Dealing with dimensionality: the application of machine learning to multi-omics data

Research output: Contribution to journalJournal articlepeer-review

Standard

Dealing with dimensionality : the application of machine learning to multi-omics data. / Feldner-Busztin, Dylan; Nisantzis, Panos Firbas; Edmunds, Shelley Jane; Boza, Gergely; Racimo, Fernando; Gopalakrishnan, Shyam; Limborg, Morten Tønsberg; Lahti, Leo; de Polavieja, Gonzalo G.

In: Bioinformatics, Vol. 39, No. 2, btad021, 2023.

Research output: Contribution to journalJournal articlepeer-review

Harvard

Feldner-Busztin, D, Nisantzis, PF, Edmunds, SJ, Boza, G, Racimo, F, Gopalakrishnan, S, Limborg, MT, Lahti, L & de Polavieja, GG 2023, 'Dealing with dimensionality: the application of machine learning to multi-omics data', Bioinformatics, vol. 39, no. 2, btad021. https://doi.org/10.1093/bioinformatics/btad021

APA

Feldner-Busztin, D., Nisantzis, P. F., Edmunds, S. J., Boza, G., Racimo, F., Gopalakrishnan, S., Limborg, M. T., Lahti, L., & de Polavieja, G. G. (2023). Dealing with dimensionality: the application of machine learning to multi-omics data. Bioinformatics, 39(2), [btad021]. https://doi.org/10.1093/bioinformatics/btad021

Vancouver

Feldner-Busztin D, Nisantzis PF, Edmunds SJ, Boza G, Racimo F, Gopalakrishnan S et al. Dealing with dimensionality: the application of machine learning to multi-omics data. Bioinformatics. 2023;39(2). btad021. https://doi.org/10.1093/bioinformatics/btad021

Author

Feldner-Busztin, Dylan ; Nisantzis, Panos Firbas ; Edmunds, Shelley Jane ; Boza, Gergely ; Racimo, Fernando ; Gopalakrishnan, Shyam ; Limborg, Morten Tønsberg ; Lahti, Leo ; de Polavieja, Gonzalo G. / Dealing with dimensionality : the application of machine learning to multi-omics data. In: Bioinformatics. 2023 ; Vol. 39, No. 2.

Bibtex

@article{6175461397f74955b3e441524bb5fe7a,
title = "Dealing with dimensionality: the application of machine learning to multi-omics data",
abstract = "Motivation Machine learning (ML) methods are motivated by the need to automate information extraction from large datasets in order to support human users in data-driven tasks. This is an attractive approach for integrative joint analysis of vast amounts of omics data produced in next generation sequencing and other -omics assays. A systematic assessment of the current literature can help to identify key trends and potential gaps in methodology and applications. We surveyed the literature on ML multi-omic data integration and quantitatively explored the goals, techniques and data involved in this field. We were particularly interested in examining how researchers use ML to deal with the volume and complexity of these datasets.Results Our main finding is that the methods used are those that address the challenges of datasets with few samples and many features. Dimensionality reduction methods are used to reduce the feature count alongside models that can also appropriately handle relatively few samples. Popular techniques include autoencoders, random forests and support vector machines. We also found that the field is heavily influenced by the use of The Cancer Genome Atlas dataset, which is accessible and contains many diverse experiments.Availability and implementationAll data and processing scripts are available at this GitLab repository: or in Zenodo: .Supplementary informationare available at Bioinformatics online.",
keywords = "CAUSAL INFERENCE, GENE, EXPRESSION, MODELS",
author = "Dylan Feldner-Busztin and Nisantzis, {Panos Firbas} and Edmunds, {Shelley Jane} and Gergely Boza and Fernando Racimo and Shyam Gopalakrishnan and Limborg, {Morten T{\o}nsberg} and Leo Lahti and {de Polavieja}, {Gonzalo G.}",
year = "2023",
doi = "10.1093/bioinformatics/btad021",
language = "English",
volume = "39",
journal = "Bioinformatics (Online)",
issn = "1367-4811",
publisher = "Oxford University Press",
number = "2",

}

RIS

TY - JOUR

T1 - Dealing with dimensionality

T2 - the application of machine learning to multi-omics data

AU - Feldner-Busztin, Dylan

AU - Nisantzis, Panos Firbas

AU - Edmunds, Shelley Jane

AU - Boza, Gergely

AU - Racimo, Fernando

AU - Gopalakrishnan, Shyam

AU - Limborg, Morten Tønsberg

AU - Lahti, Leo

AU - de Polavieja, Gonzalo G.

PY - 2023

Y1 - 2023

N2 - Motivation Machine learning (ML) methods are motivated by the need to automate information extraction from large datasets in order to support human users in data-driven tasks. This is an attractive approach for integrative joint analysis of vast amounts of omics data produced in next generation sequencing and other -omics assays. A systematic assessment of the current literature can help to identify key trends and potential gaps in methodology and applications. We surveyed the literature on ML multi-omic data integration and quantitatively explored the goals, techniques and data involved in this field. We were particularly interested in examining how researchers use ML to deal with the volume and complexity of these datasets.Results Our main finding is that the methods used are those that address the challenges of datasets with few samples and many features. Dimensionality reduction methods are used to reduce the feature count alongside models that can also appropriately handle relatively few samples. Popular techniques include autoencoders, random forests and support vector machines. We also found that the field is heavily influenced by the use of The Cancer Genome Atlas dataset, which is accessible and contains many diverse experiments.Availability and implementationAll data and processing scripts are available at this GitLab repository: or in Zenodo: .Supplementary informationare available at Bioinformatics online.

AB - Motivation Machine learning (ML) methods are motivated by the need to automate information extraction from large datasets in order to support human users in data-driven tasks. This is an attractive approach for integrative joint analysis of vast amounts of omics data produced in next generation sequencing and other -omics assays. A systematic assessment of the current literature can help to identify key trends and potential gaps in methodology and applications. We surveyed the literature on ML multi-omic data integration and quantitatively explored the goals, techniques and data involved in this field. We were particularly interested in examining how researchers use ML to deal with the volume and complexity of these datasets.Results Our main finding is that the methods used are those that address the challenges of datasets with few samples and many features. Dimensionality reduction methods are used to reduce the feature count alongside models that can also appropriately handle relatively few samples. Popular techniques include autoencoders, random forests and support vector machines. We also found that the field is heavily influenced by the use of The Cancer Genome Atlas dataset, which is accessible and contains many diverse experiments.Availability and implementationAll data and processing scripts are available at this GitLab repository: or in Zenodo: .Supplementary informationare available at Bioinformatics online.

KW - CAUSAL INFERENCE

KW - GENE

KW - EXPRESSION

KW - MODELS

U2 - 10.1093/bioinformatics/btad021

DO - 10.1093/bioinformatics/btad021

M3 - Journal article

C2 - 36637211

VL - 39

JO - Bioinformatics (Online)

JF - Bioinformatics (Online)

SN - 1367-4811

IS - 2

M1 - btad021

ER -

ID: 339325886