A flexible framework for sparse simultaneous component based data integration

K. Van Deun, Tom F Wilderjans, Robert A Van Den Berg, Anestis Antoniadis, Iven Van Mechelen

Research output: Contribution to journalArticleScientificpeer-review

Abstract

Background
High throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account.
Results
We propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of Escherichia coli samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks.
Conclusion
Sparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach).
component method.
Original languageEnglish
Article number448
JournalBMC Bioinformatics
Volume12
Issue number1
DOIs
Publication statusPublished - 2011
Externally publishedYes

Fingerprint

Data integration
Data Integration
Lasso
Principal component analysis
Principal Component Analysis
Penalty
Biomolecules
Block Structure
Singular value decomposition
Escherichia coli
Throughput
Proteins
Elastic Net
Metabolomics
Proteomics
Shrinking
Ridge
Framework
alachlor
Escherichia Coli

Cite this

Van Deun, K., Wilderjans, T. F., Van Den Berg, R. A., Antoniadis, A., & Van Mechelen, I. (2011). A flexible framework for sparse simultaneous component based data integration. BMC Bioinformatics, 12(1), [448]. https://doi.org/10.1186/1471-2105-12-448
Van Deun, K. ; Wilderjans, Tom F ; Van Den Berg, Robert A ; Antoniadis, Anestis ; Van Mechelen, Iven. / A flexible framework for sparse simultaneous component based data integration. In: BMC Bioinformatics. 2011 ; Vol. 12, No. 1.
@article{b4dd7b1288bd4919abf4d247a03cf337,
title = "A flexible framework for sparse simultaneous component based data integration",
abstract = "BackgroundHigh throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account.ResultsWe propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of Escherichia coli samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks.ConclusionSparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach).component method.",
author = "{Van Deun}, K. and Wilderjans, {Tom F} and {Van Den Berg}, {Robert A} and Anestis Antoniadis and {Van Mechelen}, Iven",
year = "2011",
doi = "10.1186/1471-2105-12-448",
language = "English",
volume = "12",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

Van Deun, K, Wilderjans, TF, Van Den Berg, RA, Antoniadis, A & Van Mechelen, I 2011, 'A flexible framework for sparse simultaneous component based data integration', BMC Bioinformatics, vol. 12, no. 1, 448. https://doi.org/10.1186/1471-2105-12-448

A flexible framework for sparse simultaneous component based data integration. / Van Deun, K.; Wilderjans, Tom F; Van Den Berg, Robert A; Antoniadis, Anestis; Van Mechelen, Iven.

In: BMC Bioinformatics, Vol. 12, No. 1, 448, 2011.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - A flexible framework for sparse simultaneous component based data integration

AU - Van Deun, K.

AU - Wilderjans, Tom F

AU - Van Den Berg, Robert A

AU - Antoniadis, Anestis

AU - Van Mechelen, Iven

PY - 2011

Y1 - 2011

N2 - BackgroundHigh throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account.ResultsWe propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of Escherichia coli samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks.ConclusionSparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach).component method.

AB - BackgroundHigh throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account.ResultsWe propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of Escherichia coli samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks.ConclusionSparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach).component method.

U2 - 10.1186/1471-2105-12-448

DO - 10.1186/1471-2105-12-448

M3 - Article

VL - 12

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 448

ER -