Variable selection in the regularized simultaneous component analysis method for multi-source data integration

Zhengguo Gu*, Katrijn Van Deun, Niek de Schipper

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

2 Downloads (Pure)

Abstract

Interdisciplinary research often involves analyzing data obtained from different data sources with respect to the same subjects, objects, or experimental units. For example, global positioning systems (GPS) data have been coupled with travel diary data, resulting in a better understanding of traveling behavior. The GPS data and the travel diary data are very different in nature, and, to analyze the two types of data jointly, one often uses data integration techniques, such as the regularized simultaneous component analysis (regularized SCA) method. Regularized SCA is an extension of the (sparse) principle component analysis model to the cases where at least two data blocks are jointly analyzed, which - in order to reveal the joint and unique sources of variation - heavily relies on proper selection of the set of variables (i.e., component loadings) in the components. Regularized SCA requires a proper variable selection method to either identify the optimal values for tuning parameters or stably select variables. By means of two simulation studies with various noise and sparseness levels in simulated data, we compare six variable selection methods, which are cross-validation (CV) with the “one-standard-error” rule, repeated double CV (rdCV), BIC, Bolasso with CV, stability selection, and index of sparseness (IS) - a lesser known (compared to the first five methods) but computationally efficient method. Results show that IS is the best-performing variable selection method.
Original languageEnglish
Article number18608
Number of pages21
JournalScientific Reports
Volume9
DOIs
Publication statusPublished - 2019

Fingerprint

Data integration
Global positioning system
Tuning

Keywords

  • COMMON
  • INFORMATION
  • LOADINGS
  • METABOLOMICS
  • MODEL SELECTION
  • REGRESSION
  • SCA

Cite this

@article{3ccf0cef749941fe8ad06123d9da04bd,
title = "Variable selection in the regularized simultaneous component analysis method for multi-source data integration",
abstract = "Interdisciplinary research often involves analyzing data obtained from different data sources with respect to the same subjects, objects, or experimental units. For example, global positioning systems (GPS) data have been coupled with travel diary data, resulting in a better understanding of traveling behavior. The GPS data and the travel diary data are very different in nature, and, to analyze the two types of data jointly, one often uses data integration techniques, such as the regularized simultaneous component analysis (regularized SCA) method. Regularized SCA is an extension of the (sparse) principle component analysis model to the cases where at least two data blocks are jointly analyzed, which - in order to reveal the joint and unique sources of variation - heavily relies on proper selection of the set of variables (i.e., component loadings) in the components. Regularized SCA requires a proper variable selection method to either identify the optimal values for tuning parameters or stably select variables. By means of two simulation studies with various noise and sparseness levels in simulated data, we compare six variable selection methods, which are cross-validation (CV) with the “one-standard-error” rule, repeated double CV (rdCV), BIC, Bolasso with CV, stability selection, and index of sparseness (IS) - a lesser known (compared to the first five methods) but computationally efficient method. Results show that IS is the best-performing variable selection method.",
keywords = "COMMON, INFORMATION, LOADINGS, METABOLOMICS, MODEL SELECTION, REGRESSION, SCA",
author = "Zhengguo Gu and {Van Deun}, Katrijn and {de Schipper}, Niek",
year = "2019",
doi = "10.1038/s41598-019-54673-2",
language = "English",
volume = "9",
journal = "Scientific Reports",
issn = "2045-2322",
publisher = "Nature Publishing Group",

}

Variable selection in the regularized simultaneous component analysis method for multi-source data integration. / Gu, Zhengguo; Van Deun, Katrijn; de Schipper, Niek.

In: Scientific Reports, Vol. 9, 18608, 2019.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - Variable selection in the regularized simultaneous component analysis method for multi-source data integration

AU - Gu, Zhengguo

AU - Van Deun, Katrijn

AU - de Schipper, Niek

PY - 2019

Y1 - 2019

N2 - Interdisciplinary research often involves analyzing data obtained from different data sources with respect to the same subjects, objects, or experimental units. For example, global positioning systems (GPS) data have been coupled with travel diary data, resulting in a better understanding of traveling behavior. The GPS data and the travel diary data are very different in nature, and, to analyze the two types of data jointly, one often uses data integration techniques, such as the regularized simultaneous component analysis (regularized SCA) method. Regularized SCA is an extension of the (sparse) principle component analysis model to the cases where at least two data blocks are jointly analyzed, which - in order to reveal the joint and unique sources of variation - heavily relies on proper selection of the set of variables (i.e., component loadings) in the components. Regularized SCA requires a proper variable selection method to either identify the optimal values for tuning parameters or stably select variables. By means of two simulation studies with various noise and sparseness levels in simulated data, we compare six variable selection methods, which are cross-validation (CV) with the “one-standard-error” rule, repeated double CV (rdCV), BIC, Bolasso with CV, stability selection, and index of sparseness (IS) - a lesser known (compared to the first five methods) but computationally efficient method. Results show that IS is the best-performing variable selection method.

AB - Interdisciplinary research often involves analyzing data obtained from different data sources with respect to the same subjects, objects, or experimental units. For example, global positioning systems (GPS) data have been coupled with travel diary data, resulting in a better understanding of traveling behavior. The GPS data and the travel diary data are very different in nature, and, to analyze the two types of data jointly, one often uses data integration techniques, such as the regularized simultaneous component analysis (regularized SCA) method. Regularized SCA is an extension of the (sparse) principle component analysis model to the cases where at least two data blocks are jointly analyzed, which - in order to reveal the joint and unique sources of variation - heavily relies on proper selection of the set of variables (i.e., component loadings) in the components. Regularized SCA requires a proper variable selection method to either identify the optimal values for tuning parameters or stably select variables. By means of two simulation studies with various noise and sparseness levels in simulated data, we compare six variable selection methods, which are cross-validation (CV) with the “one-standard-error” rule, repeated double CV (rdCV), BIC, Bolasso with CV, stability selection, and index of sparseness (IS) - a lesser known (compared to the first five methods) but computationally efficient method. Results show that IS is the best-performing variable selection method.

KW - COMMON

KW - INFORMATION

KW - LOADINGS

KW - METABOLOMICS

KW - MODEL SELECTION

KW - REGRESSION

KW - SCA

U2 - 10.1038/s41598-019-54673-2

DO - 10.1038/s41598-019-54673-2

M3 - Article

VL - 9

JO - Scientific Reports

JF - Scientific Reports

SN - 2045-2322

M1 - 18608

ER -