Bayesian latent class models for the multiple imputation of categorical data

Davide Vidotto*, Jeroen K. Vermunt, Katrijn Van Deun

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

3 Downloads (Pure)

Abstract

Latent class analysis has beer recently proposed for the multiple imputation (MI) of missing categorical data, using either a standard frequentist approach or a nonparametric Bayesian model called Dirichlet process mixture of multinomial distributions (DPMM). The main advantage of using a latent class model for multiple imputation is that it is very flexible in the sense that it car capture complex relationships in the data given that the number of latent classes is large enough. However, the two existing approaches also have certain disadvantages. The frequentist approach is computationally demanding because it requires estimating many LC models: first models with different number of classes should be estimated to determine the required number of classes and subsequently the selected model is reestimated for multiple bootstrap samples to take into account parameter uncertainty during the imputation stage. Whereas the Bayesian. Dirichlet process models perform the model selection and the handling of the parameter uncertainty automatically, the disadvantage of this method is that it tends to use a too small number of clusters during the Gibbs sampling, leading to an underfitting model yielding invalid imputations. In this paper, we propose an alternative approach which combined the strengths of the two existing approaches; that is, we use the Bayesian standard latent class model as an imputation model. We show how model selection can be performed prior to the imputation step using a single run of the Gibbs sampler and, moreover, show how underfitting is prevented by using large values for the hyperparameters of the mixture weights. The results of two simulation studies and one real-data study indicate that with a proper setting of the prior distributions, the Bayesian latent class model yields valid imputations and outperforms competing methods.

Original languageEnglish
Pages (from-to)56-68
JournalMethodology: European Journal of Research Methods for the Behavioral and Social Sciences
Volume14
Issue number2
DOIs
Publication statusPublished - 2018

Keywords

  • Bayesian mixture models
  • latent class models
  • missing data
  • multiple imputation

Cite this

@article{9bdfe6ab11c44bafb3c18596bc368854,
title = "Bayesian latent class models for the multiple imputation of categorical data",
abstract = "Latent class analysis has beer recently proposed for the multiple imputation (MI) of missing categorical data, using either a standard frequentist approach or a nonparametric Bayesian model called Dirichlet process mixture of multinomial distributions (DPMM). The main advantage of using a latent class model for multiple imputation is that it is very flexible in the sense that it car capture complex relationships in the data given that the number of latent classes is large enough. However, the two existing approaches also have certain disadvantages. The frequentist approach is computationally demanding because it requires estimating many LC models: first models with different number of classes should be estimated to determine the required number of classes and subsequently the selected model is reestimated for multiple bootstrap samples to take into account parameter uncertainty during the imputation stage. Whereas the Bayesian. Dirichlet process models perform the model selection and the handling of the parameter uncertainty automatically, the disadvantage of this method is that it tends to use a too small number of clusters during the Gibbs sampling, leading to an underfitting model yielding invalid imputations. In this paper, we propose an alternative approach which combined the strengths of the two existing approaches; that is, we use the Bayesian standard latent class model as an imputation model. We show how model selection can be performed prior to the imputation step using a single run of the Gibbs sampler and, moreover, show how underfitting is prevented by using large values for the hyperparameters of the mixture weights. The results of two simulation studies and one real-data study indicate that with a proper setting of the prior distributions, the Bayesian latent class model yields valid imputations and outperforms competing methods.",
keywords = "Bayesian mixture models, latent class models, missing data, multiple imputation",
author = "Davide Vidotto and Vermunt, {Jeroen K.} and {Van Deun}, Katrijn",
year = "2018",
doi = "10.1027/1614-2241/a000146",
language = "English",
volume = "14",
pages = "56--68",
journal = "Methodology: European Journal of Research Methods for the Behavioral and Social Sciences",
issn = "1614-1881",
publisher = "Hogrefe & Huber Publishers",
number = "2",

}

TY - JOUR

T1 - Bayesian latent class models for the multiple imputation of categorical data

AU - Vidotto, Davide

AU - Vermunt, Jeroen K.

AU - Van Deun, Katrijn

PY - 2018

Y1 - 2018

N2 - Latent class analysis has beer recently proposed for the multiple imputation (MI) of missing categorical data, using either a standard frequentist approach or a nonparametric Bayesian model called Dirichlet process mixture of multinomial distributions (DPMM). The main advantage of using a latent class model for multiple imputation is that it is very flexible in the sense that it car capture complex relationships in the data given that the number of latent classes is large enough. However, the two existing approaches also have certain disadvantages. The frequentist approach is computationally demanding because it requires estimating many LC models: first models with different number of classes should be estimated to determine the required number of classes and subsequently the selected model is reestimated for multiple bootstrap samples to take into account parameter uncertainty during the imputation stage. Whereas the Bayesian. Dirichlet process models perform the model selection and the handling of the parameter uncertainty automatically, the disadvantage of this method is that it tends to use a too small number of clusters during the Gibbs sampling, leading to an underfitting model yielding invalid imputations. In this paper, we propose an alternative approach which combined the strengths of the two existing approaches; that is, we use the Bayesian standard latent class model as an imputation model. We show how model selection can be performed prior to the imputation step using a single run of the Gibbs sampler and, moreover, show how underfitting is prevented by using large values for the hyperparameters of the mixture weights. The results of two simulation studies and one real-data study indicate that with a proper setting of the prior distributions, the Bayesian latent class model yields valid imputations and outperforms competing methods.

AB - Latent class analysis has beer recently proposed for the multiple imputation (MI) of missing categorical data, using either a standard frequentist approach or a nonparametric Bayesian model called Dirichlet process mixture of multinomial distributions (DPMM). The main advantage of using a latent class model for multiple imputation is that it is very flexible in the sense that it car capture complex relationships in the data given that the number of latent classes is large enough. However, the two existing approaches also have certain disadvantages. The frequentist approach is computationally demanding because it requires estimating many LC models: first models with different number of classes should be estimated to determine the required number of classes and subsequently the selected model is reestimated for multiple bootstrap samples to take into account parameter uncertainty during the imputation stage. Whereas the Bayesian. Dirichlet process models perform the model selection and the handling of the parameter uncertainty automatically, the disadvantage of this method is that it tends to use a too small number of clusters during the Gibbs sampling, leading to an underfitting model yielding invalid imputations. In this paper, we propose an alternative approach which combined the strengths of the two existing approaches; that is, we use the Bayesian standard latent class model as an imputation model. We show how model selection can be performed prior to the imputation step using a single run of the Gibbs sampler and, moreover, show how underfitting is prevented by using large values for the hyperparameters of the mixture weights. The results of two simulation studies and one real-data study indicate that with a proper setting of the prior distributions, the Bayesian latent class model yields valid imputations and outperforms competing methods.

KW - Bayesian mixture models

KW - latent class models

KW - missing data

KW - multiple imputation

U2 - 10.1027/1614-2241/a000146

DO - 10.1027/1614-2241/a000146

M3 - Article

VL - 14

SP - 56

EP - 68

JO - Methodology: European Journal of Research Methods for the Behavioral and Social Sciences

JF - Methodology: European Journal of Research Methods for the Behavioral and Social Sciences

SN - 1614-1881

IS - 2

ER -