Bayesian latent class models for the multiple imputation of cross-sectional, multilevel and longitudinal categorical data

Research output: ThesisDoctoral ThesisScientific

50 Downloads (Pure)

Abstract

Categorical data are data collected, for instance, by means of surveys or polls, in which for each question respondents need to choose a possible answer between two or more options (or ‘categories’). In many situations these data, that need to be treated with statistical analysis in order to answer the researchers’ scientific questions, might not be fully observed for some of the respondents. That is, if a respondent does not answer all the questions in a questionnaire, the problem of the so-called ‘missing data’ occurs. This can represent an issue for the statistical analysis following the data collection stage, because the lack of information due to the missing data can lead to wrong conclusions.
In order to solve the problem, analysts have developed over the years a technique that allows replacing the missing data in a data set, a procedure also known as ‘imputation’. In particular, the missing data are replaced probabilistically; that is, we will assign with a given probability a potential imputed value to each non-observed value, and this probability is calculated according to the available, observed answers from the subjects. Repeating the imputations several times leads to perform ‘multiple imputation’, and it is useful to give a more complete representation of the uncertainty of the missing data. After the imputations have been performed, it is possible to move to the next step, which is carrying out the statistical analysis of interest for the research question.
With this dissertation novel models that allow to obtain imputing values which replace the missing data in a dataset are proposed. The name of these models is ‘Latent Class models’. In order to perform the imputations, these models create clusters (‘latent classes’) of respondents that have provided similar answers, and impute the missing data according to the values observed in those units that are in the same clusters as the units with non-responses.
Furthermore, the models proposed in the dissertation take into account the way in which data have been collected in the previous stage. For instance, if we have observed data for pupils coming from different schools, it is typical (and reasonable) to assume that pupils in the same school tend to give more similar answers than pupils coming from other schools. In order to obtain reliable statistical analysis subsequent to the imputation stage, this aspect needs to be accounted for when the imputations are performed. This is done with a modified version of the Latent Class model, called ‘Multilevel latent class model’. It is called multilevel, because it accounts for the hierarchical (or multilevel) structure of the data; in the example with the schools, schools represent a higher level in the hierarchy, while pupils represent the lower level.
A different way to collect data is obtained by doing it over time for the same subjects; these data are known as ‘longitudinal data’. Longitudinal data are useful, for instance, to monitor changes over time of a certain phenomenon of scientific interest. Imputing missing data arising from this kind of data collection design requires a model that takes into account the longitudinal structure of the data, and also that accounts for the fact that past observations can affect somehow future ones. In the dissertation, longitudinal data are imputed with a model that considers this kind of relationships; once again, this model is a modified version of the Latent Class model, and it is known as ‘Latent Markov model’.
Throughout the dissertation, the performance of models proposed for the imputations have been compared with other methods existing in the literature, by means of both synthetic (simulated) and empirical data. In all cases, results show that Latent Class models are among the best performing ones for multiple imputation, and are therefore the recommended approach for the applied researchers who wish to perform statistical analysis in the presence of missing categorical data.
Original languageEnglish
QualificationDoctor of Philosophy
Supervisors/Advisors
  • Vermunt, Jeroen, Promotor
  • Van Deun, Katrijn, Co-promotor
  • van Buuren, S., Member PhD commission, External person
  • Bassi, F., Member PhD commission, External person
  • Cramer, Angélique, Member PhD commission
  • van der Ark, L.A., Member PhD commission
Award date2 Mar 2018
Place of Publications.l.
Publisher
Print ISBNs978-94-6295-808-1
Publication statusPublished - 2018

Fingerprint

statistical analysis
earning a doctorate
pupil
school
Values
response behavior
uncertainty
questionnaire
lack
performance
time

Cite this

@phdthesis{8360113b8eb34b4f92ebc3db7c5d2ea2,
title = "Bayesian latent class models for the multiple imputation of cross-sectional, multilevel and longitudinal categorical data",
abstract = "Categorical data are data collected, for instance, by means of surveys or polls, in which for each question respondents need to choose a possible answer between two or more options (or ‘categories’). In many situations these data, that need to be treated with statistical analysis in order to answer the researchers’ scientific questions, might not be fully observed for some of the respondents. That is, if a respondent does not answer all the questions in a questionnaire, the problem of the so-called ‘missing data’ occurs. This can represent an issue for the statistical analysis following the data collection stage, because the lack of information due to the missing data can lead to wrong conclusions. In order to solve the problem, analysts have developed over the years a technique that allows replacing the missing data in a data set, a procedure also known as ‘imputation’. In particular, the missing data are replaced probabilistically; that is, we will assign with a given probability a potential imputed value to each non-observed value, and this probability is calculated according to the available, observed answers from the subjects. Repeating the imputations several times leads to perform ‘multiple imputation’, and it is useful to give a more complete representation of the uncertainty of the missing data. After the imputations have been performed, it is possible to move to the next step, which is carrying out the statistical analysis of interest for the research question. With this dissertation novel models that allow to obtain imputing values which replace the missing data in a dataset are proposed. The name of these models is ‘Latent Class models’. In order to perform the imputations, these models create clusters (‘latent classes’) of respondents that have provided similar answers, and impute the missing data according to the values observed in those units that are in the same clusters as the units with non-responses. Furthermore, the models proposed in the dissertation take into account the way in which data have been collected in the previous stage. For instance, if we have observed data for pupils coming from different schools, it is typical (and reasonable) to assume that pupils in the same school tend to give more similar answers than pupils coming from other schools. In order to obtain reliable statistical analysis subsequent to the imputation stage, this aspect needs to be accounted for when the imputations are performed. This is done with a modified version of the Latent Class model, called ‘Multilevel latent class model’. It is called multilevel, because it accounts for the hierarchical (or multilevel) structure of the data; in the example with the schools, schools represent a higher level in the hierarchy, while pupils represent the lower level. A different way to collect data is obtained by doing it over time for the same subjects; these data are known as ‘longitudinal data’. Longitudinal data are useful, for instance, to monitor changes over time of a certain phenomenon of scientific interest. Imputing missing data arising from this kind of data collection design requires a model that takes into account the longitudinal structure of the data, and also that accounts for the fact that past observations can affect somehow future ones. In the dissertation, longitudinal data are imputed with a model that considers this kind of relationships; once again, this model is a modified version of the Latent Class model, and it is known as ‘Latent Markov model’.Throughout the dissertation, the performance of models proposed for the imputations have been compared with other methods existing in the literature, by means of both synthetic (simulated) and empirical data. In all cases, results show that Latent Class models are among the best performing ones for multiple imputation, and are therefore the recommended approach for the applied researchers who wish to perform statistical analysis in the presence of missing categorical data.",
author = "Davide Vidotto",
year = "2018",
language = "English",
isbn = "978-94-6295-808-1",
publisher = "Proefschriftmaken",

}

Bayesian latent class models for the multiple imputation of cross-sectional, multilevel and longitudinal categorical data. / Vidotto, Davide.

s.l. : Proefschriftmaken, 2018. 141 p.

Research output: ThesisDoctoral ThesisScientific

TY - THES

T1 - Bayesian latent class models for the multiple imputation of cross-sectional, multilevel and longitudinal categorical data

AU - Vidotto, Davide

PY - 2018

Y1 - 2018

N2 - Categorical data are data collected, for instance, by means of surveys or polls, in which for each question respondents need to choose a possible answer between two or more options (or ‘categories’). In many situations these data, that need to be treated with statistical analysis in order to answer the researchers’ scientific questions, might not be fully observed for some of the respondents. That is, if a respondent does not answer all the questions in a questionnaire, the problem of the so-called ‘missing data’ occurs. This can represent an issue for the statistical analysis following the data collection stage, because the lack of information due to the missing data can lead to wrong conclusions. In order to solve the problem, analysts have developed over the years a technique that allows replacing the missing data in a data set, a procedure also known as ‘imputation’. In particular, the missing data are replaced probabilistically; that is, we will assign with a given probability a potential imputed value to each non-observed value, and this probability is calculated according to the available, observed answers from the subjects. Repeating the imputations several times leads to perform ‘multiple imputation’, and it is useful to give a more complete representation of the uncertainty of the missing data. After the imputations have been performed, it is possible to move to the next step, which is carrying out the statistical analysis of interest for the research question. With this dissertation novel models that allow to obtain imputing values which replace the missing data in a dataset are proposed. The name of these models is ‘Latent Class models’. In order to perform the imputations, these models create clusters (‘latent classes’) of respondents that have provided similar answers, and impute the missing data according to the values observed in those units that are in the same clusters as the units with non-responses. Furthermore, the models proposed in the dissertation take into account the way in which data have been collected in the previous stage. For instance, if we have observed data for pupils coming from different schools, it is typical (and reasonable) to assume that pupils in the same school tend to give more similar answers than pupils coming from other schools. In order to obtain reliable statistical analysis subsequent to the imputation stage, this aspect needs to be accounted for when the imputations are performed. This is done with a modified version of the Latent Class model, called ‘Multilevel latent class model’. It is called multilevel, because it accounts for the hierarchical (or multilevel) structure of the data; in the example with the schools, schools represent a higher level in the hierarchy, while pupils represent the lower level. A different way to collect data is obtained by doing it over time for the same subjects; these data are known as ‘longitudinal data’. Longitudinal data are useful, for instance, to monitor changes over time of a certain phenomenon of scientific interest. Imputing missing data arising from this kind of data collection design requires a model that takes into account the longitudinal structure of the data, and also that accounts for the fact that past observations can affect somehow future ones. In the dissertation, longitudinal data are imputed with a model that considers this kind of relationships; once again, this model is a modified version of the Latent Class model, and it is known as ‘Latent Markov model’.Throughout the dissertation, the performance of models proposed for the imputations have been compared with other methods existing in the literature, by means of both synthetic (simulated) and empirical data. In all cases, results show that Latent Class models are among the best performing ones for multiple imputation, and are therefore the recommended approach for the applied researchers who wish to perform statistical analysis in the presence of missing categorical data.

AB - Categorical data are data collected, for instance, by means of surveys or polls, in which for each question respondents need to choose a possible answer between two or more options (or ‘categories’). In many situations these data, that need to be treated with statistical analysis in order to answer the researchers’ scientific questions, might not be fully observed for some of the respondents. That is, if a respondent does not answer all the questions in a questionnaire, the problem of the so-called ‘missing data’ occurs. This can represent an issue for the statistical analysis following the data collection stage, because the lack of information due to the missing data can lead to wrong conclusions. In order to solve the problem, analysts have developed over the years a technique that allows replacing the missing data in a data set, a procedure also known as ‘imputation’. In particular, the missing data are replaced probabilistically; that is, we will assign with a given probability a potential imputed value to each non-observed value, and this probability is calculated according to the available, observed answers from the subjects. Repeating the imputations several times leads to perform ‘multiple imputation’, and it is useful to give a more complete representation of the uncertainty of the missing data. After the imputations have been performed, it is possible to move to the next step, which is carrying out the statistical analysis of interest for the research question. With this dissertation novel models that allow to obtain imputing values which replace the missing data in a dataset are proposed. The name of these models is ‘Latent Class models’. In order to perform the imputations, these models create clusters (‘latent classes’) of respondents that have provided similar answers, and impute the missing data according to the values observed in those units that are in the same clusters as the units with non-responses. Furthermore, the models proposed in the dissertation take into account the way in which data have been collected in the previous stage. For instance, if we have observed data for pupils coming from different schools, it is typical (and reasonable) to assume that pupils in the same school tend to give more similar answers than pupils coming from other schools. In order to obtain reliable statistical analysis subsequent to the imputation stage, this aspect needs to be accounted for when the imputations are performed. This is done with a modified version of the Latent Class model, called ‘Multilevel latent class model’. It is called multilevel, because it accounts for the hierarchical (or multilevel) structure of the data; in the example with the schools, schools represent a higher level in the hierarchy, while pupils represent the lower level. A different way to collect data is obtained by doing it over time for the same subjects; these data are known as ‘longitudinal data’. Longitudinal data are useful, for instance, to monitor changes over time of a certain phenomenon of scientific interest. Imputing missing data arising from this kind of data collection design requires a model that takes into account the longitudinal structure of the data, and also that accounts for the fact that past observations can affect somehow future ones. In the dissertation, longitudinal data are imputed with a model that considers this kind of relationships; once again, this model is a modified version of the Latent Class model, and it is known as ‘Latent Markov model’.Throughout the dissertation, the performance of models proposed for the imputations have been compared with other methods existing in the literature, by means of both synthetic (simulated) and empirical data. In all cases, results show that Latent Class models are among the best performing ones for multiple imputation, and are therefore the recommended approach for the applied researchers who wish to perform statistical analysis in the presence of missing categorical data.

M3 - Doctoral Thesis

SN - 978-94-6295-808-1

PB - Proefschriftmaken

CY - s.l.

ER -