Abstract
Categorical data are data collected, for instance, by means of surveys or polls, in which for each question respondents need to choose a possible answer between two or more options (or ‘categories’). In many situations these data, that need to be treated with statistical analysis in order to answer the researchers’ scientific questions, might not be fully observed for some of the respondents. That is, if a respondent does not answer all the questions in a questionnaire, the problem of the socalled ‘missing data’ occurs. This can represent an issue for the statistical analysis following the data collection stage, because the lack of information due to the missing data can lead to wrong conclusions.
In order to solve the problem, analysts have developed over the years a technique that allows replacing the missing data in a data set, a procedure also known as ‘imputation’. In particular, the missing data are replaced probabilistically; that is, we will assign with a given probability a potential imputed value to each nonobserved value, and this probability is calculated according to the available, observed answers from the subjects. Repeating the imputations several times leads to perform ‘multiple imputation’, and it is useful to give a more complete representation of the uncertainty of the missing data. After the imputations have been performed, it is possible to move to the next step, which is carrying out the statistical analysis of interest for the research question.
With this dissertation novel models that allow to obtain imputing values which replace the missing data in a dataset are proposed. The name of these models is ‘Latent Class models’. In order to perform the imputations, these models create clusters (‘latent classes’) of respondents that have provided similar answers, and impute the missing data according to the values observed in those units that are in the same clusters as the units with nonresponses.
Furthermore, the models proposed in the dissertation take into account the way in which data have been collected in the previous stage. For instance, if we have observed data for pupils coming from different schools, it is typical (and reasonable) to assume that pupils in the same school tend to give more similar answers than pupils coming from other schools. In order to obtain reliable statistical analysis subsequent to the imputation stage, this aspect needs to be accounted for when the imputations are performed. This is done with a modified version of the Latent Class model, called ‘Multilevel latent class model’. It is called multilevel, because it accounts for the hierarchical (or multilevel) structure of the data; in the example with the schools, schools represent a higher level in the hierarchy, while pupils represent the lower level.
A different way to collect data is obtained by doing it over time for the same subjects; these data are known as ‘longitudinal data’. Longitudinal data are useful, for instance, to monitor changes over time of a certain phenomenon of scientific interest. Imputing missing data arising from this kind of data collection design requires a model that takes into account the longitudinal structure of the data, and also that accounts for the fact that past observations can affect somehow future ones. In the dissertation, longitudinal data are imputed with a model that considers this kind of relationships; once again, this model is a modified version of the Latent Class model, and it is known as ‘Latent Markov model’.
Throughout the dissertation, the performance of models proposed for the imputations have been compared with other methods existing in the literature, by means of both synthetic (simulated) and empirical data. In all cases, results show that Latent Class models are among the best performing ones for multiple imputation, and are therefore the recommended approach for the applied researchers who wish to perform statistical analysis in the presence of missing categorical data.
In order to solve the problem, analysts have developed over the years a technique that allows replacing the missing data in a data set, a procedure also known as ‘imputation’. In particular, the missing data are replaced probabilistically; that is, we will assign with a given probability a potential imputed value to each nonobserved value, and this probability is calculated according to the available, observed answers from the subjects. Repeating the imputations several times leads to perform ‘multiple imputation’, and it is useful to give a more complete representation of the uncertainty of the missing data. After the imputations have been performed, it is possible to move to the next step, which is carrying out the statistical analysis of interest for the research question.
With this dissertation novel models that allow to obtain imputing values which replace the missing data in a dataset are proposed. The name of these models is ‘Latent Class models’. In order to perform the imputations, these models create clusters (‘latent classes’) of respondents that have provided similar answers, and impute the missing data according to the values observed in those units that are in the same clusters as the units with nonresponses.
Furthermore, the models proposed in the dissertation take into account the way in which data have been collected in the previous stage. For instance, if we have observed data for pupils coming from different schools, it is typical (and reasonable) to assume that pupils in the same school tend to give more similar answers than pupils coming from other schools. In order to obtain reliable statistical analysis subsequent to the imputation stage, this aspect needs to be accounted for when the imputations are performed. This is done with a modified version of the Latent Class model, called ‘Multilevel latent class model’. It is called multilevel, because it accounts for the hierarchical (or multilevel) structure of the data; in the example with the schools, schools represent a higher level in the hierarchy, while pupils represent the lower level.
A different way to collect data is obtained by doing it over time for the same subjects; these data are known as ‘longitudinal data’. Longitudinal data are useful, for instance, to monitor changes over time of a certain phenomenon of scientific interest. Imputing missing data arising from this kind of data collection design requires a model that takes into account the longitudinal structure of the data, and also that accounts for the fact that past observations can affect somehow future ones. In the dissertation, longitudinal data are imputed with a model that considers this kind of relationships; once again, this model is a modified version of the Latent Class model, and it is known as ‘Latent Markov model’.
Throughout the dissertation, the performance of models proposed for the imputations have been compared with other methods existing in the literature, by means of both synthetic (simulated) and empirical data. In all cases, results show that Latent Class models are among the best performing ones for multiple imputation, and are therefore the recommended approach for the applied researchers who wish to perform statistical analysis in the presence of missing categorical data.
Original language  English 

Qualification  Doctor of Philosophy 
Supervisors/Advisors 

Award date  2 Mar 2018 
Place of Publication  s.l. 
Publisher  
Print ISBNs  9789462958081 
Publication status  Published  2018 