Multilevel modeling for data streams with dependent observations

L. Ippel

Research output: ThesisDoctoral ThesisScientific

195 Downloads (Pure)

Abstract

The technological developments of the last decades, e.g., the introduction of the smartphone, have created opportunities to efficiently collect data of many individuals over an extensive period of time. While these technologies allow for intensive longitudinal measurements, they also come with new challenges: data sets collected using these technologies could become extremely large, and in many applications the data collection is never truly `finished'. As a result, the data keep streaming in and analyzing data streams using the standard computation of well-known models becomes inefficient as the computation has to be repeated each time a new data point enters to remain up to date. In this thesis, methods to analyze data streams are developed. The introduction of these methods allows researchers to broaden the scope of their research, by using data streams.
In this dissertation, I introduce a commonly used approach to deal with data streams: online learning, a method to update the result of an analysis while the data are entering, without revisiting the previous data points. This approach can deal with data streams, because 1) it is no longer required to store all data in memory; and 2) because the approach updates the results it can easily remain up to date when new data enter. A large range of statistical models can be estimated using this approach, e.g., a linear regression and correlations.
However, when observations are clustered, for instance because the same individuals are repeatedly observed, rewriting the estimation procedure to be feasible in a data stream becomes more difficult. This data structure with clustered observations causes dependencies between data points belonging to the same person. Ignoring these dependencies violates an important statistical assumption of independent observations. When a researcher does not deal with the dependencies between the data points, the results are likely to be biased (i.e., inaccurate). There are models which take the dependencies between the observations into account, called multilevel models. These multilevel models, however, are computationally complex to estimate in a data stream, because the estimation procedure requires multiple passes over the data set. These multiple passes over the data set in combination with the continuous stream of new data points make it difficult to estimate the multilevel model in data streams.
In this thesis, I developed an online-learning algorithm that updates the multilevel model, while new data enter and without passing over all the data repeatedly. This algorithm, called SEMA (acronym for Streaming Expectation Maximization Approximation), updates the results of the multilevel model with each new data point, without storing the data points. Using this algorithm, researchers can analyze data streams efficiently, while keeping the structure of the data stream into account.
Original languageEnglish
QualificationDoctor of Philosophy
Supervisors/Advisors
  • Vermunt, Jeroen, Promotor
  • Kaptein, Maurits, Promotor
  • van Breukelen, G.J.P., Member PhD commission, External person
  • Timmerman, M.E., Member PhD commission, External person
  • Postma, Marie, Member PhD commission
  • Croon, Marcel, Member PhD commission
Award date13 Oct 2017
Place of PublicationVianen
Publisher
Print ISBNs978-94-6295-757-2
Publication statusPublished - 2017

Fingerprint

Smartphones
Linear regression
Learning algorithms
Data structures
Data storage equipment
Statistical Models

Cite this

Ippel, L.. / Multilevel modeling for data streams with dependent observations. Vianen : [s.n.], 2017. 132 p.
@phdthesis{3917f643b24e4cb1884fb85fb2ad7247,
title = "Multilevel modeling for data streams with dependent observations",
abstract = "The technological developments of the last decades, e.g., the introduction of the smartphone, have created opportunities to efficiently collect data of many individuals over an extensive period of time. While these technologies allow for intensive longitudinal measurements, they also come with new challenges: data sets collected using these technologies could become extremely large, and in many applications the data collection is never truly `finished'. As a result, the data keep streaming in and analyzing data streams using the standard computation of well-known models becomes inefficient as the computation has to be repeated each time a new data point enters to remain up to date. In this thesis, methods to analyze data streams are developed. The introduction of these methods allows researchers to broaden the scope of their research, by using data streams. In this dissertation, I introduce a commonly used approach to deal with data streams: online learning, a method to update the result of an analysis while the data are entering, without revisiting the previous data points. This approach can deal with data streams, because 1) it is no longer required to store all data in memory; and 2) because the approach updates the results it can easily remain up to date when new data enter. A large range of statistical models can be estimated using this approach, e.g., a linear regression and correlations. However, when observations are clustered, for instance because the same individuals are repeatedly observed, rewriting the estimation procedure to be feasible in a data stream becomes more difficult. This data structure with clustered observations causes dependencies between data points belonging to the same person. Ignoring these dependencies violates an important statistical assumption of independent observations. When a researcher does not deal with the dependencies between the data points, the results are likely to be biased (i.e., inaccurate). There are models which take the dependencies between the observations into account, called multilevel models. These multilevel models, however, are computationally complex to estimate in a data stream, because the estimation procedure requires multiple passes over the data set. These multiple passes over the data set in combination with the continuous stream of new data points make it difficult to estimate the multilevel model in data streams. In this thesis, I developed an online-learning algorithm that updates the multilevel model, while new data enter and without passing over all the data repeatedly. This algorithm, called SEMA (acronym for Streaming Expectation Maximization Approximation), updates the results of the multilevel model with each new data point, without storing the data points. Using this algorithm, researchers can analyze data streams efficiently, while keeping the structure of the data stream into account.",
author = "L. Ippel",
year = "2017",
language = "English",
isbn = "978-94-6295-757-2",
publisher = "[s.n.]",

}

Ippel, L 2017, 'Multilevel modeling for data streams with dependent observations', Doctor of Philosophy, Vianen.

Multilevel modeling for data streams with dependent observations. / Ippel, L.

Vianen : [s.n.], 2017. 132 p.

Research output: ThesisDoctoral ThesisScientific

TY - THES

T1 - Multilevel modeling for data streams with dependent observations

AU - Ippel, L.

PY - 2017

Y1 - 2017

N2 - The technological developments of the last decades, e.g., the introduction of the smartphone, have created opportunities to efficiently collect data of many individuals over an extensive period of time. While these technologies allow for intensive longitudinal measurements, they also come with new challenges: data sets collected using these technologies could become extremely large, and in many applications the data collection is never truly `finished'. As a result, the data keep streaming in and analyzing data streams using the standard computation of well-known models becomes inefficient as the computation has to be repeated each time a new data point enters to remain up to date. In this thesis, methods to analyze data streams are developed. The introduction of these methods allows researchers to broaden the scope of their research, by using data streams. In this dissertation, I introduce a commonly used approach to deal with data streams: online learning, a method to update the result of an analysis while the data are entering, without revisiting the previous data points. This approach can deal with data streams, because 1) it is no longer required to store all data in memory; and 2) because the approach updates the results it can easily remain up to date when new data enter. A large range of statistical models can be estimated using this approach, e.g., a linear regression and correlations. However, when observations are clustered, for instance because the same individuals are repeatedly observed, rewriting the estimation procedure to be feasible in a data stream becomes more difficult. This data structure with clustered observations causes dependencies between data points belonging to the same person. Ignoring these dependencies violates an important statistical assumption of independent observations. When a researcher does not deal with the dependencies between the data points, the results are likely to be biased (i.e., inaccurate). There are models which take the dependencies between the observations into account, called multilevel models. These multilevel models, however, are computationally complex to estimate in a data stream, because the estimation procedure requires multiple passes over the data set. These multiple passes over the data set in combination with the continuous stream of new data points make it difficult to estimate the multilevel model in data streams. In this thesis, I developed an online-learning algorithm that updates the multilevel model, while new data enter and without passing over all the data repeatedly. This algorithm, called SEMA (acronym for Streaming Expectation Maximization Approximation), updates the results of the multilevel model with each new data point, without storing the data points. Using this algorithm, researchers can analyze data streams efficiently, while keeping the structure of the data stream into account.

AB - The technological developments of the last decades, e.g., the introduction of the smartphone, have created opportunities to efficiently collect data of many individuals over an extensive period of time. While these technologies allow for intensive longitudinal measurements, they also come with new challenges: data sets collected using these technologies could become extremely large, and in many applications the data collection is never truly `finished'. As a result, the data keep streaming in and analyzing data streams using the standard computation of well-known models becomes inefficient as the computation has to be repeated each time a new data point enters to remain up to date. In this thesis, methods to analyze data streams are developed. The introduction of these methods allows researchers to broaden the scope of their research, by using data streams. In this dissertation, I introduce a commonly used approach to deal with data streams: online learning, a method to update the result of an analysis while the data are entering, without revisiting the previous data points. This approach can deal with data streams, because 1) it is no longer required to store all data in memory; and 2) because the approach updates the results it can easily remain up to date when new data enter. A large range of statistical models can be estimated using this approach, e.g., a linear regression and correlations. However, when observations are clustered, for instance because the same individuals are repeatedly observed, rewriting the estimation procedure to be feasible in a data stream becomes more difficult. This data structure with clustered observations causes dependencies between data points belonging to the same person. Ignoring these dependencies violates an important statistical assumption of independent observations. When a researcher does not deal with the dependencies between the data points, the results are likely to be biased (i.e., inaccurate). There are models which take the dependencies between the observations into account, called multilevel models. These multilevel models, however, are computationally complex to estimate in a data stream, because the estimation procedure requires multiple passes over the data set. These multiple passes over the data set in combination with the continuous stream of new data points make it difficult to estimate the multilevel model in data streams. In this thesis, I developed an online-learning algorithm that updates the multilevel model, while new data enter and without passing over all the data repeatedly. This algorithm, called SEMA (acronym for Streaming Expectation Maximization Approximation), updates the results of the multilevel model with each new data point, without storing the data points. Using this algorithm, researchers can analyze data streams efficiently, while keeping the structure of the data stream into account.

M3 - Doctoral Thesis

SN - 978-94-6295-757-2

PB - [s.n.]

CY - Vianen

ER -