Abstract
The technological developments of the last decades, e.g., the introduction of the smartphone, have created opportunities to efficiently collect data of many individuals over an extensive period of time. While these technologies allow for intensive longitudinal measurements, they also come with new challenges: data sets collected using these technologies could become extremely large, and in many applications the data collection is never truly `finished'. As a result, the data keep streaming in and analyzing data streams using the standard computation of well-known models becomes inefficient as the computation has to be repeated each time a new data point enters to remain up to date. In this thesis, methods to analyze data streams are developed. The introduction of these methods allows researchers to broaden the scope of their research, by using data streams.
In this dissertation, I introduce a commonly used approach to deal with data streams: online learning, a method to update the result of an analysis while the data are entering, without revisiting the previous data points. This approach can deal with data streams, because 1) it is no longer required to store all data in memory; and 2) because the approach updates the results it can easily remain up to date when new data enter. A large range of statistical models can be estimated using this approach, e.g., a linear regression and correlations.
However, when observations are clustered, for instance because the same individuals are repeatedly observed, rewriting the estimation procedure to be feasible in a data stream becomes more difficult. This data structure with clustered observations causes dependencies between data points belonging to the same person. Ignoring these dependencies violates an important statistical assumption of independent observations. When a researcher does not deal with the dependencies between the data points, the results are likely to be biased (i.e., inaccurate). There are models which take the dependencies between the observations into account, called multilevel models. These multilevel models, however, are computationally complex to estimate in a data stream, because the estimation procedure requires multiple passes over the data set. These multiple passes over the data set in combination with the continuous stream of new data points make it difficult to estimate the multilevel model in data streams.
In this thesis, I developed an online-learning algorithm that updates the multilevel model, while new data enter and without passing over all the data repeatedly. This algorithm, called SEMA (acronym for Streaming Expectation Maximization Approximation), updates the results of the multilevel model with each new data point, without storing the data points. Using this algorithm, researchers can analyze data streams efficiently, while keeping the structure of the data stream into account.
In this dissertation, I introduce a commonly used approach to deal with data streams: online learning, a method to update the result of an analysis while the data are entering, without revisiting the previous data points. This approach can deal with data streams, because 1) it is no longer required to store all data in memory; and 2) because the approach updates the results it can easily remain up to date when new data enter. A large range of statistical models can be estimated using this approach, e.g., a linear regression and correlations.
However, when observations are clustered, for instance because the same individuals are repeatedly observed, rewriting the estimation procedure to be feasible in a data stream becomes more difficult. This data structure with clustered observations causes dependencies between data points belonging to the same person. Ignoring these dependencies violates an important statistical assumption of independent observations. When a researcher does not deal with the dependencies between the data points, the results are likely to be biased (i.e., inaccurate). There are models which take the dependencies between the observations into account, called multilevel models. These multilevel models, however, are computationally complex to estimate in a data stream, because the estimation procedure requires multiple passes over the data set. These multiple passes over the data set in combination with the continuous stream of new data points make it difficult to estimate the multilevel model in data streams.
In this thesis, I developed an online-learning algorithm that updates the multilevel model, while new data enter and without passing over all the data repeatedly. This algorithm, called SEMA (acronym for Streaming Expectation Maximization Approximation), updates the results of the multilevel model with each new data point, without storing the data points. Using this algorithm, researchers can analyze data streams efficiently, while keeping the structure of the data stream into account.
Original language | English |
---|---|
Qualification | Doctor of Philosophy |
Supervisors/Advisors |
|
Award date | 13 Oct 2017 |
Place of Publication | Vianen |
Publisher | |
Print ISBNs | 978-94-6295-757-2 |
Publication status | Published - 2017 |