Dealing with data streams

An online, row-by-row estimation tutorial

G.J.E. Ippel, M.C. Kaptein, J.K. Vermunt

Research output: Contribution to journalArticleScientificpeer-review

Abstract

Novel technological advances allow distributed and automatic measurement of human behavior. While these technologies provide exciting new research opportunities, they also provide challenges: datasets collected using new technologies grow increasingly large, and in many applications the collected data are continuously augmented. These data streams make the standard computation of well-known estimators inefficient as the computation has to be repeated each time a new data point enters. In this tutorial paper, we detail online learning, an analysis method that facilitates the efficient analysis of Big Data and continuous data streams. We illustrate how common analysis methods can be adapted for use with Big Data using an online, or ``row-by-row'', processing approach. We present several simple (and exact) examples of the online estimation and we discuss Stochastic Gradient Descent as a general (approximate) approach to estimate more complex models. We end this article with a discussion of the methodological challenges that remain.
Original languageEnglish
Pages (from-to)124-138
JournalMethodology: European Journal of Research Methods for the Behavioral and Social Sciences
Volume12
Issue number4
DOIs
Publication statusPublished - 2017

Fingerprint

new technology
learning

Keywords

  • big data
  • Data streams
  • online learning
  • machine learning
  • stochastic gradient descent

Cite this

@article{225c2c7fa75245f6bac392bafe2f46ed,
title = "Dealing with data streams: An online, row-by-row estimation tutorial",
abstract = "Novel technological advances allow distributed and automatic measurement of human behavior. While these technologies provide exciting new research opportunities, they also provide challenges: datasets collected using new technologies grow increasingly large, and in many applications the collected data are continuously augmented. These data streams make the standard computation of well-known estimators inefficient as the computation has to be repeated each time a new data point enters. In this tutorial paper, we detail online learning, an analysis method that facilitates the efficient analysis of Big Data and continuous data streams. We illustrate how common analysis methods can be adapted for use with Big Data using an online, or ``row-by-row'', processing approach. We present several simple (and exact) examples of the online estimation and we discuss Stochastic Gradient Descent as a general (approximate) approach to estimate more complex models. We end this article with a discussion of the methodological challenges that remain.",
keywords = "big data, Data streams, online learning, machine learning, stochastic gradient descent",
author = "G.J.E. Ippel and M.C. Kaptein and J.K. Vermunt",
year = "2017",
doi = "10.1027/1614-2241/a000116",
language = "English",
volume = "12",
pages = "124--138",
journal = "Methodology: European Journal of Research Methods for the Behavioral and Social Sciences",
issn = "1614-1881",
publisher = "Hogrefe & Huber Publishers",
number = "4",

}

Dealing with data streams : An online, row-by-row estimation tutorial. / Ippel, G.J.E.; Kaptein, M.C.; Vermunt, J.K.

In: Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, Vol. 12, No. 4, 2017, p. 124-138.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - Dealing with data streams

T2 - An online, row-by-row estimation tutorial

AU - Ippel, G.J.E.

AU - Kaptein, M.C.

AU - Vermunt, J.K.

PY - 2017

Y1 - 2017

N2 - Novel technological advances allow distributed and automatic measurement of human behavior. While these technologies provide exciting new research opportunities, they also provide challenges: datasets collected using new technologies grow increasingly large, and in many applications the collected data are continuously augmented. These data streams make the standard computation of well-known estimators inefficient as the computation has to be repeated each time a new data point enters. In this tutorial paper, we detail online learning, an analysis method that facilitates the efficient analysis of Big Data and continuous data streams. We illustrate how common analysis methods can be adapted for use with Big Data using an online, or ``row-by-row'', processing approach. We present several simple (and exact) examples of the online estimation and we discuss Stochastic Gradient Descent as a general (approximate) approach to estimate more complex models. We end this article with a discussion of the methodological challenges that remain.

AB - Novel technological advances allow distributed and automatic measurement of human behavior. While these technologies provide exciting new research opportunities, they also provide challenges: datasets collected using new technologies grow increasingly large, and in many applications the collected data are continuously augmented. These data streams make the standard computation of well-known estimators inefficient as the computation has to be repeated each time a new data point enters. In this tutorial paper, we detail online learning, an analysis method that facilitates the efficient analysis of Big Data and continuous data streams. We illustrate how common analysis methods can be adapted for use with Big Data using an online, or ``row-by-row'', processing approach. We present several simple (and exact) examples of the online estimation and we discuss Stochastic Gradient Descent as a general (approximate) approach to estimate more complex models. We end this article with a discussion of the methodological challenges that remain.

KW - big data

KW - Data streams

KW - online learning

KW - machine learning

KW - stochastic gradient descent

U2 - 10.1027/1614-2241/a000116

DO - 10.1027/1614-2241/a000116

M3 - Article

VL - 12

SP - 124

EP - 138

JO - Methodology: European Journal of Research Methods for the Behavioral and Social Sciences

JF - Methodology: European Journal of Research Methods for the Behavioral and Social Sciences

SN - 1614-1881

IS - 4

ER -