Assessing the usefulness of Google Books' word frequencies for psycholinguistic research on word processing

Marc Brysbaert*, Emmanuel Keuleers, Boris New

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

Abstract

In this Perspective Article we assess the usefulness of Google's new word frequencies for word recognition research (lexical decision and word naming). We find that, despite the massive corpus on which the Google estimates are based (131 billion words from books published in the United States alone), the Google American English frequencies explain 11% less of the variance in the lexical decision times from the English Lexicon Project (Balota et al., 2007) than the SUBTLEX-US word frequencies, based on a corpus of 51 million words from film and television subtitles. Further analyses indicate that word frequencies derived from recent books (published after 2000) are better predictors of word processing times than frequencies based on the full corpus, and that word frequencies based on fiction books predict word processing times better than word frequencies based on the full corpus. The most predictive word frequencies from Google still do not explain more of the variance in word recognition times of undergraduate students and old adults than the subtitle-based word frequencies.

Original languageEnglish
Article number27
Number of pages8
JournalFrontiers in Psychology
Volume2
DOIs
Publication statusPublished - 2011
Externally publishedYes

Keywords

  • word frequency
  • lexical decision
  • Google Books ngrams
  • SUBTLEX

Cite this

@article{6ac7b8fcbfb546858f0e5ae6d577aa8d,
title = "Assessing the usefulness of Google Books' word frequencies for psycholinguistic research on word processing",
abstract = "In this Perspective Article we assess the usefulness of Google's new word frequencies for word recognition research (lexical decision and word naming). We find that, despite the massive corpus on which the Google estimates are based (131 billion words from books published in the United States alone), the Google American English frequencies explain 11{\%} less of the variance in the lexical decision times from the English Lexicon Project (Balota et al., 2007) than the SUBTLEX-US word frequencies, based on a corpus of 51 million words from film and television subtitles. Further analyses indicate that word frequencies derived from recent books (published after 2000) are better predictors of word processing times than frequencies based on the full corpus, and that word frequencies based on fiction books predict word processing times better than word frequencies based on the full corpus. The most predictive word frequencies from Google still do not explain more of the variance in word recognition times of undergraduate students and old adults than the subtitle-based word frequencies.",
keywords = "word frequency, lexical decision, Google Books ngrams, SUBTLEX",
author = "Marc Brysbaert and Emmanuel Keuleers and Boris New",
year = "2011",
doi = "10.3389/fpsyg.2011.00027",
language = "English",
volume = "2",
journal = "Frontiers in Psychology",
issn = "1664-1078",
publisher = "Frontiers Media S.A.",

}

Assessing the usefulness of Google Books' word frequencies for psycholinguistic research on word processing. / Brysbaert, Marc; Keuleers, Emmanuel; New, Boris.

In: Frontiers in Psychology, Vol. 2, 27, 2011.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - Assessing the usefulness of Google Books' word frequencies for psycholinguistic research on word processing

AU - Brysbaert, Marc

AU - Keuleers, Emmanuel

AU - New, Boris

PY - 2011

Y1 - 2011

N2 - In this Perspective Article we assess the usefulness of Google's new word frequencies for word recognition research (lexical decision and word naming). We find that, despite the massive corpus on which the Google estimates are based (131 billion words from books published in the United States alone), the Google American English frequencies explain 11% less of the variance in the lexical decision times from the English Lexicon Project (Balota et al., 2007) than the SUBTLEX-US word frequencies, based on a corpus of 51 million words from film and television subtitles. Further analyses indicate that word frequencies derived from recent books (published after 2000) are better predictors of word processing times than frequencies based on the full corpus, and that word frequencies based on fiction books predict word processing times better than word frequencies based on the full corpus. The most predictive word frequencies from Google still do not explain more of the variance in word recognition times of undergraduate students and old adults than the subtitle-based word frequencies.

AB - In this Perspective Article we assess the usefulness of Google's new word frequencies for word recognition research (lexical decision and word naming). We find that, despite the massive corpus on which the Google estimates are based (131 billion words from books published in the United States alone), the Google American English frequencies explain 11% less of the variance in the lexical decision times from the English Lexicon Project (Balota et al., 2007) than the SUBTLEX-US word frequencies, based on a corpus of 51 million words from film and television subtitles. Further analyses indicate that word frequencies derived from recent books (published after 2000) are better predictors of word processing times than frequencies based on the full corpus, and that word frequencies based on fiction books predict word processing times better than word frequencies based on the full corpus. The most predictive word frequencies from Google still do not explain more of the variance in word recognition times of undergraduate students and old adults than the subtitle-based word frequencies.

KW - word frequency

KW - lexical decision

KW - Google Books ngrams

KW - SUBTLEX

U2 - 10.3389/fpsyg.2011.00027

DO - 10.3389/fpsyg.2011.00027

M3 - Article

VL - 2

JO - Frontiers in Psychology

JF - Frontiers in Psychology

SN - 1664-1078

M1 - 27

ER -