Adding part-of-speech information to the SUBTLEX-US word frequencies

Marc Brysbaert*, Boris New, Emmanuel Keuleers

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

Abstract

The SUBTLEX-US corpus has been parsed with the CLAWS tagger, so that researchers have information about the possible word classes (parts-of-speech, or PoSs) of the entries. Five new columns have been added to the SUBTLEX-US word frequency list: the dominant (most frequent) PoS for the entry, the frequency of the dominant PoS, the frequency of the dominant PoS relative to the entry's total frequency, all PoSs observed for the entry, and the respective frequencies of these PoSs. Because the current definition of lemma frequency does not seem to provide word recognition researchers with useful information (as illustrated by a comparison of the lemma frequencies and the word form frequencies from the Corpus of Contemporary American English), we have not provided a column with this variable. Instead, we hope that the full list of PoS frequencies will help researchers to collectively determine which combination of frequencies is the most informative.

Original languageEnglish
Pages (from-to)991-997
Number of pages7
JournalBehavior Research Methods
Volume44
Issue number4
DOIs
Publication statusPublished - Dec 2012
Externally publishedYes

Keywords

  • SUBTLEX
  • Word frequency
  • Part-of-speech information
  • Subtitles
  • Lexical decision
  • FILM SUBTITLES
  • LEXICON PROJECT
  • ENGLISH
  • NOUNS

Cite this

@article{9cec32c1dcf5416e80f5b268ab5ed83e,
title = "Adding part-of-speech information to the SUBTLEX-US word frequencies",
abstract = "The SUBTLEX-US corpus has been parsed with the CLAWS tagger, so that researchers have information about the possible word classes (parts-of-speech, or PoSs) of the entries. Five new columns have been added to the SUBTLEX-US word frequency list: the dominant (most frequent) PoS for the entry, the frequency of the dominant PoS, the frequency of the dominant PoS relative to the entry's total frequency, all PoSs observed for the entry, and the respective frequencies of these PoSs. Because the current definition of lemma frequency does not seem to provide word recognition researchers with useful information (as illustrated by a comparison of the lemma frequencies and the word form frequencies from the Corpus of Contemporary American English), we have not provided a column with this variable. Instead, we hope that the full list of PoS frequencies will help researchers to collectively determine which combination of frequencies is the most informative.",
keywords = "SUBTLEX, Word frequency, Part-of-speech information, Subtitles, Lexical decision, FILM SUBTITLES, LEXICON PROJECT, ENGLISH, NOUNS",
author = "Marc Brysbaert and Boris New and Emmanuel Keuleers",
year = "2012",
month = "12",
doi = "10.3758/s13428-012-0190-4",
language = "English",
volume = "44",
pages = "991--997",
journal = "Behavior Research Methods",
issn = "1554-351X",
publisher = "Springer",
number = "4",

}

Adding part-of-speech information to the SUBTLEX-US word frequencies. / Brysbaert, Marc; New, Boris; Keuleers, Emmanuel.

In: Behavior Research Methods, Vol. 44, No. 4, 12.2012, p. 991-997.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - Adding part-of-speech information to the SUBTLEX-US word frequencies

AU - Brysbaert, Marc

AU - New, Boris

AU - Keuleers, Emmanuel

PY - 2012/12

Y1 - 2012/12

N2 - The SUBTLEX-US corpus has been parsed with the CLAWS tagger, so that researchers have information about the possible word classes (parts-of-speech, or PoSs) of the entries. Five new columns have been added to the SUBTLEX-US word frequency list: the dominant (most frequent) PoS for the entry, the frequency of the dominant PoS, the frequency of the dominant PoS relative to the entry's total frequency, all PoSs observed for the entry, and the respective frequencies of these PoSs. Because the current definition of lemma frequency does not seem to provide word recognition researchers with useful information (as illustrated by a comparison of the lemma frequencies and the word form frequencies from the Corpus of Contemporary American English), we have not provided a column with this variable. Instead, we hope that the full list of PoS frequencies will help researchers to collectively determine which combination of frequencies is the most informative.

AB - The SUBTLEX-US corpus has been parsed with the CLAWS tagger, so that researchers have information about the possible word classes (parts-of-speech, or PoSs) of the entries. Five new columns have been added to the SUBTLEX-US word frequency list: the dominant (most frequent) PoS for the entry, the frequency of the dominant PoS, the frequency of the dominant PoS relative to the entry's total frequency, all PoSs observed for the entry, and the respective frequencies of these PoSs. Because the current definition of lemma frequency does not seem to provide word recognition researchers with useful information (as illustrated by a comparison of the lemma frequencies and the word form frequencies from the Corpus of Contemporary American English), we have not provided a column with this variable. Instead, we hope that the full list of PoS frequencies will help researchers to collectively determine which combination of frequencies is the most informative.

KW - SUBTLEX

KW - Word frequency

KW - Part-of-speech information

KW - Subtitles

KW - Lexical decision

KW - FILM SUBTITLES

KW - LEXICON PROJECT

KW - ENGLISH

KW - NOUNS

U2 - 10.3758/s13428-012-0190-4

DO - 10.3758/s13428-012-0190-4

M3 - Article

VL - 44

SP - 991

EP - 997

JO - Behavior Research Methods

JF - Behavior Research Methods

SN - 1554-351X

IS - 4

ER -