SOBR: A Corpus for Stylometry, Obfuscation, and Bias on Reddit

Chris Emmery*, Marilù Miotto, Sergey Kramp, Bennett Kleinberg

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

9 Downloads (Pure)

Abstract

Sharing textual content in the form of public posts on online platforms remains a significant part of the social web. Research on stylometric profiling suggests that despite users' discreetness, and even under the guise of anonymity, the content and style of such posts may still reveal detailed author information. Studying how this might be inferred and obscured is relevant not only to the domain of cybersecurity, but also to those studying bias of classifiers drawing features from web corpora. While the collection of gold standard data is expensive, prior work shows that distant labels (i.e., those gathered via heuristics) offer an effective alternative. Currently, however, pre-existing corpora are limited in scope (e.g., variety of attributes and size). We present the SOBR corpus: 235M Reddit posts for which we used subreddits, flairs, and self-reports as distant labels for author attributes (age, gender, nationality, personality, and political leaning). In addition to detailing the data collection pipeline and sampling strategy, we report corpus statistics and provide a discussion on the various tasks and research avenues to be pursued using this resource. Along with the raw corpus, we provide sampled splits of the data, and suggest baselines for stylometric profiling. We close our work with a detailed set of ethical considerations relevant to the proposed lines of research.
Original languageEnglish
Title of host publicationProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Subtitle of host publicationLREC-COLING 2024
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Pages14967–14983
Number of pages17
Publication statusPublished - 20 May 2024
EventLREC-COLING 2024: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italy
Duration: 20 May 202425 May 2024
https://lrec-coling-2024.org/

Conference

ConferenceLREC-COLING 2024
Country/TerritoryItaly
CityTorino
Period20/05/2425/05/24
Internet address

Keywords

  • corpus
  • computational stylometry
  • author identification
  • author profiling
  • author obfuscation
  • adversarial stylometry
  • algorithmic bias
  • natural language processing
  • computational linguistics

Fingerprint

Dive into the research topics of 'SOBR: A Corpus for Stylometry, Obfuscation, and Bias on Reddit'. Together they form a unique fingerprint.

Cite this