Projects per year
Abstract
Sharing textual content in the form of public posts on online platforms remains a significant part of the social web. Research on stylometric profiling suggests that despite users' discreetness, and even under the guise of anonymity, the content and style of such posts may still reveal detailed author information. Studying how this might be inferred and obscured is relevant not only to the domain of cybersecurity, but also to those studying bias of classifiers drawing features from web corpora. While the collection of gold standard data is expensive, prior work shows that distant labels (i.e., those gathered via heuristics) offer an effective alternative. Currently, however, pre-existing corpora are limited in scope (e.g., variety of attributes and size). We present the SOBR corpus: 235M Reddit posts for which we used subreddits, flairs, and self-reports as distant labels for author attributes (age, gender, nationality, personality, and political leaning). In addition to detailing the data collection pipeline and sampling strategy, we report corpus statistics and provide a discussion on the various tasks and research avenues to be pursued using this resource. Along with the raw corpus, we provide sampled splits of the data, and suggest baselines for stylometric profiling. We close our work with a detailed set of ethical considerations relevant to the proposed lines of research.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation |
Subtitle of host publication | LREC-COLING 2024 |
Editors | Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue |
Pages | 14967–14983 |
Number of pages | 17 |
Publication status | Published - 20 May 2024 |
Event | LREC-COLING 2024: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italy Duration: 20 May 2024 → 25 May 2024 https://lrec-coling-2024.org/ |
Conference
Conference | LREC-COLING 2024 |
---|---|
Country/Territory | Italy |
City | Torino |
Period | 20/05/24 → 25/05/24 |
Internet address |
Keywords
- corpus
- computational stylometry
- author identification
- author profiling
- author obfuscation
- adversarial stylometry
- algorithmic bias
- natural language processing
- computational linguistics
Fingerprint
Dive into the research topics of 'SOBR: A Corpus for Stylometry, Obfuscation, and Bias on Reddit'. Together they form a unique fingerprint.Projects
- 1 Finished
-
GRASP: GRASP 👊 : Gathering Redditors Against Stylometric Profiling
Emmery, C. (Principal Investigator), Miotto, M. (Researcher), Kramp, S. (Researcher) & Kleinberg, B. (CoPI)
16/01/23 → 28/07/23
Project: Research project