TY - JOUR
T1 - Manual Annotation of Unsupervised Models
T2 - Close and Distant Reading of Politics on Reddit
AU - Aurnhammer, Christoph
AU - Cuppen, Iris
AU - van de Ven, Inge
AU - van Zaanen, Menno
PY - 2019
Y1 - 2019
N2 - This article offers a methodological contribution to manually-assisted topic modeling. With the availability of vast amounts of (online) texts, performing full scale literary analysis using a close reading approach is not practically feasible. The set of alternatives proposed by Franco Moretti (2000) under the umbrella term of "distant reading" aims to show broad patterns that can be found throughout the entire text collection. After a survey of literary-critical practices that combine close and distant reading methods, we use manual annotations of a thread on Reddit, both to evaluate an LDA model, and to provide information that topic modeling lacks. We also make a case for applying these reading techniques that originate in literary reading more broadly to online, non-literary contexts. Given a large collection of posts from a Reddit thread, we compare a manual, close reading analysis against an automatic, computational distant reading approach based on topic modeling using LDA. For each text in the collection, we label the contents, effectively clustering related texts. Next, we evaluate the similarity of the respective outcomes of the two approaches. Our results show that the computational content/topic-based labeling partially overlaps with the manual annotation. However, the close reading approach not only identifies texts with similar content, but also those with similar function. The differences in annotation approaches require rethinking the purpose of computational techniques in reading analysis. Thus, we present a model that could be valuable for scholars who have a small amount of manual annotation that could be used to tune an unsupervised model of a larger dataset.
AB - This article offers a methodological contribution to manually-assisted topic modeling. With the availability of vast amounts of (online) texts, performing full scale literary analysis using a close reading approach is not practically feasible. The set of alternatives proposed by Franco Moretti (2000) under the umbrella term of "distant reading" aims to show broad patterns that can be found throughout the entire text collection. After a survey of literary-critical practices that combine close and distant reading methods, we use manual annotations of a thread on Reddit, both to evaluate an LDA model, and to provide information that topic modeling lacks. We also make a case for applying these reading techniques that originate in literary reading more broadly to online, non-literary contexts. Given a large collection of posts from a Reddit thread, we compare a manual, close reading analysis against an automatic, computational distant reading approach based on topic modeling using LDA. For each text in the collection, we label the contents, effectively clustering related texts. Next, we evaluate the similarity of the respective outcomes of the two approaches. Our results show that the computational content/topic-based labeling partially overlaps with the manual annotation. However, the close reading approach not only identifies texts with similar content, but also those with similar function. The differences in annotation approaches require rethinking the purpose of computational techniques in reading analysis. Thus, we present a model that could be valuable for scholars who have a small amount of manual annotation that could be used to tune an unsupervised model of a larger dataset.
M3 - Article
SN - 1938-4122
VL - 13
JO - Digital Humanities Quarterly
JF - Digital Humanities Quarterly
IS - 3
ER -