Can Joe be brave and bad? Experimental and computational approaches to thick terms

Giovanni Cassani, Fenna Blom, Emanuele Arcelli, Matteo Colombo

Research output: Contribution to conferencePosterScientificpeer-review


Thick terms, such as smart or racist, are at the same time evaluative and descriptive [1] unlike thin terms, e.g., good or right, which are merely evaluative, and purely descriptive terms, which state a fact without providing an evaluation, e.g., Dutch.
We started by replicating the cancellability task by [2], expanding the pool of target adjectives to 30 thick terms to test their conclusions and improve generalization, to test (i) to what extent the evaluative content presupposed by a thick term is cancellable and (ii) what sources of information predict participants’ contradiction judgements in a cancellability task with thick terms. We manipulated the intended coherence of the prompt by cancelling (or not) the evaluative component of the target thick term (see prompts (1) and (2) in the following page). We further included control trials with purely descriptive adjectives (3), with neutral polarity. We predicted responses as a function of the interaction between the polarity of the target adjective (positive, neutral, negative) and the thin attribute (positive or negative). In line with [2], the interaction was reliable (b = -4.5345, se = 0.4096, z = -11.070, p < .001), with respondents indicating a prompt was more contradictory when the adjective’s polarity matched the thin term’s one. Prompts with descriptive adjectives were more likely to be judged as non-contradictory regardless of the thin term. Replacing polarity (categorical) with valence (continuous [3], operationalized through norms elicited from subjects, included as a second-order polynomial) did not improve the model fit (DAIC = 201.1), despite it being a reliable predictor of participants’ judgments as well.
We further considered the cosine similarity between the vector representation of each target adjective and that of the thin adjectives, extracting embeddings from word2vec [4], and computing the difference in similarity. Words with a higher positive delta have a closer vector to the word ‘positive’ than to the word ‘negative’, and vice versa. This predictor entered a reliable interaction with the thin term (z = -11.07) but had a slightly worse fit than valence (DAIC = 30.3). This suggests that (i) language statistics are a relevant source of information to predict the perceived cancellability of a thick term; (ii) embedding spaces capture aspects of valence that predict fine-grained phenomena as perceived contradiction between thick and thin terms, expanding on [5]; and (iii) we could approximate the predictive power of valence norms with a predictor derived from language statistics [6], uncovering a possible mechanism by which certain terms acquire an evaluative component which cannot be cancelled.
Finally, we used a Cloze task to analyze whether the evaluative component of thick terms influences people when asked to complete a sentence and test how well a thick term’s evaluative component predicts expectations of upcoming words’ polarity. We asked participants to complete a sentence containing a thick term, manipulating its polarity and the conjunction (coordinating or adversative, see (4), (5) and (6) for example stimuli). We then fed the continuations to BERT’s sentiment classifier [7], obtaining the support the model assigns to the positive and negative sentiment label. We observed a reliable interaction (t = -25.622) between the conjunction and the valence of the thick term, gauged from [3]. As expected, with a coordinating conjunction, the more positive the thick term, the more positive the continuation. The pattern was however reversed when the conjunction was adversative.
Our results show that people are sensitive to the evaluative components of thick terms: polarity predicts the sentiment of free completions in a Cloze task and judgments of contradiction in a cancellability task. Moreover, a measure extracted from vector representations learned from co-occurrence statistics [8] proved to be almost as predictive of contradiction ratings in the cancellability task as valence norms elicited from participants in a large study [4], showing that the evaluative component of thick terms may depend on how the same terms are used in language. Our results also bear relevance in ongoing philosophical debates about the relation between the descriptive and evaluative components of thick terms [9] and what this might indicate about the objectivity of evaluative language [10]. In the future, we plan to consider more implicit measures of perceived contradiction, e.g., reading times, and investigate terms which combine evaluation and description only in specific circumstances, e.g., hot.

Cancellability task: implemented it in Qualtrics; 344 participants were recruited via MTurk. Prompts were created based on the template [NOUN PHRASE] is [ADJECTIVE], but by that I’m not saying something [positive/negative] about [PRONOUN]. We used the same adjectives used in the Cloze task and asked participants to judge to what extent the sentence they read was contradictory using a 4-point scale (definitely contradictory, somewhat contradictory, somewhat non-contradictory, definitely non-contradictory). Each participant saw 15 prompts, taking care that nobody ever saw two prompts which only differed in the target thin term (negative or positive). Responses were modeled using cumulative link mixed models, using the ordinal package in R [12], with the maximal random structure allowed and which resulted in a converging model. Sample stimuli:
(1) Ana is honest, but by that I am not saying something negative about her.
(2) Peter is insincere, but by that I am not saying something negative about him.
(3) The table is orange, but by that I am not saying something positive about it.

Cloze task: implemented in Qualtrics; 400 participants were recruited via MTurk. Sentence prompts were created based on the template [NOUN PHRASE] is [ADJECTIVE] [and/but], where ADJECTIVE could be one of 30 positive or negative thick terms or one of 30 descriptive adjectives (used as control items). Each participant saw 10 prompts and was asked to provide two possible completions for each prompt in free text boxes. We filtered out uncooperative responses manually and lowercased all texts. Responses were modeled with multilevel linear regression models fitted using lme4 in R [11], including random intercepts for adjective and participant. Sample stimuli:
(4) John is compassionate but ...
(5) The lawyer is arrogant and ...
(6) The town’s main square is round but ...

Word2vec: we used the pre-trained space provided by Google, trained on a 100-billion word news corpus with the Skip-gram method, with a vocabulary of 3M words mapped to 300-dimensional vectors.

[1] Väyrynen, P. (2021). Thick Ethical Concepts. In Ed. N. Zalta (ed.), The Stanford
Encyclopedia of Philosophy (Spring 2021 Edition).
[2] Willemsen, P. & Reuter, K. (2020). Separability and the Effect of Valence. Proceedings of the 42nd Annual Conference of the Cognitive Science Society, Austin, TX: Cognitive Science Society, pp. 794–800.
[3] Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior research methods, 45(4), 1191-1207.
[4] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. NeurIPS.
[5] Hollis, G., & Westbury, C. (2016). The principals of meaning: Extracting semantic dimensions from co-occurrence models of semantics. Psychon B Rev, 23(6), 1744-1756.
[6] Westbury, C. (2016). Pay no attention to that man behind the curtain. The mental lexicon, 11(3), 350-374.
[7] Devlin, J., J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-2019.
[8] Firth, J. R. (1957). Papers in linguistics, 1934-1951. Oxford, UK: Oxford University Press.
[9] Väyrynen, P. (2013). The Lewd, the Rude and the Nasty, New York: Oxford University Press.
[10] Putnam, H. (2002). The Entanglement of Fact and Value. In The Collapse of the Fact/Value Dichotomy and Other Essays. Mass: Harvard university Press, 2002. 28–45.
[11] Bates D., Mächler M., Bolker B., Walker S. (2015). “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software, 67(1), 1–48.
[12] Christensen, R. H. B. (2019). “ordinal—Regression Models for Ordinal Data .” R package version 2019.12-10.
Original languageEnglish
Publication statusPublished - Sept 2022
EventArchitectures and Mechanisms for Language Processing (2022) - York, United Kingdom
Duration: 7 Sept 20229 Sept 2022


ConferenceArchitectures and Mechanisms for Language Processing (2022)
Abbreviated titleAMLaP
Country/TerritoryUnited Kingdom
Internet address


  • Thick terms
  • Unlike thin terms
  • Computational approaches
  • Experimental approaches


Dive into the research topics of 'Can Joe be brave and bad? Experimental and computational approaches to thick terms'. Together they form a unique fingerprint.

Cite this