Sound symbolic associations via contrastive learning of text and images

Giovanni Cassani, Giuseppe Attanasio, Federico Bianchi

Research output: Contribution to conferencePosterScientificpeer-review


We show preliminary evidence that sound symbolic associations in the stimuli set of two relevant psycholinguistic studies emerge in a computational model trained to embed sentence- image pairs in a joint latent space. We thus show for the first time that sound symbolic cross- modal associations between sub-lexical patterns and visual features can be learned via associative mechanisms, leveraging co-occurrences in the environment.In detail, we use the CLIP model (Contrastive Language Image Pretraining [1]) to embed the stimuli from [2] and [3] and then use the WEAT score (Word Embedding Association Test [4]) to compute the association between sets of round/jagged images and pseudowords [5].CLIP is a neural network trained on sentence-image pairs and represents sentences and images as points in a latent multimodal embedding space via contrastive learning. We used CLIP ViT-B/32, where images are fed to a visual encoder which relies on Vision Transformers [6], while sentences are fed to a language encoder, which relies on text transformers [7]. Importantly, the text encoder is also able to embed out-of-vocabulary words leveraging part- words, enabling it to deal with pseudowords, typically used in studies on sound symbolism.The WEAT score has been widely used to uncover associations between concepts, e.g., flowers and pleasantness or insects and unpleasantness, from word embeddings. We adopt the same approach but use multimodal rather than purely language-based embeddings to test sound-symbolic associations between visually presented pseudowords and roundish or jagged shapes respectively, using CLIP embeddings.We first embedded the two sets of pseudowords (8 sharp-sounding, Wjagged, 8 round- sounding, Wround) and image stimuli (21 jagged shapes, Ijagged, 20 roundish shapes, Iround) in [2]. The pseudowords were grouped into sharp- or round-sounding by the authors of the original study based on the phonemes of which they consist of and their known sound symbolic asso- ciations, e.g., b and round or k and sharp. We then computed the WEAT score between the embedded targets, Wjagged and Wround, and the embedded attributes, Ijagged and Iround, observing an association in the predicted direction (effect size of 0.479).We then considered the stimuli in [3]. First, we embedded the two sets of target images (with 34 items each), featuring aliens with round v. jagged contours, again obtaining Ijagged and Iround. Then, we embedded two sets of target words, one, Wsharp, containing 10 sharp attributes like angry, and one, Wsoft, containing 10 soft attributes like friendly. The expected association between aggressive attributes and jagged aliens, based on lexical meaning rather than sound- symbolism, was observed (effect size of 0.372). We then embedded two lists of 20 and 40 made-up names, equally divided in round- and jagged-sounding (Wround and Wjagged respec- tively), and computed the WEAT score with Ijagged and Iround, finding an association in the predicted direction for both lists (effect sizes of 0.316 and 0.131). A last test, considering 20 names equally divided into male- v. female- sounding names and their expected association with jagged or round aliens respectively, also yielded an association (effect size of -0.346) but in the opposite direction, showing that CLIP may not be equally capable of capturing all documented sound-symbolic associations.To conclude, we show for the first time that cross-modal sound-symbolic associations between written words and images from two influential studies on sound symbolism are learned by CLIP, a computational model which maps text and images onto a joint embedding space exploiting co-occurrences between (sub-)lexical patterns and visual features. Unlike previous models [8], which were trained to map word-form representations to images, sound- symbolic associations emerge in CLIP as a by-product of a different objective. Cross-modal co-occurrences may thus be sufficient to learn sound-symbolic associations from experience via statistical learning [9]. Moreover, by showing that expected cross-modal associations are also captured when feeding CLIP with real words, we provide further evidence that a shared meaning space can embed words, about which we gather co-occurrence statistics, as well as pseudowords, which can be mapped onto the same representation space leveraging statisticsinvolving sub-lexical sequences [10]. Finally, we show that multimodal embedding models can be fruitfully used to study sound symbolism in a cross-modal set-up.

Text encoder: the 12-layer transformer operates on a lower-cased byte pair encoding (BPE) representation of the text and has a vocabulary size of 49,152. The activations of the highest layer (512 units) of the transformer at the End-of-Sequence token (appended to any input sentence) are considered as the feature representation of the input text which is linearly projected into the multi-modal embedding space after unit-normalization.

Visual encoder: the 12-layer vision transformer (VIT-base) operates on 32-pixel patches of the images and generates a feature representation (512 units) of the image which is projected into the multi-modal embedding space after unit-normalization.

CLIP-training: CLIP learns representations using a contrastive task: given a batch of image-sentence pairs <ii,ti>, the model is trained to increase the similarity of <ii,ti> while decreasing the similarity of <ii,tj> with i ≠ j. CLIP is trained on this task over more than 400M image-sentence pairs collected from the internet. We use the pre-trained CLIP model availa- ble through HuggingFace and do not apply further training. See [1] for further details.

WEAT: this method takes in four sets of word embeddings: two sets of embedded target concepts (X and Y), e.g., round-sounding and sharp-sounding pseudowords, and two sets of embedded target attributes (A and B), e.g., round and jagged images. The WEAT score sum- marizes whether X is more closely associated with A and is computed following Eq. (1) and Eq. (2), where A and B are the set of attributes (images, in our case), X and Y are the set of targets (words or pseudowords in our case), x, y, and w are (pseudo)words, and s is a measure of similarity:

(1) 𝑠(𝑋,𝑌,𝐴,𝐵)= sum for x in X(𝑠(𝑥,𝐴,𝐵)) − sum for y in Y(𝑠(𝑦,𝐴,𝐵))

(2) 𝑠(𝑤, 𝐴, 𝐵) = 1/|A| * ( sum for a in A (𝑐𝑜𝑠𝑖𝑛𝑒(𝑤, 𝑎))) − 1/|B| * (sum for b in B (𝑐𝑜𝑠𝑖𝑛𝑒(𝑤, 𝑏)))

[1] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. ICML, 8748-8763.
[2] Imai, M., & Kita, S. (2014). The sound symbolism bootstrapping hypothesis forlanguage acquisition and language evolution. Philos T Roy Soc B, 369(1651), 20130298.
[3] Sidhu, D. M., & Pexman, P. M. (2015). What's in a Name? Sound Symbolism andGender in First Names. PLoS One, 10(5), e0126809.
[4] Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automaticallyfrom language corpora contain human-like biases. Science, 356(6334), 183-186.
[5] Ross, C., Katz, B., & Barbu, A. (2020). Measuring social biases in grounded vision and language embeddings. NAACL-2022.
[6] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[7] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017) Attention is all you need. NeurIPS, 5998–6008.
[8] de Varda, A. G., & Strapparava, C. (2021). Phonovisual Biases in Language: is the Lexicon Tied to the visual world?. IJCAI-21, 643-649.
[9] Sidhu, D. M., & Pexman, P. M. (2018). Five mechanisms of sound symbolicassociation. Psychon B Rev, 25(5), 1619-1643.
[10] Cassani, G., & Limacher, N. (2021). Not just form, not just meaning: Words withconsistent form-meaning mappings are learned earlier. Q J Exp Psychol.
Original languageEnglish
Number of pages2
Publication statusPublished - Sept 2022
EventArchitectures and Mechanisms for Language Processing (2022) - York, United Kingdom
Duration: 7 Sept 20229 Sept 2022


ConferenceArchitectures and Mechanisms for Language Processing (2022)
Abbreviated titleAMLaP
Country/TerritoryUnited Kingdom
Internet address


  • Sound symbolic associations
  • Sub-lexical patterns
  • Visual features


Dive into the research topics of 'Sound symbolic associations via contrastive learning of text and images'. Together they form a unique fingerprint.

Cite this