From grounded spaces to linguistic prediction: multimodal word meaning in context

Research output: Contribution to conferencePosterOther research output

Abstract

It has been suggested that word meaning representations are both linguistic and multimodal in nature [1], but most research has focused on isolated words, while language is processed in context [2]. We will conduct a self-paced reading experiment to investigate whether pre-activation of words in context involves a multimodal visual representation component. Specifically, we manipulate two aspects of sentence continuations: firstly, linguistic association as the degree to which the continuation is predictable in context, leveraging Cloze data [3]; secondly, multimodal association as the degree to which different continuations are visually related to the likeliest Cloze completion, according to the Lancaster sensorimotor norms [4] and visual word vectors derived with computer vision methods [5].

For instance, in sentence (1), the word “watch” is the completion that has highest Cloze, which forms our baseline condition. We hypothesise that the expectedness of this continuation will lead to pre-activation of both its linguistic and visual information. To test this hypothesis, we crafted four other plausible completions by manipulating linguistic (L) and multimodal (MM) association:

The impatient man kept looking at his... [baseline: “watch”]
- (L+MM+) “phone”: predicted Cloze completion, visually similar to a watch
- (L-MM+) “compass”: zero Cloze completion, visually similar to a watch
- (L+MM-) “wife”: predicted Cloze completion, visually dissimilar to a watch
- (L-MM-) “dog”: zero Cloze completion, visually dissimilar to a watch

All sentences will be validated in a norming study for plausibility, linguistic, and visual association. We expect fastest reading times for words with high Cloze probability (L+) [6]. Our critical additional prediction is that if pre-activation of “watch” also carries a visual representation, then “compass” will be read faster than “dog” (both L-) benefitting from the visual similarity between “watch” and “compass”. We further plan to compare the findings to predictions obtained from grounded computational models of language [7].

References
[1] M. Andrews, S. Frank and G. Vigliocco, “Reconciling embodied and distributional accounts of meaning in language”, Top Cogn Sci, Vol. 6, no.3, pp. 359–370, 2014.
[2] J. R. Firth, “Introduction” in Studies in linguistic analysis, J. R. Firth Ed., Oxford: Basil Blackwell, 1957, pp. 1-32.
[3] J. E. Peelle et al., “Completion norms for 3085 English sentence contexts”, Behav Res, vol. 52, pp. 1795-1799, 2020.
[4] D. Lynott, L. Connell, M. Brysbaert, J. Brand and J. Carney, “The Lancaster Sensorimotor Norms: multidimensional measures of perceptual and action strength for 40,000 English words”, Behav Res Methods, vol. 52, no. 3, pp. 1271-1291, 2020.
[5] F. Günther, M. Marelli, S. Tureski and M. A. Petilli, “ViSpa (Vision Spaces): A computer-vision-based representation system for individual images and concept prototypes, with large-scale evaluation”, Psychol Rev, vol. 130, no. 4, 2023.
[6] R. Levy, “Expectation-based syntactic comprehension”, Cognition, vol. 106, no. 3, pp. 1126-1177, 2008.
[7] S. Pezzelle, E. Takmaz and R. Fernández, “Word representation learning in multimodal pre-trained transformers: an intrinsic evaluation”, TACL, vol. 9, pp. 1563–1579, 2021.
Original languageEnglish
Publication statusPublished - Oct 2024

Keywords

  • grounded cognition
  • word meaning
  • context
  • language models
  • sentence processing

Fingerprint

Dive into the research topics of 'From grounded spaces to linguistic prediction: multimodal word meaning in context'. Together they form a unique fingerprint.

Cite this