Coordination of vision and language in cross-modal referential processing
Coco, Moreno Ignazio
This thesis investigates the mechanisms underlying the formation, maintenance, and sharing of reference in tasks in which language and vision interact. Previous research in psycholinguistics and visual cognition has provided insights into the formation of reference in cross-modal tasks. The conclusions reached are largely independent, with the focus on mechanisms pertaining to either linguistic or visual processing. In this thesis, we present a series of eye-tracking experiments that aim to unify these distinct strands of research by identifying and quantifying factors that underlie the cross-modal interaction between scene understanding and sentence processing. Our results show that both low-level (imagebased) and high-level (object-based) visual information interacts actively with linguistic information during situated language processing tasks. In particular, during language understanding (Chapter 3), image-based information, i.e., saliency, is used to predict the upcoming arguments of the sentence, when the linguistic material alone is not sufficient to make such predictions. During language production (Chapter 4), visual attention has the active role of sourcing referential information for sentence encoding. We show that two important factors influencing this process are the visual density of the scene, i.e., clutter, and the animacy of the objects described. Both factors influence the type of linguistic encoding observed and the associated visual responses. We uncover a close relationship between linguistic descriptions and visual responses, triggered by the cross-modal interaction of scene and object properties, which implies a general mechanism of cross-modal referential coordination. Further investigation (Chapter 5) shows that visual attention and sentence processing are closely coordinated during sentence production: similar sentences are associated with similar scan patterns. This finding holds across different scenes, which suggests that coordination goes beyond the well-known scene-based effects guiding visual attention, again supporting the existence of a general mechanism for the cross-modal coordination of referential information. The extent to which cross-modal mechanisms are activated depends on the nature of the task performed. We compare the three tasks of visual search, object naming, and scene description (Chapter 6) and explore how the modulation of cross-modal reference is reflected in the visual responses of participants. Our results show that the cross-modal coordination required in naming and description triggers longer visual processing and higher scan pattern similarity than in search. This difference is due to the coordination required to integrate and organize visual and linguistic referential processing. Overall, this thesis unifies explanations of distinct cognitive processes (visual and linguistic) based on the principle of cross-modal referentiality, and provides a new framework for unraveling the mechanisms that allow scene understanding and sentence processing to share and integrate information during cross-modal processing.