Unsupervised structure induction and multimodal grounding

Zhao, Yanpeng

Unsupervised structure induction and multimodal grounding

Simple item page

dc.contributor.advisor

Titov, Ivan

dc.contributor.advisor

Lapata, Mirella

dc.contributor.author

Zhao, Yanpeng

dc.date.accessioned

2023-12-01T10:38:14Z

dc.date.available

2023-12-01T10:38:14Z

dc.date.issued

2023-12-01

dc.description.abstract

Structured representations build upon symbolic abstraction (e.g., words in natural language and visual concepts in natural images), offer a principled way of encoding our perceptions about the physical world, and enable the human-like generalization of machine learning systems. The predominant paradigm for learning structured representations of the observed data has been supervised learning, but it is limited in several respects. First, supervised learning is challenging given the scarcity of labeled data. Second, conventional approaches to structured prediction have been relying on a single modality (e.g., either images or text), ignoring the learning cues that may have been specified in and can be readily obtained from other modalities of data. In this thesis, we investigate unsupervised approaches to structure induction in a multimodal setting. Unsupervised learning is inherently difficult in general, let alone inducing complex and discrete structures from data without direct supervision. By considering the multimodal setting, we leverage the alignments between different data modalities (e.g., text, audio, and images) to facilitate the learning of structure-induction models, e.g., knowing that the individual words in ``a white pigeon'' always appear with the same visual object, a language parser is likely to treat them as a whole (i.e., phrase). The multimodal learning setting is practically viable because multimodal alignments are generally abundant. For example, they can be found in online posts such as news and tweets that usually contain images and associated text, and in (YouTube) videos, where audio, scripts, and scenes are synchronized and grounded in each other. We develop structure-induction models, which are capable of exploiting bimodal image-text alignments, for two modalities: (1) for natural language, we consider unsupervised syntactic parsing with phrase-structure grammars and regularize the parser by using visual image groundings; and (2) for visual images, we induce scene graph representations by mapping arguments and predicates in the text to their visual counterparts (i.e., visual objects and relations among them) in an unsupervised manner. While useful, crossmodal alignments are not always abundantly available on the web, e.g., the alignments between non-speech audio and text. We tackle the challenge by sharing the visual modality between image-text alignment and image-audio alignment; images function as a pivot and connect audio and text. The contributions of this thesis span from model development to data collection. We demonstrated the feasibility of applying multimodal learning techniques to unsupervised structure induction and multimodal alignment collection. Our work opens up new avenues for multimodal and unsupervised structured representation learning.

en

dc.identifier.uri

https://hdl.handle.net/1842/41249

dc.identifier.uri

http://dx.doi.org/10.7488/era/3985

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Zellers, R., Lu, J., Lu, X., Yu, Y., Zhao, Y., Salehi, M., Kusupati, A., Hessel, J., Farhadi, A., and Choi, Y. (2022). Merlot reserve: Neural script knowledge through vision and language and sound. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16354–16366.

en

dc.relation.hasversion

Zhao, Y., Hessel, J., Yu, Y., Lu, X., Zellers, R., and Choi, Y. (2022). Connecting the dots between audio and text without parallel data through visual knowledge transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4492–4507, Seattle, United States. Association for Computational Linguistics.

en

dc.relation.hasversion

Zhao, Y. and Titov, I. (2020). Visually grounded compound PCFGs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4369–4379, Online. Association for Computational Linguistics.

en

dc.relation.hasversion

Zhao, Y. and Titov, I. (2021). An empirical study of compound PCFGs. In Proceedings of the Second Workshop on Domain Adaptation for NLP, pages 166–171, Kyiv, Ukraine. Association for Computational Linguistics.

en

dc.relation.hasversion

Zhao, Y. and Titov, I. (2023a). On the transferability of visually grounded PCFGs. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics.

en

dc.relation.hasversion

Zhao, Y. and Titov, I. (2023b). Unsupervised scene graph induction from natural language supervision. Technical report. https://github.com/zhaoyanpeng/sgi.

en

dc.relation.hasversion

Zhao, Y., Zhang, L., and Tu, K. (2018). Gaussian mixture latent vector grammars. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1181–1189, Melbourne, Australia. Association for Computational Linguistics.

en

dc.subject

multimodal grounding

en

dc.subject

Unsupervised structure induction

en

dc.subject

Unsupervised learning

en

dc.subject

multimodal learning

en

dc.subject

structure-induction models

en

dc.subject

bimodal image-text alignments

en

dc.subject

crossmodal alignments

en

dc.subject

multimodal alignment collection

en

dc.title

Unsupervised structure induction and multimodal grounding

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ZhaoY_2023.pdf
Size:: 5.33 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection