Unsupervised structure induction and multimodal grounding
dc.contributor.advisor
Titov, Ivan
dc.contributor.advisor
Lapata, Mirella
dc.contributor.author
Zhao, Yanpeng
dc.date.accessioned
2023-12-01T10:38:14Z
dc.date.available
2023-12-01T10:38:14Z
dc.date.issued
2023-12-01
dc.description.abstract
Structured representations build upon symbolic abstraction (e.g., words in natural language and visual concepts in natural images), offer a principled way of encoding our perceptions about the physical world, and enable the human-like generalization of machine learning systems. The predominant paradigm for learning structured representations of the observed data has been supervised learning, but it is limited in several respects. First, supervised learning is challenging given the scarcity of labeled data. Second, conventional approaches to structured prediction have been relying on a single modality (e.g., either images or text), ignoring the learning cues that may have been specified in and can be readily obtained from other modalities of data. In this thesis, we investigate unsupervised approaches to structure induction in a multimodal setting.
Unsupervised learning is inherently difficult in general, let alone inducing complex and discrete structures from data without direct supervision. By considering the multimodal setting, we leverage the alignments between different data modalities (e.g., text, audio, and images) to facilitate the learning of structure-induction models, e.g., knowing that the individual words in ``a white pigeon'' always appear with the same visual object, a language parser is likely to treat them as a whole (i.e., phrase). The multimodal learning setting is practically viable because multimodal alignments are generally abundant. For example, they can be found in online posts such as news and tweets that usually contain images and associated text, and in (YouTube) videos, where audio, scripts, and scenes are synchronized and grounded in each other.
We develop structure-induction models, which are capable of exploiting bimodal image-text alignments, for two modalities: (1) for natural language, we consider unsupervised syntactic parsing with phrase-structure grammars and regularize the parser by using visual image groundings; and (2) for visual images, we induce scene graph representations by mapping arguments and predicates in the text to their visual counterparts (i.e., visual objects and relations among them) in an unsupervised manner. While useful, crossmodal alignments are not always abundantly available on the web, e.g., the alignments between non-speech audio and text. We tackle the challenge by sharing the visual modality between image-text alignment and image-audio alignment; images function as a pivot and connect audio and text. The contributions of this thesis span from model development to data collection. We demonstrated the feasibility of applying multimodal learning techniques to unsupervised structure induction and multimodal alignment collection. Our work opens up new avenues for multimodal and unsupervised structured representation learning.
en
dc.identifier.uri
https://hdl.handle.net/1842/41249
dc.identifier.uri
http://dx.doi.org/10.7488/era/3985
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Zellers, R., Lu, J., Lu, X., Yu, Y., Zhao, Y., Salehi, M., Kusupati, A., Hessel, J., Farhadi, A., and Choi, Y. (2022). Merlot reserve: Neural script knowledge through vision and language and sound. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16354–16366.
en
dc.relation.hasversion
Zhao, Y., Hessel, J., Yu, Y., Lu, X., Zellers, R., and Choi, Y. (2022). Connecting the dots between audio and text without parallel data through visual knowledge transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4492–4507, Seattle, United States. Association for Computational Linguistics.
en
dc.relation.hasversion
Zhao, Y. and Titov, I. (2020). Visually grounded compound PCFGs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4369–4379, Online. Association for Computational Linguistics.
en
dc.relation.hasversion
Zhao, Y. and Titov, I. (2021). An empirical study of compound PCFGs. In Proceedings of the Second Workshop on Domain Adaptation for NLP, pages 166–171, Kyiv, Ukraine. Association for Computational Linguistics.
en
dc.relation.hasversion
Zhao, Y. and Titov, I. (2023a). On the transferability of visually grounded PCFGs. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics.
en
dc.relation.hasversion
Zhao, Y. and Titov, I. (2023b). Unsupervised scene graph induction from natural language supervision. Technical report. https://github.com/zhaoyanpeng/sgi.
en
dc.relation.hasversion
Zhao, Y., Zhang, L., and Tu, K. (2018). Gaussian mixture latent vector grammars. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1181–1189, Melbourne, Australia. Association for Computational Linguistics.
en
dc.subject
multimodal grounding
en
dc.subject
Unsupervised structure induction
en
dc.subject
Unsupervised learning
en
dc.subject
multimodal learning
en
dc.subject
structure-induction models
en
dc.subject
bimodal image-text alignments
en
dc.subject
crossmodal alignments
en
dc.subject
multimodal alignment collection
en
dc.title
Unsupervised structure induction and multimodal grounding
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- ZhaoY_2023.pdf
- Size:
- 5.33 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

