Show simple item record

dc.contributor.advisorKeller, Frank
dc.contributor.advisorLavrenko, Victor
dc.contributor.authorElliott, Desmond
dc.date.accessioned2015-09-02T14:14:44Z
dc.date.available2015-09-02T14:14:44Z
dc.date.issued2015-06-29
dc.identifier.urihttp://hdl.handle.net/1842/10524
dc.description.abstractA photograph typically depicts an aspect of the real world, such as an outdoor landscape, a portrait, or an event. The task of creating abstract digital representations of images has received a great deal of attention in the computer vision literature because it is rarely useful to work directly with the raw pixel data. The challenge of working with raw pixel data is that small changes in lighting can result in different digital images, which is not typically useful for downstream tasks such as object detection. One approach to representing an image is automatically extracting and quantising visual features to create a bag-of-terms vector. The bag-of-terms vector helps overcome the problems with raw pixel data but this unstructured representation discards potentially useful information about the spatial and semantic relationships between the parts of the image. The central argument of this thesis is that capturing and encoding the relationships between parts of an image will improve the performance of extrinsic tasks, such as image description or search. We explore this claim in the restricted domain of images representing events, such as riding a bicycle or using a computer. The first major contribution of this thesis is the Visual Dependency Representation: a novel structured representation that captures the prominent region–region relationships in an image. The key idea is that images depicting the same events are likely to have similar spatial relationships between the regions contributing to the event. This representation is inspired by dependency syntax for natural language, which directly captures the relationships between the words in a sentence. We also contribute a data set of images annotated with multiple human-written descriptions, labelled image regions, and gold-standard Visual Dependency Representations, and explain how the gold-standard representations can be constructed by trained human annotators. The second major contribution of this thesis is an approach to automatically predicting Visual Dependency Representations using a graph-based statistical dependency parser. A dependency parser is typically used in Natural Language Processing to automatically predict the dependency structure of a sentence. In this thesis we use a dependency parser to predict the Visual Dependency Representation of an image because we are working with a discrete image representation – that of image regions. Our approach can exploit features from the region annotations and the description to predict the relationships between objects in an image. In a series of experiments using gold-standard region annotations, we report significant improvements in labelled and unlabelled directed attachment accuracy over a baseline that assumes there are no relationships between objects in an image. Finally, we find significant improvements in two extrinsic tasks when we represent images as Visual Dependency Representations predicted from gold-standard region annotations. In an image description task, we show significant improvements in automatic evaluation measures and human judgements compared to state-of-the-art models that use either external text corpora or region proximity to guide the generation process. In the query-by-example image retrieval task, we show a significant improvement in Mean Average Precision and the precision of the top 10 images compared to a bag-of-terms approach. We also perform a correlation analysis of human judgements against automatic evaluation measures for the image description task. The automatic measures are standard measures adopted from the machine translation and summarization literature. The main finding of the analysis is that unigram BLEU is less correlated with human judgements than Smoothed BLEU, Meteor, or skip-bigram ROUGE.en
dc.contributor.sponsorEuropean Research Councilen
dc.language.isoenen
dc.publisherThe University of Edinburghen
dc.relation.hasversionElliott, D. and Keller, F. (2013). Image Description using Visual Dependency Representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1292–1302, Seattle,Washington, U.S.A.en
dc.relation.hasversionD. Elliott and F. Keller. 2011. A Treebank of Visual and Linguistic Data. In Proceedings of the Workshop on Integrating Language and Vision at Neural Information Processing Systems 2011. Granada, Spain.en
dc.relation.hasversionD. Elliott and F. Keller. 2013. Image Description using Visual Dependency Representation. In Proceedings of the 2013 Conference of Empirical Methods in Natural Language Processing. Seattle,Washington, U.S.A.en
dc.relation.hasversionD. Elliott and F. Keller. 2014. Comparing Automatic Evaluation Measures for Image Description. In Proceedings of the 52nd Annual Meeting of the Association of Computational Linguistics. Baltimore, Maryland, U.S.A.en
dc.relation.hasversionD. Elliott, V. Lavrenko, and F. Keller. 2014. Query-by-example Image Retrieval using Visual Dependency Representations. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics. Dublin, Ireland.en
dc.subjectbag-of-terms vectoren
dc.subjectVisual Dependency Representationen
dc.subjectregion–region relationshipsen
dc.subjectdependency parseren
dc.titleStructured representation of images for language generation and image retrievalen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD Doctor of Philosophyen


Files in this item

This item appears in the following Collection(s)

Show simple item record