|dc.description.abstract||Narratives, such as movies and TV shows, provide a testbed for addressing a variety of challenges in the field of artificial intelligence. They are examples of complex stories where characters and events interact in many ways. Inferring what is happening in a narrative requires modeling long-range dependencies between events, understanding commonsense knowledge and accounting for non-linearities in the presentation of the story. Moreover, narratives are usually long (i.e., there are hundreds of pages in a screenplay and thousands of frames in a video) and cannot be easily processed by standard neural architectures. Movies and TV episodes also include information from multiple sources (i.e., video, audio, text) that are complementary to inferring high-level events and their interactions. Finally, creating large-scale multimodal datasets with narratives containing long videos and aligned textual data is challenging, resulting in small datasets that require data efficient approaches.
Most prior work that analyzes narratives does not consider the above challenges all at once. In most cases, text-only approaches focus on full-length narratives with complex semantics and address tasks such as question-answering and summarization, or multimodal approaches are limited to short videos with simpler semantics (e.g., isolated actions and local interactions). In this thesis, we combine these two different directions in addressing narrative summarization. We use all input modalities (i.e., video, audio, text), consider full-length narratives and perform the task of narrative summarization both in a video-to-video setting (i.e., video summarization, trailer generation) and a video-to-text setting (i.e., multimodal abstractive summarization).
We hypothesize that information about the narrative structure of movies and TVepisodes can facilitate summarizing them. We introduce the task of Turning Point identification and provide a corresponding dataset called TRIPOD as a means of analyzing the narrative structure of movies. According to screenwriting theory, turning points (e.g., change of plans, major setback, climax) are crucial narrative moments within a movie or TV episode: they define the plot structure and determine its progression and thematic units. We validate that narrative structure contributes to extractive screenplay summarization by testing our hypothesis on a dataset containing TV episodes and summary-specific labels.
We further hypothesize that movies should not be viewed as a sequence of scenes from a screenplay or shots from a video and instead be modelled as sparse graphs, where nodes are scenes or shots and edges denote strong semantic relationships between them. We utilize multimodal information for creating movie graphs in the latent space, and find that both graph-related and multimodal information help contextualization and boost performance on extractive summarization.
Moving one step further, we also address the task of trailer moment identification, which can be viewed as a specific instiatiation of narrative summarization. We decompose this task, which is challenging and subjective, into two simpler ones: narrativestructure identification, defined again by turning points, and sentiment prediction. We propose a graph-based unsupervised algorithm that uses interpretable criteria for retrieving trailer shots and convert it into an interactive tool with a human in the loop for trailer creation. Semi-automatic trailer shot selection exhibits comparable performance to fully manual selection according to human judges, while minimizing processing time.
After identifying salient content in narratives, we next attempt to produce abstractive textual summaries (i.e., video-to-text). We hypothesize that multimodal information is directly important for generating textual summaries, apart from contributing to content selection. For that, we propose a parameter efficient way for incorporating multimodal information into a pre-trained textual summarizer, while training only 3.8% of model parameters, and demonstrate the importance of multimodal information for generating high-quality and factual summaries. The findings of this thesis underline the need to focus on realistic and multimodal settings when addressing narrative analysis and generation tasks.||en