Edinburgh Research Archive

Controlling context factors in abstractive summarization of long documents

dc.contributor.advisor
Cohen, Shay
dc.contributor.advisor
Titov, Ivan
dc.contributor.author
Fonseca, Marcio
dc.date.accessioned
2024-10-17T15:15:13Z
dc.date.available
2024-10-17T15:15:13Z
dc.date.issued
2024-10-17
dc.description.abstract
The massive influx of textual data poses a significant challenge in technical fields, fueling the research of text summarization systems. Through innovative approaches in representation learning and extensive data utilization, these systems have demonstrated remarkable advancements, particularly within domain-specific contexts. More recently, large language models (LLMs) such as ChatGPT demonstrated an impressive ability to generate abstractive summaries that are fluent and relevant according to human judgments, even without domain-specific training. While those models are regarded as strong general-purpose summarizers, technical documents require more nuanced control of contextual factors that depend on the target audience and task goals. In this thesis, we argue that integrating contextual factors that are not easily distilled from reference summaries is crucial for advancing in summarization of long technical documents. We establish a conceptual framework separating intrinsic factors that can be determined from document-summary pairs (e.g., redundancy and relevance) and extrinsic factors (e.g., conciseness and rhetoric) that depend on the task context and subjective intentionality. Guided by this framework, we approach the summarization problem as a factorized energy-based model, in which we optimize for intrinsic and extrinsic factors separately. Our model, FactorSum, achieves significant improvements in terms of lexical alignment to reference summaries while requiring modest compute resources compared to baselines. Furthermore, we delve into the application of large language models to three types of scientific summarization tasks: abstract generation, summarization for reviews, and lay summarization. Our results show that those LLMs excel at the controllability of stylistic features such as budget and narrative perspective. However, these models exhibit gaps in the understanding of domain concepts in scientific papers, which limits more fine-grained control. Finally, we also propose an approach to improve the lexical alignment of summaries guiding LLM summarizers with keywords derived from FactorSum, thus combining the strengths of both approaches. In conclusion, our investigation confirms that large language models are powerful tools for summarization tasks, occasionally eclipsing human-authored summaries according to expert judgments. However, we find that LLMs struggle to match the richness of human perspectives in lay summarization, for instance. Our factorized modeling approach partially addresses these limitations, and hopefully, inspires future work focusing on context-aware summarization.
en
dc.identifier.uri
https://hdl.handle.net/1842/42308
dc.identifier.uri
http://dx.doi.org/10.7488/era/5028
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Fonseca, M. and Cohen, S. B. (2024). Can large language model summarizers adapt to diverse scientific communication goals? arXiv preprint arXiv:2401.10415.
en
dc.relation.hasversion
Fonseca, M., Ziser, Y., and Cohen, S. B. (2022). Factorizing content and budget decisions in abstractive summarization of long documents. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6341– 6364, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
en
dc.relation.hasversion
Malik, M., Zhao, Z., Fonseca, M., Rao, S., and Cohen, S. B. (2024). Civilsum: A dataset for abstractive summarization of indian court decisions. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2241–2250
en
dc.relation.hasversion
Silva, N. F. F. d., Silva, M. C. R., Pereira, F. S. F., Tarrega, J. a. P. M., Beinotti, J. a. V. P., Fonseca, M., Andrade, F. E. d., and de Carvalho, A. C. P. d. L. F. (2021). Evaluating topic models in portuguese political comments about bills from brazil’s chamber of deputies. In Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, November 29 – December 3, 2021, Proceedings, Part II, page 104–120, Berlin, Heidelberg. Springer-Verlag
en
dc.subject
abstractive summarization of long documents
en
dc.subject
textual data
en
dc.subject
large language models (LLMs)
en
dc.subject
contextual factors
en
dc.subject
target audience
en
dc.subject
task goals
en
dc.subject
document-summary pairs
en
dc.subject
extrinsic factors
en
dc.title
Controlling context factors in abstractive summarization of long documents
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
FonsecaM_2024.pdf
Size:
1.31 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)