Controlling context factors in abstractive summarization of long documents

Fonseca, Marcio

Controlling context factors in abstractive summarization of long documents

Simple item page

dc.contributor.advisor

Cohen, Shay

dc.contributor.advisor

Titov, Ivan

dc.contributor.author

Fonseca, Marcio

dc.date.accessioned

2024-10-17T15:15:13Z

dc.date.available

2024-10-17T15:15:13Z

dc.date.issued

2024-10-17

dc.description.abstract

The massive influx of textual data poses a significant challenge in technical fields, fueling the research of text summarization systems. Through innovative approaches in representation learning and extensive data utilization, these systems have demonstrated remarkable advancements, particularly within domain-specific contexts. More recently, large language models (LLMs) such as ChatGPT demonstrated an impressive ability to generate abstractive summaries that are fluent and relevant according to human judgments, even without domain-specific training. While those models are regarded as strong general-purpose summarizers, technical documents require more nuanced control of contextual factors that depend on the target audience and task goals. In this thesis, we argue that integrating contextual factors that are not easily distilled from reference summaries is crucial for advancing in summarization of long technical documents. We establish a conceptual framework separating intrinsic factors that can be determined from document-summary pairs (e.g., redundancy and relevance) and extrinsic factors (e.g., conciseness and rhetoric) that depend on the task context and subjective intentionality. Guided by this framework, we approach the summarization problem as a factorized energy-based model, in which we optimize for intrinsic and extrinsic factors separately. Our model, FactorSum, achieves significant improvements in terms of lexical alignment to reference summaries while requiring modest compute resources compared to baselines. Furthermore, we delve into the application of large language models to three types of scientific summarization tasks: abstract generation, summarization for reviews, and lay summarization. Our results show that those LLMs excel at the controllability of stylistic features such as budget and narrative perspective. However, these models exhibit gaps in the understanding of domain concepts in scientific papers, which limits more fine-grained control. Finally, we also propose an approach to improve the lexical alignment of summaries guiding LLM summarizers with keywords derived from FactorSum, thus combining the strengths of both approaches. In conclusion, our investigation confirms that large language models are powerful tools for summarization tasks, occasionally eclipsing human-authored summaries according to expert judgments. However, we find that LLMs struggle to match the richness of human perspectives in lay summarization, for instance. Our factorized modeling approach partially addresses these limitations, and hopefully, inspires future work focusing on context-aware summarization.

en

dc.identifier.uri

https://hdl.handle.net/1842/42308

dc.identifier.uri

http://dx.doi.org/10.7488/era/5028

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Fonseca, M. and Cohen, S. B. (2024). Can large language model summarizers adapt to diverse scientific communication goals? arXiv preprint arXiv:2401.10415.

en

dc.relation.hasversion

Fonseca, M., Ziser, Y., and Cohen, S. B. (2022). Factorizing content and budget decisions in abstractive summarization of long documents. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6341– 6364, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

en

dc.relation.hasversion

Malik, M., Zhao, Z., Fonseca, M., Rao, S., and Cohen, S. B. (2024). Civilsum: A dataset for abstractive summarization of indian court decisions. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2241–2250

en

dc.relation.hasversion

Silva, N. F. F. d., Silva, M. C. R., Pereira, F. S. F., Tarrega, J. a. P. M., Beinotti, J. a. V. P., Fonseca, M., Andrade, F. E. d., and de Carvalho, A. C. P. d. L. F. (2021). Evaluating topic models in portuguese political comments about bills from brazil’s chamber of deputies. In Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, November 29 – December 3, 2021, Proceedings, Part II, page 104–120, Berlin, Heidelberg. Springer-Verlag

en

dc.subject

abstractive summarization of long documents

en

dc.subject

textual data

en

dc.subject

large language models (LLMs)

en

dc.subject

contextual factors

en

dc.subject

target audience

en

dc.subject

task goals

en

dc.subject

document-summary pairs

en

dc.subject

extrinsic factors

en

dc.title

Controlling context factors in abstractive summarization of long documents

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: FonsecaM_2024.pdf
Size:: 1.31 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection