Methods for natural language processing
Files
Item Status
Embargo End Date
Date
Authors
Voita, Lena
Abstract
In recent years, the field of natural language processing has been transformed by the rise of deep learning: in most cases, traditional feature-engineered systems have been replaced with end-to-end neural networks. Despite remarkable improvements in quality, neural models have a major drawback compared to traditional models: lack of interpretability. Indeed, while features in traditional models can usually be understood by humans (e.g., syntactic relations, morphological categories, etc.), decision-making process of neural networks remains hidden, which poses challenges in achieving goals like trust, reliability and safety. This thesis focuses on the ways the predictions and behaviour of NLP models can be understood. We first discuss in depth the machine translation task which we chose because in the traditional paradigm, it used to require solving many other tasks of the classical NLP pipeline. Then we abstract away from the underlying task and discuss information-theoretic point of view on analysis of NLP models.
We first focus on methods to analyze the behavior of neural machine translation models keeping in mind the more interpretable statistical paradigm.
We first show that NMT model components can take the roles mirroring SMT components or features, e.g. tracking syntactic relations or resolving anaphora.
Second, we analyze how, during generation process, NMT models balance between using information from the source sentence and previously generated tokens in the target sentence (which in SMT was modelled with distinct models). We develop a method to explicitly evaluate the proportion of source and target influence on generated tokens and show, among other things, that the NMT training process is non-monotonic with several stages of different nature. When taking a closer look at these stages, we find that during training NMT focuses on learning different competences mirroring SMT components (e.g., target-side language model, translation and alignments).
We then come further and focus on model-agnostic methods, where we abstract away from the target task and focus on the information encoded in the learned vector representations. Specifically, we discuss model analysis from the information-theoretic perspective. First we illustrate the relation between the bottom-up encoding process in a network and Information Bottleneck theory: in the information flow from bottom to top layers, the kinds of information that get lost or acquired by token representations are explicitly defined by the training objective. We explain many observations made in previous work and propose an explanation for superior performance of the MLM objective over the LM one for pretraining. Finally, we turn to one of the most popular analysis methods for NLP: probing for linguistic structure. To measure how well pretrained representations encode some linguistic property we propose a method which, in contrast to previous work, considers not only whether the information can be extracted but also the difficulty of this extraction. The results of this method, Minimum Description Length probing, are more informative and stable than that of the standard probes.
This item appears in the following Collection(s)

