A Data-Driven Methodology for Motivating a Set of Coherence Relations
The notion that a text is coherent in virtue of the `relations' that hold between its component spans currently forms the basis for an active research programme in discourse linguistics.Coherence relations feature prominently in many theories of discourse structure, and have recently been used with considerable success in text generation systems. However, while the concept of coherence relations is now common currency for discourse theorists there remains much confusion about them, and no standard set of relations has yet emerged. The aim of this thesis is to contribute towards the development of a standard set of relations. We begin from an explicitly empirical conception of relations: they are taken to model a collection of psychological mechanisms operative during the tasks of reading and writing.This conception is fleshed out with reference to psychological theories of skilled task performance, and to Rosch's notion of the basic level of categorisation. A methodology for investigating these mechanisms is then presented, which takes as its starting point a study of cue phrases- the sentence/clause connectives by which they are signaled. Although it is conventional to investigate psychological mechanisms by studying human behaviour, it is argued here that evidence for the constructs modelled by relations can be sought in ananalysis of the linguistic resources available for marking them explicitly intext. The methodology is based on two simple linguistic tests: the test for cue phrases and the test for substitutability. Both tests are functional in inspiration: the former test identifies a heterogenous class of phrases used for linking one portion of text to another; and the later test is used to discover when a writer is willing to substitute one of these phrases for another. The tests are designed to capture the judgements of ordinary readers and writers, rather than the theoretical intuitions of specialised discourse analysts. The test for cue phrases is used to analyse around 20 pages of naturally occuring text, from which a corpus of over 20 cue phrases is assembled. The substitutability test is then used to organise this corpus into a hierarchical taxonomy, representing the substitutability relationship between every pair of phrases. The taxonomy of cue phrases lends itself neatly to a model of relations as feature-based constructs. Many cue phrases can be interpreted as signalling just some features of relations, rather than whole relations. Small extracts from the taxonomy can be used systematically to determine the alternative values of single features; complex relation definitions can then be formed by combining the values of many features. The thesis delivers results on two levels. Firstly,it sets out a methodology for motivating a set of relation definitions, which rests on a systematic analysis of oncrete linguistic data, and demands a minimum of theoretical assumptions. Also provided are the relation definitions which result from applying the methodology. The new definitions give an interesting picture of the variation that exists amongst cuephrases, and offers a number of innovative insights into text coherence.