Towards Formal Structural Representation of Spoken Language: An Evolving Transformation System (ETS) Approach
Speech recognition has been a very active area of research over the past twenty years. Despite an evident progress, it is generally agreed by the practitioners of the field that performance of the current speech recognition systems is rather suboptimal and new approaches are needed. The motivation behind the undertaken research is an observation that the notion of representation of objects and concepts that once was considered to be central in the early days of pattern recognition, has been largely marginalised by the advent of statistical approaches. As a consequence of a predominantly statistical approach to speech recognition problem, due to the numeric, feature vector-based, nature of representation, the classes inductively discovered from real data using decision-theoretic techniques have little meaning outside the statistical framework. This is because decision surfaces or probability distributions are difficult to analyse linguistically. Because of the later limitation it is doubtful that the gap between speech recognition and linguistic research can be bridged by the numeric representations. This thesis investigates an alternative, structural, approach to spoken language representation and categorisation. The approach pursued in this thesis is based on a consistent program, known as the Evolving Transformation System (ETS), motivated by the development and clarification of the concept of structural representation in pattern recognition and artificial intelligence from both theoretical and applied points of view. This thesis consists of two parts. In the first part of this thesis, a similarity-based approach to structural representation of speech is presented. First, a linguistically well-motivated structural representation of phones based on distinctive phonological features recovered from speech is proposed. The representation consists of string templates representing phones together with a similarity measure. The set of phonological templates together with a similarity measure defines a symbolic metric space. Representation and ETS-inspired categorisation in the symbolic metric spaces corresponding to the phonological structural representation are then investigated by constructing appropriate symbolic space classifiers and evaluating them on a standard corpus of read speech. In addition, similarity-based isometric transition from phonological symbolic metric spaces to the corresponding non-Euclidean vector spaces is investigated. Second part of this thesis deals with the formal approach to structural representation of spoken language. Unlike the approach adopted in the first part of this thesis, the representation developed in the second part is based on the mathematical language of the ETS formalism. This formalism has been specifically developed for structural modelling of dynamic processes. In particular, it allows the representation of both objects and classes in a uniform event-based hierarchical framework. In this thesis, the latter property of the formalism allows the adoption of a more physiologically-concreteapproach to structural representation. The proposed representation is based on gestural structures and encapsulates speech processes at the articulatory level. Algorithms for deriving the articulatory structures from the data are presented and evaluated.