Information extraction from broadcast news
This paper discusses the development of trainable statistical models for extracting content from television and radio news broadcasts. In particular, we concentrate on statistical finite-state models for identifying proper names and other named entities in broadcast speech. Two models are presented: the first represents name class information as a word attribute; the second represents both word-word and class-class transitions explicitly. A common n-gram-based formulation is used for both models. The task of named-entity identification is characterized by relatively sparse training data, and issues related to smoothing are discussed. Experiments are reported using the DARPA/NIST Hub-4E evaluation for North American broadcast news.