Topic-based mixture language modelling.
dc.contributor.author
Gotoh, Yoshihiko
en
dc.contributor.author
Renals, Steve
en
dc.date.accessioned
2006-05-26T12:34:48Z
dc.date.available
2006-05-26T12:34:48Z
dc.date.issued
1999
dc.description.abstract
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling. A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular valued decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost.
en
dc.format.extent
115186 bytes
en
dc.format.extent
327290 bytes
en
dc.format.mimetype
application/octet-stream
en
dc.format.mimetype
application/pdf
en
dc.identifier.citation
Journal of Natural Language Engineering, 5:355-375, 1999.
dc.identifier.uri
http://hdl.handle.net/1842/1189
dc.language.iso
en
dc.publisher
Cambridge University Press
en
dc.title
Topic-based mixture language modelling.
en
dc.type
Article
en
This item appears in the following Collection(s)

