From Distributional to Semantic Similarity
dc.contributor.advisor
Moens, Marc
en
dc.contributor.advisor
Finch, Steve
en
dc.contributor.author
Curran, James Richard
en
dc.date.accessioned
2004-07-01T15:31:38Z
dc.date.available
2004-07-01T15:31:38Z
dc.date.issued
2004-07
dc.description
Institute for Communicating and Collaborative Systems
en
dc.description.abstract
Lexical-semantic resources, including thesauri and WORDNET, have been successfully incorporated
into a wide range of applications in Natural Language Processing. However they are
very difficult and expensive to create and maintain, and their usefulness has been severely
hampered by their limited coverage, bias and inconsistency. Automated and semi-automated
methods for developing such resources are therefore crucial for further resource development
and improved application performance.
Systems that extract thesauri often identify similar words using the distributional hypothesis
that similar words appear in similar contexts. This approach involves using corpora to examine
the contexts each word appears in and then calculating the similarity between context distributions.
Different definitions of context can be used, and I begin by examining how different
types of extracted context influence similarity.
To be of most benefit these systems must be capable of finding synonyms for rare words.
Reliable context counts for rare events can only be extracted from vast collections of text. In
this dissertation I describe how to extract contexts from a corpus of over 2 billion words. I
describe techniques for processing text on this scale and examine the trade-off between context
accuracy, information content and quantity of text analysed.
Distributional similarity is at best an approximation to semantic similarity. I develop improved
approximations motivated by the intuition that some events in the context distribution are more
indicative of meaning than others. For instance, the object-of-verb context wear is far more
indicative of a clothing noun than get. However, existing distributional techniques do not
effectively utilise this information. The new context-weighted similarity metric I propose in
this dissertation significantly outperforms every distributional similarity metric described in
the literature.
Nearest-neighbour similarity algorithms scale poorly with vocabulary and context vector size.
To overcome this problem I introduce a new context-weighted approximation algorithm with
bounded complexity in context vector size that significantly reduces the system runtime with
only a minor performance penalty. I also describe a parallelized version of the system that runs
on a Beowulf cluster for the 2 billion word experiments.
To evaluate the context-weighted similarity measure I compare ranked similarity lists against
gold-standard resources using precision and recall-based measures from Information Retrieval,
since the alternative, application-based evaluation, can often be influenced by distributional
as well as semantic similarity. I also perform a detailed analysis of the final results using
WORDNET.
Finally, I apply my similarity metric to the task of assigning words to WORDNET semantic
categories. I demonstrate that this new approach outperforms existing methods and overcomes
some of their weaknesses.
en
dc.format.extent
958140 bytes
en
dc.format.mimetype
application/pdf
en
dc.identifier.uri
http://hdl.handle.net/1842/563
dc.language.iso
en
dc.publisher
University of Edinburgh. College of Science and Engineering. School of Informatics.
en
dc.subject.other
Natural Language Processing
en
dc.subject.other
thesauri
en
dc.title
From Distributional to Semantic Similarity
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- IP030023.pdf
- Size:
- 935.68 KB
- Format:
- Adobe Portable Document Format
This item appears in the following Collection(s)

