Real-time event detection in massive streams

Petrovic, Sasa

Real-time event detection in massive streams

Simple item page

dc.contributor.advisor

Osborne, Miles

en

dc.contributor.author

Petrovic, Sasa

en

dc.contributor.sponsor

Engineering and Physical Sciences Research Council (EPSRC)

en

dc.date.accessioned

2013-07-30T13:19:07Z

dc.date.available

2013-07-30T13:19:07Z

dc.date.issued

2013-07-02

dc.description

Grant award number EP/J020664/1

en

dc.description.abstract

New event detection, also known as first story detection (FSD), has become very popular in recent years. The task consists of finding previously unseen events from a stream of documents. Despite the apparent simplicity, FSD is very challenging and has applications anywhere where timely access to fresh information is crucial: from journalism to stock market trading, homeland security, or emergency response. With the rise of user generated content and citizen journalism we have entered an era of big and noisy data, yet traditional approaches for solving FSD are not designed to deal with this new type of data. The amount of information that is being generated today exceeds by many orders of magnitude previously available datasets, making traditional approaches obsolete for modern event detection. In this thesis, we propose a modern approach to event detection that scales to unbounded streams of text, without sacrificing accuracy. This is a crucial property that enables us to detect events from large streams like Twitter, which none of the previous approaches were able to do. One of the major problems in detecting new events is vocabulary mismatch, also known as lexical variation. This problem is characterized by different authors using different words to describe the same event, and it is inherent to human language. We show how to mitigate this problem in FSD by using paraphrases. Our approach that uses paraphrases achieves state-of-the-art results on the FSD task, while still maintaining efficiency and being able to process unbounded streams. Another important property of user generated content is the high level of noise, and Twitter is no exception. This is another problem that traditional approaches were not designed to deal with, and here we investigate different methods of reducing the amount of noise. We show that by using information from Wikipedia, it is possible to significantly reduce the amount of spurious events detected in Twitter, while maintaining a very small latency in detection. A question is often raised as to whether Twitter is at all useful, especially if one has access to a high-quality stream such as the newswire, or if it should be considered as sort of a poor man’s newswire. In our comparison of these two streams we find that Twitter contains events not present in the newswire, and that it also breaks some events sooner, showing that it is useful for event detection, even in the presence of newswire.

en

dc.identifier.uri

http://hdl.handle.net/1842/7612

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Petrovi´c, S., Osborne, M., and Lavrenko, V. (2012). Using paraphrases for improving first story detection in news and Twitter. In Proceedings of Human Language Technologies: Conference of the North American Chapter of the Association for Computational Linguistics, pages 338–346. Association for Computational Linguistics.

en

dc.relation.hasversion

Petrovi´c, S., Osborne, M., and Lavrenko, V. (2010). Streaming first story detection with application to Twitter. In Proceedings of the 11th annual conference of the North American Chapter of the Association for Computational Linguistics, pages 181–189. Association for Computational Linguistics.

en

dc.subject

event detection

en

dc.subject

first story detection

en

dc.subject

FSD

en

dc.subject

social media

en

dc.title

Real-time event detection in massive streams

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 3 of 3

Name:: Petrovic2013.pdf
Size:: 861.5 KB
Format:: Adobe Portable Document Format

Download

Name:: images.zip
Size:: 234.49 KB
Format:: Adobe Portable Document Format

Download

Name:: TEX files.zip
Size:: 120.03 KB
Format:: Plain Text

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection