Edinburgh Research Archive

Text categorization for intellectual property: comparing balanced Winnow with SVM on different document representations

Item Status

RESTRICTED ACCESS

Embargo End Date

Abstract

This study investigates the effect of training different categorization algorithms on various patent document representations. The automation of knowledge and content management in the intellectual property domain has been experiencing a growing interest in the last decade [Cai and Hofmann, 2004, Fall et al., 2003, Koster et al., 2003, Krier and Zacca, 2002],, since the first patent classification system was presented in 1999 by Larkey [Larkey, 1999]. Typical applications of patent classification systems are: (1) the automatic assignment of a new patent to the group of patent examiners concerned with the topic, (2) the search for prior art in fields similar to the incoming patent application and (3) the reclassification of patent specifications. By means of machine learning techniques, a collection of 1 270 185 patents is used to build a classifier that is able to classify documents with varyingly large feature spaces. The two algorithms that are compared are Balanced Winnow and Support Vector Machines (SVMs). A previous study [Zhang, 2000] found that Winnow achieves a similar accuracy to SVM but it is much faster as the execution time for Winnow is linear in the number of terms and the number of classes. This primary finding is verified on a feature space 100 times the size using patent documents instead of news paper articles. Results show that SVM outperforms Winnow considerably on all considered measures. Moreover, SVM is found to be a much more robust classifier than Winnow. The parameter tuning that was carried out for both algorithms confirms this result.

This item appears in the following Collection(s)