PUMA
Istituto di Scienza e Tecnologie dell'Informazione     
Avancini H., Lavelli A., Sebastiani F., Zanoli R. Automatic Expansion of Domain-Specific Lexicons by Term Categorization. The document has been submitted to Journal: ACM Transactions on Speech and Language Technology, Technical report, 2004.
 
 
Abstract
(English)
We discuss an approach to the automatic expansion of domain-specific lexicons by means of term categorization, a novel task employing techniques from information retrieval and machine learning. Specifically, we view the expansion of such lexicons as a process of learning previously unknown associations between terms and domains (i.e. disciplines, or fields of activity). The process generates, for each c_i in a set C={c_1,..,c_m} of domains, a lexicon L^i_1, bootstrapping from an initial lexicon L^i_0 and a set of documents T given as input. The method is inspired by text categorization, the discipline concerned with labeling natural language texts with labels from a predefined set of domains, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, we formulate the task of term categorization as one in which terms are (dually) represented as vectors in a space of documents, and in which terms (instead of documents) are labeled with domains. As a learning device we adopt a boosting-based method, since boosting (a) has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) naturally allows for a form of 'data cleaning', thereby making the process of generating a lexicon an iteration of generate-and-test steps. We present the results of a number of experiments using a set of domain-specific lexicons called WordNetDomains (which actually consists of an extension of WordNet), and performed using the documents in the Reuters Corpus Volume 1 as 'implicit' representations for our terms.
Subject Lexicons
Text Classification
Machine Learning
I.5.2 Classifier design and evaluation
I.2.7 Natural language processing
H.3.1 Content Analysis and Indexing


Icona documento 1) Download Document PDF


Icona documento Open access Icona documento Restricted Icona documento Private

 


Per ulteriori informazioni, contattare: Librarian http://puma.isti.cnr.it

Valid HTML 4.0 Transitional