Istituto di Scienza e Tecnologie dell'Informazione     
Lavelli A., Sebastiani F., Zanoli R. Distributional Term Representations: An Experimental Comparison. In: CIKM-04, ACM International Conference on Information and Knowledge Management (Washington, US, November 8-13, 2004). Proceedings, pp. 615 - 624. Evans, David A. and Gravano,Luis and Herzog,Otthein andZhai,ChengXiang and Ronthaler, Marc (eds.). ACM Press, 2004.
A number of content management tasks, including term categorization, term clustering, and automated thesaurus generation, view natural language terms (e.g. words, noun phrases) as first-class objects, i.e. as objects endowed with an internal representation which makes them suitable for explicit manipulation by the corresponding algorithms. The information retrieval (IR) literature has traditionally used an extensional (aka distributional) representation for terms according to which a term is represented by the 'bag of documents' in which the term occurs. The computational linguistics (CL) literature has independently developed an alternative distributional representation for terms, according to which a term is represented by the 'bag of terms' that co-occur with it in some document. This paper aims at discovering which of the two representations is most effective, i.e. brings about higher effectiveness once used in tasks that require terms to be explicitly represented and manipulated. We carry out experiments on (i) a term categorization task, and (ii) a term clustering task; this allows us to compare the two different representations in closely controlled experimental conditions. We report the results of experiments in which we categorize/cluster under 42 different classes the terms extracted from a corpus of more than 65,000 documents. Our results show a substantial difference in effectiveness between the two representation styles; we give both an intuitive explanation and an information-theoretic justification for these different behaviours.
URL: http://www.isti.cnr.it/People/F.Sebastiani/Publications/CIKM04.pdf
Subject Term representations
Distributional hypothesis
Extensional representations
I.5.3 Clustering
I.2.7 Natural Language Processing
I.5.2 Classifier design and evaluation
H.3.3 Information search and retrieval

Icona documento 1) Download Document PDF

Icona documento Open access Icona documento Restricted Icona documento Private


Per ulteriori informazioni, contattare: Librarian http://puma.isti.cnr.it

Valid HTML 4.0 Transitional