Istituto di Scienza e Tecnologie dell'Informazione     
Geraci F., Pellegrini M., Pisati P., Sebastiani F. A scalable algorithm for high-quality clustering of Web snippets. In: SAC-06. 21st ACM Symposium on Applied Computing (Dijon, FR, April 23-2, 2006). Proceedings, pp. 1058 - 1062. ACM Press, 2006.
We consider the problem of partitioning, in a highly accurate emph{and} highly efficient way, a set of $n$ documents lying in a metric space into $k$ non-overlapping clusters. We augment the well-known emph{furthest-point-first} algorithm for $k$-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical $k$-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.
URL: http://nmis.isti.cnr.it/sebastiani/Publications/SAC06.pdf
Subject Meta Search Engines
Web Snippets
Metric Spaces
H.3.3 Information Search and Retrieval. Clustering

Icona documento 1) Download Document PDF

Icona documento Open access Icona documento Restricted Icona documento Private


Per ulteriori informazioni, contattare: Librarian http://puma.isti.cnr.it

Valid HTML 4.0 Transitional