Geraci F., Pellegrini M., Pisati P., Sebastiani F. A scalable algorithm for high-quality clustering of Web snippets. In: SAC-06. 21st ACM Symposium on Applied Computing (Dijon, FR, April 23-2, 2006). Proceedings, pp. 1058 - 1062. ACM Press, 2006. |

Abstract (English) |
We consider the problem of partitioning, in a highly accurate emph{and} highly efficient way, a set of $n$ documents lying in a metric space into $k$ non-overlapping clusters. We augment the well-known emph{furthest-point-first} algorithm for $k$-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical $k$-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable. | |

URL: | http://nmis.isti.cnr.it/sebastiani/Publications/SAC06.pdf | |

Subject | Meta Search Engines Web Snippets Clustering Metric Spaces H.3.3 Information Search and Retrieval. Clustering |

1) Download Document PDF |

Open access Restricted Private