Istituto di Scienza e Tecnologie dell'Informazione     
Lucchese C., Baraglia R., De Francisci Morales G. Document similarity self-join with MapReduce. In: ICDM 2010 - IEEE International Conference on Data Mining (Sydney, December 14-17 2010). Proceedings, pp. 731 - 736. IEEE, 2010.
Given a collection of objects, the Similarity Self- Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce- based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.
Subject Data Mining
All Pair Similarity
Similarity Self-Join
Parallel Algorithms
H.2.8 Database Management. Data mining

Icona documento 1) Download Document PDF

Icona documento Open access Icona documento Restricted Icona documento Private


Per ulteriori informazioni, contattare: Librarian http://puma.isti.cnr.it

Valid HTML 4.0 Transitional