Istituto di Scienza e Tecnologie dell'Informazione     
Lucchese C., Orlando S., Perego R. Mining top-K patterns from binary datasets in presence of noise. In: SDM10 - Tenth SIAM International Conference on Data Mining (Columbus, Ohio, US, April 29 - May 1 2010). Proceedings, pp. 165 - 176. SIAM, 2010.
The discovery of patterns in binary dataset has many ap- plications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the Top-K patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and real- world data.
URL: http://www.siam.org/proceedings/datamining/2010/dm10_015_lucchesec.pdf
Subject Pattern mining
H.2.8 Database Management. Data mining

Icona documento 1) Download Document PDF

Icona documento Open access Icona documento Restricted Icona documento Private


Per ulteriori informazioni, contattare: Librarian http://puma.isti.cnr.it

Valid HTML 4.0 Transitional