Istituto di Scienza e Tecnologie dell'Informazione     
Bacarella V., Giannotti F., Nanni M., Pedreschi D. Discovery of ads web hosts through traffic data analysis. In: Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in (Paris, France, June 13, 2004). Proceedings, pp. 76 - 81. Gautam Das, Bing Liu, Philip S. Yu (eds.). ACM Press, 2004.
One of the most actual problems on web crawling -- the most expensive task of any search engine, in terms of time and bandwidth consumption -- is the detection of useless segments of Internet. In some cases such segments are purposely created to deceive the crawling engine while, in others, they simply do not contain any useful information. Currently, the typical approach to the problem consists in using a human-compiled blacklist of sites to avoid (e.g., advertising sites and web counters), but, due to the strongly dynamical nature of Internet, keeping them manually up-to-date is quite unfeasible. In this work we present a web usage statistics-based solution to the problem, aimed at automatically -- and, therefore, dynamically -- building blacklists of sites that the users of a monitored web-community consider (or appear to consider) useless or uninteresting. Our method performs a linear time complexity analysis on the traffic information which yields an abstraction of the linked web which can be incrementally up- dated, therefore allowing a streaming computation. The crawler can use the list produced in this way to prune out such sites or to give them a low priority before the (re-)spidering activity starts and, therefore, without analysing the content of crawled documents.
URL: http://doi.acm.org/10.1145/1008707
Subject Data Mining
H.2.8 Data mining

Icona documento 1) Download Document PDF

Icona documento Open access Icona documento Restricted Icona documento Private


Per ulteriori informazioni, contattare: Librarian http://puma.isti.cnr.it

Valid HTML 4.0 Transitional