PUMA
Istituto di Scienza e Tecnologie dell'Informazione     
Orlando S., Perego R., Silvestri F. Assigning document identifiers to enhance compressibility of Web Search Engines indexes. In: Proceedings of the 2004 ACM symposium on Applied computing (Nicosia, Cyprus, 2004). Proceedings, pp. 600 - 605. ACM Press, 2004.
 
 
Abstract
(English)
Granting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes where the posting lists are stored as sequences of d_gaps (i.e. differences among successive document identifiers) compressed using variable length encoding methods. This paper describes the use of a lightweight clustering algorithm aimed at assigning the identifiers to documents in a way that minimizes the average values of d_gaps. The simulations performed on a real dataset, i.e. the Google contest collection, show that our approach allows to obtain an IF index which is, depending on the d_gap encoding chosen, up to 23% smaller than the one built over randomly assigned document identifiers. Moreover, we will show, both analytically and empirically, that the complexity of our algorithm is linear in space and time.
URL: http://portal.acm.org/citation.cfm?doid=968024
Subject Compression
Information retrieval
Clustering
H.2.8 Data mining
H.3.3 Information Search and Retrieval


Icona documento 1) Download Document PDF


Icona documento Open access Icona documento Restricted Icona documento Private

 


Per ulteriori informazioni, contattare: Librarian http://puma.isti.cnr.it

Valid HTML 4.0 Transitional