PUMA
Istituto di Scienza e Tecnologie dell'Informazione     
Berardi G., Esuli A., Sebastiani F. Utility-theoretic ranking for semiautomated text classification. In: ACM Transactions on Knowledge Discovery from Data, vol. 10 (1) article n. 6. ACM, 2015.
 
 
Abstract
(English)
Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.
URL: http://dl.acm.org/citation.cfm?doid=2808688.2742548
DOI: 10.1145/2742548
Subject Semiautomatd classification
I.2.6 Learning


Icona documento 1) Download Document PDF


Icona documento Open access Icona documento Restricted Icona documento Private

 


Per ulteriori informazioni, contattare: Librarian http://puma.isti.cnr.it

Valid HTML 4.0 Transitional