Istituto di Scienza e Tecnologie dell'Informazione     
Moreo Fernandez A., Esuli A., Sebastiani F. Distributional random oversampling for imbalanced text classification. In: SIGIR 2016 - 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy, 17-21 July 2016). Proceedings, pp. 805 - 808. ACM, 2016.
The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.
URL: http://dl.acm.org/citation.cfm?id=2914722&CFID=812657189&CFTOKEN=16638796
DOI: 10.1145/2911451.2914722
Subject Distributional semantics

Icona documento 1) Download Document PDF

Icona documento Open access Icona documento Restricted Icona documento Private


Per ulteriori informazioni, contattare: Librarian http://puma.isti.cnr.it

Valid HTML 4.0 Transitional