Knowledge-enhanced document embeddings for text classification

Roberta Akemi Sinoara1, José Camacho-Collados2, Rafael G. Rossi3, Roberto Navigli4, and Solange Oliveira Rezende1

1Laboratory of Computational Intelligence, Institute of Mathematics and Computer Science
University of São Paulo
São Carlos, São Paulo, Brasil
2School of Computer Science and Informatics
Cardiff University
Cardiff, United Kingdom
3Federal University of Mato Grosso do Sul
Três Lagoas, Brazil
4Department of Computer Science
Sapienza University of Rome
Rome, Italy

Abstract

The text representation model has an important impact on the process results in text mining. For a successful application of the text mining process, the text representation adopted must keep the interesting patterns to be discovered. Although good results for automatic text classification may be achieved with the use of the traditional bags of words, such representation model can not provide satisfactory classification performances for all classification problems and richer text representations may be required. In this paper we present an approach to represent document collections based on embedded representations of words and word senses. We bring together the power of word sense disambiguation and the semantic richness of word- and word-sense embedded vectors to construct embedded representations of document collections. Our approach results in semantically enhanced and low-dimensional representations. We reduce the lack of interpretability of embedded vectors, which is a drawback of this kind of representation, with the use of word sense embedded vectors. Moreover, the experimental evaluation indicates that the use of the proposed representations provides stable classifiers with competitive classification performance, specially in semantically complex classification scenarios.


Datasets

All datasets used in our experimental evaluation are available in compressed files, that contain the 5 representations of the respective dataset in the sparse ARFF format.

Description of the datasets


Results

The complete set of results of the experimental evaluation are available in a CSV file.

- Results of classification experimental evaluation
- Partial results for IMDB dataset

Each line has the following columns:

  • - dataset: CSTR, ohsumed-400, bbc, se-product, se-polarity, se-product-polarity, bs-topic, bs-semantic, bs-topic-semantic, or IMDB;
  • - representation-model: nasari_babel2vec, babel2vec, bow, lda, or word2vec;
  • - algorithm_parameters;
  • - accuracy(%);
  • - error(%);
  • - accuracy std. deviation;
  • - micro-precision;
  • - micro-recall;
  • - macro-precision;
  • - macro-recall;
  • - micro-f1;
  • - macro-f1;


Resources

The following resources were used to build the document collection representations and to perform the experimental evaluation.

- NASARI embedded vectors are available at NASARI web page.
Reference: José Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artificial Intelligence 240, Elsevier, 2016, pp.567-577.

- The pre-trained word and phrase vectors trained on part of Google News dataset (GoogleNews-vectors-negative300.bin.gz) is available at word2vec web page.
Reference: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

- Babelfy API is available at Babelfy.org web site.
Reference: A. Moro, F. Cecconi, and R. Navigli. Multilingual Word Sense Disambiguation and Entity Linking for Everybody. In Proceedings of the 13th International Semantic Web Conference, Posters and Demonstrations (ISWC 2014), pp. 25-28, 2014.

- The tool used in the execution of the classification algorithms in the experimental evaluation was provided by Rafael G. Rossi (MultiThreading_20180223).
Reference: Rafael G. Rossi, Alneu de Andrade Lopes, and Solange O. Rezende. Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts. Information Processing & Management, Volume 52, Issue 2, pp. 217-257, 2016.

- Mallet tool, used in the execution of LDA in the experimental evaluation is available at MAchine Learning for LanguagE Toolkit web page.
Reference: Andrew Kachites McCallum. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu. 2002.

- The tool used in the construction of Word2Vec representation is available at João Antunes's GitHub - Text Representation.

- Stoplists used in pre-processing of bag-of-words and LDA representations: English stoplist and Portuguese stoplist.

- Lee, Pincombe, & Welsh's dataset for document similarity.
Reference: M. D. Lee, B. M. Pincombe, and M. B. Welsh. An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society, pp. 1254-1259, 2005.

- The fastText pre-trained word vectors (crawl-300d-2M.vec) is available at fastText web page.
Reference: Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in Pre-Training Distributed Word Representations. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.


Acknowledgments

This work was supported by grants #2013/14757-6, #2016/07620-2, and #2016/17078-0, São Paulo Research Foundation (FAPESP).
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.
Jose Camacho-Collados was supported by a Google PhD Fellowship in Natural Language Processing while initially working on this project, and is currently supported by the ERC Starting Grant 637277.