JPreText: A Simple Text Preprocessing Tool
Ricardo M. Marcacini and Solange O. Rezende
Based on the paper "Nogueira, B. M., M. F. Moura, M. S. Conrado, R. G. Rossi, R. M. Marcacini, and S. O. Rezende. Winning Some of the Document Preprocessing Challenges in a Text Mining Process. In IV Workshop on Algorithms and Data Mining Applications. XXIV Brazilian Symposium on Database (SBBD), pages 10--18, Campinas, SP, Brazil, 2008".
In a context where an increasing amount of textual data is stored by different organizations, the Text Mining (TM) process using computational techniques of knowledge extraction, acts as a transformer agent. Useful knowledge is extracted from this enormous quantity of textual data, being used as a competitive advantage or as a support to decision making. Frequently, the pre-processing is dealt as a step of minor importance, or less interesting than the others, due to the lack of technical glamor and the excess of manual tasks. Basically, this step aims at transforming the text collection into a useful form for the learning algorithms, involving tasks as treatment, cleaning and reduction of the data. In this sense, the JPreText tool attempts to obtain bag-of-words representation from the text collections. To this end, the JPreText tool has the following features:
- Stemming: performs the process of reducing words to their stems, ie, the portion of a word that is left after removing its prefixes and suffixes. The tool has options to stemming for English and Portuguese texts using the Porter algorithm and Orengo Stemmer respectively.
- Stopwords removal: non-significant words are removed from the text, such as articles, prepositions and conjunctions. The tool provides stopwords lists for English and Portuguese texts. However, the user can add new stopwords as needed.
- Term Selection by document frequency: removal of terms that occur in less than "DF" documents.
- Term Weighting: select the measure to indicate the importance of a term in the document. The tool has two options: TF (term frequency) and TFIDF (term frequency–inverse document frequency).
- Normalization: allows the normalization of the vector representations of documents in the collection (unit lenght documents). This feature is useful for faster computing of the cosine similarity, and facilitate the use of clustering/classification methods based on Euclidean distance.
This work is part of the results of the Master Thesis "Unsupervised learning of topic hierarchies from dynamic text collections, Ricardo M. Marcacini (Advisor: Solange O. Rezende). Master of Science Dissertation - Mathematical and Computer Sciences Institute - ICMC University of Sao Paulo - USP - Sao Carlos, SP, Brazil". [link]
JPreText Tool (version 20120813).
Note: A more powerful tool for text preprocessing is the Pretext2-PERL (link)
Sample Configuration File (config.ini)
#Text Collection (e.g. Reuters)
Text Source: ./Re8
# Stemming language options are English, Portuguese or None
Stem Language: English
Max. Keywords: 20
Stopwords File: ./stopwords.txt
# Term Selection using Document Frequency
Min. DF: 2
# Term weighting can ben TF or TFIDF
Term Weighting: TFIDF
CSV Data File: Re8.csv
Running the JPreText software (Windows and Linux)
java -Xmx2G -cp jpretext.jar pretext.Main ./config.ini
Note: -Xmx1G: calls for 1GB of memory. You can set this parameter according to your needs.
If you have question, problem or suggestion about this web page, please email the authors or use the contact form at http://labic.icmc.usp.br
This work was sponsored by FAPESP and CNPq.