JPreText: A Simple Text Preprocessing ToolRicardo M. Marcacini and Solange O. RezendeBased on the paper "Nogueira, B. M., M. F. Moura, M. S. Conrado, R. G. Rossi, R. M. Marcacini, and S. O. Rezende. Winning Some of the Document Preprocessing Challenges in a Text Mining Process. In IV Workshop on Algorithms and Data Mining Applications. XXIV Brazilian Symposium on Database (SBBD), pages 10--18, Campinas, SP, Brazil, 2008".
In a context where an increasing amount of textual data is stored by different organizations, the Text Mining (TM) process using computational techniques of knowledge extraction, acts as a transformer agent. Useful knowledge is extracted from this enormous quantity of textual data, being used as a competitive advantage or as a support to decision making. Frequently, the pre-processing is dealt as a step of minor importance, or less interesting than the others, due to the lack of technical glamor and the excess of manual tasks. Basically, this step aims at transforming the text collection into a useful form for the learning algorithms, involving tasks as treatment, cleaning and reduction of the data. In this sense, the JPreText tool attempts to obtain bag-of-words representation from the text collections. To this end, the JPreText tool has the following features:
This work is part of the results of the Master Thesis "Unsupervised learning of topic hierarchies from dynamic text collections, Ricardo M. Marcacini (Advisor: Solange O. Rezende). Master of Science Dissertation - Mathematical and Computer Sciences Institute - ICMC University of Sao Paulo - USP - Sao Carlos, SP, Brazil". [link]
Download
Note: A more powerful tool for text preprocessing is the Pretext2-PERL (link) Sample Configuration File (config.ini)
Running the JPreText software (Windows and Linux)
Note: -Xmx1G: calls for 1GB of memory. You can set this parameter according to your needs. Contact
If you have question, problem or suggestion about this web page, please email the authors or use the contact form at http://labic.icmc.usp.br.
This work was sponsored by FAPESP and CNPq. |