About

The structure representations used in Text Mining problems are characterized by high dimensionality and high sparsity. Moreover, the documents of a domain share a lot of general words, which can affect the similarity calculation for clustering or a model induction for classification. To solve or reduce the problems mentioned above, just the keywords of a document can be used instead of all the words. The keywords describe or summarize the document content concisely. Unfortunately, most of the documents do not have associated keywords. Due to this, methods to automatically extract keywords are necessary, since the manual extraction of keywords from the documents in a collection or stream is unfeasible.

Statistical Keyword Extraction Tool (SKET) implements statistical methods to extract keywords from single document. The methods available in SKET are: i) Most Frequent, ii) Co-occurrence Statistical Information [1], iii) Term Frequency - Inverse Sentence Frequency [2], iv) Eccentricity-Based [3], v) KeyWorld [4] and vi) TextRank [5].

SKET was developed using Java language (Oracle version 1.6.0_26 [6]) and the Netbeans IDE (version 7.1 [7]). The user can enter the command "java -jar KeyWordExtraction.jar" or a double click in a graphical interface of the operational system to run SKET. SKET provides a graphical interface in which the user selects the directory that contains the text files (*.txt), the directory to store the texts that contain the extracted keywords, the statistical keyword extraction methods and the options to preprocess the text.

SKET tool accepts texts written in English and Portuguese. The preprocessing options available in SKET are: i) word stemming, ii) stopwords removal and iii) frequency cut. The word stemming is performed using Porter's stemmer [8]. The words are stemmed using Porter's stemmer [8]. To remove the stopword we include the stopwords contained in the Pretext tool [9]. We used the Java class available in [10] to stem English words. The Portuguese words are stemmed using the PTStemmer library [11]. The user can cut the terms that occur above a threshold frequency or select just a percentage of the most frequency terms. SKET considers as terms the stemmed single words that occurs in the texts.

The user can select different number of keywords to be extracted. The resulting file can contain the keywords repeated according to their frequency in the original document or include only once the keywords. SKET generates a resulting file for each method and desired number of keywords. Given an input directory, SKET extracts all the text files contained in the subdirectories (if there is subdirectories) and generates an output file according to the following pattern: /[output_directory]/[method]/ [number_of_keywords]_keywords/[subdirectoris_of_the_input_directory]/[input_file].

The extract the keywords the user can click on the button "Extract". The user can also store the selected option set in the SKET's graphical interface in a file (.ske extension) and run the SKET in command line. To do so the user needs to enter the command "java -jar KeyWordExtraction.jar [Name_Configuration_File].ske".

Download Tool


Publications


ROSSI, R. G.; MARCACINI, R. M.; REZENDE, S. O. Analysis of Statistical Keyword Extraction Methods for Incremental Clustering. In: X Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), 2013. Sociedade Brasileira de Computação.

Technical Report Describing the Tool (soon)


Datasets


ACM
Estadão


Supplement


Leader-Follower algorithm


References

[1] Matsuo, Y. and Ishizuka, M. (2003). Keyword extraction from a single document using word co-occurrence statistical information. In Florida Artificial Intelligence Research Society Conference, pages 392–396.
[2] Martins, C. B., Pardo, T. A. S., Espina, A. P., and Rino, L. H. M. (2001). Introdução à sumarização automática. Technical Report RT-DC 002/2001, ICMC-USP. 38p.
[3] Palshikar, G. K. (2007). Keyword extraction from a single document using centrality measures. In Pattern Recognition and Machine Intelligence, pages 503–510.
[4] Matsuo, Y., Ohsawa, Y., and Ishizuka, M. (2001). Keyworld: Extracting keywords from a document as a small world. In Discovery Science, volume 2226/2001, pages 271–281.
[5] Mihalcea, R. and Tarau, P. (2004). TextRank: Bringing order into texts. In Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain.
[6] Oracle, 2013. http://www.oracle.com/technetwork/java/javase/6u26releasenotes-401875.html. Access: 24th June 2013.
[7] Netbeans, 2013. https://netbeans.org/community/releases/71/. Access: 24th June 2013.
[8] Porter, M. F. (1980). An algorithm for suffix stripping. Readings in Information Retrieval, 14(3):130–137.
[9] Soares, M. V. B., Prati, R. C., e Monard, M. C. (2008). Pretext II: Descrição da reestruturação da ferramenta de pré-processamento de textos. Relatório Técnico 333, ICMC-USP. http://www.icmc.usp.br/~biblio/BIBLIOTECA/rel_tec/RT_333.pdf.
[10] Porter, M. (2000). Porter stemmer in Java. http://tartarus.org/~martin/PorterStemmer/java.txt.
[11] Oliveira, P. (2010). PTStemmer - A Stemming toolkit for the Portuguese language. http://code.google.com/p/ptstemmer/.


Developed by Rafael Geraldeli Rossi.
This page was last modified on 23 July 2013.