PIMIENTO
Pimiento means Platform Independent Text Mining Engine Tool, and it is a framework for
Text Mining.
Generally speaking, Text Mining (sometimes also called Intelligent Text Analysis, Text Data Mining or Knowledge-Discovery in Text) consists of the discovery of previously unknown information from existing resources. Text Mining is related to Data Mining, which intends to extract useful patterns from structured text or data usually stored in large database repositories. Instead, Text Mining searches for patterns in unstructured natural language texts (e.g. books, articles, e-mail messages, web pages, etc.). Text Mining is a multidisciplinary field that includes many tasks such as Information Retrieval, Text Analysis, Clustering, Categorisation, Summarisation, etc.
Text Mining is generally found useful in environments where large collections of text documents are handled. One of the well-known premises of using Text Mining is that the value obtained by mining text documents is directly proportional to the value of those documents. The more important the knowledge contained in the document collection, the more value will be derived.
FEATURES
These are the main features of Pimiento:
- Written 100% in Java 1.4.x, so it can be used in any system with an available JVM.
- High quality code. Pimiento is a framework intended for production systems, even though
it is also suitable for experimental purposes. Hence the source code has been profiled
and tuned for maximum efficiency, performance, and scalabitly.
- Text Categorisation consists of automatically selecting the most suitable category for a given and previously unseen document. The categorisation system will have been previously trained with documents whose categories are known. These documents are known as training documents and are used by the machine learning algorithms in order to learn. After the learner has been trained, it will be able to predict the category of an unseen document. This functionality includes the following key features:
- Multilingual support including English, German, French, Spanish and Basque. This involves the
proper pre-processing of documents (stop-words removal and stemming) according to their
language and three different classification approaches (one classifier per language with
language-specific pre-processing, one classifier for all languages with language-specific
preprocessing, and one classifier for all languages with neutral pre-processing). A language-neutral
pre-processing feature, based on n-grams, is also available for any existing language.
- Multiple functions for feature selection such as Term Frequency (TF), and Document Frequency (DF),
Inverse Document Frequency (IDF), χ2, and Information Gain.
- Several weighting methods available including Term Existance (TE), Term Frequency (TF),
and Term Frequency/Inverse Document Frequency (TD/IDF).
- Several learning algorithms available including Naive Bayes Multinomial, Naive Bayes Complement,
k-Nearest Neightbour (kNN), and Rocchio.
- Ensembles of classifiers based on different multiclass decomposition methods like one-per-class,
pairwise coupling, and ECOC.
- Trained Categorisers can be saved to disk for future loading and use without having to retrain.
- High scalability based on a cache system that allows controlling the amount of memory
allocated for the categorisation process. Both documents and feature vectors can be cached, so
only a certain amount of them will be kept in memory thus minimising the memory usage.
- Complete evaluation of results including category-specific TPi, FPi, FNi, πi, ρi, F1i
and averaged πμ, ρμ, F1μ, πM, ρM, F1M.
- Document Clustering functionality using the k-means algorithm. The documents can be in English, Spanish, French, German, or Basque.
- Language Identification functionality for English, Spanish, French, German, and Basque. The applied model is based on statistics of character n-grams, achieving an accuracy around 99%.
- Similarity Analysis functionality. This feature allows to compare the similarity between two text documents. A number of similarity functions are available, including Hamming, Euclidean, Manhattan, and Minkowski.
- Basic sentente-oriented Summarisation functionality. This feature consists of distilling the most important information from a source text document to produce a condensed account by means of statistical and linguistical methods.
DOCUMENTATION
The paper Mining Text with Pimiento published by IEEE Internet Computing describes in certain detail Pimiento.
DOWNLOAD
Although Pimiento is in a fairly stable state regading bugs and code quality, it is not directly downloadable and it does not have any licence in particular. However, you can obtain Pimiento under request if you belong to a non-profit research or academic institution. I am mostly interested in establishing collaborations with other researchers or developers interested in text-mining infrastructure. Also, I usually ignore email sent from free providers such as Hotmail, Yahoo, Gmail, etc.
Copyright © Juan José García Adeva