What is Text Mining?
Text mining, also known as text analytics, is the process of extracting useful information and insights from unstructured text data. It involves using various computational and statistical techniques to analyze large collections of textual data, such as emails, social media posts, customer reviews, and news articles.
Table of Content
Text mining or text analytics comes as a handy tool to quantitatively examine the text generated by social media and filtered in the form of different clusters, patterns, and trends.
In other words, text mining represents the set of tools, techniques, and methods applied for automatically processing natural language textual data provided in huge amounts in the form of computer files. The extracted and structured content and themes are used for rapid analysis, identification of hidden data and information, and for automatic decision-making. Text mining tools are often based on the principles of information retrieval and natural learning processes.
Complex linkage structure makes text mining in social networks a challenging job, requiring the help of automated tools and sorting techniques. A number of text mining tools and algorithms have been developed to enable easy extraction of information from different textual resources. The recent developments in statistical and data processing tools have added to the evolution in the domain of text mining.
Text mining employs the concepts obtained from various fields ranging from linguistics and statistics to Information and Communication Technologies (ICT). Statistical pattern learning is applied to create patterns from the extracted text, which are further examined to obtain valuable information. The overall process of text mining comprises retrieval of information, lexical analysis, creation and recognition of patterns, tagging, extraction of information, application of data mining techniques, and predictive analytics.
This can be summarised as follows:
Text mining = Lexicometry + Data mining
The process is initiated with the retrieval of information which involves collection and identification of information from a set of textual material. The information can come from various sources such as websites, database, documents, or content management system. The textual information is processed by parsers and other linguistic analysis tools to examine and recognise textual features such as people, organisations, names of places, stock ticker symbols, and abbreviations. Figure depicts the text mining process:
On the basis of certain identified patterns, other quantities such as entities, emails, and telephone numbers are identified. Further, sentiment analysis is applied to identify the underlying attitude. Finally, psychological profiling is determined by conducting quantitative text analysis.
The overall purpose of text mining analytics is to transform unstructured text into valuable structured data which can be further analysed and applied for various domains such as research, investigation, exploratory data analysis, biomedical applications, and business intelligence.
Statistical analysis tools, such as R and word count, aid in the assessment of the overall review. Further, positive and negative relationships can be explored using various plotting techniques such as scatter plot.
Apart from the listed application areas, text mining techniques can be further applied for analysis of demographics, financial status, and buying tendencies of customers.
To sum up, text mining can be applied in the following areas:
- Competitive intelligence: In order to succeed, business organisations not only need to know about the key players in the industry, but also about the strengths and weaknesses of their competitors. Text mining provides factual data to organisations that can be applied for strategic decision making.
- Community leveraging: Text mining facilitates the identification and extraction of the information embedded in community interaction. This information can be applied for amending marketing strategies.
- Law enforcement: Text mining can be applied in the domain of government intelligence for countering anti-terrorist activities.
- Life sciences: Text mining can also be effectively applied in the area of research and development of drugs. Bioinformatics companies, such as PubGen, are applying biomedical text mining combined with network visualisation as an Internet service.
Text Mining Process
The enormous amount of unstructured data collected from social media makes text mining a very challenging process. The key steps for any text mining process can be summed up as follows:
Extracting the keyword
Any text analysis process begins by identification of relevant and precise keyword(s) that can be applied for specific queries. Next, the content and the linkage patterns are considered for applying keyword searches since the content related to similar keywords is often linked. The selected keywords act as social network nodes and play an important role while clustering the text.
Classifying and clustering the text
Various algorithms are applied for classifying text from the source content. For this process, the nodes are associated with labels prior to classification. After that, the classified text is clustered on the basis of similarity. The classification and clustering of the text is greatly influenced by the linkage structure of data. Accurate results can be obtained by applying node labelling and content-based classification techniques.
Trend analysis applies the principle that even for the same content, the clusters collected at different nodes can have different concept distributions. For this reason, the concepts at various nodes are compared and classified accordingly in the same or different subcollections.
Obtaining desired results for a specific query involves careful processing of the relevant document. For effective text mining, several stages of processing need to be applied on a document, such as:
- Text preprocessing: This involves the identification of all the unique words in a document. Non-informative words, such as the, and, or, and when, are filtered out from the document text before applying word stemming.
Word stemming refers to the process of reducing the inflected or derived words to their stem base. For example, words such as cat, cats, catlike, and catty will all be mapped to the same base stem cat. Terms such as stemmers or stemming algorithms are also used interchangeably in stemming programs. Affix stemmers trim down both suffix and prefix, such as ed, ly, and ing from a given word. Popular stemmers include Brute Force algorithm and Suffix Tripping algorithm.
- Document representation: A document is basically represented in words and terms.
- Document retrieval: This involves the retrieval of a document based on some query. Accurate results are ensured using text indexing and accuracy measures. Text indexing and searching capabilities can be incorporated in an application using Lucene which is a Java library.
- Document clustering: This involves the grouping of conceptually related documents to ensure fast retrieval. A term for a given query can be searched faster from the well-clustered documents. Document clustering can be implemented using the following techniques:
- Hierarchical clustering
- One-pass clustering
- Buckshot clustering
Once clustered, the documents are then organised into user-defined categories or taxonomies. Figure depicts the stages of document processing:
Both structured and unstructured data is involved in text mining. Unstructured data comes from reviews and summaries while the structured data is obtained from organised spreadsheets. Text mining tools identify themes, patterns, and insights hidden in the structured as well as unstructured data. Various text mining software are employed by organisations for different data mining applications.
Text Mining Software
The following are some commonly used text mining software:
- R: Used for statistical data analysis, text processing, and sentiment analysis.
- ActivePoint: Applied for natural language processing and online catalog-based contextual search.
- Attensity: Used for extraction of facts including who, what, where, and why and then identifying people, places, and events and how they are related.
- Crossminder: Applied for cross-lingual text analytics.
- Compare Suit: Used for comparing texts by keywords and highlighting common and unique keywords.
- IBM SPSS Predictive Analytics Suite: Applied for data and text mining.
- Monarch: Applied for analysis and transformation of reports into live data.
- SAS Text Miner: Provides a rich suite of text processing and analysis tools.
- Textalyzer: Used for online text analysis.
Apart from these, some other text mining tools include AeroText, Angoss, Autonomy, Clarabridge, IBM LanguageWare, IBM SPSS, WordStat, and Lexalytics.