sas text miner - semantic scholar...sas® text miner: distilling textual data 1 introduction...
TRANSCRIPT
A SAS White Paper
SAS Text Miner
Distilling Textual Data for Competitive Business Advantage
Table of Contents
Introduction ....................................................................................................................1
What Is Text Mining? .....................................................................................................1 Text Mining Applications...............................................................................................1 Automatic Classification of Documents ........................................................................2 Clustering Large Document Collections .......................................................................2 Examples of Integrated Text and Data Mining Analysis...............................................3
SAS Text Miner ...............................................................................................................3
SAS Text Miner Functions.............................................................................................5 Language Support ........................................................................................................5 Universal Data Access..................................................................................................5 Term Identification ........................................................................................................6 Word Stemming and Synonyms ...................................................................................6 Feature Extraction ........................................................................................................7 Stop and Start Lists ......................................................................................................7 Output ...........................................................................................................................9 Dimension Reduction....................................................................................................9 Term Weightings.........................................................................................................10 Term Concepts ...........................................................................................................11 Mining the Text ...........................................................................................................11 Clustering....................................................................................................................11 Interpretation of Clusters ............................................................................................12 Classification...............................................................................................................13 Interactive Results Viewer ..........................................................................................13 Searching: Similarity-based Filtering with Keywords and Clusters ............................14
How SAS Text Miner Differs from Its Competition ...................................................14
Conclusion....................................................................................................................16
Primary content providers for this paper were Manya Mayes, Bernd Drewes and
Wayne Thompson
SAS® Text Miner: Distilling Textual Data
1
Introduction
Analysis of customer information has traditionally involved structured data sources, namely quantitative and qualitative data, that can be arranged in tables or spreadsheets. While information such as customer e-mail, phone transcriptions and free-form survey text are widely
collected, the ability to analyze these textual fields (independently or in concert with structured information) has not been a simple task. Manually digesting large volumes of e-mail messages or in general, text documents of any nature, is a time-consuming process that is not facilitated by
exploiting structure contained in the data. As a result, much of this valuable information has remained unused. Harvesting the information found in large volumes of unstructured text data requires text mining.
SAS Text Miner provides an easy-to-use interface that enables you to quickly determine key information contained in document collections.
This paper defines text mining, discusses text mining applications and describes how best to leverage your business information by combining text mining and traditional data mining
techniques.
What Is Text Mining?
SAS defines text mining as the process of investigating a large collection of free-form documents in order to discover and use the knowledge that exists in the collection as a whole.
Unlike natural language processing (NLP) and knowledge extraction, text mining involves pattern searching across the entire document collection. NLP and knowledge extraction focus on summarizing individual documents within the collection, which means each document is
processed independently of the others. However, NLP and knowledge extraction can be important steps in the text mining process by adding value to the preprocessing of raw text in order to turn unstructured data into structured information suitable for enhanced analysis.
SAS text mining extends the traditional text analysis algorithms by making text data available to data mining algorithms thus enabling text documents to be utilized for predictive modeling and
segmentation.
Text Mining Applications
Imagine that in your corporate warehouse you have verbatim text from your customer call center. Typically, this text is used rarely, if at all. Each day, literally hundreds of customers relay their comments about your products. Some of these calls are complimentary, but more likely, most of
these calls are related to complaints. In either case, being able to quickly ascertain information about what makes your customers pleased or unhappy may mean the difference between the
SAS® Text Miner: Distilling Textual Data
2
success or failure of a product and perhaps more importantly, enhance or tarnish the image of your company.
SAS Text Miner enables you to detect new problems with product packaging, issues with product distribution and possible health risks associated with product formulation. Finding this information
and acting on it quickly is key to keeping your customers happy.
In this information age, there is a broad target audience that can benefit from Text Miner —
marketing representatives, middle management, customer service representatives, help desk support specialists, academic and medical researchers, product managers, attorneys, Web developers, human resource representatives, anyone who must look through large volumes of
text to extract information, ideas and trends. Typical applications of SAS Text Miner include:
Automatic Classification of Documents
• Offline e-mail filtering and routing.
• News item routing.
• Help desk/call center inquiries.
• Generate classification codes from descriptions.
Clustering Large Document Collections
• Free-form survey data.
• Customer complaints/comments.
• Scientific or legal databases.
• Document warehouse management.
• Warranty claims.
• Competitive reports.
• Patent claims.
SAS® Text Miner: Distilling Textual Data
3
Examples of Integrated Text and Data Mining Analysis
• Predicting customer satisfaction from customer complaints/comments.
• Capturing business announcements with stock metrics to predict stock market price fluctuations.
• Incorporating patient notes with patient measurements to better predict drug efficacy.
• Predicting trends in product defects from call center logs.
• Capturing customer comments with demographic and purchasing behavior to facilitate cross-selling and up-selling.
• Enhancement of other data mining activities by including previously unused free-form text.
SAS Text Miner
SAS Text Miner uncovers the themes and concepts contained in a document collection. It provides facilities to read different document formats, extract key features, generate essential
concepts (data reduction) and cluster text using specialized techniques, all within an interactive environment that optimizes the interpretation of a text collection.
Text Miner operates in the same integrated environment as SAS Enterprise Miner to jointly evaluate both structured and unstructured data elements to answer specific business questions.
Figure 1: Text Miner runs within the intuitive point-and-click process flow environment of Enterprise Miner to seamlessly incorporate textual data into the mining process.
SAS® Text Miner: Distilling Textual Data
4
For example, customer comments can be supplemented with customer purchase and campaign response information in order to predict if customer complaints have an impact on customer
loyalty. In the case where only the textual data is available, Text Miner can be used to uncover themes and concepts in the document collection as illustrated in Figures 2 and 3. A more detailed discussion of these concepts is provided in the remainder of the paper.
Figure 2: Sample Medline� abstract data for creating more refined document taxonomies
Figure 2 highlights the contents of several abstracts extracted from the Medline Database. Text Miner is used to create a document taxonomy as well as subject-based clusters among the
retrieved items. Each cluster is profiled by the terms that best discriminate it from all other clusters. Interactive exploration of the documents contained in each cluster allows for the discovery of any items of interest. Enterprise Miner can analyze further the clusters and
documents through predictive modeling or enhanced profiling using other attributes. The results of the analysis are described in the section titled "Interpretation of Clusters."
SAS® Text Miner: Distilling Textual Data
5
Figure 3: A partial display of Medline� document taxonomies.
Figure 3 highlights several key themes contained within the Medline query, including a cluster that captures documents related to decision support in health care and a cluster that contains documents related to neural network techniques. Note that the 13 documents that are blank
(these documents had titles with no abstracts) are automatically clustered together.
SAS Text Miner Functions
Language Support
Text Miner supports the core analysis of space-delimited documents. More sophisticated text
preprocessing is provided for English, French and German (or a combination of these) documents.
Universal Data Access
Text Miner can read documents from a variety of sources, including ASCII, PDF, HTML, Excel, Lotus and PowerPoint. All such documents can be easily imported into a single SAS data set for
text mining purposes. This data may be enriched using the SAS System to integrate documents and quantitative data from a wide variety of disparate but complementary sources.Once the desired SAS data set has been created, it becomes the input to SAS Text Miner.
SAS® Text Miner: Distilling Textual Data
6
Term Identification
For analysis purposes, terms may be defined by single words, word groups (e.g., balance of payments, Empire State Building), entities (such as name, address, etc.), word context (using part-of-speech tagging), numbers or punctuation. Frequency and weight information for each
term are valuable in determining a term's contribution to the document collection. Removal of low-information terms helps reduce the dimensionality of the data.
Word Stemming and Synonyms
Words can be reduced (stemmed) to their root form to simplify analysis. The process of "stemming" is a commonly used text-processing feature. The words banks, banking, and banked
would be treated as bank. By using a synonym dictionary, words such as purse, pocketbook and handbag could be treated as one term. Abbreviations, such as oz, lbs, $ are also treated as synonyms for the words they represent (ounce, pounds, dollars). Users have the flexibility to
develop their own domain-specific dictionaries.
Figure 3: Text Miner provides the capability of building and refining domain-specific synonym lists.
SAS® Text Miner: Distilling Textual Data
7
Figure 4: Stemmed terms are determined during analysis and may be further refined.
Feature Extraction
Much of a term's meaning can be derived from a dictionary and/or linguistic patterns. The term
"Mr. Bill H. Washburn" clearly defines the name of a person even though this may not be listed in any standard dictionary. From both dictionary information and linguistic patterns, many terms can be flagged with semantic categories such as location, company name, Internet address, date,
currency, measure, title, phone number, names of people. These semantic features can be used selectively to improve document processing by complementing the dictionary with entity categories.
Stop and Start Lists
A count of all discrete terms in a document collection can be massive. These terms make for a
hugely dimensional analysis matrix. Two facilities that allow interactive control of this dimensionality are stop and start lists.
A stop list contains a list of terms that should be ignored for analysis purposes. Typical terms would be very frequently occurring words with little or no discriminating power such as "the", "a", "and", "or", "because", "in", "is", etc. From a business perspective these terms add little value to
illuminate the content of documents.
SAS® Text Miner: Distilling Textual Data
8
Figure 5: A partial list of stopped terms.
A start list is a list of terms that define the vocabulary to be used during text analysis. It is particularly useful in well-defined domains such as medical, legal or specific engineering
disciplines. It allows a domain-specific focus while controlling prohibitively large numbers of terms in a document.
Figure 6: Setting stop/start lists
SAS® Text Miner: Distilling Textual Data
9
Output
The result of parsing the documents is a term-by-document matrix of term frequency. The sheer size of this sparse matrix can make it too large to mine effectively. Text Miner methods for dimension reduction include Singular Value Decomposition (SVD) and Rollup Terms. These
methods reduce the size of this matrix while retaining the key information.
More specifically, the output of the parsing process is a list of terms, processed as described
above and displayed with various defining features. These include their syntactic and semantic (if applicable) categories, their frequencies and other information. These terms can then be searched, sorted and modified interactively as discussed in the interactive results viewer section below.
Figure 7: A partial list of parsed terms containing noun groups, stemming, synonyms and more.
Dimension Reduction
In order to represent the meaning of text, it is important to analyze the relationship that exists
between the isolated terms by distilling the key concepts contained in the documents. Text Miner provides two approaches to generate such concepts: Singular Value Decomposition (SVD) and "rolled-up terms."
Singular Value Decomposition (SVD) is a proven mathematical method that accomplishes concept (theme) creation by projecting each document into a reduced dimensional space. The
closer together documents are in the reduced space, the more similar they will be. The farther apart documents are vis-à-vis content, the more dissimilar they will be. The definition of the
SAS® Text Miner: Distilling Textual Data
10
concepts is determined in part by mathematical considerations and in part by domain knowledge. These concepts are derived from the text collection as a whole, not from an analysis of a single
document itself.
Term Weightings
In addition to building similarities between words, terms are weighted to favor those terms that discriminate well between documents or document categories. Such categories could be different text topics or text sources. In situations where this information is known beforehand, it
could be coded in a separate variable and made available to the weighting algorithm. Text Miner offers a very wide variety of weighting schemes, most of which will favor terms that occur frequently in some documents or document categories but not very
frequently everywhere.
Figure 8: Two dimension reduction methods allow for flexibility in analysis of a wide range of data sources.
SAS® Text Miner: Distilling Textual Data
11
Term Concepts
Each document can now be characterized by the most important terms in the entire text collection
with the highest weighted terms expressing the central concepts. As with SVD concepts, these are characterized by a descriptive list of words, each with an importance measure or weight.
Term weighting provides interpretability, which is very useful for exploratory purposes. This approach carries the same advantages and disadvantages of a keyword approach and is often used as a starting point for further text processing activities, rather than an end result. Interactive
explorations can best determine which of these terms are most useful for the tasks at hand, which to eliminate and which are so close in meaning as to be equivalent to one another. Using these insights, the list of terms can be modified and tailored to the task at hand.
Mining the Text
Once the documents have been represented as points in a multidimensional space, they are now
suitable for the typical goals of analysis: clustering and/or classification.
Clustering
One of the central activities of text processing consists of finding and analyzing the content of a document collection. Imagine being faced with a collection of several thousand customer letters. One would like to partition the letters into several piles with each pile representing the main issue
addressed, thus allowing the allocation of appropriate priorities and resources to the task of processing the letters. It will also provide the first insight into many of the specific concepts mentioned in the letters.
One way to facilitate this is by clustering the documents' numerical representation as points in space. Clusters of closely related points represent the similarity of document themes.
Text Miner provides two clustering techniques geared specifically toward the analysis of text-based data. One method uses a hierarchical clustering algorithm where each document is placed
in a specific subtree. The other method makes use of 'fuzzy clustering' where each document has a probability of membership in each cluster.
As previously discussed, clustering is achieved on the basis of presence and absence of the extracted central concepts in the text. Ideally, each cluster contains a different set of these concepts.
SAS® Text Miner: Distilling Textual Data
12
Figure 9: The Text Miner Cluster tab allows for two text mining-specific clustering algorithms.
Interpretation of Clusters
Clusters are characterized by the terms that best describe them and thereby allow the user easy interpretation of their content. Because the details of a cluster analysis (e.g., the number of
clusters expected and the size of useful clusters) is very much task specific, such analysis is inherently interactive.
The next section shows how Text Miner supports the discovery of the most pertinent clusters.
Sometimes it is useful to view clusters in a hierarchical fashion. For example, when analyzing a
set of documents for topical content, one might first look for the two main topics, then for each cluster "drill down" for subtopics as warranted by the application.
Cluster interpretation is a first step toward gaining a general document overview. More detailed interpretations are possible when integrating the full suite of data mining components available in SAS Enterprise Miner. It then becomes possible, for example, to extract sequences of words (i.e.
common phrases from the text), or to profile these clusters using structured data such as demographic and geographic information. Phrases can be enriched with some or all of the stop words that were originally excluded in order to capture their full meaning.
SAS® Text Miner: Distilling Textual Data
13
Classification
Another major text mining activity is the classification of text into predefined categories. Imagine working in a help desk environment that receives many requests and problems to act on. In the past you have stored such requests together in your database with the diagnosed problem
category, and perhaps total costs and time required to solve each problem. These database texts can be mined to reveal the key concepts that differentiate between the problem categories. You can then predict which category incoming requests fall into and route the problem accordingly
while also predicting expected time and cost to solve that problem (should that information be available). This allows for prioritization of resource allocations. Such a project can be easily implemented using SAS Text Miner and SAS Enterprise Miner.
Even without preclassified data, a robust solution can be built. Imagine obtaining a document collection and interactively clustering and interpreting that collection iteratively until a satisfactory
level of cluster content and quality has been achieved. These clusters can now assume the role of the historical collection above. Key concepts separating the clusters can then be applied to new documents. In either case, the knowledge contained in the documents will be organized into
appropriate categories.
These categories can, in turn, be tailored to the needs of the users. A very general category
would be "interesting vs. not-interesting." This would effectively allow a filtering of information from retrieval systems, news wires or Internet sources, without having to specify detailed keyword logic. Other applications include categorization of results from survey analysis, medical diagnostic
comments, sales descriptions, and account or competitive information.
Interactive Results Viewer
Text Mining is not a "push-button" technology but instead requires the user to interpret results and make adaptive changes. These changes may affect:
• Vocabulary. Typically a small subset of the tens of thousands of words are needed for the task at hand. Many terms can be eliminated not just on the basis of a stop list but also on their syntactic roles (determiners, auxiliaries, conjunctions and prepositions), low overall weight, frequency count, or semantic category (e.g. one may want to eliminate dates or addresses). Another goal that requires interactive intervention is the elimination or retention of certain term or term categories.
• The selection of certain clusters, word groups, and texts, as well as the exploration and analysis of this subset.
• The ability to locate similar documents or clusters to one or more under consideration. Defining similar terms as candidates for the creation of synonym lists is another highly interactive activity.
SAS® Text Miner: Distilling Textual Data
14
Figure 10: The Text Miner results window provides numerous interactive training features to tailor the analysis.
Performing these activities may be done in cycles — exploring and then focusing on another task.
The effects of using a different vocabulary may need to be assessed and could lead to a re-computation of the central concepts and clusters, all from a flexible interactive environment. These activities may also involve exploring the use of different types of concepts, different
numbers of concepts, different clustering approaches, or different lexical options in the first place. Ultimately, a suitable list of terms and, if appropriate, a satisfactory decomposition of the text collection, will result in clusters with well-defined concepts that serve as the basis for further
processing. This may involve other operations drawing on the suite of SAS data mining tools or a simple dissemination of findings.
Searching: Similarity-based Filtering with Keywords and Clusters
While Text Miner does not contain an explicit search engine, the interactive environment allows for searching documents. This can be achieved by retrieving the documents that are most similar
to the selected document or by selecting and filtering on specific terms. Rather than being geared toward a retrieval environment, this functionality is seen as one of support for the general text mining activities already discussed for gaining better insights and building better applications.
How SAS Text Miner Differs from Its Competition
Text Miner is designed to efficiently perform text mining on large collections of textual materials.
The goals of text mining, as with mining activities in general, are to obtain actionable information that can be used for business benefits (e.g., increased productivity or competitive advantage). The unique differentiator of Text Miner is its seamless integration with the SAS Enterprise Miner
data mining environment. This integration provides the full range of mining tools such as
SAS® Text Miner: Distilling Textual Data
15
prediction, classification and clustering as well the full range of data access and data preprocessing tools.
Some of the classical document management systems can also perform text classification, however, the goals of text mining are much more ambitious. Most document management
systems perceive texts as passive entities that need to be indexed, filed and efficiently retrieved when needed. The more ambitious among these categories of systems attempt to automate this process by exploiting document meta-information and/or neural networks that can learn
classifications based on collections of known training cases. Text Miner goes beyond these limited goals to build models that can be actively used for business benefits. For example, in a help desk environment, models can be generated in order to predict the problem category of a
new complaint. These models can be generated using a variety of different prediction tools, all in a statistically validated process. The impacts of costs and profits of the different models can be compared using various assessment charts. The results can then be applied to new reports. The
various sophisticated routines and algorithms provided in Text Miner ensure reliable support for the process of building such models.
The analysis can often be supplemented with nontextual data. For example, there may be data on previously processed call reports with captured information on costs, time to completion and resources needed. This might be supplemented by customer information including customer
value. This additional data can then be used to better define priorities, help allocate resources and pinpoint quality problems. Analysis extends far beyond filing and retrieving documents, moving toward a more cost-effective operation and improved customer relationship management.
There are also specific tools available in Enterprise Miner that may be of considerable help in text mining. Some examples are:
• The Sampling node, which enables the user to extract a proportion of the texts to be used for training purposes. When used for training classification, the sampling technique ensures that sufficient samples are included for all classes in order to achieve best possible results.
• The Filter Outliers node, which can help manage the vocabulary that will be used for a task by including or excluding terms based on frequency information. This can usefully supplement a stop list and help manage an otherwise almost unmanageable set of tens of thousands of words.
• The Insight node, which can display the relationship between words and document classes and clusters.
• The Assessment node, which can measure and compare the quality of different models under consideration.
• The Score node, which captures the complete scoring code for scoring new documents.
Many other more technical tools are available and can increase productivity, including tools for building transformations, computing associations and sequences of terms, and linking of terms and documents through link analysis. These facilities go far beyond document management
systems and systems that use neural networks or such to learn classifications.
SAS® Text Miner: Distilling Textual Data
16
Conclusion
SAS Text Miner is a powerful tool for document classification and clustering. With the integration of Text Miner into the SAS Enterprise Miner solution for data mining, SAS is the first software vendor to offer a complete data mining solution for analyzing both nonstructured and structured
data. By exploring and modeling large amounts of structured data, as well as text-based data, you can uncover hidden relationships and patterns of information, enhancing your ability to make accurate, on-target predictions.
SAS is the market leader in providing a new generation of business intelligence software and services that create true enterprise intelligence. SAS solutions are used at more than 38,000
sites, including 99 of the top 100 businesses on the Fortune 500, to develop more profitable relationships with customers and suppliers; to enable better, more accurate and informed decisions; and to drive organizations forward. SAS is the only vendor that completely integrates
leading data warehousing, analytics and traditional BI applications to create intelligence from massive amounts of data. For 25 years, SAS has been giving customers around the world The
Power to KnowTM. Visit us at www.sas.com
SAS® Text Miner: Distilling Textual Data
17
References
Berry, M. W. and M. Browne. 1999. Understanding Search Engines: Mathematical Modeling and
Text Retrieval. Philadelphia: Society for Industrial and Applied Mathematics.
Deerwester, et al. 1990. "Indexing by Latent Semantic Analysis." Journal of the American Society
for Information Science. 41(6): 391-407.
Krishnaiah, P.R. and L.N. Kanal. 1982. Classification, Pattern Recognition, and Reduction of
Dimensionality. New York: North-Holland Publishing Company.
McLachlan, G.J. 1997. The EM Algorithm and Extensions. New York: John Wiley & Sons.
SAS Institute Inc. 2002, Getting Started with SAS® Text Miner Software, Release 8.2 PUBCODE: 58859
Trefethen, L.N. and D. Bau. 1997. Numerical Linear Algebra. Philadelphia: Society for Industrial and Applied Mathematics.
Watkins, D.S. 1991. Fundamentals of Matrix Computations. New York: John Wiley & Sons.
Yang, Y. and J.O. Pedersen. 1997. "A Comparative Study on Feature Selection in Text Categorization." Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97).
SAS Text Miner contains:
LinguistX® from Inxight Software, Inc. Copyright © 1996-2002. All rights reserved. www.inxight.com.
Thing Finder TM Server from Inxight Software, Inc. Copyright © 1996-2002. All rights reserved. www.inxight.com.
World Headquartersand SAS AmericasSAS Campus DriveCary, NC 27513 USATel: (919) 677 8000Fax: (919) 677 4444U.S. & Canada sales: (800) 727 0025
SAS InternationalPO Box 10 53 40 Neuenheimer Landstr. 28-30D-69043 Heidelberg, GermanyTel: (49) 6221 4160 Fax: (49) 6221 474850
www.sas.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright © 2002, SAS Institute Inc. All rights reserved. 51410US.0602