sas text miner - semantic scholar...sas® text miner: distilling textual data 1 introduction...

A SAS White Paper

SAS Text Miner

Distilling Textual Data for Competitive Business Advantage

Table of Contents

Introduction ....................................................................................................................1

What Is Text Mining? .....................................................................................................1 Text Mining Applications...............................................................................................1 Automatic Classification of Documents ........................................................................2 Clustering Large Document Collections .......................................................................2 Examples of Integrated Text and Data Mining Analysis...............................................3

SAS Text Miner ...............................................................................................................3

SAS Text Miner Functions.............................................................................................5 Language Support ........................................................................................................5 Universal Data Access..................................................................................................5 Term Identification ........................................................................................................6 Word Stemming and Synonyms ...................................................................................6 Feature Extraction ........................................................................................................7 Stop and Start Lists ......................................................................................................7 Output ...........................................................................................................................9 Dimension Reduction....................................................................................................9 Term Weightings.........................................................................................................10 Term Concepts ...........................................................................................................11 Mining the Text ...........................................................................................................11 Clustering....................................................................................................................11 Interpretation of Clusters ............................................................................................12 Classification...............................................................................................................13 Interactive Results Viewer ..........................................................................................13 Searching: Similarity-based Filtering with Keywords and Clusters ............................14

How SAS Text Miner Differs from Its Competition ...................................................14

Conclusion....................................................................................................................16

Primary content providers for this paper were Manya Mayes, Bernd Drewes and

Wayne Thompson

SAS® Text Miner: Distilling Textual Data

1

Introduction

Analysis of customer information has traditionally involved structured data sources, namely quantitative and qualitative data, that can be arranged in tables or spreadsheets. While information such as customer e-mail, phone transcriptions and free-form survey text are widely

collected, the ability to analyze these textual fields (independently or in concert with structured information) has not been a simple task. Manually digesting large volumes of e-mail messages or in general, text documents of any nature, is a time-consuming process that is not facilitated by

exploiting structure contained in the data. As a result, much of this valuable information has remained unused. Harvesting the information found in large volumes of unstructured text data requires text mining.

SAS Text Miner provides an easy-to-use interface that enables you to quickly determine key information contained in document collections.

This paper defines text mining, discusses text mining applications and describes how best to leverage your business information by combining text mining and traditional data mining

techniques.

What Is Text Mining?

SAS defines text mining as the process of investigating a large collection of free-form documents in order to discover and use the knowledge that exists in the collection as a whole.

Unlike natural language processing (NLP) and knowledge extraction, text mining involves pattern searching across the entire document collection. NLP and knowledge extraction focus on summarizing individual documents within the collection, which means each document is

processed independently of the others. However, NLP and knowledge extraction can be important steps in the text mining process by adding value to the preprocessing of raw text in order to turn unstructured data into structured information suitable for enhanced analysis.

SAS text mining extends the traditional text analysis algorithms by making text data available to data mining algorithms thus enabling text documents to be utilized for predictive modeling and

segmentation.

Text Mining Applications

Imagine that in your corporate warehouse you have verbatim text from your customer call center. Typically, this text is used rarely, if at all. Each day, literally hundreds of customers relay their comments about your products. Some of these calls are complimentary, but more likely, most of

these calls are related to complaints. In either case, being able to quickly ascertain information about what makes your customers pleased or unhappy may mean the difference between the


2

success or failure of a product and perhaps more importantly, enhance or tarnish the image of your company.

SAS Text Miner enables you to detect new problems with product packaging, issues with product distribution and possible health risks associated with product formulation. Finding this information

and acting on it quickly is key to keeping your customers happy.

In this information age, there is a broad target audience that can benefit from Text Miner —

marketing representatives, middle management, customer service representatives, help desk support specialists, academic and medical researchers, product managers, attorneys, Web developers, human resource representatives, anyone who must look through large volumes of

text to extract information, ideas and trends. Typical applications of SAS Text Miner include:

Automatic Classification of Documents

• Offline e-mail filtering and routing.

• News item routing.

• Help desk/call center inquiries.

• Generate classification codes from descriptions.

Clustering Large Document Collections

• Free-form survey data.

• Customer complaints/comments.

• Scientific or legal databases.

• Document warehouse management.

• Warranty claims.

• Competitive reports.

• Patent claims.


3

Examples of Integrated Text and Data Mining Analysis

• Predicting customer satisfaction from customer complaints/comments.

• Capturing business announcements with stock metrics to predict stock market price fluctuations.

• Incorporating patient notes with patient measurements to better predict drug efficacy.

• Predicting trends in product defects from call center logs.

• Capturing customer comments with demographic and purchasing behavior to facilitate cross-selling and up-selling.

• Enhancement of other data mining activities by including previously unused free-form text.

SAS Text Miner

SAS Text Miner uncovers the themes and concepts contained in a document collection. It provides facilities to read different document formats, extract key features, generate essential

concepts (data reduction) and cluster text using specialized techniques, all within an interactive environment that optimizes the interpretation of a text collection.

Text Miner operates in the same integrated environment as SAS Enterprise Miner to jointly evaluate both structured and unstructured data elements to answer specific business questions.

Figure 1: Text Miner runs within the intuitive point-and-click process flow environment of Enterprise Miner to seamlessly incorporate textual data into the mining process.


4

For example, customer comments can be supplemented with customer purchase and campaign response information in order to predict if customer complaints have an impact on customer

loyalty. In the case where only the textual data is available, Text Miner can be used to uncover themes and concepts in the document collection as illustrated in Figures 2 and 3. A more detailed discussion of these concepts is provided in the remainder of the paper.

Figure 2: Sample Medline� abstract data for creating more refined document taxonomies

Figure 2 highlights the contents of several abstracts extracted from the Medline Database. Text Miner is used to create a document taxonomy as well as subject-based clusters among the

retrieved items. Each cluster is profiled by the terms that best discriminate it from all other clusters. Interactive exploration of the documents contained in each cluster allows for the discovery of any items of interest. Enterprise Miner can analyze further the clusters and

documents through predictive modeling or enhanced profiling using other attributes. The results of the analysis are described in the section titled "Interpretation of Clusters."


5

Figure 3: A partial display of Medline� document taxonomies.

Figure 3 highlights several key themes contained within the Medline query, including a cluster that captures documents related to decision support in health care and a cluster that contains documents related to neural network techniques. Note that the 13 documents that are blank

(these documents had titles with no abstracts) are automatically clustered together.

SAS Text Miner Functions

Language Support

Text Miner supports the core analysis of space-delimited documents. More sophisticated text

preprocessing is provided for English, French and German (or a combination of these) documents.

Universal Data Access

Text Miner can read documents from a variety of sources, including ASCII, PDF, HTML, Excel, Lotus and PowerPoint. All such documents can be easily imported into a single SAS data set for

text mining purposes. This data may be enriched using the SAS System to integrate documents and quantitative data from a wide variety of disparate but complementary sources.Once the desired SAS data set has been created, it becomes the input to SAS Text Miner.


6

Term Identification

For analysis purposes, terms may be defined by single words, word groups (e.g., balance of payments, Empire State Building), entities (such as name, address, etc.), word context (using part-of-speech tagging), numbers or punctuation. Frequency and weight information for each

term are valuable in determining a term's contribution to the document collection. Removal of low-information terms helps reduce the dimensionality of the data.

Word Stemming and Synonyms

Words can be reduced (stemmed) to their root form to simplify analysis. The process of "stemming" is a commonly used text-processing feature. The words banks, banking, and banked

would be treated as bank. By using a synonym dictionary, words such as purse, pocketbook and handbag could be treated as one term. Abbreviations, such as oz, lbs, $ are also treated as synonyms for the words they represent (ounce, pounds, dollars). Users have the flexibility to

develop their own domain-specific dictionaries.

Figure 3: Text Miner provides the capability of building and refining domain-specific synonym lists.


7

Figure 4: Stemmed terms are determined during analysis and may be further refined.

Feature Extraction

Much of a term's meaning can be derived from a dictionary and/or linguistic patterns. The term

"Mr. Bill H. Washburn" clearly defines the name of a person even though this may not be listed in any standard dictionary. From both dictionary information and linguistic patterns, many terms can be flagged with semantic categories such as location, company name, Internet address, date,

currency, measure, title, phone number, names of people. These semantic features can be used selectively to improve document processing by complementing the dictionary with entity categories.

Stop and Start Lists

A count of all discrete terms in a document collection can be massive. These terms make for a

hugely dimensional analysis matrix. Two facilities that allow interactive control of this dimensionality are stop and start lists.

A stop list contains a list of terms that should be ignored for analysis purposes. Typical terms would be very frequently occurring words with little or no discriminating power such as "the", "a", "and", "or", "because", "in", "is", etc. From a business perspective these terms add little value to

illuminate the content of documents.


8

Figure 5: A partial list of stopped terms.

A start list is a list of terms that define the vocabulary to be used during text analysis. It is particularly useful in well-defined domains such as medical, legal or specific engineering

disciplines. It allows a domain-specific focus while controlling prohibitively large numbers of terms in a document.

Figure 6: Setting stop/start lists


9

Output

The result of parsing the documents is a term-by-document matrix of term frequency. The sheer size of this sparse matrix can make it too large to mine effectively. Text Miner methods for dimension reduction include Singular Value Decomposition (SVD) and Rollup Terms. These

methods reduce the size of this matrix while retaining the key information.

More specifically, the output of the parsing process is a list of terms, processed as described

above and displayed with various defining features. These include their syntactic and semantic (if applicable) categories, their frequencies and other information. These terms can then be searched, sorted and modified interactively as discussed in the interactive results viewer section below.

Figure 7: A partial list of parsed terms containing noun groups, stemming, synonyms and more.

Dimension Reduction

In order to represent the meaning of text, it is important to analyze the relationship that exists

between the isolated terms by distilling the key concepts contained in the documents. Text Miner provides two approaches to generate such concepts: Singular Value Decomposition (SVD) and "rolled-up terms."

Singular Value Decomposition (SVD) is a proven mathematical method that accomplishes concept (theme) creation by projecting each document into a reduced dimensional space. The

closer together documents are in the reduced space, the more similar they will be. The farther apart documents are vis-à-vis content, the more dissimilar they will be. The definition of the


10

concepts is determined in part by mathematical considerations and in part by domain knowledge. These concepts are derived from the text collection as a whole, not from an analysis of a single

document itself.

Term Weightings

In addition to building similarities between words, terms are weighted to favor those terms that discriminate well between documents or document categories. Such categories could be different text topics or text sources. In situations where this information is known beforehand, it

could be coded in a separate variable and made available to the weighting algorithm. Text Miner offers a very wide variety of weighting schemes, most of which will favor terms that occur frequently in some documents or document categories but not very

frequently everywhere.

Figure 8: Two dimension reduction methods allow for flexibility in analysis of a wide range of data sources.


11

Term Concepts

Each document can now be characterized by the most important terms in the entire text collection

with the highest weighted terms expressing the central concepts. As with SVD concepts, these are characterized by a descriptive list of words, each with an importance measure or weight.

Term weighting provides interpretability, which is very useful for exploratory purposes. This approach carries the same advantages and disadvantages of a keyword approach and is often used as a starting point for further text processing activities, rather than an end result. Interactive

explorations can best determine which of these terms are most useful for the tasks at hand, which to eliminate and which are so close in meaning as to be equivalent to one another. Using these insights, the list of terms can be modified and tailored to the task at hand.

Mining the Text

Once the documents have been represented as points in a multidimensional space, they are now

suitable for the typical goals of analysis: clustering and/or classification.

Clustering

One of the central activities of text processing consists of finding and analyzing the content of a document collection. Imagine being faced with a collection of several thousand customer letters. One would like to partition the letters into several piles with each pile representing the main issue

addressed, thus allowing the allocation of appropriate priorities and resources to the task of processing the letters. It will also provide the first insight into many of the specific concepts mentioned in the letters.

One way to facilitate this is by clustering the documents' numerical representation as points in space. Clusters of closely related points represent the similarity of document themes.

Text Miner provides two clustering techniques geared specifically toward the analysis of text-based data. One method uses a hierarchical clustering algorithm where each document is placed

in a specific subtree. The other method makes use of 'fuzzy clustering' where each document has a probability of membership in each cluster.

As previously discussed, clustering is achieved on the basis of presence and absence of the extracted central concepts in the text. Ideally, each cluster contains a different set of these concepts.


12

Figure 9: The Text Miner Cluster tab allows for two text mining-specific clustering algorithms.

Interpretation of Clusters

Clusters are characterized by the terms that best describe them and thereby allow the user easy interpretation of their content. Because the details of a cluster analysis (e.g., the number of

clusters expected and the size of useful clusters) is very much task specific, such analysis is inherently interactive.

The next section shows how Text Miner supports the discovery of the most pertinent clusters.

Sometimes it is useful to view clusters in a hierarchical fashion. For example, when analyzing a

set of documents for topical content, one might first look for the two main topics, then for each cluster "drill down" for subtopics as warranted by the application.

Cluster interpretation is a first step toward gaining a general document overview. More detailed interpretations are possible when integrating the full suite of data mining components available in SAS Enterprise Miner. It then becomes possible, for example, to extract sequences of words (i.e.

common phrases from the text), or to profile these clusters using structured data such as demographic and geographic information. Phrases can be enriched with some or all of the stop words that were originally excluded in order to capture their full meaning.


13

Classification

Another major text mining activity is the classification of text into predefined categories. Imagine working in a help desk environment that receives many requests and problems to act on. In the past you have stored such requests together in your database with the diagnosed problem

category, and perhaps total costs and time required to solve each problem. These database texts can be mined to reveal the key concepts that differentiate between the problem categories. You can then predict which category incoming requests fall into and route the problem accordingly

while also predicting expected time and cost to solve that problem (should that information be available). This allows for prioritization of resource allocations. Such a project can be easily implemented using SAS Text Miner and SAS Enterprise Miner.

Even without preclassified data, a robust solution can be built. Imagine obtaining a document collection and interactively clustering and interpreting that collection iteratively until a satisfactory

level of cluster content and quality has been achieved. These clusters can now assume the role of the historical collection above. Key concepts separating the clusters can then be applied to new documents. In either case, the knowledge contained in the documents will be organized into

appropriate categories.

These categories can, in turn, be tailored to the needs of the users. A very general category

would be "interesting vs. not-interesting." This would effectively allow a filtering of information from retrieval systems, news wires or Internet sources, without having to specify detailed keyword logic. Other applications include categorization of results from survey analysis, medical diagnostic

comments, sales descriptions, and account or competitive information.

Interactive Results Viewer

Text Mining is not a "push-button" technology but instead requires the user to interpret results and make adaptive changes. These changes may affect:

• Vocabulary. Typically a small subset of the tens of thousands of words are needed for the task at hand. Many terms can be eliminated not just on the basis of a stop list but also on their syntactic roles (determiners, auxiliaries, conjunctions and prepositions), low overall weight, frequency count, or semantic category (e.g. one may want to eliminate dates or addresses). Another goal that requires interactive intervention is the elimination or retention of certain term or term categories.

• The selection of certain clusters, word groups, and texts, as well as the exploration and analysis of this subset.

• The ability to locate similar documents or clusters to one or more under consideration. Defining similar terms as candidates for the creation of synonym lists is another highly interactive activity.


14

Figure 10: The Text Miner results window provides numerous interactive training features to tailor the analysis.

Performing these activities may be done in cycles — exploring and then focusing on another task.

The effects of using a different vocabulary may need to be assessed and could lead to a re-computation of the central concepts and clusters, all from a flexible interactive environment. These activities may also involve exploring the use of different types of concepts, different

numbers of concepts, different clustering approaches, or different lexical options in the first place. Ultimately, a suitable list of terms and, if appropriate, a satisfactory decomposition of the text collection, will result in clusters with well-defined concepts that serve as the basis for further

processing. This may involve other operations drawing on the suite of SAS data mining tools or a simple dissemination of findings.

Searching: Similarity-based Filtering with Keywords and Clusters

While Text Miner does not contain an explicit search engine, the interactive environment allows for searching documents. This can be achieved by retrieving the documents that are most similar

to the selected document or by selecting and filtering on specific terms. Rather than being geared toward a retrieval environment, this functionality is seen as one of support for the general text mining activities already discussed for gaining better insights and building better applications.

How SAS Text Miner Differs from Its Competition

Text Miner is designed to efficiently perform text mining on large collections of textual materials.

The goals of text mining, as with mining activities in general, are to obtain actionable information that can be used for business benefits (e.g., increased productivity or competitive advantage). The unique differentiator of Text Miner is its seamless integration with the SAS Enterprise Miner

data mining environment. This integration provides the full range of mining tools such as


15

prediction, classification and clustering as well the full range of data access and data preprocessing tools.

Some of the classical document management systems can also perform text classification, however, the goals of text mining are much more ambitious. Most document management

systems perceive texts as passive entities that need to be indexed, filed and efficiently retrieved when needed. The more ambitious among these categories of systems attempt to automate this process by exploiting document meta-information and/or neural networks that can learn

classifications based on collections of known training cases. Text Miner goes beyond these limited goals to build models that can be actively used for business benefits. For example, in a help desk environment, models can be generated in order to predict the problem category of a

new complaint. These models can be generated using a variety of different prediction tools, all in a statistically validated process. The impacts of costs and profits of the different models can be compared using various assessment charts. The results can then be applied to new reports. The

various sophisticated routines and algorithms provided in Text Miner ensure reliable support for the process of building such models.

The analysis can often be supplemented with nontextual data. For example, there may be data on previously processed call reports with captured information on costs, time to completion and resources needed. This might be supplemented by customer information including customer

value. This additional data can then be used to better define priorities, help allocate resources and pinpoint quality problems. Analysis extends far beyond filing and retrieving documents, moving toward a more cost-effective operation and improved customer relationship management.

There are also specific tools available in Enterprise Miner that may be of considerable help in text mining. Some examples are:

• The Sampling node, which enables the user to extract a proportion of the texts to be used for training purposes. When used for training classification, the sampling technique ensures that sufficient samples are included for all classes in order to achieve best possible results.

• The Filter Outliers node, which can help manage the vocabulary that will be used for a task by including or excluding terms based on frequency information. This can usefully supplement a stop list and help manage an otherwise almost unmanageable set of tens of thousands of words.

• The Insight node, which can display the relationship between words and document classes and clusters.

• The Assessment node, which can measure and compare the quality of different models under consideration.

• The Score node, which captures the complete scoring code for scoring new documents.

Many other more technical tools are available and can increase productivity, including tools for building transformations, computing associations and sequences of terms, and linking of terms and documents through link analysis. These facilities go far beyond document management

systems and systems that use neural networks or such to learn classifications.


16

Conclusion

SAS Text Miner is a powerful tool for document classification and clustering. With the integration of Text Miner into the SAS Enterprise Miner solution for data mining, SAS is the first software vendor to offer a complete data mining solution for analyzing both nonstructured and structured

data. By exploring and modeling large amounts of structured data, as well as text-based data, you can uncover hidden relationships and patterns of information, enhancing your ability to make accurate, on-target predictions.

SAS is the market leader in providing a new generation of business intelligence software and services that create true enterprise intelligence. SAS solutions are used at more than 38,000

sites, including 99 of the top 100 businesses on the Fortune 500, to develop more profitable relationships with customers and suppliers; to enable better, more accurate and informed decisions; and to drive organizations forward. SAS is the only vendor that completely integrates

leading data warehousing, analytics and traditional BI applications to create intelligence from massive amounts of data. For 25 years, SAS has been giving customers around the world The

Power to KnowTM. Visit us at www.sas.com


17

References

Berry, M. W. and M. Browne. 1999. Understanding Search Engines: Mathematical Modeling and

Text Retrieval. Philadelphia: Society for Industrial and Applied Mathematics.

Deerwester, et al. 1990. "Indexing by Latent Semantic Analysis." Journal of the American Society

for Information Science. 41(6): 391-407.

Krishnaiah, P.R. and L.N. Kanal. 1982. Classification, Pattern Recognition, and Reduction of

Dimensionality. New York: North-Holland Publishing Company.

McLachlan, G.J. 1997. The EM Algorithm and Extensions. New York: John Wiley & Sons.

SAS Institute Inc. 2002, Getting Started with SAS® Text Miner Software, Release 8.2 PUBCODE: 58859

Trefethen, L.N. and D. Bau. 1997. Numerical Linear Algebra. Philadelphia: Society for Industrial and Applied Mathematics.

Watkins, D.S. 1991. Fundamentals of Matrix Computations. New York: John Wiley & Sons.

Yang, Y. and J.O. Pedersen. 1997. "A Comparative Study on Feature Selection in Text Categorization." Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97).

SAS Text Miner contains:

LinguistX® from Inxight Software, Inc. Copyright © 1996-2002. All rights reserved. www.inxight.com.

Thing Finder TM Server from Inxight Software, Inc. Copyright © 1996-2002. All rights reserved. www.inxight.com.

World Headquartersand SAS AmericasSAS Campus DriveCary, NC 27513 USATel: (919) 677 8000Fax: (919) 677 4444U.S. & Canada sales: (800) 727 0025

SAS InternationalPO Box 10 53 40 Neuenheimer Landstr. 28-30D-69043 Heidelberg, GermanyTel: (49) 6221 4160 Fax: (49) 6221 474850

www.sas.com

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright © 2002, SAS Institute Inc. All rights reserved. 51410US.0602