evaluating text analytics 9.2012
TRANSCRIPT
-
7/28/2019 Evaluating Text Analytics 9.2012
1/16
WHITE PAPER
Finding the Right Fit: How to Evaluate Text Analytics Software
-
7/28/2019 Evaluating Text Analytics 9.2012
2/16
SAS White Paper
Table o Contents
Introduction 1
How Does Text Analytics Work? 1
Text Analytics Applications 2
Deciding on Text Analytics Software The Process 3
Sel-Knowledge Phase 3
Filter Phase 4
Proo-o-Concept Phase 5
The POC Process for Text Analytics Technology: Key Issues 6
Stage One Preparation 6
Stage Two Development 7
Key considerations for categorization 8
Developing extraction catalogs 8
Stage Three Results: Balancing Recall and Precision 9
Key considerations for measuring results 10
Stage Four Report 11
A Simple Solution? 12
Conclusion: Getting the Most from Your Investment 13
Tom Reamy is the Chie Knowledge Architect and ounder o KAPS Group, a group o knowledge
architecture, taxonomy and text analytics consultants.
Reamy has 20 years o experience in inormation architecture, enterprise search, intranet management
and consulting, education sotware and text analytics consulting. His academic background includes
a masters in the history o ideas, research in artifcial intelligence and cognitive science, and a strong
ocus in philosophy, particularly epistemology. He has published articles in various journals and is a
requent speaker at knowledge management conerences.
When not writing or developing knowledge management projects, Reamy can usually be ound at the
bottom o the ocean in Carmel, CA, taking photos o strange creatures.
-
7/28/2019 Evaluating Text Analytics 9.2012
3/16
1
Finding the Right Fit: How to Evaluate Text Analytics Software
Introduction
Although it is a ast-growing area, text analytics is still new to most organizations Many
are looking or help to understand exactly what text analytics can do or their business
and how to choose a platorm and vendor that work best or them
This paper describes what text analytics is, how it works and why it is so valuable across
many dierent organizational areas It is also intended to give guidelines when evaluating
various text analytics technologies and vendors
How Does Text Analytics Work?
The analysis o text is based on our basic text-handling capabilities: text extraction,
categorization, sentiment analysis and summarization
Text extraction. This is where the sotware identies many types o wordsThese words can range rom all words in a document to only one type o word
in a document For example, words could be limited to nouns, noun phrases or
entities (people, places, organizations, etc) or to acts (sets o subject-object
pairs connected by some type o relationship) Extraction can be done with large,
elaborate lists o named entities or with automated, rule-based extraction or
any combination o the two
Categorization. The heart o text analytics, categorization can be done in a variety
o ways with dierent levels o precision and dierent types o eort by dierent
resources Despite what sotware vendors love to claim, categorization is not yet
done automatically Categorization methods range rom statistical to a variety o
rule-based techniques that utilize sophisticated operators to understand the
context o words
Sentiment analysis. Social media, voice o the customer and sentiment analysis
have been initiators to the growth o text analytics And while many people dont
consider sentiment analysis to be part o text analytics, it is important to include it
as part o the description, because it relies on many o the same techniques and
capabilities as the rest o text analytics, particularly categorization
Summarization. This is a rarely used component o text analytics that gives the
ability to generate a rules-based summarization o large documents It is mostly
used or replacing the rarely useul snippets that search engines provide
Summarization is typically done using simple rules related to size and placement
in the document, and is tied to categorization when the summary is based on a
search term
Text analytics includes fourbasic text-handling capabilities:
text extraction, categorization,
sentiment analysis and
summarization.
How Can Your Organization
Beneft rom Text Analytics?
Enterprise search. Text analytics
increases the relevance o search
eorts by adding concepts and
meaning
Content management. Text
analytics enables a hybrid
publishing model to semi-
automate categorization,improving overall quality
Search-based applications.
Embedding text analytics-based
search in enterprise applications
helps to streamline, inorm and
optimize business decisions
-
7/28/2019 Evaluating Text Analytics 9.2012
4/16
2
SAS White Paper
Text Analytics Applications
Text analytics can be used as a platorm or a variety o applications These include:
enabling smarter search and integrating with content management systems to
add metadata, making retrieval more relevant; converting unstructured text to data
or predictive analytics; doing voice-o-the-customer applications to open up new
avenues o input into what customers are really saying; and a variety o event detection
applications, such as raud detection or e-discovery
The complexity o how text analytics can be used is compounded by an almost
bewildering variety o oerings Text analytics sotware can include any number o
eatures It can, or example, include taxonomy management sotware It can include
text analytics platorm sotware that oers everything rom just text extraction to
simple categorization to all text analytics capabilities Or, it can include all o the above
capabilities integrated with sophisticated text and data mining platorms Text analytics
can be incorporated into a search application, a content management application, ora business intelligence or customer intelligence or competitor intelligence application
The vendors range rom small start-up companies to ERP vendors (like SAP), hardware
companies (like IBM), to SAS, which specializes in analytics
Self-knowledge
Filter Preparation
Development
Results
Proofof Concept
VendorSelection Report
Figure 1: Before choosing a vendor,most companies go through threemajor phases and four sub-phasesin the process of evaluating textanalytics solutions.
Text analytics helps
Governments. Governments
benet by using text analytics to
improve the eectiveness and
eciency o citizen services Text
analytics gives agencies a more
comprehensive approach to
assessing communications and
events rom social media, and
improves monitoring o constituentinquiries and the overall public
pulse It also improves early
warning detection, enhances public
saety, increases transparency and
promotes better-inormed policy
decisions
-
7/28/2019 Evaluating Text Analytics 9.2012
5/16
3
Finding the Right Fit: How to Evaluate Text Analytics Software
Deciding on Text Analytics Sotware The Process
With all this complexity, it is important to approach with great care all decisions related
to selecting text analytics solutions What ollows is a description o such a process that
has been developed over a number o years across multiple projects
The basic method is a three-step process:
Self-knowledge. How does text analytics t with the inormation and business
goals and strategy o your organization?
Filter. Traditional sotware evaluation methods that involve investigating eature
sets, technology issues, usability and other eatures
Proof of concept, or POC. Because text analytics deals with language and
semantics, this is the real heart o the evaluation
Lets take a closer look at what is needed to complete each phase
Sel-Knowledge Phase
Too many decision makers decide suddenly that they need to jump on the social media
or text analytics bandwagon Then, they try to pick a vendor without really understanding
what business value they are looking or
A good evaluation process starts with doing a deep dive into what text analytics might
mean to your organization This deep dive includes:
Understanding the strategic and business context for text analytics. For
example, how does inormation fow within specic business processes? Is it mostly
when you write large Word documents where research is done as a ormal activity,
or is it when research is done on the fy as documents are being written? For every
company the answers will be dierent, but the main task or everyone is to map out the
relative strategic importance o each type o inormation or business process fow
Deciding what your information problems are. You must decide what, how severe
and how critical each inormation problem is to your organization For example,
do your problems mostly relate to the diculty o nding inormation within your
company, or do they relate to the inability to understand what your customers are
saying about your products, or to the need to nd better patent inormation?
Asking strategic questions. You need to ask why you need text analytics, what
value you get rom the taxonomy or text analytics, and how you are going to use
it This will involve getting an idea o how much money and time you are currently
losing to your inormation problems and understanding how a text analytics solution
will help This can be done abstractly (by applying the results o analyst research);
by doing actual studies or surveys to determine how much productivity is lost now;
or by calculating how much prot you think a new text analytics application will
generate
Text analytics helps
Insurers. Insurers benet by using
text analytics to analyze claims
descriptions and obtain deep insight
into each claim Text analytics helps
automate triage and subrogation
processing It also ocuses
reviewers eorts on prioritized
claims by enhancing predictive
analytical models and it helps spot
raudulent activity in claims
-
7/28/2019 Evaluating Text Analytics 9.2012
6/16
4
SAS White Paper
Determining what content and content resources you have. This will involve
answering questions such as: What is the mix o unstructured content and database
content? Does most o the unstructured content live in a content management
system or is it distributed on le shares? Is the content mostly just business content,
or do you have large collections o topical content such as biological research
results? Do you have existing taxonomies or glossaries, or even just good overview
books with good chapter structure?
Assessing your technology environment and how text analytics will integrate
with it This will involve answering questions such as: Is SharePoint a major part
o your technology environment? Do you have well-integrated technology, or does
each department or division have its own technology? Do multiple programs have
to share inormation? Do you have multiple search engines within the organization,
and how integrated are they?
Answering these questions can be done during a ormal two- to our-week process, oras an inormal set o research and discussion activities1 This new sel-knowledge needs
to be documented, describing the extent o the potential value and eect text analytics
could have on your organization
Filter Phase
The lter phase is the one that most resembles traditional sotware evaluation It consists
o such activities as:
Marketresearchintothecompanyreputation,historyandprojectedfuture.
Technologyresearchintotheunderlyingtechnologybehindthesoftware,sothatyou
can decide how it might integrate with your existing environment
Featurescorecardwithafocusonminimumfeatures,must-havefeaturesandan
understanding o how those eatures are important to your organization These
eatures can include general sotware eatures such as price, usability and editing,
but can also include comparisons o how well the basic text analytics eatures (text
extraction and categorization) are implemented
These traditional sotware evaluation activities can produce a scorecard, but this
scorecard should be thought o as a lter to eliminate oerings that dont t with your
needs not as a nal scorecard that you use to select your sotware This phase should
reduce the number o viable alternatives to a small list Then, you can invite those
vendors to do extended demos o one to three hours each
Why not simply base the decision on eatures? First, because sotware eatures change
But more importantly, because your content is unique So the real issue is to nd
eatures that are useul in understanding your materials
1 Such processes may be called a Readiness Assessment, perormed by vendors or third parties like theKAPS Group.
The only way to really
understand a text analytics
solution is by doing a proof of
concept that tests with your
content, your scenarios and
your people.
Text analytics helps
Financial organizations. Financial
departments and organizations use
text analytics to eectively identiy
raudulent activity Text analytics
provides a way to dig into the details
contained in applications, notes,
descriptions and other unstructured
text sources helping prioritize
cases or examiners to investigate,and creating indicators or detection
alerts
-
7/28/2019 Evaluating Text Analytics 9.2012
7/16
5
Finding the Right Fit: How to Evaluate Text Analytics Software
Overall, the lter phase should reduce the number o candidates to between two and
our These are the candidates you will consider in the next phase In some cases,
you might even be able to reduce your candidates to one clear leader But even i this
happens, it still makes sense to do the last phase described below In some cases, a
company could start with a preerred vendor because o an ongoing relationship, or
on the basis o a trusted recommendation or some other reason In these cases, it still
makes sense to do the next phase, but with a dierent ocus to make sure that the
text analytics oering works in your environment
Proo-o-Concept Phase
The proo-o-concept (POC) phase is the most important o the entire text analytics
evaluation That is because text analytics is all about language and semantics and how
people think and express their thoughts The only way to really understand that is to test
with your content, your scenarios and your people
A basic approach to a POC is to set it up as a contest, o sorts, between the top two
or three vendors POCs are needed because the complexity o language demands that
you look beyond simple out-o-the-box (OOB) capabilities The key questions are not
how well a vendor can set up a demo with careully selected content and scenarios, but
how well those capabilities can be rened through two or more development-rene-test
cycles This is what will really tell you i the sotware can solve the inormation problems
you uncovered in the sel-knowledge phase
A POC will also answer another critical question: How much eort will it take to get
to acceptable levels o accuracy? For example, some vendors have expended a
lot o eort to get better OOB results with built-in semantic networks, large multiple
dictionaries and the like While those resources make the product look good in an initial
comparison, the real question is how much eort will be required to achieve the 90
percent accuracy rate that you could have set as your goal? For example, lets say a
product can determine OOB that specic content is about telecommunications That
doesnt really tell you much i you are a telecommunications company and almost all o
your content is about telecommunications And it can oten take more time and eort to
go rom telecommunications to specic concepts (like bill plans) than it would to go rom
scratch to those levels using some other product
Another question that a POC can answer is how well you can establish a working
relationship with the vendor And rom the vendors perspective, the POC can uncover
any special issues that you need to have addressed so the vendor can work outsolutions while you are still in a relatively orgiving research rame o mind
Text analytics helps
Health and Life Sciences
organizations. Text analytics
improves patient saety and care
It promotes a proactive approach
to identiying adverse events,
oten ound in doctors notationsand in descriptions o symptoms
and secondary eects rom drug
treatments Text analytics also
improves health outcomes rom
in-depth research assessments
-
7/28/2019 Evaluating Text Analytics 9.2012
8/16
6
SAS White Paper
One o the most valuable aspects o a POC is that it gives you a head start in
development with the support o the vendors This is true even in a case where the initial
selection was reduced to one The POC creates a oundation or your initial project as
well as or any uture projects This oundation consists o both the actual development
o taxonomies and rules or categorization, sentiment and extraction But just as
important, it provides on-the-job training or your internal resources (taxonomists, text
analytics developers and others) under the guidance o experts Training by doing, in this
case, is by ar the best and cheapest way to train your internal resources so they can
take over ater the initial POC
A second benet o using a POC or initial development is that it allows you to build
the right kind o oundation one that is designed rom the beginning to be a platorm
technology that can support multiple applications This keeps you rom getting caught
in the trap o thinking about text analytics just in terms o your rst project a sure way
to not get the maximum benet rom text analytics sotware In addition to getting the
maximum direct value rom your investment, this approach can also enable you tointegrate text analytics with other advanced analytic technologies like text mining, data
mining and predictive analytics
The POC Process or Text Analytics Technology: Key Issues
The actual POC can also be broken down into our stages: Preparation (including
design), development, results denition and reporting While each project will be
dierent, there are a number o key issues to consider or any POC or each o those
phases
Stage One Preparation
When designing a POC, you should start by deciding on an appropriate size and
length or this phase While the overall length is somewhat dependent on the size and
complexity o content and anticipated uses, a rough guide is to allow our to six weeks
o eort, with one or two experienced taxonomists or text analytics developers per
candidate sotware Ideally, these taxonomists will have experience with the particular
sotware theyre evaluating; but it is even more important that they have experience
developing categorization, extraction and/or sentiment rules2 The our-to-six-week time
rame allows the POC to go through at least one and preerably two or three rounds o
development and renement, which is essential or a meaningul POC
Other design considerations involve selecting the amount and variations o content
that will accurately refect the complexity o your organization, and then developing the
essential use cases or your POC This includes getting access to the content, which is
not always easy
2 I you dont have experienced personnel, most vendors will have a range o consultants and partners whocan help you bridge this interim skills gap.
Tips and Tricks
Testing categorization rules requires
careul design Once the initial
test content is categorized, you
can virtually automate scores
without having to open each le
to determine the categorizations
correctness In subsequent tests
with uncategorized content, a
normal procedure is to open
selected documents to let subject
matter experts do this evaluation
Text analytics helps
Manufacturers. Through text
analytics, manuacturers enhance
quality and reliability Text analytics
gives a common view o categorized
product and parts codes that are
used in early-warning detection
systems It also lets manuacturers
examine quality and reliability issues
based on incoming customer
communications, claims and social
media monitoring
-
7/28/2019 Evaluating Text Analytics 9.2012
9/16
7
Finding the Right Fit: How to Evaluate Text Analytics Software
Another key consideration is the selection and recruitment o your internal resources
who will participate in the POC These include subject-matter experts (SMEs) who will
select and categorize appropriate content or each individual category, and act as expert
evaluators o the success o the POC categorization and other scenarios Others whomight need to be included are technical people who can support the technical aspect o
the POC and business users who can generate use case scenarios and also evaluate
the text analytics success
Another important task or the preparation phase is to identiy or develop a taxonomy
or use with the categorization portion o the POC Categorization requires a taxonomy
as its organizing structure It need not be a big taxonomy, and it can oten just be a
list o important concepts But i you have a large taxonomy (like biopharmaceutical
companies or government and military organizations oten have), you should select a
small subset o the overall taxonomy to ocus on getting good results not complete
coverage
Once you have dened the use case scenarios or the evaluation, the next step is to
map those to specic text analytics capabilities and then develop tests or each one
This will vary rom organization to organization, but a general suggestion is to develop
a set o small extraction catalogs that includes both named entities and rule-based
unctionality, and test them on your selected content The other primary test case(s)
will be categorization and/or sentiment During the preparation phase, you will need to
determine what accuracy level you will aim or
Stage Two Development
Ater designing the POC project and setting everything up including preparing
content and people the next phase (development) typically starts with developingcategorization and/or sentiment scenarios One reason to start with categorization/
sentiment is because this is where the majority o eort will be
The process is roughly the same or both categorization and sentiment, and a simple
process can be used or both Categorization typically starts with selecting example
content to build the rst round o categorization rules This example content, oten
called training sets, can be obtained either by SMEs or in a more automated manner
In some cases, you may be able to develop very simple categorization rules o a ew
terms or a single term, and then nd content that matches it, using an expert to help
select additional useul terms out o that small initial set For example, your SME or your
sotware might do a simple search on the term public health and then explore the
result set o that search or additional terms
Once the initial set o rules is developed based on the categories o the initial content
set, the next step is to test them against the complete content set and rene them to
get both good recall and good precision (Good is something you dene during the
preparation phase) The next stage is to generate a new (usually larger) content set
and run your categorization rules against that new content This will typically result in a
signicant drop in accuracy, which is ollowed by another round o rening the rules to
produce good results or the new content
Text analytics helps
Retailers. Retailers use text
analytics to improve brand image,
advertising, customer satisaction
and campaign eorts Through text
analytics, retailers get eedback rom
consumers social media chatter
so they can listen to consumers,
understand competitive reactions,
ollow trends and visibly address
problems
-
7/28/2019 Evaluating Text Analytics 9.2012
10/16
8
SAS White Paper
This entire process can be repeated But i time or resources are short, or i you get good
results against the new content, then this is as ar as you need go
Key considerations for categorization
Almost anyone can develop categorization rules For example, you can ask SMEs to
look at documents and pick out words that were suggested by the sotware, then
incorporate them into simple word list categorization rules However, to get realistic
results that you can use to build upon, or as tests to compare the capabilities o two
competing packages, you have to go beyond simple word list categorizations
This can be done in two ways The rst is to careully tune the list o words in ways
that SMEs oten have little experience with To do this, you need words that not only
exempliy the concept or category you are building the rule or, but words that are unique
to those documents Good sotware can aid in this process by choosing statistically
unique words but human judgment is always needed
The second way to create realistic rules is to develop advanced Boolean rules that
utilize operators like AND, OR, NOT and DIST or START Developing advanced rules
or creating word list rules that combine signicant and unique words both require
experience and learning This is why one o the goals o a POC should be to train your
resources who will be charged with urther development and maintenance
Developing extraction catalogs
The second major activity o a POC is to develop extraction capabilities with catalogs or
lists o entities to extract and/or rules or extracting all kinds o noun phrases
When it comes to extraction, there are usually two main considerations: scalability and
disambiguation
Scalability is not particularly suited to a POC, but you can get some insight into
the scalability o the various oerings with simulations o large content sets and
articial extraction catalogs For example, you can generate or capture large
content sets that will accurately match the number and size o your documents,
even though they are not refective o your specic content
Tips and Tricks
One way to address scalability in the
POC is to take names o worldwide
organizations and combine them
with sets o generic rules By starting
with small content sets and catalogs
and increasing in measurable steps,
you can get a good idea o the basic
scalability up to limits that are close
to your nal needs
-
7/28/2019 Evaluating Text Analytics 9.2012
11/16
9
Finding the Right Fit: How to Evaluate Text Analytics Software
Disambiguation is something that can and should be tested Disambiguation is
the ability to distinguish between words that look the same but mean something
dierent, or between two words that are dierent but mean the same thing The
latter case is usually relatively easy to handle through development o extendedsynonyms But the rst case oten calls or much more sophisticated rules that
take context into consideration For example, Ford can reer to a person, a
car or a company (or in some contexts, a ctional person) To distinguish which
is being reerred to in a particular text, you must be able to incorporate multiple
levels o context rom any type o work (ction, newspaper, economic analysis) to
types o words in the document, the paragraph or the sentence
Because you need this level o disambiguation even or sentiment applications, it is
important to look at the categorization unctionality o each oering It is the underlying
categorization capability that will typically be used or disambiguation
Stage Three Results: Balancing Recall and Precision
The results stage would seem to be the most straightorward aspect o the POC, but
there are a number o key issues to be aware o during this phase Initial measurements
typically generate numerical scores or overall accuracy, recall and precision
Recall or categorization is the number o documents that are known to be documents
that should be tagged with each category So i you know that there are 100
documents that should be tagged as Health Care > Public Health, and the sotware
correctly identied 80 o them, then it would produce a recall score o 80 percent
Precision is the number o alse positives, which are the number o documents the
sotware incorrectly tagged as a particular category So i 20 out o the top 100
documents tagged Health Care > Public Health dont belong in that category, then the
precision score is 80 percent
Typically, recall and precision are inversely related the better the recall the worse the
precision It is easy to write a rule that correctly categorizes all 100 known documents
i the rule is so general that it categorizes virtually everything as part o that category
Conversely, it is easy to write a rule that is so specic it only returns 10 o the known
documents and thereore no alse positives The trick is to write rules that come up with
a good balance between recall and precision, with high scores or both
It is important to realize that recall and precision are somewhat content dependentFor example, in the normal develop-test-rene cycle,3 it is typical to develop rules that
give good results or the initial test set o documents but with a score that goes down
when applied to a new set The goal is to produce rules that are general enough to
apply to new content almost as well as to old content
3 For an in-depth description o the develop-test-refne cycle, see:Enterprise Content Categorization -How to Successully Choose, Develop and Implement a Semantic Strategy. Available at: sas.com/reg/wp/corp/25624.
Text analytics helps
Media and Publishers.
Text analytics provides more
personalized reader experiences
and improves ad revenues in this
industry by automatically indexing
content and associating it with
readers specic topics o interest
Tips and Tricks
Typically, recall and precision are
inversely related the better the
recall the worse the precision The
trick is to write rules that come up
with a good balance between recall
and precision, with high scores or
both
http://www.sas.com/reg/wp/corp/25624http://www.sas.com/reg/wp/corp/25624http://www.sas.com/reg/wp/corp/25624http://www.sas.com/reg/wp/corp/25624 -
7/28/2019 Evaluating Text Analytics 9.2012
12/16
10
SAS White Paper
Also, keep in mind that the right balance between recall and precision is dependent on
the particular application For example, in a discovery application in which humans will
be reviewing the results, recall is the most important measure But or an automated
application that is exposed to users, precision is oten the most important, because too
many alse positives will cause users to lose aith in the application
Recall and precision are normally applied to categorization, but they can also be applied
to extraction, with the ocus on specic entities instead o documents
Key considerations for measuring results
There are three key considerations or getting good results rom tests The rst is to
realize that testing will likely require signicant human eort Subject-matter experts
will need to provide a human categorization either during the preparation phase (by
categorizing training sets), and/or to evaluate the outcomes There are tricks to reduce
the human eort involved, such as incorporating categories in le names or by obtainingpre-categorized content without having to use internal resources The diculty with
using pre-categorized content is that it is rarely available in sucient depth to be
useul Similar to good OOB categorization, it is usually not specic enough to provide
documents that are about your industry, such as telecommunications or health care
Such specic categories o content are much harder to nd
Using humans or categorization, though, raises a question about accuracy In
general, humans are very good at seeing patterns and coming up with a reasonable
categorization; but they are not very consistent Machines, on the other hand, are
completely consistent Humans can be inconsistent in two ways agreement between
people and agreement over time What this means in terms o getting good results rom
testing is that you need to normalize results across multiple testers and over time or
individual testers
A second key consideration to remember is that the scores are not the only story First
o all, it is oten hard to develop tests that refect each vendors unique capabilities For
example, i one vendor has very strong statistical modeling but weak categorization
operators while the opposing vendor has weak statistical components but a complete
set o categorization operators, it can be very tricky to design a air test One way
around this is to develop a set o tests with weights that refect your criteria and use
case scenarios Second, it is important to actor in the overall level o eort needed to
achieve those scores This is something that oten counterbalances price dierences
because a relatively cheap sotware package can have a much higher total cost oownership when labor is actored in Third, it is important to recognize that scores are
only relative measures I one vendor gets 90 percent accuracy and the other gets 85
percent accuracy with the same level o eort, the dierence may not be signicant in
the real world
Text analytics helps
Energy and Transportation.
Companies in these industries use
text analytics to improve asset
maintenance schedules Servicing
notes are used as inputs to improve
asset predictions and to proactively
identiy potential saety issues rom
logs and accident reports
-
7/28/2019 Evaluating Text Analytics 9.2012
13/16
11
Finding the Right Fit: How to Evaluate Text Analytics Software
The third key consideration knowing that the develop-test-rene cycle is not a linear
process is extremely important or an overall evaluation o the project For example,
you may be looking at only 30 percent accuracy ater one round, which seems so poor
that the entire idea is questionable Or it may be that ater one round, one vendor is way
ahead with mostly 80 percent versus 50 percent accuracy
In the rst case, project owners may be thinking that it took two weeks to get 30 percent
accuracy, so they assume it will take another our weeks to get up to 80 or 90 percent,
when in act a particular category can go rom 30 percent accuracy to 90 percent in one
hour with a simple addition or deletion to the rule In the second case, the relative scores
could easily be an artiact o experience with the sotware or inexperience with the
particular subject matter, which could be easily reversed with a second round o eort
This scenario also highlights the importance o doing at least two rounds o development
and testing
Stage Four Report
The last phase o the evaluation and the project is to measure the results o the
last round o testing and generate a nal report The report should describe the
process, present the results with any issues clearly delineated, and propose a nal
recommendation about which sotware to purchase It should also include other details,
such as deployment and implementation recommendations Another component o the
nal report is oten a development road map to guide the development o a text analytics
platorm and an initial set o applications that the organization plans to deploy
One particularly eective technique is to generate an interim or preliminary report This is
oten in the orm o a PowerPoint presentation that includes the results, an interpretation
o those results, and an emphasis on any unresolved issues and decisions This interim
report is used to get eedback on the results This eedback typically produces a better
nal report and also ensures buy-in to the conclusions
Another unction o this interim report is to guide and ocus discussions about any
unresolved issues, interpretations o results, and plans or the uture
The ormat and content o both the interim and nal reports are strongly dependent on
the specic use case scenarios and other criteria that were developed in the preparation
phase But there are a ew general considerations to keep in mind
A typical report will ollow the major phases o the project: sel-knowledge, preparationand the POC Some sample sections might be:
Review evaluation process and methodology. This section provides the overall
context o the report and describes the requirements and use case scenarios that
were developed in the sel-knowledge phase
To get accurate results, its
important to do at least two
rounds of development and
testing.
Text analytics helps
Academic and other educational
elds In this domain, people
benet rom the collaboration
that text analytics enables With
text analytics, people are rapidly
connected with each other and
with external networks and relevant
materials Text analytics even
identies the level o expertise
contained within documents, sets
o documents and groups
To get accurate results, its
important to do at least two
rounds of development and
testing.
-
7/28/2019 Evaluating Text Analytics 9.2012
14/16
12
SAS White Paper
Initial evaluation. This part o the report should describe the research and thinking
that went into the initial evaluation o the entire vendor space It should then review
the outcomes and describe the initial high-level conclusions An interim version
would also contain any unresolved discussion points rom that phase This section
would end with a description and justication or the recommendations rom that
phase
Proof of concept. This section typically describes the methodology employed
during the POC, describes and interprets the results, and presents the nal
conclusions The interim version would also lay out the remaining discussion
points, while the nal version would contain the results o those discussions
Final recommendations. This section could be as simple as listing the nal
vendor selection and the justications or that selection It could also contain a
set o recommendations about how to proceed to implement the sotware in one
or more applications, with an initial approach, resourcing recommendations and
prioritization o potential applications The level o detail will vary, depending on how
much eort went into the sel-knowledge phase
These reports can be relatively inormal or can ollow any ormal requirements that
are in place Reports should provide both the history o and justication or any nal
conclusions and decisions
A Simple Solution?
This might seem like a very involved and complex process, and the question oten
comes up: Isnt there an easier way? Or isnt there one product that is better than all
the others?
First o all, no there is not an easier way You could try setting up some sort o
number generator with the randomly placed names o all the text analytics vendors on
your wheel o uture text analytics ortunes spreadsheet and spin the wheel Or you
could ask a riend but how many people have that kind o riend who has done it
beore and happens to have exactly the same content, scenarios and use cases as
your organization?
Second o all, there is not one product that is better in all ways or all customer
environments It would be nice i there were, but the reality is that dierent environments
sometimes call or dierent solutions For example, an organization may be primarily
interested in developing products or resale in the voice-o-the-customer space In that
case, their best t might be a vendor that has spent the last ve years developing built-in
customer intelligence reporting capabilities that are available out o the box
When you take a platform
approach to text analytics right
from the start, your solution will
continue to deliver value over
time, across your organization.
-
7/28/2019 Evaluating Text Analytics 9.2012
15/16
13
Finding the Right Fit: How to Evaluate Text Analytics Software
Conclusion: Getting the Most rom Your Investment
One generalization applies to the vast majority o companies: Text analytics delivers the
greatest value when approached as a platorm or inrastructure technology that can
support and enable an impressive array o applications both internally and externally
There are both broad strategic reasons or a platorm approach and myriad specic,
practical reasons
The overall context is the explosion o inormation, particularly unstructured content
It used to be that in general, 80 percent o signicant business inormation was
unstructured; but with the rise o social media and other actors, analysts estimate that it
has gone up to 90 percent Weve known or years that the way to maximize value rom
this unstructured content is to add more structure to it But the cost and eort o adding
structure with largely manual means have been too high and too unreliable
With the development o more sophisticated text analytics sotware to semi-automatethe process o adding structure, the situation has nally changed Now we can add
structure to inormation in a more cost-eective way, aster and with better quality
And this need or structure in unstructured content cuts across all boundaries and all
applications in the world This means that the number and variety o applications or text
analytics is vast almost beyond belie And even i you are currently only interested in
xing your search experience or getting better eedback rom your customers, once text
analytics has been added to your organization, the number o potential applications or
them will grow dramatically i you approach text analytics as a platorm
You may only be looking at one o these applications right now, but in the uture you will
almost certainly need and want other applications I your current choice is limited, you
may have to go through this whole process again Or someone else in your organization
may end up purchasing some other sotware solution that does one thing well, but not
everything you need So another department may buy another solution and this cycle
can go on and on The real solution is to purchase technology that is integrated into a
comprehensive platorm In that way, a specic solution can be augmented over time
and the platorm can be adapted or developed to support all o your application and
departmental needs
The decision about which application should be developed rst depends on the priority
o your organization and what has driven you to incorporate text analytics in the rst
place However, i text analytics is approached with a platorm model, it doesnt matter
which is done rst Why? Because the rst application will create a platorm that willenable your organization to add other applications at a raction o the cost and eort it
would take i each application was developed independently
Weve known for years that
the way to maximize value
from unstructured content is to
add more structure to it. With
sophisticated text analytics
software, we can add structure
faster, more effectively and with
better quality.
-
7/28/2019 Evaluating Text Analytics 9.2012
16/16
SAS Institute Inc. World Headquarters +1 919 677 8000
To contact your local SAS ofce, please visit:www.sas.com/ofces
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA
and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.
Copyright 2012, SAS Institute Inc. All rights reserved. 105643_S84400_0512
About SAS
SAS is the leader in business analytics sotware and services, and the largest independent vendor in the business intelligence market
Through innovative solutions, SAS helps customers at more than 55,000 sites improve perormance and deliver value by making better
decisions aster Since 1976, SAS has been giving customers around the world THE POWER TO KNOW For more inormation on
SAS Business Analytics sotware and services, visit sas.com
http://www.sas.com/http://www.sas.com/