beyond search-and-retrieval: enterprise text mining with...

25
TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING Beyond Search-and-Retrieval: Enterprise Text Mining with SAS ® White Paper by Stephen E. Arnold ArnoldIT.com

Upload: others

Post on 28-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING

Beyond Search-and-Retrieval: Enterprise Text Mining

with SAS®

White Paper

by Stephen E. ArnoldArnoldIT.com

Page 2: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

Table of Contents

1. Enhancing the usefulness of existing information in an enterprise .............1a. Integratediscoveryindatabaseswithunstructuredtext........................................... 1

b. Differencebetweensearchanddiscovery............................................................... 2

c. HowSAS®TextMinerworksasadiscoverysearchtool......................................... 2

d.HowSAS®TextMinerenhancesthediscoveryprocess.......................................... 2

2. SAS® Text Miner is part of the SAS® Enterprise Intelligence Platform ......4a. Overview.................................................................................................................. 4

b. SAS®createsaninformationcircuitfororganizations............................................. 5

3. SAS – The company ........................................................................................6a. History...................................................................................................................... 6

b. Customers................................................................................................................. 7

4. Data mining products available from SAS .....................................................9a. HighlightingtheSAS®TextMinersolution............................................................ 9

i. Textualdatacomesinmanyflavors..................................................................................................9ii. Competitivedifferentiators..............................................................................................................10iii.Keyfeatures....................................................................................................................................10iv. SAS®TextMinerunderthecovers...................................................................................................10v. Reducingthemanualworkoftextmining.......................................................................................12

b. HighlightingSAS®EnterpriseMiner™.................................................................. 13i. WhatisthedifferencebetweenSASTextMinerandSASEnterpriseMiner?................................13ii. Modelscoring................................................................................................................................16iii.Customizingapplications...............................................................................................................17

c. Howdotheseproductsinterrelate?........................................................................ 17

5. SAS® Text Miner Feature Snapshot ..............................................................18

6. Conclusion ......................................................................................................20

7. Appendix ........................................................................................................21

Page 3: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

ii

Formoreinformationaboutsearchandtextmining,visitwww.arnoldit.com.

Noportionofthisdocumentmaybereproducedwithoutthepriorwritten

permissionofStephenE.Arnold.

Page 4: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

Enterprisesearchanditslaundrylistsofresultsareincreasinglyfrustratingtousersofenterprisesearchsystems.Whatcanan

organizationdotobreakthroughthelogjamofoften-irrelevantresultsand“missing”information?Howcananorganization

make“findinginformation”reducecostsinsteadofincreasingthem?Whatcanyoudotosqueezemorevaluefromthedigital

dataandelectronicdocumentsyouhave?SAShasananswer:textanalytics.

1. Enhancing the usefulness of existing information in an enterprise

a. Integrate discovery in databases with unstructured text

Enterprisesearchisundergoinganimportantchange.Systemsusingbruteforcematchingofthewordsinauser’s

queryarebecomingmorepliantanddiscovery-centric.Textminingiscomingtoanenterprisesearchsystemnearyou.

Itsarrivalislongoverdue.Asystemthatcanseamlesslyintegratediscoveryindatabasesandunstructuredtextcanpay

hugedividendsquickly.(SeeStephenArnold’sarticleintheNovember2006issueofCXOAmerica:

http:/www.cxoamerica.com/pastissue/printarticle.asp?art=25408.)

ManyCFOs,CEOs,CIOsandCTOs(hereinafter,CxOs)havelearnedthepainfulwaythatintheirorganizations,text-

centricinformationproblemsarecommon–andexpensivetoaddressusingtraditionalmanagementtechniquessuch

ashiringtemporarystaff,outsourcingortryingtogetexistinginformationtechnologytohandlecustomersupport,

marketingandlegaltasks.Thisanalysiswilloutlinetheemergingtopicoftextminingasapotentialanswertoreduce

thedissatisfactionwithexistingmanualprocesses.Textminingprovidesawaytoenhancetheusefulnessofexisting

informationinanenterprise.Byaddingvaluetoitsinformation,acompanycanhelpeveryoneintheorganization

worksmarter.Dataminingistheprocessofdataselection,explorationandbuildingmodelsusingvastdatastoresto

uncoverpreviouslyunknownpatterns.Whatdoesthismeantoyou?Youcanproducenewknowledgetobetterinform

decisionmakersbeforetheyact.Buildamodeloftherealworldbasedondatacollectedfromavarietyofsources,

includingcorporatetransactions,customerhistoriesanddemographics,evenexternalsourcessuchascreditbureaus.

Thenusethismodeltoproducepatternsintheinformationthatcansupportdecisionmakingandpredictnewbusiness

opportunities.Textminingcapabilitiesenableyoutoapplysuchanalysestotext-baseddocuments.WithSAS’rich

suiteoftextprocessingandanalysistools,youcanuncoverunderlyingthemesorconceptscontainedinlargedocument

collections,groupdocumentsintotopicalclusters,classifydocumentsintopredefinedcategoriesandintegratetextdata

withstructureddataforenrichedpredictivemodelingendeavors.

Whetherit’susedtodrivenewbusiness,reducecostsorgainthecompetitiveedge,datamining(anditsclosecousin

textmining)isalsoavaluableassetfororganizations.Especiallyintheareaofcustomerrelationshipmanagement

(CRM),applyingthetechniquesofdatamininghasbecomeanaccepted“bestpractice”forcompaniesseriousabout

business.Nolongercansuccessfulorganizationsrestontheirlaurelsandcontinuetodobusinessastheyalwayshave

inthepast.Onlythosethattakeproactivemeasureswillsurviveintoday’scompetitiveworld.

Textmining,however,isnotwell-knownbecauseitisviewedasanemergingtechnologywithuntappedpotentialby

thoseeagertoanalyzethetextual-baseddatathatisaccumulatingandfillingtheirvastdatawarehouses.Enterprise

searchvendorshaveassertedthattheirkeywordmatchingsystemsare,ineffect,textminingsystems.Becausetext

miningisnotwellunderstood,theseassertionsleadmanyorganizationstobelievethatbroadcategoriesofresults

equalstextmining.Thisisnottrueandthesituationcancostanorganizationmoneyandopportunities.Enterprise

searchthatgeneratesalonglistofpossiblematcheswastestheuser’stime.Everyminutespentbrowsingandclicking

on“hits”coststheorganizationmoneyandreducestheemployee’sabilitytorespondtoaninformationneedquickly.

Tools,likethoseavailablefromSAS,providealternativewaystofindananswer,pinpointthefactneeded,orget

‘oversight’andinsightintostructuredandunstructureddata.

1

Page 5: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

2

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

Organizationsoughttotakestepsandbegintoharnessmoreofthedatatheyareaddingtotheirgrowingdatastores–

structuredandunstructuredalike–sothattheywillbeempoweredtotakeproactivestepsandimprovedecisionsto

guidetheircompaniestoabrighter(moreprofitable)future.Textminingsoftwarecansurfacepatternsandtrendsthat

youmayneverhavethoughttolookfor.WithcomprehensivedataminingsolutionssuchasSASTextMiner,youcan

gofromrawdatatoaccurate,business-drivenanalyticalmodelsinaseamless,efficientprocess.

b. Difference between search and discovery

It’simportanttorememberthatsearchenginesaredesignedtoreturnaspecificsubsetofdocumentsbasedonaquery

performedbytheuser.Typically,usersknowwhattheyseekandconstructaquerytoaccomplishthegoal.Thesearch

enginethenreturnsasubsetofdocumentsthatmatchesthatquery.

Incontrast,textminingenginesaredesignedtohelptheuserdiscoverkeyconceptswithindocumentsorsubsetsof

documentswithouthavingtoreadtheentirecollection.Unliketheuserofasearchengine,textminersareunsureof

whattheywanttoknowandarerelyingonthecollectionasawholetospeakforitself.

Textminingsoftwareisdesignedtoprocessunstructuredinformation.SAShashadsuccessinsellingitstechnology

wherecustomersmustfindmeaningandtrendsine-mail,callcenterfeedbackfromcustomersandtextualinformation

fromindividualswhofilereportsinelectronicform.SASTextMinercanbeusedtodiscovercommonthemesin

collectionsofunstructuredinformation.

SASTextMinercanbeoptimallyintegratedwithinyourexistingITinfrastructurewithasingle,unifiedsystem.The

SASEnterpriseIntelligencePlatformtranscendsorganizationalsilos,diversecomputingplatformsandnichetools

todelivernewinsightsthatdrivevaluefororganizations.SAScustomershavereadycapabilitiesfordataintegration,

analyticsandbusinessintelligence(BI).Thisensuresdatacanbecleaned,manipulated,preparedandfedintoanalytics

foranonlineanalyticalprocessing(OLAP)engine,forecastingorotheranalyticalmodulesbeforedisplayingresults

viaWebgraphicsorinteractiveflexiblebusinessintelligence(hereinafter,BI)tools.

c. How SAS® Text Miner works as a discovery search tool

SASTextMinertechnologydeliversextraction,automaticclassificationanddiscoverytoSAScustomers.Theoutputs

oftheSASTextMinersubsystemallowyoutoviewtermstatistics,identifysimilardocuments,andviewtermand

conceptrelationshipsinagraphicdisplay.Thesystemalsoperformsautomaticclassificationandmetatagging.

Ifyouhaveanexistingtaxonomy,SASTextMinercanusethatdataanddiscovernewcategories.Thesystemcan

identifyclustersofrelatedinformationsuchascallcentercommentsorstatementswithine-mail.StephenE.Arnold,

ArnoldIT.com,callsthis“informationoversight”which,hesays,“Makesitpossibletospotakeyfactortheneeded

informationwithoutwastingtimeclickingonlinksandhuntingfortheitemofinformation.”

ThedatageneratedbySASTextMinercanbeusedtoenhanceanexistingsearchsystem.However,thisdatacanbe

usedforpredictivemodelingbyitself.Alternatively,youcanmergediscovereddatawithexistingstructureddataand

perform

d. How SAS® Text Miner enhances the discovery process

SAS’textanalyticsareanadd-ontoitspremierdataminingsoftwaresolution–SASEnterpriseMiner™.TheSAS

TextMinerproductissurfacedasanadditionalnodetofitcomfortablywithinthepredictiveanalyticsumbrella.SAS

istheleaderinbusinessintelligenceandpredictiveanalyticssoftware.SASPredictiveanalyticssoftwareprovidesa

widerangeofintegratedcapabilitiesfortimeseriesanalysisandforecasting,econometricsandsystemsmodeling,

andfinancialanalysisandreporting,withdirectaccesstointernal/externalcommercialdatawarehouses/marts.With

31yearsexperienceand43,000customersitesworldwide,SAShelpsmanageperformancebytransformingdatainto

predictiveinsights.

Page 6: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

3

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

SAS’systemassumesthattheprimaryuserswillhavestrongunderstandingofstatisticsandbasicnavigation

familiaritywiththeJavaGUIofSASEnterpriseMiner.

ThebenefittotheSASTextMineruseristhatvirtuallyallofthenaturallanguageprocessingandstatistical

functionalityisavailablewithintheSASgraphicalenvironment.SAShasdonetheheavyliftingrequiredtomake

algorithmsavailableasdrag-and-droporpoint-and-clickfunctions.AccordingtoMaryCrissey,AnalyticsProduct

MarketingManageratSAS,“AkeybenefittotheuserofSASTextMineristhatthetimerequiredtograftlinguistic

functionsontoamodelisreducedbymorethan90percent.”

Usingintelligentalgorithmsandlexicalprocessingtoautomatethecategorization,tagging,orbuildingoftopicsand

documentindexes:

• Providesautomaticidentificationandindexingofconceptswithinthetext.

• Visuallypresentsahigh-levelviewoftheentirescopeofthedataprocessedbythesystem,withtheabilitytodrill

downtorelevantdetail.

• Enablesuserstomakenewassociationsandrelationshipsandtopresentpathsandlinksforfurtherdocument

analysis.

In2006,SASmovedquicklytobecomeaGooglepartner.GoogleOneBoxAPIallowsSAStooffertheGooglesearch

interfacetocoreSASdataandreports.Inaddition,astheGoogleSearchAppliancebecomesmorerobust,SAShas

theopportunitytooffercollaborationandaccesstootherGooglefunctionalities.It’stooearlyintheadoptioncycleto

beabletoquantifythevalueoftheGooglerelationship,butasaprivatelyheldsoftwarecompany,SASisabletomove

quicklyanddecisively.

InthesamewaythatSAShasexpandedaccesstoBIbycreatingtargeteduserinterfacesforitssoftwarethatmatch

theskilllevelsofindividualusers,SASandGooglewillprovidejointcustomerswhoactivatethenewGoogleOneBox

forEnterprisefeatureoftheGoogleSearchAppliancewithafamiliar,securewaytosearchforreal-timeinformation

deliveredbySASEnterpriseBIServersoftware.Inaddition,thecombinationoftheGoogleSearchAppliancewiththe

SASEnterpriseIntelligencePlatformwillgiveusersmoreinformationthanordinarykeywordsearchescanprovide.

SAS’contextuallyrelevantsearchcapabilitieswithGooglenotonlyexploremetadatabutalsolookatthebusiness

views(SASInformationMaps)thathavebeendefinedbySASclients.

ThevisionaheadforSASwillnotbelimitedtoitspremierdataanalysis(statisticalprocessing)capabilities.Not

onlyisSASgainingfootholdinqueryandreportingsoftware(BI),itinfactextendsacrosstheentireSASEnterprise

IntelligencePlatformasacompleteend-to-endsystemthatincludesthedatawarehousingETLwhichstandsfor

extract,transform,andload(handlingdatainputissues).ThepotentialforSAStodeliverTHEPOWERTOKNOW®

seemsunlimitedbecauseSASismeetingthechallengeforsmallandlargedatabasesstuffedwithtraditionalstructured

datathatcanbemeasuredandquantifiedaswellasthequalitativedescriptorsofdatareferredtoas“unstructured”–

text-basedcontentthatcanoccupygigabytesofinformationinorganizations.Examplesoftraditionalstructureddata

areage,jobclassificationandincomeinformation.Examplesofunstructureddataincludetext-basedcontentinWeb

pages,e-mailmessages,articles,memos,customerfeedback,warrantyclaims,patentinformation,surveys,research

studies,resumes,clientnotes,competitiveintelligenceandmore.

Page 7: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

4

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

2. SAS® Text Miner is part of the SAS® Enterprise Intelligence Platform

a. Overview

TheSASplatformoptimallyintegratesindividualtechnologycomponentswithinyourexistingITinfrastructureinto

asingle,unifiedsystem.Theresultisaninformationflowthattranscendsorganizationalsilos,diversecomputing

platformsandnichetools–anddeliversnewinsightsthatdrivevalueforyourorganization.

ThecorenotionofSASisthatanalyticprocessesworkoffasingleversionofthetruth.Bydefiningananalytical

platformandsettingupametadata(informationaboutdatasources,includinghowitwasderived,businessrulesand

accessauthorizations)architecture,SASprovidesacentralizedrepositoryfordataandinformation.SAShascreated

avarietyofinterfacesandtoolstopermitthird-partyapplicationstoaccessinformationmanagedbySAS.SAS’

principalstrengthisthataclientnolongerhastobeartheburdenoffiguringthemostcost-effectivewaytomanagethe

complexitiesofintegratingdifferentrepositoriesofdata.

Thecompanyaddedtextminingtoitstechnologysuitesixyearsagowheninterestinunstructuredtextfirstrippled

throughtheanalyticssector.SAS’approachwastolicensemodulesfromInxightforstemming,part-of-speechtagging

andentityextractiontechnologiessotheycouldlinkwiththeclusteringandpatternrecognitionfacilitiesoftheSAS

textminingsolution.Bycontractualright,SAScontinuestheuseofInxighttechnologyasitdidbeforetheBusiness

ObjectsacquisitionofInxight.SASTextMinerwillcontinuetoofferadvancedtechnologyfortheintegrationof

structuredandunstructureddata.Additionalsolutionsfocusedtoindustryneedsarealreadyunderdevelopmentat

SAS.AnyfuturepotentialreplacementofmoduleswillbeintegratedintonewreleasesofSAStechnologywhich

currentcustomersgetatnoextracost.

SAS’approachisdesignedtoprovideasingleenvironmentfordataanalysis,textminingandknowledgediscovery.

SAS’marketingteamlikestosay,“SASgivescustomersthepowertoknow.”SASTextMinerhastightlyintegrated

text-basedinformationwithstructureddata.Fromasingleinterface,ananalystcanobtainacompleteviewof

structuredandunstructuredinformationtoenhanceanalysesanddecisionmaking.

WithSAS,nothird-partytoolsarerequired,andSAScan“playnicely”withmostthird-partysystemsbecauseit

accommodatesdatainamyriadofformatsandrunningonawidevarietyofoperatingsystems.Ifaproprietary

functionisneeded,SASpermitsunlimitedcustomization.Configurationassistanceandinstallationexpertsareonhand

toguidethecustomer’sSASadministratorintheroleofmanagingtheserversandsoftwarethatcomprisetheSAS

framework.

SAS’textminingsubsystemisavailableasanenterpriseapplication(installedonaserverandlinkedtomanyclient

computers)oraquasi-standalonesubsystem(installedonasinglelaptop).Bothtextminingsubsystemscan:

• Filtere-mailforregulatorycomplianceorsecurityapplications.

• Groupdocumentsintopredefinedcategories.

• Routenewsitems.

• Clusteranalysisofdocumentsinacollection,surveydata,reviewcustomercomplaintsandcomments.

• Predictfinancialshiftsfrombusinessinformationandcustomersatisfactionfromcustomercomments

Page 8: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

5

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

b. SAS® creates an information circuit for organizations

SASisasystem,notasinglesoftwarepackage.ThefigurebelowshowstheoverallSASarchitecture.TheSASText

Minermodule(highlightedinyellow)sitsintheclienttieroftheSASframework.SASTextMinerisasubcomponent

oftheSASEnterpriseMinersubsystem.

SAS, now in V ersion 9, provides a comprehensive informaiton environment. It includes servers, data management tools, analytic routines, APIs, and such advanced features as the exchange of data, scipts, andmodels across the dozens of SAS modules. Featuring full support of portal accessible text mining, SAS provides a comprehensive solution to organizations that need to analyze large volumes of textual and numeric data.

Inthisinformationcircuit,informationflowsintothevariousservers.Analystsmanipulatethedataintheserversusing

arangeoftools.UsersofthedatacanaccessthereportsinstandardofficesoftwareapplicationssuchasMicrosoft

Excel.

BeforelookingatthespecificfunctionsoftheSASTextMinermodule,let’sreviewSAS’twindevelopmentpaths,

whichhaveyieldedsomeimportantinnovations.

ThefirstisthedevelopmentoftheSASIntelligenceStoragesystem.TheSAScustomercanrelyonSAS’framework

toprovidedatawarehousing,integrityanalysistoolsandadministrativeinterfaceswhenmanipulatingthegigabyte

andterabytecollectionsofdatafoundinmanyorganizations.Fortheanalyst,thedatamanagementtechniquesdonot

intrudeonthespecificstatisticalprocessesanindividualanalystuses.However,thoseaccessingandmanipulatingdata

seethatthespeedwithwhichSAShandlescertaindataintensiveprocesseshasincreased.

ToillustrateSAS’storageefficiency,consideranOracletablecontaining10terabytesofXMLdata.TheOracle

databasewillrequire10terabytesforthetabledataandanother30to40terabytesofstorageforindicesandthe

swapspacethatwillallowforefficientdatamanipulation.Thecostofstorageisdropping,butthevolumeofdatais

increasing.Itmakessensetominimizediscspacerequirementsinordertoholddowncostsandtohavetheability

toprocessnewdata.TheSASIntelligenceStoragesystemrequires10terabytesforthedata,butSASrequiresan

additional10terabytesofspaceformetadataanddatamanipulation.SAShasmostlyremovedthebottleneckswhile

implementingamultithreadedkernelcapableofscalinguptohandleclustersofcommodityservers.Atthesametime,

SAShasreducedtheburdenonthedatabaseadministrator.

Page 9: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

6

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

Thesecondinnovationhasbeentherefinementoftoolsforindividualanalystsandtheservicesavailabletoananalyst

tocreateaboiled-down,graphicalsnapshotofkeyfindings.IfthetoolsbuiltintotheSASenterpriseminingandtext

miningsubsystemsarenotadequate,ananalystcanturntothecompany’sgraphproductandcreateMadisonAvenue-

stylechartscompletewithwizardsandpreviewsofagraphorchart.

ThecomponentsforSASTextMineraretheSAS®9.1system,SASEnterpriseMiner,Java,necessaryhardware

servers.BeforeSASTextMinercanbeused,SASrequiresthatabasicstatisticalfoundationbeinstalled.SAS

EnterpriseMinerinstallsonthatbase.Finally,theSASTextMinercodeinstallsintotheSASEnterpriseMiner

module.Thereis,therefore,nowaytousetheSASTextMinerwithoutlicensingothersoftwarefromSAS.Pricingcan

varydependingonthespecificrequirementsoftheclientorganization.

Theplatformconceptisanimportantone.Theideaisthatalicenseecanbuildanenterprisewideinformation

analyticssystemonSAStechnology.SAS’descriptionsofitsplatformsimplifytoagreatdegreethelargenumberof

componentsthecompanyhasdeveloped.Forexample,alicenseecanacquiresoftwarethatfacilitatesgridcomputing.

SASalsooffersastoragemanagementsystem.Bothgridandstoragetechnologyarepivotaltothesuccessofalarge-

scaletextminingapplication.

3. SAS – The company

a. History

SAScanboastthatitisoneofthelargestprivatelyheldsoftwarecompaniesintheworld.HeadquarteredinCary,

NorthCarolina,thecompanygeneratesaboutUS$1.5billioninannualrevenue.Thecompany’sanalyticproducts

rangefromstatisticaltoolkitstoenterprisewidedataanalysissystems.Thecompany’ssoftwareisincludedaspartof

thecurriculumindataminingandBIintheUS,Europe,Australiaandelsewhere.

SASbeganatNorthCarolinaStateUniversityin1976,whenJimGoodnight,nowCEO,launchedthenewcompany

withthreeofhiscolleagues.Thecompanyhasgrowntoencompass10,000employeesand400officesworldwide.

DabblersinanalyticswillfindSAS’productsinscrutable.TraininginspecificSAStechnicalapproachesisbeing

incorporatedintomoreandmoreuniversitycampusestoday–especiallyintobusinessprograms,whichvaluethe

competitiveadvantageanalyticsprovides.SASpublictrainingcentersarelocatedaroundtheworld,staffedbycertified

instructorswhoalsotraveltocustomersitesaswellasteachviaLiveWeb.SASSelf-Pacede-Learningtutorialsare

becomingquitepopular.SASPresspublishesawidevarietyofbooksthatclarifytheterminology,technicalframework

andbestpracticescustomersareimplementingwithSASsoftware.Softwareusermanualsanddocumentationare

postedontheInternetforthepublictosee,soeventhosewhohavenotyetinstalledthesoftwarecanlookupthe

mathematicalalgorithmsincorporatedinacertainprocedureorproductofinterest.

SAShasworkeddiligentlytomakeitstoolsaccessibleanduserfriendlyinresponsetomarketdemandsandfeedback

fromitslargecustomerbaseacrosstheglobe.Today’susersattractedtothedataminingsolutionsincludetraditional

statisticians,businessmanagersanduniversityresearchersaswellasdomain-specificindustryprofessionals.

TodaySASisrecognizedasasoftwareproviderthatdoesmorethannumber-crunchingandfancydatamanipulation.

AnalystshavestatedthatSASsetsthestandardforBI.1

1http://www.sas.com/news/analysts/by_technology_bi.html.

Page 10: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

7

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

ForthoseunfamiliarwithSAS,thecompanyprovidesseveralchoicesforkickingoffnumericalanalysisprocesses,

suchastheSASEnterpriseGuidepoint-and-clickinterfaceorJMP®software(version7,releasedinMay2007,links

toSASdatasets).Storedservicesnowprovideautomaticwaysofrerunningcommontaskssoprogrammingtools

canbesavedandquicklyautomated.Thesebuilt-inoptionsanduserinterfacemakeiteasyforbusinessmanagersto

produceandunderstandreportsgeneratedbySAS.ProgrammerstrainedtouseSAScanmakejumpsthroughflaming

hoopsbyaddingtheirowncreativetoucheswithpersonalizedmacroprogrammingortailoredinterfaces.SASwants

toenableitscustomerstodeployasbasicorascomplexatextminingsystemastheydesire.Then,whenitbecomes

necessarytoaddadditionalfunctionality–forexample,advanceddataanalytics–SAScomponentswillworksmoothly.

SAScontinuestomaintainitsleadershiproleintheanalyticsmarketasithasbeendoingformorethanthreedecades.

Newnichevendorsentertheanalyticsmarketsector,butnonepossessesthebreadthanddepthofSASanalytics

completedbystrongdataintegrationandBIcapabilitiesacrosstheplatform.

Theprogrammer-analystwithastrongfoundationinstatisticswillrevelinSAS’technologyanditspower.Aperson

whoinveststhetimetolearntheSASapproachwillfindthatSASdeliversusable,flexibleandscalableanalytictools.

b. Customers

ManagersandprofessionalswithSASexperiencearestrongadvocatesofthesystem.Inarticlesandspeeches,SAS

customersemphasize:

• Reducedtime-to-decisionsandamoreaccurateorganizationalview.Bycombiningstructureddataandunstructured

text,andautomatingtheprocessofanalyzingthedata,SASTextMinerhelpsorganizationsgainmeaningful

insightsthatsuccessfullydriveoverallbusinessdirection.

• Improvedorganizationalperformance.Whilemostsoftwarevendorsofferclassificationofonetextfieldintoa

singleclass,SASTextMinerenablestheclassificationofmultiplestructuredandunstructuredfields.Itimproves

yourorganization'sperformancebydistillinginformationfrommultiplebusinessunitsintoaformatthatiseasyto

manageandanalyze.

• Abilitytorecognizetrendsandpredictbusinessopportunities.Analysisofinformationsuchascustomerlettersandcall

centernotesmayprovidevaluableinformationaboutcustomerdissatisfactionorinsightsintoserviceandproductneeds.

SAShasimplementedanumberofinterestingtextminingsystemsforclientsworldwide.Twoexamplesbelow

illustratetheuseoftheSAStextminingsysteminbothacademiaandbusiness:

(i)UniversityofLouisvilleusestextmining,dataanalysisandgeospatialmappingtounderstandmedical

conditionsandimprovepatientcareinhospitals.

(ii)Hondausestextmininginwarrantyandqualityanalyses.

(i) Understandingmedicaltrends

TheUniversityofLouisvilleusedSAStextanddataminingtechnologytoidentifymajortrendsinseveral

cardiacandrespiratoryconditions,andtoanalyzecorrelationsuchasmedianincomeandgeographical

locationsforJeffersonCountypatientsinKentucky.SASandArcGISwereusedtocompile,author,analyze,

map,andpublishgeographicinformationandknowledge.

Page 11: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

8

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

SAS in combination with ESRI tools generated graphic displays of data on maps of Jefferson County and kernel density comparison graphs of the combined data from structured tables and text mining.

Thisstudyultimatelydemonstratedthattextminingandclusteringallowedforefficientstatisticaldataanalysis.

Furthermore,thebridgebetweenSASandESRIallowedforastatisticalformulationofthedataaswellasavisual

interpretationofthepatientsandhealthcareproviderlocationsinJeffersonCounty.

AccordingtoauthorPatriciaCerrito:

“SAShasintegrateditsanalytictoolsbetterthananyothersoftwarevendor.TextMinerhighlightsrelevantpatternsin

documentssuchasclinicalreports,anditquantifiestext-basedinformation.WemoveseamlesslytoEnterpriseMiner

tocombineandanalyzethisunstructuredtextwithstructureddatasuchasdemographicsandlaboratoryvalues.And,

weaugmentthatanalysiswithSAS/STATThat’swhywestandardizedonSASforourresearch.Noothersoftware

deliveredthisdepthandbreadthofanalyticfunctionality.Inadditiontosuperioralgorithms,SASdeliveredthe

simplestinterfaceformanagingandimportingdatafromanysource.Duringanygivenworkday,thesebenefitsaddup

tosignificantROIattheUniversityofLouisville.”2

(ii)Improvingqualityinautomotivemanufacturing

AmericanHondalaunchedaninitiativetoextractearlywarningofpotentialproductproblemsfromcustomer

feedback.AnalertsystemwasimplementedthatusedSASTextMinerandSAS’qualitycontrolmodule.Although

analystsalreadyusedcommentfieldstounderstandautomotiveproblemsunderinvestigation,thenewsystemhas

to“read”incomingfreeformtextdatasystematicallyandgenerateoutputsthatcouldbeanalyzedbyautomated

SASAnalytics.Thesystemusedclusteringmodelstogroupsimilarwarrantyclaimstogether.Newdataaboutthat

particularautomodelwasthenprocessed,scoredandplacedintothebestcluster.AmericanHondaanalystsmonitor

clustersizesandvariancesweekly.Deviationsfromexpectedvaluesareflaggedforhumanreview.

TheoverallHondaprocessconsistedoftakingwarrantyandcustomerdata(bothstructuredandunstructured)and

creatingacommonSASrepresentation.Thedatamovedthroughamodelreview,scoringandstatisticalprocess

controlsequence.Theoutputsgeneratedthealertswhencertainwarranty-relatedvalueswereabovetheexpected

threshold.

2Dr.PatriciaCerrito,UniversityofLouisville,quotedbySASinaDecember2003newsrelease.Seehttp://www.sas.com/news/preleases/121503/news1.html

Page 12: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

Thetextminingprocesshighlightedinthefigureaboveconsistedofsevendistinctsteps.Theoutputsofthesestepswere

clustereditems.

4. Data mining products available from SAS

a. Highlighting the SAS® Text Miner solution

SASTextMinerisanadd-ontotheSASEnterpriseMinerapplication.SASTextMinerrunsasanodewithinthe

largerSASframework.SASTextMinerprovidesanumberoftoolsfordiscoveringandextractingknowledgefromtext

documents.Thesystemtransformsunstructuredtextualdataintoausable,structuredformat.Theunstructureddata

feedsintotheSASSystemwhichtransformsthatintodocumentsthatareclassified,displayedinrelationshipdiagrams,

clusteredintocategoriesand/orincorporatedwithotherstructureddataforfurtheranalysis.TheoutputsofSASText

Minercanbecounted,parsedandanalyzedinseveralways.

SASTextMinerprovidestoolsfordiscoveringandextractinginformationfromawidelyvariedcollectionoftext

documents.Thetoolscanuncoververbalpatternsandconceptsthatareembeddedwithinthee-documentcollection.

i. Textualdatacomesinmanyflavors SASTextMinercanprocessvolumesoftextualdataincluding:

• Webcontent.

• E-mailmessages.

• Newsarticles.

• Researchpapers.

• Surveys.

• Morethan200officefiletypes.

9

Page 13: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

10

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

ii. Competitivedifferentiators SAS’approachtotextmininghasthreedistinctivefeatures:

• TextminingistightlyintegratedintoSAS'broadersystemforstatisticalanalysis.Thebenefitisthatan

analystcanusetoolsasopposedtohavingtofigureouthowtogetdifferenttoolsto“playwell”withother

applications.

• TheprogrammingandproceduresexpertiseofSASeliminatesmostofthelearningcurveassociatedwith

textmining.

• Datafromthetextminingmodule–whatSAScallsanode–isavailabletootherSASprocessesand

procedures.Datafromtextminingcanbemergedwithnumericdatafromotheranalyticprocesseswithout

aformattingstep.

BecauseSASTextMinerrunsasafunctionwithintheSASoperatingsystem,itsoutputscanbefurtherprocessed

withoutconversionbyanyotherSASsoftwaremodule.Thisintegrationoftextminingoutputsintousableformats

isadefiniteadvantage.SASwasoneofthefirstcompaniestodeliveratextminingsolutionabletomergeexisting

structureddatawiththestructuredoutputsofitstextminingsubsystem.Bycombiningtextandstructureddata,the

reliabilityofpredictiveanalysesimproves.

iii. Keyfeatures SASTextMinerguaranteescompatibilitywithotherSAScomponents.Furthermore,thetextminingsubsystemcanbe

integratedintotheSASplatformwithoutcustomprogramming.OtherfeaturesofSASTextMinerinclude:

• Anintegratedinterfaceforanalyzingtext(unstructureddata)inconjunctionwithmultiplerelateddatabase

(structured)fields.Thegraphicaluserinterfacesignificantlyreducestextminingtimeforbothbusiness

analystsandstatisticians.

• Textparsingtoextracttermsorphrasesfromlargetextcollections,whilestemmingtermstorootformbased

onpartsofspeechandfindingphrasesofinterestsuchasabbreviations,countrynames,organizationnames

andotherspecificentities.

• Transformationofdataintoastructuredrepresentationofthesourcecontent.Thesystemdistillskeyconcepts

containedinlargedocumentcollectionsandanalyzesrelationshipsbetweenisolatedtermsorphrasesand

documents.Thetransformedrepresentationofthesourcetextstructuresitforuseindataexploration,

clusteringandpredictivemodeling.

• Aninteractiveresultsbrowserthatenablesanalyststointeractivelyexploreconceptsandrelationshipsbetween

documentsandtomakemodificationstofurthertailoranalyses.

• FullintegrationwithSASEnterpriseMiner’sdataanalyticssoftware.Theanalyticsprovidecomprehensive

analytictoolsfortextandrelatedstructureddata.

iv. SAS®TextMinerunderthecovers SASTextMinerrunsonmostoperatingsystems,includingLinuxandWindows.Withthelatestrelease,itrunson

SolarisandAIXUNIXenvironmentsaswell.SASTextMineriscomputationallyintensiveandrequiresdedicated

hardwareforoptimumthroughput.

SASTextMinerreliesprimarilyuponpatternrecognitiontechnologyinsteadofalinguistics-centricordictionary-

basedapproach.SAS’textminingsoftwareproducesanumericalrepresentationofadocument.Ingeneral,SASText

Minerdelivershigheraccuracywhenalargenumberofdocumentsareprocessed.

Page 14: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

11

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

SASTextMinerisbestsuitedtoapplicationswherehundredsorthousandsofmegabytesofunstructuredcontentmust

beprocessedrapidly.Thesystemcangeneratereportsthatmakeiteasyforananalystorendusertograspthekey

facts,ideasandtrendsintheprocessedcontent.

Theprocessesoftextminingusethenumericalrepresentationtoprovideinsightsintooneormoredocumentsina

collection.

Thesystemperformstextminingbymovingeachdocumentthroughfourprocessingsteps.Documentsareacquired

andpreprocessedsothatasingledatasetisformed.Next,leveragingInxight’sLinguistXPlatformandThingFinder,

SASTextMinerperformsparsingandmarkupofthedocumentsintheset.Theprocesseddocumentsarethen

transformed,objectstagged,andreducedtoastructuredform.Thefinalstageintheprocessisthemanipulationofthe

newlygenerateddatasetbyanalyticroutines.Thereisnolimittothenumberofdocumentsinacollection.Analysis

canbeperformedonasinglecollectionoracrossmultiplecollectionsaswellascombinationsofstructureddatafrom

otherapplicationsandtheoutputsofthetextminingprocess.

The SAS Text Miner performs preprocessing, text parsing, transformation, document reduction (creating a structured representation of the objecs and associated metadata), and analysis.

Theprocessflowforminingtextconsistsofaseriesofactionsthatareperformedonastaticcollectionofdata.The

SASTextMinerapproachassumesthatthecollectionorcorpustobeanalyzedisnotupdatedduringtheparsingand

othertextminingprocesses.SASTextMinerworksthroughasequenceofactivitiesinordertogeneratenumerical

snapshotsofdocuments,clustersofobjectssuchasmajorconceptsinthecollectionitself.

Thereareseveralcoreprocessesthatresultinacohesivesetofdataandreportsaboutthedocumentsprocessed.These

basicprocessesare:

• Parsing:separatingwordsandphrasesinthedocument.

• Stemming:locatingroots.

Page 15: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

12

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

• Synonymidentification:mappingrootsinthedocumenttosynonyms.

• Stoplistanalysis:eliminationofstopwordsviathefilesashelp.stoplst.

• Termbydocumentmatrixgeneration:mathematicalrepresentationofeachdocument.

• Singularvaluedecomposition:aprocesstoenablefastprocessingofnumericalrepresentationsofdocuments.

• Clustering:aprocessthatgroupsobjectsthatareinsomewayrelated.

TheresultsoftheseprocessescanbeusedbyotherSASanalyticroutinesforadditionalanalysesandgraphic

expressionsoftherelationshipsidentifiedbythealgorithmicprocesses.

v. Reducingthemanualworkoftextmining RawtextisinputtedtoSASTextMiner.Twopre-analysisstepsarerequired.Thesearetasksusuallyperformedby

ananalyst.WitheachreleaseofSASTextMiner,SAShasincludedroutinesthatcanreducesomeofthenecessary

manualwork.Thesetwolistsareimportantbecausetheaccuracyoftheclusteringprocessultimatelydependsonthe

“knowledge”embodiedineachlist.

SASTextMinerperformsstemminginEnglish,French,GermanandSpanish.Stemmingallowsvariantstobegrouped

inonetermset.Anexampleappearsinthetablebelow.

Stem Terms

aller(French) vais,vas,va,allez,allons,vont

reach reaches,reaching,reached

big bigger,biggest

wagon wagons

SASTextMinerforSAS®9isavailablein15languages.CustomersreceiveSASTextMinerforEnglishandachoice

ofoneadditionallanguage,witheachcustomer’snativelanguagerecommended.

Theanalystcanpreparealistofwordsandphrasesforthesystemtoidentify.Useofastartlistcanreducetheamount

oftimerequiredtoprocessacorpusorrepositoryoftext.

Inanycollectionofdocuments,certaintermsmustbeidentifiedsothatSASTextMinerprocessesignorethesewords

whenprocessingthecorpus.

Eachcorpuscontainswordsthatmaybeusedasequivalents.Examplesincludeaproductname,itsmodelnumber

oranacronym.Forexample,F-150maybeasynonymfortheboundphrase“lighttruck.”SASTextMineruses

synonymstogroupconceptsintoameaningfulcluster.

Page 16: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

13

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

Equivalenttermsareillustratedinthetablebelow:

Term Parent Category

appearing appear Verb

EM SASEnterpriseMiner Product

Employees Employee Noun

AdministrativeAssistant

Employee Noun

SAShasdevelopedanumberoftoolstoassistananalystwithdiscoveringandextractinginformationfromawide

varietyoftextdocuments.SASTextMinercanidentifythemesandconceptsthatarecontainedinthecollection.

TheoutputofaSASTextMineranalysisallowsananalysttounderstandtheinformationcontainedinacollectionof

documents.AnanalystoreditordoesnothavetopreprocessorpretagthecontentprocessedbySASTextMiner.

TheSASTextMinermodulegeneratesreports.UnlikeotherSASnodes,SASTextMinerdoesnotproducecodethat

canbeusedoutsidetheminingactivity,nordoesSASTextMinerproducePMML.Nevertheless,thetightintegration

oftheSASTextMinerfunctionalityintheSASinterfaceisattractive.AnanalystfamiliarwithSAS’datamining

softwarecanbeupandrunningwithSASTextMinerofteninaslittleasoneday.

b. Highlighting SAS® Enterprise Miner™

i. WhatisthedifferencebetweenSASTextMinerandSASEnterpriseMiner? SASEnterpriseMinerprovidestheprocessflowdiagramtomanagetheanalyticalinvestigationasthedataanalysis

projectispursued.ThetextminingapplicationisturnedonbyclickingthetextminerNODElocatedontheworkspace

diagramofSASEnterpriseMiner.

SASEnterpriseMinerstreamlinesthedataminingprocess.Thesoftwareprovidesanintegratedsystemtotheanalyst.

Fromasingleinterface,theanalystcanaccessdata,testmodelsandgeneratereportsforendusers.SASEnterprise

Minersupportscollaborativeworkprocesses.

SASprovidesthemostcompletedataminingsolutionavailable.Thesystemsupportsintegrationwiththird-party

enterpriseapplicationsanddesktopsoftware.Deliveredasadistributedclient/serversystem,SASEnterpriseMiner

scalesandiswell-suitedfordatamininginlargeorganizations.

SASEnterpriseMinerisdesignedfordataminers,marketinganalysts,databasemarketers,riskanalysts,fraud

investigators,engineersandscientists.

Throughdatamining,whichSASdefinesas“theprocessofselecting,exploringandmodelingmassiveamountsof

data,”ananalystcanuncoverpreviouslyunknownpatternsandhelpimprovedecisionmakingacrosstheenterprise.

Youmayfindmoreinformationathttp://www/sas.com/technologies/analytics/datamining.

Manydataminingsolutionsarestandaloneapplicationsandoftendonotintegrateeasilyintoexistingapplications.

Furthermore,scalabilityisakeyissue,particularlywhendatasetscanrunintoterabytes.Consequently,quantitative

specialistsmustspendtimeaccessing,preparingandmanipulatingdisparatedata.SASEnterpriseMinerautomates

thesedatapreparationtasks.SASEnterpriseMinerallowsanalyststomodeldataandapplytheirexpertiseto

solvespecificbusinessproblems.Italsoensuresthattheresultscanbedeployedintoscoringenginesforactual

implementation.

Page 17: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

14

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

SASEnterpriseMinerincludesabroadsetoftoolsthatsupportthecompletedataminingprocess.Thefunctionsare

exposedinagraphicaluserinterface.Mostaspectsofmodelbuildingcanbecompletedbypointing-and-clickingona

SASEnterpriseMineroption.SASEnterpriseMinerincludesaprocessflowdiagramsystemthateliminatestheneed

formostmanualcodingandhelpsreducethetimerequiredtoconstructmodels.SASEnterpriseMinercangenerate

diagramstosavethevarioustypesofanalyticalpattern-detectingalgorithmsexplored.Thesediagramsserveasself-

documentingtemplates.Thesetemplatescanbeupdatedasnecessaryandbeappliedtonewproblemswithoutstarting

theanalysisfromscratch.Diagramscanalsobesharedwithotheranalyststhroughouttheenterprise.

SASEnterpriseMineroffersarangeofintegratedassessmentfeatures.Thesefeaturesallowananalysttocomparethe

resultsgeneratedbydifferentmodelingtechniques.Modelresultscanbeeasilysharedthroughout.Themodelsreside

inanSASdataminingmodelrepository.SASwasoneofthefirstbusinessintelligencevendorstoimplementmodel

managementinanenterpriseintelligenceframework.

Scoringistheprocessofapplyingamodeltodata.Theoutputofadataminingprocessconsistsofrawscores.SAS

EnterpriseMinerautomatesthemodelscoringprocessandsuppliescompletescoringcodeforeachstageofmodel

development.Thescoringcodecanbedeployedinnumerousreal-timeorbatchenvironmentswithinSAS,ontheWeb

ordirectlyinrelationaldatabases.

SASEnterpriseMinerisdeliveredasadistributedclient/serversystem.Thesubsystemeliminatestheneedforan

organizationtohavenichedataminingapplications.

AnanalystcandraganddropiconsontotheprocessflowdiagramusingSASEnterpriseMiner’sgraphicalprocess

flowtools.

The SAS Interface allows most functions related to programming and data manipulation to be handled within a multi-paned interface. The SAS system generates needed code automatically reducing the time an analyst spends creating scripts so that more “what if” and alternative model investigations can be conducted.

Page 18: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

15

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

SASEnterpriseMinerimplementstheSEMMAdataminingprocess.SEMMAisanacronymforSAS’coreapproach

todataanalysis:sample,explore,modify,modelandassess.Theideaisthatananalystcanexperimentwithdifferent

numericalrecipes,seeresults,makechangesandthengenerateresultsfromtheoptimalmodeloftendescribedbySAS

asthechampionmodel.SEMMAisalogicalorganizationofthefunctionsintheSASEnterpriseMinersubsystem.

SEMMAprovidesaflexibleframeworkforconductingthecoretasksofdatamining.

SASEnterpriseMinerusesaJavaclientandtheSASserverarchitecture.Theresultisthattheminingcomputational

serverisseparatefromtheuserinterface.Ananalystcanconfigureaninstallationthatscalesfromasingle-usersystem

toverylargeenterprisesolutionswithnochangestothemodelorrecoding.Certainprocess-intensiveservertaskssuch

asdatasorting,summarization,variableselectionandregressionmodelingaremultithreaded.SASEnterpriseMiner

canautomaticallydistributetheworkloadovermultipleCPUs.Itsprocessescanberuninparallelandasynchronously.

Largeorrepetitivetrainingorscoringprocessescanbescheduledforbatchprocessingduringoff-peakdemandhours

ontheanalyticalserverwithoutprogramming.

SASEnterpriseMinerbakesinadvancedpredictiveanddescriptivemodelingtoolsandalgorithms.Ananalyst

canimplementdecisiontrees,neuralnetworks,autoneuralnetworks,memory-basedreasoning,linearandlogistic

regression,clustering,associations,andtimeseriesamongothers,bypointingandclickingonamenuchoiceora

radiobutton.Withmultiplemodelsandalgorithmsreadytoruninasinglesubsystem,SASEnterpriseMinerallowsan

analysttoexploredataandtheoutputsofdifferentmodelsfromavisualuserinterface.

SASEnterpriseMinerincludesassessmentcapabilities.Thesehelpensurethatacommonframeworkforcomparing

differentmodelingtechniquesisavailabletotheanalyst.

SASEnterpriseMinerincludestoolstohelptheanalystperformsuchtasksassampling,datapartitioning,missing

valueimputation,clustering,mergingdatasources,droppingvariables,customizedSASlanguageprocessingvia

theSAScodenode,variabletransformationandfilteringoutliers.Extensivedescriptivesummarizationfeaturesare

included,aswellasadvancedvisualizationtoolsthatpermitexaminationoflargeamountsofdatainmultidimensional

plotsandgraphicallycomparemodelingresults.

Theplatform-independentJava-basedGUIofSASEnterpriseMinerprovidesenhancedstatisticalgraphicswith

flexibleactionsandcontrols.AJavagraphicswizardisalsoprovidedfordevelopingcustomizedgraphs.Plotsand

underlyingtablesaredynamicallylinkedbySASEnterpriseMinertoallowinteractionswiththemodelandother

componentsofthesystem.

TheSASEnterpriseMinerenvironmentfeaturesintegratedassessmentfunctions.Theseallowananalysttocompare

theresultsofdifferentmodelingtechniques.Ananalystcangaugeamodel’seffectivenessinstatisticaltermsandin

moreconcretemeasuressuchasmoneysavedornewsalesrevenue.

Page 19: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

16

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

Scoring data often appear in a series of tables. SAS’s approach is to make the scoring data for various runs accessible in a point-and-click interface.

SASEnterpriseMinerfeaturesaformofversioncontrolformodels.Analystscankeeptrackofthemodelsthathave

beenupdatedandcomparetheimprovementsofaccuracyovertime.Modelsdevelopedforotherprojectscanbe

locatedandtheircomponentsreusedinanewproject.

SASEnterpriseMinerallowstheanalysttosaveschematicsandflowdiagrams.ItexportstheseobjectsasXMLfiles.

Ananalystcancreateacompressedmodelresultspackagescontainingthedatausedinadataminingprocessflow,

includingdatapreprocessing,modelinglogic,modelresultsandscoringcode.

ThesemodelresultpackagescanberegisteredtotheSASMetadataServerforsubsequentretrievalandqueryingby

dataminers,businessmanagersanddatamanagersviaaWeb-basedmodelrepositoryviewer.Whenintroducedin

2002,SAShadtheonlyWeb-basedsystemforeffectivelymanaginganddistributinglargemodelportfoliosthroughout

theorganization.

ii. Modelscoring Insomeorganizations,theanalystmakesthemodelavailabletomanagersinoperatingunits.Scoring–sometimes

calledweightingofdifferentfactorsinamodel–requiresconsiderableexpertise.Scoringinsometextmining

systemsisatime-consumingprocess.Insomecases,theanalystwillhavetowriteoreditscoringscripts.TheSAS

EnterpriseMinerenvironmentautomatesvirtuallyallofthescoringcodeprocess.Ifamanualadjustmentisrequired,

SASprovidesscoringcodeinSASscript,C,JavaandPMML(predictivemodelmarkuplanguage).TheSAS

implementationcapturesthecodeofananalyst’smodels,supportstheimportandexportofPMMLmodelsforNaive

Bayesandassociationrulesmodels,andcodesforpreprocessingactivitiessuchasdatanormalization.

Page 20: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

17

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

Oncethescoringcodehasbeencaptured,ananalystcan:

• ScoretheproductiondatasetwithinSASEnterpriseMiner.

• Exportthescoringcodeandscoreonadifferentmachine.

• Deploythescoringformulainbatchorreal-timeontheWebordirectlyinrelationaldatabases.

iii. Customizingapplications SASEnterpriseMinercanbecustomizedandextended.ThedefaulttoollibraryofSASEnterpriseMinerallowsan

analysttoscriptadditionalfunctionsusingSAScodeandXMLlogic.SASsupportsaJavaAPIthatcanbeembedded

intocustomizedapplications.SASmakesitpossibleforananalysttobuildaproprietaryanalyticalapplicationsuchas

havinganOLAPreportcapabilitylinkedwithathird-partyapplicationwithinthesameinterface.

c. How do these products interrelate?

TheflagshipproductSASTextMinerfeaturesacomprehensiverangeoftextminingtoolstospeedsearchanddata

analysis,including:

• Noungroupextractionsupportforalllanguages.

• Additionalentitytypes.

• AnenhancedTMFILTERmacrotool.

• UNIXsupport.

• Morerobustsynonymlistprocessingandparsing,includingarevisedDOCPARSEprocedureandDOCSCORE

DATAstepscoringfunction.

Page 21: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

18

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

Feature Comment

Analyticsfunctions Acomprehensiverangeofapplicationsinadiverserangeofindustriesexists.Themostpopularinclude(1)frauddetection,(2)warrantyearlywarningand(3)supplierrelationshipmanagementstatistics.Comprehensivetextpreprocessingcapabilities

include:

• Captureanddistillthemostimportantunderlyinginformationwithina

documentcollection.

• Defaultorcustomizedstoplistsforeachlanguagetoremovetermswith

littleornoinformationalvalue.

• Automatedspellingcorrection.

• Stemmingtoidentifyrootwords.

• Part-of-speechtaggingbasedonsentencecontext.

• Noungroupextractionforidentifyingphrase-levelconceptssuchas

“competitiveintelligence.”

• User-definedmultiwordtokens,suchas“pointandclick.”

• User-customizedanddefaultsynonymlists.

Compoundwordsplittingintodistinctsubterms.

API APIsareprovidedwithcompletedocumentationandsamplecode.

ControlledVocabularySupport Thesystemcanmakeuseofexistingdictionariesaswellasdeveloponefromscratchasdocumentsareanalyzedandcommontermsaresurfaced.Insteadofsellingindustry-specificdictionaries,SASoffersspecialroutinesthatallowdomainexpertstorapidlytailortheirownstop/startlistsofsynonymswithuser-friendlyinteractivepoint-and-clickprogramminginterfaces.Forapredictivemodelingexercise,significanteffortwithindustry-specificdictionariescandegraderesults.Ifacustomerhashisorherowndictionary,SAScanincorporatethatintothemodel.

Directthird-partysupport Third-partyapplicationscanmakeuseofSASoutputs.TheAPIsareusedtolinkSASTextMinerwithothersystemsandsoftware.ScorecodecanbesavedasastoredprocessanddeployedthroughafamiliarBIclientsuchasExcel,SASWebReportStudio,SASInformationDeliveryPortalandSASEnterpriseGuide.

Documentconversionsupport SASTextMinercanreadtextstoredinavarietyofdocumentformats,suchasPDF,ASCII,HTML,MicrosoftWordandWordPerfect.

Entityextraction People,places,events,datesandotherobjectscanbeextracted.

Languagesupport Danish,Dutch,English,Finnish,French,German,Italian,Japanese,Korean,Norwegian(Bokmal),Portuguese,Spanish,Swedish,TraditionalChineseandSimplifiedChinese.

• SupportforLatin-1,DoubleByteCharacterandUTF-8encodings.

• Europeanlanguages(Latin-1encoding):Danish,Dutch,English,Finnish,French,German,Italian,Norwegian(Bokmal),Portuguese,SpanishandSwedish.

• Far-Easternlanguages(DoubleByteCharacterSupport):Japanese,Korean,SimplifiedChineseandTraditionalChinese.

EncodingsupportforUnicodeUTF-8allowsanalysisofnon-Englishinterfacesandlanguagesthatlistedabove

5. SAS® Text Miner Feature Snapshot*

Page 22: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

19

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

Professionalservices SASoffersfreeWebtutorials,on-siteeducationandtraining,on-demandWebseminars,engineering,conferences,academicinitiativesforschoolsanduniversities,andconsultingservices.

Programminglanguages Majorprogramminglanguagesaresupported.Graphicalinterfacesareavailabletomodifycertainsystemfunctions.

Ready-to-runmodules SASsystemsrequireinstallationandsomesetuppriortooperation.Minimumprocessorspeedis1GHZ.Memoryrequirements:512MBserver.Diskspacerequired:40MBthinJavaclient,300MBSASclient.300MBserver–averageWinXPinstall.

Platform SASrunsonWindows(32-bit)AIXandSolaris(64-bit).

Crawler Thesystemincludestoolstoacquireinformation.Additionally,thesystemcanprocessinformationpushedtowatchedfolders.

Standardssupport TheSASTextMinersystemsupportsJavaandXMLstandards.SASisactivelyparticipatingintheUIMAstandardsdiscussionswithIBMandotheralliancepartners,allofwhomjoininthePMMLdiscussionsatDMG.organdtheGridComputingForum.SASismonitoringdevelopmenteffortstoidentifyandselectmeaningfulmetadatasothatwhenitdoesintegratewithaUIMA-compliantstandardofapplication,itwillbeusefultoSAScustomers,especiallyforanalysispurposes.

Taxonomymanagementtools Thesystemprovidesagraphicalinterfacetomanageclassificationsystemsandrepresenttheclass/subclassrelationshipsinahierarchicalfashiondisplayingdocumentsina“tree.”

Taxonomysupport SASallowsthetaxonomybe“discovered”withanunsupervisedmethod,hierarchicalclustering.Wealso“discover”thelabelsfortheclassesandsubclasses.OrSASTextMinercanfollowapredefinedclassificationsystem,ifavailable.

Third-partysoftwaresupport Mostthird-partyandproprietarysystemscanbesupportedbySASTextMinerandSASEnterpriseMiner.

Visualizationfunctions SASprovidesarangeofdatavisualizationtools.Theconcept-linkingdiagramshavebeentailoredspecificallyforSASTextMiner.User-directedconceptlinkingprovidesflexiblenavigationtovisualizecomplexhiddenrelationshipsbetweenterms,phrasesandentities(suchaspeopleandplacenames).Conceptlinking,nowintegratedwiththeInteractiveResultsBrowserforenhancedusabilityandgreaterinsight,helpsdetectpatternsinaclearvisualfashionthatmaynothaveotherwisebeenobserved.

*FeaturesdescribedreflecttheNovember2006releaseofSASTextMiner.

Page 23: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

20

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

6. Conclusion

SAShasdonetheworkrequiredtoeasetheintegrationoftextwithstructureddataelementsthroughdrag-and-drop

orpoint-and-clickfunctions.SASTextMinerprovidesafullrangeofpredictivemodelingtoolsthatcanunearth

opportunitiesfortimelyexploitation.Benefitsinclude

• DatainaSASSystemcanbetraced.Thenotionofdatalineageisimportantwhentheorganizationmustoperate

withintheguidelinesofSarbanes-Oxley,BaselIII,andothercomplianceandgovernancerequirements.

• Alargeorganizationcandevelopacomprehensive,customizedtextminingsystemwithoutthecostandtime

requiredtointegratethird-partytoolsintoacohesivesystem.

• SAShascontinuedtoaddfunctionalitytotheSASTextMinernode;forexample,SAShasaddedacreditscoring

functionthatcanuseawiderangeofnumericandtextualinformation.

SASTextMiner’sapproachsetsitapartfrommanybusinessintelligenceandtextminingsystems.BecauseSASText

Minerisacomponentresidingwithinalargerdatamanagementandanalyticenvironment,anorganizationusing

SASTextMinercanrepurposemodels,streamline“whatif ”andalternativemodelanalysis,andmergeoutputsfrom

SASTextMinerwithdatafromnumericorhybridsystems.Thesebuilt-inservicesslashthetimeandprogramming

requiredtoachievesimilargoalsusingsoftwarefrommultiplevendors.

TheprogrammeranalystwithastrongfoundationinstatisticswillrevelinSAS’technologyanditspower.Aperson

whoinveststhetimetolearntheSASapproachwillfindthatSASdeliversusable,flexibleandscalableanalytictools.

Endusersbenefitfromvisualpresentationsofcomplexdataandpoint-and-clickinterfacescreatedwithSAS’built-

intools.Adequatecomputationalhorsepower,systemspecialists,andanalystswiththerequisiteprogrammingand

statisticalexpertiseareallthat’sneededtomakeSAShum.

Insummary,integratingtext-basedinformationwithstructureddataenrichesyourpredictivemodelingcapabilitiesand

providesnewstoresofinsightfulinformationfordrivingyourbusinessandresearchinitiativesforward.Textmining

supportsawidevarietyofapplications,suchascategorizinghugecollectionsofcallcenterdata,Webtextandblogs

analysis,findingpatternsincustomerfeedbackoremployeesurveys,detectingemergingproductissues,analyzing

competitiveintelligencereportsorpatentdatabases,andclassifyingreportsandcompanyinformationbytopicor

businessissue.

Achievingindustry-leadingstatusalmostuponitsintroduction,theseSASdataminingtechnologiescontinueto

receiveravereviewsfromindustryexpertsandusersalike.KeepyoureyesonSAS.

FormoreinformationonSASTextMiner,visit:

www.sas.com/technologies/analytics/datamining/textminer

Page 24: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

21

Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®

7. Appendix

Birds-eyeview

Product SASTextMiner(requiresSASEnterpriseMinerandSAS/STAT®licenses)

Platform Textminingtoolkit–runsonMicrosoftWindows,UNIX.

LicenseFee TextMiningadd-onstartsatabout$30,000.

UseCases Transformstextualdataintoausable,intelligibleformatthatfacilitatesclassifying

documents,findingexplicitrelationshipsorassociationsbetweendocuments,clustering

documentsintocategoriesandincorporatingtextwithotherstructureddatatoenrich

predictivemodelingendeavors.

KeyFeatures Avarietyofadvancedanalytics–statistical,heuristics,neuralnets,patterndecisionsand

predictiveforecastingdataminingtechnologies.

Caveats ThisSAStechnologyrunsonaclient/serverplatform.Installationmustbedonecarefully

todefinethepropermetadataarchitecture.

Similarto LikeproductsfromSPSS,TEMISandIBM,theSASTextMinerproductperforms

textualanalytics.Unlikecompetitors,SASexcelsintheintegrationofthestructureand

unstructureddataforpredictivescoringmodels.

PointstoNote SASprovidesanindustry-standard,integratedstatisticaloperatingsystemandtheSAS

TextMinermoduleto“bridgethegapbetweendataandtext.”

Page 25: Beyond Search-and-Retrieval: Enterprise Text Mining with SASarnoldit.com/articles/sas-white-paper.pdf · Beyond Search-and-Retrieval: Enterprise Text Mining with SAS® Organizations

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies. Copyright © 2007, Stephen E. Arnold. All rights reserved. 000000_451499.0707