Transcript
Page 1: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

1

Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

The Lion thought itmight be as well to frighten theWizard, so he gave alarge, loud roar, whichwas so fierce and dreadful that Toto jumped awayfromhiminalarmandtippedoverthescreenthatstoodinacorner.Asitfellwith a crash they looked thatway, and thenextmomentall of themwerefilledwithwonder. For they saw, standing in just the spot the screen hadhidden,alittleoldman,withabaldheadandawrinkledface,whoseemedtobeasmuchsurprisedastheywere.TheTinWoodman,raisinghisaxe,rushedtowardthelittlemanandcriedout,"Whoareyou?""I amOz, theGreat andTerrible," said the littleman, in a trembling voice."Butdon'tstrikeme--pleasedon't--andI'lldoanythingyouwantmeto.”

L.FrankBaum.TheWonderfulWizardofOz.ProjectGutenberg(2008)TheWizardofOzcapturesthelifetrajectoryofmanytechnologies,andsearchisnoexception:thetechnologystartsinthepublicdomain,itisappropriatedandmademysterious(hiddenbehindthecurtain)bycommercialinterests,thentoutedandoverblownasmagical,eventuallythecurtainfalls,andwecanseewhathasbeenthereallalong,andworkwithitmoreintelligentlyasitis,knowingwhatit’sgoodforandknowingwhatit’snotgoodfor.Thisistrueofsearch,asitistrueofauto-classification.ModernsearchtechnologyhasitsrootsinVannevarBusch’sseminal1945paper“Aswemaythink”anditsfoundationsweredevelopedbyresearchgroupsatMIT,CornellandHarvardintothe1970s.Oneoftheconsequencesofhidingatechnologybehindthecommercialcurtainisthatdifferenttechnologiescompetewitheachotherforthesamemarket,andthereforedownplaytheirperceivedcompetitors,orattempttoduplicatethefunctionalitiesoftheirperceivedcompetitors.Searchtechnologyhascompetedwithtaxonomyworkforyears–“youdon’tneedtaxonomies,”wentthemantraofoneinfamoussearchvendor,“oursoftwarewillunderstandyourcontentforyou.”ThecurtainfellwiththeunveilingofGoogleKnowledgeGraph,whenitbecameobviousjusthowmuchGooglewasinvestinginknowledgeorganisationstructures(controlledvocabularies,taxonomies,ontologies,knowledgegraphs)tomakeitssearchsmarter.Thesuperiorperformanceofopensourcesearchengines(andtheirstrikingadoptionrates)overcommercialsearchenginesinsolvingparticulartypesofproblemis,likeToto’sleap,anothermanifestationofthedroppedcurtain.Machineclassificationofcontenthasbeenanotherexample.Itisfrequentlytoutedbyitscommercialvendorsasasingletechnologyperformingasingle“we’lltagyour

Page 2: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

2

contentforyou”functionwhereitisactuallyabundleofquitedistincttechnologies,androotedininformationretrievalresearchandsearchtechnologiesdevelopedinthe1970sand1980s.Machineclassificationvendorshavefoundthemselvescompetingwithsearch(“wecanmakeyoursearchsmarter”)andtaxonomies(“wecanderivetaxonomiesautomaticallyforyou”).ThecurtainisnowstartingtofallwithlargeorganisationslearninghowtousetheopensourcetoolkitssuchastheUniversityofSheffield’sGATEtoolset,tosolvetheirsearchanddiscoveryproblems.The“goitalone”propagandaofthetechnologyvendorsisdamagingbecauseitlocksbuyersintoasingletoolsetdesignedforcommondenominators,withprohibitivecustomisationandmaintenancecosts.The“magicblackbox”propagandablindsorganisationstotheneedtobuildcapabilitiesinarangeoftools.Andthe“wedoeverything”propagandahidesthesynergiesandcapabilitiesthatcanbeattainedbyintelligentlyusingsearch,machineclassificationandknowledgeorganisationtechnologiesasatoolkit,withdifferenttoolsetstobeusedindifferentcombinationstosolvedifferentkindsofsearchanddiscoveryproblem.Thepurposeofthisarticleistoexplaintheusesandinteractionsofthreetechnologiesthattogethersupportsearchanddiscovery,sothatorganisationscanlearnwhichtoolstoadoptinwhichcombinationtosolvedifferentproblems,andtheycanseewhatkindofinternalcapabilitiestheyneedtobuildandmaintaintocontinuesolvingthoseproblems:

• Enterprisesearch• Taxonomyandontologymanagement• Textanalyticstoperformmachineclassification.

Enterprisesearchisatechnologyfordeliveringusefulandrelevantresultstousersexploringalargecontentbase.Asatechnologyitisrelativelysimpleandnotespeciallysmart,andtodeliversuperiorresultsincomplexenvironmentsitneedstobesupportedbyothertechnologies.“Complexity”canrefertothediversityofusercommunities,andtothediversityofthecontentitself.Searchenginesbythemselvesdon’tunderstandanything.Theycrawlandindex,processqueriesandserveupuseranalytics.Tobecomesmart,theyneedhumandesignedcurationtools(taxonomyandontologytools)andtoolsforscalingthehumantaggingofcontent(textanalyticstools).Taxonomyandontologymanagementtechnologiessupportdesigning,deliveringandmaintainingknowledgestructuresandcontrolledvocabulariestoenhancethesearchanddiscoveryexperience.Thesestructuresarebasedonananalysisofbusinessanduserneedstogetherwithananalysisofthetargetcontentforsearchanddiscovery,andtheyhelptodisambiguateconcepts,capturevariationsinlanguageandmapthemassynonyms,andidentifyrelationshipsbetweenconceptssothatsearchescanbeexpandedornarrowed.Theconstraintsoftaxonomyandontologymanagementmostoftenrelate(a)tostayinguptodatewithnewandemergingneedsofdiverseusersets(whichiswheresearchanalyticscanhelp),and(b)toscalingthewaythatconceptscanbeidentifiedincontentandmetadataenriched(whichiswheretextanalyticsandmachineclassificationcanhelp).

Page 3: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

3

Textanalyticsreferstoaclusteroftechnologiesthatextractmeaningfromtextandturnitintometadata.Someoftheroottechnologies(crawlingandindexing)aresharedwithsearch.Thesetechnologiesarearesponsetotheproblemsofhumaninconsistency(peoplearenotconsistentinhowtheyapplytagstocontent)andofthescaleofcontenttobetagged(whenitwouldnotbefeasibletoapplyhuman-curatedtaggingtolargeamountsofcontentinashortperiodoftime).Differenttechnologiesareusedtosupplydifferenttypesoftaggingoperation,rangingfromidentificationofknownentitiessuchaspeople,organisationsandplaces,quantitiessuchasmoneyamounts,totheidentificationofabstractconceptsthatcharacterizewhatadocumentis“about”inwaysthatmakesensetotargetusers.Theconstraintsoftextanalyticsare(a)thatithasahardtimefiguringoutwithoutintensivehumancurationorverywellstructuredtrainingsetswhichconceptsaremostsalienttothemostpeople(whichiswheretaxonomyandontologymanagementhelps)and(b)likesearch,itrequiresconstanttuningbasedonemerginguserneeds(whichiswheresearchanalyticsonuserbehaviourscanhelp).Thesetechnologiescanbeextremelypowerfulwhenusedincombinationinsupportofthesearchanddiscoveryexperience.However,itisimportanttounderstandwhattheyarecapableof,howtheywork,andthelimitationsofwhattheycando.FigureA1showsadetailedviewofthethreepillarsofthesearchanddiscoverytechnologystack,howtheywork,andtheirkeyinteractions.Webeginatthetopofthediagram.Sincethesetechnologiesaresupposedtobedeployedintheserviceofuserneeds,decisionsabouttheirconfigurationanddesignneedtobemadeinthecontextofknownuserneeds,aswellasathoroughunderstandingofthecontenttobecovered.Notethatthisisasimplifiedconceptualframework,anditdoesnotfullyrepresentthecomplexitiesoftherelationshipsbetweenthepillars,noralltheoverlapsbetweenthem.

Page 4: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

4

Figu

reA1De

tailview

ofthese

archand

disc

overytechno

logystack

Page 5: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

5

“Content”canbestructured,semi-structuredorunstructured.Itcanbeintheformofdata,documents,webcontent,audiovisualfiles,discussions,orquestions.“People”canalsobeconsideredaformofcontentsinceknowledgeablepeoplemayalsobeausefulresultinssearchonaparticulartopic.Inacontentmanagementsystemapersonprofilecanbeusedasaproxyforthatperson,andcanbetaggedwithrelevantsubjecttagssoastosurfaceinsearchalongsideotherformsofcontent.Allthreetechnologiesinthestackdependuponathoroughsurveyofthetargetcontenttobecovered,andpreparationandstructuringofthecontentstoressothatthesearch,taxonomymanagementandtextanalyticstechnologiescanaccessthemandoperateuponthem.

Enterprise Search Let’sbeginwiththecentralpillar,theenterprisesearchstack.Therearesixmaincomponentsinthesearchstack.

CrawlingThefirstcomponentisthecrawlingmodule.Thisisdirectedtocontentsources,anditcrawlsthecontentinpreparationforindexing.Youmaywanttobeabletodirectthecrawlertospecificpartsofyourcontent.

ContentProcessingContentisthenprocessed,usingtoolsincommonwithinitialprocessingofcontentbytextanalyticstools.Atextfileiscreatedforparsingandindexing,andthecontentistokenized,meaningeachelementisgivenauniqueidentifiersothattheycanbemanipulatedandlocatedlateron.Thisisalsothepointatwhichlemmatization(resolvingvariantgrammaticalformssuchastensesandpluralstothesamegrammaticalroot)andstemming(resolvingdifferentwordendingstothesameword-stem)occurs.Youmayalsowanttoremove“stopwords”thatwillcreatemeaninglessnoise(commonoperatorssuchasif,but,and,the,a)fromtheindexing.Yourstopwordsmayneedtobetuned.Sometimesstopwordsthatarenoisyinsearchindexesprovidesemanticcluesinsomerules-drivenapplicationsoftextanalytics.

IndexingTheindexingmodulecreatestermindexesfromtheparsedtext.Eachtermislinkedtoitssourcecontent,andtokenizationprovidesthecontextinwhichthesearchtermappears.Thepre-processedindexesprovidespeedinsearch.Crawlingandpre-processingoflarge-scalecontentcollectionscanbeveryslowandoccupiessignificantprocessingcapacity.Hencemostindexingconsistsofincrementalupdatestotheindexesbynoticing,crawlingandindexingnewcontentwheneveritisadded.Itisveryimportanttoplantheconfigurationandsetupofyoursearchenginecarefullyinadvance.Anymajorchangestothesearchengineconfiguration,ortothe

Page 6: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

6

wayitexploitstaxonomyortextanalyticsmayrequireanexpensiveandslowcompletere-crawlandre-index.

Auto-categorizationSomesearchenginesalsooffersimpleauto-categorizationfunctions.Intheabsenceofsophisticatedtextanalyticstools,theseareusuallyentityrecognitionbasedonlookupsofreferencetermlists.Theselistscanbemaintainedwithinthesearchengineitself,orcanbesuppliedfromataxonomymanagementsystem.Thisisreferredtoas“knownentityextractionvialookup”.Theentitiesyouareinterestedincallingout(forexamplecompanynamesorcountrynames)arecompiledintolists,andareregisteredassignificantentitiesifthemainsearchindexescomeacrossthesetermsortheirsynonyms.Thiscanprovideaverysimplefilteringcapabilitybasedontheentitytypesyouareinterestedin.However,thisisafairlybasictechnology.Justbecauseadocumentmentions“Google”doesnotmeanthatthedocumentissubstantially“about”Google.

QueryProcessingMuchoftheutilityofasearchengineliesinthesophisticationwithwhichithandlesqueriesandqueryinterfaces.Queryinterfacescanincludetheubiquitoussearchquerybox,browsingtaxonomystructureswhereconceptsarepre-linkedtocontentthroughstoredsearches(requiresintegrationwithataxonomymanagementmodule),filteringofsearchresultsusingmetadataelementsortaxonomyfacets(againrequiresintegrationwithtaxonomy).Manysearchenginesofferauto-completeorauto-suggestofsearchqueriessuggestingcommonsearchqueriesastheusertypestheirqueryintothesearchbox.Auto-completecanexploitcommonsearchqueries,existingtaxonomyorthesaurusterms(linkedtoresultsthroughstoredsearches),and/orkeywordsincontextfrommetadatasuchastitlesofhighvaluecontent.Therelevanceofsearchresultsforcommonqueriescanbetunedattheback-endbyasearchmanagerbyalteringtherulesthatcalculatethelikelyrelevanceofagivenpieceofcontentforagivenquery.Wherethereisaknown“best”pieceofcontentforagivenquery,thiscanbepromotedtothetopoftheresultspage.Relevancetuningcanalsobeautomated,bylookingatclickbehaviorsbetweenqueryandclickingonresultsandpromotingcontentthatisfrequentlyclickedon.Relevance-tuningcanexploitwhatisknownabouttheusersandtheircontexts.Knowingthatauserissearchingfromwithinagivenorganizationalfunctionenablesrulestobewrittenforhowqueriesfromthoseusersshouldbehandled,fromzoningthesearchtoknowntargetcontentforthatfunctionalgroup,tohandlingrelevancecalculationbasedonmetadataassociatedwiththosefunctionsandusers,andmatchingitwiththeindexedcontent.Contentrankingmechanismscanalsobepulledintothemixed.

Page 7: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

7

SearchAnalyticsQueryhandlingandsearchtuningrequiresophisticationinthesearchengineitselfaswellasarelativelysophisticatedsearchmanagementcapability–i.e.thewaythatsearchisstaffed,administered,configuredandconstantlytunedinrelationtoknownuserneeds.Searchanalyticsisthemodulethatpowersthissophisticationandtuning.Searchengineshaveverypowerfulreportingcapabilities,trackinghowqueriesareconductedanduserinteractionswithresultspages,butasinallcomplexreportingcapabilities,thesmartslieinhowthereportsaredefinedandconfigured.Whatwillyoutrack,how,andhowfrequently?Howwillyoubackuphunchesgatheredfromthereports,andvalidatethemwithusers?Theinitialsearchdesignstrategybasedonusecasesandcontentanalysiswillgiveyouastartingpoint.Thiswillberefinedbytheinsightsyougainbyobservinguserbehaviorsthroughthesearchroll-outandsubsequentmaintenance.

Taxonomy and Ontology Management Likesearch,taxonomyandontologymanagementrequiresveryrobustanalysisofhowusersneedtointeractwithcontent,andanalysisofthecontentitselftounderstandhowitisdescribed,organizedandused.Taxonomydevelopmentandmanagementisstillbestconductedasaworkofskilledhumandesign,butitcanbesubstantiallyenhancedwiththeappropriatetechnologies.Therearetwobroadfunctionswithinenterprisetaxonomyandontologymanagementsystems.

TaxonomyBuildingandMaintainingEnterprisetaxonomymanagementsystemssupporttheworkofbuildingandmaintainingtaxonomiesbyprovidingworkflowsandmetadataaroundtaxonomyconcepts.Forexample,theycanhousesourcevocabulariesofvaryingauthorityandstructurethatcanbeusedtogeneratecandidatetermsforthetaxonomy.Thereareapprovalworkflowstotracktheapprovalprocess.Theremayberoles-basedpermissionsforcomplexenvironmentswheremultipletaxonomyprofessionalsareinvolvedinmaintainingthevocabularies.Sourcesandusagesoftaxonomytermscanbetrackedasattributesofthetaxonomyconcepts.Scopenotescanbeaddedtoexplicatetaxonomyterms,andtherevisionhistoryoftermsandtaxonomyfacetscanbetracked.

ReferenceStoreThesecondmajorfunctionofanenterprisetaxonomymanagementsystemistoactasareferencestoreforothersystems,supplyingtaxonomyterms,relationshipsbetweenterms,synonymsandcontrolledvocabulariestoothersystems.Itcansupply:

• controlledvocabularytermstocontentmanagementsystemstosupportmanualtaggingofcontent

• termsandrelationshipstosupportauto-categorizationviaknownentitylookup,whetherappliedwithinasearchengineoratextanalyticsengine;

Page 8: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

8

• synonymsandrelationshipstosearchsothatitcanimproverelevanceandprecisionofresults,andsupportexpansionofsearch

• taxonomyfacetstosearchtosupportfilteringofsearchresults.Becausetheyarebuilttomanipulatedefinedrelationshipsbetweenconceptsandtheirattributes,taxonomyandontologymanagementsystemscanalsoactasbrokerstootherLinkedDataorLinkedOpenDatasources,tocallauthoritytermsandrelationshipsfromelsewhere,tosupportsearchand/ortosupportenrichmentofcontentviatextanalytics.Finally,enterprisetaxonomyandontologymanagementsystemsarebuilttoactastaxonomy/ontologyhubscateringtomultiplesystemsandaudiences.Theyarecapableofinteractingdifferentlywithdifferentplatformsandaudiences.Forexample,differentversionsofthesamecoretaxonomycanbepresenteddifferentlytodifferentaudiences,dependingontheirneeds.Differentvocabulariesandtaxonomyservicescanbesuppliedtodifferentsystemsbasedontheircontentanduses.

Text Analytics Textanalyticsreferstoaquitediversefamilyofcomputer-assistedtechniquesforassigningmeaningtocontent.Differentmodulescanbeusedtoservedifferentpurposes.

ContentProcessingTherearesomebasiccontentprocessingfunctionsthattextanalyticshasincommonwithenterprisesearch:preparationoftextforparsing,tokenization,lemmatization,stemming,stopwords.

SyntacticandSemanticPre-ProcessingSeveralfunctionalapplicationsoftextanalyticsdependontwofoundationalprocessingtechniques.Thefirst,syntacticanalysis,analyzesthelanguagestructureoftextgrammatically,usingverylargeopensourcetrainingsets.Sentencesplittersbreakupsentencesintopartsofspeech,distinguishingnounsandnounphrasesfromverbsandoperators,andsoon,andsyntacticanalyzerscreatealogicalrepresentationofthesentences.Semanticanalysisattemptstoinfermeaningfromanalyzedtexts.Likesyntacticanalysisitdependsonreferencetrainingsets.Itenablescomputerstoinfermeaningfulassertionsandfactsfrombodiesoftext,andunderpinsthebasetechnologiesbehindmachinetranslation.Syntacticanalysisisbasedonknownrulesofgrammar.Semanticanalysishasamuchmorechallengingtask,whichistoinferhuman-understandablemeanings.Notallgrammaticallycorrectsentencesaremeaningfultohumans–"Colorlessgreenideassleepfuriously"isafamousexample

Page 9: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

9

presentedbyNoamChomskyofasyntacticallycorrectsentencethatissemanticallyincoherent.Syntacticandsemantictechniquesunderpintheidentificationof“entities”withincontent.Inthecontextoftextanalytics,an“entity”canbeaperson,athing,aplace,atime,atopic.Itisanydistinctconceptthatcanbeidentifiedfromtext.Identificationandextractionofentitiescanbedonemostsimplybysimplylookingupreferencelists,asinauto-categorizationofknownentitiesprovidedbysomesearchengines.However,withsyntacticandsemanticanalysistextanalyticscanalsoidentifyunknown(unpredicted)entitiesforwhichyoudon’tyethaveexamplesinyourlists.ForexampleinEnglishpropernamesarecapitalized,andpersonalnameshaveknownforms(thatalsovarybyculture).Hencecandidatesforpersonsmentionedintextcouldbepickedupbyarulethatsays“lookforoneormorecapitalizedsubjectsorobjectsofsentences,andthenlookforlemmatizedverbsfromthefollowinglist‘say’…‘reply’…etc”.Suchrulesarebasedoncommonusageasrepresentedintheirtrainingsetsandaretypicallyprovidedasstandard.Theseruleswilltypicallyneedtobetestedandrefinedbasedonlocalusage.Forexample,wefoundthatonestandardtooldidnotpickupcountryoforiginincontentfromanimmigrationandcustomsauthority,becauseitwasusingsimpleterm-phraselookup.However,customsofficersatcheckpointshabituallyusedtheconvention“[Country]-registeredvehicle”.Thehyphenationandextensionofthetermdisruptedthesimplelookupfunction.Noticethatthesolutiontothisproblemisacombinationoftechniques,likelysupplementingknownentitylookupwithsyntacticallydrivenentityextractionrules.Thepracticeofusingdifferenttechniquesincombinationofcharacteristicoftextanalyticsapproaches.Withthesefoundationalcapabilities,therearemanyspecializedapplicationsoftextanalytics,notallofwhichwillbesuitabletoeverypurpose.

Auto-categorizationTheabilitytoidentifybothknownandunknownentities/conceptsisasignificantimprovementinauto-categorizationcapabilities.Lookuplistsforknownentities/conceptscanbesupportedbyataxonomymanagementsystem,andstrengthenedthroughthesupplyofsynonyms.Conversely,throughthejudicioususeofrules,textanalyticscandiscoverentities/conceptsmentionedintextthatarenotyetinthetaxonomyorcontrolledvocabularies,andcanbeconsideredascandidateterms.

TextMiningTextminingreferstoanumberoftechniquesforanalyzingtextbasedonstatisticalalgorithms.Inverybroadterms,largebodiesoftextaresubmittedtoiterative

Page 10: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

10

operationsusingalgorithmsto“sort”thetextintoclustersor“buckets”thatareinternallystatisticallysimilar,andstatisticallydissimilartotheotherclustersorbuckets.Statisticaltechniquesarealsousedtodeterminethewordsorword-phrasesthataremostlikelytouniquelyrepresenttheirbucketcomparedtootherbuckets.Thesewordstringsareoftencalled“topicmodels”.Thelikelihoodofspecifictermsoccurringinproximitytoeachothercanalsobecalculated.Textminingissometimestoutedasafully-automatedalternativetodescribingandtaggingcontent.Inpracticehowever,theclusteringtechniquesusefuzzyalgorithms,andtheend“described”stateishighlyinfluencedbywhichclusteringpathwaysaretakenatthebeginningoftheoperation.Textminingtechniquessufferfromnon-reproducibility(thesameseriesofoperationsonthesamecontentcanproducedifferentresultsondifferentoccasions)andfromnon-comprehensibility–thetopicmodelsgeneratedoftendonotmatchhowusersthemselvesdescribethecontent.Soinpractice,applicationsoftextminingareheavilytunedbytheirhumanoperators,anditisnotalwaystransparenthowagivensetofresultsareachieved.Textminingalsosuffersfromendogeneity–meaningthatthetagsthatareextractedtodescribesimilarbucketsofcontentarewhollyderivedfromthatcontentset(orwhentrainingsetsareusedforcomparison,fromthetargetcontent+thetrainingset).Endogeneityisaproblemwhenyouwanttosurveydiversecollectionsofcontentandmapmeaningsconsistentlyacrossthem.Inthatcase,ataxonomyorontologythatcrossesthosecontentsetsandaudiencesprovidesanexogenousreferencepointthatcanbrokermeaningsacrossdifferentcontentsetsanduseraudiences.However,textminingcanbeapowerfultoolusedincombinationwithsearchandtaxonomymanagement.Itsstatisticaltechniquestoclusterbasedonsimilaritycanproposenewcategorizationsandpotentiallyrelevantrelationshipsbetweentopicstothetaxonomymanager.Itsabilitytoextracttopicmodelscanbeusefulinautomatedsummarizationtechniques.

SemanticAnalysisWehavealreadydiscussedsomeofthechallengesofcomplexityfacingsemanticanalysis,notleasttheheavyinfluenceofcontextandculture(factorsexternaltothecontent)onmeaning.Inpracticaltextanalyticsapplications,semanticanalysisreliesonanumberoftechniquesandnotjustcomputationalsemanticanalysis.Forexample,oneofthemainapplicationsofsemanticanalysisistoextractassertionsorfactsfromcomplexdocumentation.Ifsemanticanalysishasaccesstoontologiesviaataxonomymanagementsystemitcanthenenrichitsinferencesbasedonknownandvalidatedrelationshipsbetweenentities.

Page 11: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

11

SentimentAnalysisSentimentanalysisisaveryspecializedapplicationoftextanalytics,oftenusedinmarketing,socialmediaanalysis,andqualitativefeedback.Itlooksatthesemanticsofpositive,negativeandneutralexpressions,inassociationwithknown(extracted)entities.Againitishighlycontextual,andsubjecttoover-simplification–forexample,sarcasticlanguagecanbesuperficiallypositive,butthelinguisticmarkersofsarcasmareoftennotveryobvious.Aswithmosttextanalyticstechniques,therulesandalgorithmsneedtobetested,validatedandconstantlysupervisedforaccuracy.

SummarizationAutomaticsummarizationoftenexploitsknowndocumentordatastructures(itknowswheretolookforkeycontent),anditmayexploitsyntacticanalysis(e.g.removingextraneous“padding”languagesuchasadjectivesandadverbs).Itmayusesemanticanalysistoextractkeyfactsfromdocuments.

How the Pieces of the Stack Work Together Let’sbringthisbacktogethertogetabetterunderstandingofhowthepillarsofthestackworktogether.FigureA2illustratesthehighlevelviewofhowthedifferentpillarsofthesearchanddiscoverytechnologystackinteractandworktogethertoproducesuperiorresultsfororganisations.Thefocusforhowthestackisdeployed,andwhichparticulartechnologycomponentsareused,alwaysspringsfromuseranalysis,andanunderstandingofthequeriesthatusersmake,againstanunderstandingofthetargetcontent,howitisstructured,howandwhyitisproduced,howitiscurrentlybeingused,andhowitcouldpotentiallybeused,ifthesearchanddiscoveryfunctionalitycouldbeimproved.Indigitalenvironments,“targetcontent”canbeanycombinationofdocuments,webcontent,data,dashboards,andpeopleprofiles.TheconfigurationoftheelementsoftheSearchandDiscoverystackspringsfromanunderstandingofuserneedsagainstanunderstandingoftargetcontent.Determininghowtousethepillarsofthestacksitswithinalargerdiagnosticanddecision-makingframework.

1. Whatisthebusinessproblem(orsetofbusinessproblems)youaretryingtosolve?Theanswertothisquestioncanoftencomethroughaknowledgemanagementdiagnosticsactivitysuchasaknowledgeaudit,combinedwithanunderstandingoftheorganisation’sbusinessdirectionandstrategy.

2. Whoarethekeyusercommunitiesinrelationtothebusinessproblem,andhowwillyouunderstandtheirworkingneedsandopportunitiesforimprovement?(Thiscanalsocomefromaknowledgeaudit).

3. Whatisthetargetcontent(theknowledgeresourcesthatyourtargetusercommunitiesneedstoworkwithmoreeffectively),whereisit,howisitstructuredandusedcurrently?(Contentanalysisandmodelingcanhelp).

Page 12: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

12

4. Whattoolsfromwithinthesearchanddiscoverystackcanhelptoaddresstheproposedsolution?(Thefocusofthispaper)

5. Whatneedstochangeonthebehavioural,processandgovernancelevelsforthenewsolutiontowork?(Forsustainableimplementationplanning,checkoutTheKnowledgeManager’sHandbook(MiltonandLambe,2016)fromaknowledgemanagementperspectiveorMaishNichani’sseriesofarticlesathttp://olasearch.com/articlesfromasearchdesignperspective).

FigureA2Interactionsbetweentaxonomymanagement,searchandtextanalytics

SearchandTaxonomySearchistheuser-facingcomponentofthestack.Itmediatescontenttousers.Taxonomymanagementsuppliesknownsalientconceptsandrelationshipstosearch,andittellssearchwhichquerytermsaresynonymsofthesametaxonomyconcepts.By“salient”wemeanthattheconceptsarefrontofmind,action-relatedconceptsintheheadsofusersastheyperforminformationandknowledgeseekingactivitiestocompleteworktasks.Taxonomiesmakesearchsmarterbytellingitwhichconceptsareimportanttousers,howtheyarerelatedtootherconcepts,whetherthroughhierarchicalorotherrelationships.Hencesearchcanexploittaxonomiestohelpuserstorefineorexpandtheirsearch,tofollowmeaningfulpathwaysandexplorecontent.Taxonomiesandontologiesenhancetherelevanceandprecisionofsearchforusers.Becausesearchisuserfacing,itgathersalotofdataaboutuserneedsandbehaviors.Frequencyanalysisofsearchqueriestellsthetaxonomistalotabouthowusersthinkabouttheircontentandtheirsearchrequirements.Clickthroughactivity

Page 13: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

13

onsearchresultspagesalsoprovidesevidenceaboutuserperceptionsofwhatmayormaynotbeuseful.Hencesearchanalyticsprovidesafeedbacklooptohelpthetaxonomymanagerconstantlyrefineandimprovethetaxonomytouserneeds.Thisisapositivefeedbackloop,wheretaxonomyprovidesrelevanceandprecisiontosearch,andsearchprovidesaconstantflowofrealtimelarge-scaleevidenceonwhatisactuallyrelevantandusefultousersatthepresenttime.Searchanalyticscanalsopickupemergingtermsandconceptsthatareimportanttousers,andproposethemascandidatetermsfortaxonomiesandtheirsupportingthesauri.Withouttaxonomymanagement,searchisrelativelydumb,whetherintermsofwhatitcanindex,howitcanhelpusersmaketheirqueries,andhowitcanhelpusersactupontheirresults.Withoutsearch,taxonomymanagementlacksaconvincingmediumtodemonstrateitsvalue,anditlacksacontinuousflowofevidencetomaintainitsrelevanceandusefulnesstousers.

SearchandTextAnalyticsTextminingcanproposenewconceptsandnewassociationsbetweencontentitemsforthesearchmanagertobuildintoqueryprocessingandrelevancytuning.Semanticanalysisandsummarizationtechniquescanbeusedtoprovide“smart”summariesofcontentitemstobepreviewedinsearchsnippetsandinsearchresultspages,makingiteasierforuserstoassesswhetherornotthecontentisrelevanttotheirneed,beforetheyopenthecontentitem.Textanalyticscanextendthewayinwhichcontentcanbepulledtogetherinspecializedsearchbasedapplications.Allthreepillarsofthestack(search,taxonomyandtextanalytics)dependonconstanttuningoftheinfrastructuretodelivergoodsearchanddiscoveryresults.Textanalyticsinparticulardependsonrulestoguidehowthecontentisprocessedandtagged.Theserulesneedmaintenanceandtuning.Thistuningisnecessarybecauseoftheconstantongoingchangesincontent,context,priorities,usersandneeds.Searchanalyticsprovidealargescale,real-timewindowintouserneedsandbehaviors,andso,aswithtaxonomymanagement,itprovidestheevidencebaseneededtokeepthetextanalyticsstacktunedtodeliveraccurateandusefulresultsforusers.Whentherearelargeandcomplexknowledgebases,withouttextanalytics,searchfindsitdifficulttopickoutsalientconceptsrelatedtocontent.Aswithtaxonomymanagement,textanalyticsdependsondeepuserandcontentknowledgetoknowhowtoconfigureandtuneitsoperations.Withoutsearch,textanalyticsfindsithardtodothisunlesstheteamispreparedtoinvestinconstant(andexpensive)userresearchandtesting.

TextAnalyticsandTaxonomyManagementTaxonomiesaredesigned,evidence-basedartifacts.Assuch,theyconsistentlylagthepaceofchange.Theyareoptimizedtoexploitknownanddocumentedconcepts.Textanalyticstechniquessuchasauto-categorizationandtextminingcansupplycandidatetermsandrelationshipstothetaxonomymanagerwithouttheneedfor

Page 14: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

14

comprehensivere-surveyingofcontentanduserneeds.Theycanpickupemergingconceptsinthecontent,justassearchanalyticspicksupandproposesemergingconceptsevidencedinsearchqueries.Sotextanalyticscanactasapowerfultaxonomyenrichmenttool.Taxonomiesandontologiescanguideandstrengthentheapplicationoftextanalyticstechniquessuchasauto-categorization(knownentityextractionthroughlookup),factextractionthroughsemanticanalysis,andsentimentanalysis(byhelpingthesentimentengineresolvedifferentsynonymstothesameentitybeingreferencedinthecontent).Whentherearelargeandcomplexknowledgebases,withouttextanalytics,taxonomytagscannoteasilyandconsistentlybeappliedtolarge,diversebodiesofcontent.Withouttaxonomymanagement,textanalyticsstrugglestodescribecontentintermsthatmakesensetokeyusersandinthecontextofuseractivities.Machinetaggingtechniquesalonedonotdescribecontentintermsthathumansintuitivelyunderstand,withoutsomeformofhuman-curatedfilter.Taxonomymanagementprovidesthis.

Implications Fortoolongorganisationshavebeenlisteningtotheboomingvoicegivingmagicalreassurancesfrombehindthecurtain,andhaveneglectedtheirowncapabilities,andthetoolsetsthatexistintheopensourcearena(thatcommercialproductsalsoexploit).Astheworldbecomesmorecomplex,solutionstosearchanddiscoveryproblemswillbecomemorediverse.Weneedtobecomebetterjourneymen,moreknowledgeableaboutthecapabilitiesdifferenttoolsetsthatareavailable,sothatwecanchoosetherighttoolsforanygivenproblem.Someofthemwillbecommercialtools,acquiredfortheirexcellenceinaspecificsetoffunctions.Someofthemwillbeopensourcetools,acquiredandtunedtosolvespecifickindsofproblem.Theageofsinglesource“doitall”commercialplatformsisover.Thispaperhasproposedthatknowledgeofthesearchanddiscoverytechnologystackisanimportantcapabilitytoacquire.Itneedstobeatleastadequatetothejobofevaluatingneedsandpossibilities,andofaskingvendorssearchingquestionsabouthowtheirtechnologyworks,whatitisgoodat,andwhatitslimitationsare.Insomecases,organisationswillwanttointernalizeadeepertechnicalcapabilityinworkingwithanumberofsetsoftools.Thereareotherimportantsearchanddiscoveryrelatedcapabilitiesthatenterprisesneedtoacquirebeyondaknowledgeofthesearchanddiscoverytechnologystack–forexample,howuserneedsandcontentcanbeanalysedtodetermineprioritiesandopportunities,howcontentcanbemanagedandenrichedtoprovideoptimaluserexperiences,andhowbusinessproblemscanbetransformedthroughdesignprocessesintoeffectivesolutionsforbusinessneeds,problemsandopportunities.

Page 15: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

15

I would like to acknowledge the following colleagues who have provided valuable insights and advice on this paper: Dave Clarke, Agnes Molnar, Maish Nichani and Tom Reamy. Patrick Lambe October 2016-January 2017


Top Related