integrating text into an enterprise it environment february 25, 2003
DESCRIPTION
Integrating Text Into an Enterprise IT Environment February 25, 2003. Curt Monash, Ph.D. President Monash Information Services [email protected] www.monash.com. Agenda for this talk. How text indexing and search work – and what they assume Fitting text into a traditional IT context - PowerPoint PPT PresentationTRANSCRIPT
4502X.PPT 04/22/23 1
Integrating Text Into an Integrating Text Into an Enterprise IT EnvironmentEnterprise IT Environment
February 25, 2003February 25, 2003
Curt Monash, Ph.D.Curt Monash, Ph.D.PresidentPresident
Monash Information ServicesMonash Information [email protected]@monash.com
www.monash.comwww.monash.com
4502X.PPT 04/22/23 2
Agenda for this talkAgenda for this talk
• How text indexing and search work – and what How text indexing and search work – and what they assumethey assume
• Fitting text into a traditional IT contextFitting text into a traditional IT context• Sorting out your text application needsSorting out your text application needs• Key considerations in text application architectureKey considerations in text application architecture
4502X.PPT 04/22/23 3
There are no miracles or magic There are no miracles or magic bulletsbullets
• ““Search engines” aren’t Search engines” aren’t thethe answer answer• ““Content management” isn’t Content management” isn’t thethe answer answer• Clustering isn’t Clustering isn’t thethe answer answer• XML isn’t XML isn’t thethe answer answer
No one technology solves all search problemsNo one technology solves all search problems
4502X.PPT 04/22/23 4
Gresham’s Law of CoinageGresham’s Law of Coinage
Bad (i.e., debased) coinage drives out goodBad (i.e., debased) coinage drives out good
4502X.PPT 04/22/23 5
Monash’s Law of Jargon Monash’s Law of Jargon
Bad uses of (recently coined) jargon drive Bad uses of (recently coined) jargon drive out good onesout good ones
• Example: “Content management” can mean Example: “Content management” can mean almost anythingalmost anything
4502X.PPT 04/22/23 6
Best practices for text apps are the same Best practices for text apps are the same as for any other major IT challengeas for any other major IT challenge
• Understand your application needsUnderstand your application needs• Use safely proven technology where you canUse safely proven technology where you can• Push the boundaries of technology where you Push the boundaries of technology where you
mustmust• Ask your users to make Ask your users to make smallsmall changes in the way changes in the way
they do their jobs they do their jobs
4502X.PPT 04/22/23 7
Key takeawaysKey takeaways
• The classical “technology stack” is evolving The classical “technology stack” is evolving nicely to accommodate textnicely to accommodate text
• Standalone search-in-a-box doesn’t solve very Standalone search-in-a-box doesn’t solve very many problemsmany problems
• Careful application analysis is crucialCareful application analysis is crucial– It’s not just data design and workflowIt’s not just data design and workflow– Security needs to be designed inSecurity needs to be designed in
4502X.PPT 04/22/23 8
Part 1Part 1
How text indexing and search work – and How text indexing and search work – and what they assumewhat they assume
4502X.PPT 04/22/23 9
Different application contextsDifferent application contexts
• Different kinds of problemsDifferent kinds of problems• Different available resourcesDifferent available resources
4502X.PPT 04/22/23 10
Recall vs. PrecisionRecall vs. Precision
• Recall = What percentage of the valid hits did you Recall = What percentage of the valid hits did you get?get?– Crucial if you actually need 100%Crucial if you actually need 100%
• Precision = What percentage of the (top) hits Precision = What percentage of the (top) hits returned really are valid?returned really are valid?– Important for user satisfaction and efficiencyImportant for user satisfaction and efficiency
• But how is “valid” measured???But how is “valid” measured???
4502X.PPT 04/22/23 11
Three fundamentally different Three fundamentally different scenariosscenarios
• Article searchArticle search• Web searchWeb search• OLTP application text searchOLTP application text search
4502X.PPT 04/22/23 12
Article searchArticle search
• Very high recall may be neededVery high recall may be needed• Metadata may be reliableMetadata may be reliable• Document style and structure may be predictableDocument style and structure may be predictable
This is the “traditional” information retrieval This is the “traditional” information retrieval challengechallenge
4502X.PPT 04/22/23 13
Successful only in clearcut research Successful only in clearcut research marketsmarkets
• Legal – LexisLegal – Lexis• Investments Investments
– Simple-minded appsSimple-minded apps– Stock symbols are the perfect keywordStock symbols are the perfect keyword
• Intelligence community?Intelligence community?• Business “competitive intelligence”Business “competitive intelligence”• Scientific/medicalScientific/medical
4502X.PPT 04/22/23 14
The “Daily Me” hasn’t arrived yetThe “Daily Me” hasn’t arrived yet
• How well does the user understand information How well does the user understand information retrieval?retrieval?
• Who has time to read anyway?Who has time to read anyway?• Failures include Newsedge, Northern Light, et al.Failures include Newsedge, Northern Light, et al.• ““Personalized” portals are wimpy, and nobody Personalized” portals are wimpy, and nobody
seems to careseems to care
4502X.PPT 04/22/23 15
• Precision is usually a bigger problem than recall Precision is usually a bigger problem than recall (300,000 hits!)(300,000 hits!)
• Metadata is unreliable (no standards, deliberate Metadata is unreliable (no standards, deliberate deception)deception)
• Style and structure are enormously variedStyle and structure are enormously varied
Web searchWeb search
4502X.PPT 04/22/23 16
Users like GoogleUsers like Google
But how are they using it?But how are they using it?
• What they’re finding is good web What they’re finding is good web sitessites• They still have to navigate to the specific pageThey still have to navigate to the specific page
4502X.PPT 04/22/23 17
OTLP app text searchOTLP app text search
• 100% precision is assumed for the overall app …100% precision is assumed for the overall app …• … … so text search had better not be the only way to so text search had better not be the only way to
find documentsfind documents• The relational record probably The relational record probably isis the metadata the metadata• Hot future areaHot future area
– Usage is creeping upUsage is creeping up– Functionality is still primitiveFunctionality is still primitive– App dev tools are improving dramatically, App dev tools are improving dramatically,
albeit from a dismal starting pointalbeit from a dismal starting point
4502X.PPT 04/22/23 18
Lessons from Amazon.comLessons from Amazon.com
• Search-based navigation can workSearch-based navigation can work• The user needs a clear understanding of what The user needs a clear understanding of what
s/he is looking fors/he is looking for• If you make an imprecise query, you have to If you make an imprecise query, you have to
accept an imprecise result setaccept an imprecise result set
4502X.PPT 04/22/23 19
It all starts with word searchIt all starts with word search
• Big, specialized inverted-list indexBig, specialized inverted-list index– Huge but sparseHuge but sparse– Analogous to bit-map or star schemaAnalogous to bit-map or star schema
• Digrams/trigrams/n-grams, offsets, stopwordsDigrams/trigrams/n-grams, offsets, stopwords• Fortunately, integration into RDBMS has been Fortunately, integration into RDBMS has been
largely solvedlargely solved
4502X.PPT 04/22/23 20
The ranking problemThe ranking problem
• What does 75% relevance mean?What does 75% relevance mean?• How do you combine rankings from different How do you combine rankings from different
subsystems?subsystems?• The SAME query against the SAME data can give The SAME query against the SAME data can give
different results in different search enginesdifferent results in different search engines• The SAME query against the SAME search engine The SAME query against the SAME search engine
can give different results if you add irrelevant datacan give different results if you add irrelevant data
4502X.PPT 04/22/23 21
Major issues for (key)word searchMajor issues for (key)word search
• AmbiguityAmbiguity• VaguenessVagueness• Information overloadInformation overload
4502X.PPT 04/22/23 22
Major tools Major tools
• Traditional linguistic techniquesTraditional linguistic techniques• ““Automagic” clusteringAutomagic” clustering• Traditional metadataTraditional metadata• SocioheuristicsSocioheuristics
4502X.PPT 04/22/23 23
Traditional linguistic techniquesTraditional linguistic techniques
• Synonyms and other semantic cluesSynonyms and other semantic clues• Topic sentences and other syntactic cluesTopic sentences and other syntactic clues• Standard document structureStandard document structure
4502X.PPT 04/22/23 24
Query translation/expansionQuery translation/expansion
• ThesaurusThesaurus– End-user extensibleEnd-user extensible
• Spelling correctionSpelling correction– Traditional (e.g., drop the vowels)Traditional (e.g., drop the vowels)– Modern (e.g., compare to query logs)Modern (e.g., compare to query logs)
4502X.PPT 04/22/23 25
Automagic clustering and Automagic clustering and information discoveryinformation discovery
• Nice mathematical buzzwordsNice mathematical buzzwords– Bayesian statistics, etc.Bayesian statistics, etc.– It all boils down to “distance” measured in a It all boils down to “distance” measured in a
very high-dimensional vector spacevery high-dimensional vector space• Nice social science buzzwords tooNice social science buzzwords too
– Semiotics, etc.Semiotics, etc.• Same appeal as neural networksSame appeal as neural networks
– The computer “discovers” what humans can’tThe computer “discovers” what humans can’t
4502X.PPT 04/22/23 26
Clustering technology isn’t sufficiently Clustering technology isn’t sufficiently advanced yet to be “magic”advanced yet to be “magic”
• Same weaknesses as neural networks tooSame weaknesses as neural networks too– Lack of reliabilityLack of reliability– Lack of transparencyLack of transparency– Lack of predictability!Lack of predictability!
• Legacy of failureLegacy of failure– Search engines: Excite, Northern LightSearch engines: Excite, Northern Light– “ “Employee Internet Management” (i.e., Employee Internet Management” (i.e.,
porn/gambling filter) companiesporn/gambling filter) companies
4502X.PPT 04/22/23 27
Traditional metadataTraditional metadata
• Typically supplied by the author/editor, or by a Typically supplied by the author/editor, or by a librarianlibrarian
• Keywords, etc.Keywords, etc.• Who/What/Where/WhenWho/What/Where/When
4502X.PPT 04/22/23 28
SocioheuristicsSocioheuristics
• Measures of page popularityMeasures of page popularity• Guesses at author expertiseGuesses at author expertise
4502X.PPT 04/22/23 29
Sorting through the metadataSorting through the metadata
Since unaided “search” often works badly, Since unaided “search” often works badly, metadata is crucialmetadata is crucial
4502X.PPT 04/22/23 30
So what is metadata in a search So what is metadata in a search context?context?
• Standard definition of “metadata”: Data about Standard definition of “metadata”: Data about datadata
• Actually, relational metadata usually is data about Actually, relational metadata usually is data about data structuresdata structures
• But in the text world, metadata usually is data But in the text world, metadata usually is data about the data itselfabout the data itself
4502X.PPT 04/22/23 31
Categories of text metadataCategories of text metadata
• Library-likeLibrary-like• Extracted from the documentExtracted from the document• Implicit in the corpusImplicit in the corpus• OLTP-likeOLTP-like
4502X.PPT 04/22/23 32
Classical document metadataClassical document metadata
• Comes from the library tradition (i.e., card Comes from the library tradition (i.e., card catalogs) …catalogs) …
• … … and/or from early online document stores used and/or from early online document stores used by librariansby librarians
• Examples:Examples:– Title, author, date, etc.Title, author, date, etc.– Hand-selected classification/categorizationHand-selected classification/categorization– Hand-selected keywordsHand-selected keywords
• Can be created by author, editor, “librarian”Can be created by author, editor, “librarian”
4502X.PPT 04/22/23 33
Extracted metadataExtracted metadata
• In essence, precomputed text searchIn essence, precomputed text search• Examples:Examples:
– Key words (or keywords) and conceptsKey words (or keywords) and concepts– Titles and metatagsTitles and metatags– Topic sentences, summariesTopic sentences, summaries– Author, etc.Author, etc.
4502X.PPT 04/22/23 34
Implicit metadata – location, location, Implicit metadata – location, location, locationlocation
• Where on the net is the document?Where on the net is the document?• Judge a document by its neighborsJudge a document by its neighbors• Major problem – unstable net topographyMajor problem – unstable net topography
– URL patterns can’t be relied on, unfortunatelyURL patterns can’t be relied on, unfortunately– Google’s original algorithms were based on Google’s original algorithms were based on
behavior analysis on the public WWWbehavior analysis on the public WWW
4502X.PPT 04/22/23 35
Automatic metadata in “traditional” Automatic metadata in “traditional” OLTP appsOLTP apps
• ExamplesExamples– Comment fields in apps such asComment fields in apps such as
• CRM/call reportCRM/call report• Maintenance/damage reportMaintenance/damage report
– Web feedback formsWeb feedback forms• Limited more by application imagination than by Limited more by application imagination than by
the data itselfthe data itself
4502X.PPT 04/22/23 36
Part 2Part 2
Fitting text into a traditional IT contextFitting text into a traditional IT context
4502X.PPT 04/22/23 37
Benefits of storage in standard DBMSBenefits of storage in standard DBMS
• System management (e.g., backup, failover)System management (e.g., backup, failover)• Standard programming languages/APIsStandard programming languages/APIs• Security!!Security!!
4502X.PPT 04/22/23 38
Old objections to DBMS-based Old objections to DBMS-based storage are invalidstorage are invalid
• Performance -- Proprietary systems can’t index Performance -- Proprietary systems can’t index email in real time eitheremail in real time either
• Specialized functionality – the DBMS have long Specialized functionality – the DBMS have long feature lists toofeature lists too
4502X.PPT 04/22/23 39
All enterprise data architectures are All enterprise data architectures are supportedsupported
• Central everythingCentral everything• Central index, distributed storageCentral index, distributed storage• Distributed/federated everythingDistributed/federated everything
4502X.PPT 04/22/23 40
Application development technology Application development technology and tools are just emergingand tools are just emerging
• SQL/MMSQL/MM• Search controls, etc.Search controls, etc.• Emerging XML-centric technologyEmerging XML-centric technology• Customizable “content management” systemsCustomizable “content management” systems
4502X.PPT 04/22/23 41
Canned text apps are a mixed bagCanned text apps are a mixed bag
• Document management for regulatory filingsDocument management for regulatory filings• Information discovery Information discovery • Generic searchGeneric search
4502X.PPT 04/22/23 42
Part 3Part 3
Sorting out your application needsSorting out your application needs
4502X.PPT 04/22/23 43
Different applications have very Different applications have very different profilesdifferent profiles
• Precision/recall of resultPrecision/recall of result• Quality of inputQuality of input• SecuritySecurity
4502X.PPT 04/22/23 44
Basic application types, Group 1 – Basic application types, Group 1 – the fuzziesthe fuzzies
• Portal (e.g., self-service HR)Portal (e.g., self-service HR)– Best case for generic WWW-like searchBest case for generic WWW-like search
• Notes/Exchange/EmailNotes/Exchange/Email– Not clear what the real functionality needed isNot clear what the real functionality needed is– Active area of research/developmentActive area of research/development
• Information discoveryInformation discovery
4502X.PPT 04/22/23 45
Basic application types, Group 2 – Basic application types, Group 2 – OLTP OLTP
• Heavy-duty transaction processing (ERP, supply Heavy-duty transaction processing (ERP, supply chain, etc.)chain, etc.)– Search is tangentialSearch is tangential
• Direct touch CRMDirect touch CRM– Basic search is underutilized but gaining Basic search is underutilized but gaining
groundground• Online sales/marketing (very different in different Online sales/marketing (very different in different
industries)industries)– Search part of the app unlikely to be very Search part of the app unlikely to be very
demanding …demanding …– … … except from a security standpointexcept from a security standpoint
4502X.PPT 04/22/23 46
Basic application types, Group 3 – Basic application types, Group 3 – Heavy-duty analytic aidsHeavy-duty analytic aids
• BI/CPM/Analytic appsBI/CPM/Analytic apps– Great for taming the numerical part of the Great for taming the numerical part of the
information tangleinformation tangle– Text search is largely irrelevantText search is largely irrelevant
• Product lifecycle management (engineering-Product lifecycle management (engineering-centric)centric)– Text is an afterthoughtText is an afterthought
• Product lifecycle management (regulatory-centric)Product lifecycle management (regulatory-centric)– Documentum et al. offer “compliance” Documentum et al. offer “compliance”
solutionssolutions• Online maintenance manualsOnline maintenance manuals
– This is a biggie for text!!This is a biggie for text!!
4502X.PPT 04/22/23 47
Part 4Part 4
Key considerations in text application Key considerations in text application architecture architecture
4502X.PPT 04/22/23 48
Five big issuesFive big issues
• Database integrationDatabase integration• Realistic options for document metadataRealistic options for document metadata• Document stylistic consistency (local)Document stylistic consistency (local)• Quality-of-search application requirementsQuality-of-search application requirements• SecuritySecurity
4502X.PPT 04/22/23 49
Text database integration vs. Text database integration vs. relational database integrationrelational database integration
• Remote indexing is an optionRemote indexing is an option• Data cleaning and consistency issues are Data cleaning and consistency issues are
differentdifferent• Performance issues are differentPerformance issues are different• Everything is a little more primitiveEverything is a little more primitive
4502X.PPT 04/22/23 50
Document metadata – consider the Document metadata – consider the sourcesource
• Author/editor – can’t be relied onAuthor/editor – can’t be relied on• Implicit metadata – great if you trust your Implicit metadata – great if you trust your
policies/procedurespolicies/procedures• Extracted metadata – same strengths/weaknesses Extracted metadata – same strengths/weaknesses
as general text searchas general text search• From a relational OLTP app – nice if you have itFrom a relational OLTP app – nice if you have it
4502X.PPT 04/22/23 51
Document stylistic consistencyDocument stylistic consistency
• The best search techniques make assumptions The best search techniques make assumptions about document structureabout document structure
• Which assumptions are valid?Which assumptions are valid?• 100% consistency is neither needed nor realistic100% consistency is neither needed nor realistic
4502X.PPT 04/22/23 52
Quality-of-search requirementsQuality-of-search requirements
• What ratio of bad hits is tolerable? (Precision)What ratio of bad hits is tolerable? (Precision)• How close do you have to get to the target?How close do you have to get to the target?• What level of redundancy is tolerable?What level of redundancy is tolerable?• How crucial is it to get every good hit? (Recall)How crucial is it to get every good hit? (Recall)
4502X.PPT 04/22/23 53
SecuritySecurity
• What level of detail must be protected?What level of detail must be protected?• How compartmentalized are need and permission How compartmentalized are need and permission
to know?to know?• How important is security auditing?How important is security auditing?
4502X.PPT 04/22/23 54
Must search results themselves be Must search results themselves be kept secure?kept secure?
• Existence of info?Existence of info?• Summaries?Summaries?• Other document metadata?Other document metadata?
4502X.PPT 04/22/23 55
Integrating Text Into an Integrating Text Into an Enterprise IT EnvironmentEnterprise IT Environment
February 25, 2003February 25, 2003
Curt Monash, Ph.D.Curt Monash, Ph.D.PresidentPresident
Monash Information ServicesMonash Information [email protected]@monash.com
www.monash.comwww.monash.com