integrating text into an enterprise it environment february 25, 2003

55
4502X.PPT 05/15/22 1 Integrating Text Into an Integrating Text Into an Enterprise IT Environment Enterprise IT Environment February 25, 2003 February 25, 2003 Curt Monash, Ph.D. Curt Monash, Ph.D. President President Monash Information Services Monash Information Services [email protected] [email protected] www.monash.com www.monash.com

Upload: gayora

Post on 09-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Integrating Text Into an Enterprise IT Environment February 25, 2003. Curt Monash, Ph.D. President Monash Information Services [email protected] www.monash.com. Agenda for this talk. How text indexing and search work – and what they assume Fitting text into a traditional IT context - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 1

Integrating Text Into an Integrating Text Into an Enterprise IT EnvironmentEnterprise IT Environment

February 25, 2003February 25, 2003

Curt Monash, Ph.D.Curt Monash, Ph.D.PresidentPresident

Monash Information ServicesMonash Information [email protected]@monash.com

www.monash.comwww.monash.com

Page 2: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 2

Agenda for this talkAgenda for this talk

• How text indexing and search work – and what How text indexing and search work – and what they assumethey assume

• Fitting text into a traditional IT contextFitting text into a traditional IT context• Sorting out your text application needsSorting out your text application needs• Key considerations in text application architectureKey considerations in text application architecture

Page 3: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 3

There are no miracles or magic There are no miracles or magic bulletsbullets

• ““Search engines” aren’t Search engines” aren’t thethe answer answer• ““Content management” isn’t Content management” isn’t thethe answer answer• Clustering isn’t Clustering isn’t thethe answer answer• XML isn’t XML isn’t thethe answer answer

No one technology solves all search problemsNo one technology solves all search problems

Page 4: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 4

Gresham’s Law of CoinageGresham’s Law of Coinage

Bad (i.e., debased) coinage drives out goodBad (i.e., debased) coinage drives out good

Page 5: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 5

Monash’s Law of Jargon Monash’s Law of Jargon

Bad uses of (recently coined) jargon drive Bad uses of (recently coined) jargon drive out good onesout good ones

• Example: “Content management” can mean Example: “Content management” can mean almost anythingalmost anything

Page 6: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 6

Best practices for text apps are the same Best practices for text apps are the same as for any other major IT challengeas for any other major IT challenge

• Understand your application needsUnderstand your application needs• Use safely proven technology where you canUse safely proven technology where you can• Push the boundaries of technology where you Push the boundaries of technology where you

mustmust• Ask your users to make Ask your users to make smallsmall changes in the way changes in the way

they do their jobs they do their jobs

Page 7: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 7

Key takeawaysKey takeaways

• The classical “technology stack” is evolving The classical “technology stack” is evolving nicely to accommodate textnicely to accommodate text

• Standalone search-in-a-box doesn’t solve very Standalone search-in-a-box doesn’t solve very many problemsmany problems

• Careful application analysis is crucialCareful application analysis is crucial– It’s not just data design and workflowIt’s not just data design and workflow– Security needs to be designed inSecurity needs to be designed in

Page 8: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 8

Part 1Part 1

How text indexing and search work – and How text indexing and search work – and what they assumewhat they assume

Page 9: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 9

Different application contextsDifferent application contexts

• Different kinds of problemsDifferent kinds of problems• Different available resourcesDifferent available resources

Page 10: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 10

Recall vs. PrecisionRecall vs. Precision

• Recall = What percentage of the valid hits did you Recall = What percentage of the valid hits did you get?get?– Crucial if you actually need 100%Crucial if you actually need 100%

• Precision = What percentage of the (top) hits Precision = What percentage of the (top) hits returned really are valid?returned really are valid?– Important for user satisfaction and efficiencyImportant for user satisfaction and efficiency

• But how is “valid” measured???But how is “valid” measured???

Page 11: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 11

Three fundamentally different Three fundamentally different scenariosscenarios

• Article searchArticle search• Web searchWeb search• OLTP application text searchOLTP application text search

Page 12: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 12

Article searchArticle search

• Very high recall may be neededVery high recall may be needed• Metadata may be reliableMetadata may be reliable• Document style and structure may be predictableDocument style and structure may be predictable

This is the “traditional” information retrieval This is the “traditional” information retrieval challengechallenge

Page 13: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 13

Successful only in clearcut research Successful only in clearcut research marketsmarkets

• Legal – LexisLegal – Lexis• Investments Investments

– Simple-minded appsSimple-minded apps– Stock symbols are the perfect keywordStock symbols are the perfect keyword

• Intelligence community?Intelligence community?• Business “competitive intelligence”Business “competitive intelligence”• Scientific/medicalScientific/medical

Page 14: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 14

The “Daily Me” hasn’t arrived yetThe “Daily Me” hasn’t arrived yet

• How well does the user understand information How well does the user understand information retrieval?retrieval?

• Who has time to read anyway?Who has time to read anyway?• Failures include Newsedge, Northern Light, et al.Failures include Newsedge, Northern Light, et al.• ““Personalized” portals are wimpy, and nobody Personalized” portals are wimpy, and nobody

seems to careseems to care

Page 15: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 15

• Precision is usually a bigger problem than recall Precision is usually a bigger problem than recall (300,000 hits!)(300,000 hits!)

• Metadata is unreliable (no standards, deliberate Metadata is unreliable (no standards, deliberate deception)deception)

• Style and structure are enormously variedStyle and structure are enormously varied

Web searchWeb search

Page 16: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 16

Users like GoogleUsers like Google

But how are they using it?But how are they using it?

• What they’re finding is good web What they’re finding is good web sitessites• They still have to navigate to the specific pageThey still have to navigate to the specific page

Page 17: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 17

OTLP app text searchOTLP app text search

• 100% precision is assumed for the overall app …100% precision is assumed for the overall app …• … … so text search had better not be the only way to so text search had better not be the only way to

find documentsfind documents• The relational record probably The relational record probably isis the metadata the metadata• Hot future areaHot future area

– Usage is creeping upUsage is creeping up– Functionality is still primitiveFunctionality is still primitive– App dev tools are improving dramatically, App dev tools are improving dramatically,

albeit from a dismal starting pointalbeit from a dismal starting point

Page 18: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 18

Lessons from Amazon.comLessons from Amazon.com

• Search-based navigation can workSearch-based navigation can work• The user needs a clear understanding of what The user needs a clear understanding of what

s/he is looking fors/he is looking for• If you make an imprecise query, you have to If you make an imprecise query, you have to

accept an imprecise result setaccept an imprecise result set

Page 19: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 19

It all starts with word searchIt all starts with word search

• Big, specialized inverted-list indexBig, specialized inverted-list index– Huge but sparseHuge but sparse– Analogous to bit-map or star schemaAnalogous to bit-map or star schema

• Digrams/trigrams/n-grams, offsets, stopwordsDigrams/trigrams/n-grams, offsets, stopwords• Fortunately, integration into RDBMS has been Fortunately, integration into RDBMS has been

largely solvedlargely solved

Page 20: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 20

The ranking problemThe ranking problem

• What does 75% relevance mean?What does 75% relevance mean?• How do you combine rankings from different How do you combine rankings from different

subsystems?subsystems?• The SAME query against the SAME data can give The SAME query against the SAME data can give

different results in different search enginesdifferent results in different search engines• The SAME query against the SAME search engine The SAME query against the SAME search engine

can give different results if you add irrelevant datacan give different results if you add irrelevant data

Page 21: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 21

Major issues for (key)word searchMajor issues for (key)word search

• AmbiguityAmbiguity• VaguenessVagueness• Information overloadInformation overload

Page 22: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 22

Major tools Major tools

• Traditional linguistic techniquesTraditional linguistic techniques• ““Automagic” clusteringAutomagic” clustering• Traditional metadataTraditional metadata• SocioheuristicsSocioheuristics

Page 23: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 23

Traditional linguistic techniquesTraditional linguistic techniques

• Synonyms and other semantic cluesSynonyms and other semantic clues• Topic sentences and other syntactic cluesTopic sentences and other syntactic clues• Standard document structureStandard document structure

Page 24: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 24

Query translation/expansionQuery translation/expansion

• ThesaurusThesaurus– End-user extensibleEnd-user extensible

• Spelling correctionSpelling correction– Traditional (e.g., drop the vowels)Traditional (e.g., drop the vowels)– Modern (e.g., compare to query logs)Modern (e.g., compare to query logs)

Page 25: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 25

Automagic clustering and Automagic clustering and information discoveryinformation discovery

• Nice mathematical buzzwordsNice mathematical buzzwords– Bayesian statistics, etc.Bayesian statistics, etc.– It all boils down to “distance” measured in a It all boils down to “distance” measured in a

very high-dimensional vector spacevery high-dimensional vector space• Nice social science buzzwords tooNice social science buzzwords too

– Semiotics, etc.Semiotics, etc.• Same appeal as neural networksSame appeal as neural networks

– The computer “discovers” what humans can’tThe computer “discovers” what humans can’t

Page 26: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 26

Clustering technology isn’t sufficiently Clustering technology isn’t sufficiently advanced yet to be “magic”advanced yet to be “magic”

• Same weaknesses as neural networks tooSame weaknesses as neural networks too– Lack of reliabilityLack of reliability– Lack of transparencyLack of transparency– Lack of predictability!Lack of predictability!

• Legacy of failureLegacy of failure– Search engines: Excite, Northern LightSearch engines: Excite, Northern Light– “ “Employee Internet Management” (i.e., Employee Internet Management” (i.e.,

porn/gambling filter) companiesporn/gambling filter) companies

Page 27: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 27

Traditional metadataTraditional metadata

• Typically supplied by the author/editor, or by a Typically supplied by the author/editor, or by a librarianlibrarian

• Keywords, etc.Keywords, etc.• Who/What/Where/WhenWho/What/Where/When

Page 28: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 28

SocioheuristicsSocioheuristics

• Measures of page popularityMeasures of page popularity• Guesses at author expertiseGuesses at author expertise

Page 29: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 29

Sorting through the metadataSorting through the metadata

Since unaided “search” often works badly, Since unaided “search” often works badly, metadata is crucialmetadata is crucial

Page 30: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 30

So what is metadata in a search So what is metadata in a search context?context?

• Standard definition of “metadata”: Data about Standard definition of “metadata”: Data about datadata

• Actually, relational metadata usually is data about Actually, relational metadata usually is data about data structuresdata structures

• But in the text world, metadata usually is data But in the text world, metadata usually is data about the data itselfabout the data itself

Page 31: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 31

Categories of text metadataCategories of text metadata

• Library-likeLibrary-like• Extracted from the documentExtracted from the document• Implicit in the corpusImplicit in the corpus• OLTP-likeOLTP-like

Page 32: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 32

Classical document metadataClassical document metadata

• Comes from the library tradition (i.e., card Comes from the library tradition (i.e., card catalogs) …catalogs) …

• … … and/or from early online document stores used and/or from early online document stores used by librariansby librarians

• Examples:Examples:– Title, author, date, etc.Title, author, date, etc.– Hand-selected classification/categorizationHand-selected classification/categorization– Hand-selected keywordsHand-selected keywords

• Can be created by author, editor, “librarian”Can be created by author, editor, “librarian”

Page 33: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 33

Extracted metadataExtracted metadata

• In essence, precomputed text searchIn essence, precomputed text search• Examples:Examples:

– Key words (or keywords) and conceptsKey words (or keywords) and concepts– Titles and metatagsTitles and metatags– Topic sentences, summariesTopic sentences, summaries– Author, etc.Author, etc.

Page 34: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 34

Implicit metadata – location, location, Implicit metadata – location, location, locationlocation

• Where on the net is the document?Where on the net is the document?• Judge a document by its neighborsJudge a document by its neighbors• Major problem – unstable net topographyMajor problem – unstable net topography

– URL patterns can’t be relied on, unfortunatelyURL patterns can’t be relied on, unfortunately– Google’s original algorithms were based on Google’s original algorithms were based on

behavior analysis on the public WWWbehavior analysis on the public WWW

Page 35: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 35

Automatic metadata in “traditional” Automatic metadata in “traditional” OLTP appsOLTP apps

• ExamplesExamples– Comment fields in apps such asComment fields in apps such as

• CRM/call reportCRM/call report• Maintenance/damage reportMaintenance/damage report

– Web feedback formsWeb feedback forms• Limited more by application imagination than by Limited more by application imagination than by

the data itselfthe data itself

Page 36: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 36

Part 2Part 2

Fitting text into a traditional IT contextFitting text into a traditional IT context

Page 37: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 37

Benefits of storage in standard DBMSBenefits of storage in standard DBMS

• System management (e.g., backup, failover)System management (e.g., backup, failover)• Standard programming languages/APIsStandard programming languages/APIs• Security!!Security!!

Page 38: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 38

Old objections to DBMS-based Old objections to DBMS-based storage are invalidstorage are invalid

• Performance -- Proprietary systems can’t index Performance -- Proprietary systems can’t index email in real time eitheremail in real time either

• Specialized functionality – the DBMS have long Specialized functionality – the DBMS have long feature lists toofeature lists too

Page 39: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 39

All enterprise data architectures are All enterprise data architectures are supportedsupported

• Central everythingCentral everything• Central index, distributed storageCentral index, distributed storage• Distributed/federated everythingDistributed/federated everything

Page 40: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 40

Application development technology Application development technology and tools are just emergingand tools are just emerging

• SQL/MMSQL/MM• Search controls, etc.Search controls, etc.• Emerging XML-centric technologyEmerging XML-centric technology• Customizable “content management” systemsCustomizable “content management” systems

Page 41: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 41

Canned text apps are a mixed bagCanned text apps are a mixed bag

• Document management for regulatory filingsDocument management for regulatory filings• Information discovery Information discovery • Generic searchGeneric search

Page 42: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 42

Part 3Part 3

Sorting out your application needsSorting out your application needs

Page 43: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 43

Different applications have very Different applications have very different profilesdifferent profiles

• Precision/recall of resultPrecision/recall of result• Quality of inputQuality of input• SecuritySecurity

Page 44: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 44

Basic application types, Group 1 – Basic application types, Group 1 – the fuzziesthe fuzzies

• Portal (e.g., self-service HR)Portal (e.g., self-service HR)– Best case for generic WWW-like searchBest case for generic WWW-like search

• Notes/Exchange/EmailNotes/Exchange/Email– Not clear what the real functionality needed isNot clear what the real functionality needed is– Active area of research/developmentActive area of research/development

• Information discoveryInformation discovery

Page 45: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 45

Basic application types, Group 2 – Basic application types, Group 2 – OLTP OLTP

• Heavy-duty transaction processing (ERP, supply Heavy-duty transaction processing (ERP, supply chain, etc.)chain, etc.)– Search is tangentialSearch is tangential

• Direct touch CRMDirect touch CRM– Basic search is underutilized but gaining Basic search is underutilized but gaining

groundground• Online sales/marketing (very different in different Online sales/marketing (very different in different

industries)industries)– Search part of the app unlikely to be very Search part of the app unlikely to be very

demanding …demanding …– … … except from a security standpointexcept from a security standpoint

Page 46: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 46

Basic application types, Group 3 – Basic application types, Group 3 – Heavy-duty analytic aidsHeavy-duty analytic aids

• BI/CPM/Analytic appsBI/CPM/Analytic apps– Great for taming the numerical part of the Great for taming the numerical part of the

information tangleinformation tangle– Text search is largely irrelevantText search is largely irrelevant

• Product lifecycle management (engineering-Product lifecycle management (engineering-centric)centric)– Text is an afterthoughtText is an afterthought

• Product lifecycle management (regulatory-centric)Product lifecycle management (regulatory-centric)– Documentum et al. offer “compliance” Documentum et al. offer “compliance”

solutionssolutions• Online maintenance manualsOnline maintenance manuals

– This is a biggie for text!!This is a biggie for text!!

Page 47: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 47

Part 4Part 4

Key considerations in text application Key considerations in text application architecture architecture

Page 48: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 48

Five big issuesFive big issues

• Database integrationDatabase integration• Realistic options for document metadataRealistic options for document metadata• Document stylistic consistency (local)Document stylistic consistency (local)• Quality-of-search application requirementsQuality-of-search application requirements• SecuritySecurity

Page 49: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 49

Text database integration vs. Text database integration vs. relational database integrationrelational database integration

• Remote indexing is an optionRemote indexing is an option• Data cleaning and consistency issues are Data cleaning and consistency issues are

differentdifferent• Performance issues are differentPerformance issues are different• Everything is a little more primitiveEverything is a little more primitive

Page 50: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 50

Document metadata – consider the Document metadata – consider the sourcesource

• Author/editor – can’t be relied onAuthor/editor – can’t be relied on• Implicit metadata – great if you trust your Implicit metadata – great if you trust your

policies/procedurespolicies/procedures• Extracted metadata – same strengths/weaknesses Extracted metadata – same strengths/weaknesses

as general text searchas general text search• From a relational OLTP app – nice if you have itFrom a relational OLTP app – nice if you have it

Page 51: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 51

Document stylistic consistencyDocument stylistic consistency

• The best search techniques make assumptions The best search techniques make assumptions about document structureabout document structure

• Which assumptions are valid?Which assumptions are valid?• 100% consistency is neither needed nor realistic100% consistency is neither needed nor realistic

Page 52: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 52

Quality-of-search requirementsQuality-of-search requirements

• What ratio of bad hits is tolerable? (Precision)What ratio of bad hits is tolerable? (Precision)• How close do you have to get to the target?How close do you have to get to the target?• What level of redundancy is tolerable?What level of redundancy is tolerable?• How crucial is it to get every good hit? (Recall)How crucial is it to get every good hit? (Recall)

Page 53: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 53

SecuritySecurity

• What level of detail must be protected?What level of detail must be protected?• How compartmentalized are need and permission How compartmentalized are need and permission

to know?to know?• How important is security auditing?How important is security auditing?

Page 54: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 54

Must search results themselves be Must search results themselves be kept secure?kept secure?

• Existence of info?Existence of info?• Summaries?Summaries?• Other document metadata?Other document metadata?

Page 55: Integrating Text Into an Enterprise IT Environment February 25, 2003

4502X.PPT 04/22/23 55

Integrating Text Into an Integrating Text Into an Enterprise IT EnvironmentEnterprise IT Environment

February 25, 2003February 25, 2003

Curt Monash, Ph.D.Curt Monash, Ph.D.PresidentPresident

Monash Information ServicesMonash Information [email protected]@monash.com

www.monash.comwww.monash.com