one tool, many industries text mining with oracle omar alonso chuck adams oracle corp. text mining...
TRANSCRIPT
![Page 1: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/1.jpg)
One Tool, Many Industries
Text Mining with Oracle
Omar AlonsoChuck Adams
Oracle Corp.
Text Mining Summit, Boston, 2005
![Page 2: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/2.jpg)
Agenda
Introduction Text mining Define problems Present solutions A look at Oracle’s technology stack Oracle’s roadmap A case study Conclusions
![Page 3: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/3.jpg)
Data mining and Text mining
OLTP
OLAP
DM
Keyword search
BK
TM• Classification
• Clustering
• Ontologies
• NLP
• Inexact match
Structured Data Unstructured Data
![Page 4: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/4.jpg)
An analogy
RFID and robot vision– Put tags on everything instead having the
robot do the vision
Similar approach for text mining– Language is very social, not technical– Instead, start with a unified storage model– Then do mining
![Page 5: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/5.jpg)
What about text mining?
Text mining is one of many features in text technology
Real future of text technology is business intelligence (BI)
What is BI? – Ability to make better decisions
What are the obstacles today?– Structured data is well understood– Unstructured data is different
![Page 6: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/6.jpg)
Text and XML
Increased exploitationof structure
Plain Old File System
File System on Steroids(WinFS)
Records Mgmt, ECMDynamic Doc Generation
Traditional Content Mgmt
XML Content Mgmt.
![Page 7: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/7.jpg)
First problem: access
No uniform access over all sources Each source has separate storage and
algebra Examples
– Email – Databases– Applications– Web
![Page 8: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/8.jpg)
Second problem: management Management of unstructured of data
very poor compared with structure data Cleaning Noise is larger than in structure data Security Multilingual
![Page 9: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/9.jpg)
Third problem – user needs Perception with current search engines Large data -> 80/20 rule Doesn't provide uniform information Two users type same query and get the
same results– Cricket the game or cricket the bug?
![Page 10: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/10.jpg)
Foundations
XML as the common model XML allows:
– Manipulation data with standards– Mining becomes more data mining– RDF emerging as a complementary model
The more structure you can explore the better you can do mining
Integration use cases
![Page 11: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/11.jpg)
Foundations - II
Unstructured data is too AI Too easy to get fooled by the complexity Hybrid solution Domain knowledge
– You know your domain– You own the content – You can do better
![Page 12: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/12.jpg)
Remember?
![Page 13: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/13.jpg)
Personalization problem
Lack of personalization You own the content, you own the user Two users type the same query:
“financials”– Sales rep looks for customers and other deals– Tech guy looks for bugs, architecture, etc.
LDAP shows who they are Combination with query logs shows
patterns in the same peer group Recommendation systems
![Page 14: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/14.jpg)
Better Answers: Beyond Keywords
Noise theory– As you cast your nets ever wider, you catch disproportionately more
junk Must develop new models of Quality in the face of comprehensiveness
– Combine Link-Analysis with Context-sensitive relevance– Personalization
Must summarize information– Theme Maps, Gists
Show patterns in information vs. many pages of hit-lists– Tree Maps, Stretch Viewer
Ability to post-process and refine search hit lists– Dynamic categories for navigation– Reorder by date
Progressive query relaxation– Nearest inexact match
![Page 15: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/15.jpg)
Technology StackBetter Answers
Relevance Toward BI
Progressive Relaxation
Multi-Criterion Support
Visualization
Classification
Personalization
Direct Answers
Link Analysis
Query Log Analysis
Metadata Extraction
Keyword Ranking
Intelligent Match
Duplicate Elimination
![Page 16: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/16.jpg)
![Page 17: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/17.jpg)
![Page 18: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/18.jpg)
![Page 19: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/19.jpg)
Oracle’s position
Text mining is one of many tools for information retrieval and discovery in many assets
Text mining is best used in the context of other techniques
– Personalization– Search query logs– Visualization
Product: one integrated platform
![Page 20: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/20.jpg)
Oracle platform
Integrated platform vs. niche technology
Full-text searching
XML
Classification
Clustering
Visualization
Google, FAST
Tamino
Autonomy
Vivisimo
Inxight
One platform, low cost, low complexity
Several products, different APIs, performance, maintenance cost, etc.
Application search SAP/TREX
![Page 21: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/21.jpg)
Oracle platform
“If I can see further than anyone else, it is only because I am standing on the shoulders of giants” – Isaac Newton
Oracle provides you all the functionality– Plus you get backup, recovery, scalability,
and other benefits
You build the mining application
![Page 22: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/22.jpg)
Case study
Federal customer High Performance Text Information
Mining and Entity Extraction
![Page 23: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/23.jpg)
Business Need
Enterprise Search Capability Information Fusion Profiles and alerting Security – user need to know Entity identification and extraction High Performance ingestion, search, and
indexing Scalability
![Page 24: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/24.jpg)
Challenges
Search quality Performance Scalability Document formats Integration Operations and maintenance
![Page 25: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/25.jpg)
Solutions Architecture
Oracle 10g Integrated Framework 10g release 2
– Oracle Real Application Clusters– Oracle Text
Full text and rule based indexingExtensible thesauriDocument classificationDocument filters
– Oracle Partitioning– Oracle Virtual Private database– Oracle Advanced Security
![Page 26: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/26.jpg)
Technical Architecture
Application Server
EDL Portal User
EDL Portal User
Oracle 10g RAC
Application Server
LoadBalancer
Oracle 10g RACInterconnect
Enterprise MetaData Layer
Scalar, Domain, andB*Tree Indices
EDL Portal User
EDL Portal User
ADS OID
Process Isolated RAC DBNodes. 1 tuned for Userquery and the other fordata synchronization
Application Server
Key meta dataconsolidated and indexedfor enterprise data layer
access.
CIA PKI Authenticationfrom ADSN clients
ADS LDAP Integrated forClient and Server
Authentication
ExistingMissionSystem
Network BasedIntegration Hub and EDLSynchronization Services
Federated Data AccessJ2EE Services for
mission system drill
ExistingMissionSystem
ExistingMissionSystem
![Page 27: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/27.jpg)
Scalable load and indexing
Oracle 9i& 9i Text
Raw Payload Payload Index
Scalar Indexes XML Indexes
DataCollec-
tion
Preprocess&
Filtering
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
Java LoadDistri-bution
Process
Standard-ized
Xml DTD
UTF8 TextExtracted
fromCollection
![Page 28: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/28.jpg)
Real world results
Single search for user Profiles and alerts Couple second query response 80,000,000 + documents indexed 1.2 TB raw text and growing 700 Gig index size Incremental index 1-2 Gig / day
![Page 29: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/29.jpg)
Next Steps
Oracle 10gText Indexstructure
Entityidentification
andextraction
engine
Languagespecific
dictionary
Languagespecific
dictionary
Languagespecific
dictionary
Languagespecific
dictionary
ExtractedEntities
XMLInterface
Relationshipdetectionengine
• Entity Extraction and Relationship Awareness
![Page 30: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/30.jpg)
Oracle database 10g release 2 Enterprise Search Capability Information Fusion Profiles and alerting Security – user need to know Entity identification and extraction High Performance ingestion, search, and
indexing Scalability
![Page 31: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/31.jpg)
Conclusions
Text mining is one of many features needed for BI on unstructured data
– Not a silver bullet in itself
Must exploit other approaches – metadata (XML, RDF), personalization, classification, entity extraction, full-text search, …
– Hybrid solution
Focus on an integrated platform that gives you all the functionality
Drive the platform for your information need
![Page 32: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005](https://reader036.vdocument.in/reader036/viewer/2022062318/55163ba8550346c6758b51c3/html5/thumbnails/32.jpg)