sla summer 2008
DESCRIPTION
My presentation to SLA, summer 2008TRANSCRIPT
Mining SolutionsA New Approach to Making the Most of Your Research Time
SLA,Strategic Technology Alliance, Seattle, 2008Joe Buzzanga, Product Manager, Elsevier Science and TechnologyJune 17, 2008
Agenda
•Challenges and Framework for Information Retrieval (IR)
•Using Natural Language Processing (NLP) in IR (illumin8)
•Product Demo
Digital Universe: 10x bigger in 5 years
“Searching for meaning in the content of unstructured data like images, video clips, documents, and the numbers and characters in databases is the rocket science of the digital universe.” IDC
Source: IDC Whitepaper, The Diverse and Exploding Digital Universe, March 2008
Today’s Researcher?
Search for Meaning?
Business Week Innovation Scorecard
What’s at Stake?
Amazon “Kindle”
Impact on Information Retrieval
•Separate the Signal from Noise
•Signal processing
Our Goal
•Make you successful through superior information retrieval tools
Framework for Information Retrieval
HumanIndex SearchSimple
Model Content
•Traditional: card catalog, periodical index…
HumanIndex SearchPrint
Collections Surrogate
RecordContent
•Simple Model: single book
Meta Data
Framework for Information Retrieval
HumanIndex SearchDigital
BibliographicA&I
Surrogate Record
DigitalIndex
Content
Hybrid Index
Meta Data
•Digital bibliographic A&I•Semi-structured records•Content under editorial control•Application of controlled terms•Application of digital indexing•Results need to be organized and ranked
•additional access points (e.g., facets, tags..)
Results
Framework for Information Retrieval
•No Human Intervention•Content unstructured, uncontrolled and unmeasurable•Crawling is inherently imperfect•Typically Keyword indexing•Ranking of results becomes critical
Web SearchCrawl Digital
IndexContent
Results
Content:How Big is the Web?
Today
170 million websites across all domains
Source: Netcraft
2 years ago
80 million websites across all domains
Content: Plumbing the Depths
Source: Mills Davis, Project 10X
Content: How Big is the Web?
~10 Billion pages (2003 estimate)
http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
Crawling in the Dark
The Key in Keyword?
• Keyword is a misnomer in context of an index• Keyword is in the mind of the searcher• Every word is indexed, since the computer is not smart enough to know significant words (i.e., the “key” in “keyword”)
• Brute force approach, feasible with compute power
Results: Facets
Research and its Discontents
18185.5 hours / week *Searching and gathering information
* Source: 2007 survey of 6,300 knowledge workers, Outsell, Inc.
4.7 hours / week *Organizing and analyzing and applying information
Introducing illumin8
•Cut through the noise•Rapid summary/overview•Cross domain view•Integrated content•Web-based•Sharing results
Applies Natural Language Processing at Internet Scale!
Typical Search
Current general searchGet millions of documents
to sift through
Page 1 Page 2 Page 180,000
…
compostable film
There is just no way any researcher can read through all this information.It just takes too long!
Illumin8 Uses Natural Language Processing to “read” text
Enter search termsGenerate
Organized Result Set
Products Companies/Organizations Technical Approaches
•Results grouped into meaningful classes
•System generates list of solutions, not records
•Quickly see interesting and useful areas for investigation
Our Approach• Premium Scientific• Patent• Web
Search-Crawl-Load
SemanticIndex
Content
Results
NLP Applied
Problems, Solutions, Benefits
NLP Applied
Fuse, Classify, Summarize
NLP Applied
NLP applied throughout the system: index, query, result set
Full Text
Abstracts
illumin8 searches on solutions. The solutions are extracted from full text sources, abstracts, web, and patents
Internet
Patents
illumin8 Solution Database1.1 billion
5 Billion web pages, blogs and forums
3 Million full-text scientific and technical articles from 1,800 Elsevier journals
33 Million scientific records from 15,000 peer reviewed journals & more than 4,000 publishers
21 Million patents from 5 world-wide patent offices
Extract and Summarize Solutions
Search
How does illumin8 work?
WEB JOURNAL PATENT
• Summarizing information about Companies, Products, etc., for technologies that researchers
care about
• Organizing results from the worlds most trusted scientific content and billions of web pages
A Uniform Lens (index) Across Content Sources
Keyword Indexing
• Meaning is lost
Taking Search Beyond Keyword Indexing
Sentence processing
• Meaning is maintained
• Identify & classify problems, solutions and benefits
Neural Network used in handwriting recognitionSolution Problem
Natural Language Parsing
Help_patternsSucceed2Correct_problemtreatPerson_SAVSpositively_influencehave_positive_influenceprotect_sb_against_sthProduct_would_do_goodprovide_sb_with_sthProduct_is_shown_totalented_atuse_sth_to_do_sthapprove_sthrely_on_product_toapplication_isProduct_allows_sb_toVG2ensure_protagonistA_makes_B_goodbenefit_of
...
Thousands of rulesPlus statistical models
illumin8 Rules Grammatical Role Role Test Role Assignment
provides
Capacitive deionization
an economical and efficient method for removing salt and impurities from water
Solution
Benefit
Continue …Modal?
Check that Verb polarity is positive; this rule would not match if the Verb were modal (i.e. only in certain cases), for example if it said “should provide … but”
Check that Subject is not negated; this rule would not match if Subject were not positive, for example if it said “no process provides an economical an efficient …”
Check that Object is not antagonistic; this rule would not match if Object were, for example “provides a costly and complicated method”
no
yes
Negated? no
yes
Antagonistic? noye
s
Capacitive deionization with carbon aerogel electrodes provides an economical and efficient method for removing salt and impurities from water.
Verb
Subject
Object
Analyzing A Sentence
Carrier’s Infinity™ Air Purifier uses ultraviolet light to eliminate germs such as viruses, molds, bacteria, mildew and mold spores from the indoor air of homes and offices, ensuring a higher indoor air quality.
Germ[Problem]
Indoor air quality[Benefit]
Carrier[Organization]
Infinity Air Purifier
[Product]
Ultraviolet light
[Technology]
Virus
Mold
Bacteria
MildewMakes Uses
Solves
Provides
Kind of
Mold spore
Concepts, ideas and entities extracted from a single sentence.
DEMO