open information extraction from the web oren etzioni
DESCRIPTION
Open Information Extraction from the Web Oren Etzioni. KnowItAll Project (2003…). Rob Bart Janara Christensen Tony Fader Tom Lin Alan Ritter Michael Schmitz Dr. Niranjan Balasubramanian Dr. Stephen Soderland Prof. Mausam Prof. Dan Weld - PowerPoint PPT PresentationTRANSCRIPT
Open Information Extractionfrom the Web
Oren Etzioni
2
KnowItAll Project (2003…)Rob BartJanara ChristensenTony FaderTom LinAlan RitterMichael SchmitzDr. Niranjan BalasubramanianDr. Stephen SoderlandProf. MausamProf. Dan Weld
PhD alumni: Michele Banko, Prof. Michael Cafarella, Prof. Doug Downey, Ana-Maria Popescu, Stefan Schoenmackers, and Prof. Alex Yates
Funding: DARPA, IARPA, NSF, ONR, Google.
Etzioni, University of Washington
Etzioni, University of Washington 3
Outline
I. A “scruffy” view of Machine ReadingII. Open IE (overview, progress, new demo)III. Critique of Open IEIV. Future work: Open, Open IE
Etzioni, University of Washington 4
I. Machine Reading (Etzioni, AAAI ‘06)
• “MR is an exploratory, open-ended, serendipitous process”
• “In contrast with many NLP tasks, MR is inherently unsupervised”
• “Very large scale”
• “Forming Generalizations based on extracted assertions”
No Ontology…Ontology Free!
Etzioni, University of Washington 5
Lessons from DB/KR Research
• Declarative KR is expensive & difficult• Formal semantics is at odds with
– Broad scope– Distributed authorship
• KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance” (Halevy IJCAI ‘03)
A fortiori, for KBs extracted from
text!
Etzioni, University of Washington 6
Machine Reading at Web Scale
• A “universal ontology” is impossible• Global consistency is like world peace• Micro ontologies--scale? Interconnections?
• Ontological “glass ceiling”– Limited vocabulary– Pre-determined predicates– Swamped by reading at scale!
Etzioni, University of Washington 7
Traditional IE Open IE
Input: Corpus + O(R) hand-labeled data
Corpus
Relations: Specified in advance
Discovered automatically
Extractor: Relation-specific Relation-independent
OPEN VERSUS TRADITIONAL IE
II. Open vs. Traditional IE
How is Open IE Possible?
Etzioni, University of Washington 8
Semantic Tractability Hypothesis
easy-to-understand subset of English
• Characterized relations/arguments syntactically (Banko, ACL ’08; Fader, EMNLP ’11; Etzioni, IJCAI ‘11)
• Characterization is compact, domain independent
• Covers 85% of binary, verb-based relations
Etzioni, University of Washington 9
invented acquired by has a PhD in
denied voted for inhibits tumorgrowth in
inherited born in mastered the art of
downloaded aspired to is the patron saint of
expelled Arrived from wrote the book on
SAMPLE OrrF EXTRACTED RELATIONS
SAMPLE RELATION PHRASES
Etzioni, University of Washington 10
DARPA MR Domains <50NYU, Yago <100NELL ~500DBpedia 3.2 940PropBank 3,600VerbNet 5,000WikiPedia InfoBoxes, f > 10 ~5,000TextRunner (phrases) 100,000+ ReVerb (phrases) 1,000,000+
NUMBER OF RELATIONSNumber of Relations
Etzioni, University of Washington
First Web-scale Open IE system Distant supervision + CRF models of relations
(Arg1, Relation phrase, Arg2)
1,000,000,000 distinct extractions
11
TEXTRUNNER
TextRunner (2007)
Etzioni, University of Washington 12
Relation Extraction from Web
Etzioni, University of Washington 13
Open IE (2012)
• Open source ReVerb extractor• Synonym detection• Parser-based Ollie extractor (Mausam EMNLP ‘12)
– Verbs Nouns and more– Analyze context (beliefs, counterfactuals)
• Sophistication of IE is a major focus
But what about entities, types, ontologies?
After beating the Heat, the Celtics are now the “top dog” in the NBA.
(the Celtics, beat, the Heat)
If he wins 5 key states, Romney will be president
(counterfactual: “if he wins 5 key states”)
Etzioni, University of Washington 14
Towards “Ontologized” Open IE
• Link arguments to Freebase (Lin, AKBC ‘12)– When possible!
• Associate types with Args• No Noun Phrase Left Behind (Lin, EMNLP ‘12)
Etzioni, University of Washington 15
System Architecture
Relation-independent
extraction
Synonyms,Confidence
Index in Lucene;Link entities
Processing
Web corpus
OutputInput
Extractor
Raw tuples
Assessor
Extractions
(XYZ Corp.; acquired; Go Inc.)(oranges; contain; Vitamin C)(Einstein; was born in; Ulm)
(XYZ; buyout of; Go Inc.)(Albert Einstein; born in; Ulm)
(Einstein Bros.; sell; bagels)
XYZ Corp. = XYZAlbert Einstein = Einstein !=
Einstein Bros.
Acquire(XYZ Corp., Go Inc.) [7]BornIn(Albert Einstein, Ulm) [5]Sell(Einstein Bros., bagels) [1]Contain(oranges, Vitamin C) [1]
Query processor DEMO
Etzioni, University of Washington 16
III. Critique of Open IE
• Lack of formal ontology/vocabulary• Inconsistent extractions• Can it support reasoning?• What’s the point of Open IE?
Etzioni, University of Washington 17
Perspectives on Open IE
A. “Search Needs a Shakeup” (Etzioni, Nature ’11)
B. Textual ResourcesC. Reasoning over Extractions
Etzioni, University of Washington 18
A. New Paradigm for Search
“Moving Up the Information Food Chain” (Etzioni, AAAI ‘96)
Retrieval Extraction Snippets, docs Entities, RelationsKeyword queries Questions List of docs Answers
Essential for smartphones!(Siri meets Watson)
Etzioni, University of Washington 19
Case Study over Yelp Reviews
1. Map review corpus to (attribute, value) (sushi = fresh) (parking = free)
2. Natural-language queries “Where’s the best sushi in Seattle?”
3. Sort results via sentiment analysis exquisite > very good > so, so
Etzioni, University of Washington 20
RevMiner: Extractive Interface to 400K Yelp Reviews (Huang, UIST ’12)
Revminer.com
Etzioni, University of Washington 21
B. Public Textual Resources(Leveraging Open IE)
• 94M Rel-grams: n-grams, but over relations in text (Balasubarmanian. AKBC’12)
• 600K Relation phrases (Fader, EMNLP ‘11)• Relation Meta-data:
– 50K Domain/range for relations (Ritter, ACL ‘10)– 10K Functional relations (Lin, EMNLP ‘10)
• 30K learned Horn clauses (Schoenmackers, EMNLP ‘10)• CLEAN (Berant, ACL ‘12)
– 10M entailment rules (coming soon)– Precision double that of DIRT
See openie.cs.washington.edu
(police investigate X) (police charge Y)
Etzioni, University of Washington 22
C. Reasoning over Extractions
Linear-time 1st order Horn-clause inference (SchoenmackersEMNLP ’08)
Learn argument typesVia generative model (Ritter ACL ‘10)
1,000,000,000Extractions
TransitiveInference(Berant ACL ’11)
Identify synonyms (Yates & Etzioni JAIR ‘09)
Etzioni, University of Washington 23
Unsupervised, probabilistic model for identifying synonyms
• P(Bill Clinton = President Clinton)– Count shared (relation, arg2)
• P(acquired = bought)– Relations: count shared (arg1, arg2)
• Functions, mutual recursion• Next step: unify with
24
Scalable Textual Inference
Desiderata for inference:• In text probabilistic inference• On the Web linear in |Corpus|
Argument distributions of textual relations:• Inference provably linear• Empirically linear!
25
Inference Scalability for Holmes
26
Extractions Domain/range
• Much previous work (Resnick, Pantel, etc.)• Utilize generative topic models
Extractions of R DocumentDomain/range of R topics
27
born_in(Einstein, Ulm)
headquartered_in(Microsoft, Redmond)
founded_in(Microsoft, 1973)
born_in(Bill Gates, Seattle)
founded_in(Google, 1998)
headquartered_in(Google, Mountain View)
born_in(Sergey Brin, Moscow)
founded_in(Microsoft, Albuquerque)
born_in(Einstein, March)
born_in(Sergey Brin, 1973)
TextRunner ExtractionsRelations as Documents
29
z1
a1
R
N
a
z2
a2
gT T
h1 h2
Generative Story [LinkLDA, Erosheva et. al. 2004]
Pick a topic for arg2
For each extraction, pick type for a1, a2
Person born_in Location
Pick a topic for arg2
Then pick arguments based
on typesSergey Brin born_in Moscow
For each relation, randomly pick a distribution over
types
X born_in Y P(Topic1|born_in)=0.5 P(Topic2|born_in)=0.3 …
Pick a topic for arg2
Two separate sets of type
distributions
Etzioni, University of Washington 30
Examples of Learned Domain/range
• elect(Country, Person)• predict(Expert, Event)• download(People, Software)• invest(People, Assets)• Was-born-in(Person, Location OR Date)
Etzioni, University of Washington 31
Summary: Trajectory of Open IE
2003 KnowItAll project
2007TextRunner: 1,000,000,000 “Ontology free” extractions
2008-9Inference over extractions
2010-11
Open source extractorPublictextual Resources
2012
Freebase typesIE-based search Deeper analysis of sentences
Openie.cs.washington.edu
Etzioni, University of Washington 32
IV. Future: Open Open IE
• Open input: ingest tuples from any source(Tuple, Source, Confidence)
• Linked Open Output: – Extractions Linked-open Data (LOD) cloud– Relation normalization– Use LOD best practices
• Specialized reasoners
Etzioni, University of Washington 33
Conclusions
1. Ontology is not necessary for reasoning2. Open IE is “gracefully” ontologized3. Open IE is boosting text analysis4. LOD has distribution & scale (but not text) =
opportunity
Thank you
Etzioni, University of Washington 34
qs
• Why Open?• What’s next?• Dimensions for analyzing systems• What’s worked, what’s failed? (lessons)• What can we learn from watson?• What can we learn from db/kr? (alon)
Etzioni, University of Washington 35
Questions• What extraction mechanism is used?• What corpus?• What input knowledge?• Role for people/manual labling• • Form of the extracted knowledge?• Size/scope of extracted knowledge?• • What reasoning is done?• • Most unique aspect?• Biggest challenge?
Etzioni, University of Washington 36
Scalability notes
• Interoperability, distributed authorship, vs. a monolithic system
• Open IE meets RDF:– Need URI’s for predicates. How to obtain?– What about errors in mapping to URI?– Ambiguity? Uncertainty?
Etzioni, University of Washington 37
reasoning
• Nell: inter-class constraints to gen negative egs
Etzioni, University of Washington 38
Dims of scalability
• Corpus size• Syn coverage over text• Sem coverage over text
– Time, belief, n-ary relations, etc.• Number of entities, relations• Ability to reason• How much cpu? • How much manual effort?• Bounding, cielign effect, ontological glass ceiling
Etzioni, University of Washington 39
Example of limiting assumptions
• Nell: apple has single meaning• Single atom per entity
– Global computation to add entity– Can’t be sure
• LOD:– Best practice– Same-as links
Etzioni, University of Washington 40
Risk for scalable system
• Limited semantics, reasoning• No reasoning…
Etzioni, University of Washington 41
LOD triple in aug 2011: 31,634,213,770
Etzioni, University of Washington 42
• . The following statement appears in the last paragraph of W3C Linked Library Data Group Final Report:
• . . . Linked Data follows an open-world assumption: the assumption that data cannot generally be assumed to be complete and that, in principle, more data may become available for any given entity.
Etzioni, University of Washington 43
Etzioni, University of Washington 44
Entity Linking an Extraction CorpusEinstein quit his job at the patent office (8)
1. String Match 2. Prominence Priors 3. Context Match
US Patent OfficeEU Patent Office
Japan Patent OfficeSwiss Patent Office
Patent
Link Score
1,281 inlinks168 inlinks56 inlinks
101 inlinks4,620 inlinks
(med)(med)(med)(med)(low)
Obtain candidates, and measure string similarity.
Exact String Match = best matchalso consider:
(low)(low)(low)
(very high)(low)
(med)(low)(low)(high)(low)
Alternatecapitalization
Editdistance
Wordoverlap
Substring/Superstring
KnownAliases
PotentialAbbreviations
USPatentOffice
EUPatentOffice
JapanPatentOffice
SwissPatentOffice
Patent
Prominence # of links in Wikipedia to that Entity’s article∝
Patent
SwissPatentOffice
USPatentOffice
EUPatentOffice
JapanPatentOffice
Wikipedia Article Texts
“Einstein quit his job at the patent office.”“Einstein quit his job at the patent office to become a professor.”“In 1909, Einstein quit his job at the patent office.”“Einstein quit his job at the patent office where he worked.”
cosinesimilarity
“Document” of the extraction’s source sentencesLink Score is a function of (String Match Score, Prominence Prior Score, Context Match Score)
e.g., String Match Score x ln(Prominence Prior Score) x Context Match Score
Link Ambiguity = 2nd Top Link Score Top Link Score
2.53GHz computer links 15 million text
arguments in ~3 days (60+ per second)
FasterHigher Precision
Collective Linking vs One Extraction at a time
Etzioni, University of Washington 45
Golf
Sports that originated in China
Ping Pong
Dragon Boating
WushuKarate
Soccer
…
Q/A with Linked Extractions
• Ambiguous Entities• Typed Search• Linked Resources
“I need to learn about Titanic the ship for my homework.”
“Titanic earned more than $1 billion worldwide”“The Titanic sank in 1912”“The Titanic was released in 1998”“Titanic represents the state-of-the-art in special effects”“Titanic was built in Belfast”
(3,761 more …)
“The Titanic set sail from Southampton””
“RMS Titanic weighed about 26 kt”“The Titanic was built for safety and comfort”“The Titanic sank in 12,460 feet of water”
(1,902 more …)
“Golf originated in China”“Soccer originated in China”“Karate originated in China”
“Dragon Boating originated in China”
(14 more …)
“Which sports originated in China?”
“Noodles originated in China”“Printmaking originated in China”“Soy Beans originated in China”“Wushu originated in China”“Taoism originated in China”“Ping Pong originated in China”
(534 more …)
Leverages KBs by linking textual argumentsto entities found in the knowledge base.
Freebase Sports“Dragon Boat Racing”
“Table Tennis”…
Etzioni, University of Washington 46
Linked Extractions support Reasoning
In addition to Question Answering, Linking can also benefit:
Functions [Ritter et al., 2008; Lin et al., 2010]
Other Relation Properties [Popescu 2007; Lin et al., CSK 2010]
Inference [Schoenmackers et al., 2008; Berant et al., 2011]
Knowledge-Base Population [Dredze et al., 2010]
Concept-Level Annotations [Christensen and Pasca, 2012]
… basically anything using the output of extraction
Other Web-based text containing Entities (e.g., Query Logs) can also be linked to enable new experiences…
Etzioni, University of Washington 48
Challenges
• Single-sentence extraction– He believed the plan will work– John Glenn was the first American in space– Obama was elected President in 2008.– American president Barack Obama asserted…
• ??