ppt
Post on 13-Sep-2014
17 views
DESCRIPTION
TRANSCRIPT
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 1
Towards Web-Scale Information Extraction
Eugene Agichtein Mathematics & Computer ScienceEmory [email protected]://www.mathcs.emory.edu/~eugene/
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 2
The Value of Text Data “Unstructured” text data is the primary form of human-generated
information Blogs, web pages, news, scientific literature, online reviews, … Semi-structured data (database generated): see Prof. Bing Liu’s
KDD webinar: http://www.cs.uic.edu/~liub/WCM-Refs.html The techniques discussed here are complimentary to structured
object extraction methods
Need to extract structured information to effectively manage, search, and mine the data
Information Extraction: mature, but active research area Intersection of Computational Linguistics, Machine Learning,
Data mining, Databases, and Information Retrieval Traditional focus on accuracy of extraction
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 3
Example: Answering Queries Over TextFor years, Microsoft
Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..
PEOPLE
Select Name From PEOPLE Where Organization = ‘Microsoft’
Bill Gates
Bill Veghte
(from William Cohen’s IE tutorial, 2003)
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 4
Outline Information Extraction Tasks
Entity tagging Relation extraction Event extraction
Scaling up Information Extraction Focus on scaling up to large collections (where
data mining can be most beneficial) Other dimensions of scalability
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 5
Information Extraction Tasks Extracting entities and relations: this talk
Entities: named (e.g., Person) and generic (e.g., disease name) Relations: entities related in a predefined way (e.g., Location of a
Disease outbreak, or a CEO of a Company) Events: can be composed from multiple relation tuples
Common extraction subtasks: Preprocess: sentence chunking, syntactic parsing, morphological
analysis Create rules or extraction patterns: hand-coded, machine
learning, and hybrid Apply extraction patterns or rules to extract new information Postprocess and integrate information
Co-reference resolution, deduplication, disambiguation
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 6
Entity Tagging Identifying mentions of entities (e.g., person names, locations,
companies) in text MUC (1997): Person, Location, Organization, Date/Time/Currency ACE (2005): more than 100 more specific types
Hand-coded vs. Machine Learning approaches
Best approach depends on entity type and domain: Closed class (e.g., geographical locations, disease names, gene &
protein names): hand coded + dictionaries Syntactic (e.g., phone numbers, zip codes): regular expressions Semantic (e.g., person and company names): mixture of context,
syntactic features, dictionaries, heuristics, etc. “Almost solved” for common/typical entity types
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 7
Example: Extracting Entities from Text Useful for data warehousing, data cleaning, web
data integration
14089 Whispering Pines Nobel Drive San Diego CA 92122
House number Building Road City ZipState
Address
Citation
Segment(si) Sequence Label(si)
S1 Ronald Fagin Author
S2 Combining Fuzzy Information from Multiple Systems Title
S3 Proc. of ACM SIGMOD Conference
S4 2002 Year
Ronald Fagin, Combining Fuzzy Information from Multiple Systems, Proc. of ACM SIGMOD, 2002
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 8
Hand-Coded Methods Easy to construct in some cases
e.g., to recognize prices, phone numbers, zip codes, conference names, etc. Intuitive to debug and maintain
Especially if written in a “high-level” language:
Can incorporate domain knowledge Scalability issues:
Labor-intensive to create Highly domain-specific Often corpus-specific Rule-matches can be expensiveContactPattern RegularExpression(Email.body,”can be reached at”)
[IBM Avatar]
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 9
Machine Learning Methods Can work well when lots of training data easy to construct
Can capture complex patterns that are hard to encode with hand-crafted rules e.g., determine whether a review is positive or negative extract long complex gene names Non-local dependencies
The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.“ [From AliBaba]
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 10
Representation Models [Cohen and McCallum, 2003]
Any of these models can be used to capture words, formatting or both.
Lexicons
AlabamaAlaska…WisconsinWyoming
Sliding WindowClassify Pre-segmented
Candidates
Finite State Machines Context Free GrammarsBoundary Models
Abraham Lincoln was born in Kentucky.
member?
Abraham Lincoln was born in Kentucky.Abraham Lincoln was born in Kentucky.
Classifier
which class?
…and beyond
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
Abraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VP
VP
S
Mos
t lik
ely
pars
e?
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 11
Popular Machine Learning Methods Naive Bayes
SRV [Freitag 1998], Inductive Logic Programming Rapier [Califf and Mooney 1997] Hidden Markov Models [Leek 1997] Maximum Entropy Markov Models [McCallum et al. 2000] Conditional Random Fields [Lafferty et al. 2001]
Scalability Can be labor intensive to construct training data At run time, complex features can be expensive to construct or
process (batch algorithms can help: [Chandel et al. 2006] )
For details: [Feldman, 2006 and Cohen, 2004]
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 12
Some Available Entity Taggers ABNER:
http://www.cs.wisc.edu/~bsettles/abner/ Linear-chain conditional random fields (CRFs) with orthographic and contextual
features.
Alias-I LingPipe http://www.alias-i.com/lingpipe/
MALLET: http://mallet.cs.umass.edu/index.php/Main_Page Collection of NLP and ML tools, can be trained for name entity tagging
MinorThird: http://minorthird.sourceforge.net/ Tools for learning to extract entities, categorization, and some visualization
Stanford Named Entity Recognizer: http://nlp.stanford.edu/software/CRF-NER.shtml CRF-based entity tagger with non-local features
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 13
Alias-I LingPipe ( http://www.alias-i.com/lingpipe/ ) Statistical named entity tagger
Generative statistical model Find most likely tags given lexical and linguistic features Accuracy at (or near) state of the art on benchmark tasks
Explicitly targets scalability: ~100K tokens/second runtime on single PC Pipelined extraction of entities User-defined mentions, pronouns and stop list
Specified in a dictionary, left-to-right, longest match Can be trained/bootstrapped on annotated corpora
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 14
Outline Overview of Information Extraction
Entity tagging Relation extraction Event extraction
Scaling up Information Extraction Focus on scaling up to large collections (where
data mining and ML techniques shine) Other dimensions of scalability
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 15
Relation Extraction Examples Extract tuples of entities that are related in predefined way
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is findingitself hard pressed to cope with the crisis…
Disease Outbreaks relation
Relation Extraction
„We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“
CBF-A CBF-C
CBF-B CBF-A-CBF-C complex
interactcomplex
associates [From AliBaba]
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 16
Relation Extraction ApproachesKnowledge engineering Experts develop rules, patterns:
Can be defined over lexical items: “<company> located in <location>” Over syntactic structures: “((Obj <company>) (Verb located) (*) (Subj <location>))”
Sophisticated development/debugging environments: Proteus, GATE
Machine learning Supervised: Train system over manually labeled data
Soderland et al. 1997, Muslea et al. 2000, Riloff et al. 1996, Roth et al 2005, Cardie et al 2006, Mooney et al. 2005, …
Partially-supervised: train system by bootstrapping from “seed” examples: Agichtein & Gravano 2000, Etzioni et al., 2004, Yangarber & Grishman 2001, … “Open” (no seeds): Sekine et al. 2006, Cafarella et al. 2007, Banko et al. 2007
Hybrid or interactive systems: Experts interact with machine learning algorithms (e.g., active learning family) to
iteratively refine/extend rules and patterns Interactions can involve annotating examples, modifying rules, or any combination
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 17
Open Information Extraction [Banko et al., IJCAI 2007] Self-Supervised Learner:
All triples in a sample corpus (e1, r, e2) are considered potential “tuples” for relation r Positive examples: candidate triplets generated by a dependency parser Train classifier on lexical features for positive and negative examples
Single-Pass Extractor: Classify all pairs of candidate entities for some (undetermined) relation Heuristically generate a relation name from the words between entities
Redundancy-Based Assessor: Estimate probability that entities are related from co-occurrence statistics
Scalability Extraction/Indexing
No tuning or domain knowledge during extraction, relation inclusion determined at query time 0.04 CPU seconds pre sentence, 9M web page corpus in 68 CPU hours Every document retrieved, processed (parsed, indexed, classified) in a single pass
Query-time Distributed index for tuples by hashing on the relation name text
Related efforts: [Cucerzan and Agichtein 2005], [Pasca et al. 2006], [Sekine et al. 2006], [Rozenfeld and Feldman 2006], …
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 18
Event Extraction Similar to Relation Extraction, but:
Events can be nested Significantly more complex (e.g., more slots) than relations/template
elements Often requires coreference resolution, disambiguation, deduplication, and
inference
Example: an integrated disease outbreak event [Hatunnen et al. 2002]
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 19
Event Extraction: Integration Challenges Information spans multiple documents
Missing or incorrect values Combining simple tuples into complex events No single key to order or cluster likely duplicates while separating them
from similar but different entities. Ambiguity: distinct physical entities with same name (e.g., Kennedy)
Duplicate entities, relation tuples extracted Large lists with multiple noisy mentions of the same entity/tuple Need to depend on fuzzy and expensive string similarity functions Cannot afford to compare each mention with every other.
See Part II of KDD 2006 Tutorial “Scalable Information Extraction and Integration” -- scaling up integration: http://www.scalability-tutorial.net/
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 20
Other Information Extraction Tutorials
See these tutorials for more details:
R. Feldman, Information Extraction – Theory and Practice, ICML 2006http://www.cs.biu.ac.il/~feldman/icml_tutorial.html
W. Cohen, A. McCallum, Information Extraction and
Integration: an Overview, KDD 2003 http://www.cs.cmu.edu/~wcohen/ie-survey.ppt
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 21
Summary: Accuracy of Extraction Tasks
Errors cascade (errors in entity tag cause errors in relation extraction)
This estimate is optimistic: Primarily for well-established (tuned) tasks Many specialized or novel IE tasks (e.g. bio- and medical- domains) exhibit
lower accuracy Accuracy for all tasks is significantly lower for non-English
[Feldman, ICML 2006 tutorial]
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 22
Multilingual Information Extraction Active research area, beyond the scope of this talk. Nevertheless, a few (incomplete)
pointers are provided.
Closely tied to machine translation and cross-language information retrieval efforts.
Language-independent named entity tagging and related tasks at CoNLL: 2006: multi-lingual dependency parsing (http://nextens.uvt.nl/~conll/) 2002, 2003 shared tasks: language independent Named Entity Tagging
(http://www.cnts.ua.ac.be/conll2003/ner/)
Global Autonomous Language Exploitation program (GALE): http://www.darpa.mil/ipto/Programs/gale/concept.htm
Interlingual Annotation of Multilingual Text Corpora (IAMTC) Tools and data for building MT and IE systems for six languages http://aitc.aitcnet.org/nsf/iamtc/index.html
REFLEX project: NER for 50 languages Exploit for training temporal correlations in weekly aligned corpora http://l2r.cs.uiuc.edu/~cogcomp/wpt.php?pr_key=REFLEX
Cross-Language Information Retrieval (CLEF) http://www.clef-campaign.org/
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 23
Outline Overview of Information Extraction
Entity tagging Relation extraction Event Extraction
Scaling up Information Extraction Focus on scaling up to large collections (where
data mining and ML techniques shine) Other dimensions of scalability
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 24
Scaling Information Extraction to the Web Dimensions of Scalability
Corpus size: Applying rules/patterns is expensive Need efficient ways to select/filter relevant documents
Document accessibility: Deep web: documents only accessible via a search interface Dynamic sources: documents disappear from top page
Source heterogeneity: Coding/learning patterns for each source is expensive Requires many rules (expensive to apply)
Domain diversity: Extracting information for any domain, entities, relationships Some recent progress (e.g., see slide 17) Not the focus of this talk
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 25
Scaling Up Information Extraction Scan-based extraction
Classification/filtering to avoid processing documents Sharing common tags/annotations
General keyword index-based techniques QXtract, KnowItAll
Specialized indexes BE/KnowItNow, Linguist’s Search Engine
Parallelization/distributed processing IBM WebFountain, UIMA, Google’s Map/Reduce
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 26
Efficient Scanning for Information Extraction
80/20 rule: use few simple rules to capture majority of the instances [Pantel et al. 2004]
Train a classifier to discard irrelevant documents without processing [Grishman et al. 2002] (e.g., the Sports section of NYT is unlikely to describe disease
outbreaks)
Share base annotations (entity tags) for multiple extraction tasks
Output Tuples
…Extraction
System
Text Database
4. Extract output tuples
3. Process documents
1. Retrieve docs from database
Classifier
2. Filter documents filtered
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 27
Exploiting Keyword and Phrase Indexes Generate queries to retrieve only relevant documents
Data mining problem!
Some methods in literature: Traversing Query Graphs [Agichtein et al. 2003] Iteratively refine queries [Agichtein and Gravano 2003] Iteratively partition document space [Etzioni et al., 2004]
Case studies: QXtract, KnowItAll
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 28
Simple Strategy: Iterative Set Expansion Output Tuples
…Extraction
System
Text Database
3. Extract tuplesfrom docs
2. Process retrieved documents
1. Query database with seed tuples
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Time for retrieving a document
Time for answering a query
Time for processing a document
Query
Generation
4. Augment seed tuples with new tuples
(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 29
Reachability via Querying
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Upper recall limit: determined by the size of the biggest connected component
Reachability GraphTuples Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
<SARS, China>
<Ebola, Zaire>
<Malaria, Ethiopia>
<Cholera, Sudan>
<H5N1, Vietnam>
[Agichtein et al. 2003b]
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 30
Reachability Graph for DiseaseOutbreaks
DiseaseOutbreaks, New York Times 1995
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 31
Getting Around Reachability Limits KnowItAll [Etzioni et al., WWW 2004]:
Add keywords to partition documents into retrievable disjoint sets
Submit queries with parts of extracted instances
QXtract [Agichtein and Gravano 2003a]
General queries with many matching documents Assumes many documents retrievable per query
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 32
QXtract [Agichtein and Gravano 2003]
1. Get document sample with “likely negative” and “likely positive” examples.
2. Label sample documents usinginformation extraction systemas “oracle.”
3. Train classifiers to “recognize”useful documents.
4. Generate queries from classifiermodel/rules.
Query Generation
Information Extraction
Text Database
Search Engine
? ???
? ?
??
++
++
- -
--
Seed Sampling
Classifier Training
Queries
tuple1tuple2tuple3tuple4tuple5
++
++
- -
--
User-Provided Seed Tuples
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 33
Using Generic Indexes: Summary Order of magnitude speed up in runtime But: keyword indexes are approximate (so the
queries are not precise) Require many documents to retrieve
Can we do better?
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 34
Index Structures for Information Extraction Bindings Engine [Cafarella and Etzioni 2005]
Indexing and querying entities: [K. Chakrabarti et al. 2006]
IBM Avatar project: http://www.almaden.ibm.com/cs/projects/avatar/
Other indexing schemes: Linguist’s search engine (P. Resnik): http://lse.umiacs.umd.edu:8080/ FREE: Indexing regular expressions: [Cho and Rajagolapan, ICDE 2002] Indexing and querying linguistic information in XML: [Bird et al., 2006]
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 35
Variabilized search query language Integrates variable/type data with inverted index, minimizing query seeks
Index <NounPhrase>, <Adj-Term> terms Key idea: neighbor index
At each position in the index, store “neighbor text” – both lexemes and tags
Query: “cities such as <NounPhrase>”
Bindings Engine (BE) [Cafarella
and Etzioni 2005]
as
billy
cities
friendly
give
mayors
nickels
philadelphia
such
words
neighbor1 str1
#docs …pos0 pos1 docid#docs-1 pos#docs-1
#posns pos0 pos1 … pos#pos-1
docid0 docid1
#posns pos0 neighbor0 pos1 neighbor1 …pos#pos-1
…
#neighborsblk_offset
19
12
Result: in document 19: “I love cities such as Philadelphia.”
neighbor0 str0
NPright Philadelphia3<offset> AdjTleft such
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 36
Related Approach: [K. Chakrabarti
et al. 2006] Support “relationship” keyword queries over indexed entities Top-K support for early processing termination
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 37
Workload-Driven Indexing [S. Chakrabarti et al. 2006]
Indexing Thousands of Entity Types
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 38
Workload-Driven Indexing (continued)
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 39
Parallelization/Adaptive Processing Parallelize processing:
WebFountain [Gruhl et al. 2004] UIMA architecture Map/Reduce
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 40
IBM WebFountain
Dedicated share-nothing 256-node cluster Blackboard annotation architecture Data pipelined and streamed past each
“augmenter” to add annotations Merge and index annotations Index both tokens and annotations Between 25K-75K entities per second
[Gruhl et al. 2004]
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 41
UIMA (IBM Research) Unstructured Information Management Architecture (UIMA)
http://www.research.ibm.com/UIMA/
Open component software architecture for development, composition, and deployment of text processing and analysis components.
Run-time framework allows to plug in components and applications and run them on different platforms. Supports distributed processing, failure recovery, …
Scales to millions of documents – incorporated into IBM OmniFind, grid computing-ready
The UIMA SDK (freely available) includes a run-time framework, APIs, and tools for composing and deploying UIMA components.
Framework source code also available on Sourceforge: http://uima-framework.sourceforge.net/
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 42
Map/Reduce ([Dean & Ghemawat, OSDI 2004])
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 43
Map/Reduce (continued) General framework
Scales to 1000s of machines Recently implemented in Nutch and other open source efforts
“Maps” nicely to information extraction Map phase:
Parse individual documents Tag entities Propose candidate relation tuples
Reduce phase Merge multiple mentions of same relation tuple Resolve co-references, duplicates
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 44
Summary Presented a brief overview of information extraction
Techniques and ideas to scale up information extraction Scan-based techniques (limited impact) Exploiting general indexes (limited accuracy) Building specialized index structures (most promising)
Scalability is a data mining problem Querying graphs link discovery Workload mining for index optimization Automatically optimizing for specific text mining application Considerations for building integrated data mining systems
Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction 45
References and Supplemental Materials References, slides, and additional information
available at: http://www.mathcs.emory.edu/~eugene/kdd-webinar/
Will also post answers to questions