web scale information extraction

40
Web scale Information Extraction

Upload: madelyn-sapphire

Post on 02-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Web scale Information Extraction. The Value of Text Data. “Unstructured” text data is the primary form of human-generated information Blogs, web pages, news, scientific literature, online reviews, … - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web scale Information Extraction

Web scale Information Extraction

Page 2: Web scale Information Extraction

The Value of Text Data

“Unstructured” text data is the primary form of human-generated information Blogs, web pages, news, scientific literature, online reviews, … Semi-structured data (database generated): see Prof. Bing Liu’s

KDD webinar: http://www.cs.uic.edu/~liub/WCM-Refs.html The techniques discussed here are complimentary to structured

object extraction methods

Need to extract structured information to effectively manage, search, and mine the data

Information Extraction: mature, but active research area Intersection of Computational Linguistics, Machine Learning,

Data mining, Databases, and Information Retrieval Traditional focus on accuracy of extraction

Page 3: Web scale Information Extraction

Outline

Information Extraction Tasks Entity tagging Relation extraction Event extraction

Scaling up Information Extraction Focus on scaling up to large collections (where data

mining can be most beneficial) Other dimensions of scalability

Page 4: Web scale Information Extraction

Information Extraction Tasks

Extracting entities and relations: this talk Entities: named (e.g., Person) and generic (e.g., disease name) Relations: entities related in a predefined way (e.g., Location of a

Disease outbreak, or a CEO of a Company) Events: can be composed from multiple relation tuples

Common extraction subtasks: Preprocess: sentence chunking, syntactic parsing, morphological

analysis Create rules or extraction patterns: hand-coded, machine learning, and

hybrid Apply extraction patterns or rules to extract new information Postprocess and integrate information

Co-reference resolution, deduplication, disambiguation

Page 5: Web scale Information Extraction

Entity Tagging Identifying mentions of entities (e.g., person names, locations, companies) in

text MUC (1997): Person, Location, Organization, Date/Time/Currency ACE (2005): more than 100 more specific types

Hand-coded vs. Machine Learning approaches

Best approach depends on entity type and domain: Closed class (e.g., geographical locations, disease names, gene & protein

names): hand coded + dictionaries Syntactic (e.g., phone numbers, zip codes): regular expressions Semantic (e.g., person and company names): mixture of context, syntactic

features, dictionaries, heuristics, etc. “Almost solved” for common/typical entity types

Page 6: Web scale Information Extraction

Example: Extracting Entities from Text

Useful for data warehousing, data cleaning, web data integration

14089 Whispering Pines Nobel Drive San Diego CA 92122

House number Building Road City ZipState

Address

Citation

Segment(si) Sequence Label(si)

S1 Ronald Fagin Author

S2 Combining Fuzzy Information from Multiple Systems Title

S3 Proc. of ACM SIGMOD Conference

S4 2002 Year

Ronald Fagin, Combining Fuzzy Information from Multiple Systems, Proc. of ACM SIGMOD, 2002

Page 7: Web scale Information Extraction

Hand-Coded Methods Easy to construct in some cases

e.g., to recognize prices, phone numbers, zip codes, conference names, etc. Intuitive to debug and maintain

Especially if written in a “high-level” language:

Can incorporate domain knowledge Scalability issues:

Labor-intensive to create Highly domain-specific Often corpus-specific Rule-matches can be expensiveContactPattern RegularExpression(Email.body,”can be reached at”)

[IBM Avatar]

Page 8: Web scale Information Extraction

Machine Learning Methods Can work well when lots of training data easy to construct

Can capture complex patterns that are hard to encode with hand-crafted rules e.g., determine whether a review is positive or negative extract long complex gene names Non-local dependencies

The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.“

[From AliBaba]

Page 9: Web scale Information Extraction

Popular Machine Learning Methods Naive Bayes SRV [Freitag 1998], Inductive Logic Programming Rapier [Califf and Mooney 1997] Hidden Markov Models [Leek 1997] Maximum Entropy Markov Models [McCallum et al. 2000] Conditional Random Fields [Lafferty et al. 2001]

Scalability Can be labor intensive to construct training data At run time, complex features can be expensive to construct or process

(batch algorithms can help: [Chandel et al. 2006] )

For details: [Feldman, 2006 and Cohen, 2004]

Page 10: Web scale Information Extraction

Some Available Entity Taggers

ABNER: http://www.cs.wisc.edu/~bsettles/abner/ Linear-chain conditional random fields (CRFs) with orthographic and contextual

features.

Alias-I LingPipe http://www.alias-i.com/lingpipe/

MALLET: http://mallet.cs.umass.edu/index.php/Main_Page Collection of NLP and ML tools, can be trained for name entity tagging

MinorThird: http://minorthird.sourceforge.net/ Tools for learning to extract entities, categorization, and some visualization

Stanford Named Entity Recognizer: http://nlp.stanford.edu/software/CRF-NER.shtml CRF-based entity tagger with non-local features

Page 11: Web scale Information Extraction

Alias-I LingPipe ( http://www.alias-i.com/lingpipe/ )

Statistical named entity tagger Generative statistical model

Find most likely tags given lexical and linguistic features Accuracy at (or near) state of the art on benchmark tasks

Explicitly targets scalability: ~100K tokens/second runtime on single PC Pipelined extraction of entities User-defined mentions, pronouns and stop list

Specified in a dictionary, left-to-right, longest match Can be trained/bootstrapped on annotated corpora

Page 12: Web scale Information Extraction

Outline

Overview of Information Extraction Entity tagging Relation extraction Event extraction

Scaling up Information Extraction Focus on scaling up to large collections (where data

mining and ML techniques shine) Other dimensions of scalability

Page 13: Web scale Information Extraction

Relation Extraction Examples Extract tuples of entities that are related in predefined way

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is findingitself hard pressed to cope with the crisis…

Disease Outbreaks relation

Relation Extraction

„We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“

CBF-A CBF-C

CBF-B CBF-A-CBF-C complex

interactcomplex

associates [From AliBaba]

Page 14: Web scale Information Extraction

Relation Extraction ApproachesKnowledge engineering Experts develop rules, patterns:

Can be defined over lexical items: “<company> located in <location>” Over syntactic structures: “((Obj <company>) (Verb located) (*) (Subj

<location>))” Sophisticated development/debugging environments:

Proteus, GATE

Machine learning Supervised: Train system over manually labeled data

Soderland et al. 1997, Muslea et al. 2000, Riloff et al. 1996, Roth et al 2005, Cardie et al 2006, Mooney et al. 2005, …

Partially-supervised: train system by bootstrapping from “seed” examples: Agichtein & Gravano 2000, Etzioni et al., 2004, Yangarber & Grishman 2001, … “Open” (no seeds): Sekine et al. 2006, Cafarella et al. 2007, Banko et al. 2007

Hybrid or interactive systems: Experts interact with machine learning algorithms (e.g., active learning family) to

iteratively refine/extend rules and patterns Interactions can involve annotating examples, modifying rules, or any

combination

Page 15: Web scale Information Extraction

Open Information Extraction [Banko et al., IJCAI 2007] Self-Supervised Learner:

All triples in a sample corpus (e1, r, e2) are considered potential “tuples” for relation r Positive examples: candidate triplets generated by a dependency parser Train classifier on lexical features for positive and negative examples

Single-Pass Extractor: Classify all pairs of candidate entities for some (undetermined) relation Heuristically generate a relation name from the words between entities

Redundancy-Based Assessor: Estimate probability that entities are related from co-occurrence statistics

Scalability Extraction/Indexing

No tuning or domain knowledge during extraction, relation inclusion determined at query time 0.04 CPU seconds pre sentence, 9M web page corpus in 68 CPU hours Every document retrieved, processed (parsed, indexed, classified) in a single pass

Query-time Distributed index for tuples by hashing on the relation name text

Related efforts: [Cucerzan and Agichtein 2005], [Pasca et al. 2006], [Sekine et al. 2006], [Rozenfeld and Feldman 2006], …

Page 16: Web scale Information Extraction

Event Extraction Similar to Relation Extraction, but:

Events can be nested Significantly more complex (e.g., more slots) than relations/template elements Often requires coreference resolution, disambiguation, deduplication, and

inference

Example: an integrated disease outbreak event [Hatunnen et al. 2002]

Page 17: Web scale Information Extraction

Event Extraction: Integration Challenges

Information spans multiple documents Missing or incorrect values Combining simple tuples into complex events No single key to order or cluster likely duplicates while separating

them from similar but different entities. Ambiguity: distinct physical entities with same name (e.g., Kennedy)

Duplicate entities, relation tuples extracted Large lists with multiple noisy mentions of the same entity/tuple Need to depend on fuzzy and expensive string similarity functions Cannot afford to compare each mention with every other.

See Part II of KDD 2006 Tutorial “Scalable Information Extraction and Integration” -- scaling up integration: http://www.scalability-tutorial.net/

Page 18: Web scale Information Extraction

Summary: Accuracy of Extraction Tasks

Errors cascade (errors in entity tag cause errors in relation extraction)

This estimate is optimistic: Primarily for well-established (tuned) tasks Many specialized or novel IE tasks (e.g. bio- and medical- domains) exhibit lower

accuracy Accuracy for all tasks is significantly lower for non-English

[Feldman, ICML 2006 tutorial]

Page 19: Web scale Information Extraction

Multilingual Information Extraction Closely tied to machine translation and cross-language information retrieval efforts.

Language-independent named entity tagging and related tasks at CoNLL: 2006: multi-lingual dependency parsing (http://nextens.uvt.nl/~conll/) 2002, 2003 shared tasks: language independent Named Entity Tagging

(http://www.cnts.ua.ac.be/conll2003/ner/)

Global Autonomous Language Exploitation program (GALE): http://www.darpa.mil/ipto/Programs/gale/concept.htm

Interlingual Annotation of Multilingual Text Corpora (IAMTC) Tools and data for building MT and IE systems for six languages http://aitc.aitcnet.org/nsf/iamtc/index.html

REFLEX project: NER for 50 languages Exploit for training temporal correlations in weekly aligned corpora http://l2r.cs.uiuc.edu/~cogcomp/wpt.php?pr_key=REFLEX

Cross-Language Information Retrieval (CLEF) http://www.clef-campaign.org/

Page 20: Web scale Information Extraction

Outline

Overview of Information Extraction Entity tagging Relation extraction Event Extraction

Scaling up Information Extraction Focus on scaling up to large collections (where data

mining and ML techniques shine) Other dimensions of scalability

Page 21: Web scale Information Extraction

Scaling Information Extraction to the Web

Dimensions of Scalability Corpus size:

Applying rules/patterns is expensive Need efficient ways to select/filter relevant documents

Document accessibility: Deep web: documents only accessible via a search interface Dynamic sources: documents disappear from top page

Source heterogeneity: Coding/learning patterns for each source is expensive Requires many rules (expensive to apply)

Domain diversity: Extracting information for any domain, entities, relationships Some recent progress (e.g., see slide 17) Not the focus of this talk

Page 22: Web scale Information Extraction

Scaling Up Information Extraction

Scan-based extraction Classification/filtering to avoid processing documents Sharing common tags/annotations

General keyword index-based techniques QXtract, KnowItAll

Specialized indexes BE/KnowItNow, Linguist’s Search Engine

Parallelization/distributed processing IBM WebFountain, UIMA, Google’s Map/Reduce

Page 23: Web scale Information Extraction

Efficient Scanning for Information Extraction

80/20 rule: use few simple rules to capture majority of the instances [Pantel et al. 2004]

Train a classifier to discard irrelevant documents without processing [Grishman et al. 2002] (e.g., the Sports section of NYT is unlikely to describe disease outbreaks)

Share base annotations (entity tags) for multiple extraction tasks

Output Tuples

…Extraction

System

Text Database

4. Extract output tuples

3. Process documents

1. Retrieve docs from database

Classifier

2. Filter documents filtered

Page 24: Web scale Information Extraction

Exploiting Keyword and Phrase Indexes Generate queries to retrieve only relevant documents

Data mining problem!

Some methods in literature: Traversing Query Graphs [Agichtein et al. 2003] Iteratively refine queries [Agichtein and Gravano 2003] Iteratively partition document space [Etzioni et al., 2004]

Case studies: QXtract, KnowItAll

Page 25: Web scale Information Extraction

Simple Strategy: Iterative Set ExpansionOutput Tuples

…Extraction

System

Text Database

3. Extract tuplesfrom docs

2. Process retrieved documents

1. Query database with seed tuples

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Time for retrieving a document

Time for answering a query

Time for processing a document

Query

Generation

4. Augment seed tuples with new tuples

(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)

Page 26: Web scale Information Extraction

Some IE tools Available

MALLET (UMass) statistical natural language processing, document classification, clustering, information extraction

other machine learning applications to text. Sample Application:

GeneTaggerCRF: a gene-entity tagger based on MALLET (MAchine Learning for LanguagE Toolkit). It uses conditional random fields to find genes in a text file.

Page 27: Web scale Information Extraction

http://minorthird.sourceforge.net/ “a collection of Java classes for storing text,

annotating text, and learning to extract entities and categorize text”

Stored documents can be annotated in independent files using TextLabels (denoting, say, part-of-speech and semantic information)

MinorThird

Page 28: Web scale Information Extraction

GATE

http://gate.ac.uk/ie/annie.html

leading toolkit for Text Mining distributed with an Information Extraction component set

called ANNIE (demo) Used in many research projects

Long list can be found on its website Under integration of IBM UIMA

Page 29: Web scale Information Extraction

Sunita Sarawagi's CRF package

http://crf.sourceforge.net/ A Java implementation of conditional random fields for

sequential labeling.

Page 30: Web scale Information Extraction

UIMA (IBM)

Unstructured Information Management Architecture.

A platform for unstructured information management solutions from combinations of semantic analysis (IE) and search components.

Page 31: Web scale Information Extraction

Some Interesting Website based on IE

ZoomInfo CiteSeer.org (some of us using it everyday!)

Google Local, Google Scholar and many more…

Page 32: Web scale Information Extraction

UIMA (IBM Research)

Unstructured Information Management Architecture (UIMA) http://www.research.ibm.com/UIMA/

Open component software architecture for development, composition, and deployment of text processing and analysis components.

Run-time framework allows to plug in components and applications and run them on different platforms. Supports distributed processing, failure recovery, …

Scales to millions of documents – incorporated into IBM OmniFind, grid computing-ready

The UIMA SDK (freely available) includes a run-time framework, APIs, and tools for composing and deploying UIMA components.

Framework source code also available on Sourceforge: http://uima-framework.sourceforge.net/

Page 33: Web scale Information Extraction

UIMA – Quick Overview Architecture, Software Framework and Tooling

Page 34: Web scale Information Extraction

UnstructuredInformation

UnstructuredInformation

• High-Value• Most Current• Fastest Growing • ...BUT ...• Buried in Huge Volumes (Noise)• Implicit Semantics• Inefficient Search

• Explicit Semantics• Efficient Search• Focused Content...BUT...• Slow Growing• Narrow Coverage• Less Current/Relevant

Text, Chat, Email, Audio,

Video

Text, Chat, Email, Audio,

Video

IndicesIndices

DBsDBs

KBsKBs

Discover Relevant Semantics → Build into StructureDocs, Emails, Phone Calls, ReportsTopics, Entities, Relationships People, Places, Org, Times, EventsCustomer Opinions, Products, ProblemsThreats, Chemicals, Drugs, Drug Interactions....

Discover Relevant Semantics → Build into StructureDocs, Emails, Phone Calls, ReportsTopics, Entities, Relationships People, Places, Org, Times, EventsCustomer Opinions, Products, ProblemsThreats, Chemicals, Drugs, Drug Interactions....

StructuredInformation

Analytics Bridge the Unstructured & Structured Worlds

Text and Multi-ModalAnalytics

UIMA

Page 35: Web scale Information Extraction

Language, Speaker Identifiers Tokenizers Classifiers Part of Speech Detectors Document Structure Detectors Parsers, Translators Named-Entity Detectors Face Recognizers Relationship Detectors

Modality Human Language Domain of Interest Source: Style and Format Input/Output Semantics Privacy/Security Precision/Recall Tradeoffs Performance/Precision

Tradeoffs...

Capability SpecializationsCapability SpecializationsAnalysis CapabilitiesAnalysis Capabilities

Analytics: The kinds of things they do• Independently developed• From an increasing # of sources

• Independently developed• From an increasing # of sources

• Different technologies & interfaces• Highly specialized & fine grained

• Different technologies & interfaces• Highly specialized & fine grained

The right analysis for the job will likely be a best-of-breed combination integrating capabilities across many dimensions.

The right analysis for the job will likely be a best-of-breed combination integrating capabilities across many dimensions.

Page 36: Web scale Information Extraction

UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for upstream processing.

FredFred isis thetheCenterCenter CEOCEO ofof

OrganizationOrganizationPersonPerson

CeoOfCeoOf

Arg2:OrgArg2:OrgArg1:PersonArg1:Person

PPPPVPVPNPNPParserParser

Named EntityNamed Entity

RelationshipRelationship

CenterCenter MicrosMicros

Common Analysis Structure (CAS)

Artifact (e.g., Document)

Analysis Results (i.e., Artifact Metadata)

UIMA CASRepresentation now

Alignedwith XMI standard

Page 37: Web scale Information Extraction

• Analyzed by a collection of text analytics

• Detected Semantic Entities and Relations Highlighted

• Represented in UIMA Common Analysis Structure (CAS)

• Analyzed by a collection of text analytics

• Detected Semantic Entities and Relations Highlighted

• Represented in UIMA Common Analysis Structure (CAS)

Page 38: Web scale Information Extraction

UIMA: Unstructured Information Management Architecture

Open Software Architecture and Emerging Standard Platform independent standard for interoperable text and multi-modal analytics Under Development: UIMA Standards Technical Committee Initiated under

OASIS http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uima

Software Framework Implementation SDK Available on IBM Alphaworks

http://www.alphaworks.ibm.com/tech/uima Tools, Utilities, Runtime, Extensive Documentation

Creation, Integration, Discovery, Deployment of analytics Java, C++, Perl, Python (others possible) Supports co-located and service-oriented deployments (eg., SOAP) x-Language High-Performances APIs to common data structure (CAS)

Embeddable on Systems Middleware (e.g., ActiveMQ, WebSphere, DB2) Apache UIMA open-source project

http://incubator.apache.org/uima/

Page 39: Web scale Information Extraction

VideoObject Detector

VideoObject Detector

TranscriptionEngine

TranscriptionEngine

StreamingSpeech Segmenter

StreamingSpeech Segmenter

Text IR EngineIndex

Text IR EngineIndex

End-User

Application

Interfaces

End-User

Application

Interfaces

Query

Services

Query

Services

Query Interface(s)Query Interface(s)

Relevant KnowledgeRelevant Knowledge

Entity & RelationDetector(s)

Entity & RelationDetector(s)

IndexEntities & Relationsin RDB or OWL KB

IndexEntities & Relationsin RDB or OWL KB

Relational

Database

Relational

Database

Text, Chat, Email,

Speech, Video

Text, Chat, Email,

Speech, Video

Deep ParserDeep Parser

Arabic-EnglishTranslator

(Web Service)

Arabic-EnglishTranslator

(Web Service)

OWL

Knowledge-

Base

OWL

Knowledge-

Base

Index Tokens& Annotationsin IR Engine

Index Tokens& Annotationsin IR Engine

UIMA: Pluggable Framework, User-defined Workflows

CAS: Common UIMA Data Representation & Interchange

Aligned with OMG & W3C standards (i.e., XMI, SOAP, RDF)

Any UIMA-Compliant Analysis Engine(s)

Any UIMA-Compliant Analysis Engine(s)

File System Reader

File System Reader

Web CrawlerWeb Crawler

Any UIMA-CompliantReaders, SegmentersAny UIMA-CompliantReaders, Segmenters

Any UIMA-CompliantCAS Consumer(s)

Any UIMA-CompliantCAS Consumer(s)

Video SearchIndex

Video SearchIndex

Analyze ContentAssign

Task-RelevantSemantics

Analyze ContentAssign

Task-RelevantSemantics

Index orProcessResults

Index orProcessResults

Connect, Read& Segment

Sources

Connect, Read& Segment

SourcesCAS

CASCAS

CAS

Page 40: Web scale Information Extraction

Collection Processing Engine (CPE)Collection Processing Engine (CPE)

CAS ConsumerCAS Consumer

Aggregate Analysis EngineAggregate Analysis Engine

UIMA Component Architecture

CAS ConsumerCAS Consumer

CAS ConsumerCAS Consumer

OntologiesOntologies

IndicesIndices

DBsDBs

KnowledgeBases

KnowledgeBases

Collection

Reader

Collection

Reader

Text, Chat, Email, Audio, Video

Text, Chat, Email, Audio, Video

Analysis EngineAnalysis Engine

AnnotatorAnnotator

Analysis EngineAnalysis Engine

AnnotatorAnnotator

CASCAS

CASCAS

CASCAS

FlowController

FlowController

FlowController

FlowController

Key

FrameworkConstructionFramework

Construction

Developer CodesDeveloper Codes