csa2050: natural language processing

49
December 2005 CSA3180: Information Extr action I 1 CSA2050: Natural Language Processing Information Extraction • Information Extraction • Named Entities • IE Systems • MUC • Finite State Machines • Pattern Recognition

Upload: gretchen-herrmann

Post on 01-Jan-2016

43 views

Category:

Documents


7 download

DESCRIPTION

CSA2050: Natural Language Processing. Information Extraction Information Extraction Named Entities IE Systems MUC Finite State Machines Pattern Recognition. Classification at different granularities. Text Categorization: Classify an entire document Information Extraction (IE): - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 1

CSA2050: Natural Language Processing

Information Extraction• Information Extraction• Named Entities• IE Systems• MUC• Finite State Machines• Pattern Recognition

Page 2: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 2

Classification at different granularities

• Text Categorization: – Classify an entire document

• Information Extraction (IE):– Identify and classify small units within

documents

• Named Entity Extraction (NE):– A subset of IE– Identify and classify proper names

• People, locations, organizations

Page 3: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 3

Martin Baker, a person

Genomics job

Employers job posting form

Page 4: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 4

Aggregator Websites

Page 5: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 5

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Page 6: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 6

Aggregator Websites

• Read in many web pages from different sites

• Extract information into a database

• Screen Scraping

• Can then return data matching particular queries

• Data mining can extract meaningful insight that might not have been obvious

Page 7: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 7

Page 8: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 8

Data Mining

Page 9: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 9

IE from Research Papers

Page 10: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 10

IE from Commercial Websites

Page 11: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 11

What is Information Extraction?Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Page 12: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 12

What is Information Extraction?Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Page 13: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 13

What is Information Extraction?Information Extraction = segmentation + classification + association

As a familyof techniques:

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka “named entity extraction”

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Page 14: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 14

What is Information Extraction?Information Extraction = segmentation + classification + association

A familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 15: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 15

What is Information Extraction?Information Extraction = segmentation + classification + association

A familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 16: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 16

IE in Context

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Page 17: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 17

IE in Context: Formatting

Grammatical sentencesand some formatting & links

Text paragraphswithout formatting

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Page 18: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 18

IE in Context: Formatting

Non-grammatical snippets,rich formatting & links Tables

Page 19: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 19

IE in Context: CoverageWeb site specific

Amazon.com Book Pages

Formatting

Genre specific

Resumes

Layout

Page 20: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 20

IE in Context: CoverageWide, non-specific

University Names

Language

Page 21: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 21

IE in Context: ComplexityClosed set

He was born in Alabama…

The big Wyoming sky…

U.S. states

Regular set

Phone: (413) 545-1323

The CALD main office can be reached at 412-268-1299

U.S. phone numbers

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802

U.S. postal addresses

Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210

…was among the six houses sold by Hope Feldman that year.

Ambiguous patterns,needing context andmany sources of evidence

Person names

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

Page 22: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 22

IE in Context: Single Field/Record

Single entity

Person: Jack Welch

Binary relationship

Relation: Person-TitlePerson: Jack WelchTitle: CEO

N-ary record

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: Company-LocationCompany: General ElectricLocation: Connecticut

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

Page 23: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 23

State of the Art

• Named entity recognition from newswire text– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s

• Binary relation extraction– Contained-in (Location1, Location2)

Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s

• Web site structure recognition– Extremely accurate performance obtainable– Human effort (~10min?) required on each site

Page 24: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 24

IE Generations

• Hand-Built Systems – Knowledge Engineering [1980s– ]– Rules written by hand– Require experts who understand both the systems and the

domain– Iterative guess-test-tweak-repeat cycle

• Automatic, Trainable Rule-Extraction Systems [1990s– ]– Rules discovered automatically using predefined templates,

using automated rule learners– Require huge, labeled corpora (effort is just moved!)

• Statistical Models [1997 – ]– Use machine learning to learn which features indicate

boundaries and types of entities.– Learning usually supervised; may be partially unsupervised

Page 25: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 25

IE Techniques

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

Page 26: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 26

Trainable IE SystemsPros

• Annotating text is simpler & faster than writing rules.

• Domain independent

• Domain experts don’t need to be linguists or programers.

• Learning algorithms ensure full coverage of examples.

Cons• Hand-crafted systems

perform better, especially at hard tasks. (but this is changing)

• Training data might be expensive to acquire

• May need huge amount of training data

• Hand-writing rules isn’t that hard!!

Page 27: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 27

MUC: Genesis of IE

• DARPA funded significant efforts in IE in the early to mid 1990’s.

• Message Understanding Conference (MUC) was an annual event/competition where results were presented.

• Focused on extracting information from news articles:– Terrorist events– Industrial joint ventures– Company management changes

• Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)

Page 28: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 28

MUC

• Named entity• Person, Organization, Location

• Co-reference• Clinton President Bill Clinton

• Template element• Perpetrator, Target

• Template relation• Incident

• Multilingual

Page 29: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 29

MUC Typical Text

Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

Page 30: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 30

MUC Typical Text

Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

Page 31: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 31

MUC Templates

• Relationship• tie-up

• Entities:• Bridgestone Sports Co, a local concern, a Japanese trading

house

• Joint venture company• Bridgestone Sports Taiwan Co

• Activity• ACTIVITY 1

• Amount• NT$2,000,000

Page 32: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 32

MUC Templates

• ATIVITY 1– Activity

• Production

– Company• Bridgestone Sports Taiwan Co

– Product• Iron and “metal wood” clubs

– Start Date• January 1990

Page 33: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 33

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

Example from Fastus (1993)

Page 34: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 34

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 35: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 35

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 36: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 36

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

set upnew Taiwan dollars

a Japanese trading househad set up

production of 20, 000 iron and metal wood clubs

[company][set up][Joint-Venture]with[company]

Page 37: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 37

Evaluating IE Accuracy

• Always evaluate performance on independent, manually-annotated test data not used during system development.

• Measure for each test document:– Total number of correct extractions in the solution template: N– Total number of slot/value pairs extracted by the system: E– Number of extracted slot/value pairs that are correct (i.e. in the

solution template): C

• Compute average value of metrics adapted from IR:– Recall = C/N– Precision = C/E– F-Measure = Harmonic mean of recall and precision

Page 38: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 38

Named Entities

• Named Entities• Person Name: Colin Powell, Frodo• Location Name: Middle East, Aiur• Organization: UN, DARPA

• Domain Specific vs. Open Domain

Page 39: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 39

Nymble (BBN Corporation)

• State of the art system

• Near-human performance ~90% accuracy

• Statistical system

• Approach: Hidden Markov Model (HMM)

Page 40: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 40

Nymble (BBN Corporation)

• Noisy channel paradigm• Originally, entities were marked in the raw text• Post noisy channel, annotation is lost

• Probability of most likely sequence of name classes (NC) given a sequence of words (W)

Pr(NC|W) = Pr(W,NC) / Pr(W)

since the a priori probability of the word sequence can be considered constant for any given sentence maximize just numerator

Page 41: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 41

Nymble (BBN Corporation)

Person

Organization

Not-A-Name

Start of Sentence

Five other classes

End of Sentence

Page 42: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 42

Automatic Content Extraction

• DARPA ACE Program

• Identify Entities– Named: Bilbo, San Diego, UNICEF– Nominal: the president, the hobbit– Pronominal: she

• Reference resolution– Clinton the president he

Page 43: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 43

Question Answering

The over-used pipeline paradigm:

Question

AnalysisQuestionInformation

Retrieval

Answer

Extraction

Answer

MergingAnswer

Page 44: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 44

Question Answering

• Feedback loops can be present for constraint relaxation purposes

• Not all QA systems adhere to the pipeline architecture

• Question answering flavors– Factoid vs. complex

– Who invented paper? vs. Which of Mr. Bush’s friends are Black Sabbath fans?

– Closed vs. open domain

Page 45: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 45

Answer Extraction

The over-used pipeline paradigm:

• Focus on open domain, factoid question answering

Question

AnalysisQuestionInformation

Retrieval

Answer

Extraction

Answer

MergingAnswer

Page 46: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 46

Practical Issues

• Web Spell Checking– Mispling– nucular

• Infrequent forms: – Niagra vs. Niagara, Filenes vs. Filene’s

• Google QA

• Genome, video, games

Page 47: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 47

Practical Issues

• Traditional Information Extraction• Either expert built or statistical

• Specific strategies for specific question types

• Person Bio vs. Location question types

• Ability to generalize to new questions and new question types

Page 48: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 48

Practical Issues

• Who invented Blah?

Blah was invented by PersonName

Blah was Verb by PersonNamewhere Verb is synonym to invented

Blah VerbPhrase by PersonName

Page 49: CSA2050: Natural Language Processing

December 2005 CSA3180: Information Extraction I 49

Popular Resources

• Experts and/or Learning Algorithms• Gazeteers• NE taggers• Part Of Speech taggers• Parsers• Wordnet• Stopword list• Stemmer