bbn technologies statistical models of text: from bags of words to structure ralph weischedel 17...

18
BBN Technologies Statistical Models of Text: From Bags of Words to Structure Ralph Weischedel 17 April 2000

Post on 21-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

BBN Technologies

Statistical Models of Text:From Bags of Words

to Structure

Ralph Weischedel

17 April 2000

BBN Technologies

Extraction VisionMulti-dimensional

Meta-data Extraction

J F M A M J J A

EMPLOYEE / EMPLOYER Relationships:Jan Clesius works for Clesius EnterprisesBill Young works for InterMedia Inc.COMPANY / LOCATION Relationshis:Clesius Enterprises is in New York, NYInterMedia Inc. is in Boston, MA

Meta-Data

India Bombing NY Times Andhra Bhoomi Dinamani Dainik Jagran

Topic Discovery

Concept Indexing

Thread Creation

Term Translation

Document Translation

Story Segmentation

Entity Extraction

Fact Extraction

BBN Technologies

Outline

Statistical models that support feature extraction Bags of words

• Topic extraction

Sequences (HMMs)• Name extraction and classification

Lexicalized probabilistic context-free grammars• Parses• Facts/relationships

TBD• Propositions

BBN Technologies

Topic Extraction via Bag of Words

Topics• Clinton, Bill• Mexico • Money• Economic assistance,

American

“President Clinton dumped his embattled Mexican bailout today. Instead, he announced another plan that doesn’t need congressional approval.” Text

SpeechRecognition Classifier

Speech

Models

TrainingProgram

trainingsentences answers

Topics

BBN Technologies

Generative Model of Story and Topics First, choose a Set of topics, T0 ...TM

For each word in story:• Choose a topic according to P ( Tj | Set )

• Choose a word according to output distribution P ( Wn | Tj )

• Loop

.

.

P( Tj | Set )storystart

storyend

T1

T2

TM

T0General Language

Loop

P( Set )

n P ( Wn |Tj )

BBN Technologies

Topic Classification on Broadcast News

Trained on 1 year of stories from July ‘95 to Jun ‘96(42,502 stories)

Tested on 989 stories from July ‘96 Allowed 4,627 topics that occur at least twice OOT (out-of-topic) rate was 2.45% Results:

• 75.8% of the first choice topics are among the annotated labels

• 63.6% for a simple likelihood-based method

• 45% for the traditional tfidf measure used in IR

On cursory examination of errors, often the recognized topic was correct and the annotator failed to include it.

BBN Technologies

Name Extraction via HMMs

Text

SpeechRecognition Extractor

Speech Entities

NE

Models

Locations

Persons

Organizations

The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.

TrainingProgram

trainingsentences answers

The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.

Prior to 1997 - no learning approach competitive with hand-built rule systems

Since 1997 - Statistical approaches (BBN, NYU, MITRE) achieve state-of-the-art performance

BBN Technologies

START OF SENTENCE

END OF SENTENCE

(if MUC, thenfive other

entity types)

ORGANIZATION

NOT-A-NAME

PERSON

A Hidden Markov Model

Structure of Model One language model for

each category plus on for other (not-a-name)

The number of categories is learned from training

Bi-gram transition probabilities

W1

Pr(NC | NC-1, W-1) • Pr(W1 | NC, NC-1)

W1 W2 +end+Pr(W | W -1, NC )

Pr(W | W -1, NC )

BBN Technologies

Effect of Speech Recognition Error

70

72

74

76

78

80

82

84

86

88

90

92

0 5 10 15 20 25 30

Word error rate (Hub4 Eval 98)

F-m

ea

su

re

BBN and NIST found IdentiFinder performance degrades 0.7 points of F per 1% WER

BBN Technologies

Parsing via Lexicalized Probabilistic CFGs

Text

SpeechRecognition Parser

Speech Trees

NE

Models

TrainingProgram

trainingsentences answers

NP

NP

VP

SBAR

WH

NP

NP

S

NP

VP

S

NP

NP

NPVP

PP

PervezM

uscharraf,

PakistaniA

rmy

General

ousted

,

Naw

az

, who

ledPakistan

Sharif

October

12

was

by

Nawaz Sharif, who led Pakistan, was ousted October 12 by Pervez Musharraf, Pakistani Army General.

Prior to 1990 - accuracy for non-statistical parsers around 65% Since 1995 - Statistical parsers (IBM, UPenn, Brown and BBN)

achieve 85-90% accuracy

BBN Technologies

Example of Generating a Parse Tree

,

VP

SBAR

WH

NP

NP

S

NP

NP

Naw

az

, who

ledPakistan

Sharif

S wasN

P

NP

NPN

P

PP

PervezM

uscharraf,

PakistaniA

rmy

General

October

12 byVP ousted

was

VP

S

wasVP

S

ousted

VP

BBN Technologies

Extracting Facts via LPCFG

“Nance, who is also a paid consultant to ABC News, said ...”

PositionHolderPerson: NancePost: a paid consultantOrg: ABC News

Text

SpeechRecognition Extractor

Speech Relationships/

Events

Models

TrainingProgram

trainingsentences answers

1998 - First state-of-the-art trainable system (70% accuracy)

BBN Technologies

Type of Annotation Required

CoreferenceEmployee

relation

Nance , who is also a paid consultant to ABC News , said ...

person organization

person-descriptor

Training data consists ONLY of• Named entities (as in NE)• Descriptor phrases (for TE)• Descriptor references (for TE)• Relation/events to be extracted (for TR)

BBN Technologies

The Sentential Model

• Search Criterion: find M such that p(M | W) is maximized• Since p(W) is constant, search for:

• Model the probability as the product of the probabilities of generating each element in the tree

p(M |W )=p(M,W )

p(W ) p(M,T,W)

p(W)T max p(M,T,W )

p(W)

max

Mp(M,T,W)

treee

hepWTMp )|(),,(

BBN Technologies

Augmented Semantic Tree

Nance , who a consultantis to NewsABCpaidalso , said ...

per-r/np

per/np

per-desc-of/sbar-lnk

per-desc-ptr/sbar

per/nnp , wp vbz rb det vbn per-desc/nn to org-c/nnp org/nnp, vbd

per-desc-r/np

per-desc/np org-r/np

org-ptr/pp

whnp advp

per-desc-ptr/vp

vp

s

emp-of/pp-lnk

Syntax labelSemantic label

BBN Technologies

Propositions via TBD

Within the past two months, a bomb exploded in the offices of the El Espectador in Bogata, destroying a major part of its installations and equipment.

Text

SpeechRecognition Extractor

Speech Propositions

Models

TrainingProgram

trainingsentences answers

a major part of itsinstallations and equipment

L-OBJdestroying

a bombL-SUBJdestroying

a bombL-OBJexploded

ValueArgPredicate

BBN Technologies

Towards a Proposition Bank

PervezM

uscharraf,

PakistaniA

rmy

General

NP

NP

ousted

,

Naw

az

, who

ledPakistan

VP

SBAR

WH

NP

NP

S

Sharif

NP

VP

S

NP

NP

October

12

NP

was

VP

PP

by

Event: ousted-1

Logical Object:

Logical Subject

Time:

Location: --

Event: led-3

Logical Object:

Logical Subject

Time: --

Location: --

Add Predicate/Argument Markings

Add Co-referenceAdd Verb Sense Markings

-3

-1

BBN Technologies

Statistical Speech/Language Modeling

Trainer

Decoder

Model

LanguageInput

Answers

AnswersLanguageInput

Technology Input Answers• Speech recognition audio transcription• OCR image characters• Speech understanding audio response• Topic classification document topics• Topic detection text/speech clusters• Topic tracking text/speech relevant stories• Story segmentation speech stories• Information retrieval query text/speech• Named entity text/speech names & types

extraction

Advantages Mathematically rigorous approach State-of-the-art performance Highly robust in the face of degraded input Language independent, requiring only annotated

training data Affordable annotation

• Only domain knowledge is needed• Can be performed by students/interns