(c) 2003, the university of michigan1 information retrieval handout #5 march 3, 2003

50
(C) 2003, The University of Michigan 1 Information Retrieval Handout #5 March 3, 2003

Upload: madeline-baldwin

Post on 25-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

(C) 2003, The University of Michigan 1

Information Retrieval

Handout #5

March 3, 2003

(C) 2003, The University of Michigan 2

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector

• Phone: (734) 615-5225

• Office hours: M&F 11-12

• Course page: http://tangra.si.umich.edu/~radev/650/

• Class meets on Mondays, 1-4 PM in 409 West Hall

(C) 2003, The University of Michigan 3

The Weka package

(C) 2003, The University of Michigan 4

Weka

• A general environment for machine learning (e.g. for classification and clustering)

• Book by Witten and Frank

• www.cs.waikato.ac.nz/ml/weka

(C) 2003, The University of Michigan 5

K-means (continued)

(C) 2003, The University of Michigan 6

Demos

• http://www.cs.mcgill.ca/~bonnef/project.html• http://www.cs.washington.edu/research/

imagedatabase/demo/kmcluster/• http://www-2.cs.cmu.edu/~dellaert/software/• java weka.clusterers.SimpleKMeans -t

data/weather.arff

(C) 2003, The University of Michigan 7

EM algorithm

(C) 2003, The University of Michigan 8

EM algorithms

[Dempster et al. 77]

• Needed: probabilistic model Θ

• Given estimate Θ0

• Useful in the absence of certain data• Class of iterative algorithms for maximum likelihood estimation in

problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values.

[McCallum & Nigam 98]

(C) 2003, The University of Michigan 9

E-M algorithms

• Initialize probability model

• Repeat– E-step: use the best available current classifier

to classify some datapoints– M-step: modify the classifier based on the

classes produced by the E-step.

• Until convergence

(C) 2003, The University of Michigan 10

Demos

• java weka.clusterers.EM -t data/iris.arff

• http://www.neurosci.aist.go.jp/~akaho/MixtureEM.html

• http://www.cs.uic.edu/~liub/S-EM/S-EM-download.html

(C) 2003, The University of Michigan 11

Question Answering

(C) 2003, The University of Michigan 12

Q: When did Nelson Mandela become president of South Africa?

A: 10 May 1994

Q: How tall is the Matterhorn?

A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches

Q: How tall is the replica of the Matterhorn at Disneyland?

A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years

Q: If Iraq attacks a neighboring country, what should the US do?

A: ??

Question answering

(C) 2003, The University of Michigan 13

Q: Why did David Koresh ask the FBI for a word processor?Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies".Q: What is the brightest star visible from Earth?Q: What are the Valdez Principles?Q: Name a film that has won the Golden Bear in the Berlin Film Festival?Q: Name a country that is developing a magnetic levitation railway system?Q: Name the first private citizen to fly in space.Q: What did Shostakovich write for Rostropovich?Q: What is the term for the sum of all genetic material in a given organism?Q: What is considered the costliest disaster the insurance industry has ever faced?Q: What is Head Start?Q: What was Agent Orange used for during the Vietnam War?Q: What did John Hinckley do to impress Jodie Foster?Q: What was the first Gilbert and Sullivan opera?Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics?Q: How did Socrates die?Q: Why are electric cars less efficient in the north-east than in California?

(C) 2003, The University of Michigan 14

The TREC evaluation

• Document retrieval

• Eight years

• Information retrieval?

• Corpus: texts and questions

(C) 2003, The University of Michigan 15

documents

query

TextractResporator Indexer

Index

QueryProcessing

Search

Hit ListAnSel/Werlect

RankedHitList

GuruQA

Answer selection

Prager et al. 2000 (SIGIR)Radev et al. 2000 (ANLP/NAACL)

(C) 2003, The University of Michigan 16

QA-Token Question type Example

PLACE$ Where In the Rocky Mountains

COUNTRY$ Where/What country United Kingdom

STATE$ Where/What state Massachusetts

PERSON$ Who Albert Einstein

ROLE$ Who Doctor

NAME$ Who/What/Which The Shakespeare Festival

ORG$ Who/What The US Post Office

DURATION$ How long For 5 centuries

AGE$ How old 30 years old

YEAR$ When/What year 1999

TIME$ When In the afternoon

DATE$ When/What date July 4th, 1776

VOLUME$ How big 3 gallons

AREA$ How big 4 square inches

LENGTH$ How big/long/high 3 miles

WEIGHT$ How big/heavy 25 tons

NUMBER$ How many 1,234.5

METHOD$ How By rubbing

RATE$ How much 50 per cent

MONEY$ How much 4 million dollars

(C) 2003, The University of Michigan 17

<p><NUMBER>1</NUMBER></p><p><QUERY>Who is the author of the book, "The Iron Lady: ABiography of Margaret Thatcher"?</QUERY></p><p><PROCESSED_QUERY>@excwin(*dynamic* @weight(200*Iron_Lady) @weight(200 Biography_of_Margaret_Thatcher)@weight(200 Margaret) @weight(100 author) @weight(100book) @weight(100 iron) @weight(100 lady) @weight(100 :)@weight(100 biography) @weight(100 thatcher) @weight(400@syn(PERSON$ NAME$)) )</PROCESSED_QUERY></p><p><DOC>LA090290-0118</DOC></p><p><SCORE>1020.8114</SCORE></p><TEXT><p>THE IRON LADY; A <span class="NAME">Biography ofMargaret Thatcher</span> by <span class="PERSON">HugoYoung</span> (<span class="ORG">Farrar , Straus &Giroux</span>) The central riddle revealed here is why, asa woman <span class="PLACEDEF">in a man</span>'s world,<span class="PERSON">Margaret Thatcher</span> evinces suchan exclusionary attitude toward women.</p></TEXT>

(C) 2003, The University of Michigan 18

SYN-set N Score Score/N

PERSON NAME 30 16.5 55.0%

PLACE COUNTRY STATE NAME PLACEDEF 21 7.08 33.7%

NAME 18 3.67 20.4%

DATE YEAR 18 5.31 29.5%

PERSON ORG NAME ROLE 19 4.62 24.3%

undefined 19 11.45 60.3%

NUMBER 18 8.00 44.4%

PLACE NAME PLACEDEF 14 10.00 71.4%

PERSON ORG PLACE NAME PLACEDEF 10 3.03 30.3%

MONEY RATE 6 1.50 25%

ORG NAME 4 1.25 31.2%

SIZE1 4 2.50 62.5%

SIZE1 DURATION 3 0.83 27.7%

STATE 3 2.00 66.7%

COUNTRY 3 1.33 44.3%

YEAR 2 1.00 50.0%

RATE 2 1.50 75.0%

TIME DURATION 1 0.00 0.0%

SIZE1 SIZE2 1 0.00 0.0%

DURATION TIME 1 0.33 33.3%

DATE 1 0 0.00%

(C) 2003, The University of Michigan 19

Span Type Number Rspanno Count Notinq Type Avgdst Sscore TOTALOllie Matson PERSON 3 3 6 2 1 12 0.02507 -7.53Lou Vasquez PERSON 1 1 6 2 1 16 0.02507 -9.93Tim O'Donohue PERSON 17 1 4 2 1 8 0.02257 -12.57Athletic Director Dave Cowen PERSON 23 6 4 4 1 11 0.02257 -15.87Johnny Ceballos PERSON 22 5 4 1 1 9 0.02257 -19.07Civic Center Director Martin Durham PERSON 13 1 2 5 1 16 0.02505 -19.36Johnny Hodges PERSON 25 2 4 1 1 15 0.02256 -25.22Derric Evans PERSON 33 4 4 2 1 14 0.02256 -25.37NEWSWIRE Johnny Majors PERSON 30 1 4 2 1 17 0.02256 -25.47Woodbridge High School ORG 18 2 4 1 2 6 0.02257 -28.37Evan PERSON 37 6 4 1 1 14 0.02256 -29.57Gary Edwards PERSON 38 7 4 2 1 17 0.02256 -30.87O.J. Simpson NAME 2 2 6 2 3 12 0.02507 -37.40South Lake Tahoe NAME 7 5 6 3 3 14 0.02507 -40.06Washington High NAME 10 6 6 1 3 18 0.02507 -49.80Morgan NAME 26 3 4 1 3 12 0.02256 -52.52Tennesseefootball NAME 31 2 4 1 3 15 0.02256 -56.27Ellington NAME 24 1 4 1 3 20 0.02256 -59.42assistant ROLE 21 4 4 1 4 8 0.02257 -62.77the Volunteers ROLE 34 5 4 2 4 14 0.02256 -71.17Johnny Mathis PERSON 4 4 6 -100 1 11 0.02507 -211.33Mathis NAME 14 2 2 -100 3 10 0.02505 -254.16coach ROLE 19 3 4 -100 4 4 0.02257 -259.67

(C) 2003, The University of Michigan 20

Features (1)• Number: position of the span among all spans returned. Example:

“Lou Vasquez” was the first span returned by GuruQA on the sample question.

• Rspanno: position of the span among all spans returned within the current passage.

• Count: number of spans of any span class retrieved within the current passage.

• Notinq: the number of words in the span that do not appear in the query. Example: Notinq (“Woodbridge high school”) = 1, because both “high” and “school” appear in the query while “Woodbridge” does not. It is set to –100 when the actual value is 0.

(C) 2003, The University of Michigan 21

• Type: the position of the span type in the list of potential span types. Example: Type (“Lou Vasquez”) = 1, because the span type of “Lou Vasquez”, namely “PERSON” appears first in the SYN-set, “PERSON ORG NAME ROLE”.

• Avgdst: the average distance in words between the beginning of the span and the words in the query that also appear in the passage. Example: given the passage “Tim O'Donohue, Woodbridge High School's varsity baseball coach, resigned Monday and will be replaced by assistant Johnny Ceballos, Athletic Director Dave Cowen said.” and the span “Tim O’Donohue”, the value of avgdst is equal to 8.

• Sscore: passage relevance as computed by GuruQA.

Features (2)

(C) 2003, The University of Michigan 22

Combining evidence

• TOTAL (span) = – 0.3 * number – 0.5 * rspanno + 3.0 * count + 2.0 * notinq – 15.0 * types – 1.0 * avgdst + 1.5 * sscore

(C) 2003, The University of Michigan 23

DocumentID

Score Extract

LA053189-0069

892.5 of O.J. Simpson , Ollie Matson and Johnny Mathis

LA053189-0069

890.1 Lou Vasquez , track coach of O.J. Simpson , Ollie

LA060889-0181

887.4 Tim O'Donohue , Woodbridge High School 's varsity

LA060889-0181

884.1 nny Ceballos , Athletic Director Dave Cowen said.

LA060889-0181

880.9 aced by assistant Johnny Ceballos , Athletic Direc

Extracted text

(C) 2003, The University of Michigan 24

First Second Third Fourth Fifth TOTAL

# cases 49 15 11 9 4 88

Points 49.00 7.50 3.67 2.25 0.80 63.22

First Second Third Fourth Fifth TOTAL

# cases 71 16 11 6 5 109

Points 71.00 8.00 3.67 1.50 1.00 85.17

50 bytes

250 bytes

Results

(C) 2003, The University of Michigan 25

Information Extraction

(C) 2003, The University of Michigan 26

Types of Information Extraction

• Template filling

• Language reuse

• Biographical information

• Question answering

(C) 2003, The University of Michigan 27

MUC-4 Example

INCIDENT: DATE 30 OCT 89 INCIDENT: LOCATION EL SALVADOR INCIDENT: TYPE ATTACK INCIDENT: STAGE OF EXECUTION ACCOMPLISHED INCIDENT: INSTRUMENT ID INCIDENT: INSTRUMENT TYPEPERP: INCIDENT CATEGORY TERRORIST ACT PERP: INDIVIDUAL ID "TERRORIST" PERP: ORGANIZATION ID "THE FMLN" PERP: ORG. CONFIDENCE REPORTED: "THE FMLN" PHYS TGT: ID PHYS TGT: TYPEPHYS TGT: NUMBERPHYS TGT: FOREIGN NATIONPHYS TGT: EFFECT OF INCIDENTPHYS TGT: TOTAL NUMBERHUM TGT: NAMEHUM TGT: DESCRIPTION "1 CIVILIAN"HUM TGT: TYPE CIVILIAN: "1 CIVILIAN"HUM TGT: NUMBER 1: "1 CIVILIAN"HUM TGT: FOREIGN NATIONHUM TGT: EFFECT OF INCIDENT DEATH: "1 CIVILIAN"HUM TGT: TOTAL NUMBER

On October 30, 1989, one civilian was killed in a reported FMLN attack in El Salvador.

(C) 2003, The University of Michigan 28

Yugoslav President Slobodan Milosevic

[description]

NP

Phrase to be reused

Language reuse

[entity]

(C) 2003, The University of Michigan 29

NP

Example

Andrija Hebrang , The Croatian Defense Minister

[description][entity]

NP NPPunc

(C) 2003, The University of Michigan 30

Issues involved

• Text generation depends on lexical resources

• Lexical choice

• Corpus processing vs. manual compilation

• Deliberate decisions by writers

• Difficult to encode by hand

• Dynamically updated (Scott O’Grady)

• No full semantic representation

(C) 2003, The University of Michigan 31

Named entitiesRichard Butler met Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work.

Yitzhak Mordechai will meet Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking.

Sinn Fein deferred a vote on Northern Ireland's peace deal Sunday.

Hundreds of troops patrolled Dili on Friday during the anniversary of Indonesia's 1976 annexation of the territory.

(C) 2003, The University of Michigan 32

Entities + DescriptionsChief U.N. arms inspector Richard Butler met Iraq’s Deputy Prime Minister Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work.

Israel's Defense Minister Yitzhak Mordechai will meet senior Palestinian negotiator Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking.

Sinn Fein, the political wing of the Irish Republican Army, deferred a vote on Northern Ireland's peace deal Sunday.

Hundreds of troops patrolled Dili, the Timorese capital, on Friday during the anniversary of Indonesia's 1976 annexation of the territory.

(C) 2003, The University of Michigan 33

Building a database of descriptions

• Size of database: 59,333 entities and 193,228 descriptions as of 08/01/98

• Text processed: 494 MB (ClariNet, Reuters, UPI)

• Length: 1-15 lexical items

• Accuracy: (precision 94%, recall 55%)

(C) 2003, The University of Michigan 34

Ung Huot

A senior memberCambodia’sCambodian foreign ministerCo-premierFirst prime ministerForeign ministerHis excellencyMr.New co-premierNew first prime ministerNewly-appointed first prime ministerPremier

Multiple descriptions per entity

Profile for Ung Huot

(C) 2003, The University of Michigan 35

Language reuse and regeneration

+ =CONCEPTS CONSTRAINTS CONSTRUCTS

Corpus analysis: determining constraints

Text generation: applying constraints

(C) 2003, The University of Michigan 36

• Understanding: full parsing is expensive • Generation: expensive to use full parses• Bypassing certain stages (e.g., syntax)• Not(!) template-based: still required

extraction, analysis, context identification, modification, and generation

• Factual sentences, sentence fragments• Reusability of a phrase

Language reuse and regeneration

(C) 2003, The University of Michigan 37

Context-dependent solution

Redefining the relation:

DescriptionOf (E,C) =

{Di,c, Di,c is a description of E in context C}

If named entity E appears in text and the context is C:

Insert DescriptionOf (E,C) in text.

(C) 2003, The University of Michigan 38

Multiple descriptions per entity

Bill Clinton

U.S. PresidentPresidentAn Arkansas nativeDemocratic presidential candidate

Profile for Bill Clinton

(C) 2003, The University of Michigan 39

Choosing the right description

Bill Clinton CONTEXT

U.S. President …………………………..foreign relationsPresident ………………………………… national affairsAn Arkansas native ……………....false bomb alert in ARDemocratic presidential candidate …………….. elections

Pragmatic and semantic constraints on lexical choice.

(C) 2003, The University of Michigan 40

Semantic information from WordNet

• All words contribute to the semantic representation

• First sense is used only

• What is a synset?

(C) 2003, The University of Michigan 41

WordNet synset hierarchy

{07063762} director, manager, managing director

{07063507} administrator, decision maker

{07311393} head, chief, top dog

{06950891} leader

{00004123} person, individual, someone, somebody, human

{00002086} life form, organism, being, living thing

{00001740} entity, something

(C) 2003, The University of Michigan 42

Lexico-semantic matrixWord synsets Parent synsets

Description{07147929}premier

{07009772}Kampuchean …

{07412658}minister

{07087841}associate

A senior member … X

Cambodia's X …

Cambodian foreign minister X … X

Co-premier X … X

First prime minister X … X

Foreign minister … X

His excellency …

Mr. …

New co-premier X … X

New first prime minister X … X

Newly-appointed first prime minister X … X

Premier X … X

Prime minister X … X

Profile for Ung Huot

(C) 2003, The University of Michigan 43

Choosing the right description• Topic approximation by context: words that

appear near the entity in the text (bag) • Name of the entity (set)• Length of article (continuous)• Profile: set of all descriptions for that entity (bag)

- parent synset offsets for all words wi.

• Semantic information: WordNet synset offsets (bag)

(C) 2003, The University of Michigan 44

Choosing the right description

(Context, Entity, Description, Length, Profile, Parent) Classes

Ripper feature vector [Cohen 1996]

(C) 2003, The University of Michigan 45

Example (training)T# Context Entity Description Len Profile Parent Classes1 Election,

promised,said, carry,party …

KimDae-Jung

Veteranoppositionleader

949 Candidate,chief, policymaker,Korean ...

person,leader,Asian,importantperson ...

{07136302}{07486519}{07311393}{06950891}{07486079}

2 Introduced,responsible,running,should,bringing …

KimDae-Jung

SouthKorea'soppositioncandidate

629 Candidate,chief, policymaker,Korean ...

person,leader,Asian,importantperson ...

{07136302}{07486519}{07311393}{06950891}{07486079}

3 Attend,during,party, time,traditionally …

KimDae-Jung

A front-runner

535 Candidate,chief, policymaker,Korean ...

person,leader,Asian,importantperson ...

{07136302}{07486519}{07311393}{06950891}{07486079}

4 Discuss,making,party,statement,said …

KimDae-Jung

A front-runner

1114 Candidate,chief, policymaker,Korean ...

person,leader,Asian,importantperson ...

{07136302}{07486519}{07311393}{06950891}{07486079}

5 New, party,politics, in,it …

KimDae-Jung

SouthKorea'spresident-elect

449 Candidate,chief, policymaker,Korean ...

person,leader,Asian,importantperson ...

{07136302}{07486519}{07311393}{06950891}{07486079}

(C) 2003, The University of Michigan 46

Sample rules

Total number of rules: 4085 for 100,000 inputs

{07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 361 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ presidential LENGTH <=

412 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~

during .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ case .{07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 390

LENGTH <= 412 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~

and .

(C) 2003, The University of Michigan 47

Evaluation

• 35,206 tuples; 11,504 distinct entities; 3.06 DDPE

• Training: 90% of corpus (10,353 entities)

• Test: 10% of corpus (1,151 entities)

(C) 2003, The University of Michigan 48

Evaluation

• Rule format (each matching rule adds constraints):

X [A] (evidence of A)

Y [B] (evidence of B)

X Y [A] [B] (evidence of A and B)

• Classes are in 2W (powerset of WN nodes)

• P&R on the constraints selected by system

(C) 2003, The University of Michigan 49

Definition of precision and recall

Model System P R

50.0 %[A] [B] [C]

[A] [B] [C] [A] [B] [D]

[B] [D] 33.3 %

66.7 % 66.7 %

(C) 2003, The University of Michigan 50

Precision and recallWord nodes only Word and parent nodes

Trainingset

Precision Recall Precision Recall500 64.29% 2.86% 78.57% 2.86%

1000 71.43% 2.86% 85.71% 2.86%2000 42.86% 40.71% 67.86% 62.14%5000 59.33% 48.40% 64.67% 53.73%

10000 69.72% 45.04% 74.44% 59.32%15000 76.24% 44.02% 73.39% 53.17%20000 76.25% 49.91% 79.08% 58.70%25000 83.37% 52.26% 82.39% 57.49%30000 80.14% 50.55% 82.77% 57.66%50000 83.13% 58.53% 88.87% 63.39%

100000 85.42% 62.81% 89.70% 64.64%150000 87.07% 63.17%200000 85.73% 62.86%250000 87.15% 63.85%