recent advances in multi-document summarization dragomir radev university of michigan, ann arbor...

76
Recent advances in multi- document summarization Dragomir Radev University of Michigan, Ann Arbor [email protected] Presentation at UC Berkeley SIMS, November 10, 2004

Upload: macie-axson

Post on 16-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Recent advances in multi-document summarization

Dragomir RadevUniversity of Michigan, Ann Arbor

[email protected]

Presentation at UC Berkeley SIMS, November 10, 2004

Page 2: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

WWW as a textual database

• Large: 1010 pages, 200 TB [Lyman&Varian 03] cf. brain (1011 neurons)

• Multilingual: English 56.4% of sites, German 7.7%, French 5.6%, Japanese 4.9%, Chinese 2.4%

• Evolving: 22% of sites change every day, another 31% change every month [Cho&Garcia-Molina 00]

• Uneven importance: at different levels• Adequate representations are needed for user-

friendly access

Page 3: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Outline

• Introduction• Random walks and social networks• LexRank• Projects in language modeling and machine learning

Page 4: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Outline

• Introduction• Random walks and social networks• LexRank• Projects in language modeling and machine learning

Page 5: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Natural Language Processing (NLP)

Typical NLP problemsEntity extractionRelation extractionText classificationSummarizationInformation retrievalMachine translationQuestion answeringText understandingParsingWord sense disambiguationLexical acquisitionParaphrasing

• NLP is very hard!– The pen is in the box.– Every American has a

mother.– Boston called.– I saw Zoe. The poor girl

looked tired.– Mary and Sue bought each

other a book.– The spirit is willing but the

flesh is weak.– Children make delicious

snacks.– Army head seeks arms.– Czech President and

playwright Havel to receive honors

Page 6: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Recent trends in NLP

• Multidisciplinary• Statistical• Well founded• Scaleable

NLP

Sociology

Linguistics

Lin. Algebra

Graph theory

Bioinformatics

Stat. Mechanics

E-commerce

Bioinformatics

Info. Retrieval

Intelligence

User interfaces

Translation

Page 7: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Finding structure

• Language doesn’t have a regular structure (like a database)

• Sentences are very unlike each other

• Linguistic analysis: parse trees

• Hard to generalize

• Finding structure– Across sentences– Across

sites/sources/documents– Over time

• Representations– Graphs everywhere!

Page 8: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

NewsInEssence

• MEAD: salience-based extractive summarization

• Centroid-based summarization (single and multi document)

• Vector space model• Additional features: position,

length, lexrank

• (1000+ downloads)• Cross-document structure

theory (CST)• NIE: first robust news

summarization system (2001)

Page 9: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November
Page 10: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November
Page 11: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November
Page 12: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November
Page 13: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Outline

• Introduction• Random walks and social networks• LexRank• Projects in language modeling and machine learning

Page 14: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Social networks• Induced by a relation• Symmetric or not• Examples:

– Friendship networks– Board membership– Citations– Power grid of the US– WWW

Page 15: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Krebs 2004

Page 16: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Graph-based representations

1

2

34

5

7

6 81 2 3 4 5 6 7 8

1 1 1

2 1

3 1 1

4 1

5 1 1 1 1

6 1 1

7

8

Square connectivity(incidence) matrix P

Graph G (V,E)

Page 17: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Markov chains

• A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel P.

• Path = sequence (x0, x1, …, xn).

• The probability of a path can be computed as a product of probabilities for each step i.

Page 18: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Random walks• Access time Hij = expected number of steps to go from i

to j.

• Example [Lovász 1993]. What is Hij on a path with nodes 0, 1, n-1?H(k-1,k) = 2k-1H(i,k) = H(i,k-1) + 2k-1H(i,k) = (2i+1) + (2i+3) + … + (2k-1) = k2 – i2

H(0,k) = k2

(Brownian motion: travel distance sqrt(t) in time t)

• Electrical networks– Rst is the resistance between two nodes s and t. The round-trip travel

time between s and t is exactly 2mRst, where m is the number of edges.

Page 19: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Stationary solutions

• The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions:– E is stochastic– E is irreducible – E is aperiodic

• To make these conditions true:– All rows of E add up to 1 (and no value is negative)– Make sure that E is strongly connected– Make sure that E is not bipartite

• Example: PageRank [Brin and Page 1998]: use “teleportation”

Page 20: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

1

2

34

5

7

6 8

Example

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=10

This graph E has a second graph E’superimposed on it:

E’ is the uniform transition graph.

Page 21: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Eigenvectors

• An eigenvector is an implicit “direction” for a matrix.Ev = λv, where v is non-zero, though λ can be any

complex number in principle.

• The largest eigenvalue of a stochastic matrix E is λ1 = 1.

• For λ1, the left (principal) eigenvector is p, the right eigenvector = 1

• In other words, ETp = p.

Page 22: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Prestige and centrality

• Degree centrality: how many neighbors each node has.• Closeness centrality: how close an actor is to all of the

other nodes• Betweenness centrality: based on the role that a node

plays by virtue of being on the path between two other nodes

• Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects.

• Prestige = same as centrality but for directed graphs.

Page 23: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Computing the stationary distribution

0)(

pEI

pEpT

T

function PowerStatDist (E):begin p(0) = u; i=1; repeat p(i) = ETp(i-1)

L = ||p(i)-p(i-1)||1; i = i + 1; until L < end

Solution for thestationary distribution

Page 24: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

1

2

34

5

7

6 8

Example

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=10

Page 25: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Outline

• Introduction• Random walks and social networks• LexRank

Page 26: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Centrality in summarization

• Motivation: capture the most central words in a document or cluster

• Centroid score [Radev & al. 2000, 2004a]• Alternative methods for computing centrality?

Page 27: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Sample multidocument cluster

1 (d1s1) Iraqi Vice President Taha Yassin Ramadan announced today, Sunday, that Iraq refuses to back down from its decision to stop cooperating with disarmament inspectors before its demands are met.

2 (d2s1) Iraqi Vice president Taha Yassin Ramadan announced today, Thursday, that Iraq rejects cooperating with the United Nations except on the issue of lifting the blockade imposed upon it since the year 1990.

3 (d2s2) Ramadan told reporters in Baghdad that "Iraq cannot deal positively with whoever represents the Security Council unless there was a clear stance on the issue of lifting the blockade off of it.

4 (d2s3) Baghdad had decided late last October to completely cease cooperating with the inspectors of the United Nations Special Commission (UNSCOM), in charge of disarming Iraq's weapons, and whose work became very limited since the fifth of August, and announced it will not resume its cooperation with the Commission even if it were subjected to a military operation.

5 (d3s1) The Russian Foreign Minister, Igor Ivanov, warned today, Wednesday against using force against Iraq, which will destroy, according to him, seven years of difficult diplomatic work and will complicate the regional situation in the area.

6 (d3s2) Ivanov contended that carrying out air strikes against Iraq, who refuses to cooperate with the United Nations inspectors, ``will end the tremendous work achieved by the international group during the past seven years and will complicate the situation in the region.''

7 (d3s3) Nevertheless, Ivanov stressed that Baghdad must resume working with the Special Commission in charge of disarming the Iraqi weapons of mass destruction (UNSCOM).

8 (d4s1) The Special Representative of the United Nations Secretary-General in Baghdad, Prakash Shah, announced today, Wednesday, after meeting with the Iraqi Deputy Prime Minister Tariq Aziz, that Iraq refuses to back down from its decision to cut off cooperation with the disarmament inspectors.

9 (d5s1) British Prime Minister Tony Blair said today, Sunday, that the crisis between the international community and Iraq ``did not end'' and that Britain is still ``ready, prepared, and able to strike Iraq.''

10 (d5s2) In a gathering with the press held at the Prime Minister's office, Blair contended that the crisis with Iraq ``will not end until Iraq has absolutely and unconditionally respected its commitments'' towards the United Nations.

11 (d5s3) A spokesman for Tony Blair had indicated that the British Prime Minister gave permission to British Air Force Tornado planes stationed in Kuwait to join the aerial bombardment against Iraq.

(DUC cluster d1003t)

Page 28: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Cosine between sentences

• Let s1 and s2 be two sentences.

• Let x and y be their representations in an n-dimensional vector space

• The cosine between is then computed based on the inner product of the two.

yx

yx

yx niii

,1),cos(

• The cosine ranges from 0 to 1.

Page 29: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

LexRank (Cosine centrality)

1 2 3 4 5 6 7 8 9 10 11

1 1.00 0.45 0.02 0.17 0.03 0.22 0.03 0.28 0.06 0.06 0.00

2 0.45 1.00 0.16 0.27 0.03 0.19 0.03 0.21 0.03 0.15 0.00

3 0.02 0.16 1.00 0.03 0.00 0.01 0.03 0.04 0.00 0.01 0.00

4 0.17 0.27 0.03 1.00 0.01 0.16 0.28 0.17 0.00 0.09 0.01

5 0.03 0.03 0.00 0.01 1.00 0.29 0.05 0.15 0.20 0.04 0.18

6 0.22 0.19 0.01 0.16 0.29 1.00 0.05 0.29 0.04 0.20 0.03

7 0.03 0.03 0.03 0.28 0.05 0.05 1.00 0.06 0.00 0.00 0.01

8 0.28 0.21 0.04 0.17 0.15 0.29 0.06 1.00 0.25 0.20 0.17

9 0.06 0.03 0.00 0.00 0.20 0.04 0.00 0.25 1.00 0.26 0.38

10 0.06 0.15 0.01 0.09 0.04 0.20 0.00 0.20 0.26 1.00 0.12

11 0.00 0.00 0.00 0.01 0.18 0.03 0.01 0.17 0.38 0.12 1.00

Page 30: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

d4s1

d1s1

d3s2

d3s1

d2s3

d2s1

d2s2

d5s2d5s3

d5s1

d3s3

Cosine centrality (t=0.3)

Page 31: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

d4s1

d1s1

d3s2

d3s1

d2s3

d2s1

d2s2

d5s2d5s3

d5s1

d3s3

Cosine centrality (t=0.2)

Page 32: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

d4s1

d1s1

d3s2

d3s1

d2s3d3s3

d2s1

d2s2

d5s2d5s3

d5s1

Cosine centrality (t=0.1)

Sentences vote for the most central sentence!

Page 33: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

))(

)(...

)(

)((

1)(

1

1

n

n

Tc

Tp

Tc

Tpd

N

dAp

LexRank

• T1…Tn are pages that link to A, c(Ti) is the outdegree of pageTi, and N is the total number of pages.

• d is the “damping factor”, or the probability that we “jump” to a far-away node during the random walk. It accounts for disconnected components or periodic graphs.

• When d = 0, we have a strict uniform distribution.When d = 1, the method is not guaranteed to converge to a unique solution.

• Typical value for d is between [0.1,0.2] (Brin and Page, 1998).

Page 34: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Cosine centrality vs. centroid centrality

ID LPR (0.1) LPR (0.2) LPR (0.3) Centroid

d1s1 0.6007 0.6944 1.0000 0.7209

d2s1 0.8466 0.7317 1.0000 0.7249

d2s2 0.3491 0.6773 1.0000 0.1356

d2s3 0.7520 0.6550 1.0000 0.5694

d3s1 0.5907 0.4344 1.0000 0.6331

d3s2 0.7993 0.8718 1.0000 0.7972

d3s3 0.3548 0.4993 1.0000 0.3328

d4s1 1.0000 1.0000 1.0000 0.9414

d5s1 0.5921 0.7399 1.0000 0.9580

d5s2 0.6910 0.6967 1.0000 1.0000

d5s3 0.5921 0.4501 1.0000 0.7902

Page 35: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Evaluation metrics

• Difficult to evaluate summaries– Intrinsic vs. extrinsic evaluations– Extractive vs. non-extractive evaluations– Manual vs. automatic evaluations

• ROUGE = mixture of n-gram recall for different values of n.

• Example:– Reference = “The cat in the hat”– System = “The cat wears a top hat”– 1-gram recall = 3/5; 2-gram recall = 1/4;

3,4-gram recall = 0• ROUGE-W = longest common subsequence• Example above: 3/5

Page 36: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Evaluation results

Centroid: C0.5, C10, C1.5, C1, C2.5, C2Degree: D0.5T0.1, D0.5T0.2, D0.5T0.3, D1.5T0.1,

D1.5T0.2, D1.5T0.3, D1T0.1, D1T0.2, D1T0.3LexRank: Lr0.5T0.1, Lr0.5T0.2, Lr0.5t0.3, Lr1.5t0.1,

Lr1.5t0.2, Lr1.5t0.3, Lr1T0.1, Lr1T0.2, Lr1T0.3

Rouge-2Lr1.5t0.2 0.115 D1.5T0.2 0.114D1T0.2 0.113…C1.5 0.099

Rouge-1Lr1.5t0.1 0.400Lr1.5t0.2 0.400Lr1T0.2 0.396…C1 0.382

Rouge-4Lr1.5t0.1 0.124Lr1.5t0.2 0.124Lr1T0.2 0.124…C2 0.118

Page 37: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

DUC results

Peer code Task ROUGE-1 ROUGE-2 ROUGE-3 ROUGE-4 ROUGE-L ROUGE-W

141 3 5 2 1 1 2 2

142 3 5 1 1 1 4 3

143 4 1 2 1 1 6 6

144 4 3 1 1 1 7 7

145 4 1 2 2 2 4 4

Recall LCS

Page 38: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Results and applications

• DUC results (MU recall, ROUGE):– 1st place 2003

(duc.nist.gov)– 1-2 place 2004

• applications:– Web page

summarization (WIE)– Topical crawling– Answer focused– wireless access– Cross-lingual– IR-based evaluation– Knowledge based

• Beyond summarization:– Classification– WSD– Spam recognition

Page 39: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

123

45

67

89

10111213

14151617

1819

2021

2223

2425

2627

28

Page 40: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Outline

• Introduction• Random walks and social networks• LexRank• Projects in language modeling and machine learning

Page 41: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Syntax in Statistical Machine Translation

• Noisy channel model: assume that a source sentence has to be translated into a target language sentence

• Goal: find

• Obvious problems can be fixed with syntax (?)

• JHU 02 and 03 projects• (Franz Och, Jan Hajic, Dan

Gildea + others)

)}|({ˆ fePargmaxe

}),({ˆ fehargmaxe mm

• Solution using log-linear combination of features

Page 42: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Setup

• Given: a Chinese sentence+• The top 1000 candidate

translations in English• Parse all of these• Compute features:

monolingual, bilingual, syntax-free, and syntactic

• Evaluation using BLEU (BiLingual Evaluation Understudy)

• Example:– Is the number of

constituents across languages the same?

– Is the english tree grammatical?

– Are the two sentences of comparable length?

• Feature combination– Use a greedy maxbleu

algorithm

Page 43: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

中国 十四 个 边境 开放 城市 经济 建设 成就 显著

NR CD M NN NN NN NN NN NN VV

China 14 border open cities economic achievements marked

CLP

QP

NP NP VPNP

NP

IP

Chinese parse tree

Page 44: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

1. fourteen chinese open border cities make significant achievements in economic construction2. xinhua news agency report of february 12 from beijing - the fourteen chinese border cities that have been opened to foreigners achieved satisfactory results in their economic construction in 1995 .3. according to statistics , the cities achieved a combined gross domestic product of rmb 19 billion last year , an increase of more than 90 % over 1991 before their opening .4. the state council successively approved the opening of fourteen border cities to foreigners in 1992 , including heihe , pingxiang , hunchun , yining and ruili , and permitted them to set up 14 border economic cooperation zones .

1. significant accomplishment achieved in the economic construction of the fourteen open border cities in china2. xinhua news agency , beijing , feb. 12 - exciting accomplishment has been achieved in 1995 in the economic construction of china 's fourteen border cities open to foreigners .3. statistics have indicated that these cities produced a combined gdp of over 19 billion yuan last year , an increase of more than 90 % , compared with that in 1991 before the cities were open to foreigners .4. in 1992 , the state council successively opened fourteen border cities to foreigners . these included heihe , pingxiang , huichun , yining , and ruili . meanwhile , the state council also gave its approval to these cities to establish fourteen border zones for economic cooperation .

1. in china , fourteen cities along the border opened to foreigners achieved remarkable economic development2. xinhua news agency , beijing , february 12 - the economic development in china 's fourteen cities along the border opened to foreigners achieved gratifying results in 1995 .3. according to statistics , these cities completed a gross domestic product in excess of rmb 19 billion in last year , an increase of more than 90 % over 1991 ( the year before they were opened ) .4. in 1992 , the state council successively approved fourteen cities along the border to be opened to foreigners , which included hei he , pingxiang , hunchun , yining and ruili etc. at the same time , these cities were also given approvals to set up fourteen border @-@ economic @-@ cooperation zones .

1. economic construction achievement is prominent in china 's fourteen border opening up cities .2. xinhua news agency , beijing , february 12 - delightful economic construction result was achieved in china 's fourteen border opening up cities in 1995 .3. according to statistics , gdp registered over 19 billion yuan last year in those cities , over 90 % higher than those of year 1991 before opening up .4. fourteen border cities like heihe , pingxiang , huichun , yinin , and ruili etc were approved successively by the state council in 1992 as the cities opening to the outside world , setting up of fourteen border economic cooperation zones in these cities were also approved simultaneously .

1. china 's 14 open border cities marked economic achievements2. xinhua news agency , beijing , february 12 chinese 14 border an open city 1995 economic development to achieve good results3. according to statistics , the city last year 's gross domestic product ( gdp ) over 19 billion yuan , and opening up of more than 90 % growth in 1991 .4. the state council in 1992 has approved the heihe , pingxiang , huichun , yining and ruili , 14 border cities as an open city , and the city also approved a total of 14 border economic cooperation .

Multiple references

Page 45: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Syntactic features

(S1 (S (NP (NP (NNP china) (POS 's)) (CD 14) (ADJP (JJ open)) (NN border) (NNS cities)) (VP (VBD marked) (NP (JJ economic) (NNS achievements)))))

(S1 (S (NP (CD fourteen) (ADJP (JJ chinese) (JJ open)) (NN border) (NNS cities)) (VP (VBP make) (NP (JJ significant) (NNS achievements)) (PP (IN in) (NP (JJ economic) (NN construction))))))

(S1 (NP (NP (JJ significant) (NN accomplishment)) (VP (VBN achieved) (PP (IN in) (NP (NP (DT the) (JJ economic) (NN construction)) (PP (IN of) (NP (NP (DT the) (CD fourteen) (JJ open) (NN border) (NNS cities)) (PP (IN in) (NP (NNP china))))))))))

(S1 (S (PP (IN in) (NP (NNP china))) (, ,) (NP (NP (CD fourteen) (NNS cities)) (PP (IN along) (NP (DT the) (NN border)))) (VP (VBN opened) (PP (TO to) (NP (NP (NNS foreigners)) (VP (VBN achieved) (NP (JJ remarkable) (JJ economic) (NN development))))))))

(S1 (S (NP (JJ economic) (NN construction) (NN achievement)) (VP (AUX is) (ADJP (JJ prominent) (PP (IN in) (S (NP (NP (NNP china) (POS 's)) (NP (CD fourteen) (NN border))) (VP (VBG opening) (PRT (RP up)) (NP (NNS cities)))))))))

Page 46: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

)(

)(log),(

12

2121 wwp

wwpwwd

Flipdeps

Page 47: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

PREDsay

APPS,

PATincrease

ACTrate

EXTpct

ACTname

TWHENJanuary

ACTSpoon

RSTRAlan

TWHENrecently

PATpresident APP

Newsweek

RSTRad

RSTR5

ACT&Gen;

RSTRNewsweek

TR

FUF

CATpp

PREP

LEXin

NP

LEXJanuary

DETERMINERnone

CIRCUM

PARTIC

PROCESS

AFFECTED AGENT

CATclause

PROCESSPARTIC

CREATED AGENT

CAT HEAD CLASSIFIER POSSESSORCAT

LEXsay

TENSEpast

OBJECT-CLAUSEthat

LEXNewsweek

Page 48: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Results

• BLEU baseline:– 31.6%

• Most features:– 30.0%-31.8%

• Flipdeps:– 31.8%

• Best single feature:– 32.5%

• Best combination– 32.9%

• (statistically significant improvement)

• Results in [Och&al.04]

Page 49: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

1. Other Party, governmental and law enforcement authorities must take similar actions beginning from the start of next year.2. Other Party and government agencies and judicial departments must also take similar actions early next year.3. All other Party, Government and Judicial Departments must start similar actions at the beginning of next year.4. Other Party, government, and judicatory departments must take similar action at the beginning of next year.5. Other party and government departments as well as judicial departments must take similar action from the beginning of next year.6. All other party government and judicial departments must also take similar measures from the beginning of next year.7. Other party and judicial authorities should take similar actions from the beginning of next year.8. Other departments of the Party, the government and the judicial departments must also take similar actions early next year.9. Other Party and Government departments as well as judicial departments must also take similar measures from the beginning of next year.10. The other law enforcement agencies and departments will also take part in similar proceedings from the beginning of next year.11. Other party, governmental and judicial departments will have to take similar action from the beginning of next year.

12. Other party politics and judicial department also will have to start from next year beginning of the year to adopt similar motion.13. Other party and judicial section must start from the beginning of year of next year taking similar action also14. The beginning of a year for and res judiciaria as welling must from next year of other party commences assumingis similar toing the proceeding.15. At the beginning of next year politics and judicial department other parties must also start to pick to take similar action.16. Other party politics and the judicial department also will have to start from at the beginning of next year to take the similar action.17. Other party policies and judicial department must also begin from early next year to take similar action.

其他党政及司法部门也必须从明年年初开始采取类似行动。

Phylogenetic Text Modeling

Machine translation identification

Page 50: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

t-test: p<0.05Chinese: Levenshtein 50/50, BLEU 50/50Arabic: Levenshtein 50/50, BLEU 48/50

Page 51: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Chronological orderingS1: Italian TV says the crash put a hole in the 25th floor of the Pirelli building, and that smoke is pouring from the opening. (04/18/02 12:22)

S2: Italian TV showed a hole in the side of the Pirelli building with smoke pouring from the opening. (04/18/02 12:32)

S3: Italian state television said the crash put a hole in the 25th floor of the Pirelli building. (04/18/02 12:42)

S4: Italian state television said the crash put a hole in the 25th floor of the 30-story building. (04/18/02 12:44)

S1 S2 S3 S4S1 0 10 12 13S2 10 0 15 16S3 12 15 0 1S4 13 16 1 0

Page 52: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

S2(d=10)

S1 (d=0)

S3(d=12)

S4(d=13)

2 (d=12)

1 (d=3.5)

time t

S1

S2

S3

S4

Best representation: stop words removed

Page 53: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November
Page 54: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

A small plane has hit a skyscraper in central Milan, setting the top floors of the 30-story building on fire, an Italian journalist told CNN. The crash by the Piper tourist plane into the 26th floor occurred at 5:50 p.m. (1450 GMT) on Thursday, said journalist Desideria Cavina. The building houses government offices and is next to the city's central train station. Several storeys of the building were engulfed in fire, she said. Italian TV says the crash put a hole in the 25th floor of the Pirelli building, and that smoke is pouring from the opening. Police and ambulances are at the scene. Many people were on the streets as they left work for the evening at the time of the crash. Police were trying to keep people away, and many ambulances were on the scene. There is no word yet on casualties.

CNN 4/18/02 12:22pm; CNN 4/18/02 12:32pm; ABCNews 4/18/02 1:00pm;MSNBC 4/18/02 1:00pm; La Stampa 4/18/02 12:45pm

A small plane has hit a skyscraper in central Milan, setting the top floors of the 30-story building on fire, an Italian journalist told CNN. The crash by the Piper tourist plane into the 26th floor occurred at 5:50 p.m. (1450 GMT) on Thursday, said journalist Desideria Cavina. The building houses government offices and is next to the city's central train station. Several storeys of the building were engulfed in fire, she said. Italian TV showed a hole in the side of the Pirelli building with smoke pouring from the opening. RAI state TV reported that the plane had apparently radioed an SOS because of engine trouble. Earlier though, in Rome, the senate's president, Marcello Pera, said it "very probably" appeared to be a terrorist attack. Police and ambulances are at the scene. Many people were on the streets as they left work for the evening at the time of the crash. Police were trying to keep people away, and many ambulances were on the scene. There is no word yet on casualties. TV pictures from the scene evoked horrific memories of the September 11 attacks on the World Trade Center in New York and the collapse of the building's twin towers. "I heard a strange bang so I went to the window and outside I saw the windows of the Pirelli building blown out and then I saw smoke coming from them," said Gianluca Liberto, an engineer who was working in the area told Reuters. The building is known as the Pirelli skyscraper but the Italian tyre and cable company does not operate out of the building. It is one of the symbols of Italy's financial capital and is one of the world's tallest concrete buildings, designed between 1955 and 1960.

A small plane crashed into a skyscraper in downtown Milan today, setting several floors of the 30-story building on fire. The plane crashed into the 25th floor of the Pirelli building in downtown Milan. The weather was clear at the time of the crash. Smoke poured from the opening as police and ambulances rushed to the area. The president of the Italian Senate, Marcello Pera, told Italian television it "very probably" appeared to be a terrorist attack but soon afterwards his spokesman said it was probably an accident. A transport official told Reuters the plane had reported problems with its undercarriage and was circling the city ahead of trying to land at a local airport. The Pirelli building houses the administrative offices of the local Lombardy region and sits next to the city's central train station. It is constructed of concrete and glass. The crash happened just before rush hour, as office workers were closing their day.

A small airplane crashed into a government building in heart of Milan, setting the top floors on fire, Italian police reported. There were no immediate reports on casualties as rescue workers attempted to clear the area in the city’s financial district. Few details of the crash were available, but news reports about it immediately set off fears that it might be a terrorist act akin to the Sept. 11 attacks in the United States. Those fears sent U.S. stocks tumbling to session lows in late morning trading. Witnesses reported hearing a loud explosion from the 30-story office building, which houses the administrative off ices of the local Lombardy region and sits next to the city s central train station. Italian state television said the crash put a hole in the 25th floor of the Pirelli building. News reports said smoke poured from the opening. Police and ambulances rushed to the building in downtown Milan. No further details were immediately available.

Un aereo da turismo, un Piper si è schiantato questo pomeriggio a Milano, poco prima delle 18, contro il grattacielo Pirelli, sede anche della Regione Lombardia (il presidente della Regione, Roberto Formigoni, è in missione ufficiale in India con una delegazione della regione). Lo si è appreso in ambienti investigativi. L' impatto sarebbe avvenuto attorno al 25/o piano dei 30 del grattacielo. Almeno sei piani alla vista risultano sventrati. I detriti sono stati lanciati dal'esplosione a una quarantina di metri intorno all'edificio. In tutta l'area attorno al grattacielo Pirelli lecomunicazioni telefoniche anche via cellulare sono interrotte o quasi impossibili. La Borsa ha sospeso la seduta serale a Piazza Affari dopo lo schianto dell'aereo da turismo, anche il presidente Bush è stato subito avvertito dell'espolosione al Pirellone.«Con molta probabilità si tratta di un attentato». Lo ha detto Marcello Pera aprendo la seduta a Palazzo Madama. Ma secondo quanto si è appreso, l'aereo da turismo era probabilmente in avaria: il pilota, infatti, avrebbe lanciato l'SOS, raccolto dalla torre di controllo di Linate.

Page 55: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Fact tracking04/18/02 13:17 (CNN) The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building's 26th floor at 5:50 p.m. (1450 GMT) on Thursday.

04/18/02 13:42 (ABCNews)The plane was destined for Italy's capital Rome, but there were conflicting reports as to whether it had come from Locarno, Switzerland or Sofia, Bulgaria.

04/18/02 13:42 (CNN)The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building's 26th floor at 5:50 p.m. (1450 GMT) on Thursday.

04/18/02 13:42 (FoxNews)The plane had taken off from Locarno, Switzerland, and was heading to Milan's Linate airport, De Simone said.

Page 56: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Questions from Milan corpus

1. How many people were injured?2. How many people were killed? (age, number, gender, description)3. Was the pilot killed?4. Where was the plane coming from?5. Was it an accident (technical problem, illness, terrorist act)? 6. Who was the pilot? (age, number, gender, description) 7. When did the plane crash? 8. How tall is the Pirelli building? 9. Who was on the plane with the pilot? 10. Did the plane catch fire before hitting the building? 11. What was the weather like at the time of the crash? 12. When was the building built? 13. What direction was the plane flying? 14. How many people work in the building? 15. How many people were in the building at the time of the crash? 16. How many people were taken to the hospital? 17. What kind of aircraft was used?

Page 57: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Relative order, time to stabilize and number of incorrector partially correct answers before stabilization

• Changing answers:– How many people were injured?: 40 different answers!

``no word yet on casualties/injuries'', ``20 people were taken to a nearby hospital'', ``20 to 30 people were hospitalized with iinjuries'', ``many people were injured'', ``there was no official word on the number of people injured in the building'', ``at least 20 injured were taken to hospital from the scene dozens of people had been taken to the hospital'', ``injuring dozens'', ``injuring at least 30'', ``injuring 60'', ``dozens were injured'', ``60 others were injured'', ``the number of injured, originally at 60, was revised downward Friday to 36''.

Only 24 hours after the crash do agencies settle on the accurate number, namely ``36 people''.

Page 58: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Time(EST)

ABCNews

Source

CNN

FoxNews

MSNBC

USAToday

Next Day

9:50 12:47 12:49 12:51 12:51 13:01 13:17 13:42 13:46 14:13 14:21 14:29 14:32 14:52 15:02 15:22 15:31 15:36 17:52 18:13 18:35 18:40 9:31 18:02

one dead at least two four people four dead

no word yeton casualties two deaths at least three at least four

no immediatereports

two deathsthree peoplekilled three people dead

at least two at least threefive peoplekilled

at leastfive

at leastthree

five reporteddead

Fasuloand twootherskilled

incorrect

partial

correct

Page 59: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Syntactic Alignment

• Sequence alignment for (near) paraphrasing [Barzilay&Lee 03]

• No syntax used

• Dynamic programming• Different penalties for

alignment depending on the syntactic similarity

John

talked

Mary

had

with

a chat

Page 60: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Syntactic AlignmentA police official said it was a Piper tourist plane and that the crash had set the top floors on fire.According to ABCNEWS aviation expert John Nance, Piper planes have no history of mechanical troubles or other problems that would lead a pilot to lose control.April 18, 2002 8212; A small Piper aircraft crashes into the 417-foot-tall Pirelli skyscraper in Milan, setting the top floors of the 32-story building on fire.Authorities said the pilot of a small Piper plane called in a problem with the landing gear to the Milan's Linate airport at 5:54 p.m., the smaller airport that has a landing strip for private planes.Initial reports described the plane as a Piper, but did not note the specific model.Italian rescue officials reported that at least two people were killed after the Piper aircraft struck the 32-story Pirelli building, which is in the heart of the city s financial district.A small piper plane with only the pilot on board crashed Thursday into a 30-story landmark skyscraper, killing at least two people and injuring at least 30.Police officer Celerissimo De Simone said the pilot of the Piper Air Commander plane had sent out a distress call at 5:50 p.m. just before the crash near Milan's main train station.Police officer Celerissimo De Simone said the pilot of the Piper aircraft had sent out a distress call at 5:50 p.m. (11:50 a.m.)Police officer Celerissimo De Simone said the pilot of the Piper aircraft had sent out a distress call at 5:50 p.m. just before the crash near Milan's main train station.Police officer Celerissimo De Simone said the pilot of the Piper aircraft sent out a distress call at 5:50 p.m. just before the crash near Milan's main train station.Police officer Celerissimo De Simone told The AP the pilot of the Piper aircraft had sent out a distress call at 5:50 p.m. just before crashing.Police say the aircraft was a Piper tourism plane with only the pilot on board.Police say the plane was an Air Commando 8212; a small plane similar to a Piper.Rescue officials said that at least three people were killed, including the pilot, while dozens were injured after the Piper aircraft struck the Pirelli high-rise in the heart of the city s financial district.The crash by the Piper tourist plane into the 26th floor occurred at 5:50 p.m. (1450 GMT) on Thursday, said journalist Desideria Cavina.

Police officer Celerissimo De Simone said the pilot of the Piper aircraft, en route from Switzerland, sent out a distress call at 5:54 p.m. just before the crash near Milan's main train station.

Page 61: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Algorithm and results

• Three lexical methods• Two syntactic methods• Generate new sentences• method 4 (syntactic alignment

except for stop words):– Grammaticality 3.74– Fidelity 3.77 – on a scale from 1 to 4

• Best lexical method:– Grammaticality 3.12– Fidelity 3.07

Page 62: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Web-based QA• TREC questions

– Where is Inoco based?– When was London's

Docklands Light Railway constructed?

– Who followed Willy Brandt as chancellor of the Federal Republic of Germany?

– What is Grenada's main commodity export?

• TREC evaluation– Earliest conference papers

(Radev & al. ANLP’2000, Prager & al. SIGIR’2000)

• Reranking models

Page 63: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Question Modulation• TREC question set• Start with initial formulation• TRDR = Total Reciprocal

Document Rank (range: 0 to 2.92)

• Evolutionary operators: mutation, permutation, crossover, drop, insert, phrase

• What country is the biggest producer of tungsten? 0.44

• What country “biggest producer” of tungsten? 1.11

• country “biggest producer of tungsten”? 1.98

• Web results using Google as the backend search engine

– 0.4 MRR (mean reciprocal rank)

• Query modulation results

– 42% increase in TRDR (from 0.79 to 1.12)

Page 64: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Models of the Web

Npkk

kekP

kk

!)(

)()(

k

kP

A

B

a

b

• Erdös/Rényi 59, 60

• Barabási/Albert 99

• Watts/Strogatz 98

• Kleinberg 98

• Menczer 02

• Radev 03

• Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology

Page 65: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Self-triggerability across hyperlinks

• Document closures for information retrieval

• Self-triggerability [Mosteller&Wallace 84] Poisson distribution

• Two-Poisson [Bookstein&Swanson 74]

• Negative Binomial, K-mixture [Church&Gale 95]

• Triggerability across hyperlinks?

p

pwpppwp

p

pr ijij )|('

pjpi

p

p’

by with fromp

p’

photo dream path

Page 66: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Evolving Word-based Web

• Observations:– Links are made based on

topics– Topics are expressed with

words– Words are distributed very

unevenly (Zipf, Benford, self-triggerability laws)

• Model– Pick n– Generate n lengths

according to a power-law distribution

– Generate n documents using a trigram model

• Model (cont’d)– Pick words in decreasing

order of r.– Generate hyperlinks with

random directionality

• Outcome– Generates power-law

degree distributions– Generates topical

communities– Natural variation of

PageRank: LexRank

hEa T' Eah 'PageRank

Hits

pEp T'

Page 67: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Tripartite updating

• Modeling classification problems using bipartite graphs

• Weakly supervised learning – why?– bootstrapping, co-training,

active learning

• Spectral partitioning– Fiedler vector

• Singular value decomposition

• Random walks

• Tripartite updating• Matrix representation• Iterative power method

L

U

F

T1

T2

Page 68: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Tripartite updating

• Tasks:– Spam detection– Named entity classification– PP attachment– Number classification

• Features:– Number classification: 5

classes based on context and hobbs class

• Four-way or three-way classification• For the same accuracy of SP and

TU, TU handles twice as many labeled examples with ten times as many unlabeled examples

)1(1

)( tTt FLTF)1()(

2)( ttt UFTU

)()(2

)( ttTt FUTF

L

U

F

T1

T2

Page 69: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Relation extraction• User gives examples of entity

E1 and entity E2.

• Example: song = “Let it Be”, singer = “the Beatles”.

• System finds other songs and singers with a very minimal number of training examples.

• The relation may be quite different, e.g., protein-protein, organization-leader, book-author, drug-disease.

• Weakly supervised learning based on graphs is used.

Page 70: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Protein Regulatory Network Recognition

• Wnt signaling• Glycogen synthase kinase-3

(GSK-3) and CK1 (casein kinase 1) alpha phosphorylate Arm (Armadillo, -catenin) and cause it to degrade.

• Axin also binds to the phosphatase PP2A

• PP2A activity inhibits Wnt signaling

Hsu 1999, Li 2001, Yanagawa 2002, Liu2002, Nusse 2003

Page 71: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Method and Results• Medline:

– “signal transduction” as MeSH major topic and “Wnt” or “AKT” or “Beta-catenin” as words

• 3300 papers extracted by Carlos Santos

• 441 putative proteins (“X is a protein”, “the X protein” “X verbs”)

• Verbs: Bind associate interact activate repress inhibit upregulate regulate downregulate complex dimerize localize bound regulate stabilize control translocate antagonize amplify transduce trigger

Page 72: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Number classification X X X X X

Summarization MEAD/CST/NIE X X X X X X X X X

Lexical Web models X X X X X

Statistical MT X X X X X

Protein networks X X X X X X

Relation extraction X X X X X X X

Phylogenetic alignment X X X X X X X X

QA/NSIR X X X X X

Topical crawling X X X X X X

XML retrieval X X X

Fact tracking X X X X X

mult

ilingual

mult

isourc

e

Uneven im

port

ance

redundant

Gra

ph s

truct

ure

evolv

ing

unst

ruct

ure

d

Hard

to t

rain

Manual evalu

ati

on

Page 73: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

A grabbag of research problems

• Finding adequate representations for dynamic texts

• Integrating user models• Using self-triggering for information retrieval• Weakly supervised and active learning• Robust semantic analysis• Adequate models of the Web

• Relation extraction• Syntax-based machine

translation and summarization

• Automatic knowledge acquisition from the Web

Page 74: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

Conclusion• New approaches to natural language processing and information

retrieval using graph-based techniques such as random walks• Applications beyond NLP• Highest ranked system at DUC• Promising results in semi-supervised machine learning• Acknowledgments:

– CLAIR (Güneş Erkan, Jahna Otterbacher, Siwei Shen, Zhu Zhang)– UROP program– NSF and NIH– Mark Newman

• To read more:– http://tangra.si.umich.edu/clair– http://www.summarization.com– http://www.newsinessence.com

• Papers: CACM 2005; JAIR 2004; EMNLP 2004; IP&M 2004; JASIST 2002, 2004, 2005; WWW 2002; AAAI 2002; SIGIR 1995, 2000; ACL 1998, 2003; HLT 2001; HLT-NAACL 2004; CIKM 2001, 2003; ANLP 1997, 2000; LREC 2002, 2004; IJCNLP 2004; CL 1998, 2002; COLING 2000, 2004

Page 75: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

0520

ACL 2005

www.aclweb.orgJune 25-30, 2005Ann Arbor, MI

General chair: Kevin Knight, ISIProgram co-chairs: Kemal Öflazer, Sabanci U.; Hwee Tou Ng, NUS

Local chair: Dragomir Radev, U. MichiganSubmission deadline: January 14

Page 76: Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November

S

VP

NPVB PP

INPRP NP

PRP$ NN

Thank you for your attention !tangra.si.umich.edu/clair