coe quarterly technical exchange, june 10 th 20081 using mapreduce for scalable coreference...

36
COE Quarterly Technical Exchange, June 10 th 2008 1 Using MapReduce Using MapReduce for Scalable Coreference for Scalable Coreference Resolution Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed and Tan Xu HLT COE and UMIACS Laboratory for Computational Linguistics and Information Processing

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 1

Using MapReduce Using MapReduce for Scalable Coreference for Scalable Coreference

ResolutionResolution

Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed and Tan Xu

HLT COE andUMIACS Laboratory for Computational Linguistics and Information Processing

Page 2: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 2

COE ACE System

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Within-DocWithin-DocCoref.Coref.

PairsPairsFilteringFiltering

FeatureFeatureGenerationGeneration ClusteringClustering

English PipelineEnglish Pipeline

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Within-DocWithin-DocCoref.Coref.

FeatureFeatureGenerationGeneration ClusteringClustering

Arabic PipelineArabic Pipeline

ContextContextFeaturesFeatures

ConversationalConversationalGenreGenre

FeaturesFeatures

Page 3: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 3

Roadmap

1.1. Context FeaturesContext Features Pairwise similarity Efficient vs. effectiveness Generating features for ACE

2.2. Conversational-genre FeaturesConversational-genre Features New generative model Joint Resolution Evaluation using ACE-Usenet

Page 4: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 4

Context FeaturesClose friends and colleagues of Cheney -- including former Gen. Brent Scowcroft, who was national security adviser when Cheney was Gerald Ford's chief of staff and George H. W. Bush's defense secretary -- have been famously quoted they just don't recognize the Cheney they served along side and the Cheney of today who repeatedly made false assertions about the Iraq war and weapons of mass destruction.Now, an article in Vanity Fair Magazine by Todd S. Purdum has published a number of strikingly similar assessments from Clinton's former confidants -- plus medically authoritative guesswork speculating about how health problems of the sort Clinton experienced can change a person.But we avoid that trash talk to focus only on the real, striking changes in the public performances of Bill Clinton and Dick Cheney today. Compared to the way they were, back when they were greatly admired by those who knew them best, back in the day.

Once, ClintonClinton and Cheney were considered consummate political performers. Now they utter gaffes and commit blunders. And they leave the lasting impression that they just don't care about what you think about it.Once, they were smart and savvy strategic forces that always seemed to boost the political fortunes of their team (Clinton with sterling public performances; Cheney with rock-steady behind-the-scenes guidance). Now they have become liabilities to their causes, grand grist for late-night monologues, caricatures on "Saturday Night Live."

It barely seems credible now but there was a time when it seemed the Democratic nomination was Hillary Clinton's for the taking. The air of certainty in January was convincing when Clinton declared from a sofa at her Washington home: "I'm in and I'm in to win." Two Democratic senators and two former governors swiftly pulled out rather than get between Clinton and

White House. Then along came Barack Obama and the aura of inevitability that was crucial to Clinton's strategy vanished.

"The ClintonClinton campaign was meant to be shock and awe: big events in big states, sweep the board on Super Tuesday, overwhelm the less well-known competitors," said Chip Smith, who was deputy campaign manager for Al Gore in 2000. "Unfortunately, Obama uprooted that strategy. Inevitability isn't a viable strategy against a well-funded candidate with a powerful message." It is unclear whether there was anything Clinton could have done to stop a gifted politician such as Obama, once his early win in Iowa and prodigious fundraising ability established that he really did have a chance of winning the Democratic nomination.Clinton also may have destroyed any chance of a comeback after being caught out in her fib about coming under sniper fire while in Bosnia in the 1990s. The lie crystallised voter unease with Clinton, and held back chances of a grand comeback in Pennsylvania. In April, a Washington Post/ABC News poll found that 61% of American voters considered her dishonest and untrustworthy.

Page 5: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 5

Abstract Problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

Goal: Scalable Pairwise Similarity

~10K docs ~50 million doc pairs

~140K entities ~10 billion entity pairs

Page 6: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 6

Solutions Trivial

Loads each vector o(N) times Loads each term t o(dft

2) times

Better Each term contributes only if appears in

Loads each term (with posting list) once Each term contributes o(dft

2)

Vt

dtdtji jiwwddsim ,,),(

ji dd

ji

jiddt

dtdtji wwddsim ,,),(

ji ddt

jiji ddtcontribtermddsim ),,(_),(

Page 7: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 7

Indexing (3-doc toy collection) Clinton

Barack

Cheney

Obama

Indexing

Standard IR Indexing

2

1

1

1

1

ClintonObamaClinton

1

1

ClintonCheney

ClintonBarackObama

Page 8: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 8

Pairwise Similarity(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs

Clinton

Barack

Cheney

Obama

2

1

1

1

1

1

1

22

22

11

1111

22

22 22

22

11

1133

11

Page 9: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 9

Pairwise Similarity (abstract)(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs

multiplymultiply

multiplymultiply

multiplymultiply

multiplymultiply

sumsum

sumsum

sumsum

term postings

term postings

term postings

term postings

similarity

similarity

similarity

GroupingGrouping

Page 10: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 10

MapReduce!

mapmap

mapmap

mapmap

mapmap

reducereduce

reducereduce

reducereduce

input

input

input

input

output

output

output

ShufflingShuffling

group values group values by keysby keys

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

Page 11: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 11

And indexing .. of course!

tokenizetokenize

tokenizetokenize

tokenizetokenize

tokenizetokenize

combinecombine

combinecombine

combinecombine

doc

doc

doc

doc

Posting list

Posting list

Posting list

ShufflingShuffling

group values group values by keysby keys

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

Page 12: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 12

Terms: Zipfian Distribution

term rank

do

c fr

eq (

df)

each term t contributes o(dft2) partial results

very few terms dominate the computations

most frequent term (“said”) 3%

most frequent 10 terms 15%

most frequent 100 terms 57%

most frequent 1000 terms 95%

~0.1% of total terms(99.9% df-cut)

Page 13: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 13

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rme

dia

te P

air

s (

bill

ion

s)

8 trillion intermediate pairs

Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ million doc

Page 14: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 14

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rmed

iate

Pai

rs (

bil

lio

ns)

df-cut at 99%df-cut at 99.9%df-cut at 99.99%df-cut at 99.999%no df-cut

8 trillionintermediate pairs

0.5 trillion intermediate pairs

Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ million doc

Page 15: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 15

EffectivenessEffect of df-cut on effectiveness

Medline04 - 909k abstracts- Ad-hoc retrieval

50

55

60

65

70

75

80

85

90

95

100

99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)

Re

lati

ve

P5

(%

)

Drop 0.1% of terms“Near-Linear” Growth

Fit on diskCost 2% in Effectiveness

Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk

For more details, Check “Pairwise Document Similarity in Large Collections with MapReduce”

at ACL 2008 (presented next week!)

Page 16: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 16

In ACE! ~10K docs

each document is a vector ~140K entities

each has multiple mentions each entity context is a vector

Generated 8 feature matrices (6 English + 2 Arabic)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Within-DocWithin-DocCoref.Coref.

PairsPairsFilteringFiltering

FeatureFeatureGenerationGeneration ClusteringClustering

English PipelineEnglish Pipeline

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Within-DocWithin-DocCoref.Coref.

FeatureFeatureGenerationGeneration ClusteringClustering

Arabic PipelineArabic Pipeline

Page 17: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 17

Roadmap

1.1. Context FeaturesContext Features Pairwise similarity Efficient vs. effectiveness Generating features for ACE

2.2. Conversational-genre FeaturesConversational-genre Features New generative model Joint Resolution Evaluation using ACE-Usenet

Page 18: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 18

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Mary Adams <[email protected]>Subject: Re: tennis tomorrow!

Did Sue want Scott to join? Looks like the gamewill be too late for him.

Identity Resolution in Email

Sue

Identity Identity ResolutionResolution

Who?i.e., label with email address

Page 19: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 19

New Generative Model

1. Choose “personperson” c to mention

p(c)

2. Choose appropriate “contextcontext” X to mention c

p(X | c)

3. Choose a “mentionmention” l

p(l | X, c) ““sue”sue”

playingplayingtennistennis

Page 20: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 20

Context

Social ContextSocial Context

LocalLocalContextContext

Conversational Conversational ContextContext

Topical ContextTopical Context

Page 21: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 21

Single-Mention: 2-Step Solution

Prior DistributionPrior Distribution(1) Identity Modeling(1) Identity Modeling

Posterior DistributionPosterior Distribution

(2) Mention Resolution(2) Mention ResolutionEvidenceEvidence

Page 22: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 22

Improved ResultsEffectivness Comparison on Enron Collection

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MRR P@1

Heuristic

Generative

+8.9% +8.6%

For more details, Check “Resolving Personal Names in Email using Context Expansion”

at ACL 2008 (also presented next week!)

Page 23: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 23

Limitation!

socialconversational

social

topical

social

topical

topical

“Susan Scott”

“Sue”

“Suebob”

[email protected]

“Susan”

“Susan Jones”

“Sue”

Joint Resolution!Joint Resolution!

Context-Free Resolution

Page 24: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 24

Joint Resolution

SpreadSpreadCurrent ResolutionCurrent Resolution

CombineCombineContext InfoContext Info

UpdateUpdateResolutionResolution

MentionGraph

Page 25: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 25

Joint Resolution

mapmap shuffleshuffle reducereduce

MentionGraph

MapReduce!MapReduce!

Work in Progress!Work in Progress!

Page 26: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 26

Roadmap

Context FeaturesContext Features Pairwise similarity Efficient vs. effectiveness Generating features for ACE

Conversational-genre FeaturesConversational-genre Features New generative model Joint Resolution Evaluation using ACE-Usenet

Page 27: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 27

Email Message

From: Machiavegli <[email protected]>To: Mark <mk@hotmail>Date: 29 Jan 2005 22:04:38 GMTSubject: The 1860 Presidential Election

In 1860 there was a four-way race between the Republican Party with AbrahamLincold, the Democratic Party with Stephen Douglas, the Southern DemocraticParty with John Breckenridge, and the Constitutional Union Party with JohnBell. Lincoln won a plurality with about 40% of the vote.WI it was only a two-way race between Lincoln and Douglas? I believe Douglaswould have won.This would have delayed secession and the Civil War.

receiver is email address

Page 28: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 28

Usenet Message

From: Machiavegli <[email protected]>Newsgroup: soc.history.what-ifDate: 29 Jan 2005 22:04:38 GMTSubject: The 1860 Presidential Election

In 1860 there was a four-way race between the Republican Party with AbrahamLincold, the Democratic Party with Stephen Douglas, the Southern DemocraticParty with John Breckenridge, and the Constitutional Union Party with JohnBell. Lincoln won a plurality with about 40% of the vote.WI it was only a two-way race between Lincoln and Douglas? I believe Douglaswould have won.This would have delayed secession and the Civil War.

newsgroup!

Page 29: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 29

ACE Usenet Document<DOCID> soc.history.what-if_20350205910 </DOCID><POSTER> Machiavegli </POSTER>

<POSTDATE> 29 Jan 2005 22:04:38 GMT </POSTDATE><SUBJECT> The 1860 Presidential Election </SUBJECT>

In 1860 there was a four-way race between the Republican Party with AbrahamLincold, the Democratic Party with Stephen Douglas, the Southern DemocraticParty with John Breckenridge, and the Constitutional Union Party with JohnBell. Lincoln won a plurality with about 40% of the vote.WI it was only a two-way race between Lincoln and Douglas? I believe Douglaswould have won.This would have delayed secession and the Civil War.

no email addresses in headers!

Page 30: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 30

Reconstruct from automatically From: Machiavegli <[email protected]>Newsgroup: soc.history.what-ifDate: 29 Jan 2005 22:04:38 GMTSubject: The 1860 Presidential Election

In 1860 there was a four-way race between the Republican Party with AbrahamLincold, the Democratic Party with Stephen Douglas, the Southern DemocraticParty with John Breckenridge, and the Constitutional Union Party with JohnBell. Lincoln won a plurality with about 40% of the vote.WI it was only a two-way race between Lincoln and Douglas? I believe Douglaswould have won.This would have delayed secession and the Civil War.

Got the address back!

Page 31: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 31

Handling it as @

From: Machiavegli <[email protected]>To: [email protected]@usenet.comDate: 29 Jan 2005 22:04:38 GMTSubject: The 1860 Presidential Election

In 1860 there was a four-way race between the Republican Party with AbrahamLincold, the Democratic Party with Stephen Douglas, the Southern DemocraticParty with John Breckenridge, and the Constitutional Union Party with JohnBell. Lincoln won a plurality with about 40% of the vote.WI it was only a two-way race between Lincoln and Douglas? I believe Douglaswould have won.This would have delayed secession and the Civil War.

handle group as receiver

Page 32: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 32

Feature Value: same label

[email protected] [email protected]

“Steph”

“Stephan”

“Stephan”

“S. Smith”

+1.0

Need for feature matrix (pairwise score)

Page 33: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 33

Feature Value: different labels

[email protected] [email protected]

“Steph”

“Stephan”

“Stephan”

“S. Smith”

-1.0

Need for feature matrix (pairwise score)

Page 34: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 34

Conclusion

MapReduce can be applied to many HLT applications easy, cheap, and fast for distributed processing

e.g., scalable pairwise similarity for coreference resolution calls for new ways of thinking

Identity resolution in email new generative model yields improved accuracy

scalable joint resolution needed Usenet-ACE is new test collection

Page 35: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 35

Thank You!

Page 36: COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed

COE Quarterly Technical Exchange, June 10th 2008 36

MapReduce and Text Analysis Computing pairwise similarity in large

collections Joint resolution of mentions in email

collections Search engines (of course!) Building language models Clustering applications Machine translation …