knowledge base acceleration trec 2012

12
Knowledge Base Acceleration TREC 2012 November 17, 2011

Upload: nuwa

Post on 09-Jan-2016

46 views

Category:

Documents


1 download

DESCRIPTION

November 17, 2011. Knowledge Base Acceleration TREC 2012. John R. Frank [email protected] Ian Soboroff [email protected]. November 17, 2011. Number of People C reating Representations of Knowledge. WWW. Expert Systems. Machine Learning. Transistor. Telegraph. Gutenberg Bible. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Knowledge Base Acceleration TREC 2012

Knowledge Base AccelerationTREC 2012

November 17, 2011

Page 2: Knowledge Base Acceleration TREC 2012

300BC 140AD 1828 1900 1950 1970 1984 1994 2001 now

Number of People Creating Representations of Knowledge

Library atAlexandria

Gutenberg Bible

Telegraph

Transistor

Machine Learning

Expert Systems

WWW

maps

November 17, 2011

Page 3: Knowledge Base Acceleration TREC 2012

Accelerate?

rate of assimilation << rate of new info

# editors << # active entities

(definition of a “large” KB)

November 17, 2011

Page 4: Knowledge Base Acceleration TREC 2012

Random Choicesnu

m p

ages

days

How many days must a news article wait before eventually being cited in Wikipedia?

Time lag in days between publication and eventual citation in Wikipedia of a sample of 50,000 web pages (mostly news) cited in Category:LivingPeople.

November 17, 2011

Page 5: Knowledge Base Acceleration TREC 2012

Even for entities mentioned frequently in the news, there is no correlation between mention and edit frequencies.

Human analysts follow their personal interests, hunches, hobbies, habits.

True for all large knowledge bases.

Even for entities mentioned frequently in the news, there is no correlation between mention and edit frequencies.

Human analysts follow their personal interests, hunches, hobbies, habits.

True for all large knowledge bases.

mea

n ed

it in

terv

al (d

ays)

mean mention interval (hours)

November 17, 2011

Page 6: Knowledge Base Acceleration TREC 2012

Methods in the Madnessm

ean

edit

inte

rval

(d

ays)

mean mention interval (hours)

November 17, 2011

Page 7: Knowledge Base Acceleration TREC 2012

First Year Task: Basic CCR“Cumulative Citation Recommendation”

Steps:1)Initialize with a single KB node:

1) Freebase & Wikipedia content2) WP edits from Aug-Nov 2011

2)Begin iterating over news stream3)For each article, output a “pertinence” confidence score between 0 and 1.

1) Aug-Sep: train on labels2) Oct-Mar: labels hidden

4)Your system generates labels and excerpts Oct-Mar

Content Stream•~500,000 English de-duplicated articles per day•Half news, half blogs & forums

Content Stream•~500,000 English de-duplicated articles per day•Half news, half blogs & forums

Your System

November 17, 2011

Page 8: Knowledge Base Acceleration TREC 2012

Gavin Rain•South African•Painter

Gavin Rain•South African•Painter

Venice Biennale(art show)Venice Biennale(art show)

Controversy about South African Pavilion at Venice Biennale

Controversy about South African Pavilion at Venice Biennale

will have an exhibit in

explicit mentions in newsexplicit mentions in news

explicit mentionsexplicit mentionsNo explicit co-occurrence

…inference?

No explicit co-occurrence

…inference?

1

2

3

Challenging Example

November 17, 2011

Page 9: Knowledge Base Acceleration TREC 2012

Annotations(guidelines under development)

November 17, 2011

Page 10: Knowledge Base Acceleration TREC 2012

Future Tasks

Detect changes to infobox slot values

Detect new links between entities

Resurrection of old articles (archive mining)

Identify emerging entities (not yet in KB)

… many more ideas …

November 17, 2011

Page 11: Knowledge Base Acceleration TREC 2012

TimelineDecember 2011Call for ParticipationTest data•Three nodes•Four months

March/April 2012Full data:•~50 nodes•Eight months

Monthly Skype Callsand discussion in Google Groups

Nov Dec Jan Feb Mar Apr Jun Jul Aug Sep Oct Nov

January 2012Tentative:•Eval results for baseline system•Data for more nodes

Summer meet up•At a convenient conference?

nowTREC2012

Submit your runs for eval

November 17, 2011

Page 12: Knowledge Base Acceleration TREC 2012

Optional Output Values(not judged in 2012)

• Novelty Group ID:– Output is a list of docIDs:

• Output an empty list means this doc has new information• Output one or more previous docIDs means that all of this document’s pertinent info was already

revealed in earlier docs

– Would help us plan future tasks about novelty

• Links to other nodes:– Output a list of other KB nodes that this content item associates to the target

node– Would help us plan future tasks about link detection

• Infobox slot name=value– Output a list of two-tuples of strings

• [(slot name, slot value),…]

– Would help us plan future tasks about detecting infobox changes

November 17, 2011