entities, topics and events in community memories elena demidova, nicola barbieri, stefan dietze,...

19
Entities, Topics and Events in Community Memories Elena Demidova , Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou, Vassilis Plachouras, Wim Peters, Thomas Risse, Yannis Stavrakas, and Nina Tahmasebi 1st International Workshop on Archiving Community Memor 6 September 2013, Lisbon, Portu

Upload: dalia-wadford

Post on 29-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Entities, Topics and Events in Community Memories

Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos

Papailiou, Vassilis Plachouras, Wim Peters, Thomas Risse, Yannis Stavrakas, and Nina Tahmasebi

1st International Workshop on Archiving Community Memories6 September 2013, Lisbon, Portugal

Page 2: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Architecture Overview Offline processing

ETOEs extraction Semantic enrichment &consolidation

Cross-crawl analysis Dynamics detection

Page 3: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

TEXT ANALYSIS & CONSOLIDATION

Page 4: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Entity & Event Extraction from Text

Development of applications that identify document sections by language automatically select appropriate resources to process multilingual

text (within as well as across documents), handle different domains within single pipelines appropriately

GATE applications are wrapped in the off-line module

Entity types: Person, Location, Organisation, …

Cross-document co-reference within GATE

Improved linguistic pre-processing for degraded text in tweets (joint development with TrendMiner project)

Improvements to event recognition, including use of low-scoring terms as event indicators

Adaptation to German

Page 5: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Entity Enrichment and Correlation

Enrichment and correlation using DBpedia & Freebase

<Enrichment>http://dbpedia.org/resource/Jean-Claude_Trichet</Enrichment>

<Enrichment>http://dbpedia.org/resource/ECB</Enrichment>

<Event>Trichet warns of systemic debt crisis</Event>

<Person>Jean Claude Trichet</Person> <Organisation>ECB</Organisation>

DBpedia Spotlight: keyword search using entity labels with conf. 0.6. Freebase: structured queries using ARCOMEM entity types FC data: 5,800 enriched entities (Dbpedia: 492; Freebase: 5,309)

Avg. precision 0.89 ([1- 0.8] dependent on the entity type and source) RAR data: 19,429 enriched entities (Dbpedia: 6,021; Freebase: 13,408)

[SDA 12]

Page 6: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Freebase Dataset

• Data: 22 millions entities, 350 millions facts

• Schema: 7,500 entity types in about 100 domains

• (June 2011)

• Wikipedia, MusicBrainz, …

Page 7: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Nodes: entities/events (blue), enrichments DBpedia (green), Freebase (orange)

1013 clusters of correlated entities/events in FC

ARCOMEM Entities and Enrichments - Graph

=>cluster expansion using related enrichments

Page 8: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Enrichment and Correlation: Clustering

Direct correlations (entities sharing the same enrichments):

E.g. {Mexico, Mexiko, MEXIKO}, {Greece, Griechenland}

#Clusters with at least 2 correlated entities: FC : 1,013 RAR : 1,381

Exploit graph analysis methods to detect closeness of the enrichments

Linking: e.g. related events with organisations and persons

Enrichment&Clustering component has been integrated in the offline processing and released.

SARA integration: Enrichments: direct links to LOD entities; Clusters: finding similar (or related) entities

Outlook: integration of indirect relationships, studying data quality aspects in LOD

[WOLE 12]

Page 9: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

TOPIC DETECTION

Page 10: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Topic Modeling on Rock am Ring Probabilistic topic models provide a suite of techniques to uncover the hidden

semantic theme of a large collection of data

Documents may exhibit multiple topics

Each topic is described by a distribution of probability over the dictionary

Associate each topic with a list of representative documents

and write them into the ARCOMEM KBAlbum 0.021Metal 0.015Songs 0.014Band 0.013

Dj 0.007Lyrics 0.004

Rock 0.055Am 0.050Ring 0.042

Festival 0.009Tickets 0.003

Fashion 0.003

Collection 0.003

Food 0.003Style 0.003Color 0.002

Rock Am Ring Data: 32,864 documents Multilingual (English, German, etc.)

Page 0.007Site 0.005Web 0.005Click 0.004Link 0.004

The Topic Detection module is based on the Mahout Collapsed Variational Bayes which

scales on very large dataset

Task 1: Topic Detection

Task 2: Assign Documents to Topics

Page 11: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Temporal Evolution in Topic Modeling

Several Challenges: Tracking the evolution of topics Early detection of emerging topics Prediction of trendy topics

Topics may evolve and emerge over time[Mantrach 13]

Page 12: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Trendy Topic Detection

HBasePOS

Named Entity Rec.

TokensTrendy Tf-Idf

Ranked List

Understanding what was the trend at a specific time in the pastDetect events/entities/words that are popular in a time frame

Compute Trendiness: The term frequency in a period is penalized

with the average term frequency over other time periods

Tokens that are popular in all time periods are down-weighted

Page 13: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

DYNAMICS DETECTION

Page 14: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Twitter Dynamics Motivation – being able to pose questions like: “What are the hashtags associated with #obama at time t?” “Find tweets that mention #cnn during the periods that

#obama is associated with #romney” “How the hashtags associated with #obamawins have

evolved over time?” “Find tweets that mention #romney during the peak periods

of #obama”

Designed a model that takes the temporal aspect for associating hashtags in tweets into account (e.g. based on co-occurrence)

Implemented query operators for retrieving the tweets that satisfy complex conditions: filter, fold, jump, merge, join

Implemented a prototype system

Experiments with 25,000 tweets about the US elections

[WOSS 12]

Page 15: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Change Period

Named Entity Evolution

Named Entities (NE): people, places, companies...

Characteristics of Named Entity Evolution (NEE)

Same thing but different terms over time

Change occurs over short periods of time

Small or no concept shift

Announced to the public repeatedly

Goal: Find method for named entity evolution recognition independent from external knowledge sources

Joseph Ratzinger Pope Benedict

Pope Benedict XVIBenedict XVI

Joseph Aloisius RatzingerCardinal RatzingerCardinal Joseph Ratzinger

[TPDL 12]

Page 16: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Named Entity Evolution Recognizer (NEER)

FilteringFinding

Temporal Co-references

Co-References

Benedict XVIà Joseph Ratzingerà Cardinal Ratzinger

1. Pope Benedict XVI2. Pope Benedict3. Benedict XVI4. Cardinal Ratzinger5. Pope6. Benedict

Identifying Change Periods(Burst Detection)

Extract Text NLP Processing Context Creation

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addresses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addresses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addresses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addr-esses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addr-esses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addr-esses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.

Evaluation Results Burst detection found total

73% of all change periods High recall for unsupervised

method Machine learning boosts

precision Data set:

http://www.l3s.de/neer-dataset/

Barack ObamaSenatorState Senator Barack ObamaSenator-elect Barack ObamaSenator Barack ObamaIllinois Democrat

Vladimir PutinPresident-elect Vladimir V PutinMinister Vladimir PutinActing President Vladimir V PutinPresident Vladimir V Putin

Processing Chain[NEER Coling 12]

Page 17: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

FOKAS – Formerly Known As Search Engine[FOKAS Coling 12]

http://www.l3s.de/fokas/

Page 18: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

References[SDA 12] Dietze, S., Maynard, D., Demidova, E., Risse, T., Peters, W., Doka, K., Stavrakas, Y., Entity Extraction and Consolidation for

Social Web Content Preservation, 2nd SDA Workshop, Pafos, 2012.

[WOLE 12] Nunes, B. P., Kawase, R., Dietze, S., Taibi, D., Casanova, M.A., Nejdl, W., Can entities be friends?, Proc. of WOLE2012 Workshop at the ISWC2012, Boston, US (2012).

[KECSM 12] Maynard, D., Dietze, S., Hare, J., Peters, W., (Eds.), Proc. of the 1st KECSM Workshop at the ISWC2012, CEUR Workshop Proceedings Vol. 895, 2012.

[TPDL 12] Risse, T., Dietze, S., Peters, W., Doka, K., Stavrakas, Y., Senellart, P., Exploiting the Social and Semantic Web for guided Web Archiving, TPDL2012, Pafos, Cyprus, September 2012.

[ICDM 12] Nicola Barbieri, Francesco Bonchi and Giuseppe Manco .Topic-aware Social Influence Propagation Models. Proc. of the ICDM 2012, Brussels, Belgium, December 2012

[WSDM 13] Nicola Barbieri, Francesco Bonchi and Giuseppe Manco. Cascade-Based Community Detection. Proc. of the WSDM 2013, Rome, Italy, February 2013

[NEER Coling 12] Nina Tahmasebi , Gerhard Gossen , Nattiya Kanhabua , Helge Holzmann , Thomas Risse, NEER: An Unsupervised Method for Named Entity Evolution Recognition. Coling 2012, Mumbai

[FOKAS Coling 12] Helge Holzmann , Gerhard Gossen , Nina Tahmasebi, fokas: Formerly Known As -- A Search Engine Incorporating Named Entity Evolution, Proc. of the Coling 2012, Mumbai, India

[WOSS 12] Vassilis Plachouras, and Yannis Stavrakas. Querying Term Associations and their Temporal Evolution in Social Data. Int. VLDB Workshop on Online Social Systems (WOSS 2012).

[ICMR 12] Hare, Jonathon, Samangooei, Sina, Dupplaw, David and Lewis, Paul H. ImageTerrier: an extensible platform for scalable high-performance image retrieval. ACM ICMR'12, Hong Kong, HK.

[MTA12] Hare, Jonathon S., Samangooei, Sina and Lewis, Paul H. (2012) Practical scalable image analysis and indexing using Hadoop. Multimedia Tools and Applications, 1-34.

[Mantrach 13] Amin Mantrach. A Joint Past and Present NMF for Topic Detection and Transitions in Social Media; Subm. 13

Page 19: Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Thank You!

Dr. Elena Demidova [email protected] Research CenterAppelstrasse 9a30167 Hannover