trec2009blog overview v9

Title here

Overview of theTREC 2009 Blog TrackIadh Ounis, Craig Macdonald, Ian [email protected]

1

OutlineBlog Track: BackgroundTREC Blog Track 2009 OverviewBlogs08 collectionFaceted blog distillation taskTop stories identification taskConclusions2

2

Blog Track @ TRECIntroduced in TREC 2006Explores the information seeking behaviour in the blogosphereThe Blog track adopted an incremental approachFrom core and simple retrieval tasks to more complex search scenariosThus far, two main search tasks have been addressed:Opinion-finding task [2006-2008]Find me posts about what people think of XBlog distillation task [2007-2008]Find me blogs with a principle, recurring interest in X

3

3

Blog Track 2009In 2009, the Blog track has been markedly revampedAddresses more refined and complex search scenarios using a larger sample of the blogosphereAn up-to-date sample of the blogosphere: Blogs08One order of magnitude larger than the older Blogs06 (28M posts, 1.3M feeds) A much longer timespan: 13 months from Jan 08 to Feb 09Two new search tasks:Faceted blog distillationAddresses the quality aspect of the retrieved blogsTop stories identification taskAddresses the news-related dimension of the blogosphere

4

4

The New Blogs08 CollectionCrawled from the blogosphere over a 13-month period from 14th Jan 08 to 10th Feb 09Includes spam, non-English documents, and non-blogsFacilitates addressing the temporal/chronological aspect of the blogospheree.g. news and filtering tasksFollow a similar structure to the older Blogs06 collection:808GB feeds (>1.3M blogs)1445GB permalinks (28M documents)A single post and its comments 56GB homepagesCreated by the Univ. of Glasgow and distributed since April 20095

5


6

Blog Distillation Task Blog search users often wish to identify blogs about a given topicThey can subscribe to and read on a regular basisFiltering: Subscribe to a repeated search in their RSS readerDistillation: add blog feeds with a recurring central interest to their RSS readerBlog distillation task [2007-2008]Find me a blog with a principle, recurring interest in XThe TREC 2007 and 2008 incarnations focused on topical relevance The task did not address the quality aspect of the retrieved blogs

7

7First, let us look at the motivation of the task.

In its previous incarnation, the task was addresses as an adhoc topical-relevance.The users might not wish subscribe to all retrieved bogs, but only to those that meet some constraints/creteria.

Faceted Blog SearchNew task mimics an exploratory search taskFind me a quality blog to follow/read about XQuality aspect is addressed through the use of facets in the search interface (Hearst et al., SSM 2008)

Faceted search allows the users to explore the attributes of those blogs they might wish to follow and read:In-depth/shallow analysisHumouristic/serious styleExpert/novice viewpointetc.8

Idea of task is groups identify features for ranking wrt to a facet inclination8

Task DefinitionFor operationalising at TREC Each topic has a facet of interest attached to itBlogs do not have facet attributes

For TREC 2009, we used an initial set of 3 facets of varying difficulty:Opinionated: opinionated vs factual blogsPersonal: personal vs. official blogsIndepth: in-depth vs. shallow blogs

The use of the Opinionated facet allowed to leverage past track work on opinion-finding9} binary

All facets assumed to have binary inclinations for operational simplicity

9

TopicsOne appropriate facet added to each topic

hugo chavez I am looking for blogs that talk about Venezuelanpresident Hugo Chavez and his politics. indepth I want to follow blogs that talk about Hugo Chavez,the president of Venezuela. Blogs that follow his role inVenezuelan politics are relevant, as well as those thatdiscuss non-political stories and activities. I am moreinterested in blogs about Chavez than blogs aboutVenezuelan politics generally.

50 new topics were created by TREC assessors:21 Opinionated10 Personal19 Indepth 10

10

RunsRetrieval unit:Blogs from the Feeds component of Blogs08For each topic, a run consists of three rankings of 100 blogs:One with the 1st inclination of facet enabledOne with the 2nd inclination of facet enabledOne with no facet inclination enabled (akin to topic-relevance baseline)Example: For a topic with Personal facet1st ranking should have 100 personal blogs2nd ranking should have 100 official blogs3rd ranking should have 100 relevant blogs 11

11

Assessment ProcedureHow does one assess a blog?By reading some of its postsAssessment scale:[0]: Not relevant[1]: Relevant but not clearly inclined to a facet inclination [2]: Relevant and clearly inclined towards the 1st facet inclination (opinionated, personal, indepth)[3]: Relevant and clearly inclined towards the 2nd facet inclination (factual, official, shallow)Topic-relevance baseline runsMeasure using NR={0}, R={1,2,3}Faceted blog search runsMeasure using NR={0,1}, R={2|3}Measure MAP for all facet inclination rankings (2 inclinations for each topic)

12

12

Runs and PoolingEach group permitted up to 4 runs9 groups took part in the faceted blog distillation task29 submitted runs, including 24 title-only runsAll runs pooled (and all 3 rankings in each run) to depth 30# Queries# BlogsNot Relevant4925381Relevant (cannot tell)49210Relevant (opinionated)13159Relevant (factual)1392Relevant (official)863Relevant (personal)8118Relevant (indepth)18220Relevant (shallow)18176

13

Overview of ResultsBaseline retrieval performances are lower than expected96% of the pooled blogs were judged irrelevant

Facet performances are lowPerformance across facets differsE.g. Indepth vs Opinionated

Task complexity, early-stage techniques, or difficult topics?

Facet InclinationMAPP@10BestMedianBaseline0.36170.12850.53080.2436BestMedianOpinionated0.23380.07270.26150.1000BestMedianFactual0.29450.06850.23080.0769BestMedianOfficial0.31670.05600.23750.0625BestMedianPersonal0.29950.09370.32500.1125BestMedianIndepth0.34890.05490.27780.0889BestMedianShallow0.19060.02500.21110.0333

14

14

Baseline runs results: 39 topics; Top 5 Groups; Title-only (ranked by MAP)Most of the groups indexed only the Permalinks components of Blogs08Almost all deployed retrieval techniques scored a blog based on the scores of its corresponding relevant postsGroupRunMAPP@10bPrefbuptpris_2009prisb0.27560.27670.3206ICTNETICTNETBDRUN20.23990.23840.2863USIcombined0.23260.24090.2815FEUPFEUPirlab20.17520.19860.2447uogTruogTrFBAlr0.13170.15310.2004

Topic relevance model and expansion using terms from and topic fields.Fuzzy aggregation methods to combine regularized blog posts scores into blog scores.Blog posts ranked using BM25, then scores aggregated to blogs15

Faceted blog search runs results: 39 topics; Top 5 Groups; Ranked by ALL (MAP) Faceted search proved to be particularly challengingFor all groups, and in almost all cases: Applying faceted search leads to a decrease in performance viz. the faceted performance of the baseline ranking

GroupRunMAPAllOpinionFactualOfficialPersonalIndepthShallowUSIregularized0.12610.08970.10440.15770.13370.14690.1298FEUPPirlab2*0.11980.10680.13390.15230.17910.14890.0491ICTNETBDRUN20.10300.12590.11760.02570.18550.12000.0567BIT09PH 0.10260.07980.13500.10470.12390.14030.0475uogTrTrFBHlr0.09180.09190.11030.19650.07390.10150.0301

Indepth facet: posts scored using Cross Entropy. For other facets: Mutual Information is used to weight terms in posts, using various lexicons.Did not attempt faceted search. Post scores are altered using temporal information before being aggregated into blog scores.Learned a classifier for the Indepth facet. For other facets, they used heuristics to score blog posts before aggregation.16

For each facet: 2 rankings -> 6 rankings39 queries: each query has two ranking inclination -> 78 AP16


17

Top Stories Identification TaskMany blog search engine queries are news-related

New tasks main research question: How well does the blogosphere respond to real-world events?

Facilitated by the Blogs08 test collection 54 weeks in length, includingUS election cycleChina earthquakeetc.

18

Task DefinitionFor a given unit of time (query date), identify the top news stories on that dateAnd also identify some related blog posts to the headline, covering its various/diverse aspects

News stories are represented by headlines broadcast by NY TimesFor entire timespan of Blogs08Distributed with kind permission of NYTFederal takeover of Fannie Mae and Freddie Mac--------

----

1.2.19

Task DetailsExample Query :

TS09-33 2008-08-25

Provide a ranking of news headlines in range 1 e.g. If a story happens early on day d in Europe, it will be reported by an American broadcaster (NYT) on day d-1

For each ranked news headline, suggest relevant, diverse blog postsRelevant blog posts may occur anytime after the date of the eventThe task is of Retrospective Event Detection (RED) type

20

Topic DevelopmentThe organisers selected 55 dates as topics Covering various global, political, economics, cultural, sports and technology events

These included dates related to events such as:Chinese EarthquakeObamas inaugurationBanking crisisBeijing OlympicsOscarsMicrosoft/Yahoo (aborted) dealetc.21

Runs and AssessmentsA run consists of a ranking of 100 headlines, each supported by up to 10 diverse blog postsRuns use the SUPPORTing run format developed for the Enterprise track expert search task25 runs by 7 groups: pooled top 20 headlines from each run

Two phases of participant community judging:Top news story judging: Identify important news stories for each dayBlog post judging: Identify relevant and diverse blog posts for relevant headlines22

Phase 1: Top News Story JudgingWe asked assessors to take the role of a newspaper editorWhat stories would they put on the front page of a newspaper or news website?Assess whether the headline actually occurred on the query day, and judge each headline story as Important or Not ImportantCould consider their own recollection of events, or refer to external Web resources

Editorial factors to consider: Timing, Significance, Prominence, Human Interest, Proximity

Interface provided pool of headlines to judge, headline and snippet of story, and link to actual NYT news article

23

Phase 2: Blog Post JudgingOnce headlines were judged, important ones were sampled for which to perform blog post judging2-phase judging avoids judging blog posts at the same time as judging headline Assessors only have to read blog posts for judged important headlines

Blog posts were judged Relevant or Not Relevant to the headline

When judging, assessors defined aspects to group relevant blog postse.g. for a headline on the Oscars, the assessor defined aspects such as liveblogs, factual, opinionated, accuracy of predictionsAspects are used during diversity evaluation24

Relevance AssessmentsTop news story identification was hard:

Blog post judging, less so:

Result reporting in two phases: Top news story identification, then diverse blog post retrieval

Relevance Level# StoriesNot Important9453Important1434

Relevance Level # Blog PostsNot Relevant3453Relevant4375

25

25

Identifying Top News StoriesAll 25 submitted runs were automaticTask was fairly difficult: retrieval performances were rather low MAPP@10TREC best0.25530.4873TREC median0.04450.1164

26

Identifying Top News Stories: Runs All groups indexed only the Permalinks component of Blogs08 (exceptions are UAms & USI)[email protected]_KLEKLEClusPrior0.16050.29640.4553UAmsIlpsTSExP0.13540.27450.4271IowaSIowaSBT09040.08820.17090.3294ICTNETICTNETTSRun10.03910.09820.1801shakwatri1025rw5h2b0.03880.12000.2127USIruntag0.00620.01820.1818

Voting Model: Number of blog posts mentioning a headline. Probabilistic: Combination of query generating headline probability and headline prior calculated from time- or term-based evidenceTwo probabilistic approaches: news to blogs or blogs to news.27

Identifying Blog PostsRuns with high top story recall have more chance to identify relevant blog postsMoreover, systems found identifying blog posts for a headline easierEvaluation measures are diversity-based, from the Web track:-NDCG@10 (=0.5)IA-P@10See Charlies talk for Web track-NDCG@10IA-P@10TREC best0.77230.2759TREC median0.02170.0041

28

Identifying Blog Posts: RunsMeans calculated over all 258 judged headlinesHowever, ranking of runs not identical to top story identification evaluationSome swaps between groups, and between runs for a given groupGroupRun-NDCG@[email protected]_KLEKLEFeedPrior0.5040.162IowaSIowaSBT09010.3410.099UAmsIlpsTSExT0.1040.030ICTNETICTNETTSRun10.0730.024shakwatri1025rw54320.0020.000USIruntag0.0000.000

Divergence From Randomness DPH ranking and MMRLatent Dirichlet Relevance Model, but applied no diversification29

ConclusionsIn 2009, the Blog track has been markedly revampedTwo new pilot search tasks that go beyond topical relevance and simple adhoc retrieval

The results on both tasks confirm the complexities of faceted blog search and top stories identificationThere is a large scope for further research and improvements

Blog track will run in 2010 Same tasks but with a few proposed refinements intended to facilitate research into considering the blogosphere as a time streamMore at the Blog track workshop on Friday 30

30

trec2009blog overview v9

Technology