characterizing web content, user interests, and search behavior by reading level and topic

23
Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais *Work done during internship at Microsoft Research

Upload: shiloh

Post on 24-Mar-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic. Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais. *Work done during internship at Microsoft Research . Search and recommendation are about the matching. Queries Documents - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Characterizing Web Content, User Interests, and Search Behavior by

Reading Level and TopicJin Young Kim*, Kevyn Collins-Thompson,

Paul Bennett and Susan Dumais

*Work done during internship at Microsoft Research

Page 2: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Search and recommendation are about the matching.

QueriesDocumentsWebsites

Users

Page 3: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Term-space matching is not always a good idea.

GranularitySparsity

Efficiency

Page 4: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Can we build representations beyond the term vectors?

Topic CategoryReading Level

SentimentStyle

Page 5: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

What would be their implications for search and recommendations?

QueriesDocumentsWebsites

Users

Topic CategoryReading Level

SentimentStyle

Page 6: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

In a Nutshell,

WHAT WE DID: Build Profiles of

Reading Level and Topic (RLT)

For queries, websites, users and search sessions

In order to characterize and compare entities

WHAT WE FOUND: Profile matching

predicts user’s content preference

Profiles can indicate when not to personalize

Profile features can predict expert content

Page 7: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Building Reading Level and Topic Profiles

Page 8: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Predicting Reading Level and Topic for URL Reading Level Classifier

Based on language model and other sources

Topic Classifier Trained using URLs in each Open Directory Project

category

Profile Distribution over reading level, topic,

or reading level and topic (RLT)P(R|d1) P(T|d1)

Page 9: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Entities and Related URLs Websites : content vs. user-viewed URLs Users : URLs visited during search sessions Queries : top-10 retrieved URLs

Example: Site profile made from URLs visited during search

sessions

Entity Profile Built from Related URLs

P(R|d1) P(T|d1)P(R|d1) P(T|d1)P(R|d1) P(T|d1) P(R,T|s)

Page 10: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Entity and related entities User – Websites visited Website – Surfacing queries Query – Issuing users

Example: Site profile made from the profiles of its visitors

Entity Profile Built with Related Entities

User

Query

WebsiteVisit

IssueSurface

P(R,T|s)P(R,T|u)P(R,T|u)P(R,T|u)

Page 11: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Characterizing an Individual Entity Mean : expectation Variance : entropy

Characterizing a Group of Entities Build a group centroid from its members Variance : divergence among members

Comparing Entitles and Groups Difference in mean Divergence in profile (distribution)

Characterizing and Comparing Profiles

Page 12: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Characterizing Web Content, User Interests, and Search Behavior

Page 13: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Data Set Session Log Data

2,281,150 URL visits (1,218,433 SERP clicks) Collected from 8,841 users

Profiles of Entities 4,715 websites with 25+ clicked URLs 7,613 users with 25+ URL visits 141,325 unique queries

Page 14: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Each topic has different reading level distribution

Reading Level Distribution for Top ODP Categories

Category R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 E[R|T]Reference 0.00 0.00 0.00 0.02 0.17 0.10 0.15 0.04 0.02 0.03 0.20 0.27 8.80Health 0.00 0.00 0.00 0.03 0.18 0.08 0.13 0.04 0.04 0.10 0.27 0.11 8.53Science 0.00 0.00 0.00 0.06 0.23 0.09 0.07 0.02 0.01 0.08 0.27 0.17 8.44Computers 0.00 0.00 0.00 0.06 0.24 0.19 0.03 0.01 0.01 0.02 0.32 0.12 8.11Business 0.00 0.00 0.00 0.05 0.22 0.16 0.09 0.03 0.02 0.04 0.26 0.12 8.08Society 0.00 0.00 0.00 0.02 0.23 0.07 0.35 0.03 0.01 0.01 0.22 0.06 7.62Adult 0.00 0.00 0.00 0.05 0.28 0.26 0.14 0.05 0.02 0.01 0.13 0.06 6.98Kids and Teens 0.00 0.00 0.02 0.23 0.26 0.13 0.09 0.02 0.01 0.02 0.15 0.08 6.60Games 0.00 0.00 0.00 0.19 0.36 0.10 0.11 0.02 0.02 0.03 0.12 0.03 6.39Recreation 0.00 0.00 0.00 0.11 0.44 0.19 0.08 0.02 0.02 0.02 0.09 0.02 6.18Arts 0.00 0.00 0.00 0.08 0.40 0.27 0.10 0.05 0.01 0.01 0.06 0.02 6.18Home 0.00 0.00 0.02 0.19 0.41 0.14 0.04 0.03 0.01 0.03 0.09 0.04 6.08News 0.00 0.00 0.00 0.04 0.41 0.33 0.14 0.02 0.02 0.01 0.03 0.01 5.99Shopping 0.00 0.00 0.01 0.22 0.29 0.24 0.09 0.03 0.01 0.02 0.07 0.02 5.98Sports 0.00 0.00 0.00 0.09 0.56 0.11 0.10 0.03 0.03 0.02 0.06 0.02 5.94

Page 15: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Topic and reading level characterize websites in each category

Page 16: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Profile matching predict user’s preference over search results Metric

% of user’s preferences predicted by profile matching,for each clicked website over the skipped website above

Results By degree of focus in user profile : H(R,T|u) By the distance metric between user and website

KLR(u,s) / KLT(u,s) / KLRLT(u,s)User

Group #Clicks KLR(u,s) KLT(u,s)KLRLT(u,s)

↑Focused 5,960 59.23% 60.79% 65.27%  147,195 52.25% 54.20% 54.41%

 ↓Diverse 197,733 52.75% 53.36% 53.63%

Page 17: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Users’ Deviation from Their Own Profiles Stretch reading

Session-level reading level >> Long-term reading level

Casual reading Session-level reading level << Long-term reading

level URL Title Words for Stretch Reading

URL Title Words for

Casual ReadingTitle word Log

ratio Title word Log ratio

tests 2.22 best -0.42test 1.99 football -0.45sample 1.94 store -0.46digital 1.88 great (deals) -0.47(tuition) options 1.87 items -0.52(financial) aid 1.87 new -0.53(medication) effects 1.84 sale -0.61education 1.77 games -0.65

Page 18: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Comparing Expert vs. Non-expert URLs Expert vs. Non-expert URLs taken from

[White’09]

Page 19: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Predicting Expert vs. Novice Websites Results

Features

Baseline(predict most likely class)

65.8%

Classifier accuracy 82.2%

FeatureCorrel. with

Expertness

Description

E[R|Qs] +0.34 Expectation of Surfacing Query's RLE[R|Us] +0.44 Expectation of Visitor's RLDivRLT(U,s) -0.56 Distance of visitors’ RLT profile from site'sDivT(U,s) -0.55 Distance of visitors’ Topic profile from

site's

Page 20: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Thank you for your attention!

WHAT WE DID: Build Profiles of

Reading Level and Topic (RLT)

For Queries, Websites, Users and Search Sessions

To characterize and compare entities

WHAT WE FOUND: Profile matching predict

user’s content preference

Profiles can indicate when not to personalize

Profile features can predict expert content

More at : @jin4ir / cs.umass.edu/~jykim

Page 21: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Optional Slides

Page 22: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Website reading level vs. visitor diversity

Breakdown per topic revealsstronger relationship

Correlation between Site vs. Visitor Profiles

Website Reading Level Visitor Profile Diversity

DivR(U|s) DivT(U|s) DivRT(U|s)

E[R|s] 0.052 0.081 0.095

ComputersReference

NewsArts

RecreationScienceHealthSports

SocietyBusiness

AdultGamesHome

ShoppingKids_and_Teens

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4

Page 23: Characterizing Web Content, User Interests, and Search Behavior by Reading  Level and  Topic

Query / User Reading Level against P(Topic) User profile shows different trends in Computers