: web usage mining with semantic analysis
DESCRIPTION
Laura Hollink, Peter Mika and Roi Blanco. Web Usage Mining with Semantic Analysis. In proceedings of the International World Wide Web Conference, Rio de Janeiro, Brazil, May 2013.TRANSCRIPT
Web Usage Mining with Semantic Analysis
Laura Hollink, VU University AmsterdamPeter Mika, Yahoo! Labs BarcelonaRoi Blanco, Yahoo! Labs Barcelona
Analysis of web user behavior
What are typical use cases? Are these carried out in a particular order?
Which use cases are not satisfied? And to which other sites do users go?
Analysis of web user behavior
What are typical use cases? Are these carried out in a particular order?
Which use cases are not satisfied? And to which other sites do users go?
oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!
captain'america'''movies.yahoo.com moneyball'trailer'''movies.yahoo.com'
money'''moneyball'movies.yahoo.com'
moneyball'''movies.yahoo.com''movies.yahoo.com en.wikipedia.org'''movies.yahoo.com''peter'brand'''peter'brand'oakland''nymag.com'''moneyball'the'movie'''www.imdb.com'
moneyball'trailer'movies.yahoo.com''moneyball'trailer''
brad'pi-''brad'pi-'moneyball''brad'pi-'moneyball'movie'brad'pi-'moneyball''brad'pi-'moneyball'oscar'''www.imdb.com'
relay'for'life'calvert'ocunty www.relayforlife.org'trailer'for'moneyball'''movies.yahoo.com 'moneyball.movie-trailer.com'
moneyball'en.wikipedia.org 'movies.yahoo.com map'of'africa''www.africaguide.com'
money'ball'movie'''www.imdb.com money'ball'movie'trailer''moneyball.movie-trailer.com''
brad'pi-'new''www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com 'brad'pi-'news'news.search.yahoo.com moneyball'trailer''moneyball'trailer'www.imdb.com''www.imdb.com!
Transaction logs: sessions of queries and clicks
Analysis of web user behavior
oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!
captain'america'''movies.yahoo.com moneyball'trailer'''movies.yahoo.com'
money'''moneyball'movies.yahoo.com'
moneyball'''movies.yahoo.com''movies.yahoo.com en.wikipedia.org'''movies.yahoo.com''peter'brand'''peter'brand'oakland''nymag.com'''moneyball'the'movie'''www.imdb.com'
moneyball'trailer'movies.yahoo.com''moneyball'trailer''
brad'pi-''brad'pi-'moneyball''brad'pi-'moneyball'movie'brad'pi-'moneyball''brad'pi-'moneyball'oscar'''www.imdb.com'
relay'for'life'calvert'ocunty www.relayforlife.org'trailer'for'moneyball'''movies.yahoo.com 'moneyball.movie-trailer.com'
moneyball'en.wikipedia.org 'movies.yahoo.com map'of'africa''www.africaguide.com'
money'ball'movie'''www.imdb.com money'ball'movie'trailer''moneyball.movie-trailer.com''
brad'pi-'new''www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com 'brad'pi-'news'news.search.yahoo.com moneyball'trailer''moneyball'trailer'www.imdb.com''www.imdb.com!
Transaction logs: sessions of queries and clicks
Are these use cases typical for all movies? Recent movies? Only for Moneyball?
Why are these questions difficult to answer?
Sparsity of the event space‣ 64% percent of queries are unique within a year‣ even the most frequent patterns have extremely low support
To illustrate: top 12 most frequent sessions observed in our data:
Tasks
Question 1: what are typical use cases?‣Task 1: find sequences of events in the data that are more
frequent (have a higher support) than a threshold.Question 2: what use cases are not satisfied?‣Task 2: learn to predict website abandonment from
queries and clicks.
Approach
'oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!
Applied to the movie domain
Connect queries to entities in the linked open data cloud and use properties of these entities to generalize and categorize queries.
Data processing and linking steps
1.link queries to entities2.select types of entities (classes) 3.detect modifier words (download, trailer, cast, date, etc.)4.identify navigational queries5.identify ‘loosing’ queries.
'oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!
1. Linking queries to entities in the LOD cloud
• We link one entity to each query.• The intent of about 40% of unique Web queries is to find a particular entity
[Pound, WWW2008].• We link to Freebase (has a lot of movie related info) and DBpedia (Wikipedia is
widely used)
2. Select one type per entity
• We use the Freebase API to get the semantic “types” of each query URI
• Freebase ‘Notable types API’ is not official and not documented.
• For repeatability and transparency, we have created our own heuristics to select one type for each entity:
1. no internal or administrative types,
2.prefer established domains (‘Commons’) over user defined schemas (’Bases’)
3.aggregate specific types into more general types
a)subtypes of location -> location
b)subtypes of award winners and nominees -> award_winner_nonimee
c)prefer movie related types over other types: film, actor, artist, tv_program, tv_actor and location (order of decreasing preference).
entity
TypeType
Type Type
Type
Type
3. Detect modifier words in queries
Top 100 most frequent words that appear in the query log before or after entity names [Mika ISWC2009, Pantel WWW2012].
movie, movies, theater, cast, quotes, free, theaters, watch, 2011, new, tv, show, dvd, online, sex, video, cinema, trailer, list, theatre . . .
4. Identifying navigational queries
• A navigational query is a query entered with the intention of navigating to a particular website.
• A common heuristic is to consider navigational queries where the query matches the domain name of a clicked result.
• “official homepage” is value of dbpedia:homepage, dbpedia:url, and foaf:homepage.
netflix login www.netflix.com
banana www.bananas.org
European Parliament europarl.europa.eu
5 Identify ‘loosing’ queries
• A ‘loosing’ query is the query that leads a user to abandon a service in favor of another service.
• Common definition: A user repeats the same query and clicks on another result in the list.
• Our broader, semantic definition:
Evaluation
1.Linking to entities and types2.Detection of frequent usage patterns3.Prediction of website abandonment
Applied to the movie domain
• sample of server logs of Yahoo! Search in the US from June, 2011, split into sessions.
• Only sessions that contain at least one visit to any of 16 popular movie sites4.
• 1.7 million sessions, containing over 5.8 million queries and over 6.8 million clicks.
Evaluation of links to entities and types
• Compare manually created <query, entity> and <entity, type> pairs to automatically created links.
• 2 samples: the 50 most frequent queries and 50 random queries.
Examples:• Ambiguous query: “Green Lantern” - the movie or the fictional character?• Wrong type: Oil peak is a serious game subject?
Evaluation of links to entities and types
Queries Entities Types
Freq
uenc
y of
occ
urre
nce
Freq
uenc
y of
occ
urre
nce
Freq
uenc
y of
occ
urre
nce
Frequent usage patterns I
• Freebase:release_date property of entities.Recent movies Older movies
Frequent usage patterns II
• Sequences of consecutive query types.
Frequent usage patterns III
• A comparison of websites.
• most frequent query types that lead to a click on a website.
/film
/fi
lm/a
ctor
/tv
_pro
gram
/p
eopl
e/pe
rson
/b
ook/
book
/fi
ctio
nal_
univ
erse
/fict
iona
l_ch
arac
ter
/mus
ic/a
rtist
/tv
/tv_a
ctor
/lo
catio
n /fi
lm/fi
lm_s
erie
s
Website 1
prop
ortio
n of
que
ries
that
lead
to a
clic
k on
the
web
site
0.0
0.1
0.2
0.3
0.4
0.5
0.6
/film
/lo
catio
n /b
ook/
book
/fi
lm/a
ctor
/b
usin
ess/
empl
oyer
/fi
ctio
nal_
univ
erse
/wor
k_of
_fic
tion
/fict
iona
l_un
iver
se/fi
ctio
nal_
char
acte
r /tv
_pro
gram
/a
rchi
tect
ure/
build
ing_
func
tion
/film
/film
_ser
ies
Website 2
prop
ortio
n of
que
ries
that
lead
to a
clic
k on
the
web
site
0.0
0.1
0.2
0.3
0.4
0.5
0.6
/loca
tion
/bus
ines
s/em
ploy
er
/film
/fi
lm/a
ctor
/o
rgan
izat
ion/
orga
niza
tion
/arc
hite
ctur
e/bu
ildin
g_fu
nctio
n /p
eopl
e/pe
rson
/tv
_pro
gram
/tv
/tv_n
etw
ork
/inte
rnet
/web
site
_cat
egor
y
Website 3
prop
ortio
n of
que
ries
that
lead
to a
clic
k on
the
web
site
0.0
0.1
0.2
0.3
0.4
0.5
0.6
/bus
ines
s/em
ploy
er
/film
/tv
_pro
gram
/tv
/tv_s
erie
s_se
ason
/fi
lm/a
ctor
/c
vg/c
vg_p
latfo
rm
/peo
ple/
pers
on
/com
pute
r/sof
twar
e /tv
/tv_n
etw
ork
/boo
k/bo
ok
Website 4
prop
ortio
n of
que
ries
that
lead
to a
clic
k on
the
web
site
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Prop
ortio
n of
que
ries
Prop
ortio
n of
que
ries
Website BWebsite A
Predicting website abandonment
• 3 Classification Tasks: Given a (part of a) session in which a user is lost/gained, predict...1...whether a user will be gained for a given website.2...given that the session includes a given website, whether this website is in
the loosing or gaining position.3...given that the session includes two given websites, which one is in the
gaining position.
• Gradient Boosted Decision Trees.
Discussion and future work
• Mining patterns of entire queries gives problems with sparsity of data• We interpret the structure and semantics of the queries, using openly
available, up-to-date information on the Web.• give a “semantic” definition of navigational and ‘loosing’ queries• find patterns of user behavior• predict website abandonment
• This is the beginning:• Use more properties of entities, more features.• Detect more complex patterns.• Explore other linked open datasets.
Thank you!
Questions?