clickstream data with spark
TRANSCRIPT
MAKING BIG DATA COME ALIVE
Clustering click-stream data using Spark Marissa Saunders
Slides available at: http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark
22
• Why?– Why clustering?– Why Spark?– Why click-stream?
• What?– What is the raw data?
• How?– Parsing user agent data on
Spark– Distributed K-modes on Spark
• So what?– Details of applying the method
to this use case– Resulting clusters– Time access patterns– Preferred websites
• Questions
Agenda
3
ObjectivesUnderstand:• k-means and k-modes clustering• why Spark is a good choice• different data structures in Spark
– RDD, dataframe and dataset• clickstream data and how user-agent parsing works
Demonstrate:• mapping a function over a RDD• defining a custom UDF and mapping it over a
dataframe• mapping a python function over a partition• how identifying different user types can drive insight
into user behavior
4
Why Clustering
5
We have a plot like this …• 2 groups of data• Clustering can find them• This can lead to insight …– There are two different groups
of unladen swallows– The heavy species flies more
slowly– When asking for airspeed, we
should specify if we mean African or European swallows
Why clustering?
… with apologies to Monty Python
Bird Type
Flight velocities vs. bird mass
6
For 2 clusters:1. Pick 2 points at random
as centroids
How does it work?
7
For 2 clusters:1. Pick 2 points at random
as centroids2. Cluster data based on
closest point
How does it work?
8
For 2 clusters:1. Pick 2 points at random as centroids2. Cluster data based on closest point3. Calculate the mean of each cluster as centroids
How does it work?
9
For 2 clusters:1. Pick 2 points at random as centroids2. Cluster data based on closest point3. Calculate the mean of each cluster as centroids4. Repeat 2 and 3 to convergence
How does it work?
10
For 2 clusters:1. Pick 2 points at random as centroids2. Cluster data based on closest point3. Calculate the mean of each cluster as centroids4. Repeat 2 and 3 to convergence
How does it work?
11
For 2 clusters:1. Pick 2 points at random as centroids2. Cluster data based on closest point3. Calculate the mean of each cluster as centroids4. Repeat 2 and 3 to convergence
How does it work?
Converged
This is called K-means clustering… and there is a Spark function for this
12
What about categorical data?
• Use modes instead of means– Most frequently occurring value
• Use binary distance metric for each dimension– 0 = the same– 1 = not the same
• Use the same iterative cluster assignment algorithm
This is called K-modes clustering
Color Mass Speed Type
Green/Grey Heavy Slow AfricanGreen/Grey Heavy Fast AfricanGreen/Black Heavy Slow AfricanGreen/Grey Light Slow AfricanBlue/White Heavy Fast EuropeanBlue/White Light Fast EuropeanBlue/Grey Light Slow EuropeanBlue/White Light Fast European
… and we’ve open-sourced a Spark function for this
13
Why Spark?
14
What is Spark?Apache Spark™ is a fast and general engine for large-scale data
processing. - spark.apache.org
• Distributed computing
• Relies on HDFS (or other DFS)
• In-memory• Optimized
execution• High level
functionality
15
Block1Block2Block3Block4
Block5Block6Block7Block8
Why Spark?
• Take the computation to the data
• Spark works faster on partitioned data than map-reduce– In-memory operation avoids I/O costs– DAG optimization reduces computational costs
• Fast to develop– Data transformation and machine learning libraries are part of Spark
http://spark.apache.org/docs/latest/cluster-overview.html
It is FAST
16
Basic data structures in Spark
• Resiliently distributed dataset (RDD)
• Dataframe = RDD with a schema– SQL-style syntax– Refer to column by name– Optimized queries
• Dataset = best of both worlds?!?
Block3Block4Block7Block8
Block1Block2Block3Block4
Block1Block2Block3Block4Block5Block6Block7Block8
Full data set
Block1Block2Block5Block6
Block7Block8Block5Block6
What makes it resilient?Multiple copiesStores lineage
17
A little terminology …
Block3Block4Block7Block8
Block1Block2Block3Block4
Block1Block2Block5Block6
Block7Block8Block5Block6
Full data set
nodepartition
record
18
Why Clickstream?
19
What is clickstream data?• Information trail left behind by each user• Semi-structured website log files• Includes:– User agent information- Device- OS- Browser
– Geo information- Timezone- Lat/Longitude- City- Country
– Time of access– Referring website– Website accessed
Photo credit: Tim Franklin Photography via Foter.com
20
What is this good for?
• Web analytics can answer questions like:– How long do users take from first visit to purchase?– When do users visit the website?– What marketing channels are effective in attracting users?– Where are users located?– What are the paths that users take through the website?– How long do users stay on a specific page?– Which pages draw the most users?– etc…
21
The sample use caseClickstream data from 1usagov– Created whenever anyone shortens a .gov or .mil site with bitly– Feed at http://developer.usa.gov/1usagov– Archive for 2011-2013:
http://bitly.measuredvoice.com/bitly_archive/?C=M;O=D
Why this is a great dataset:– Large volume– Realistic format- Streaming - Not cleaned
– Interesting questions- What subtypes of users are there?- How do the activity patterns of these subtypes differ?
– Publically available archive
22
What is the raw data?
23
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format
24
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format• Fields include:
• Website clicked: long url
25
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format• Fields include:
• Website clicked/long url• Referring url
26
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format• Fields include:
• Website clicked/long url• Referring url• User agent – what machine is this?
27
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format• Fields include:
• Website clicked/long url• Referring url• User agent – what machine is this?• Time accessed• etc…
28
Parsing click stream data on Spark
29
High level picture
• Need to extract:– Time in date, hours– Information about the user:- Device type- OS- Timezone
– Main domain of the url– Referring url
• Do this for one record in python• Map this function over all records
using Spark
{"h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL” ,”tz":”America/New_York "}
Day: FridayLocal_hour: 16Device_type:pcBrowser: IEOS: Windows 7Is_bot: false
30
Actual transformation
• Define parsing function
• Map parsing function over RDD
Leverage user-agents library
for every record s
Apply user_agent library
RDD containing parsed json data
31
Actual transformation
• Define parsing function
• Map parsing function over RDD
Leverage user-agents library
Keep every entry as item in list
32
Actual transformation
• Define parsing function
• Map parsing function over RDD
Leverage user-agents library
Apply custom function to user agent string
33
Distributed K-modes
34
How does clustering have to change to be distributed?
K-means example:Clustering is a collective operation.How can we distribute it?
35
How does clustering have to change to be distributed?
Do k-means on each partition
Cluster the collected centroids
K-means example:
36
Mapping over data in Spark• Map over a record:
def f(record): return transform(record)rdd2 = rdd1.map(f)
37
Mapping over data in Spark
Block3Block4Block7Block8
Block1Block2Block3Block4
Block1Block2Block5Block6
Block7Block8Block5Block6
Full data set
map
Block1
What is the equivalent here?
Spark has two possibilities:1. mapPartition:
• get each record in turn and do something; return after all records are done
• mapPartitionWithIndex:• Keep track of which partition returned
which result
38
Mapping over data in Spark• Map over a record:
def f(record): return transform(record)rdd2 = rdd1.map(f)
• Map over a partition:def f(iterator): yield cluster(iterator)rdd2 = rdd1.mapPartitions(f)
• Map over a partition with a partition keydef f(splitIndex, iterator): yield (partitionIndex, cluster(iterator))rdd2 = rdd1.mapPartitionsWithIndex(f)
For K-modes, we have open-sourced an implementation of distributed clustering: https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes
Iterator = cycle once through each record
39
Applying to 1USAGOV data
40
Getting 1usagov clickstream data• Scrape data from archive site:– http://1usagov.measuredvoice.com/– json format
• Concatenate into files by month• Store in HDFS• Load into Spark
41
Loading json data
42
Parse to extract user agent information
• Python package user_agents– Input string -> output information
• Add some custom parsing to extract features– os family, os_version, device
• Use spark to map this over each clickstream entry
43
Prepare for K-modes clustering
To reduce dimensionality:• Decide which variables to
use for clustering• Keep only the top few
categories for each variable
Prasad Patil, as referenced on http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/
The CURSE of dimensionality ….
44
Prepare for K-modes clustering• Decide which variables to use for clustering– Country– Timezone– Device Type– OS– Browser
• Keep only the top few categories for each variable
Custom UDF for Spark dataframes
Apply a series of UDFs
45
• Uses open-source packagehttps://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes
Perform distributed k-modes clustering
# of modes Max. iterations
Full log
Partition
Partition
Partition
Centroids
Centroids
Centroids
Centroids
Local clustering
Distributed clustering
Create RDD
46
Clustering results: 10 clusters
47
What do the clusters look like?# Size Country Timezone Device
TypeOS Browser
1 617820 US: 93% US/NY: 53% Pc: 97% Win 7: 75% Firefox: 57%
2 226035 NotUS: 68% Other: 57% Mobile: 75% iOS:84% MobileSafari: 78%
3 152053 NoGeoInfo:86%
NoGeoInfo: 86% Pc: 99% Windows:
81%Chrome/IE:
72%
4 161947 US:96% US/NY: 60% PC: 99% Windows not 7: 99%
IE:81%
5 105090 NoGeoInfo:76%
NoGeoInfo:76% Mobile: 70% Other: 70% Other: 99%
6 235719 NotUS:99% Other:89% PC: 99% Win7: 68% Chrome: 51%
7 121464 US:100% US/LA: 59% PC:95% MacOSX: 72%
Chrome: 54%
8 121115 US:48% NoGeoInfo: 40% Mobile:93% Android:
100%Android:
99%
9 101052 NotUS:98% Other: 90% PC: 100% Win other than 7: 84% Firefox: 57%
10 173424 US:100% US/NY: 48% Mobile: 68% iOS:100% MobileSafari: 74%
48
Access patterns
49
Access patterns
50
Top sites visited: January 2012Description
Top 3 domains
US, pc, Win7
www.nysdot.gov 212K
www.nasa.gov 59K
www.fda.gov 18K
US, pc, Win_not7, IE
www.nasa.gov 15K
www.shrewsbury-ma.gov 9K
www.fda.gov 5K
US, pc, Mac OS X
www.nysdot.gov 29K
www.nasa.gov 16K
www.whitehouse.gov6KnotUS, pc,
Win7www.nasa.gov 87K
earthobservatory.nasa.gov 15K
www.nysdot.gov 14K
notUS, pc,Win_not7
www.nasa.gov 30K
www.navy.mil 8K
globalhealth.gov7K
noGeo, pc, Win, Chrome
www.nasa.gov 34K
www.nysdot.gov 17K
earthobservatory.nasa.gov 6K
US, mobile,iOS
www.nasa.gov 33K
earthobservatory.nasa.gov 11K
forecast.weather.gov 9K
notUS, mobile, iOS
www.nasa.gov 82K
earthobservatory.nasa.gov 24K
www.navy.mil 13K
Mobile,Android
www.nasa.gov 29K
earthobservatory.nasa.gov 9K
www.navy.mil 6K
noGeo, mobile, OtherOS
www.nasa.gov 24K
www.nysdot.gov 8K
www.army.mil 5K
51
Where do users come from: January 2012Description
Top 3 domains
US, pc, Win7
direct 342K
t.co 135K
www.facebook.com 67K
US, pc, Win_not7, IE
direct 69K
t.co33K
www.facebook.com19K
US, pc, Mac OS X
t.co 49K
direct41K
www.facebook.com15K
notUS, pc, Win7
t.co 125K
www.facebook.com45K
direct38K
notUS, pc,Win_not7
t.co 41K
direct29K
www.facebook.com14K
noGeo, pc, Win, Chrome
t.co56K
direct 47K
www.facebook.com24K
US, mobile,iOS
twitter.com 83K
direct59K
m.facebook.com17K
notUS, mobile, iOS
twitter.com 119K
direct 69K
t.co 21K
Mobile,Android
t.co 62K
direct34K
m.facebook.com17K
noGeo, mobile, OtherOS
direct 63K
t.co20K
m.facebook.com13K
52
What happened in space that had the twitter-sphere abuzz in January 2012?
Solar Flares!
Especially non-US users
to:Nasa.govEarthobservatory.com
from:Twitter
http://earthobservatory.nasa.gov/NaturalHazards/view.php?id=76998
53
Summary
• Data processing operations, like parsing user-agent string, can be distributed using spark• Clustering of large data sets can be distributed using Spark• Clustering finds groups of related users/records• These user types show distinct behaviors • Segmenting users can drive insight and facilitate appropriate
messaging– When are they visiting?– Where are they looking?– Where are they coming from?
User information
Usergroups
Targeted message
Web log data
54
Questions?
Slides available at: http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark
Distributed K-modes clustering for pyspark:https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes