implementing improved and consistent arbitrary event tracking company-wide using snowplow
TRANSCRIPT
Implementing improved and consistent arbitrary event
tracking company-wide using Snowplow
Nora PaymerSr. Business & Consumer Insights Analyst, StumbleUpon
10/6/2015SF Snowplow MeetUp
About me
• Hi, I’m Nora• BS & MA in Cognitive Neuroscience– Ask me about sign/speech bilingualism or
optical illusions in the brain!• Previous Roles:– UC Berkeley: Institutional Analytics– CBS Interactive: Inventory Analytics– SquareTrade: Marketing/Consumer Insights
Analytics• StumbleUpon: Business & Product
Analytics
About StumbleUpon
• What is StumbleUpon?– Recommendation Engine for the Internet– Ad Platform for native advertisement– Social engagement platform
• Still #4 in Referral Traffic* (behind Facebook, Twitter, and Pinterest; ahead of Reddit)
• Still alive and kicking!
*Shareaholic, Q4 2014 (mot recent data available)
My Role
• Data Science Team & Finance/Sales Analytics Team, but no dedicated Product or Business Analytics
• When I was hired, I was asked to:– Help Product team be a data-driven
culture–Make data more available company-wide• Better & easier to change dashboards• Ability for non-data people to access data
– Help clean up Data Pipelines• With support from amazing Data Engineering
Team
• Other data all over the place• No way to integrate with
user/stumble/activity data• Only accessible by a couple people each
• Only place to access most real site data
• Dashboards all made with R/Shiny
• Queries done at terminal, only by Data Science/Analytics Team
• Hive/MapReduce is slow for real-time data querying!
Data sources
Protobuf messages
MySQL
HBase/Hive
MixPanel
FireBase
Adjust
App AnnieDesk.com
SalesForce
StrongView
Solutions
1. Copy product data to quicker/more universal data solution
2. Implement BI tool (Looker)
Data sources
Protobuf messages
MySQL
HBase/Hive
MixPanel
FireBase
Adjust
App AnnieDesk.com
SalesForce
• Send data to RedShift for faster querying• Connect RedShift to Looker:
• Dashboards• GUI Query Builder
RedShift
Looker
StrongView
Problems
1. Data siloed all over the place2. Data inaccessible to most people3. Difficult for teams to add new events– Only “official” solution was protobuf
messages, which was slow and needed to go through Engineering/Data Science/Me just to record a button click
– Teams started using MixPanel, which is expensive and limited
Solutions
1. Copy product data to quicker/more universal data solution
2. Implement BI tool (Looker)3. Replace MixPanel with Snowplow for
arbitrary Event Reporting– Sends data to RedShift for easy
integration with other data– Easy for teams to add new events
Data Sources
Protobuf messages
MySQL
HBase/Hive
MixPanel
FireBase
Adjust
App AnnieDesk.com
SalesForce
RedShift
LookerSnowplow
StrongView
Problems
1. Data siloed all over the place2. Data inaccessible to most people3. Difficult for teams to add new events4. So many teams! So much integration!– Mobile (iOS & Android), Site (back end &
front end), Ads, Marketing (including install referral info & email marketing & other), Firefox & Chrome toolbars, etc. etc.
How we did it
Intended Plan:1. Site implements default page tracker2. Site implements 2-3 events to make
sure flow is working properly– Structured Events
3. Assess if everything is working4. Mobile implements 2-3 events per
platform5. Then roll out everywhere
How we did it
What Actually Happened:1. Site implemented default page tracker2. Site implemented ~100 events– Structured Events
3. Mobile replaced all MixPanel events with Snowplow– Structured Events– Some trouble with implementation/integration
with Android– Used wiki page created by a site engineer, had
confusing language, did some things weirdly4. Testing??
Uh-Oh
• Structured Events not really the right thing:
• Didn’t have userid implemented properly originally
• More fields were going to be needed
Snowplow Term Our UseCategory Event Name (e.g. thumbup)
Action Event Type (e.g. click vs view)
Label Platform (site, iOS…)
Property Version #
Value When event had a value associated with it
So? Switch to Unstructured Events! Easy, right?
• OK great, come up with a new framework for Unstructured Events!– Some required fields across all events– Some optional fields that we know will be widely
used from day 1– Nature of unstructured events is that more fields
could be added laterField Req’d? Descriptionevent_name y Event name
platform y site, iOS, Android, etc.
device_version y Version number (standard field)
event_category n e.g. click; view: useful for filtering
event_group n For defining a group of events, for filtering
value n For events with a value
referrer n Referral source (when applicable)
Sounds good so far!
• Teams that had already implemented Unstructured did not want to implement Structured– They had already spent Eng time on this,
why spend more?• Everyone is always on a tight timeline
– Had trouble seeing the value in the format of their events matching the format of teams they didn’t work with.
• Result? Arguments and top-down mandates
What should we have done differently?
1. Program management across all teams– Didn’t have anyone officially in charge
2. Implement in phases: do test events & a test project before going full live
3. Excellent Documentation4. Get buy-in from everyone from day
one5. Think through dream/far-fetched use
cases: what will you need for that?6. Use Snowplow team for advice!
So now what?
• Still working on it• Connecting all existing data pipelines
to RedShift, sometimes via Snowplow• Better utilizing Snowplow when back
end tracking is too cumbersome– Referral Tracking: both reg and landing
page– Better understanding of engagement and
Time on Site (for non-stumble pages especially)
– Understanding user flow through the site– Etc. etc. etc, hopefully!
Protobuf messages
MySQL
HBase/Hive
MixPanel
FireBase
Adjust
App AnnieDesk.com
SalesForce
RedShift
LookerSnowplow
StrongView
New Data!