big data meetup budapest adding data schemas to snowplow
TRANSCRIPT
Adding Data Schemas to Snowplow
Big Data Budapest Meetup -‐ 5 June 2014
Agenda today
1. Introduc;on to Snowplow
2. Evolu;on of Snowplow
3. The answer: schema all the things!
4. Snowplow roadmap
5. Ques;ons
Introduc8on to Snowplow
Snowplow is an open-‐source web and event analy8cs pla<orm, first version released in early 2012
• Co-‐founders Alex Dean and Yali Sassoon met at OpenX, the open-‐source ad technology business in 2008
• ASer leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analy;cs consultancy
• We released Snowplow as a skunkworks prototype at start of 2012:
github.com/snowplow/snowplow
• We started working full ;me on Snowplow in summer 2013
We wanted to take a fresh approach to web analy8cs
• Your own web event data -‐> in your own data warehouse • Your own event data model • Slice / dice and mine the data in highly bespoke ways to answer your specific business ques;ons
• Plug in the broadest possible set of analysis tools to drive value from your data
Data warehouse Data pipeline
Analyse your data in any analysis tool
By spring 2013 we had arrived at a rela8vely stable batch-‐based processing architecture
Website / webapp
Snowplow Hadoop data pipeline
CloudFront-‐based event collector
Scalding-‐based
enrichment on Hadoop
JavaScript event tracker
Amazon RedshiS / PostgreSQL
Amazon S3
or
Clojure-‐based event collector
Evolu8on of Snowplow
Snowplow is evolving from a web analy8cs pla<orm into a general event analy8cs pla<orm
Data warehouse
Collect event data from any connected
device
Web analysts work with a small number of event types – outside of web, the number of possible event types is… infinite
Web events
All events
• Page view • Order • Add to basket • Page ac;vity
• Game saved • Machine broke • Car started
• Spellcheck run • Screenshot taken • Fridge empty
• App crashed • Disk full • SMS sent
• Screen viewed • Tweet draSed • Player died
• Taxi arrived • Phonecall ended • Cluster started
• Till opened • Product returned ∞
There are two historic approaches to dealing with the explosion of possible event types
Web analy8cs vendors Mobile and app analy8cs vendors
Custom Variables Schema-‐less JSONs
Custom variables are very restric8ve
1. Take a standard web event, like a page view:
2. and add custom variables un;l it becomes something totally different:
= a “taxi arrived” event, kind of!
Page View
Page View vehicle=taxi23 status=arrived + +
Schema-‐less JSONs are beWer, but they have a different set of problems
Issues with the event name: • Separate from the event proper;es • Not versioned • Not unique – HBO video played
versus Brightcove video played
Lots of unanswered ques;ons about the proper;es: • Is length required, and is it always a
number? • Is id required, and is it always a string? • What other op;onal proper;es are
allowed for a video play?
Other issues: • What if the developer
accidentally starts sending “len” instead of “length”? The data will end up split across two separate fields
• Why does the analyst need to keep an implicit schema in their head to analyze video played events?
The answer: schema all the things!
When a developer or analyst defines a new event in JSON, let’s ask them to create a JSON Schema for that event
Addi;onal op;onal field we might not know about otherwise
No other fields allowed
Yes length should always be a number
But we need to let our event defini8ons evolve, so let’s add versioning – we’re calling this SchemaVer
MODEL-REVISION-ADDITION!
• Start versioning at 1-‐0-‐0 – so 1-‐0-‐0, 1-‐0-‐1, 1-‐0-‐2, 1-‐1-‐0 etc • Try to s;ck to backwards-‐compa;ble ADDITION upgrades as much as possible
Where are our schemas going to live? We need a schema repository/registry
Schema repo {}!
Enrichment Manager
Raw events in JSON format
Enriched events in ThriS or Arvo format
Shredder
1. Test instrumenta;on
2. Validate events
3. Define structure
4. Drive shredding
Enriched events in TSV ready for loading into db
5. Define structure
We need to namespace our schemas properly to prevent clashes and confusion in our schema repository
iglu:com.channel2.vod/video_played/jsonschema/1-0-0!
We are calling our schema methodology “Iglu”
The vendor of this event
Event name
Schema format
Schema version
Bringing it all together, let’s now make the event JSONs self-‐describing, with a schema header and data body
And for good measure, let’s add in our schema informa8on into the JSON Schema itself
Snowplow roadmap
Self-‐describing JSON Schemas are coming in the next release of Snowplow
We are also star8ng to define third-‐party events for Snowplow integra8on, star8ng with Zendesk customer support events
Ques8ons?
hlp://snowplowanaly;cs.com hlps://github.com/snowplow/snowplow
@snowplowdata
To chat – @alexcrdean on Twiler or alex@snowplowanaly;cs.com