Building data pipelines
01
from simple to more advanced - hands-on
Sergii Khomenko, Data [email protected], @lc0d3r
CrunchConf - October 29, 2015
Sergii Khomenko
2
Data scientist at one of the biggest fashion communities, Stylight.
Data analysis and visualisation hobbyist, working on problems not only in working time but in free time for fun and personal data visualisations.
Speaker at Berlin Buzzwords 2014, ApacheCon Europe 2014, Puppet Camp London, Berlin Buzzwords 2015 , Tableau Conference on Tour, Budapest BI Forum 2015
Profitable LeadsStylight provides its partners with high-quality leads enabling partner shops to leverage Stylight as a ROI positive traffic channel.
InspirationStylight offers
shoppable inspiration that
makes it easy to know what to
buy and how to style it.
Branding & ReachStylight offers a unique opportunity for brands to reach an audience that is actively looking for style online.
ShoppingStylight helps users search
and shop fashion and lifestyle products smarter across
hundreds of shops.
3
Stylight – Make Style HappenCore Target Group
Stylight help aspiring women between 18 and 35 to evolve their style through shoppable inspiration.
Stylight – acting on a global scale
Experienced & Ambitious Team
Innovative cross-functional organisation with flat hierarchy builds a unique team spirit.• +200 employees• 40 PhDs/Engineers• 28 years average age
• 63% female• 23 nationalities• 0 suits
5
Agenda
6
T h e G o o d , T h e B a d A n d T h e L e g a c y
O p e n S o u r c e s t a c k
A m a z o n A W S
G o o g l e C l o u d
T i p s , t r i c k s a n d b e s t p r a c t i c e s
7
I n c o m p u t i n g , a p ipe l i ne i s a s e t o f d a t a p r o c e s s i n g e l e m e n t s c o n n e c t e d i n s e r i e s , w h e r e t h e
o u t p u t o f o n e e l e m e n t i s t h e i n p u t o f t h e n e x t o n e .
The Good, The Bad And The Legacy
8
Sources of data:
9
• Web tracking • Metrics tracking • Behaviour tracking
• Business intelligence ETL • Internal Services • ML tagging service
Access patterns
10
• Real-time • Nearly real-time • Daily batches
11
12
Properties
13
• Data consistency • Doesn’t scale • Hard to add new sources • Complex system • Many interfaces
• As lean and legacy as possible • No need for special services
14
15
Streaming
Open Source Stack
16
17
http://lambda-architecture.net/
18
A p a c h e K a f k a i s p u b l i s h - s u b s c r i b e m e s s a g i n g r e t h o u g h t a s a d i s t r i b u t e d c o m m i t l o g .
19
20
21http://www.ipponusa.com/wp-content/uploads/2014/10/spark-architecture.jpg
Results
22
• Scalable • Flexible
• High costs of maintenance • Not so easy to setup
23
A p r o g r a m m i n g l a n g u a g e i s l o w l e v e l w h e n i t s p r o g r a m s r e q u i r e
a t t e n t i o n t o t h e i r r e l e v a n t .
Alan Jay Perlis / Epigrams on Programming
Amazon AWS
24
Kinesis Streams
27
28
29
businessdevelopment
& finance
websiteevents
enrichmentBusiness
Intelligence
Kinesis Firehose Kinesis Analytics
33
34
custom unificationpipeline
ProductProcessing
BusinessIntelligence
ML/TaggingProduct events
variety of event types and structures
36
AWS Data Pipeline
Google Cloud
39
40
41
42
43
44
Tips, tricks and best practices
46
Cross-Functional Team
47
Department: mission oriented team with all resources and the least dependencies
Product Team: builds the software the department or its customers use
Squad: team that executes the product development
47
Department
Product Team
Squad
PO
Engineer
Engineer
Designer
Data Scientist
Head of
Business Role
Business Role
48
Cross-Functional Team
49
• You build it - you run it
• You check your numbers (domain knowledge)
• You provide your data as interface layer
• Data report comes after data tracking
49
Department
Product Team
Squad
PO
Engineer
Engineer
Designer
Data Scientist
Head of
Business Role
Business Role
50
51
52
54
I t h i n k t h a t i t ' s e x t r a o r d i n a r i l y i m p o r t a n t t h a t w e i n c o m p u t e r s c i e n c e k e e p f u n i n c o m p u t i n g .
W h e n i t s t a r t e d o u t , i t w a s a n a w f u l l o t o f f u n .
Alan Jay Perlis / The Structure and Interpretation
of Computer Programs
Related talks
56
• Helping Data Teams with Puppet / Puppet Camp London
• Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift / Tableau Conference on Tour - Berlin
• Google Cloud Dataflow Two Worlds Become a Much Better One