hadoop summit - sanoma self service on hadoop
TRANSCRIPT
Scaling self service on HadoopSander Kieft@skieft
Photo credits: niznoz - https://www.flickr.com/photos/niznoz/3116766661/
About me@skieft
Manager Core Services at Sanoma
Responsible for all common services, including the Big Data platform
Work:
– Centralized services
– Data platform
– Search
Like:
– Work
– Water(sports)
– Whiskey
– Tinkering: Arduino, Raspberry PI, soldering stuff
24 April 20153
Sanoma, Publishing and Learning company
2+1002 Finnish newspapers
Over 100 magazines
24 April 2015 Presentation name4
5TV channels in Finland
and The Netherlands
200+Websites
100Mobile applications on
various mobile platforms
Past
History
< 2008 2009 2010 2011 2012 2013 2014 2015
Self service
Photo credits: misternaxal - http://www.flickr.com/photos/misternaxal/2888791930/
Self service
Self service levels
Personal Departmental Corporate
Full self service Support with publishing dashboards
and data loading
Full service and support on
dashboard
Information is created by end users with
little or no oversight. Users are
empowered to integrate different data
sources and make their own
calculations.
Information has been created by end
users and is worth sharing, but has not
been validated.
Information that has gone through a
rigorous validation process can be
disseminated as official data.
24 April 20159
Information Workers Information Consumers
Full Agility Centralized Development
Excel Static reportsFocus
Self service position – 2010 starting point
Source ExtractionTrans-
formationModeling Load
Report / Dashboar
dInsight
24 April 201510
Data team Analysts
History
< 2008 2009 2010 2011 2012 2013 2014 2015
4/24/2015 © Sanoma Media11
Glue: ETL
EXTRACT
TRANSFORM
LOAD
24 April 201512
24 April 2015 Presentation name13
24 April 2015 Presentation name14
#fail
Hadoop and Qlikview proofed to be really valuable tools
Traditional ETL tools don’t scale for Big Data sources
Big Data projects are not BI projects
Doing full end-to-end integrations and dashboard development doesn’t scale
Qlikview was not good enough as the front-end to the cluster
Hadoop requires developers not BI consultants
Learnings
History
< 2008 2009 2010 2011 2012 2013 2014 2015
Russell JurneyAgile Data
24.4.2015 © Sanoma Media Presentation name / Author21
Self service position – Agile Data
Source ExtractionTrans-
formationModeling Load
Report / Dashboar
dInsight
24 April 201522
New glue
Photo credits: Sheng Hunglin - http://www.flickr.com/photos/shenghunglin/304959443/
ETL Tool features
Processing
Scheduling
Data quality
Data lineage
Versioning
Annotating
24 April 201524
24 April 2015 Presentation name25
Photo credits: atomicbartbeans - http://www.flickr.com/photos/atomicbartbeans/71575328/
24.4.2015 © Sanoma Media Presentation name / Author26
24.4.2015 © Sanoma Media Presentation name / Author27
Processing - Jython
No JVM startup overhead for Hadoop API usage
Relatively concise syntax (Python)
Mix Python standard library with any Java libs
24 April 2015 Presentation name28
Scheduling - Jenkins
Flexible scheduling with dependencies
Saves output
E-mails on errors
Scales to multiple nodes
REST API
Status monitor
Integrates with version control
24 April 2015 Presentation name29
ETL Tool features
Processing – Bash & Jython
Scheduling – Jenkins
Data quality
Data lineage
Versioning – Mecurial (hg)
Annotating – Commenting the code
24 April 2015 Presentation name30
Processes
Photo credits: Paul McGreevy - http://www.flickr.com/photos/48379763@N03/6271909867/
Independent jobs
Source (external)
HDFS upload + move in place
Staging (HDFS)
MapReduce + HDFS move
Hive-staging (HDFS)
Hive map external table + SELECT INTO
Hive
24 April 2015 Presentation name32
Typical data flow - Extract
Jenkins QlikViewHadoopHDFS
HadoopM/R or Hive
1. Job code from
hg (mercurial)
2. Get the data from
S3, FTP or API
3. Store the raw source
data on HDFS
Typical data flow - Transform
Jenkins QlikViewHadoopHDFS
HadoopM/R or Hive
1. Job code from
hg (mercurial)
2. Execute M/R or
Hive Query
3. Transform the data to the
intermediate structure
Typical data flow - Load
Jenkins QlikViewHadoopHDFS
HadoopM/R or Hive
1. Job code from
hg (mercurial)
2. Execute Hive
Query
3. Load data to QlikView
Out of order jobs
At any point, you don’t really know what ‘made it’ to Hive
Will happen anyway, because some days the data delivery is going to be three
hours late
Or you get half in the morning and the other half later in the day
It really depends on what you do with the data
This is where metrics + fixable data store help...
24 April 2015 Presentation name36
Fixable data store
Using Hive partitions
Jobs that move data from staging create partitions
When new data / insight about the data arrives, drop the partition and re-insert
Be careful to reset any metrics in this case
Basically: instead of trying to make everything transactional, repair afterwards
Use metrics to determine whether data is fit for purpose
24 April 2015 Presentation name37
Photo credits: DL76 - http://www.flickr.com/photos/dl76/3643359247/
Enabling self service
24.4.2015 © Sanoma Media39
??
Netherlands
Finland
Staging Raw Data (Native format)
HDFS
Normalized Data structure
HIVE
Reusable
stem data
• Weather
• Int. event
calanders
• Int. AdWord
Prices
• Exchange
rates
Source systems
FI
NL
?
Extract Transform
Data
Scientists
Business
Analists
Enabling self service
24.4.2015 © Sanoma Media40
??
Netherlands
Finland
Staging Raw Data (Native format)
HDFS
Normalized Data structure
HIVE
Reusable
stem data
• Weather
• Int. event
calanders
• Int. AdWord
Prices
• Exchange
rates
Source systems
FI
NL
?
Extract Transform
Data
Scientists
Business
Analists
HUE & Hive
QlikView
R Studio Server
Jenkins
Architecture
High Level Architecture
46
s o
u r
c e
s
Jenkins & Jython
ETL
Back Office
Analytics
Meta Data
AdServing(incl. mobile, video,cpc,yield,
etc.)
Market
Content
QlikView
Dashboard/
Reporting
Hive
&
HUE
Subscription
Hadoop
Scheduled loads
AdHocQueries
Schedule
dexport
s
R Studio
Advanced
Analytics
Current State - Real time
Extending own collecting infrastructure
Using this to drive recommendations, real time dashboarding, user segementation
and targeting in real time
Using Kafka and Storm
24 April 2015 Presentation name47
High Level Architecture
48
Jenkins & Jython
ETL
Meta Data
QlikView
Dashboard/
Reporting
Hive
&
HUE
(Real time)
Collecting
CLOE
Hadoop
Scheduled loads
AdHocQueries
Schedule
dexport
s
Recommendations /
Learn to Rank
R Studio
Advanced
Analytics
Storm
Stream
processing
Recommendatio
ns & online
learningFlume / Kafka
Back Office
Analytics
AdServing(incl. mobile, video,cpc,yield,
etc.)
Market
Content
s o
u r
c e
s
Subscription
DC1 DC2
Sanoma Media The Netherlands Infra
Production VMs
Managed HostingColocation
Big Data platformDev., test and
acceptance VMsProduction VMs
• NO SLA (yet)
• Limited support BD9-5
• No Full system backups
• Managed by us with systems department
help
• 99,98% availability
• 24/7 support
• Multi DC
• Full system backups
• High performance EMC SAN storage
• Managed by dedicated systems
department
DC1 DC2
Sanoma Media The Netherlands Infra
Production VMs
Managed HostingColocation
Big Data platformDev., test and
acceptance VMsProduction VMs
• NO SLA (yet)
• Limited support BD9-5
• No Full system backups
• Managed by us with systems department
help
• 99,98% availability
• 24/7 support
• Multi DC
• Full system backups
• High performance EMC SAN storage
• Managed by dedicated systems
department
Processing
Storage
Batch
Collecting
Real time
Recommendations
Queue
data for 3
days
Run on
stale data
for 3 days
Present
Self service position – Present
Source ExtractionTrans-
formationModeling Load
Report / Dashboar
dInsight
24 April 2015 Presentation name52
Present
< 2008 2009 2010 2011 2012 2013 2014 2015
YARN
Moving to YARN(CDH 4.3 -> 5.1)
24 April 2015 "Blasting frankfurt" by Heptagon -Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -http://commons.wikimedia.org/wiki/File:Blasting_frankfurt.jpg#/media/File:Blasting_frankfurt.jpg54
Functional testing wasn’t enough
Takes time to tune the parameters
Defaults are NOT good enough
Cluster dead locks
Grouped nodes with similar Hardware requirements into Node groups
CDH 4 to 5 specific
Hive CLI -> Beeline
Hive Locking
Combining various data sources (web analytics + advertising + user profiles)
A/B testing + deeper analyses
Ad spend ROI optimization & attribution
Ad auction price optimization
Ad retargeting
Item and user based recommendations
Search optimizations
– Online Learn to Rank
– Search suggestions
Current state – Use case
24 April 2015 Presentation name55
Current state - Usage
Main use case for reporting and analytics, increasing data science workloads
Sanoma standard data platform, used in all Sanoma countries
> 250 daily dashboard users
40 daily users: analysts & developers
43 source systems, with 125 different sources
400 tables in hive
24 April 2015 Presentation name56
Current state – Tech and team
Platform Team:
– 1 product owner
– 3 developers
Close collaborators:
– ~10 data scientists
– 1 Qlikview application manager
– ½ (system) architect
Platform:
– 50-60 nodes
– > 600TB storage
– ~3000 jobs/day
Typical nodes
– 1-2 CPU 4-12 cores
– 2 system disks (RAID 1)
– 4 data disks (2TB, 3TB or 4TB)
– 24-32GB RAM
24 April 2015 Presentation name57
Challenges
Photo credits: http://kdrunlimited1.blogspot.hu/2013/01/late-edition-squirrel-appreciation-day.html
1. Security
2. Increase SLA’s
3. Integration with other BI environments
4. Improve Self service
5. Data Quality
6. Code reuse
Challenges
24 April 2015 Presentation name59
Future
Better integration batch and real time (unified code base)
Improve the availability and SLA’s of the platform
Optimizing Job scheduling (Resource Pools)
Automated query/job optimization tips for analysts
Fix Jenkins, is single point of failure
Moving some NLP (Natural Language Processing) and Image recognition
workload to hadoop
24 April 2015 Presentation name61
What’s next
Thank you!Questions?