hadoop summit - sanoma self service on hadoop

Scaling self service on HadoopSander Kieft@skieft

Photo credits: niznoz - https://www.flickr.com/photos/niznoz/3116766661/

About me@skieft

Manager Core Services at Sanoma

Responsible for all common services, including the Big Data platform

– Centralized services

– Data platform

– Search

– Work

– Water(sports)

– Whiskey

– Tinkering: Arduino, Raspberry PI, soldering stuff

24 April 20153

Sanoma, Publishing and Learning company

2+1002 Finnish newspapers

Over 100 magazines

24 April 2015 Presentation name4

5TV channels in Finland

and The Netherlands

200+Websites

100Mobile applications on

various mobile platforms

History

< 2008 2009 2010 2011 2012 2013 2014 2015

Self service

Photo credits: misternaxal - http://www.flickr.com/photos/misternaxal/2888791930/

Self service

Self service levels

Personal Departmental Corporate

Full self service Support with publishing dashboards

and data loading

Full service and support on

dashboard

Information is created by end users with

little or no oversight. Users are

empowered to integrate different data

sources and make their own

calculations.

Information has been created by end

users and is worth sharing, but has not

been validated.

Information that has gone through a

rigorous validation process can be

disseminated as official data.

24 April 20159

Information Workers Information Consumers

Full Agility Centralized Development

Excel Static reportsFocus

Self service position – 2010 starting point

Source ExtractionTrans-

formationModeling Load

Report / Dashboar

dInsight

24 April 201510

Data team Analysts

History

< 2008 2009 2010 2011 2012 2013 2014 2015

4/24/2015 © Sanoma Media11

Glue: ETL

EXTRACT

TRANSFORM

24 April 201512

Hadoop and Qlikview proofed to be really valuable tools

Traditional ETL tools don’t scale for Big Data sources

Big Data projects are not BI projects

Doing full end-to-end integrations and dashboard development doesn’t scale

Qlikview was not good enough as the front-end to the cluster

Hadoop requires developers not BI consultants

Learnings

History

< 2008 2009 2010 2011 2012 2013 2014 2015

Russell JurneyAgile Data

24.4.2015 © Sanoma Media Presentation name / Author21

Self service position – Agile Data

Report / Dashboar

dInsight

24 April 201522

New glue

Photo credits: Sheng Hunglin - http://www.flickr.com/photos/shenghunglin/304959443/

ETL Tool features

Processing

Scheduling

Data quality

Data lineage

Versioning

Annotating

24 April 201524

Photo credits: atomicbartbeans - http://www.flickr.com/photos/atomicbartbeans/71575328/

Processing - Jython

No JVM startup overhead for Hadoop API usage

Relatively concise syntax (Python)

Mix Python standard library with any Java libs

Scheduling - Jenkins

Flexible scheduling with dependencies

Saves output

E-mails on errors

Scales to multiple nodes

REST API

Status monitor

Integrates with version control

ETL Tool features

Processing – Bash & Jython

Scheduling – Jenkins

Data quality

Data lineage

Versioning – Mecurial (hg)

Annotating – Commenting the code

Processes

Photo credits: Paul McGreevy - http://www.flickr.com/photos/48379763@N03/6271909867/

Independent jobs

Source (external)

HDFS upload + move in place

Staging (HDFS)

MapReduce + HDFS move

Hive-staging (HDFS)

Hive map external table + SELECT INTO

Typical data flow - Extract

Jenkins QlikViewHadoopHDFS

HadoopM/R or Hive

1. Job code from

hg (mercurial)

2. Get the data from

S3, FTP or API

3. Store the raw source

data on HDFS

Typical data flow - Transform

HadoopM/R or Hive

1. Job code from

hg (mercurial)

2. Execute M/R or

Hive Query

3. Transform the data to the

intermediate structure

Typical data flow - Load

HadoopM/R or Hive

1. Job code from

hg (mercurial)

2. Execute Hive

3. Load data to QlikView

Out of order jobs

At any point, you don’t really know what ‘made it’ to Hive

Will happen anyway, because some days the data delivery is going to be three

hours late

Or you get half in the morning and the other half later in the day

It really depends on what you do with the data

This is where metrics + fixable data store help...

Fixable data store

Using Hive partitions

Jobs that move data from staging create partitions

When new data / insight about the data arrives, drop the partition and re-insert

Be careful to reset any metrics in this case

Basically: instead of trying to make everything transactional, repair afterwards

Use metrics to determine whether data is fit for purpose

Photo credits: DL76 - http://www.flickr.com/photos/dl76/3643359247/

Enabling self service

24.4.2015 © Sanoma Media39

Netherlands

Finland

Staging Raw Data (Native format)

Normalized Data structure

Reusable

stem data

• Weather

• Int. event

calanders

• Int. AdWord

Prices

• Exchange

Source systems

Extract Transform

Scientists

Business

Analists

Enabling self service

24.4.2015 © Sanoma Media40

Netherlands

Finland

Staging Raw Data (Native format)

Normalized Data structure

Reusable

stem data

• Weather

• Int. event

calanders

• Int. AdWord

Prices

• Exchange

Source systems

Extract Transform

Scientists

Business

Analists

HUE & Hive

QlikView

R Studio Server

Jenkins

Architecture

High Level Architecture

Jenkins & Jython

Back Office

Analytics

Meta Data

AdServing(incl. mobile, video,cpc,yield,

Market

Content

QlikView

Dashboard/

Reporting

Subscription

Hadoop

Scheduled loads

AdHocQueries

Schedule

dexport

R Studio

Advanced

Analytics

Current State - Real time

Extending own collecting infrastructure

Using this to drive recommendations, real time dashboarding, user segementation

and targeting in real time

Using Kafka and Storm

High Level Architecture

Jenkins & Jython

Meta Data

QlikView

Dashboard/

Reporting

(Real time)

Collecting

Hadoop

Scheduled loads

AdHocQueries

Schedule

dexport

Recommendations /

Learn to Rank

R Studio

Advanced

Analytics

Stream

processing

Recommendatio

ns & online

learningFlume / Kafka

Back Office

Analytics

AdServing(incl. mobile, video,cpc,yield,

Market

Content

Subscription

DC1 DC2

Sanoma Media The Netherlands Infra

Production VMs

Managed HostingColocation

Big Data platformDev., test and

acceptance VMsProduction VMs

• NO SLA (yet)

• Limited support BD9-5

• No Full system backups

• Managed by us with systems department

• 99,98% availability

• 24/7 support

• Multi DC

• Full system backups

• High performance EMC SAN storage

• Managed by dedicated systems

department

DC1 DC2

Sanoma Media The Netherlands Infra

Production VMs

Managed HostingColocation

Big Data platformDev., test and

acceptance VMsProduction VMs

• NO SLA (yet)

• Limited support BD9-5

• No Full system backups

• Managed by us with systems department

• 99,98% availability

• 24/7 support

• Multi DC

• Full system backups

• High performance EMC SAN storage

• Managed by dedicated systems

department

Processing

Storage

Collecting

Real time

Recommendations

data for 3

Run on

stale data

for 3 days

Present

Self service position – Present

Report / Dashboar

dInsight

Present

< 2008 2009 2010 2011 2012 2013 2014 2015

Moving to YARN(CDH 4.3 -> 5.1)

24 April 2015 "Blasting frankfurt" by Heptagon -Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -http://commons.wikimedia.org/wiki/File:Blasting_frankfurt.jpg#/media/File:Blasting_frankfurt.jpg54

Functional testing wasn’t enough

Takes time to tune the parameters

Defaults are NOT good enough

Cluster dead locks

Grouped nodes with similar Hardware requirements into Node groups

CDH 4 to 5 specific

Hive CLI -> Beeline

Hive Locking

Combining various data sources (web analytics + advertising + user profiles)

A/B testing + deeper analyses

Ad spend ROI optimization & attribution

Ad auction price optimization

Ad retargeting

Item and user based recommendations

Search optimizations

– Online Learn to Rank

– Search suggestions

Current state – Use case

Current state - Usage

Main use case for reporting and analytics, increasing data science workloads

Sanoma standard data platform, used in all Sanoma countries

> 250 daily dashboard users

40 daily users: analysts & developers

43 source systems, with 125 different sources

400 tables in hive

Current state – Tech and team

Platform Team:

– 1 product owner

– 3 developers

Close collaborators:

– ~10 data scientists

– 1 Qlikview application manager

– ½ (system) architect

Platform:

– 50-60 nodes

– > 600TB storage

– ~3000 jobs/day

Typical nodes

– 1-2 CPU 4-12 cores

– 2 system disks (RAID 1)

– 4 data disks (2TB, 3TB or 4TB)

– 24-32GB RAM

Challenges

Photo credits: http://kdrunlimited1.blogspot.hu/2013/01/late-edition-squirrel-appreciation-day.html

1. Security

2. Increase SLA’s

3. Integration with other BI environments

4. Improve Self service

5. Data Quality

6. Code reuse

Challenges

Future

Better integration batch and real time (unified code base)

Improve the availability and SLA’s of the platform

Optimizing Job scheduling (Resource Pools)

Automated query/job optimization tips for analysts

Fix Jenkins, is single point of failure

Moving some NLP (Natural Language Processing) and Image recognition

workload to hadoop

What’s next

Thank you!Questions?

@skieftsander.kieft@sanoma.com

hadoop summit - sanoma self service on hadoop

sanoma mediapresentation

data loadingfull service

official data

presentation name4

presentation name14

presentation name1324

different data sources

big data platform work

Technology

hadoop summit 2014: processing complex workflows in...

hadoop india summit, feb 2011 - informatica

mahout @ india hadoop summit

hadoop summit 2014 - san jose - introduction to deep...

hadoop summit 2010 yahoo’s commitment to hadoop and open...

hadoop summit 2016

parquet hadoop summit 2013

hadoop summit 2010 keynote

hadoop summit fair scheduler (1)

hadoop summit europe talk 2014: apache hadoop yarn: present...

apache falcon at hadoop summit 2013

docker based hadoop provisioning - hadoop summit 2014

hadoop operations-2015-hadoop-summit-san-jose-v5

hadoop summit 2010 machine learning using hadoop

data infrastructure on hadoop - hadoop summit 2011 blr

let's talk operations! (hadoop summit 2014)

hadoop summit san jose 2014 - apache hadoop yarn: best...

hadoop crash course hadoop summit sj

hadoop robot from ebay at china hadoop summit 2015

hadoop summit keynote 2014