hadoop summit - sanoma self service on hadoop

63
Scaling self service on Hadoop Sander Kieft @skieft

Upload: sander-kieft

Post on 14-Jul-2015

107 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Hadoop Summit - Sanoma self service on hadoop

Scaling self service on HadoopSander Kieft@skieft

Page 2: Hadoop Summit - Sanoma self service on hadoop

Photo credits: niznoz - https://www.flickr.com/photos/niznoz/3116766661/

Page 3: Hadoop Summit - Sanoma self service on hadoop

About me@skieft

Manager Core Services at Sanoma

Responsible for all common services, including the Big Data platform

Work:

– Centralized services

– Data platform

– Search

Like:

– Work

– Water(sports)

– Whiskey

– Tinkering: Arduino, Raspberry PI, soldering stuff

24 April 20153

Page 4: Hadoop Summit - Sanoma self service on hadoop

Sanoma, Publishing and Learning company

2+1002 Finnish newspapers

Over 100 magazines

24 April 2015 Presentation name4

5TV channels in Finland

and The Netherlands

200+Websites

100Mobile applications on

various mobile platforms

Page 5: Hadoop Summit - Sanoma self service on hadoop

Past

Page 6: Hadoop Summit - Sanoma self service on hadoop

History

< 2008 2009 2010 2011 2012 2013 2014 2015

Page 7: Hadoop Summit - Sanoma self service on hadoop

Self service

Photo credits: misternaxal - http://www.flickr.com/photos/misternaxal/2888791930/

Page 8: Hadoop Summit - Sanoma self service on hadoop

Self service

Page 9: Hadoop Summit - Sanoma self service on hadoop

Self service levels

Personal Departmental Corporate

Full self service Support with publishing dashboards

and data loading

Full service and support on

dashboard

Information is created by end users with

little or no oversight. Users are

empowered to integrate different data

sources and make their own

calculations.

Information has been created by end

users and is worth sharing, but has not

been validated.

Information that has gone through a

rigorous validation process can be

disseminated as official data.

24 April 20159

Information Workers Information Consumers

Full Agility Centralized Development

Excel Static reportsFocus

Page 10: Hadoop Summit - Sanoma self service on hadoop

Self service position – 2010 starting point

Source ExtractionTrans-

formationModeling Load

Report / Dashboar

dInsight

24 April 201510

Data team Analysts

Page 11: Hadoop Summit - Sanoma self service on hadoop

History

< 2008 2009 2010 2011 2012 2013 2014 2015

4/24/2015 © Sanoma Media11

Page 12: Hadoop Summit - Sanoma self service on hadoop

Glue: ETL

EXTRACT

TRANSFORM

LOAD

24 April 201512

Page 13: Hadoop Summit - Sanoma self service on hadoop

24 April 2015 Presentation name13

Page 14: Hadoop Summit - Sanoma self service on hadoop

24 April 2015 Presentation name14

Page 15: Hadoop Summit - Sanoma self service on hadoop
Page 16: Hadoop Summit - Sanoma self service on hadoop
Page 17: Hadoop Summit - Sanoma self service on hadoop

#fail

Page 18: Hadoop Summit - Sanoma self service on hadoop

Hadoop and Qlikview proofed to be really valuable tools

Traditional ETL tools don’t scale for Big Data sources

Big Data projects are not BI projects

Doing full end-to-end integrations and dashboard development doesn’t scale

Qlikview was not good enough as the front-end to the cluster

Hadoop requires developers not BI consultants

Learnings

Page 19: Hadoop Summit - Sanoma self service on hadoop

History

< 2008 2009 2010 2011 2012 2013 2014 2015

Page 20: Hadoop Summit - Sanoma self service on hadoop
Page 21: Hadoop Summit - Sanoma self service on hadoop

Russell JurneyAgile Data

24.4.2015 © Sanoma Media Presentation name / Author21

Page 22: Hadoop Summit - Sanoma self service on hadoop

Self service position – Agile Data

Source ExtractionTrans-

formationModeling Load

Report / Dashboar

dInsight

24 April 201522

Page 23: Hadoop Summit - Sanoma self service on hadoop

New glue

Photo credits: Sheng Hunglin - http://www.flickr.com/photos/shenghunglin/304959443/

Page 24: Hadoop Summit - Sanoma self service on hadoop

ETL Tool features

Processing

Scheduling

Data quality

Data lineage

Versioning

Annotating

24 April 201524

Page 25: Hadoop Summit - Sanoma self service on hadoop

24 April 2015 Presentation name25

Photo credits: atomicbartbeans - http://www.flickr.com/photos/atomicbartbeans/71575328/

Page 26: Hadoop Summit - Sanoma self service on hadoop

24.4.2015 © Sanoma Media Presentation name / Author26

Page 27: Hadoop Summit - Sanoma self service on hadoop

24.4.2015 © Sanoma Media Presentation name / Author27

Page 28: Hadoop Summit - Sanoma self service on hadoop

Processing - Jython

No JVM startup overhead for Hadoop API usage

Relatively concise syntax (Python)

Mix Python standard library with any Java libs

24 April 2015 Presentation name28

Page 29: Hadoop Summit - Sanoma self service on hadoop

Scheduling - Jenkins

Flexible scheduling with dependencies

Saves output

E-mails on errors

Scales to multiple nodes

REST API

Status monitor

Integrates with version control

24 April 2015 Presentation name29

Page 30: Hadoop Summit - Sanoma self service on hadoop

ETL Tool features

Processing – Bash & Jython

Scheduling – Jenkins

Data quality

Data lineage

Versioning – Mecurial (hg)

Annotating – Commenting the code

24 April 2015 Presentation name30

Page 31: Hadoop Summit - Sanoma self service on hadoop

Processes

Photo credits: Paul McGreevy - http://www.flickr.com/photos/48379763@N03/6271909867/

Page 32: Hadoop Summit - Sanoma self service on hadoop

Independent jobs

Source (external)

HDFS upload + move in place

Staging (HDFS)

MapReduce + HDFS move

Hive-staging (HDFS)

Hive map external table + SELECT INTO

Hive

24 April 2015 Presentation name32

Page 33: Hadoop Summit - Sanoma self service on hadoop

Typical data flow - Extract

Jenkins QlikViewHadoopHDFS

HadoopM/R or Hive

1. Job code from

hg (mercurial)

2. Get the data from

S3, FTP or API

3. Store the raw source

data on HDFS

Page 34: Hadoop Summit - Sanoma self service on hadoop

Typical data flow - Transform

Jenkins QlikViewHadoopHDFS

HadoopM/R or Hive

1. Job code from

hg (mercurial)

2. Execute M/R or

Hive Query

3. Transform the data to the

intermediate structure

Page 35: Hadoop Summit - Sanoma self service on hadoop

Typical data flow - Load

Jenkins QlikViewHadoopHDFS

HadoopM/R or Hive

1. Job code from

hg (mercurial)

2. Execute Hive

Query

3. Load data to QlikView

Page 36: Hadoop Summit - Sanoma self service on hadoop

Out of order jobs

At any point, you don’t really know what ‘made it’ to Hive

Will happen anyway, because some days the data delivery is going to be three

hours late

Or you get half in the morning and the other half later in the day

It really depends on what you do with the data

This is where metrics + fixable data store help...

24 April 2015 Presentation name36

Page 37: Hadoop Summit - Sanoma self service on hadoop

Fixable data store

Using Hive partitions

Jobs that move data from staging create partitions

When new data / insight about the data arrives, drop the partition and re-insert

Be careful to reset any metrics in this case

Basically: instead of trying to make everything transactional, repair afterwards

Use metrics to determine whether data is fit for purpose

24 April 2015 Presentation name37

Page 38: Hadoop Summit - Sanoma self service on hadoop

Photo credits: DL76 - http://www.flickr.com/photos/dl76/3643359247/

Page 39: Hadoop Summit - Sanoma self service on hadoop

Enabling self service

24.4.2015 © Sanoma Media39

??

Netherlands

Finland

Staging Raw Data (Native format)

HDFS

Normalized Data structure

HIVE

Reusable

stem data

• Weather

• Int. event

calanders

• Int. AdWord

Prices

• Exchange

rates

Source systems

FI

NL

?

Extract Transform

Data

Scientists

Business

Analists

Page 40: Hadoop Summit - Sanoma self service on hadoop

Enabling self service

24.4.2015 © Sanoma Media40

??

Netherlands

Finland

Staging Raw Data (Native format)

HDFS

Normalized Data structure

HIVE

Reusable

stem data

• Weather

• Int. event

calanders

• Int. AdWord

Prices

• Exchange

rates

Source systems

FI

NL

?

Extract Transform

Data

Scientists

Business

Analists

Page 41: Hadoop Summit - Sanoma self service on hadoop

HUE & Hive

Page 42: Hadoop Summit - Sanoma self service on hadoop

QlikView

Page 43: Hadoop Summit - Sanoma self service on hadoop

R Studio Server

Page 44: Hadoop Summit - Sanoma self service on hadoop

Jenkins

Page 45: Hadoop Summit - Sanoma self service on hadoop

Architecture

Page 46: Hadoop Summit - Sanoma self service on hadoop

High Level Architecture

46

s o

u r

c e

s

Jenkins & Jython

ETL

Back Office

Analytics

Meta Data

AdServing(incl. mobile, video,cpc,yield,

etc.)

Market

Content

QlikView

Dashboard/

Reporting

Hive

&

HUE

Subscription

Hadoop

Scheduled loads

AdHocQueries

Schedule

dexport

s

R Studio

Advanced

Analytics

Page 47: Hadoop Summit - Sanoma self service on hadoop

Current State - Real time

Extending own collecting infrastructure

Using this to drive recommendations, real time dashboarding, user segementation

and targeting in real time

Using Kafka and Storm

24 April 2015 Presentation name47

Page 48: Hadoop Summit - Sanoma self service on hadoop

High Level Architecture

48

Jenkins & Jython

ETL

Meta Data

QlikView

Dashboard/

Reporting

Hive

&

HUE

(Real time)

Collecting

CLOE

Hadoop

Scheduled loads

AdHocQueries

Schedule

dexport

s

Recommendations /

Learn to Rank

R Studio

Advanced

Analytics

Storm

Stream

processing

Recommendatio

ns & online

learningFlume / Kafka

Back Office

Analytics

AdServing(incl. mobile, video,cpc,yield,

etc.)

Market

Content

s o

u r

c e

s

Subscription

Page 49: Hadoop Summit - Sanoma self service on hadoop

DC1 DC2

Sanoma Media The Netherlands Infra

Production VMs

Managed HostingColocation

Big Data platformDev., test and

acceptance VMsProduction VMs

• NO SLA (yet)

• Limited support BD9-5

• No Full system backups

• Managed by us with systems department

help

• 99,98% availability

• 24/7 support

• Multi DC

• Full system backups

• High performance EMC SAN storage

• Managed by dedicated systems

department

Page 50: Hadoop Summit - Sanoma self service on hadoop

DC1 DC2

Sanoma Media The Netherlands Infra

Production VMs

Managed HostingColocation

Big Data platformDev., test and

acceptance VMsProduction VMs

• NO SLA (yet)

• Limited support BD9-5

• No Full system backups

• Managed by us with systems department

help

• 99,98% availability

• 24/7 support

• Multi DC

• Full system backups

• High performance EMC SAN storage

• Managed by dedicated systems

department

Processing

Storage

Batch

Collecting

Real time

Recommendations

Queue

data for 3

days

Run on

stale data

for 3 days

Page 51: Hadoop Summit - Sanoma self service on hadoop

Present

Page 52: Hadoop Summit - Sanoma self service on hadoop

Self service position – Present

Source ExtractionTrans-

formationModeling Load

Report / Dashboar

dInsight

24 April 2015 Presentation name52

Page 53: Hadoop Summit - Sanoma self service on hadoop

Present

< 2008 2009 2010 2011 2012 2013 2014 2015

YARN

Page 54: Hadoop Summit - Sanoma self service on hadoop

Moving to YARN(CDH 4.3 -> 5.1)

24 April 2015 "Blasting frankfurt" by Heptagon -Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -http://commons.wikimedia.org/wiki/File:Blasting_frankfurt.jpg#/media/File:Blasting_frankfurt.jpg54

Functional testing wasn’t enough

Takes time to tune the parameters

Defaults are NOT good enough

Cluster dead locks

Grouped nodes with similar Hardware requirements into Node groups

CDH 4 to 5 specific

Hive CLI -> Beeline

Hive Locking

Page 55: Hadoop Summit - Sanoma self service on hadoop

Combining various data sources (web analytics + advertising + user profiles)

A/B testing + deeper analyses

Ad spend ROI optimization & attribution

Ad auction price optimization

Ad retargeting

Item and user based recommendations

Search optimizations

– Online Learn to Rank

– Search suggestions

Current state – Use case

24 April 2015 Presentation name55

Page 56: Hadoop Summit - Sanoma self service on hadoop

Current state - Usage

Main use case for reporting and analytics, increasing data science workloads

Sanoma standard data platform, used in all Sanoma countries

> 250 daily dashboard users

40 daily users: analysts & developers

43 source systems, with 125 different sources

400 tables in hive

24 April 2015 Presentation name56

Page 57: Hadoop Summit - Sanoma self service on hadoop

Current state – Tech and team

Platform Team:

– 1 product owner

– 3 developers

Close collaborators:

– ~10 data scientists

– 1 Qlikview application manager

– ½ (system) architect

Platform:

– 50-60 nodes

– > 600TB storage

– ~3000 jobs/day

Typical nodes

– 1-2 CPU 4-12 cores

– 2 system disks (RAID 1)

– 4 data disks (2TB, 3TB or 4TB)

– 24-32GB RAM

24 April 2015 Presentation name57

Page 58: Hadoop Summit - Sanoma self service on hadoop

Challenges

Photo credits: http://kdrunlimited1.blogspot.hu/2013/01/late-edition-squirrel-appreciation-day.html

Page 59: Hadoop Summit - Sanoma self service on hadoop

1. Security

2. Increase SLA’s

3. Integration with other BI environments

4. Improve Self service

5. Data Quality

6. Code reuse

Challenges

24 April 2015 Presentation name59

Page 60: Hadoop Summit - Sanoma self service on hadoop

Future

Page 61: Hadoop Summit - Sanoma self service on hadoop

Better integration batch and real time (unified code base)

Improve the availability and SLA’s of the platform

Optimizing Job scheduling (Resource Pools)

Automated query/job optimization tips for analysts

Fix Jenkins, is single point of failure

Moving some NLP (Natural Language Processing) and Image recognition

workload to hadoop

24 April 2015 Presentation name61

What’s next

Page 62: Hadoop Summit - Sanoma self service on hadoop

Thank you!Questions?

@[email protected]

Page 63: Hadoop Summit - Sanoma self service on hadoop