hadoop in adtech

Hadoop in adtech world

YutaImaiSolu,onsEngineer,Hortonworks

©HortonworksInc.2011–2015.AllRightsReserved

WhatisApacheHadoop?

3 ©HortonworksInc.2011–2016.AllRightsReserved runson

RDBMSImport/Export

DistributedStorage&ProcessingFramework

SecureNoSQLDB

SQLonHBase

NoSQLDB

WorkflowManagement

StreamingDataIngesFon

ClusterSystemOperaFons

SecureGateway

DistributedRegistry

Search&Indexing

EvenFasterDataProcessing

DataManagement

MachineLearning

HadoopEcosystem

4 ©HortonworksInc.2011–2016.AllRightsReserved

HortonworksDataPla:orm(HDP)

1stGenHadoop:CostEffecBveBatchatScale

HADOOP1.0BuiltforWeb-ScaleBatchApps

SingleAppBATCH

SingleAppINTERACTIVE

SingleAppBATCH

Siloscreatedfordis,nctusecasesSingleApp

SingleAppONLINE

HadoopBeyondBatchwithYARN

SingleUseSysztemBatchApps

Mul2UseDataPla6ormBatch,InteracFve,Online,Streaming,…

AshiHfromtheoldtothenew…

HADOOP 1

MapReduce (cluster resource management

& data processing)

Data Flow Pig

SQL Hive

Others

API, Engine,

and System

YARN (Data Operating System: resource management, etc.)

Data Flow Pig

SQL Hive

Other ISV

Apache Yarn as a Base

System

Engine

API’s

1 ° ° ° ° °

° ° ° ° ° N

HDFS (redundant, reliable storage)

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

Batch MapReduce

Tez Tez

MapReduce as the Base HADOOP 2

ArchitectureEnabledbyYARNAsinglesetofdataacrosstheen,reclusterwithmul,pleaccessmethodsusing“zones”forprocessing

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° ° ° ° ° ° ° n

SQLHive

Interac,veSQLQueryforAnaly,cs

PigScript-basedETL

AlgorithmexecutedinbatchtoreworkdatausedbyHiveandHBaseconsumers

• Maximize compute resources to lower TCO

• No standalone, silo’d clusters

• Simple management & operations

…all enabled by YARN

StreamProcessingStorm

Iden,fy&actonreal-,meevents

NoSQLHbase

Accumulo

Low-latencyaccessservingupawebfrontend

HadoopWorkloadEvoluBon

SingleUseSystemBatchApps

Mul2UseDataPla6ormBatch,InteracFve,Online,Streaming,…

AshiHfromtheoldtothenew… Mul2UsePla6ormData&Beyond

HADOOP 1

HADOOP 2

1 ° ° ° °

° ° ° ° N

1 ° ° °

° ° ° N

MapReduce

HADOOP.Next

YARN ‘

1 ° ° ° ° ° °

° ° ° ° ° ° N

DATA ACCESS APPS

Docker

MySQL MR2 Others (ISV Engines)

Multiple (Script, SQL, NoSQL, …)

MR2 Others (ISV Engines)

Multiple (Script, SQL, NoSQL, …)

Docker

Tomcat

Docker

HadoopOperaBons&Tools

How Do You Operate a Hadoop Cluster?

Apache™Ambariisapla:ormtoprovision,manageandmonitorHadoopclusters

Ambari Core Features and Extensibility

Install&Configure

Operate,Manage&Administer

Develop

OpBmize&Tune

Developer

DataArchitect

AmbariprovidescoreservicesforoperaBons,developmentandextensionspointsforboth

ExtensibilityFeatures

Stacks,Blueprints&RESTAPIs

CoreFeatures

InstallWizard&Web

Web,OperatorViews,Metrics&Alerts

UserViews

ViewsFramework&RESTAPIs

ViewsFramework

How?ClusterAdmin

Newuserinterfaceenablesfast&easySQLdefini,onandexecu,on.

New User Views for DevOps

CapacitySchedulerViewBrowseandmanageYARNqueues

TezViewViewinforma,onrelatedtoTezjobsthatareexecu,ngonthecluster

NewUserViewsforDevelopment

PigViewAuthorandexecutePigScripts.

HiveViewAuthor,executeanddebugHive

queries.

FilesViewBrowseHDFSfilesystem.

ApacheZeppelin

•  Web-basednotebookfordataengineers,dataanalystsanddatascien,sts•  Bringsinterac,vedatainges,on,data

explora,on,visualiza,on,sharingandcollabora,onfeaturestoHadoopandSpark

•  Moderndatasciencestudio•  ScalawithSpark•  PythonwithSpark•  SparkSQL•  ApacheHive,andmore.

Hadoopusecasesinadtechworld

Hadoopの多くのユースケースはHive•  例えばWebサービスのアクセスレポートの作成などによく利⽤され、以下の

様なアーキテクチャが⾮常にメジャーだった。•  クエリにはそれなりに時間がかかることが多く、定期ジョブとして実⾏され

ることが多かった。

Hadoop

Hadoopの多くのユースケースはHive•  例えばWebサービスのアクセスレポートの作成などによく利⽤され、以下の

様なアーキテクチャが⾮常にメジャーだった。•  クエリにはそれなりに時間がかかることが多く、定期ジョブとして実⾏され

ることが多かった。

Hadoop

⼤量のデータに対して⼤きな処理をするために利⽤されるのがHadoopでありMapReduceだった。

MySQLReportUI

SQL on ビッグデータを⾼速化する試み

Hive(MapReduce)の速度はインタラクティブなクエリには不⼗分だった。•  Presto•  Impala•  Drill•  Shark(今のSparkSQL)

Hadoopの多くのユースケースはHive

•  PrestoやMySQL(データマートとして)などと組み合わせた構成が⼀般的になってきている

Hadoop

ReportUI

SQL on ビッグデータ - クラウドサービスの登場

•  Amazon Redshift•  Google BigQuery

Sub-secondショートクエリで

1秒以下のレスポンスを⽬指す

Ã ~Hive1.2.1– Tez– Cost Based Optimizer(CBO)– ORC File format– Vectorization

Ã Hive2.0– LLAP

Stinger InitiativeHiveを100倍以上⾼速化

Already available on HDP!

もちろんHive⾃⾝も⾼速化している

Hiveの⾼速化

Hadoop

ReportUI

•  Hiveで直接インタラクティブクエリを処理できるようになった

今では様々なところに利⽤されるHadoopエコシステム

HadoopHDFS

ReportUI

レポート

すべてのログの⻑期保存

ETLやもろもろのバッチ処理

HadoopHDFS

ReportUI

Adsserver

配信DB

⼊札やオプティマイゼーションのモデル⽣成

HadoopHDFS

ReportUI

Adsserver

リアルタイムなログ収集

リアルタイムトラッキング

HadoopHDFS

ReportUI

Adsserver

配信DB

レポート

HadoopHDFS

ReportUI

Adsserver

配信DB

レポート

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Loaddataandmanageaccordingtopolicy

Providelayeredapproachto

securitythroughAuthen,ca,on,Authoriza,on,Accoun,ng,andDataProtec,on

SECURITYGOVERNANCE

Deployandeffec,velymanagetheplahorm

° ° ° ° ° ° ° ° ° ° ° ° ° ° °

Script

Java Scala

Cascadin

Stream

Search

HBase Accumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Others

ISV Engines

1 ° ° ° ° ° ° ° ° ° ° ° ° ° °

YARN: Data Operating System (ClusterResourceManagement)

HDFS (Hadoop Distributed File System)

Tez Slider Slider Tez Tez

OPERATIONS

Key highlightsin recent Hadoop evolution

昨今のHadoopの進化

Ã  LLAP

Ã  HCatalog Stream Mutation API

Ã  Cloudbreak

Ã Hive– LLAP– ACID, HCatalog Stream Mutation API

Ã Cloudbreak

ApacheHive:FastFacts

MostQueriesPerHour

100,000QueriesPerHour

AnalyBcsPerformance

100Millionrows/sPerNode(withHiveLLAP)

LargestHiveWarehouse

300+PBRawStorage(Facebook)

LargestCluster

4,500+Nodes(Yahoo)

SQL evolution on HadoopCa

Batch SQL OLAP / CubeInteractive SQL

Sub-Second SQL

ACID / MERGE

Speed Feature

Hive0.x(MapReduce)

Hive1.2-(Tez, Vectorize, ORC, CBO)

Hive2.0(LLAP)

PrestoImpala

Spark SQLHAWQ

KylinDruid

CommercialKyvos Insights

AtScaleSource

Hive2withLLAP:ArchitectureOverview

Storage

HDFS S3+OtherHDFSCompa,bleFilesystems

YARNCluster

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

QueryCoordinators

Coord-inator

HiveServer2(Query

Endpoint)

ODBC/JDBC SQL

Queries In-MemoryCache(SharedAcrossAllUsers)

Hive2withLLAP:ArchitectureOverview

Storage

HDFS S3+OtherHDFSCompa,bleFilesystems

YARNCluster

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

QueryCoordinators

Coord-inator

HiveServer2(Query

Endpoint)

ODBC/JDBC SQL

Queries In-MemoryCache(SharedAcrossAllUsers)

MPP型に近いアーキテクチャを取りながら・・・•  キャッシュレイヤを持ったり•  YARNによるスケール機能を利⽤したり•  低いレイテンシが必要ないクエリは通常のTezコンテナで処理できたりといろいろおいしいどころどりな設計

Speedu

p(xFactor)

e(s)(Low

erisBep

Hive2withLLAPaverages26xfasterthanHive1

Hive1/TezTime(s) Hive2/LLAPTime(s) Speedup(xFactor)

Hive2withLLAP:25+xPerformanceBoost

HiveACIDProducBon-ReadywithHDP2.5

Ã  Testedatmul,-TBscaleusingTPC-Hbenchmark.–  Reliablyingest400GB+perdaywithina

par,,on.–  10TB+rawdatainasinglepar,,on.–  Simultaneousingest,deleteandquery.

Ã  70+stabiliza,onimprovements.

Ã  Supported:–  SQLINSERT,UPDATE,DELETE.–  StreamingAPI.

Ã  Future:SQLMERGEunderdevelopment(HIVE-10924).

NotableImprovements

16/05/24 16/05/25 16/05/26 16/05/27 16/05/28 16/05/29 16/05/30 16/05/31 16/06/01

Time(s)

QueryTimeversusDataSize

Run,meforAllQueries(s) TotalCompressedData

16/05/23 16/05/24 16/05/25 16/05/26 16/05/27 16/05/28 16/05/29 16/05/30 16/05/31 16/06/01

Time(s)

TimesforInsertsandDeletes

,me_insert_lineitem ,me_insert_orders ,me_delete_lineitem ,me_delete_orders

HiveACIDProducBon-ReadywithHDP2.5

Ã  Testedatmul,-TBscaleusingTPC-Hbenchmark.–  Reliablyingest400GB+perdaywithina

par,,on.–  10TB+rawdatainasinglepar,,on.–  Simultaneousingest,deleteandquery.

Ã  70+stabiliza,onimprovements.

Ã  Supported:–  SQLINSERT,UPDATE,DELETE.–  StreamingAPI.

Ã  Future:SQLMERGEunderdevelopment(HIVE-10924).

NotableImprovements

16/05/24 16/05/25 16/05/26 16/05/27 16/05/28 16/05/29 16/05/30 16/05/31 16/06/01

Time(s)

QueryTimeversusDataSize

Run,meforAllQueries(s) TotalCompressedData

16/05/23 16/05/24 16/05/25 16/05/26 16/05/27 16/05/28 16/05/29 16/05/30 16/05/31 16/06/01

Time(s)

TimesforInsertsandDeletes

,me_insert_lineitem ,me_insert_orders ,me_delete_lineitem ,me_delete_orders

分析/集計⽤DBのつらいところとして、データをバッチ処理的に投⼊してやる必要があった。ストリームインサートができるのは⼤きなメリット。

HCatalog Stream Mutation API

ORCORC

Bucket

Ã Cloudbreak

Cloudbreak

BI/AnalyBcs(Hive)

IoTApps(Storm,HBase,Hive)

Dev/Test(allHDPservices)DataScience

(Spark)

Cloudbreak

1.  PickaBlueprint2.  ChooseaCloud3.  LaunchHDP!

ExampleAmbariBlueprints:IoTApps,BI/Analy,cs,DataScience,

Dev/Test

クラウドへのHDPデプロイの実⾏を容易に

昨今のHadoopの進化：まとめると・・・

Ã Cloudbreak

昨今のHadoopの進化: クラウドとうまく共存できる⽅向に

CacheCache

リアルタイムなデータ収集

クラ

ウド

内外

への

オン

デマ

ンド

なク

ラス

タデ

プロ

クラウドストレージを活⽤しながら低レイテンシ

なクエリ処理

hadoop in adtech

Technology

iran adtech overview -...

adtech jakarta michael leander presentation in indonesia

email talk at adtech '12

adtech london get jar sep2010

adtech@adtech 2014

about results...

luma display adtech landscape

adtech ascendancy - barclays corporate€¦ · adtech...

adtech india-2012-keynote-comscore

adtech media attribution

adtech meetup london - about us

adtech & martech barometer - results...

flixel adtech copy

adtech singapore media attribution

dianomi adtech london presentation 2011

adtech tokyo 2010 takahiro part

adtech analytic

adtech pipeline: kafka->apex->geode

adtech@ad tech 2015

2010-04 adtech presentation final