integrating hadoop in your existing dw and bi environment

37

Upload: cloudera-inc

Post on 20-Aug-2015

4.229 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Integrating Hadoop in Your Existing DW and BI Environment
Page 2: Integrating Hadoop in Your Existing DW and BI Environment

Page 3: Integrating Hadoop in Your Existing DW and BI Environment

Page 4: Integrating Hadoop in Your Existing DW and BI Environment

Presentation Outline! 1. The standard model

! 2. The 3 stages of Hadoop adoption

! 3. Cloudera partnerships

! 4. Analytics at eBay

! Questions and Discussion

Wednesday, November 17, 2010

Page 5: Integrating Hadoop in Your Existing DW and BI Environment

1. The Standard ModelData Warehousing and Business Intelligence

Wednesday, November 17, 2010

Page 6: Integrating Hadoop in Your Existing DW and BI Environment

ApplicationDatabase

ApplicationRequests

Wednesday, November 17, 2010

Page 7: Integrating Hadoop in Your Existing DW and BI Environment

ApplicationDatabase

ApplicationRequests

DataWarehouse

Wednesday, November 17, 2010

Page 8: Integrating Hadoop in Your Existing DW and BI Environment

ApplicationDatabase

ApplicationRequests

DataWarehouse

ETL

Wednesday, November 17, 2010

Page 9: Integrating Hadoop in Your Existing DW and BI Environment

ApplicationDatabase

ApplicationRequests

DataWarehouse

ETL

BusinessIntelligence

Wednesday, November 17, 2010

Page 10: Integrating Hadoop in Your Existing DW and BI Environment

ApplicationDatabase

ApplicationRequests

DataWarehouse

ETL

BusinessIntelligence

Analytics

Wednesday, November 17, 2010

Page 11: Integrating Hadoop in Your Existing DW and BI Environment

2. The 3 Stages of Hadoop Adoption

Wednesday, November 17, 2010

Page 12: Integrating Hadoop in Your Existing DW and BI Environment

Stage 1Off the Critical Path

Wednesday, November 17, 2010

Page 13: Integrating Hadoop in Your Existing DW and BI Environment

Stage 1Copy or Archive

ApplicationDatabase

ApplicationRequests

DataWarehouse

ETL

BusinessIntelligence

Analytics

Hadoop

Wednesday, November 17, 2010

Page 14: Integrating Hadoop in Your Existing DW and BI Environment

Stage 1Add Unstructured Data

ApplicationDatabase

ApplicationRequests

DataWarehouse

ETL

BusinessIntelligence

Analytics

Hadoop

Wednesday, November 17, 2010

Page 15: Integrating Hadoop in Your Existing DW and BI Environment

Stage 1Consolidate Multiple Data Warehouses

ApplicationDatabase

DataWarehouse

ETL

Hadoop

ApplicationDatabase

DataWarehouse

ETL

Wednesday, November 17, 2010

Page 16: Integrating Hadoop in Your Existing DW and BI Environment

Stage 2On the Critical Path

Wednesday, November 17, 2010

Page 17: Integrating Hadoop in Your Existing DW and BI Environment

Stage 2Structure and Store

ApplicationDatabase

ApplicationRequests

DataWarehouse

BusinessIntelligence

Analytics

Hadoop

Wednesday, November 17, 2010

Page 18: Integrating Hadoop in Your Existing DW and BI Environment

Stage 3Ad Hoc Query Support

Wednesday, November 17, 2010

Page 19: Integrating Hadoop in Your Existing DW and BI Environment

ApplicationDatabase

ApplicationRequests

DataWarehouse

BusinessIntelligence

Analytics

Hadoop + Hive

BusinessIntelligence

Analytics

Wednesday, November 17, 2010

Page 20: Integrating Hadoop in Your Existing DW and BI Environment

Cloudera’s Distribution for HadoopThe Industry-leading Hadoop Distribution

Wednesday, November 17, 2010

Page 21: Integrating Hadoop in Your Existing DW and BI Environment

3. Cloudera Partnerships

Wednesday, November 17, 2010

Page 22: Integrating Hadoop in Your Existing DW and BI Environment

Cloudera PartnershipsCloud, Hardware, and OS

! Processor

! AMD, Intel

! Server

! Acer, HP, Supermicro

! OS

! Canonical

! Cloud

! VMware vCloud

! CDH runs on AWS and Rackspace Cloud as well

Wednesday, November 17, 2010

Page 23: Integrating Hadoop in Your Existing DW and BI Environment

Cloudera PartnershipsData Integration

! Informatica

! Talend

! Pentaho Data Integration

Wednesday, November 17, 2010

Page 24: Integrating Hadoop in Your Existing DW and BI Environment

Cloudera PartnershipsDatabase

! Aster Data

! Greenplum

! Membase

! Netezza

! Quest Software (OraOop)

! Teradata

! Vertica

Wednesday, November 17, 2010

Page 25: Integrating Hadoop in Your Existing DW and BI Environment

Cloudera PartnershipsBusiness Intelligence

! Jaspersoft

! Microstrategy

! Pentaho BI Suite

Wednesday, November 17, 2010

Page 26: Integrating Hadoop in Your Existing DW and BI Environment

4. Analytics at eBay

Wednesday, November 17, 2010

Page 27: Integrating Hadoop in Your Existing DW and BI Environment

1

eBay’s Data Scale

• eBay manages …• Over 90 million active users worldwide• Over 220 million items for sale• Over 10 billion URL requests per day

• •  … in a dynamic environment• Tens of new features each week• Roughly 10% of items are listed or ended every day

• Collect Everything• eBay processes 40TB of new, incremental data per day• eBay analyzes 40PB of data per day• Store every historical item and purchase

eBay has one of the largest EDW system and is building one of the world’slargest Hadoop clusters

Page 28: Integrating Hadoop in Your Existing DW and BI Environment

2

Where – it fits in our Data Platform…

Page 29: Integrating Hadoop in Your Existing DW and BI Environment

Integration into Existing Warehouse

3

Click Stream

EDW

Images

Search Indices

Analytics Reporting

Algorithmic Models

Acquisition

Item Description

DataAcquisition

BIGeneration

InsightDelivery

Page 30: Integrating Hadoop in Your Existing DW and BI Environment

Data Sourcing Patterns

4

Source Preparation Format Pattern / LearningClick StreamSessionEventSession Container

Session/Event Streamed as Gzip/ Binary. Prepared as LZO/Text.

Session/Event DataBuild an index and use LzoTextInputFormat for splits

Session Container - a join of Session and corresponding Event data. Prepared as Sequence Files.

Session Container - Secondary sort with reduce side join

EDWItemTransactionUserFeedbackBids

Incremental feed streamed and maintained as GZIP/Text.

Smaller data set , keep it in the original format.

Prepare a snapshot as SequenceFile.

Rebuild daily snapshot with previous snapshot and incremental day’s data.

Build a Hive table on snapshot data Create external Hive table which points to SequenceFile

HBasea) Leverage TotalOrderPartitonerwith RandomSamplers to identify partition ranges for reducers.b) Create HBaseregions using Hfilec) Update RegionServers using ruby script loadtable.rb

Learninga) Incremental data not temporal/sparse, hence not suitable as versions in a column oriented DB.b) HBase insert vs. append performance, 120K vs. 12K rows per secc) Hfile flush durability issues HBASE-1923

Page 31: Integrating Hadoop in Your Existing DW and BI Environment

Hadoop Ecosystem

55

Hadoop Core (HDFS,Common)

MapReduce (Java, Streaming, Pipes,Scala)

Data Access (Hbase, Pig, Hive)

Tools & L ibraries(HUE,UC4,Oozie.Mobius,Mahout)

Monitoring & A lerting (Ganglia, Nagios)

•MapReduceSourcing data primarily JavaApplications using Perl, Scala, Python…

• Data Access FrameworksPig – data piplelinesHive – Adhoc queries MQL –Mobius Query Language

•Monitoring & AlertingGanglia, Nagios,  Cloudera Enterprise

• Tools & LibrariesHUE/Mobius – lifecycle of user  jobsUC4 ‐ schedulingOozie – user workflow and data pipelinesMahout – data mining    

Page 32: Integrating Hadoop in Your Existing DW and BI Environment

6

Page 33: Integrating Hadoop in Your Existing DW and BI Environment

Metadata ‐ Data Discovery & Management

7

Clients

Data Sourcing

Data Access Layer

HDFS

Metadata        

Data Discovery Data Monitoring

Logical Type System

Provisioning Tools

Metadata Store

Hive, Java

Pig Schemas

Pig  load UDFs

Hive Tables

Java POJO

ValidationLoad

HBASE Tables

Extract Transform

Page 34: Integrating Hadoop in Your Existing DW and BI Environment

Administration

• Groups• Cloudera Enterprise• Workload Management

• Allocation, Weights , Preemption, Speculative Execution, Data Locality

• Security• Integrate Hadoop security spec with corporate policies• Authentication

• HUE – custom module to use corp. credentials• Command Line Interface – PAM custom module

• Authorization• Establish roles based on data classification and access patterns

8

Page 35: Integrating Hadoop in Your Existing DW and BI Environment

9

Metrics Details

Data Sourcing Latency, Data Load Status, Integrity, Quality, Availability

Consumption Cloudera’s Bean Counter , Job Statistics, System consumption

Budgeting Resource Allocation Models, Forecasting, Chargeback

Utilization Cloudera’s Activity Monitor, Efficiency, Performance

Platform Description

Availability Standby Nodes ‐ Checkpoint ,Backup , Avatar Node, SLAs

Manageability Installation, Provisioning, De‐Provisioning, Version upgrades

Scalability Federated NameNode, Metadata Replication, Zookeeper

Data Movement Publish/Subscribe  ETL tools, low latency , self‐service

Storage Consistency, Partitioning, Compression, Replication

Workload Concurrency, Resource Sharing, Schedulers, Allocation

Policies Retention, Archival, Backup, Quotas

Platform & Metrics

Page 36: Integrating Hadoop in Your Existing DW and BI Environment
Page 37: Integrating Hadoop in Your Existing DW and BI Environment