scaling etl with hadoop shapira 3

46
1 Scaling ETL with Hadoop Gwen Shapira, Solutions Architect @gwenshap [email protected]

Upload: chen-gwen-shapira

Post on 26-Jan-2015

115 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Scaling etl with hadoop   shapira 3

1

Scaling ETL with Hadoop

Gwen Shapira, Solutions Architect@[email protected]

Page 2: Scaling etl with hadoop   shapira 3

2

Have you read the ZooKeeper paper?

(“ZooKeeper: Wait-free coordination for Internet-scale systems”)

Page 3: Scaling etl with hadoop   shapira 3

3

ETL is…

• Extracting data from outside sources• Transforming it to fit operational needs• Loading it into the end target

• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)

Page 4: Scaling etl with hadoop   shapira 3

4

Hadoop Is…

• HDFS – Replicated, distributed data storage• Map-Reduce – Batch oriented data processing at

scale. Parallel programming platform.

Page 5: Scaling etl with hadoop   shapira 3

5

The Ecosystem

• High level languages and abstractions• File, relational and streaming data integration• Process Orchestration and Scheduling• Libraries for data wrangling• Low latency query language

Page 6: Scaling etl with hadoop   shapira 3

6

Why ETL with Hadoop?

Page 7: Scaling etl with hadoop   shapira 3

8

Got unstructured data?

• Traditional ETL:• Text• CSV• XLS• XML

• Hadoop:• HTML• XML, RSS• JSON• Apache Logs• Avro, ProtoBuffs, ORC, Parquet• Compression• Office, OpenDocument, iWorks• PDF, Epup, RTF• Midi, MP3• JPEG, Tiff• Java Classes• Mbox, RFC822• Autocad• TrueType Parser• HFD / NetCDF

Page 8: Scaling etl with hadoop   shapira 3

9

Replace ETL Clusters

• Informatica is:• Easy to program• Complete ETL solution

• Hadoop is:• Cheaper• MUCH more flexible• Faster?• More scalable?

• You can have both

Page 9: Scaling etl with hadoop   shapira 3

10

Data Warehouse Offloading

• DWH resources are EXPENSIVE

• Reduce storage costs• Release CPU capacity• Scale • Better tools

Page 10: Scaling etl with hadoop   shapira 3

11

What I often see

ETL Cluster

ELT in DWH

ETL in Hadoop

ETL Cluster ETL Cluster with some Hadoop

OR

Page 11: Scaling etl with hadoop   shapira 3

12

Moving your transformations from the DWH to Hadoop?

Lets do it right.

Page 12: Scaling etl with hadoop   shapira 3

13

We’ll Discuss:

Technologies Speed & Scale Tips & Tricks

Extract

Transform

Load

Workflow

Page 13: Scaling etl with hadoop   shapira 3

14

Extract

Page 14: Scaling etl with hadoop   shapira 3

15

Let me count the ways

• From Databases: Sqoop • Log Data: Flume + CDK• Or just write data to HDFS

Page 15: Scaling etl with hadoop   shapira 3

16

Sqoop – The Balancing Act

Page 16: Scaling etl with hadoop   shapira 3

17

Scale Sqoop Slowly

• Balance between:• Maximizing network utilization• Minimizing database impact

• Start with smallish table (1-10G)• 1 mapper, 2 mappers, 4 mappers• Where’s the bottleneck?• vmtat, iostat, mpstat, netstat, iptraf

Page 17: Scaling etl with hadoop   shapira 3

18

When Loading Files:

Same principles apply:• Parallel Copy• Add Parallelism• Find Bottlenecks• Resolve them• Avoid Self-DDOS

Page 18: Scaling etl with hadoop   shapira 3

19

Scaling Sqoop

• Split column - match index or partitions• Compression• Drivers + Connectors + Direct mode• Incremental import

Page 19: Scaling etl with hadoop   shapira 3

20

Ingest Tips

• Use file system tricks to ensure consistency• Directory structure:

/intent/category

/application (optional)/dataset

/partitions/files

• Examples:/data/fraud/txs/2011-01-01/20110101-00.avro/group/research/model-17/training-tx/part-00000.txt/user/gshapira/scratch/surge/

Page 20: Scaling etl with hadoop   shapira 3

21

Ingest Tips

• External tables in Hive• Keep raw data• Trigger workflows on

file arrival

Page 21: Scaling etl with hadoop   shapira 3

22

Transform

Page 22: Scaling etl with hadoop   shapira 3

23

Endless Possibilites

• Map Reduce (in any language)

• Hive (i.e. SQL)• Pig• R• Shell scripts• Plain old Java• Morphlines

Page 23: Scaling etl with hadoop   shapira 3

24

Prototype

Page 24: Scaling etl with hadoop   shapira 3

25

Partitioning

• Hive• Directory Structure• Pre-filter• Adds metadata

Page 25: Scaling etl with hadoop   shapira 3

26

Tune Data Structures

• Joins are expensive• Disk space is not• De-normalize• Store same data in

multiple formats

Page 26: Scaling etl with hadoop   shapira 3

27

Map-Reduce

• Assembly language of data processing• Simple things are hard, hard things are possible• Use for:

• Optimization: Do in one MR job what Hive does in 3• Optimization: Partition the data just right• GeoSpatial• Mahout – Map/Reduce machine learning

Page 27: Scaling etl with hadoop   shapira 3

28

Parallelism –Unit of Work

• Amdahl’s Law• Small Units• That stay small

• One user?• One day?• Ten square meter?

Page 28: Scaling etl with hadoop   shapira 3

29

Remember the Basics

• X reduce output is 3X disk IO and 2X network IO• Less jobs = Less reduces = Less IO = Faster and Scalier• Know your network and disk throughput• Have rough idea of ops-per-second

Page 29: Scaling etl with hadoop   shapira 3

30

Instrumentation

• Optimize the right things• Right jobs • Right hardware• 90% of the time –

its not the hardware

Page 30: Scaling etl with hadoop   shapira 3

31

Tips

Avoid the“old data warehouse guy”syndrome

Page 31: Scaling etl with hadoop   shapira 3

32

Tips

• Slowly changing dimensions:• Load changes • Merge• And swap

• Store intermediate results:• Performance• Debugging• Store Source/Derived relation

Page 32: Scaling etl with hadoop   shapira 3

33

Fault and Rebuild

• Tier 0 – raw data• Tier 1 – cleaned data• Tier 2 – transformations, lookups and denormalization• Tier 3 - Aggregations

Page 33: Scaling etl with hadoop   shapira 3

34

Never Metadata I didn’t like

• Metadata is small data• That changes frequently• Solutions:

• Oozie• Directory structure / Partitions• Outside of HDFS• HBase• Cloudera Navigator

Page 34: Scaling etl with hadoop   shapira 3

35

Few words about Real Time ETL

• What does it even mean?• Fast reporting?• No delay from OLTP to DWH?• Micro-batches make more sense:

• Aggregation• Economy of scale

• Late data happens• Near-line solutions

Page 35: Scaling etl with hadoop   shapira 3

36

Load

Page 36: Scaling etl with hadoop   shapira 3

37

Technologies

• Sqoop• Fuse-DFS• Oracle Connectors• NoSQLs

Page 37: Scaling etl with hadoop   shapira 3

38

Scaling

• Sqoop works better for some DBs• Fuse-FS and DB tools give more control• Load in parallel• Directly to FS• Partitions • Do all formatting on Hadoop• Do you REALLY need to load that?

Page 38: Scaling etl with hadoop   shapira 3

39

How not to Load

• Most Hadoop customers don’t load data in bulk• History can stay in Hadoop• Load only aggregated data• Or computation results – recommendations, reports.• Most queries can run in Hadoop• BI tools often run in Hadoop

Page 39: Scaling etl with hadoop   shapira 3

40

Workflow Management

Page 40: Scaling etl with hadoop   shapira 3

41

Tools

• Oozie• Azkaban

• Pentaho Kettle• TalenD

• Informatica

}

Native Hadoop

}

Kinda Open Source

Page 41: Scaling etl with hadoop   shapira 3

42

Scaling Challenges

• Keeping track of:• Code Components• Metadata• Integrations and Adapters• Reports, results, artifacts

• Scheduling and Orchestration• Cohesive System View• Life Cycle• Instrumentation, Measurement and Monitoring

Page 42: Scaling etl with hadoop   shapira 3

43

My Toolbox

• Hue + Oozie:• Scheduling + Orchestration• Cohesive system view• Process repository• Some metadata• Some instrumentation

• Cloudera Manager for monitoring

• … and way too many home grown scripts

Page 43: Scaling etl with hadoop   shapira 3

44

Hue + Oozie

Page 44: Scaling etl with hadoop   shapira 3

45

Writing a workflow engine isthe software engineering equivalent of getting involved in a land war in Asia.

“ ”

— Josh Wills

Page 45: Scaling etl with hadoop   shapira 3

46

A lot is still missing

• Metadata solution• Instrumentation and monitoring solution• Lifecycle management• Lineage tracking

Contributions are welcome

Page 46: Scaling etl with hadoop   shapira 3

47