scaling etl with hadoop - avoiding failure

67
1 Scaling ETL with Hadoop Gwen Shapira @gwenshap [email protected]

Upload: chen-gwen-shapira

Post on 27-Aug-2014

636 views

Category:

Software


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Scaling ETL with Hadoop - Avoiding Failure

1

Scaling ETL with HadoopGwen Shapira@[email protected]

Page 2: Scaling ETL with Hadoop - Avoiding Failure

Coming soon to a bookstore near you…

• Hadoop Application Architectures

How to build end-to-end solutions using Apache Hadoop and related tools

@hadooparchbookwww.hadooparchitecturebook.com

Page 3: Scaling ETL with Hadoop - Avoiding Failure

3

ETL is…

• Extracting data from outside sources• Transforming it to fit operational needs• Loading it into the end target

• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)

Page 4: Scaling ETL with Hadoop - Avoiding Failure

4

Hadoop Is…

• HDFS – Massive, redundant data storage• MapReduce – Batch oriented data processing at scale• Many many ways to process data in parallel at scale

Page 5: Scaling ETL with Hadoop - Avoiding Failure

5

The Ecosystem

• High level languages and abstractions• File, relational and streaming data integration• Process Orchestration and Scheduling• Libraries for data wrangling• Low latency query language

Page 6: Scaling ETL with Hadoop - Avoiding Failure

6

Why ETL with Hadoop?

Page 7: Scaling ETL with Hadoop - Avoiding Failure

Data Has Changed in the Last 30 YearsDA

TA G

ROW

TH

END-USERAPPLICATIONS

THE INTERNET

MOBILE DEVICES

SOPHISTICATEDMACHINES

STRUCTURED DATA – 10%

1980 2013

UNSTRUCTURED DATA – 90%

Page 8: Scaling ETL with Hadoop - Avoiding Failure

Volume, Variety, Velocity Cause Problems

8

OLTP

EnterpriseApplications

Data Warehouse

QueryExtract

Transform

LoadBusiness

IntelligenceTransform

1

1

1

Slow data transformations. Missed SLAs.

2

2

Slow queries. Frustrated business and IT.

3 Must archive. Archived data can’t provide value.

Page 9: Scaling ETL with Hadoop - Avoiding Failure

9

Got unstructured data?

• Traditional ETL:• Text• CSV• XLS• XML

• Hadoop:• HTML• XML, RSS• JSON• Apache Logs• Avro, ProtoBuffs, ORC, Parquet• Compression• Office, OpenDocument, iWorks• PDF, Epup, RTF• Midi, MP3• JPEG, Tiff• Java Classes• Mbox, RFC822• Autocad• TrueType Parser• HFD / NetCDF

Page 10: Scaling ETL with Hadoop - Avoiding Failure

10

What is Apache Hadoop?

Has the Flexibility to Store and Mine Any Type of Data

Ask questions across structured and unstructured data that were previously impossible to ask or solve

Not bound by a single schema

Excels atProcessing Complex Data

Scale-out architecture divides workloads across multiple nodes

Flexible file system eliminates ETL bottlenecks

ScalesEconomically

Can be deployed on commodity hardware

Open source platform guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-Healing, High Bandwidth Clustered

Storage

MapReduce

Distributed Computing Framework

Apache Hadoop is an open source platform for data storage and processing that is…

Distributed Fault tolerant Scalable

CORE HADOOP SYSTEM COMPONENTS

Page 11: Scaling ETL with Hadoop - Avoiding Failure

11

What I often see

ETL Cluster

ELT in DWH

ETL in Hadoop

Page 12: Scaling ETL with Hadoop - Avoiding Failure

12

Moving your transformations from the DWH to Hadoop?

Lets do it right.

Page 13: Scaling ETL with Hadoop - Avoiding Failure

13

Best Practices

Arup Nanda taught me to ask:1. Why is it better than the rest?2. What happens if it is not followed?3. When are they not applicable?

Page 14: Scaling ETL with Hadoop - Avoiding Failure

14

Or at leastLets avoid the worst mistakes

Page 15: Scaling ETL with Hadoop - Avoiding Failure

15

Extract

Page 16: Scaling ETL with Hadoop - Avoiding Failure

16

Let me count the ways

1. From Databases: Sqoop 2. Log Data: Flume 3. Copy data to HDFS

Page 17: Scaling ETL with Hadoop - Avoiding Failure

17

Data Loading Mistake #1

Hadoop is scalable.Lets run as many Sqoop mappers as possible, to get the data from our DB faster!

— Famous last words

Page 18: Scaling ETL with Hadoop - Avoiding Failure

18

Result:

Page 19: Scaling ETL with Hadoop - Avoiding Failure

19

Lesson:

• Start with 2 mappers, add slowly• Watch DB load and network utilization• Use FairScheduler to limit number of mappers

Page 20: Scaling ETL with Hadoop - Avoiding Failure

20

Data Loading Mistake #2

Database specific connectors are complicated and scary. Lets just use the default JDBC connector.

— Famous last words

Page 21: Scaling ETL with Hadoop - Avoiding Failure

21

Result:

Page 22: Scaling ETL with Hadoop - Avoiding Failure

22

Lesson:

1. There are connectors to:Oracle, Netezza and Teradata

2. Download them3. Read documentation4. Ask questions if not clear5. Follow installation instructions

6. Use Sqoop with connectors

Page 23: Scaling ETL with Hadoop - Avoiding Failure

23

Data Loading Mistake #3

Just copying files?This sounds too simple. We probably need some cool whizzbang tool.

— Famous last words

Page 24: Scaling ETL with Hadoop - Avoiding Failure

24

Result

Page 25: Scaling ETL with Hadoop - Avoiding Failure

25

Lessons:

• Copying files is a legitimate solution• In general, simple is good

Page 26: Scaling ETL with Hadoop - Avoiding Failure

26

Transform

Page 27: Scaling ETL with Hadoop - Avoiding Failure

27

Endless Possibilities

• Map Reduce• Crunch / Cascading • Spark• Hive (i.e. SQL)• Pig• R• Shell scripts• Plain old Java

Page 28: Scaling ETL with Hadoop - Avoiding Failure

28

Data Processing Mistake #0

Page 29: Scaling ETL with Hadoop - Avoiding Failure

29

Data Processing Mistake #1

This system must be ready in 12 month.We have to convert 100 data sources and 5000 transformations to Hadoop. Lets spend 2 days planning a schedule and budget for the entire year and then just go and implement it.

Prototype? Who needs that?

— Famous last words

Page 30: Scaling ETL with Hadoop - Avoiding Failure

30

Result

Page 31: Scaling ETL with Hadoop - Avoiding Failure

31

Lessons

• Take learning curve into account• You don’t know what you don’t know• Hadoop will be difficult and frustrating

for at least 3 month.

Page 32: Scaling ETL with Hadoop - Avoiding Failure

32

Data Processing Mistake #2

Hadoop is all about MapReduce. So I’ll use MapReduce for all my data processing needs.

— Famous last words

Page 33: Scaling ETL with Hadoop - Avoiding Failure

33

Result:

Page 34: Scaling ETL with Hadoop - Avoiding Failure

34

Lessons:

MapReduce is the assembly language of Hadoop:

Simple things are hard.Hard things are possible.

Page 35: Scaling ETL with Hadoop - Avoiding Failure

35

Data Processing Mistake #3

I got 5000 tiny XMLs, and Hadoop is great at processing unstructured data. So I’ll just leave the data like that and parse the XML in every job.

— Famous last words

Page 36: Scaling ETL with Hadoop - Avoiding Failure

36

Result

Page 37: Scaling ETL with Hadoop - Avoiding Failure

37

Lessons

1. Consolidate small files2. Don’t argue about #13. Convert files to easy-to-query formats4. De-normalize

Page 38: Scaling ETL with Hadoop - Avoiding Failure

38

Data Processing Mistake #4

Partitions are for relational databases

— Famous last words

Page 39: Scaling ETL with Hadoop - Avoiding Failure

39

Result

Page 40: Scaling ETL with Hadoop - Avoiding Failure

40

Lessons

1. Without partitions every query is a full table scan2. Yes, Hadoop scans fast. 3. But faster read is the one you don’t perform4. Cheap storage allows you to store same dataset,

partitioned multiple ways.5. Use partitions for fast data loading

Page 41: Scaling ETL with Hadoop - Avoiding Failure

41

Load

Page 42: Scaling ETL with Hadoop - Avoiding Failure

42

Technologies

• Sqoop• Fuse-DFS• Oracle Connectors• Just copy files• Query Hadoop

Page 43: Scaling ETL with Hadoop - Avoiding Failure

43

Data Loading Mistake #1

All of the data must end up in a relational DWH.

— Famous last words

Page 44: Scaling ETL with Hadoop - Avoiding Failure

44

Result

Page 45: Scaling ETL with Hadoop - Avoiding Failure

45

Lessons:

• Use Relational:• To maintain tool compatibility• DWH enrichment

• Stay in Hadoop for:• Text search• Graph analysis• Reduce time in pipeline• Big data & small network• Congested database

Page 46: Scaling ETL with Hadoop - Avoiding Failure

46

Data Loading Mistake #2

We used Sqoop to get data out of Oracle. Lets use Sqoop to get it back in.

— Famous last words

Page 47: Scaling ETL with Hadoop - Avoiding Failure

47

Result

Page 48: Scaling ETL with Hadoop - Avoiding Failure

48

Lesson

Use Oracle direct connectors if you can afford them.

They are:1. Faster than any alternative2. Use Hadoop to make Oracle more efficient3. Make *you* more efficient

Page 49: Scaling ETL with Hadoop - Avoiding Failure

49

Workflow Management

Page 50: Scaling ETL with Hadoop - Avoiding Failure

50

Tools

• Oozie• Pentaho, Talend, ActiveBatch, AutoSys, Informatica,

UC4, Cron

Page 51: Scaling ETL with Hadoop - Avoiding Failure

51

Workflow Mistake #1

Workflow management is easy. I’ll just write few scripts.

— Famous last words

Page 52: Scaling ETL with Hadoop - Avoiding Failure

52

Writing a workflow engine isthe software engineering equivalent of getting involved in a land war in Asia.

“ ”— Josh Wills

Page 53: Scaling ETL with Hadoop - Avoiding Failure

53

Lesson:

Workflow management tool should enable:• Keeping track of metadata, components and

integrations• Scheduling and Orchestration• Restarts and retries• Cohesive System View• Instrumentation, Measurement and Monitoring• Reporting

Page 54: Scaling ETL with Hadoop - Avoiding Failure

54

Workflow Mistake #2

Schema? This is Hadoop. Why would we need a schema?

— Famous last words

Page 55: Scaling ETL with Hadoop - Avoiding Failure

55

Result

Page 56: Scaling ETL with Hadoop - Avoiding Failure

56

Lesson

/user/…/user/gshapira/testdata/orders

/data/<database>/<table>/<partition>/data/<biz unit>/<app>/<dataset>/partition/data/pharmacy/fraud/orders/date=20131101

/etl/<biz unit>/<app>/<dataset>/<stage>/etl/pharmacy/fraud/orders/validated

Page 57: Scaling ETL with Hadoop - Avoiding Failure

57

Workflow Mistake #3

Oozie was written for Hadoop, so the right solution will always use Oozie

— Famous last words

Page 58: Scaling ETL with Hadoop - Avoiding Failure

58

Result

Page 59: Scaling ETL with Hadoop - Avoiding Failure

59

Lessons:

• Oozie has advantages• Use the tool that works for you

Page 60: Scaling ETL with Hadoop - Avoiding Failure

60

Hue + Oozie

Page 61: Scaling ETL with Hadoop - Avoiding Failure

61

I hope that in this year to come, you make mistakes.“

”— Neil Gaiman

Page 62: Scaling ETL with Hadoop - Avoiding Failure

62

Always Make New Mistakes“ ”— Esther Dyson

Page 63: Scaling ETL with Hadoop - Avoiding Failure

63

Page 64: Scaling ETL with Hadoop - Avoiding Failure

64

Should DBAs learn Hadoop?

• Hadoop projects are more visible• 48% of Hadoop clusters are owned by DWH team• Big Data == Business pays attention to data• New skills – from coding to cluster administration• Interesting projects

• No, you don’t need to learn Java

Page 65: Scaling ETL with Hadoop - Avoiding Failure

65

Beginner Projects

• Take a class• Download a VM• Install 5 node Hadoop cluster in AWS• Load data:

• Complete works of Shakespeare• Movielens database

• Find the 10 most common words in Shakespeare• Find the 10 most recommended movies• Run TPC-H• Cloudera Data Science Challenge• Actual use-case:

XML ingestion, ETL process, DWH history

Page 66: Scaling ETL with Hadoop - Avoiding Failure

66

Books

Page 67: Scaling ETL with Hadoop - Avoiding Failure

67

More Books