scaling etl with hadoop - avoiding failure

Post on 27-Aug-2014

636 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

1

Scaling ETL with HadoopGwen Shapira@gwenshapgshapira@cloudera.com

Coming soon to a bookstore near you…

• Hadoop Application Architectures

How to build end-to-end solutions using Apache Hadoop and related tools

@hadooparchbookwww.hadooparchitecturebook.com

3

ETL is…

• Extracting data from outside sources• Transforming it to fit operational needs• Loading it into the end target

• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)

4

Hadoop Is…

• HDFS – Massive, redundant data storage• MapReduce – Batch oriented data processing at scale• Many many ways to process data in parallel at scale

5

The Ecosystem

• High level languages and abstractions• File, relational and streaming data integration• Process Orchestration and Scheduling• Libraries for data wrangling• Low latency query language

6

Why ETL with Hadoop?

Data Has Changed in the Last 30 YearsDA

TA G

ROW

TH

END-USERAPPLICATIONS

THE INTERNET

MOBILE DEVICES

SOPHISTICATEDMACHINES

STRUCTURED DATA – 10%

1980 2013

UNSTRUCTURED DATA – 90%

Volume, Variety, Velocity Cause Problems

8

OLTP

EnterpriseApplications

Data Warehouse

QueryExtract

Transform

LoadBusiness

IntelligenceTransform

1

1

1

Slow data transformations. Missed SLAs.

2

2

Slow queries. Frustrated business and IT.

3 Must archive. Archived data can’t provide value.

9

Got unstructured data?

• Traditional ETL:• Text• CSV• XLS• XML

• Hadoop:• HTML• XML, RSS• JSON• Apache Logs• Avro, ProtoBuffs, ORC, Parquet• Compression• Office, OpenDocument, iWorks• PDF, Epup, RTF• Midi, MP3• JPEG, Tiff• Java Classes• Mbox, RFC822• Autocad• TrueType Parser• HFD / NetCDF

10

What is Apache Hadoop?

Has the Flexibility to Store and Mine Any Type of Data

Ask questions across structured and unstructured data that were previously impossible to ask or solve

Not bound by a single schema

Excels atProcessing Complex Data

Scale-out architecture divides workloads across multiple nodes

Flexible file system eliminates ETL bottlenecks

ScalesEconomically

Can be deployed on commodity hardware

Open source platform guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-Healing, High Bandwidth Clustered

Storage

MapReduce

Distributed Computing Framework

Apache Hadoop is an open source platform for data storage and processing that is…

Distributed Fault tolerant Scalable

CORE HADOOP SYSTEM COMPONENTS

11

What I often see

ETL Cluster

ELT in DWH

ETL in Hadoop

12

Moving your transformations from the DWH to Hadoop?

Lets do it right.

13

Best Practices

Arup Nanda taught me to ask:1. Why is it better than the rest?2. What happens if it is not followed?3. When are they not applicable?

14

Or at leastLets avoid the worst mistakes

15

Extract

16

Let me count the ways

1. From Databases: Sqoop 2. Log Data: Flume 3. Copy data to HDFS

17

Data Loading Mistake #1

Hadoop is scalable.Lets run as many Sqoop mappers as possible, to get the data from our DB faster!

— Famous last words

18

Result:

19

Lesson:

• Start with 2 mappers, add slowly• Watch DB load and network utilization• Use FairScheduler to limit number of mappers

20

Data Loading Mistake #2

Database specific connectors are complicated and scary. Lets just use the default JDBC connector.

— Famous last words

21

Result:

22

Lesson:

1. There are connectors to:Oracle, Netezza and Teradata

2. Download them3. Read documentation4. Ask questions if not clear5. Follow installation instructions

6. Use Sqoop with connectors

23

Data Loading Mistake #3

Just copying files?This sounds too simple. We probably need some cool whizzbang tool.

— Famous last words

24

Result

25

Lessons:

• Copying files is a legitimate solution• In general, simple is good

26

Transform

27

Endless Possibilities

• Map Reduce• Crunch / Cascading • Spark• Hive (i.e. SQL)• Pig• R• Shell scripts• Plain old Java

28

Data Processing Mistake #0

29

Data Processing Mistake #1

This system must be ready in 12 month.We have to convert 100 data sources and 5000 transformations to Hadoop. Lets spend 2 days planning a schedule and budget for the entire year and then just go and implement it.

Prototype? Who needs that?

— Famous last words

30

Result

31

Lessons

• Take learning curve into account• You don’t know what you don’t know• Hadoop will be difficult and frustrating

for at least 3 month.

32

Data Processing Mistake #2

Hadoop is all about MapReduce. So I’ll use MapReduce for all my data processing needs.

— Famous last words

33

Result:

34

Lessons:

MapReduce is the assembly language of Hadoop:

Simple things are hard.Hard things are possible.

35

Data Processing Mistake #3

I got 5000 tiny XMLs, and Hadoop is great at processing unstructured data. So I’ll just leave the data like that and parse the XML in every job.

— Famous last words

36

Result

37

Lessons

1. Consolidate small files2. Don’t argue about #13. Convert files to easy-to-query formats4. De-normalize

38

Data Processing Mistake #4

Partitions are for relational databases

— Famous last words

39

Result

40

Lessons

1. Without partitions every query is a full table scan2. Yes, Hadoop scans fast. 3. But faster read is the one you don’t perform4. Cheap storage allows you to store same dataset,

partitioned multiple ways.5. Use partitions for fast data loading

41

Load

42

Technologies

• Sqoop• Fuse-DFS• Oracle Connectors• Just copy files• Query Hadoop

43

Data Loading Mistake #1

All of the data must end up in a relational DWH.

— Famous last words

44

Result

45

Lessons:

• Use Relational:• To maintain tool compatibility• DWH enrichment

• Stay in Hadoop for:• Text search• Graph analysis• Reduce time in pipeline• Big data & small network• Congested database

46

Data Loading Mistake #2

We used Sqoop to get data out of Oracle. Lets use Sqoop to get it back in.

— Famous last words

47

Result

48

Lesson

Use Oracle direct connectors if you can afford them.

They are:1. Faster than any alternative2. Use Hadoop to make Oracle more efficient3. Make *you* more efficient

49

Workflow Management

50

Tools

• Oozie• Pentaho, Talend, ActiveBatch, AutoSys, Informatica,

UC4, Cron

51

Workflow Mistake #1

Workflow management is easy. I’ll just write few scripts.

— Famous last words

52

Writing a workflow engine isthe software engineering equivalent of getting involved in a land war in Asia.

“ ”— Josh Wills

53

Lesson:

Workflow management tool should enable:• Keeping track of metadata, components and

integrations• Scheduling and Orchestration• Restarts and retries• Cohesive System View• Instrumentation, Measurement and Monitoring• Reporting

54

Workflow Mistake #2

Schema? This is Hadoop. Why would we need a schema?

— Famous last words

55

Result

56

Lesson

/user/…/user/gshapira/testdata/orders

/data/<database>/<table>/<partition>/data/<biz unit>/<app>/<dataset>/partition/data/pharmacy/fraud/orders/date=20131101

/etl/<biz unit>/<app>/<dataset>/<stage>/etl/pharmacy/fraud/orders/validated

57

Workflow Mistake #3

Oozie was written for Hadoop, so the right solution will always use Oozie

— Famous last words

58

Result

59

Lessons:

• Oozie has advantages• Use the tool that works for you

60

Hue + Oozie

61

I hope that in this year to come, you make mistakes.“

”— Neil Gaiman

62

Always Make New Mistakes“ ”— Esther Dyson

63

64

Should DBAs learn Hadoop?

• Hadoop projects are more visible• 48% of Hadoop clusters are owned by DWH team• Big Data == Business pays attention to data• New skills – from coding to cluster administration• Interesting projects

• No, you don’t need to learn Java

65

Beginner Projects

• Take a class• Download a VM• Install 5 node Hadoop cluster in AWS• Load data:

• Complete works of Shakespeare• Movielens database

• Find the 10 most common words in Shakespeare• Find the 10 most recommended movies• Run TPC-H• Cloudera Data Science Challenge• Actual use-case:

XML ingestion, ETL process, DWH history

66

Books

67

More Books

top related