scaling etl with hadoop - avoiding failure

Scaling ETL with HadoopGwen Shapira@gwenshapgshapira@cloudera.com

Coming soon to a bookstore near you…

• Hadoop Application Architectures

How to build end-to-end solutions using Apache Hadoop and related tools

@hadooparchbookwww.hadooparchitecturebook.com

ETL is…

• Extracting data from outside sources• Transforming it to fit operational needs• Loading it into the end target

• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)

Hadoop Is…

• HDFS – Massive, redundant data storage• MapReduce – Batch oriented data processing at scale• Many many ways to process data in parallel at scale

The Ecosystem

• High level languages and abstractions• File, relational and streaming data integration• Process Orchestration and Scheduling• Libraries for data wrangling• Low latency query language

Why ETL with Hadoop?

Data Has Changed in the Last 30 YearsDA

END-USERAPPLICATIONS

THE INTERNET

MOBILE DEVICES

SOPHISTICATEDMACHINES

STRUCTURED DATA – 10%

1980 2013

UNSTRUCTURED DATA – 90%

Volume, Variety, Velocity Cause Problems

EnterpriseApplications

Data Warehouse

QueryExtract

Transform

LoadBusiness

IntelligenceTransform

Slow data transformations. Missed SLAs.

Slow queries. Frustrated business and IT.

3 Must archive. Archived data can’t provide value.

Got unstructured data?

• Traditional ETL:• Text• CSV• XLS• XML

• Hadoop:• HTML• XML, RSS• JSON• Apache Logs• Avro, ProtoBuffs, ORC, Parquet• Compression• Office, OpenDocument, iWorks• PDF, Epup, RTF• Midi, MP3• JPEG, Tiff• Java Classes• Mbox, RFC822• Autocad• TrueType Parser• HFD / NetCDF

What is Apache Hadoop?

Has the Flexibility to Store and Mine Any Type of Data

Ask questions across structured and unstructured data that were previously impossible to ask or solve

Not bound by a single schema

Excels atProcessing Complex Data

Scale-out architecture divides workloads across multiple nodes

Flexible file system eliminates ETL bottlenecks

ScalesEconomically

Can be deployed on commodity hardware

Open source platform guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-Healing, High Bandwidth Clustered

Storage

MapReduce

Distributed Computing Framework

Apache Hadoop is an open source platform for data storage and processing that is…

Distributed Fault tolerant Scalable

CORE HADOOP SYSTEM COMPONENTS

What I often see

ETL Cluster

ELT in DWH

ETL in Hadoop

Moving your transformations from the DWH to Hadoop?

Lets do it right.

Best Practices

Arup Nanda taught me to ask:1. Why is it better than the rest?2. What happens if it is not followed?3. When are they not applicable?

Or at leastLets avoid the worst mistakes

Extract

Let me count the ways

1. From Databases: Sqoop 2. Log Data: Flume 3. Copy data to HDFS

Data Loading Mistake #1

Hadoop is scalable.Lets run as many Sqoop mappers as possible, to get the data from our DB faster!

— Famous last words

Result:

Lesson:

• Start with 2 mappers, add slowly• Watch DB load and network utilization• Use FairScheduler to limit number of mappers

Database specific connectors are complicated and scary. Lets just use the default JDBC connector.

Result:

Lesson:

1. There are connectors to:Oracle, Netezza and Teradata

2. Download them3. Read documentation4. Ask questions if not clear5. Follow installation instructions

6. Use Sqoop with connectors

Just copying files?This sounds too simple. We probably need some cool whizzbang tool.

Result

Lessons:

• Copying files is a legitimate solution• In general, simple is good

Transform

Endless Possibilities

• Map Reduce• Crunch / Cascading • Spark• Hive (i.e. SQL)• Pig• R• Shell scripts• Plain old Java

Data Processing Mistake #0

This system must be ready in 12 month.We have to convert 100 data sources and 5000 transformations to Hadoop. Lets spend 2 days planning a schedule and budget for the entire year and then just go and implement it.

Prototype? Who needs that?

Result

Lessons

• Take learning curve into account• You don’t know what you don’t know• Hadoop will be difficult and frustrating

for at least 3 month.

Hadoop is all about MapReduce. So I’ll use MapReduce for all my data processing needs.

Result:

Lessons:

MapReduce is the assembly language of Hadoop:

Simple things are hard.Hard things are possible.

I got 5000 tiny XMLs, and Hadoop is great at processing unstructured data. So I’ll just leave the data like that and parse the XML in every job.

Result

Lessons

1. Consolidate small files2. Don’t argue about #13. Convert files to easy-to-query formats4. De-normalize

Partitions are for relational databases

Result

Lessons

1. Without partitions every query is a full table scan2. Yes, Hadoop scans fast. 3. But faster read is the one you don’t perform4. Cheap storage allows you to store same dataset,

partitioned multiple ways.5. Use partitions for fast data loading

Technologies

• Sqoop• Fuse-DFS• Oracle Connectors• Just copy files• Query Hadoop

All of the data must end up in a relational DWH.

Result

Lessons:

• Use Relational:• To maintain tool compatibility• DWH enrichment

• Stay in Hadoop for:• Text search• Graph analysis• Reduce time in pipeline• Big data & small network• Congested database

We used Sqoop to get data out of Oracle. Lets use Sqoop to get it back in.

Result

Lesson

Use Oracle direct connectors if you can afford them.

They are:1. Faster than any alternative2. Use Hadoop to make Oracle more efficient3. Make *you* more efficient

Workflow Management

• Oozie• Pentaho, Talend, ActiveBatch, AutoSys, Informatica,

UC4, Cron

Workflow Mistake #1

Workflow management is easy. I’ll just write few scripts.

Writing a workflow engine isthe software engineering equivalent of getting involved in a land war in Asia.

“ ”— Josh Wills

Lesson:

Workflow management tool should enable:• Keeping track of metadata, components and

integrations• Scheduling and Orchestration• Restarts and retries• Cohesive System View• Instrumentation, Measurement and Monitoring• Reporting

Workflow Mistake #2

Schema? This is Hadoop. Why would we need a schema?

Result

Lesson

/user/…/user/gshapira/testdata/orders

/data/<database>/<table>/<partition>/data/<biz unit>/<app>/<dataset>/partition/data/pharmacy/fraud/orders/date=20131101

/etl/<biz unit>/<app>/<dataset>/<stage>/etl/pharmacy/fraud/orders/validated

Workflow Mistake #3

Oozie was written for Hadoop, so the right solution will always use Oozie

Result

Lessons:

• Oozie has advantages• Use the tool that works for you

Hue + Oozie

I hope that in this year to come, you make mistakes.“

”— Neil Gaiman

Always Make New Mistakes“ ”— Esther Dyson

Should DBAs learn Hadoop?

• Hadoop projects are more visible• 48% of Hadoop clusters are owned by DWH team• Big Data == Business pays attention to data• New skills – from coding to cluster administration• Interesting projects

• No, you don’t need to learn Java

Beginner Projects

• Take a class• Download a VM• Install 5 node Hadoop cluster in AWS• Load data:

• Complete works of Shakespeare• Movielens database

• Find the 10 most common words in Shakespeare• Find the 10 most recommended movies• Run TPC-H• Cloudera Data Science Challenge• Actual use-case:

XML ingestion, ETL process, DWH history

More Books

scaling etl with hadoop - avoiding failure

unstructured

copying files

words result

apache hadoop

oracle

etl

relational

data

Software

big data analytics for fraud detection · data warehousing...

big data orchestration - activeeon.com · big data...

improving your business with oracle · pdf file–fix...

testing big data: automated etl testing of hadoop

scalable etl with talend and hadoop, cédric carbone, talend

scaling etl with hadoop shapira 3

running apache airflow workflows as etl processes on hadoop

hadoop based etl and solr based semantic search

performance advantages of hadoop etl offload with the intel...

design advantages of hadoop etl offload with the intel...

hadoop , hadoop , hadoop !!!

etl european etl spain 2002 s.l. - etl hungary kft

etl - sozialinformatik.de · etl makes or breaks the data...

etl on hadoop –what is required - meetup etl on... · etl...

towards selecting best combination of sql-on-hadoop systems...

-training business intelligence - bi - etl - developer ·...

open source etl vs commercial etl

etl process etl: overview

design advantages of hadoop etl offload with the intel...

how to use parquet as a basis for etl and analytics (hadoop...