simplified data management and process scheduling in hadoop

26
Simplified Data and Process Scheduling in Hadoop

Upload: getindata

Post on 16-Jul-2015

1.324 views

Category:

Technology


1 download

TRANSCRIPT

Simplified Data and Process Scheduling in Hadoop

Somebody Still Investigates

Do you think we find the location and the owner of the “streams” dataset today?

STREAMS{trackId:long, userId:long, ts:timestamp, ...}

hdfs://data/core/streams

avro

etl

official=>true, frequency=>hourly

"UserId started to stream trackId at time ts"

users = LOAD 'data.user'

USING HCatLoader();

val users = hiveContext.hql(

"FROM data.user SELECT name, country"

)

users = LOAD

'/data/core/user/part-00000.

avro' USING AvroStorage();Non HCatalog way

in Pig

ID NAME COUNTRY GENDER

1 JOSH US M

2 ADAM PL M

[FALCON-790]

[FALCON-790]

Email

HDFS

HDFS

[FALCON-790]

Switching to ORC requires

reimplementing the Reader Code

in hundreds of productions jobs...

users = LOAD 'data.users' USING HCatLoader();

ORC

The picture comes from http://hortonworks.com/blog/introduction-apache-falcon-hadoop. Thanks Hortonworks!

Raw Data Cleansed Data

Conformed Data

Presented Data

Raw Data Presented Data

Which Elephant Is Your?

A. Elephantus Dirtus

B. Elephantus Cleanus

Backup Slides

Falcon’s Adoption

■ Top Level Project since December 2014■ 14 contributors from 3 companies■ Originated and heavily used at inMobi

● 400+ pipelines and 2000+ data feeds■ Also used at Expedia and at some undisclosed companies

Future Enhancements And Ideas

■ Improved Web UI [FALCON-790]● More extensive search box, more widgets● The “today morning” dashboard [FALCON-994]● Re-running processes

■ Automatic discovery of datasets in HDFS and Hive■ Streaming feeds and processes e.g. Storm, Spark Streaming■ Triage of data processing issues [FALCON-796]■ HDFS snapshots■ High availability of the Falcon server

[FALCON-790]