simplified data management and process scheduling in hadoop
TRANSCRIPT
Somebody Still Investigates
Do you think we find the location and the owner of the “streams” dataset today?
STREAMS{trackId:long, userId:long, ts:timestamp, ...}
hdfs://data/core/streams
avro
etl
official=>true, frequency=>hourly
"UserId started to stream trackId at time ts"
users = LOAD 'data.user'
USING HCatLoader();
val users = hiveContext.hql(
"FROM data.user SELECT name, country"
)
users = LOAD
'/data/core/user/part-00000.
avro' USING AvroStorage();Non HCatalog way
in Pig
ID NAME COUNTRY GENDER
1 JOSH US M
2 ADAM PL M
The picture comes from http://hortonworks.com/blog/introduction-apache-falcon-hadoop. Thanks Hortonworks!
Falcon’s Adoption
■ Top Level Project since December 2014■ 14 contributors from 3 companies■ Originated and heavily used at inMobi
● 400+ pipelines and 2000+ data feeds■ Also used at Expedia and at some undisclosed companies
Future Enhancements And Ideas
■ Improved Web UI [FALCON-790]● More extensive search box, more widgets● The “today morning” dashboard [FALCON-994]● Re-running processes
■ Automatic discovery of datasets in HDFS and Hive■ Streaming feeds and processes e.g. Storm, Spark Streaming■ Triage of data processing issues [FALCON-796]■ HDFS snapshots■ High availability of the Falcon server