big data ingestion with kafka -> hdfs using apache apex

13
Big Data Ingestion with Kafka Chinmay Kolhatkar [email protected]

Upload: apache-apex

Post on 08-Jan-2017

437 views

Category:

Technology


0 download

TRANSCRIPT

Big Data Ingestion with Kafka

Chinmay [email protected]

Agenda

● Data Ingestion● Use case: Kafka => HDFS● Brief about Kafka● Steps for development● Let’s code!!!

2

Data Ingestion3

● Reading data in

● Storing in accessible location

● Beginning data pipeline or write path

● From here, it is processed further or read path

Use case: KAFKA => HDFS4

● Reading from Kafka Messaging Queue

● Writing to HDFS

KAFKA HDFS

Use case: Examples5

● Log Aggregation○ Collect logs from various sources○ Streams them as a single topic○ Put all the logs in centralized place i.e. HDFS

● Real time sensor data processing○ Read sensor data from various sources○ Process stream○ Dump results to HDFS

Brief about Kafka6

● Distributed Messaging System

● Fast Reads and Writes

● Can handle large number of clients

● Scalable, fault-tolerant, partitionable

● Persistent messages

Brief about Kafka (contd.)7

● Terminologies○ Topic○ Producer○ Consumer○ Broker

Steps for developing application8

1. Create maven project using apex mvn archetype2. Add required maven dependencies3. Add operators to DAG4. Add stream(s) to DAG5. Set properties in properties.xml6. Compile and run

9

Summary10

● Ease of development using Apex

● Reusable malhar components

● Fault-tolerant, Scalable

● Reduced Time to Production

11

Resources

Apache Apex Meetup

• Apache Apex website - http://apex.incubator.apache.org/

• Subscribe - http://apex.incubator.apache.org/community.html

• Download - http://apex.incubator.apache.org/downloads.html

• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex

• Facebook - https://www.facebook.com/ApacheApex/

• Meetup - http://www.meetup.com/topics/apache-apex

• Startup Program – Free Enterprise License for startups, Universities, Non-Profits

Upcoming events...

Apache Apex Meetup

• April 12th 9am PST - Fault Tolerance and Processing Semantics with Apache Apex

• March 28th 6pm PST - Low-latency ingestion and analytics with Apache Kafka and Apache Apex (Hadoop)

• ...