kafka to hadoop ingest with parsing, dedup and other big data transformations

Chaitanya CheboluCommitter, Apache Apex

Engineer, DataTorrentSep 14, 2016

Data Ingestion - Kafka ETL

Agenda

• Introduction about Apache Apex (Architecture, Application, Native Hadoop Integration)

•What is Data Ingestion•Use Case : Kafka ETL•Brief about Kafka•Kafka ETL App•Kafka ETL Demo

Apache Apex •Platform and runtime engine that enables development of

scalable and fault-tolerant distributed applications•Hadoop native (Hadoop >= 2.2)

No separate service to manage stream processingStreaming Engine built into Application Master and

Containers•Process streaming or batch big data•High throughput and low latency•Library of commonly needed business logic•Write any custom business logic in your application

Apex Architecture

An Apex Application is a DAG(Directed Acyclic Graph)

A DAG is composed of vertices (Operators) and edges (Streams).A Stream is a sequence of data tuples which connects operators at end-points called PortsAn Operator takes one or more input streams, performs computations & emits one or more output streams

● Each operator is USER’s business logic, or built-in operator from our open source library● Operator may have multiple instances that run in parallel

Apex - Native Hadoop Integration

• YARN is the resource manager

• HDFS used for storing any persistent state

What is Data Ingestion?

•Data IngestionA process of obtaining, importing, and analyzing data for

later use or storage in a database•Big Data Ingestion

Discovering the data sources Importing the data Processing data to produce intermediate data Sending data out to durable data stores

Use Case: Kafka ETL

•Consuming data from Kafka

•Processing data to produce intermediate data

•Writing the processed data to HDFS

Brief about Kafka

● Distributed Messaging System.

● Data Partitioning Capability.

● Fast Read and Writes.

● Basic Terminology○ Topic ○ Producer○ Consumer○ Broker

Kafka ETL App

Kafka Parser Dedup Transform Formatter

Kafka ETL Demo

Resources

• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter

ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent

• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product

ᵒ https://www.datatorrent.com/product/startup-accelerator/

We Are Hiring

• jobs@datatorrent.com• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders

kafka to hadoop ingest with parsing, dedup and other big data transformations

Technology

ingest wizard

kafka connect & kafka streams/ksql - the ecosystem around...

kafka to hadoop ingest with parsing, dedup and other big...

- asynchronous pipeline for processing huge corpora on...

kafka connect & kafka streams/ksql - powerful ecosystem...

03 sort dedup and reformat components

maurice blanchot, de kafka a kafka

· apache kafka introduction to apache kafka apache kafka...

dedup est machina: memory deduplication as an advanced

working with kafka advanced consumers - cloudurable ·...

data ingest guide

dedup whitepaper 4aa1-9796enw

libero ingest...libero ingest libero ingest is a flexible...

kelly technologies · kafka introduction to kafka kafka...

apache metron in the real world - big data conference...

metus ingest

tsm dedup best practices - v2.0

ingest process: submission and ‘pre-ingest’ activities

mysql story in poi dedup

ingest strategy - erpanet