apache ignite as a data processing hub

APACHE IGNITE AS A DATA PROCESSING HUB ROMAN SHTYKH

CYBERAGENT, INC.

INTRODUCTION

ABOUT ME

Roman Shtykh

¡  R&D Engineer at CyberAgent, Inc.

¡  Areas of focus

¡  Data streaming and NLP

¡  Committer on the Apache Ignite and MyBatis projects

¡  Judoka

¡  @rshtykh

CYBERAGENT, INC.

¡  Internet ads

¡  Games

¡  Media

¡  Investing

Internet ads

Investing

* As of Sep 2015

AMEBA SERVICES

・ Monthly visitors (DUB total):

6 billion* ・ Number of member users :

about 39 million*

CyberAgent, Inc.

Ameba Services

* As of Dec 2014

•  Games •  Community services •  Content curation •  Other

AMEBA SERVICES

Ameba Pigg

CONTENTS

¡  Apache Ignite

¡  Feed your data

¡  Log Aggregation with Apache Flume

¡  Integration with Apache Ignite

¡  Streaming Data with Apache Kafka

¡  Data Pipeline with Kafka and Ignite: Example

APACHE IGNITE

¡  “High-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.”

¡  High performance, unlimited scalability and resiliency

¡  High-performance transactions and fast analytics

¡  Hadoop Acceleration, Apache Spark

¡  Apache project

https://ignite.apache.org/

MAKING APACHE IGNITE A DATA PROCESSING HUB

¡  Question: How to feed data?

¡  A simple solution: Create a client node

¡  Is it reliable?

¡  Does it scale?

¡  Ignite-only solution?

¡  Does it keep your operational costs low?

¡  Is it reliable?

¡  Does it scale?

¡  Ignite-only solution?

¡  Does it keep your operational costs low?

LOG AGGREGATION WITH APACHE FLUME

¡  Flume

¡  “Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.”

¡  Scalable

¡  Flexible

¡  Robust and fault tolerant

¡  Declarative configuration

¡  Apache project

DATA FLOW IN FLUME

Source Sink

Channel Incoming data

to another Agent or Destination

DATA FLOW IN FLUME (REPLICATION/MULTIPLEXING)

Source Sink

Sink Channel Channel Selector

DATA FLOW IN FLUME (RELIABILITY)

¡  No data is lost (configurable)

Source Sink

Source tx Sink tx

LOG TRANSFER AT AMEBA

Ameba Service � Aggregator

Aggregator

Monitoring Recommender

System

Elastic Search

Hadoop Batch processing

Stream Processing (Onix)

Stream Processing (HBaseSink)

Ameba Service �

LOG TRANSFER AT AMEBA

¡  Web Hosts

¡  More than 1600

¡  Size

¡  5.0 TB/day (raw)

¡  Traffic at peak

¡  160Mbps (compressed)

IGNITE SINK

¡  Reads Flume events from a channel

¡  With a user-implemented pluggable transformer converts them into cacheable entries

¡  Adding it requires no modification to the existing architecture

FLUME ⇒ IGNITE (1)

Source Ignite Sink

Channel Incoming data new connection

Source Ignite Sink

Sink tx

start tx

Source Ignite Sink

Sink tx

take event send events

ENABLING FLUME SINK

¡  Steps

1.  Implement EventTransformer

¡  convert Flume events into cacheable entries (java.util.Map<K, V>)

2.  Put transformer’s jar to ${FLUME_HOME}/plugins.d/ignite/lib

3.  Put IgniteSink and Ignite core jar files to ${FLUME_HOME}/plugins.d/ignite/libext

4.  Set up a Flume agent

¡  Sink setup

a1.sinks.k1.type = org.apache.ignite.stream.flume.IgniteSink a1.sinks.k1.igniteCfg = /some-path/ignite.xml a1.sinks.k1.cacheName = testCache a1.sinks.k1.eventTransformer = my.company.MyEventTransformer a1.sinks.k1.batchSize = 100

FLUME SINKS

¡  HDFS

¡  THRIFT

¡  AVRO

¡  HBASE

¡  ElasticSearch

¡  IRC

¡  IGNITE

APACHE FLUME & APACHE IGNITE

¡  If you do data aggregation with Flume

¡  Adding an Ignite cluster is as simple as writing a simple data transformer and deploying a new Flume agent

¡  If you store your data (and do computations) in Ignite

¡  Improving data injection becomes easy with Flume sink

¡  Combining Apache Flume and Ignite makes/keeps your data pipeline (both aggregation and processing) ¡  Scalable

¡  Reliable

¡  Highly-Performant

STREAMING DATA WITH APACHE KAFKA

APACHE KAFKA

“Publish-subscribe messaging rethought as a distributed commit log”

¡  Low latency

¡  High Throughput

¡  Partitioned and Replicated

¡  Kafka is an essential component of any data pipeline today

http://kafka.apache.org/

APACHE KAFKA

¡  Messages are grouped in topics

¡  Each partition is a log

¡  Each partition is managed by a broker (when replicated, one broker is the partition leader)

¡  Producers & consumers (consumer groups)

¡  Used for

¡  Log aggregation

¡  Activity tracking

¡  Monitoring

¡  Stream processing

http://kafka.apache.org/documentation.html

KAFKA CONNECT

¡  Designed for large scale stream data integration using Kafka

¡  Provides an abstraction from communication with your Kafka cluster

¡  Offset management

¡  Delivery semantics

¡  Fault tolerance

¡  Monitoring, etc.

¡  Worker (scalability & fault tolerance)

¡  Connector (task config)

¡  Task (thread)

¡  Standalone & Distributed execution models http://www.confluent.io/blog/apache-kafka-0.9-is-released

INGESTING DATA STREAMS

¡  Two ways

¡  Kafka Streamer

¡  Sink Connector

SQL queries Distributed closures Transactions

STREAMING VIA SINK CONNECTOR

¡  Configure your connector

¡  Configure Kafka Connect worker

¡  Start your connector

# connectorname=my-ignite-connectorconnector.class=IgniteSinkConnectortasks.max=2topics=someTopic1,someTopic2# cachecacheName=myCachecacheAllowOverwrite=trueigniteCfg=/some-path/ignite.xml

$ bin/connect-standalone.sh myconfig/connect-standalone.properties myconfig/ignite-connector.properties

STREAMING VIA SINK CONNECTOR

¡  Easy data pipeline

¡  Records from Kafka are written to Ignite grid via high-performance IgniteDataStreamer

¡  At-least-once delivery guarantee

¡  As of 1.6, start a new connector to write to a different cache

a b c d e

0 1 2 … Kafka offsets

a.key, a.val b.key, b.val …

a2 b2 c2 d2 e2

INGESTING DATA STREAMS

¡  Bi-directional streaming

SQL queries Distributed closures Transactions

Events Continuous queries

STREAMING BACK TO KAFKA

¡  Listening to cache events

¡  PUT

¡  READ

¡  REMOVED

¡  EXPIRED, etc.

¡  Remote filtering can be enabled

¡  Kafka Connect offsets are ignored

¡  Currently, no delivery guarantees

evt3 as records

ENABLING SOURCE CONNECTOR

¡  Configure your connector

¡  Define a remote filter if needed cacheFilterCls=MyCacheEventFilter

¡  Make sure that event listening is enabled on the server nodes

¡  Configure Kafka Connect worker

¡  Start your connector

#connector name=ignite-src-connector connector.class=org.apache.ignite.stream.kafka.connect.IgniteSourceConnector tasks.max=2 #topics, events topicNames=test cacheEvts=put,removed #cache cacheName=myCache igniteCfg=myconfig/ignite.xml

key.converter=org.apache.kafka.connect.storage.StringConverter value.converter=org.apache.ignite.stream.kafka.connect.serialization.CacheEventConverter

APACHE KAFKA & APACHE IGNITE

¡  If you do data streaming with Kafka

¡  Adding an Ignite cluster is as simple as writing a configuration file (and creating a filter if you need it for source)

¡  If you store your data (and do computations) in Ignite

¡  Improving data injection and listening for events on data becomes easy with Kafka Connectors

¡  Combining Apache Kafka and Ignite makes/keeps your data pipeline

¡  Scalable

¡  Reliable

¡  Highly-Performant

¡  Covers a wide range of ETL contexts

DATA PIPELINE WITH KAFKA AND IGNITE EXAMPLE

DATA PIPELINE WITH KAFKA AND IGNITE

¡  Requirements

¡  instant processing and analysis

¡  scalable and resilient to failures

¡  low latency

¡  high throughput

¡  flexibility

¡  Filter and aggregate events

data Flume

filter/transform

slow down on heavy loads

more channels/layers

filter transform

•  Parsimonious resource use •  Replay enabled •  More operations on streams •  Flexibility

Other sources

¡  Store events

¡  Notify about updates on aggregates

filter transform

Connectors

¡  Store events

¡  Notify about updates on aggregates

filter transform

Connectors

¡  Improving ads delivery

clicks impressions

Ads delivery

Ads recommender

storage/ computation

Image storage

data & computation in one place

¡  Improving ads delivery ¡  Better network utilization and reliability

clicks impressions

Ads delivery

Ads recommender

storage/ computation

Image storage

Anomaly detection

OTHER INTEGRATIONS

OTHER COMPLETED INTEGRATIONS

¡  CAMEL

¡  MQTT

¡  STORM

¡  FLINK SINK

¡  TWITTER

THE END

apache ignite as a data processing hub

Documents

be#er%together–)apache%ignite%&%apache%spark% ·...

overview scale14x 2016. agenda/schedule -apache bigtop...

apache ignite - in-memory data fabric

apache ignite persistence: зачем persistence для...

apache’ignitetm...

imcsummite 2016 breakout - nikita ivanov - apache ignite 2.0...

imc summit 2016 breakout - matt coventon - test driving...

accelerating the hadoop data stack with apache ignite, spark...

august 2016 hug: better together: fast data with apache...

security guide apache ignite and gridgain...2003/04/20 ·...

getting started with apache ignite sql · •ignite sql...

and oltp use cases with apache ignite accelerate mysql for...

an introduction to apache ignite - mandhir gidda -...

pega platform 7 · server mode, force the pega platform...

apache’ignitetm...

troubleshooting apache® ignite™ · 2019 © gridgain...

hotel search, scalability, and apache ignite ·...

turbocharge your sql queries in- memory with apache®...

spark ignite competition 2020 - health innovation hub...

apache’ignitetm...