ingestion and dimensions compute and enrich using apache apex

Apache ApexIntro to ApexIngestion and Dimensions Compute for a customer use-case

Devendra [email protected]@devtagare9h July 2016

mailto:[email protected]

What is Apex• Platform and runtime engine that enables development of scalable

and fault-tolerant distributed applications • Hadoop native• Process streaming or batch big data• High throughput and low latency• Library of commonly needed business logic• Write any custom business logic in your application

Applications on Apex• Distributed processing

• Application logic broken into components called operators that run in a distributed fashion across your cluster

• Scalable• Operators can be scaled up or down at runtime according to the load and SLA

• Fault tolerant• Automatically recover from node outages without having to reprocess from beginning• State is preserved• Long running applications

• Operators• Use library to build applications quickly• Write your own in Java using the API

• Operational insight – DataTorrent RTS• See how each operator is performing and even record data

Apex Stack Overview

Apex Operator Library - Malhar

Native Hadoop Integration

• YARN is the resource manager

• HDFS used for storing any persistent state

Application Development Model

A Stream is a sequence of data tuplesA typical Operator takes one or more input streams, performs computations & emits one or more output streams

• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library• Operator has many instances that run in parallel and each instance is single-threaded

Directed Acyclic Graph (DAG) is made up of operators and streams

Directed Acyclic Graph (DAG)

Filtered

Stream

Output StreamTuple Tuple

Filtered Stream

Enriched Stream

Enriched

Stream

er

Operator

er

Operator

er

Operator

er

Operator

er

Operator

er

Operator

Advanced Windowing Support

Application window Sliding window and tumbling window Checkpoint window No artificial latency

Application in Java

Partitioning and unificationNxM PartitionsUnifier

0 1 2 3

Logical DAG

0 1 2

1

1 Unifier

1

20

Logical Diagram

Physical Diagram with operator 1 with 3 partitions

0

Unifier

1a

1b

1c

2a

2b

Unifier 3

Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck

Unifier

Unifier0

1a

1b

1c

2a

2b

Unifier 3

Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier

Advanced Partitioning

0

1a

1b

2 3 4Unifier

Physical DAG

0 4

3a2a1a

1b 2b 3b

Unifier

Physical DAG with Parallel Partition

Parallel Partition

Container

uopr

uopr1

uopr2

uopr3

uopr4

uopr1

uopr2

uopr3

uopr4

dopr

dopr

doprunifier

unifier

unifier

unifier

Container

Container

NIC

NIC

NIC

NIC

NIC

Container

NIC

Logical Plan

Execution Plan, for N = 4; M = 1

Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers

Cascading Unifiers

0 1 2 3 4

Logical DAG

Dynamic Partitioning

• Partitioning change while application is runningᵒ Change number of partitions at runtime based on statsᵒ Determine initial number of partitions dynamically

• Kafka operators scale according to number of kafka partitionsᵒ Supports re-distribution of state when number of partitions changeᵒ API for custom scaler or partitioner

2b

2c

3

2a

2d

1b

1a1a 2a

1b 2b

3

1a 2b

1b 2c 3b

2a

2d

3a

Unifiers not shown

• Ingest from Kafka and S3• Parse, Filter and Enrich• Dimensional compute for key performance indicators• Reporting of critical metrics around campaign monetization• Aggregate counters & reporting on top N metrics• Low latency querying using Kafka in pub-sub model

Use Case ...

Screenshots - Demo UI

Proprietary and Confidential

Scale

• 6 geographically distributed data centers• Combination of co-located & AWS based DC’s.• > 5 PB under the data management• 22 TB / day data generated from auction & client logs• heterogeneous data log formats• North of 15 Bn impressions / day.• Average data inflow of 200K events/s

• Ad server log events consumed as Avro encoded, Snappy compressed files

from S3. New files uploaded every 10-20 minutes.

• Data may arrive in S3 out of order (time stamps).

• Event size is about 2KB uncompressed, only subset of fields retrieved for

aggregation.

• Aggregates kept in memory (checkpointed) with expiration policy, query

processing against in memory data.

• Front-end integration through Kafka based query protocol for realtime

dashboard components.

Initial Requirements


Apache Apex

AdServer

REST proxy

REST proxy

Real-time architecture- Powered By Apex

Kafka Cluster

S3Reader S3Reader

Filter Operator Filter Operator

Dimensions Aggregator


Dimensions Store

Query Query Result

Kafka Cluster

Auction Logs

Middleware

Auction Logs

Filtered Events Filtered Events

Aggregates

Query from MW

Query Query Results

S3 S3 Client logsAuction Logs

Architecture 1.0 - Batch Reads + Streaming Aggregations

• Unstable S3 client libraries– Unpredictable hangs and Corrupted data– On Hang, Master kills the container and restart reading of file from different container– Corrupt files caused containers to kill – application configurable retry mechanism and skip bad

files– Limited read consumption throughput – 1 reader/file

• Out of Order data– Some timestamp in future and past

• Spike in load when new files are added followed by period of inactivity• Memory Requirement for Store

– Cardinality Estimation for incoming data

Challenges


Apache Apex

REST proxy


Client logs

Kafka Input

(Auction logs)

ETL operator




Dimensions Store/HDHT

Query Query Result

Kafka Cluster

Auction LogsKafka Cluster

Middleware

AdServerREST proxy

Kafka Cluster

Auction Logs

Client logs

Kafka Messages

Decompress & Flatten


Aggregates

Query from MW

Query Query Results

S3

S3Reader

Kafka Input

(Auction logs)Auction Logs

Architecture 2.0 - Batch + Streaming

Challenges• Complex Logical DAG• Kafka Operator Issues

– Dynamic Partitioning– Memory Configuration– Offset snapshotting to ensure exactly once semantics

• Resource Allocation– More memory requirement for Store (Large number of Unifiers)

• Harder Debugging (More number of components)– GB(s) of container logs– Difficult to locate the sequence of failure

• More of data transferred over wire within cluster• Limit Kafka read rate


User Browser

AdServer

REST proxy

REST proxy


Kafka Cluster

Client logs

Kafka Input

(Auction logs)

Kafka Input

(Client logs)

CDN(Caching of

logs)

ETL operator ETL operator




Dimensions Store

Query Query Result

Kafka Cluster

Auction Logs

Client logs

Middleware

Auction Logs

Client logs

Kafka Messages Kafka Messages




Aggregates

Query from MW

Query Query Results

Architecture 3.0 - Streaming

Operational Architecture


Application Configuration

• 64 Kafka Input operators reading from 6 geographically distributed DC’s

• Under 40 seconds end to end latency - from ad-serving to visualization

• 32 instances of in-memory distribute store

• 64 aggregators

• 1.2 TB memory footprint @ peak load

• In-memory store was later replaced by HDHT for fault tolerance


Learning’s

• DAG – sizing, locality & partitioning (Benchmark)• Memory sizing for the store or other memory heavy operators.• Cardinality estimation for incoming data is critical.• Upstream operators tend to require more memory than down-stream

operators for high velocity reads.• Back pressure from down-stream failures due to skew in velocity of events

& upstream failures .. Buffer Server sizing is critical.• For end to end exactly once its necessary to understand the external

systems semantics & delivery guarantees.• Think fault tolerance & recovery before starting implementation.


Before And After

5 Hours + 20 Minute

• No real-time processing system in place. • Publishers and buyers could only rely on a

batch processing system for gathering relevant data

• Outdated data, not relevant to current time

• Current data being pushed to a waiting queue

• Cumbersome batch-processing lifecycle

• No visualization for reports• No glimpse into everyday

happenings, translating to lost decisions or untimely decision making scenarios

Before Scenario After Scenario

• Phase 1• With DataTorrent RTS (built on

Apache Apex), Dev team put together the first real time analytics platform

• This enabled Reporting of critical metrics around campaign monetization

• Reuse of batch ingestion mechanism for the impression data, shared with other pipelines (S3)

~ 30 seconds

No Real-time Batch + Real-time

• Phase 2• Reduce end-to-end latency

through real-time ingestion of impression data from Kafka

• Results available much sooner to the user

• Balances load (no more batch ingestion spikes), reduces resource consumption

• Handles ever growing traffic with more efficient resource utilization.

Real-time Streaming


Operators used

S3 reader (File Input Operator)• Recursively reading the contents of a S3 bucket based on a partitioning pattern• Inclusion & exclusion support• Fault tolerance (replay and idempotent)• Throughput of over 12K reads/second for event size of 1.2 KB each

Kafka Input Operator• Ability to consume from multiple Kafka clusters • Offset management support• Fault tolerant reads• Support for idempotent & exactly once semantics• Controlled reads for managing back-pressure

POJO Enrichment Operator• takes a POJO as input and does a look-up in a store for given key• supports caching• stores are pluggable• App builder ready


Operators used (cont …)Parser • Specify JSON schema• Emits a POJO based on the output schema• No user code required

Dimension Store• Distributed in-memory store• Supports re-aggregation of events• Partitioning of aggregates per view• Low latency query support with a pub/sub model using Kafka

HDHT• HDFS backed embedded key-value store• Fault tolerant, random read & write• Durability in-case of cold restarts

Dimensional Model - Key Concepts

Metrics : pieces of information we want to collect statistics about.

Dimensions : variables which can impact our measures.

Combinations : set of dimensions for which one or metric would be aggregated.They are sub-sets of dimensions.

Aggregations : the aggregate function eg.. SUM, TOPN, Standard deviation.

Example :Dimensions - campaignId, advertiserId, timeMetrics - Cost, revenue, clicks, impressionsAggregate functions -SUM,AM etc..

Combinations :1. campaignId x time - cost,revenue2. advertiser - revenue, impressions3. campaignId x advertiser x time - revenue, clicks, impressions

How to aggregate on the combinations ?

Dimensional Model

Dimensions Schema

{"keys":[{"name":"campaignId","type":"integer"}, {"name":"adId","type":"integer"}, {"name":"creativeId","type":"integer"}, {"name":"publisherId","type":"integer"}, {"name":"adOrderId","type":"integer"}], "timeBuckets":["1h","1d"], "values": [{"name":"impressions","type":"integer","aggregators":["SUM"]}, {"name":"clicks","type":"integer","aggregators":["SUM"]}, {"name":"revenue","type":"integer"}], "dimensions": [{"combination":["campaignId","adId"]}, {"combination":["creativeId","campaignId"]}, {"combination":["campaignId"]}, {"combination":["publisherId","adOrderId","campaignId"],"additionalValues":["revenue:SUM"]}]}


More Use-cases

• Real-time MonitoringAlerts on deal tracking & monetizationCampaign & deal health• Real-time LearningUsing the lost bid insights for price recommendations.• Allocation EngineFeedback to ad serving for guaranteed delivery & line item pacing

Data Processing Pipeline ExampleApp Builder

Monitoring ConsoleLogical View

Monitoring ConsolePhysical View

Real-Time DashboardsReal Time Visualization

Resources• http://apex.apache.org/• Learn more: http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html• Download - http://apex.apache.org/downloads.html• Follow @ApacheApex - https://twitter.com/apacheapex• Meetups – http://www.meetup.com/pro/apacheapex/• More examples: https://github.com/DataTorrent/examples• Slideshare:

http://www.slideshare.net/ApacheApex/presentations• https://www.youtube.com/results?search_query=apache+apex• Free Enterprise License for Startups -

https://www.datatorrent.com/product/startup-accelerator/

http://apex.apache.org/

http://apex.incubator.apache.org/docs.html

http://apex.incubator.apache.org/community.html

http://apex.incubator.apache.org/downloads.html

https://twitter.com/apacheapex

http://www.meetup.com/pro/apacheapex/

https://github.com/DataTorrent/examples

http://www.slideshare.net/ApacheApex/presentations

https://www.youtube.com/results?search_query=apache+apex

https://www.datatorrent.com/product/startup-accelerator/

ingestion and dimensions compute and enrich using apache apex

Technology