introducing athena: 08/19 big data application meetup, talk #3
TRANSCRIPT
Athena Streaming with Samza
Goddess of Wisdom
The team
Brief Background Stream Processing, Samza, Athena
Kafka Uber uses Kafka as logging/event system
Message streams
Near-real time computation
Stream Processing Framework: Apache Samza
Platform built on top of Samza: Athena
Current use cases
Current use cases Aggregation
Pricing
Assess supply and demand in real time to calculate accurate surge multiples
Kafka Samza location updates
Elasticsearch
S3 Spark location updates
HTTP Service
Realtime
Batch
Queries
Mapper Job
Mapper Job
Mapper Job
Event parsing, filtering, classification
Event Aggregation
Raw events
Riak
Inserter Job
Inserter Job
Tiles
1m Agg Tiles
Reducer
Job
Reducer
Job
1m Agg Tiles
Per Contract Common
Artemis Samza
Current use cases Event driven update engine
Driver Activation
KAFKA SAMZA Driver status updates
Onboarding Service
Fetch driver info
RocksDB
Retry queue
Update status
Upcoming use cases
Fraud monitoring and alerting
KAFKA Partitioner
Real Time
Hourly
Alerting Service
Monitoring Service
Uberx metrics
Da Vinci: Streaming platform for data science
KAFKA Aggregation job (Model generation)
RocksDB
Historical state
Timeseries data
Model evaluation
Database / Elasticsearch
updates
Model Updates
Samza Architecture Overview
Samza Architecture
Basic Structure of a Task
Task
Deployment in Uber
Athena Tooling
Tooling
● Athena manager ● Job configuration ● Unit test framework ● Graphite integration ● Codahale library support ● Maven archetype ● Artifactory support
Athena Manager
Job configuration
reference.conf
Sandbox Staging Production Dev
athena-core-lib
application.conf
Sandbox Staging Production Dev
Samza job
Job configuration projects { artemis { job_1 { mapper { envs.common { task.inputs = "kafka.topic_1" task.class = "com.uber.athena.SampleTaskClass" } envs.local = ${projects.artemis.job_1.mapper.envs.common} envs.local = { task.window.ms = 30000 } envs.sandbox = ${projects.artemis.job_1.mapper.envs.local} envs.sandbox = { yarn.package.path = "http://artifactory..../artifactory/libs-snapshot-local/com/uber/athena/.../hello-athena.tar.gz" } envs.staging = ${projects.artemis.job_1.mapper.envs.sandbox} envs.production = ${projects.artemis.job_1.mapper.envs.sandbox} }
Job configuration
Unit test framework
Stream Job
StreamTask InitableTask WindowableTask
TaskUnitTestHarness
Message Listener
Inject data Custom IncomingMessageEnvelope
Custom MessageCollector
Unit test framework
String classTaskName = "com.uber.athena.test.TestProcessTask"; // Register the job to the test harness TaskUnitTestHarness<String, Integer> testProcessTask = new TaskUnitTestHarness<>(classTaskName, false, true); // Register a listener for validating the output of every process function testProcessTask.registerMessageListenerOnProcess(new KeyAsserterListener()); // Start the job testProcessTask.start(); // Inject data testProcessTask.inject(key,value); // Get the full output testWindowTask.getResult()
Tooling
● Athena manager ● Job configuration ● Unit test framework ● Graphite integration ● Codahale library support ● Maven archetype ● Artifactory support
Observations
Observations
● YARN is not bad ! ● Offset lag and Buffered messages ● Kafka Appender for ELK ● Checkpoint topic partition incorrect count ● Config validation needs improvement ● Job restarts are complicated ● Built-in Metrics are insufficient
● Seamless upgrades ● Custom built-in serde support ● Config validation enhancement ● Auto benchmarking a Samza job ● Unit test framework enhancement
Upcoming Samza improvements
Q&A