data architectures for robust decision making
TRANSCRIPT
Designing Data Architectures for Robust Decision Making
Gwen Shapira / Software Engineer
2©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data around• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume
• @gwenshap
About Me
3©2014 Cloudera, Inc. All rights reserved.
There’s a book on that!
4
About you:
You know Hadoop
“Big Data” is stuck at The Lab.
6
We want to move to The Factory
7Click to enter confidentiality information
8Click to enter confidentiality information
What does it mean to “Systemize”?• Ability to easily add new data sources• Easily improve and expend analytics• Ease data access by standardizing metadata and storage• Ability to discover mistakes and to recover from them• Ability to safely experiment with new approaches
9Click to enter confidentiality information
We will discuss:• Actual decision making• Data Science• Machine learning• Algorithms
We will not discuss:• Architectures• Patterns• Ingest• Storage • Schemas• Metadata• Streaming• Experimenting• Recovery
10
So how do we build real data architectures?
Click to enter confidentiality information
11
The Data Bus
1212
Client Source
Data Pipelines Start like this.
1313
Client Source
Client
Client
Client
Then we reuse them
1414
Client Backend
Client
Client
Client
Then we add consumers to the existing sources
Another Backend
1515
Client Backend
Client
Client
Client
Then it starts to look like this
Another Backend
Another Backend
Another Backend
1616
Client Backend
Client
Client
Client
With maybe some of this
Another Backend
Another Backend
Another Backend
17
Adding applications should be easierWe need:• Shared infrastructure for sending records• Infrastructure must scale• Set of agreed-upon record schemas
18
Kafka Based Ingest Architecture
18
Source System
Source System
Source System
Source System
Kafka decouples Data Pipelines
Hadoop Security Systems
Real-time monitoring
Data Warehouse
Kafka
Producers
Brokers
Consumers
Kafka decouples Data Pipelines
19
Retain All DataClick to enter confidentiality information
20
Data Pipeline – Traditional ViewRaw data
Raw data Clean data
Aggregated dataClean data Enriched data
Input OutputWaste of diskspace
21©2014 Cloudera, Inc. All rights reserved.
It is all valuable dataRaw data
Raw data Clean data
Aggregated dataClean data Enriched data
Filtered dataDashboard Report
Datascienti
stAlerts
OMG
22
Hadoop Based ETL – The FileSystem is the DB/user/…/user/gshapira/testdata/orders
/data/<database>/<table>/<partition>/data/<biz unit>/<app>/<dataset>/partition/data/pharmacy/fraud/orders/date=20131101
/etl/<biz unit>/<app>/<dataset>/<stage>/etl/pharmacy/fraud/orders/validated
23Click to enter confidentiality information
Store intermediate data/etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id>/etl/pharmacy/fraud/orders/raw/date=20131101/etl/pharmacy/fraud/orders/deduped/date=20131101/etl/pharmacy/fraud/orders/validated/date=20131101/etl/pharmacy/fraud/orders_labs/merged/date=20131101/etl/pharmacy/fraud/orders_labs/aggregated/date=20131101/etl/pharmacy/fraud/orders_labs/ranked/date=20131101
24Click to enter confidentiality information
Batch ETL is old news
25Click to enter confidentiality information
Small Problem!• HDFS is optimized for large chunks of data• Don’t write individual events of micro-batches• Think 100M-2G batches• What do we do with small events?
26Click to enter confidentiality information
Well, we have this data bus…
0 1 2 3 4 5 6 7 8 9 10
11
12
13
0 1 2 3 4 5 6 7 8 9 10
11
0 1 2 3 4 5 6 7 8 9 10
11
12
13
Partition 1
Partition 2
Partition 3
Writes
Old New
27Click to enter confidentiality information
Kafka has topicsHow about?
<biz unit>.<app>.<dataset>.<stage>pharmacy.fraud.orders.rawpharmacy.fraud.orders.dedupedpharmacy.fraud.orders.validatedpharmacy.fraud.orders_labs.merged
28©2014 Cloudera, Inc. All rights reserved.
It’s (almost) all topicsRaw data
Raw data Clean data
Aggregated dataClean data
Filtered dataDashboard Report
Datascienti
stAlerts
OMG
Enriched Data
29Click to enter confidentiality information
Benefits• Recover from accidents • Debug suspicious results• Fix algorithm errors• Experiment with new algorithms• Expend pipelines • Jump-start expended pipelines
30
Kinda Lambda
31Click to enter confidentiality information
Lambda Architecture• Immutable events• Store intermediate stages• Combine Batches and Streams• Reprocessing
32Click to enter confidentiality information
What we don’t like
Maintaining two applicationsOften in two languagesThat do the same thing
33Click to enter confidentiality information
Pain Avoidance #1 – Use Spark + SparkStreaming• Spark is awesome for batch, so why not?– The New Kid that isn’t that New Anymore– Easily 10x less code– Extremely Easy and Powerful API– Very good for machine learning– Scala, Java, and Python– RDDs– DAG Engine
34Confidentiality Information Goes Here
Spark Streaming• Calling Spark in a Loop• Extends RDDs with DStream• Very Little Code Changes from ETL to Streaming
35Confidentiality Information Goes Here
Spark Streaming
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first Batch
First Batch
Second Batch
36Click to enter confidentiality information
Small Exampleval sparkConf = new SparkConf() .setMaster(args(0)).setAppName(this.getClass.getCanonicalName)
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create the DStream from data sent over the network
val dStream = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_AND_DISK_SER)
// Counting the errors in each RDD in the stream
val errCountStream = dStream.transform(rdd => ErrorCount.countErrors(rdd))
val stateStream = errCountStream.updateStateByKey[Int](updateFunc)
errCountStream.foreachRDD(rdd => {
System.out.println("Errors this minute:%d".format(rdd.first()._2))
})
37Click to enter confidentiality information
Pain Avoidance #2 – Split the StreamWhy do we even need stream + batch?• Batch efficiencies• Re-process to fix errors• Re-process after delayed arrival
What if we could re-play data?
38Click to enter confidentiality information
Lets Re-Process with new algorithm0 1 2 3 4 5 6 7 8 9 1
011
12
13
Streaming App v1
Streaming App v2
Result set 1
Result set 2
App
39Click to enter confidentiality information
Lets Re-Process with new algorithm0 1 2 3 4 5 6 7 8 9 1
011
12
13
Streaming App v1
Streaming App v2
Result set 1
Result set 2
App
40Click to enter confidentiality information
Oh no, we just got a bunch of data for yesterday!
0 1 2 3 4 5 6 7 8 9 10
11
12
13
Streaming App
Streaming App
Today
Yesterday
41Click to enter confidentiality information
Note:No need to choose between the approaches.
There are good reasons to do both.
42Click to enter confidentiality information
Prediction:Batch vs. Streaming distinction is going away.
43
Yes, you really need a Schema
Click to enter confidentiality information
44Click to enter confidentiality information
Schema is a MUST HAVE for data integration
4545
Client Backend
Client
Client
Client
Another Backend
Another Backend
Another Backend
46
Remember that we want this?
46
Source System
Source System
Source System
Source System
Hadoop Security Systems
Real-time monitoring
Data Warehouse
Kafka
Producers
Brokers
Consumers
47Click to enter confidentiality information
This means we need this:Source System
Source System
Source System
Source System
Hadoop Security Systems
Real-time monitoring
Data Warehouse
Kafka Schema Repository
48Click to enter confidentiality information
We can do it in few ways• People go around asking each other:
“So, what does the 5th field of the messages in topic Blah contain?”
• There’s utility code for reading/writing messages that everyone reuses
• Schema embedded in the message• A centralized repository for schemas– Each message has Schema ID– Each topic has Schema ID
49Click to enter confidentiality information
I Avro • Define Schema• Generate code for objects• Serialize / Deserialize into Bytes or JSON• Embed schema in files / records… or not• Support for our favorite languages… Except Go.• Schema Evolution– Add and remove fields without breaking anything
50Click to enter confidentiality information
Schemas are Agile• Leave out MySQL and your favorite DBA for a second• Schemas allow adding readers and writers easily• Schemas allow modifying readers and writers independently• Schemas can evolve as the system grows• Allows validating data soon after its written– No need to throw away data that doesn’t fit!
51Click to enter confidentiality information
52
Woah, that was lots of stuff!
Click to enter confidentiality information
53
Recap – if you remember nothing else…• After the POC, its time for production• Goal: Evolve fast without breaking things
For this you need:• Keep all data• Design pipeline for error recovery – batch or stream• Integrate with a data bus• And Schemas
Thank you