data architectures for robust decision making
TRANSCRIPT
![Page 1: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/1.jpg)
Designing Data Architectures for Robust Decision Making
Gwen Shapira / Software Engineer
![Page 2: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/2.jpg)
2©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data around
• Formerly consultant
• Now Cloudera Engineer:– Sqoop Committer
– Kafka
– Flume
• @gwenshap
About Me
![Page 3: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/3.jpg)
3©2014 Cloudera, Inc. All rights reserved.
There’s a book on that!
![Page 4: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/4.jpg)
4
About you:
You know Hadoop
![Page 5: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/5.jpg)
“Big Data” is stuck at The Lab.
![Page 6: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/6.jpg)
6
We want to move to The Factory
![Page 7: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/7.jpg)
7Click to enter confidentiality information
![Page 8: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/8.jpg)
8
What does it mean to “Systemize”?
• Ability to easily add new data sources
• Easily improve and expend analytics
• Ease data access by standardizing metadata and storage
• Ability to discover mistakes and to recover from them
• Ability to safely experiment with new approaches
Click to enter confidentiality information
![Page 9: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/9.jpg)
9
We will discuss:
• Actual decision making
• Data Science
• Machine learning
• Algorithms
Click to enter confidentiality information
We will not discuss:
• Architectures
• Patterns
• Ingest
• Storage
• Schemas
• Metadata
• Streaming
• Experimenting
• Recovery
![Page 10: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/10.jpg)
10
So how do we build real data architectures?
Click to enter confidentiality information
![Page 11: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/11.jpg)
11
The Data Bus
![Page 12: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/12.jpg)
1212
Client Source
Data Pipelines Start like this.
![Page 13: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/13.jpg)
1313
Client Source
Client
Client
Client
Then we reuse them
![Page 14: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/14.jpg)
1414
Client Backend
Client
Client
Client
Then we add consumers to the
existing sources
Another
Backend
![Page 15: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/15.jpg)
1515
Client Backend
Client
Client
Client
Then it starts to look like this
Another
Backend
Another
Backend
Another
Backend
![Page 16: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/16.jpg)
1616
Client Backend
Client
Client
Client
With maybe some of this
Another
Backend
Another
Backend
Another
Backend
![Page 17: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/17.jpg)
17
Adding applications should be easier
We need:
• Shared infrastructure for sending records
• Infrastructure must scale
• Set of agreed-upon record schemas
![Page 18: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/18.jpg)
18
Kafka Based Ingest Architecture
18
Source System Source System Source System Source System
Kafka decouples Data Pipelines
HadoopSecurity
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Producer
s
Brokers
Consume
rs
Kafka decouples Data Pipelines
![Page 19: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/19.jpg)
19
Retain All Data
Click to enter confidentiality information
![Page 20: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/20.jpg)
20
Data Pipeline – Traditional View
Raw data
Raw data Clean data
Aggregated dataClean data Enriched data
Input OutputWaste of
diskspace
![Page 21: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/21.jpg)
21©2014 Cloudera, Inc. All rights reserved.
It is all valuable data
Raw data
Raw data Clean data
Aggregated dataClean data Enriched data
Filtered dataDash
boardReport
Data
scientis
t
Alerts
OMG
![Page 22: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/22.jpg)
22
Hadoop Based ETL – The FileSystem is the
DB
/user/…
/user/gshapira/testdata/orders
/data/<database>/<table>/<partition>
/data/<biz unit>/<app>/<dataset>/partition
/data/pharmacy/fraud/orders/date=20131101
/etl/<biz unit>/<app>/<dataset>/<stage>
/etl/pharmacy/fraud/orders/validated
![Page 23: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/23.jpg)
23
Store intermediate data
/etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id>
/etl/pharmacy/fraud/orders/raw/date=20131101
/etl/pharmacy/fraud/orders/deduped/date=20131101
/etl/pharmacy/fraud/orders/validated/date=20131101
/etl/pharmacy/fraud/orders_labs/merged/date=20131101
/etl/pharmacy/fraud/orders_labs/aggregated/date=20131101
/etl/pharmacy/fraud/orders_labs/ranked/date=20131101
Click to enter confidentiality information
![Page 24: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/24.jpg)
24
Batch ETL is old news
Click to enter confidentiality information
![Page 25: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/25.jpg)
25
Small Problem!
• HDFS is optimized for large chunks of data
• Don’t write individual events of micro-batches
• Think 100M-2G batches
• What do we do with small events?
Click to enter confidentiality information
![Page 26: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/26.jpg)
26
Well, we have this data bus…
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 91
0
1
1
1
2
1
3
0 1 2 3 4 5 6 7 8 91
0
1
1
0 1 2 3 4 5 6 7 8 91
0
1
1
1
2
1
3
Partition 1
Partition 2
Partition 3
Writes
Old New
![Page 27: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/27.jpg)
27
Kafka has topics
How about?
<biz unit>.<app>.<dataset>.<stage>
pharmacy.fraud.orders.raw
pharmacy.fraud.orders.deduped
pharmacy.fraud.orders.validated
pharmacy.fraud.orders_labs.merged
Click to enter confidentiality information
![Page 28: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/28.jpg)
28©2014 Cloudera, Inc. All rights reserved.
It’s (almost) all topics
Raw data
Raw data Clean data
Aggregated dataClean data
Filtered dataDash
boardReport
Data
scientis
t
Alerts
OMG
Enriched Data
![Page 29: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/29.jpg)
29
Benefits
• Recover from accidents
• Debug suspicious results
• Fix algorithm errors
• Experiment with new algorithms
• Expend pipelines
• Jump-start expended pipelines
Click to enter confidentiality information
![Page 30: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/30.jpg)
30
Kinda Lambda
![Page 31: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/31.jpg)
31
Lambda Architecture
• Immutable events
• Store intermediate stages
• Combine Batches and Streams
• Reprocessing
Click to enter confidentiality information
![Page 32: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/32.jpg)
32
What we don’t like
Maintaining two applications
Often in two languages
That do the same thing
Click to enter confidentiality information
![Page 33: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/33.jpg)
33
Pain Avoidance #1 – Use Spark +
SparkStreaming
• Spark is awesome for batch, so why not?– The New Kid that isn’t that New Anymore
– Easily 10x less code
– Extremely Easy and Powerful API
– Very good for machine learning
– Scala, Java, and Python
– RDDs
– DAG Engine
Click to enter confidentiality information
![Page 34: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/34.jpg)
34
Spark Streaming
• Calling Spark in a Loop
• Extends RDDs with DStream
• Very Little Code Changes from ETL to Streaming
Confidentiality Information Goes Here
![Page 35: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/35.jpg)
35
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
![Page 36: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/36.jpg)
36
Small Example
val sparkConf = new SparkConf()
.setMaster(args(0)).setAppName(this.getClass.getCanonicalName)
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create the DStream from data sent over the network
val dStream = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_AND_DISK_SER)
// Counting the errors in each RDD in the stream
val errCountStream = dStream.transform(rdd => ErrorCount.countErrors(rdd))
val stateStream = errCountStream.updateStateByKey[Int](updateFunc)
errCountStream.foreachRDD(rdd => {
System.out.println("Errors this minute:%d".format(rdd.first()._2))
})
Click to enter confidentiality information
![Page 37: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/37.jpg)
37
Pain Avoidance #2 – Split the Stream
Why do we even need stream + batch?
• Batch efficiencies
• Re-process to fix errors
• Re-process after delayed arrival
What if we could re-play data?
Click to enter confidentiality information
![Page 38: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/38.jpg)
38
Lets Re-Process with new algorithm
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 91
0
1
1
1
2
1
3
Streaming App v1
Streaming App v2
Result set 1
Result set 2
App
![Page 39: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/39.jpg)
39
Lets Re-Process with new algorithm
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 91
0
1
1
1
2
1
3
Streaming App v1
Streaming App v2
Result set 1
Result set 2
App
![Page 40: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/40.jpg)
40
Oh no, we just got a bunch of data for
yesterday!
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 91
0
1
1
1
2
1
3
Streaming App
Streaming App
Today
Yesterday
![Page 41: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/41.jpg)
41
Note:
No need to choose between the approaches.
There are good reasons to do both.
Click to enter confidentiality information
![Page 42: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/42.jpg)
42
Prediction:
Batch vs. Streaming distinction is going away.
Click to enter confidentiality information
![Page 43: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/43.jpg)
43
Yes, you really need a Schema
Click to enter confidentiality information
![Page 44: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/44.jpg)
44
Schema is a MUST HAVE for data integration
Click to enter confidentiality information
![Page 45: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/45.jpg)
4545
Client Backend
Client
Client
Client
Another
Backend
Another
Backend
Another
Backend
![Page 46: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/46.jpg)
46
Remember that we want this?
46
Source System Source System Source System Source System
HadoopSecurity
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Producer
s
Brokers
Consume
rs
![Page 47: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/47.jpg)
47
This means we need this:
Click to enter confidentiality information
Source System Source System Source System Source System
HadoopSecurity
Systems
Real-time
monitoring
Data
Warehouse
KafkaSchema
Repository
![Page 48: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/48.jpg)
48
We can do it in few ways
• People go around asking each other:“So, what does the 5th field of the messages in topic Blah contain?”
• There’s utility code for reading/writing messages that everyone reuses
• Schema embedded in the message
• A centralized repository for schemas– Each message has Schema ID
– Each topic has Schema ID
Click to enter confidentiality information
![Page 49: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/49.jpg)
49
I Avro
• Define Schema
• Generate code for objects
• Serialize / Deserialize into Bytes or JSON
• Embed schema in files / records… or not
• Support for our favorite languages… Except Go.
• Schema Evolution– Add and remove fields without breaking anything
Click to enter confidentiality information
![Page 50: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/50.jpg)
50
Schemas are Agile
• Leave out MySQL and your favorite DBA for a second
• Schemas allow adding readers and writers easily
• Schemas allow modifying readers and writers independently
• Schemas can evolve as the system grows
• Allows validating data soon after its written– No need to throw away data that doesn’t fit!
Click to enter confidentiality information
![Page 51: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/51.jpg)
51Click to enter confidentiality information
![Page 52: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/52.jpg)
52
Woah, that was lots of stuff!
Click to enter confidentiality information
![Page 53: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/53.jpg)
53
Recap – if you remember nothing else…
• After the POC, its time for production
• Goal: Evolve fast without breaking things
For this you need:
• Keep all data
• Design pipeline for error recovery – batch or stream
• Integrate with a data bus
• And Schemas
![Page 54: Data Architectures for Robust Decision Making](https://reader033.vdocument.in/reader033/viewer/2022042614/55a512f81a28ab532d8b47f1/html5/thumbnails/54.jpg)
Thank you