data architectures for robust decision making

Post on 21-Apr-2017

9.816 Views

Category:

Engineering

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Designing Data Architectures for Robust Decision Making

Gwen Shapira / Software Engineer

2©2014 Cloudera, Inc. All rights reserved.

• 15 years of moving data around• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume

• @gwenshap

About Me

3©2014 Cloudera, Inc. All rights reserved.

There’s a book on that!

4

About you:

You know Hadoop

“Big Data” is stuck at The Lab.

6

We want to move to The Factory

7Click to enter confidentiality information

8Click to enter confidentiality information

What does it mean to “Systemize”?• Ability to easily add new data sources• Easily improve and expend analytics• Ease data access by standardizing metadata and storage• Ability to discover mistakes and to recover from them• Ability to safely experiment with new approaches

9Click to enter confidentiality information

We will discuss:• Actual decision making• Data Science• Machine learning• Algorithms

We will not discuss:• Architectures• Patterns• Ingest• Storage • Schemas• Metadata• Streaming• Experimenting• Recovery

10

So how do we build real data architectures?

Click to enter confidentiality information

11

The Data Bus

1212

Client Source

Data Pipelines Start like this.

1313

Client Source

Client

Client

Client

Then we reuse them

1414

Client Backend

Client

Client

Client

Then we add consumers to the existing sources

Another Backend

1515

Client Backend

Client

Client

Client

Then it starts to look like this

Another Backend

Another Backend

Another Backend

1616

Client Backend

Client

Client

Client

With maybe some of this

Another Backend

Another Backend

Another Backend

17

Adding applications should be easierWe need:• Shared infrastructure for sending records• Infrastructure must scale• Set of agreed-upon record schemas

18

Kafka Based Ingest Architecture

18

Source System

Source System

Source System

Source System

Kafka decouples Data Pipelines

Hadoop Security Systems

Real-time monitoring

Data Warehouse

Kafka

Producers

Brokers

Consumers

Kafka decouples Data Pipelines

19

Retain All DataClick to enter confidentiality information

20

Data Pipeline – Traditional ViewRaw data

Raw data Clean data

Aggregated dataClean data Enriched data

Input OutputWaste of diskspace

21©2014 Cloudera, Inc. All rights reserved.

It is all valuable dataRaw data

Raw data Clean data

Aggregated dataClean data Enriched data

Filtered dataDashboard Report

Datascienti

stAlerts

OMG

22

Hadoop Based ETL – The FileSystem is the DB/user/…/user/gshapira/testdata/orders

/data/<database>/<table>/<partition>/data/<biz unit>/<app>/<dataset>/partition/data/pharmacy/fraud/orders/date=20131101

/etl/<biz unit>/<app>/<dataset>/<stage>/etl/pharmacy/fraud/orders/validated

23Click to enter confidentiality information

Store intermediate data/etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id>/etl/pharmacy/fraud/orders/raw/date=20131101/etl/pharmacy/fraud/orders/deduped/date=20131101/etl/pharmacy/fraud/orders/validated/date=20131101/etl/pharmacy/fraud/orders_labs/merged/date=20131101/etl/pharmacy/fraud/orders_labs/aggregated/date=20131101/etl/pharmacy/fraud/orders_labs/ranked/date=20131101

24Click to enter confidentiality information

Batch ETL is old news

25Click to enter confidentiality information

Small Problem!• HDFS is optimized for large chunks of data• Don’t write individual events of micro-batches• Think 100M-2G batches• What do we do with small events?

26Click to enter confidentiality information

Well, we have this data bus…

0 1 2 3 4 5 6 7 8 9 10

11

12

13

0 1 2 3 4 5 6 7 8 9 10

11

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Partition 1

Partition 2

Partition 3

Writes

Old New

27Click to enter confidentiality information

Kafka has topicsHow about?

<biz unit>.<app>.<dataset>.<stage>pharmacy.fraud.orders.rawpharmacy.fraud.orders.dedupedpharmacy.fraud.orders.validatedpharmacy.fraud.orders_labs.merged

28©2014 Cloudera, Inc. All rights reserved.

It’s (almost) all topicsRaw data

Raw data Clean data

Aggregated dataClean data

Filtered dataDashboard Report

Datascienti

stAlerts

OMG

Enriched Data

29Click to enter confidentiality information

Benefits• Recover from accidents • Debug suspicious results• Fix algorithm errors• Experiment with new algorithms• Expend pipelines • Jump-start expended pipelines

30

Kinda Lambda

31Click to enter confidentiality information

Lambda Architecture• Immutable events• Store intermediate stages• Combine Batches and Streams• Reprocessing

32Click to enter confidentiality information

What we don’t like

Maintaining two applicationsOften in two languagesThat do the same thing

33Click to enter confidentiality information

Pain Avoidance #1 – Use Spark + SparkStreaming• Spark is awesome for batch, so why not?– The New Kid that isn’t that New Anymore– Easily 10x less code– Extremely Easy and Powerful API– Very good for machine learning– Scala, Java, and Python– RDDs– DAG Engine

34Confidentiality Information Goes Here

Spark Streaming• Calling Spark in a Loop• Extends RDDs with DStream• Very Little Code Changes from ETL to Streaming

35Confidentiality Information Goes Here

Spark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first Batch

First Batch

Second Batch

36Click to enter confidentiality information

Small Exampleval sparkConf = new SparkConf() .setMaster(args(0)).setAppName(this.getClass.getCanonicalName)

val ssc = new StreamingContext(sparkConf, Seconds(10))

// Create the DStream from data sent over the network

val dStream = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_AND_DISK_SER)

// Counting the errors in each RDD in the stream

val errCountStream = dStream.transform(rdd => ErrorCount.countErrors(rdd))

val stateStream = errCountStream.updateStateByKey[Int](updateFunc)

errCountStream.foreachRDD(rdd => {

System.out.println("Errors this minute:%d".format(rdd.first()._2))

})

37Click to enter confidentiality information

Pain Avoidance #2 – Split the StreamWhy do we even need stream + batch?• Batch efficiencies• Re-process to fix errors• Re-process after delayed arrival

What if we could re-play data?

38Click to enter confidentiality information

Lets Re-Process with new algorithm0 1 2 3 4 5 6 7 8 9 1

011

12

13

Streaming App v1

Streaming App v2

Result set 1

Result set 2

App

39Click to enter confidentiality information

Lets Re-Process with new algorithm0 1 2 3 4 5 6 7 8 9 1

011

12

13

Streaming App v1

Streaming App v2

Result set 1

Result set 2

App

40Click to enter confidentiality information

Oh no, we just got a bunch of data for yesterday!

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Streaming App

Streaming App

Today

Yesterday

41Click to enter confidentiality information

Note:No need to choose between the approaches.

There are good reasons to do both.

42Click to enter confidentiality information

Prediction:Batch vs. Streaming distinction is going away.

43

Yes, you really need a Schema

Click to enter confidentiality information

44Click to enter confidentiality information

Schema is a MUST HAVE for data integration

4545

Client Backend

Client

Client

Client

Another Backend

Another Backend

Another Backend

46

Remember that we want this?

46

Source System

Source System

Source System

Source System

Hadoop Security Systems

Real-time monitoring

Data Warehouse

Kafka

Producers

Brokers

Consumers

47Click to enter confidentiality information

This means we need this:Source System

Source System

Source System

Source System

Hadoop Security Systems

Real-time monitoring

Data Warehouse

Kafka Schema Repository

48Click to enter confidentiality information

We can do it in few ways• People go around asking each other:

“So, what does the 5th field of the messages in topic Blah contain?”

• There’s utility code for reading/writing messages that everyone reuses

• Schema embedded in the message• A centralized repository for schemas– Each message has Schema ID– Each topic has Schema ID

49Click to enter confidentiality information

I Avro • Define Schema• Generate code for objects• Serialize / Deserialize into Bytes or JSON• Embed schema in files / records… or not• Support for our favorite languages… Except Go.• Schema Evolution– Add and remove fields without breaking anything

50Click to enter confidentiality information

Schemas are Agile• Leave out MySQL and your favorite DBA for a second• Schemas allow adding readers and writers easily• Schemas allow modifying readers and writers independently• Schemas can evolve as the system grows• Allows validating data soon after its written– No need to throw away data that doesn’t fit!

51Click to enter confidentiality information

52

Woah, that was lots of stuff!

Click to enter confidentiality information

53

Recap – if you remember nothing else…• After the POC, its time for production• Goal: Evolve fast without breaking things

For this you need:• Keep all data• Design pipeline for error recovery – batch or stream• Integrate with a data bus• And Schemas

Thank you

top related