kansas city big data: the future of insights - keynote: "big data technologies and...

85
Big Data Technologies and Techniques Ryan Brush Distinguished Engineer, Cerner Corporation @ryanbrush

Upload: kcitp

Post on 26-Jan-2015

104 views

Category:

Business


1 download

DESCRIPTION

Kansas City IT Professionals, a grassroots tech community of 9,000+ members held an event on August 30th, 2012 entitled Big Data: The Future Of Insights (see: http://kcitp.me/M67S9M). The event consisted of 2 keynotes & a panel with expert data scientists, engineers, and data analysts from companies like Adknowledge and Cerner. This talk, entitled "Big Data Technologies and Tools" was delivered by Ryan Brush, Distinguished Engineer w/ Cerner

TRANSCRIPT

Page 1: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Big Data Technologies and Techniques

Ryan BrushDistinguished Engineer, Cerner Corporation

@ryanbrush

Page 2: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Relational Databases are Awesome

Page 3: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Relational Databases are Awesome

Atomic, transactional updates

Declarative queries

Guaranteed consistency

Easy to reason about

Long track record of success

Page 4: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Relational Databases are Awesome

…so use them!

Page 5: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Relational Databases are Awesome

…so use them!

But…

Page 6: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Those advantages have a cost

Global, atomic state means global, atomic coordination

Coordination does not scale linearly

Page 7: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

The costs of coordination

Remember the network effect?

Page 8: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

The costs of coordination

2 nodes = 1 channel5 nodes = 10 channels12 nodes = 66 channels25 nodes = 300 channels

Page 9: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

So we better be able to scale

Page 10: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

The costs of coordination

Databases have optimized this in many clever ways, but a limit on scalability still exists

Page 11: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Let’s look at some ways to scale

Page 12: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Bulk processing billions of records

Page 13: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Bulk processing billions of recordsData aggregation and storage

Page 14: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Bulk processing billions of recordsData aggregation and storage

Real-time processing of updates

Page 15: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Bulk processing billions of recordsData aggregation and storage

Real-time processing of updates

Serving data for: Online AppsAnalytics

Page 16: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Let’s start with scalability of bulk processing

Page 17: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Quiz: which one is scalable?

Page 18: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process

Page 19: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process

1000 Windows ME machines runningindependent Excel macros

Page 20: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process

1000 Windows ME machines runningindependent Excel macros

Page 21: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Independence Parallelizable

Page 22: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Independence Parallelizable

Parallelizable Scalable

Page 23: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

“Shared Nothing” architectures are themost scalable…

Page 24: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

“Shared Nothing” architectures are themost scalable…

…but most real-world problems requireus to share something…

Page 25: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

“Shared Nothing” architectures are themost scalable…

…but most real-world problems requireus to share something…

…so our designs usually have a parallelpart and a serial part

Page 26: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

The key is to make sure the vast majorityof our work in the cloud is independent andparallelizable.

Page 27: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Amdahl’s LawS : speed improvementP : ratio of the problem that can be parallelizedN: number of processors

Page 28: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce PrimerInput Data

Split 1

Split 2

Split 3

Split N

.

.

.

Mapper 1

Mapper 2

Mapper 3

Mapper N

.

.

.

Map Phase

Reducer 1

Reducer 2

Reducer N

.

.

ReducePhase

Shuffle

Page 29: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce Example: Word CountBooks

Count words per book

.

.

.

Map Phase

Sum words A-C

.

.

ReducePhase

Shuffle

Sum wordsD-E

Sum words W-Z

Count words per book

Count words per book

Page 30: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Notice there is still a serial part of the problem: the of the reducers must be combined

Page 31: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Notice there is still a serial part of the problem: the of the reducers must be combined

…but this is much smaller, and can behandled by a single process

Page 32: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Also notice that the network is a shared resource when processing big data

Page 33: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Also notice that the network is a shared resource when processing big data

So rather than moving data to computation,we move computation to data.

Page 34: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce Data LocalityInput Data

Split 1

Split 2

Split 3

Split N

.

.

.

Mapper 1

Mapper 2

Mapper 3

Mapper N

.

.

.

Map Phase

Reducer 1

Reducer 2

Reducer N

.

.

ReducePhase

Shuffle

= a physical machine

Page 35: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Data locality is only guaranteed the Map phase

Page 36: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Data locality is only guaranteed the Map phase

So the most data-intensive work should bedone in the map, with smaller sets set to the reducer

Page 37: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Data locality is only guaranteed the Map phase

So the most data-intensive work should bedone in the map, with smaller sets set to the reducer

Some Map/Reduce jobs have no reducer at all!

Page 38: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce Gone WrongBooks

Count words per book

.

.

.

Map Phase

Sum words A-C

.

.

ReducePhase

Shuffle

Sum wordsD-E

Sum words W-Z

Count words per book

Count words per book

Word Addition

Service

Page 39: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Even if our Word Addition Service is scalable, we’d need to scale it to the size of the largest Map/Reduce job that will ever use it

Page 40: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

So for data processing, prefer embedded libraries over remote services

Page 41: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

So for data processing, prefer embedded libraries over remote services

Use remote services for configuration, to prime caches, etc. – just not for every data element!

Page 42: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Joining a billion records

Word counts are great, but many real-worldproblems mean bringing together multiple datasets.

So how do we “join” with MapReduce?

Page 43: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Map-Side Joins

Data Set 1

Split 3 Mapper 3

Map Phase

Reducer 1

Reducer 2..

ReducePhase

Shuffle

Data set 2

Split 1 Mapper 1Data set 2

Split 2 Mapper 2Data set 2

When joining one big input to a small one,Simply copy the small data set to each mapper

Page 44: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Merge in Reducer

Data Set 1

Split 1

Split 2

Split 3

Group by key

Map Phase

Reducer 1

Reducer 2

Reducer N

.

.

ReducePhase

Shuffle

Group by key

Group by key

Data Set 2

Split 1

Split 2

Split 3

Group by key

Group by key

Group by key

Route common items to the same reducer

Page 45: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Higher-Level Constructs

MapReduce is a primitive operation forhigher-level constructsHive, Pig, Cascading, and Crunch all compileInto MapReduce

Crunch!

Use one!

Page 46: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce and MPP Databases

Page 47: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce MPP DatabasesData in a distributed filesystem Data in sharded relational databases

Page 48: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce MPP DatabasesOriented towards unstructured or semi-structured data

Oriented towards structured dataData in a distributed filesystem Data in sharded relational databases

Page 49: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce MPP DatabasesOriented towards unstructured or semi-structured data

Oriented towards structured data

Java or Domain-Specific Languages(e.g., Pig and Hive)

SQL

Data in a distributed filesystem Data in sharded relational databases

Page 50: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce MPP DatabasesOriented towards unstructured or semi-structured data

Oriented towards structured data

Java or Domain-Specific Languages(e.g., Pig and Hive)

SQL

Data in a distributed filesystem Data in sharded relational databases

Poor support for iterative operations Good support of iterative operations

Page 51: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce MPP DatabasesOriented towards unstructured or semi-structured data

Oriented towards structured data

Java or Domain-Specific Languages(e.g., Pig and Hive)

SQL

Data in a distributed filesystem Data in sharded relational databases

Poor support for iterative operations Good support of iterative operationsArbitrarily complex programsrunning next to data

SQL and User-Defined Functionsrunning next to data

Page 52: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce MPP DatabasesOriented towards unstructured or semi-structured data

Oriented towards structured data

Java or Domain-Specific Languages(e.g., Pig and Hive)

SQL

Data in a distributed filesystem Data in sharded relational databases

Poor support for iterative operations Good support of iterative operationsArbitrarily complex programsrunning next to data

SQL and User-Defined Functionsrunning next to data

Poor interactive query support Good interactive query support

Page 53: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce MPP Databases

…are complementary!

Page 54: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

MapReduce MPP Databases

…are complementary!

Map/Reduce to clean, normalize, reconcile and codify data to load into a MPP system for interactive analysis

Page 55: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Bulk processing of millions of recordsData aggregation and storage

Page 56: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Hadoop Distributed Filesystem

Scales to many petabytes

Page 57: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Hadoop Distributed Filesystem

Scales to many petabytesSplits all files into blocks and spreadsthem across data nodes

Page 58: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Hadoop Distributed Filesystem

Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what file

Page 59: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Hadoop Distributed Filesystem

Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what fileAll blocks written in triplicate

Page 60: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Hadoop Distributed Filesystem

Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what fileAll blocks written in triplicateWrite and append only – no random updates!

Page 61: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Client

Name Node

Data Node 1 Data Node 2 Data Node N. . .Block

Block

Block Block

Block

Lookup Data Node

Replicate Replicate

Write

HDFS Writes

Page 62: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Client

Name Node

Data Node 1 Data Node 2 Data Node N. . .Block

Block

Block Block

Block

Lookup Block locations

Read

HDFS Reads

Page 63: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

HDFS Shortcomings

No random readsNo random writesDoesn’t deal with many small files

Page 64: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

HDFS Shortcomings

No random readsNo random writesDoesn’t deal with many small files

Enter HBase“Random Access To Your Planet-Size Data”

Page 65: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

HBase

Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted files

Page 66: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

HBase

Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted filesFiles accessible as tables, split acrossmany regions, hosted by region servers

Page 67: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

HBase

Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted filesFiles accessible as tables, split acrossmany regions, hosted by region servers

Preserves scalability, data locality, andMap/Reduce features of Hadoop

Page 68: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Use HBase when:You have noisy, semi-structured data

Page 69: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Use HBase when:You have noisy, semi-structured dataYou want to apply massively parallelprocessing to your problem

Page 70: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Use HBase when:You have noisy, semi-structured dataYou want to apply massively parallelprocessing to your problem

To handle huge write loads

Page 71: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Use HBase when:You have noisy, semi-structured dataYou want to apply massively parallelprocessing to your problem

To handle huge write loadsAs a scalable key/value store

Page 72: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

But there are drawbacks:Limited schema supportLimited atomicity guaranteesNo built-in secondary indexes

HBase is a great tool for many jobs,but not every job

Page 73: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

The data store should alignwith the needs of the application

Page 74: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

So a pattern is emerging:

Hadoop with

HBase

Millennium

CCDs

Claims

HL7

Collection Aggregation Processing

MapReduce Jobs

MPP

Relational

Document Store

Storage

HBase

Page 75: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

But we have a potential bottleneck

Hadoop with

HBase

Millennium

CCDs

Claims

HL7

Collection Aggregation Processing

MapReduce Jobs

MPP

Relational

Document Store

Storage

HBase

Page 76: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Direct inserts are designed for online updates, not massively parallel data loads

So shift the work into MapReduce, and pre-build files for bulk import

Oracle Loader for HadoopHBase HFile Import Bulk Loads for MPP

Page 77: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

And we’re missing an important piece:

Hadoop with

HBase

Millennium

CCDs

Claims

HL7

Collection Aggregation Processing

MapReduce Jobs

MPP

Relational

Document Store

Storage

HBase

Page 78: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

And we’re missing an important piece:

Hadoop with

HBase

Millennium

CCDs

Claims

HL7

Collection Aggregation Processing

Realtime Processing

MPP

Relational

Document Store

Storage

HBase

Map/Reduce

Jobs (batch)

Page 79: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

How do we make it fast?

Speed Layer

Batch Layer

http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems

Page 80: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

How do we make it fast?

Speed Layer

Batch LayerHigh Latency (minutes or hours to process)

Low Latency (seconds to process)

Move data to computation

Move computation to dataYears of data

Hours of data

Bulk loads

Incremental updates

Page 81: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

How do we make it fast?

Speed Layer

Batch LayerMapReduce

Storm

Complex Event Processing

Hadoop

Page 82: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

And now, the challenge…

Page 83: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Process all data overnight

Page 84: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Process all data overnight

Quickly create new data models

Simple correction of any bugs

Fast iteration cycles means fast innovation

Much easier to understand and work with

Page 85: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Questions?