the promise and peril of abundance: making big data small. brendan mcadams at big data spain 2012

40
Brendan McAdams 10gen, Inc. [email protected] @rit A Modest Proposal for Taming and Clarifying the Promises of Big Data and the Software Driven Future Friday, November 16, 12

Upload: big-data-spain

Post on 24-Jan-2015

260 views

Category:

Technology


1 download

DESCRIPTION

Session presented at Big Data Spain 2012 Conference 16th Nov 2012 ETSI Telecomunicacion UPM Madrid www.bigdataspain.org More info: http://www.bigdataspain.org/es-2012/conference/the-promise-and-peril-of-abundance-making-big-data-small/brendan-mcadams

TRANSCRIPT

Page 1: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Brendan McAdams10gen, Inc.

[email protected]@rit

A Modest Proposalfor Taming and Clarifying the Promises of Big Data

and the Software Driven Future

Friday, November 16, 12

Page 2: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

"In short, software is eating the world."- Marc Andreesen Wall Street Journal, Aug. 2011 http://on.wsj.com/XLwnmo

Friday, November 16, 12

Page 3: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Software is Eating the World

• Amazon.com (and .uk, .es, etc) started as a bookstore• Today, they sell just about everything - bicycles, appliances, computers, TVs, etc.• In some cities in America, they even do home grocery delivery• No longer as much of a physical goods company - becoming fixated and surrounded by software• Pioneering the eBook revolution with Kindle• EC2 is running a huge percentage of the public internet

Friday, November 16, 12

Page 4: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Software is Eating the World

• Netflix started as a company to deliver DVDs to the home...

Friday, November 16, 12

Page 5: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Software is Eating the World

• Netflix started as a company to deliver DVDs to the home...• But as they’ve grown, business has shifted to an online streaming service• They are now rolling out rapidly in many countries including Ireland, the UK, Canada and the Nordics• No need for physical inventory or postal distribution ... just servers and digital copies

Friday, November 16, 12

Page 6: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Disney Found Itself Forced To Transform...

From This...

Friday, November 16, 12

Page 7: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Disney Found Itself Forced To Transform...

... To This

Friday, November 16, 12

Page 8: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

But What Does All This Software Do?

• Software always eats data – be it text files, user form input, emails, etc

• All things that eat, must eventually excrete...

Friday, November 16, 12

Page 9: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Ingestion = Excretion

+ =

Yeast Ingests Sugars,

and Excretes Ethanol

Friday, November 16, 12

Page 10: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Ingestion = Excretion

=

Cows, er...

well, you get the point.

Friday, November 16, 12

Page 11: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

So What Does Software Eat?

• Software always eats data – be it text files, user form input, emails, etc

• But what does software excrete?• More Data, of course...• This data gets bigger and bigger• The solutions become narrower for storing & processing this data• Data Fertilizes Software, in an endless cycle...

Friday, November 16, 12

Page 12: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

There’s a Big Market Here...

• Lots of Solutions for Big Data• Data Warehouse Software• Operational Databases

• Old style systems being upgraded to scale storage + processing • NoSQL - Cassandra, MongoDB, etc

• Platforms• Hadoop

Friday, November 16, 12

Page 13: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Don’t Tilt At Windmills...

Friday, November 16, 12

Page 14: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Don’t Tilt At Windmills...

• It is easy to get distracted by all of these solutions

• Keep it simple• Use tools you (and your team) can understand• Use tools and techniques that can scale• Try not to reinvent the wheel

Friday, November 16, 12

Page 15: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

... And Don’t Bite Off More Than You Can Chew

• Break it into smaller pieces• You can’t fit a whole pig into your mouth... • ... slice it into small parts that you can consume.

Friday, November 16, 12

Page 16: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Big Data at a Glance

• Big Data can be gigabytes, terabytes, petabytes or exabytes

• An ideal big data system scales up and down around various data sizes – while providing a uniform view

• Major concerns• Can I read & write this data efficiently at different scale?• Can I run calculations on large portions of this data?

Large DatasetPrimary Key as “username”

Friday, November 16, 12

Page 17: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Big Data at a Glance

• Systems like Google File System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking

• Break up pieces of data into smaller chunks, spread across many data nodes

• Each data node contains many chunks• If a chunk gets too large or a node overloaded, data can be rebalanced

Large DatasetPrimary Key as “username”

...

Friday, November 16, 12

Page 18: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Chunks Represent Ranges of Values

...+∞-∞

Initially, an empty collection has a single chunk, running the range of minimum (-∞) to maximum (+∞)

As we add data, more chunks are created of new ranges

INSERT {USERNAME: “Bill”}

-∞ “B” “C” +∞

Individual or partial letter ranges are one possible chunk value... but they can get smaller!

“Ba”-∞ “Br”“Be”

“Brendan”“Brad”The smallest possible chunk value is not a range, but a single possible value

INSERT {USERNAME: “Becky”}INSERT {USERNAME: “Brendan”}

INSERT {USERNAME: “Brad”}

Friday, November 16, 12

Page 19: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Big Data at a Glance

• To simplify things, let’s look at our dataset split into chunks by letter

• Each chunk is represented by a single letter marking its contents

• You could think of “B” as really being “Ba” →”Bz”

Large DatasetPrimary Key as “username”

a b c d e f g h

s t u v w x y z

...

Friday, November 16, 12

Page 20: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Big Data at a Glance

Large DatasetPrimary Key as “username”

a b c d e f g h

s t u v w x y z

Friday, November 16, 12

Page 21: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Big Data at a Glance

Large DatasetPrimary Key as “username”

a

b

c

d

e

f

gh

st

u

v

w

x

y

z

MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb)

Friday, November 16, 12

Page 22: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Large DatasetPrimary Key as “username”

Big Data at a Glance

Data Node 1

25% of chunks

Data Node 2

25% of chunks

Data Node 3

25% of chunks

Data Node 4

25% of chunks

a

b

c

d

e

f

gh

st

u

v

w

x

y

z

Representing data as chunks allows many levels of scale across n data nodes

Friday, November 16, 12

Page 23: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Scaling Data Node 1 Data Node 2 Data Node 3 Data Node 4Data Node 5

a

b

c

d

e

f

gh

st

u

v

w

x

y

z

The set of chunks can be evenly distributed across n data nodes

Friday, November 16, 12

Page 24: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Add Nodes: Chunk Rebalancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5

a

bc

de

f

g hs

t

u

v

w

x yz

The goal is equilibrium - an equal distribution.

As nodes are added (or even removed)

chunks can be redistributed for balance.

Friday, November 16, 12

Page 25: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Don’t Bite Off More Than You Can Chew...

• The answer to calculating big data is much the same as storing it

• We need to break our data into bite sized pieces• Build functions which can be composed together repeatedly on partitions of our data • Process portions of the data across multiple calculation nodes• Aggregate the results into a final set of results

Friday, November 16, 12

Page 26: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Bite Sized Pieces Are Easier to Swallow

• These pieces are not chunks – rather, the individual data points that make up each chunk

• Chunks make up a useful data transfer units for processing as well

• Transfer Chunks as “Input Splits” to calculation nodes, allowing for scalable parallel processing

Friday, November 16, 12

Page 27: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

MapReduce the Pieces

• The most common application of these techniques is MapReduce

• Based on a Google Whitepaper, works with two primary functions – map and reduce – to calculate against large datasets

Friday, November 16, 12

Page 28: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

MapReduce to Calculate Big Data

• MapReduce is designed to effectively process data at varying scales

• Composable function units can be reused repeatedly for scaled results

Friday, November 16, 12

Page 29: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

MapReduce to Calculate Big Data

• In addition to the HDFS storage component, Hadoop is built around MapReduce for calculation

• MongoDB can be integrated to MapReduce data on Hadoop• No HDFS storage needed - data moves directly between MongoDB and Hadoop’s MapReduce engine

Friday, November 16, 12

Page 30: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

What is MapReduce?

• MapReduce made up of a series of phases, the primary of which are

• Map• Shuffle• Reduce

• Let’s look at a typical MapReduce job• Email records• Count # of times a particular user has received email

Friday, November 16, 12

Page 31: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

MapReducing Emailto: tyler

from: brendansubject: Ruby Support

to: brendanfrom: tyler

subject: Re: Ruby Support

to: mikefrom: brendan

subject: Node Support

to: brendanfrom: mike

subject: Re: Node Support

to: mikefrom: tyler

subject: COBOL Support

to: tylerfrom: mike

subject: Re: COBOL Support (WTF?)

Friday, November 16, 12

Page 32: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Map Stepto: tyler

from: brendansubject: Ruby Support

to: brendanfrom: tyler

subject: Re: Ruby Support

to: mikefrom: brendan

subject: Node Support

to: brendanfrom: mike

subject: Re: Node Support

to: mikefrom: tyler

subject: COBOL Support

to: tylerfrom: mike

subject: Re: COBOL Support (WTF?)

key: tylervalue: {count: 1}

key: brendanvalue: {count: 1}

key: mikevalue: {count: 1}

key: brendanvalue: {count: 1}

key: mikevalue: {count: 1}

key: tylervalue: {count: 1}

map functionemit(k, v)

map function breaks each document

into a key (grouping) & value

Friday, November 16, 12

Page 33: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Group/Shuffle Step

key: tylervalue: {count: 1}

key: brendanvalue: {count: 1}

key: mikevalue: {count: 1}

key: brendanvalue: {count: 1}

key: mikevalue: {count: 1}

key: tylervalue: {count: 1}

Group like keys together,

creating an array of their

distinct values(Automatically done by M/R frameworks)

Friday, November 16, 12

Page 34: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Group/Shuffle Step

key: brendanvalues: [{count: 1}, {count: 1}]

key: mikevalues: [{count: 1}, {count: 1}]

key: tylervalues: [{count: 1}, {count: 1}]

Group like keys together,

creating an array of their

distinct values(Automatically done by M/R frameworks)

Friday, November 16, 12

Page 35: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Reduce Step

key: brendanvalues: [{count: 1}, {count: 1}]

key: mikevalues: [{count: 1}, {count: 1}]

key: tylervalues: [{count: 1}, {count: 1}]

For each key reduce function

flattens the list of values to a single

result

reduce functionaggregate values

return (result)

key: tylervalue: {count: 2}

key: mikevalue: {count: 2}

key: brendanvalue: {count: 2}

Friday, November 16, 12

Page 36: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Processing Scalable Big Data

• MapReduce provides an effective system for calculating and processing our large datasets (from gigabytes through exabytes and beyond)

• MapReduce is supported in many places including MongoDB & Hadoop

• We have effective answers for both of our concerns.• Can I read & write this data efficiently at different scale?• Can I run calculations on large portions of this data?• Can I run calculations on large portions of this data?

Friday, November 16, 12

Page 37: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Batch Isn’t a Sustainable Answer• There are downsides here - fundamentally, MapReduce is a batch process

• Batch systems like Hadoop give us a “Catch 22”• You can get answers to questions from Petabytes of Data• But you can’t guarantee you’ll get them quickly

• In some ways, this is a step backwards in our industry

• Business Stakeholders tend to want answers now• We must evolve

Friday, November 16, 12

Page 38: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Moving Away from Batch• The Big Data world is moving rapidly away from slow, batch based processing solutions

• Google moved forward from Batch into more Realtime over last few years

• Hadoop is replacing “MapReduce as Assembly Language” with more flexible resource management in YARN

• Now MapReduce is just a feature implemented on top of YARN• Build anything we want

• Newer systems like Spark & Storm provide platforms for realtime processes

Friday, November 16, 12

Page 39: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

In Closing• The World IS Being Eaten By Software

• All that software is leaving behind an awful lot of data• We must be careful not to “step in it”• More Data Means More Software Means More Data Means...

• Practical Solutions for Processing & Storing Data will save us

• We as Data Scientists & Technologists must always evolve our strategies, thinking and tools

Friday, November 16, 12

Page 40: The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

[Download the Hadoop Connector]http://github.com/mongodb/mongo-hadoop

[Docs]http://api.mongodb.org/hadoop/

¿QUESTIONS?

*Contact Me*[email protected]

(twitter: @rit)

Friday, November 16, 12