ashish thusoo evolution of big data architectures

Evolution of Big Data ArchitecturesArchitecture Summit, Aug 2012

Ashish Thusoo

Outline

Demand for Big Data

Architectural Trade Offs and Evolution

Where next?

The Changing Planet

3 Technology Drivers

Devices

Infrastructure

Applications

Evolution: Devices

Key Capabilities

Connected

Location Aware

Sensory & Powerful

Evolution: Devices

Evolution: Connectivity

Mobile Subscription Density 2004

Evolution: Connectivity

Mobile Subscription Density 2010

Evolution: Bandwidth

Evolution: Applications

Salient Traits

Cloud based

Web scale

Explosion in Data

Big Data

Volume

Velocity

Variety

Big Data: Volume

Volume:

2011: 1.8 zettabytes of digital universe

2009 - 2020: 35 zettabytes

Big Data: Velocity

Velocity

340 million tweets per day

72 hours of video uploaded every minute on YouTube

2.9 million emails a second

Big Data: Variety

Variety

Pictures

Applications Logs

etc. etc...

Disruptive Architectures

Disruptions in Data Arch

Change in Focus (1990s -> 2000s)

Performance -> Scalability & Availability

Rigid/Structured -> Flexible/Semistructured

Scalability & Availability

Towards Scalability

Problem

10K ops/sec -> 1M ops/sec

TB of data -> PB of data

Towards Scalability

Solution: SHARDING (Divide and Conquer)

11011000110000011001001011111010

Towards Scalability

How do we quickly route a record to a shard?

11011000110000011001001011111010

fn( )- Consistent Hashing- Mapping Table

Towards Scalability

What happens is part of the record is in one shard and part in another?

11011000110000011001001011111010

Towards Scalability

Keep it Simple: Application deals with atomicity & consistency semantics

11011000110000011001001011111010

Towards AvailabilityWhat if my shard is down? Where do I put my record?

11011000110000011001001011111010

Towards AvailabilityLets just replicate the shards and pray that one is available :)

1101100011000001100100101111101011011000110000011001001011111010

11011000110000011001001011111010

X11011000110000011001001011111010

11011000110000011001001011111010

Towards Availability

Replication strategies

What should be the number of replicas?

How to rebuild a replica?

How to propogate a record to a replica?

1990s vs 2000sDifferent Focus: 1990s (Raw Performance)

Optimal I/O structures

Cache Sensitive Algorithms

2000s (Scalability, Availability)

Sharding

Replication

Flexibility/Semi-structure

Towards Flexibility

Problem

Does structure in a database make it slower to write applications (sprint vs waterfall model)?

My data is not records and tables?

Towards Flexibility

How knowing my record structure help by data system?

Helps to optimize execution plans

Helps to optimize my storage layouts

Trade off?

Application change means database schema change, rebuilding indexes etc. etc.

Towards Flexibility

Most of my operations are simple lookups, range lookups and updates

Since the execution is simple we don’t need all the structure

Keep enough structure to support fast gets and puts

Towards Flexibility

Solution: Key-Value Stores (NoSQL)

1101100011

11011000110000011001001011111010

KEY VALUE

1101100011 11011000110000011001001011111010

1101100011

1101100011 11011000110000011001001011111010

- Sorted HashMaps

- Sorted Files

Towards Flexibility

Need to update related “values” of a key (Some Atomicity)

11011000110000011001001011111010

11011000110

KEY VALUE

Towards Flexibility

Need update related “values” of a key (Some Atomicity)

11011000110000011001001011111010

11011000110

KEY VALUE11011000110

11011000110

TAG = COLUMN FAMILY

Towards Flexibility

gets and puts are fine for online applications BUT..

What about Analytics?

Transformations can be really complicated...

Towards Flexibility

Is there a simple construct that can solve a number of analytics queries

of course: SORT

And it can be parallelized too

Towards Flexibility

MAP/REDUCE (Scalable Parallel Pluggable SORT)

11011000110000011001001011111010

Mappers11011000110000011001001011111010

11011000110000011001001011111010

Reducers

m{ } r{ }m: user defined map functionr: user defined reduce function

Towards Flexibility

MAP/REDUCE and Failures

11011000110000011001001011111010

Mappers

X11011000110000011001001011111010

11011000110000011001001011111010

Reducers

11011000110000011001001011111010

1990s vs 2000sDifferent Focus: 1990s (Raw Performance)

Structure important for speed optimizations

Stream everything through Query plan

2000s (Sprint mode of application development)

Support dev efficiency and data variety

Checkpointing for restartability

Where now?

The New Meets The Old

Disruption?

Well we still need SQL

We still need to make these work with other components

Guess what? Efficiency is also important at scale

Where Does New Fail?

Transactions?

Moving money from one account to another

Graphs?

Networks everywhere

How to do second order analysis on graphs

Thank You!

ashish thusoo evolution of big data architectures

opssectb of data

zettabytesbig data

secondbig data

data archchange

ashish thusoooutlinedemand

application deals

sharding divide

day72 hours of video

Technology

ashish amritsar

paper:38 chameleon: the performance tuning tool for ... ·...

ashish singh ∗∗

ashish shubh

ashish paints

report ashish

big data architectures@ facebook - goto...

ashish dwivedi

ashish - starbucks

ashish sumit

ashish karamchandani

ashish seminar

ashish questionnare

ob2 ashish

186 ashish

ashish dutta

ashish sharma8787

ashish

uterinefibroids ashish

chap01 ashish