big data and containers

37
Big Data and Containers Charles Smith @charles_s_smith

Upload: charles-smith

Post on 20-Jul-2015

154 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Big data and containers

Big Data and Containers

Charles Smith@charles_s_smith

Page 2: Big data and containers

Netflix / Lead the big data platform architecture team

Spend my time / Thinking how to make it easy/efficient to work with big data

University of Florida / PhD in Computer Science

Who am I?

Page 3: Big data and containers

“It is important that we know where we come from, because

if you do not know where you come from, then you don't

know where you are, and if you don't know where you are,

you don't know where you're going. And if you don't know

where you're going, you're probably going wrong.”

Terry Pratchett

Page 4: Big data and containers

Database Distributed Database Distributed Storage

Distributed Processing

???

Page 5: Big data and containers

Why do we care about containers?

Page 6: Big data and containers

Containers ~= Virtual Machines

Virtual Machines ~= Servers

Page 7: Big data and containers

Lightweight

fast to start

memory use

Secure

Process isolation

Data isolation

Portable

Composable

Reproducible

Everything old is new

Page 8: Big data and containers

Microservices and large architectures

Page 9: Big data and containers

Datastorage(Cassandra, MySQL, MongoDB, etc..)

Page 10: Big data and containers

Operational(Mesos, Kubernetes, etc...)

Page 11: Big data and containers

Discovery/Routing

Page 12: Big data and containers

What’s different about big data.

Page 13: Big data and containers

Data at rest

Data in motion

Page 14: Big data and containers

Customer Facing

Minimize latency

Maximize reliability

Page 15: Big data and containers

Data Analytics

Minimize I/O

Maximize processing

Page 16: Big data and containers

Ship computation to data

Page 17: Big data and containers

The questions you can answer aren’t predefined

Page 18: Big data and containers

Hive/Pig/MR

Presto

Metacat

Hive

Metastore

Page 19: Big data and containers

That doesn’t look very container-y(or microservicy-y for that matter)

Page 20: Big data and containers

Datastorage - HDFS (Or in our case S3)

Page 21: Big data and containers

Operational - YARN

Page 22: Big data and containers

Containers - JVM

Page 23: Big data and containers

So what happens when you want to do something else?

Page 24: Big data and containers
Page 25: Big data and containers

But is that really the way we want to approach containers?

Page 26: Big data and containers

What’s different about big data.

Page 27: Big data and containers

Running many different short-lived processes

Page 28: Big data and containers

Running many different short-lived processes

Efficient container construction, allocation, and movement

Page 29: Big data and containers

Groups of processes having meaning

Page 30: Big data and containers

Groups of processes having meaning

How we observe processes needs to be holistic

Page 31: Big data and containers

Processes need to be scheduled by data locality(And not just data locality for data at rest)

Page 32: Big data and containers

Processes need to be scheduled by data locality(And not just data locality for data at rest)

A special case of affinity (although possibly over time)

but...

Page 33: Big data and containers

We do need a data discovery service.(kind of… maybe… a namenode?)

Page 34: Big data and containers

SELECT

t.title_id,

t.title_desc,

SUM(v.view_secs)

FROM

view_history as v

join title_d as t on

v.title_id =

t.title_id

WHERE

v.view_dateint > 20150101

GROUP BY 1,2;

LOAD LOAD

JOIN

GROUP

Page 35: Big data and containers

Data

Discovery

Query Compiler

Query Planner

Metadata

DAG

Watcher

Page 36: Big data and containers

Bottom line

Containers provide process level security

The goal should be to minimize monoliths

This isn’t different from what we are doing already

Our languages are abstractions of composable-distributed processing

Different big data projects should share services

No matter what we do, joining is going to be a big problem

Page 37: Big data and containers

Questions?