big data in a public cloud

22
Big Data in a public cloud Maximize RAM without paying for overhead CPU and other price/ performance tricks

Upload: cloudsigma

Post on 15-Jan-2015

157 views

Category:

Technology


1 download

DESCRIPTION

All these large data sets are so big its difficult to manage with traditional tools. Distributing computing is an approach to solve that problem! First the data needs to be mapped, then it can be analyzed or reduced.

TRANSCRIPT

Page 1: Big Data in a Public Cloud

Big Data in a public cloud

Maximize RAM without paying for overhead CPU and other price/

performance tricks

Page 2: Big Data in a Public Cloud

Big Data

• Big Data: just another way of saying large data sets

• All of them are so large its difficult to manage with traditional tools

• Distributed computing is an approach to solve that problem

• First the data needs to be Mapped

• Then it can be analyzed – or Reduced

Page 3: Big Data in a Public Cloud

MapReduce

• Instead of talking about it, lets just see some code

Page 4: Big Data in a Public Cloud

Map/Reduce – distributed work

Page 5: Big Data in a Public Cloud

But that’s perfect for public cloud!

Page 6: Big Data in a Public Cloud

CPU / RAM

• Mapping – is not CPU intensive

• Reducing – is (usually) not CPU intensive

• Speed: load data in RAM or it hits the HDD and creates iowait

Page 7: Big Data in a Public Cloud

First point – a lot of RAM

Page 8: Big Data in a Public Cloud

Ah…now they have RAM instances

Page 9: Big Data in a Public Cloud

I wanted more RAM I said L

Page 10: Big Data in a Public Cloud

What about MapReduce PaaS

• Be aware of data lock in

• Be aware of forced tool set – limits your workflow

• …and I just don’t like it so much as a developer; it’s a control thing

Page 11: Big Data in a Public Cloud

There is another way

Page 12: Big Data in a Public Cloud

What we do

Page 13: Big Data in a Public Cloud

What is compute like – why commodity?

Page 14: Big Data in a Public Cloud

Now lets get to work

• Chef & Puppet – deploy and config

• Clone; scale up and down

• Problem solved?

Page 15: Big Data in a Public Cloud

And then there is reality…iowait

Page 16: Big Data in a Public Cloud

ioWAIT

• 1,000 iops on a high end enterprise HDD

• 500,000 iops on a high end SSD

• Public Cloud should be forbidden on HDD

Page 17: Big Data in a Public Cloud

Announcing: SSD priced as HDD

Page 18: Big Data in a Public Cloud

Who and How does workload?

• CERN – the LHC experiments to find the Higgs Boson

• 200 Petabytes per year

• 200,000 CPU days / day on hundreds of partner grids and clouds

• Montecarlo simulations (CPU intensive)

Page 19: Big Data in a Public Cloud

The CERN case

• Golden VM images with tools, from which they clone

• A set of coordinator servers, which scales the worker nodes up and down via provisioning API (clone, puppet config)

• Coordinator servers manages workload on each worker node doing Montecarlo simulations

• Federated cloud brokers such as Enstratus and Slipstream

• Self healing architecture – uptime of a worker node is not critical; just spin up a new worker node instance

Page 20: Big Data in a Public Cloud

The loyalty program case

• The customer runs various loyalty programs

• Plain vanilla Hadoop Map/Reduce:

• Chef and Puppet to deploy and config worker nodes

• Lots of RAM, little CPU

• Self healing architecture – uptime of a worker node is not critical; just spin up a new worker node instance

Page 21: Big Data in a Public Cloud

Find IaaS suppliers with

• Unbundled resources• Short billing cycles• New equipment: what’s your life cycle?• SSD as compute storage• Allows cross connects from your equipment in the

DC

• CloudSigma, ElasticHost, ProfitBricks to mention a few

Page 22: Big Data in a Public Cloud

What we do