big data in a public cloud

Big Data in a public cloud

Maximize RAM without paying for overhead CPU and other price/

performance tricks

Big Data

• Big Data: just another way of saying large data sets

• All of them are so large its difficult to manage with traditional tools

• Distributed computing is an approach to solve that problem

• First the data needs to be Mapped

• Then it can be analyzed – or Reduced

MapReduce

• Instead of talking about it, lets just see some code

Map/Reduce – distributed work

But that’s perfect for public cloud!

CPU / RAM

• Mapping – is not CPU intensive

• Reducing – is (usually) not CPU intensive

• Speed: load data in RAM or it hits the HDD and creates iowait

First point – a lot of RAM

Ah…now they have RAM instances

I wanted more RAM I said L

What about MapReduce PaaS

• Be aware of data lock in

• Be aware of forced tool set – limits your workflow

• …and I just don’t like it so much as a developer; it’s a control thing

There is another way

What we do

What is compute like – why commodity?

Now lets get to work

• Chef & Puppet – deploy and config

• Clone; scale up and down

• Problem solved?

And then there is reality…iowait

ioWAIT

• 1,000 iops on a high end enterprise HDD

• 500,000 iops on a high end SSD

• Public Cloud should be forbidden on HDD

Announcing: SSD priced as HDD

Who and How does workload?

• CERN – the LHC experiments to find the Higgs Boson

• 200 Petabytes per year

• 200,000 CPU days / day on hundreds of partner grids and clouds

• Montecarlo simulations (CPU intensive)

The CERN case

• Golden VM images with tools, from which they clone

• A set of coordinator servers, which scales the worker nodes up and down via provisioning API (clone, puppet config)

• Coordinator servers manages workload on each worker node doing Montecarlo simulations

• Federated cloud brokers such as Enstratus and Slipstream

• Self healing architecture – uptime of a worker node is not critical; just spin up a new worker node instance

The loyalty program case

• The customer runs various loyalty programs

• Plain vanilla Hadoop Map/Reduce:

• Chef and Puppet to deploy and config worker nodes

• Lots of RAM, little CPU

• Self healing architecture – uptime of a worker node is not critical; just spin up a new worker node instance

Find IaaS suppliers with

• Unbundled resources• Short billing cycles• New equipment: what’s your life cycle?• SSD as compute storage• Allows cross connects from your equipment in the

DC

• CloudSigma, ElasticHost, ProfitBricks to mention a few

What we do

big data in a public cloud

Technology

cpu ram mapping

big data big data

new worker node instance

overhead cpu

config worker nodes

load data

data needs

ram instances