big data in a public cloud

Post on 15-Jan-2015

157 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

All these large data sets are so big its difficult to manage with traditional tools. Distributing computing is an approach to solve that problem! First the data needs to be mapped, then it can be analyzed or reduced.

TRANSCRIPT

Big Data in a public cloud

Maximize RAM without paying for overhead CPU and other price/

performance tricks

Big Data

• Big Data: just another way of saying large data sets

• All of them are so large its difficult to manage with traditional tools

• Distributed computing is an approach to solve that problem

• First the data needs to be Mapped

• Then it can be analyzed – or Reduced

MapReduce

• Instead of talking about it, lets just see some code

Map/Reduce – distributed work

But that’s perfect for public cloud!

CPU / RAM

• Mapping – is not CPU intensive

• Reducing – is (usually) not CPU intensive

• Speed: load data in RAM or it hits the HDD and creates iowait

First point – a lot of RAM

Ah…now they have RAM instances

I wanted more RAM I said L

What about MapReduce PaaS

• Be aware of data lock in

• Be aware of forced tool set – limits your workflow

• …and I just don’t like it so much as a developer; it’s a control thing

There is another way

What we do

What is compute like – why commodity?

Now lets get to work

• Chef & Puppet – deploy and config

• Clone; scale up and down

• Problem solved?

And then there is reality…iowait

ioWAIT

• 1,000 iops on a high end enterprise HDD

• 500,000 iops on a high end SSD

• Public Cloud should be forbidden on HDD

Announcing: SSD priced as HDD

Who and How does workload?

• CERN – the LHC experiments to find the Higgs Boson

• 200 Petabytes per year

• 200,000 CPU days / day on hundreds of partner grids and clouds

• Montecarlo simulations (CPU intensive)

The CERN case

• Golden VM images with tools, from which they clone

• A set of coordinator servers, which scales the worker nodes up and down via provisioning API (clone, puppet config)

• Coordinator servers manages workload on each worker node doing Montecarlo simulations

• Federated cloud brokers such as Enstratus and Slipstream

• Self healing architecture – uptime of a worker node is not critical; just spin up a new worker node instance

The loyalty program case

• The customer runs various loyalty programs

• Plain vanilla Hadoop Map/Reduce:

• Chef and Puppet to deploy and config worker nodes

• Lots of RAM, little CPU

• Self healing architecture – uptime of a worker node is not critical; just spin up a new worker node instance

Find IaaS suppliers with

• Unbundled resources• Short billing cycles• New equipment: what’s your life cycle?• SSD as compute storage• Allows cross connects from your equipment in the

DC

• CloudSigma, ElasticHost, ProfitBricks to mention a few

What we do

top related