casual mass parallel computing

23
Casual mass parallel data processing in Java Alexey Ragozin Mar 2014

Upload: aragozin

Post on 24-May-2015

5.400 views

Category:

Technology


2 download

DESCRIPTION

Slide deck from NoSQL day in Minsk http://www.belarusjug.org/events/nosql-meetup

TRANSCRIPT

Page 1: Casual mass parallel computing

Casual mass parallel data processing in Java

Alexey Ragozin

Mar 2014

Page 2: Casual mass parallel computing

Building new bicycle …

Page 3: Casual mass parallel computing

Build Vs. Buy

Build

• No dedicated team to support infrastructure

• Very specific tasks

• Exclusive use of infrastructure

• Reasonable scale

Buy

• Product can bought as service (internal or external)

• Large scale

• Multi tenancy

• You are going to use advanced features (e.g. map/reduce)

Page 4: Casual mass parallel computing

“Casual” computing

• Small computation farms (< 100 servers)

• Team owns both application and grid

• Java platform

• Reasonably short batches (< 24 hours)

• Reasonably small data sets (< 10 TiB)

Page 5: Casual mass parallel computing

Simple master slave topology

Master process

Task queue

Slave Slave Slave

Scheduler

AdvertiseTaskReport

Page 6: Casual mass parallel computing

Simple master slave topology

Control plane

RMI

Queue / scheduler

Simple in memory queue

May be more complex than just task queue

Data plane

Page 7: Casual mass parallel computing

Data plane

Never, ever, try to send data over RMI

File system Avoid network mounts!

In-memory key-value Client side sharding works best

Disk database (RDBMS or NoSQL) Consider prefetch of data

Direct socket streaming …

Page 8: Casual mass parallel computing

Distributed objects revised

Pit falls of CORBA/RMI • IDL – functional contract

• IDL – protocol

Separating concerns • Functional contract – wrapper object

• Protocol – hidden remote interface

Page 9: Casual mass parallel computing

Distributed objects revised

Renewed distributed objects paradigm

Strong • Polymorphism

• Encapsulation Network protocol, caching aspects etc

Weak • Homogenous code base required

• Synchronous network communications

Page 10: Casual mass parallel computing

Brute force Build / package

Deploy / SCP

Restart slaves

Start batch

Change code, repeat

Deployment problem

Computation grid software Compile and run batch

Behind scene

Your classes would be collected

Associated with batch

Deployed on participating slaves

Page 11: Casual mass parallel computing

Central scheduler topology

Batch controller

Slave Slave Slave

Pull task

Task

Report

Queue server

Task queueBatch controller

Add tasks

Consume

reports

Page 12: Casual mass parallel computing

Or more elaborated

Page 13: Casual mass parallel computing

Flow organized tasks

• Input data available before task starts

• e.g. Map/Reduce

Collaborative tasks

• Tasks communicate intermediate results to each other

• e.g. physic simulations

Flavors of parallel processing

Page 14: Casual mass parallel computing

Get back to data plane

Rules of thumb • Insert / delete – never update

• Write locally (reducing risks)

• Read remotely (retry on error)

• Store input as is File system

Document / column oriented NoSQL

• Input and temporary data is different Choose right store for each

Page 15: Casual mass parallel computing

Exploiting file system

Avoid network file systems

• File system concept is not designed to be distributed

• Good network file system cannot not exists

• Use simple remote file access protocols • SCP (unencrypted data transfer options added by CERN guys)

• HTTP (if you really do not want SCP)

Cheap SAN could be build from open source

Page 16: Casual mass parallel computing

Algorithmic optimization

Parallel computing • N times speed up will increase

your OPEX and CAPEX cost by N*lg(N)

Algorithmic optimization • Up front costs only

• Orders of magnitude optimization opportunities

• Exciting coding

• Ecological way of computing

Page 17: Casual mass parallel computing

Streaming algorithms

Finding N most frequent elements • Min-Count

Estimating number of unique values • HyperLogLog

Distribution histograms

https://github.com/addthis/stream-lib

https://github.com/rwl/ParallelColt

Page 18: Casual mass parallel computing

NanoCloud – drastically simplified coding for computing clusters

Page 19: Casual mass parallel computing

@Test

public void hello_remote_world() {

Cloud cloud = CloudFactory.createSimpleSshCloud();

cloud.node("myserver.acme.com").exec(new Callable<Void>(){

@Override

public Void call() throws Exception {

String localhost = InetAddress.getLocalHost().toString();

System.out.println("Hi! I'm running on " + localhost);

return null;

}

});

}

As easy as …

Page 20: Casual mass parallel computing

All you need is …

NanoCloud requirements

SSHd

Java (1.6 and above) present

Works though NAT and firewalls

Works on Amazon EC2

Works everywhere where SSH works

Page 21: Casual mass parallel computing

Master – slave communications

Master process Slave hostSSH

(Single TCP)

Slave

Slave

RMI

(TCP)

std err

std out

std in

diag

Slave

controller

Slave

controller

multiplexed slave streams Agent

Page 23: Casual mass parallel computing

Thank you

Alexey Ragozin [email protected]

http://blog.ragozin.info - my articles http://code.google.com/p/gridkit http://github.com/gridkit - my open source code http://aragozin.timepad.ru - community events in Moscow