data aware routing in the cloud

16
Data Aware Routing in the Cloud Eugene Steinberg CTO, Grid Dynamics

Upload: eugene-steinberg

Post on 11-May-2015

1.147 views

Category:

Technology


4 download

DESCRIPTION

Large-scale HPC infrastructures such as Sun Grid Engine sooner or later face data access issues, usually caused by the centralized nature of data sources. This talk discusses an approach to mitigate those issues by employing in-memory data grids, such as Oracle Coherence. By collocating compute and data grid nodes on the same hosts, one can achieve not only better resource utilization but also enjoy performance gains through data aware routing. Grid Dynamics will present a reference architecture for building a data aware routing layer on top of SGE and Oracle Coherence along with a demo that highlights the performance benefits of data aware routing. Sun HPC ClusterTools and Open MPI Update

TRANSCRIPT

Page 1: Data Aware Routing In The Cloud

Data Aware Routing in the Cloud

Eugene SteinbergCTO, Grid Dynamics

Page 2: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 2

Sun HPC Software Workshop

Traditional HPC approach:HPC Cluster with centralized data access

Page 3: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 3

Sun HPC Software Workshop

Data Driven Scalability Challenges in traditional HPC approach• Data is far away

• Latency of remote connection

• Latency of data travelling through pipes

• Chatty random data access algorithms are expensive

• Data is centralized• Centralized HW resources are always limited

• Centralized RAM is limited: disk I/O is inevitable

• Connections are limited

• Highly concurrent data access doesn’t scale well

Page 4: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 4

Sun HPC Software Workshop

Usual solution: Data Grids• Classic Data Grid

• Data is partitioned• Partitions are stored in memory• Data grid is deployed near to compute grid• Search is parallelized over partitions• Build-in replication, persistence, coherence, failover

• What is Achieved?• Reduced latency and data moving cost• Improved connection scalability• Reduced data contention• No Disk I/O – 100% memory speed

• Is this the best we can do?

Page 5: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 5

Sun HPC Software Workshop

Limitations of Compute Grid + Data Grid

• Two separate grid environments • Hardware, footprint and management

costs of dual infrastructure

• Segregated infrastructures cannot share resources

• Sub-optimal resource utilization• Compute grid is CPU-bound, not RAM-

bound

• Data grid is RAM-bound, not CPU-bound

• Still sub-optimal performance• Still paying for remote network calls

and data movement

Page 6: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 6

Sun HPC Software Workshop

Better answer: Compute-Data Grid

• Shared hardware between compute & data grid • Data grid utilizes RAM of host machines

• Compute grid runs HPC jobs on the same host machines

• Opportunity for data aware routing• Many applications support compute-data affinity

• Reduced overhead on remote calls and data movement

• Opportunity to scale• As HPC application needs to scale in and out, data

partitions are spread over larger or smaller pool of hosts

Page 7: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 7

Sun HPC Software Workshop

Data aware routing in Compute-Data grid

• What is data aware routing?• Route HPC tasks to machine hosting relevant data grid

partition

• Relevant partition is identified by data affinity key

• Why data aware routing?• Task access data via loopback – reduced latency

• Data doesn’t travel through pipes and switches – bandwidth savings

• Data aware routing architecture goals• Pluggable: lightweight core + adapters for variety of grid

products

• Non-intrusive: neither compute grid, nor data grid are aware of being coordinated

Page 8: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 8

Sun HPC Software Workshop

Data aware routing architecture

Page 9: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 9

Sun HPC Software Workshop

Components

• Core Components• Data Grid Adapter: service responsible for knowing Data

Grid’s topology and state

• Data Aware Wrapper: client side library extending Compute Grid’s scheduling API to support data-aware job scheduling

• Variation on Configuration• Data Grid Adapter can be deployed as a remote service or

embedded as a library

Page 10: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 10

Sun HPC Software Workshop

Data aware routing workflow

1. Client submits task + affinity key to Wrapper2. Wrapper requests host hint from Data Grid Adapter3. Data Grid Adapter gives host hint using affinity key4. Wrapper submits task with host hint using HPC API

Page 11: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 11

Sun HPC Software Workshop

Cloud Factor: opportunities and challenges

Emerging Cloud Computing technologies • Offer benefits for HPC architectures

• Automated cluster provisioning and deployment

• Ability to timely set up large clusters for once-in-a-blue-moon jobs

• Ability to scale cluster up and down on demand

• Easy to spawn isolated development/testing environments

• … as well as some new challenges• Networking limitations (e.g. absence of multicast)

• No strong promises on CPU, I/O and bandwidth shares

• No real perimeter firewall – security considerations

Page 12: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 12

Sun HPC Software Workshop

Demo: Simplistic Trading Analytics

• Application characteristics• Data intensive application: N*10K trade objects (2Kb

payload), for 4*N stock tickers

• Processing algorithm• Job: to evaluate all tickers

• Task: to process ticker trades with chatty random-access algorithm

• Data source and task scheduling control• Data sources: RDBMS or IMDG

• Scheduling mode: neutral and data-aware

• Task and job latency measurement

Page 13: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 13

Sun HPC Software Workshop

Page 14: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 14

Sun HPC Software Workshop

SGE + Oracle Coherence on AWS

• AWS provisioning considerations• Customization of popular images is faster then spinning up a

custom AMI

• make sure sshd started on freshly started machine

• SGE considerations• Reverse DNS lookup: /etc/hosts to contain all hostnames

• pw-less ssh access

• Oracle Coherence considerations• No multicast: unicast + well-known addresses

• Many short-living clients: do not join TCMP cluster, use Coherence*Extend

Page 15: Data Aware Routing In The Cloud

Sun HPC Software Workshop 2009 | regensburg, germany | hpcworkshop.com 15

Sun HPC Software Workshop

Page 16: Data Aware Routing In The Cloud

Eugene [email protected]

Data aware routing in the cloud