azure brain: 4th paradigm, scientific discovery & (really) big data

60
Azure Brain: 4 th paradigm, scientific discovery & (really) big data (REC201) Gabriel Antoniu Senior Research Scientist, Inria Head of the KerData Project-Team, Inria Rennes Bretagne Atlantique Radu Tudoran PhD student, ENS Cachan Brittany KerData Project-Team, Inria Rennes Bretagne Atlantique Feb. 12, 2013

Upload: microsoft-technet-france

Post on 20-Jan-2015

394 views

Category:

Technology


0 download

DESCRIPTION

Un cloud pour comparer nos gènes aux images du cerveau" Le pionnier des bases de données, aujourd'hui disparu, Jim Gray avait annoncé en 2007 l'emergence d'un 4eme paradigme scientifique: celui d'une recherche scientifique numérique entierement guidée par l'exploration de données massives. Cette vision est aujourd'hui la réalité de tous les jours dans les laboratoire de recherche scientifique, et elle va bien au delà de ce que l'on appelle communément "BIG DATA". Microsoft Research et Inria on démarré en 2010 un projet intitulé Azure-Brain (ou A-Brain) dont l'originalité consiste à a la fois construire au dessus de Windows Azure une nouvelle plateforme d'acces aux données massives pour les applications scientifiques, et de se confronter à la réalité de la recherche scientifique. Dans cette session nous vous proposons dans une premiere partie de resituer les enjeux recherche concernant la gestion de données massives dans le cloud, et ensuite de vous presenter la plateforme "TOMUS Blob" cloud storage optimisé sur Azure. Enfin nous vous presenterons le projet A-Brain et les résultats que nous avons obtenus: La neuro-imagerie contribue au diagnostic de certaines maladies du système nerveux. Mais nos cerveaux s'avèrent tous un peu différents les uns des autres. Cette variabilité complique l'interprétation médicale. D'où l'idée de corréler ldes images IRM du cerveaux et le patrimoine génétique de chaque patient afin de mieux délimiter les régions cérébrales qui présentent un intérêt symptomatique. Les images IRM haute définition de ce projet sont produites par la plate-forme Neurospin du CEA (Saclay). Problème pour Les chercheurs : la masse d'informations à traiter. Le CV génétique d'un individu comporte environ un million de données. À cela s'ajoutent des volumes tout aussi colossaux de pixel 3D pour décrire les images. Un data deluge: des peta octets de donnés et potentiellement des années de calcul. C'est donc ici qu'entre en jeu le cloud et une plateforme optimisée sur Azure pour traiter des applications massivement parallèles sur des données massives... Comme l'explique Gabriel Antoniu, son responsable, cette équipe de recherche rennaise a développé “des mécanismes de stockage efficaces pour améliorer l'accès à ces données massives et optimiser leur traitement. Nos développements permettent de répondre aux besoins applicatifs de nos collègues de Saclay.

TRANSCRIPT

Page 1: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Azure Brain: 4th paradigm, scientific discovery & (really) big data (REC201)

Gabriel AntoniuSenior Research Scientist, Inria

Head of the KerData Project-Team, Inria Rennes – Bretagne Atlantique

Radu TudoranPhD student, ENS Cachan – Brittany

KerData Project-Team, Inria Rennes – Bretagne AtlantiqueFeb. 12, 2013

Page 2: Azure Brain: 4th paradigm, scientific discovery & (really) big data

INRIA’s strategy in Cloud Computing

INRIA is among the leaders in Europe in the area of distributed computing and HPC

• Long history of researches around distributed systems, HPC, Grids

• Now several activities virtualized environments/cloud infrastructures

• Culture of multidisciplinary research

• Culture of exploration tools (owner of massively parallel machines since 1987, large scale

testbeds such as Grid’5000)

• Strong involvement in national, European and international collaborative projects

• Strong collaboration history with industry (Joint Microsoft Research – Inria Centre, IBM,

EDF, Bull, etc.)

- 2

Page 3: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Clouds: where within Inria ?

1

2

Networks, Systems and Services, Distributed Computing3

Perception, Cognition, Interaction4

5

Applied Mathematics, Computation and Simulation

Algorithmics, Programming, Software and Architecture

Computational Sciences for Biology, Medicine and the Environment

- 3

Page 4: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Some project-teams involved in Cloud Computing

INRIA Nancy

Grand Est

INRIA Grenoble

Rhône-Alpes

INRIA Sophia Antipolis

Méditerranée

INRIA Rennes

Bretagne Atlantique

INRIA Bordeaux

Sud-Ouest

INRIA Lille

Nord Europe

INRIA

Saclay

Île-de-France

INRIA Paris

Rocquencourt

KERDATA: Data Storage and Processing

MYRIADS: Autonomous Distributed Systems

ASCOLA: Languages and virtualization

CEPAGE: task management

AVALON: middleware & programming

MESCAL: models & tools

REGAL: Large Scale dist. systems

ALGORILLE: algorithms & models

OASIS: programming

ZENITH: Scientific Data Management

- 4

Page 5: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Initiatives to support Cloud Computing and HPC within Inria

Why dedicated initiatives to support HPC/Clouds ?

• Project-teams are geographically dispersed

• Project-teams belong to different domains

• Researchers from scientific computing need access to the latest research results

related to tools, libraries, runtime systems, …

• Researchers from “computer science” need access to applications to test their

ideas as well as to find new ideas !

Concept of “Inria Large Scale Initiatives”

• Enable ambitious projects linked with the strategic plan

• Promote an interdisciplinary approach

• Mobilizing expertise of Inria researchers around key challenges

- 5

Page 6: Azure Brain: 4th paradigm, scientific discovery & (really) big data

CLOUD COMPUTING@

INRIA RENNES BRETAGNE ATLANTIQUE

- 6

Page 7: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Some Research Focus Areas

Software architecture and infrastructure for cloud computing

• Autonomic service management, resource management, SLA, sky

computing: Myriads

• Big Data storage and management, MapReduce: KerData

• Hybrid Cloud and P2P systems, privacy: ASAP

Advanced usage for specific application communities

• Bioinformatics: GENSCALE

• Cloud for medical imaging: EasyMed project (IRT B-Com): Visages

- 7

Page 8: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Some Research Focus Areas

Software architecture and infrastructure for cloud computing

• Autonomic service management, resource management, SLA, sky

computing: Myriads

• Big Data storage and management, MapReduce: KerData

• Hybrid Cloud and P2P systems, privacy: ASAP

Advanced usage for specific application communities

• Bioinformatics: GENSCALE

• Cloud for medical imaging: EasyMed project (IRT B-Com): Visages

- 8

Page 9: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Contrail EU project

Goal: develop an integrated approach to virtualization offering

• services for federating IaaS clouds

• elastic PaaS services on top of federated clouds

Overview: provide tools for

• managing federation of multiple heterogeneous IaaS clouds

• offering a secure yet usable platform for end users through federated identity management

• supporting SLAs and quality of service (QoS) for satisfying stringent business requirements for

using the cloud

Resource

Provider

Federa&on)API)+)Fed.)core)

Resource

Provider

Storage(Provider( Public(

Cloud(

Storage(Provider(Network(

Provider(

A) A)A) A)

Applica&on)

Applica&on)

Applica&on)Federa&on)API)+)Fed.)core)

Federa&on)API)+)Fed.)core)

Contrail is an open source cloud computing

software stack compliant with cloud

standards

http://contrail-project.eu

- 9

Page 10: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Contrail EU project

http://contrail-project.eu

http://contrail.projects.ow2.org/xwiki/bin/view/Main/WebHome

- 10

Page 11: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Open source software under the GNU GPLv2

license

http://snooze.inria.fr

Other Research Activities on Cloud Computing

Snooze: an autonomic energy-

efficient IaaS management system

Scalability• Distributed VM management system

• Self-organizing & self-healing hierarchy

Energy conservation• Idle nodes in power-saving mode

• Holistic approach to favor idle nodes

VM management algorithms• Energy-efficient VM placement

• Under-load / overload mitigation

• Automatic node power-cycling and wake-up

Resilin: Elastic MapReduce on

multiple clouds (sky computing)

Goals• Creation of MapReduce execution

platforms on top of multiple clouds• Elasticity of the platforms • Support all kinds of Hadoop jobs • Support different Hadoop versions

Interfaces• Amazon EMR for users• Libcloud with underlying IaaS providers

Open source software under GNU Affero

GPL license

http://resilin.inria.fr

- 11

Page 13: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Last

few decades

The Data Science:

The 4th Paradigm for Scientific Discovery

Thousand

years ago

Today and the

FutureLast few

hundred years

2

2

2.

3

4

a

cG

a

a

Simulation of

complex phenomena

Newton’s laws,

Maxwell’s equations…

Description of natural

phenomena

Unify theory, experiment

and simulation with

large multidisciplinary

Data

Using data exploration

and data mining

(from instruments,

sensors, humans…)

Distributed Communities

Crédits: Dennis Gannon

13

Page 14: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Last

few decades

The Data Science:

The 4th Paradigm for Scientific Discovery

Thousand

years ago

Today and the

FutureLast few

hundred years

2

2

2.

3

4

a

cG

a

a

Simulation of

complex phenomena

Newton’s laws,

Maxwell’s equations…

Description of natural

phenomena

Unify theory, experiment

and simulation with

large multidisciplinary

Data

Using data exploration

and data mining

(from instruments,

sensors, humans…)

Distributed Communities

14

Page 15: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Research Focus:How to efficiently store, share and process data

for new-generation, data-intensive applications?

• Scientific challenges

• Massive data (1 object = 1 TB)

• Geographically distributed

• Fine-grain access (MB) for reading and writing

• High concurrency (10³ concurrent clients)

• Without locking

- Major goal: high-throughput under heavy concurrency

- Our contribution

Design and implementation of distributed algorithms

Validation with real apps on real platforms with real users

• Applications

• Massive data analysis: clouds (e.g. MapReduce)

• Post-Petascale HPC simulations: supercomputers

- 15

Page 16: Azure Brain: 4th paradigm, scientific discovery & (really) big data

BlobSeer: A Software Platform for Scalable,

Distributed BLOB Management

Started in 2008, 6 PhD theses (Gilles Kahn/SPECIF PhD Thesis Award in 2011)

Main goal: optimized for concurrent accesses under heavy concurrency

Three key ideas•Decentralized metadata management

•Lock-free concurrent writes (enabled by versioning)

- Write = create new version of the data

•Data and metadata “patching” rather than updating

A back-end for higher-level data management systems•Short term: highly scalable distributed file systems

•Middle term: storage for cloud services

Our approach•Design and implementation of distributed algorithms

•Experiments on the Grid’5000 grid/cloud testbed

•Validation with “real” apps on “real” platforms: Nimbus, Azure, OpenNebula clouds…

http://blobseer.gforge.inria.fr/

16- 16

Page 17: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Impact of BlobSeer: MapReduce

BlobSeer improves Hadoop

• Gain (execution time) : 35%

ANR MapReduce Project (2010-2014)

• Lead: G. Antoniu (KerData)

• Partners: INRIA (AVALON), Argonne National Lab, U. Illinois Urbana-Champaign, IBM,

JLPC, IBCP, MEDIT

• Strong collaboration with the Nimbus team from Argonne National Lab

- BlobSeer integrated with the Nimbus cloud toolkit

- BlobSeer used for efficient VM deployment and snapshotting

• Validation : Grid’5000 with Nimbus, FutureGrid (USA), Open Cirrus (USA)

http://mapreduce.inria.fr

- 17

Page 18: Azure Brain: 4th paradigm, scientific discovery & (really) big data

The A-Brain Project: Data-Intensive Processing on

Microsoft Azure Clouds

Application • Large-scale joint genetic and

neuroimaging data analysis

Goal • Assess and understand the

variability between individuals

Approach • Optimized data processing on

Microsoft’s Azure clouds

Inria teams involved • KerData (Rennes)

• Parietal(Saclay)

Framework• Joint MSR-Inria Research Center

• MS involvement: Azure teams,

EMIC

18

Page 19: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Genetic information: SNPs

G GT GT TT G

G G

MRI brain images

Clinical / behaviour

The Imaging Genetics Challenge:

Comparing Heterogeneous Information

THere we focus

on this link

- 19

Page 20: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Neuroimaging-genetics: The Problem

Several brain diseases have a genetic

origin, or their occurrence/severity related

to genetic factors

Genetics important to understand & predict

response to treatment

Genetic variability captured in

DNA micro-array datap( )|

Gene→Image

geneticimage

20

Page 21: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Imaging Genetics Methodological Issues

Genetic dataBrain image

Y

~105-106

~2000X

~105-106

– Anatomical MRI

– Functional MRI

– Diffusion MRI

– DNA array (SNP/CNV)

– gene expression data

– others...

- 21

Page 22: Azure Brain: 4th paradigm, scientific discovery & (really) big data

A BIG DATA Challenge …

Azure can help…

Data:

double

permutation

voxels

SNPs

5%-10%

useful

Computation:

Estimate timespan

on single machine

Estimation for A-Brain on Azure (350 cores)

Storage capacity estimations (350 cores)

Page 23: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Imaging Genetics Methodological Issues

Multivariate methods:

predict brain characteristic with many

genetic variables

Elastic net regularization:

combination of ℓ1 and ℓ2 penalties →

sparse loadings

parameters setting:

internal cross-validation/bootstrap

Performance evaluated using

permutations

23

Page 24: Azure Brain: 4th paradigm, scientific discovery & (really) big data

A-Brain as Map-Reduce Processing

- 24

Page 25: Azure Brain: 4th paradigm, scientific discovery & (really) big data

A-Brain as Map-Reduce Data Processing

25

Page 26: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Efficient Procedures for Statistics

Example : voxelwise Genome Wide Association Studies (vGWAS)

740 subjects

~ 50,000 voxels

~ 500,000 SNPs

10,000 permutations

→ ~ 12,000 hours of computation

→ ~ 1.8 Po of statistical scores

- 26

Page 27: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Efficient Procedures for Statistics

Example : Ridge regression with cross-validation loops

Some costly computations

(SVD ~ 60 sec) are used 1-2

millions of times and cannot be

kept in memory.

~ 60-120 x 106 sec / SVD

(1.9-3.8 years / SVD)

→ An efficient distributed cache

can achieve huge speedup!

- 27

Page 28: Azure Brain: 4th paradigm, scientific discovery & (really) big data

TomusBlobs approach

- 28

Page 29: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Requirements for a cloud storage / data management

High throughput under heavy concurrency

Fine grain access

Scalability / Elasticity

Data availability

Transparency

Design principles

Data locality – use the local storage

No modification on the cloud middleware

Loose coupling between storage and applications

Storage hierarchy

- 29

Page 30: Azure Brain: 4th paradigm, scientific discovery & (really) big data

TomusBlobs - Architecture

- 30

Computation nodes

Page 31: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Architecture contd.

System components

Initiator

- Cloud specific

- Generic stub

- Properties: Scaling; Self configuration

Distributed Storage

- Aggregates the virtual disks

- Not depending on a specific solution

Client API

- Cloud specific API

- Expose the operation transparently

Initiator

Local

Disk

Application

Client APITB

entity

VM snapshot

Customizable

Environment

- 31

Page 32: Azure Brain: 4th paradigm, scientific discovery & (really) big data

TomusBlobs Evaluation

• Scenario: Single reader / writer

• Data transfer from memory to storage

• Metric: Client IO throughput

Page 33: Azure Brain: 4th paradigm, scientific discovery & (really) big data

TomusBlobs Evaluation

Cumulative read throughput Cumulative write throughput

• Scenario: Multiple readers / writers

• Throughput limited by bandwidth

• Read 4X ; Write 5X

- 34

Page 34: Azure Brain: 4th paradigm, scientific discovery & (really) big data

TomusBlobs as a Storage Backend for

Sharing Application Data in MapReduce

App

API

App App App App

API API API API

TomusBlobs

- 35

Page 35: Azure Brain: 4th paradigm, scientific discovery & (really) big data

TomusMapReduce Evaluation

• Scenario: Increase the problem size

• Optimize computation by managing better intermediate data

- 36

Page 36: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Iterative MapReduce - Daytona

Merge Step

In-Memory Caching of static data

Cache aware hybrid scheduling using Queues as well as using a bulletin board (special table)

Reduce

Reduce

MergeAdd

Iteration?

No

Map Combine

Map Combine

Map Combine

Data

Cache

Yes

Hybrid scheduling of the new iteration

Job Start

Job Finish

Crédits: Dennis Gannon

- 37

Page 37: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Beyond MapReduce

• Unique result with parallel reduce

phase

• No central control entity

• No synchronization barrier

Map

Reducer

Map

Map

Map

Map

Reducer

- 38

Page 38: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Zoom on the Reduction Ratio

• Compute the minimum of a set of large matrixes (7.5 GB)

using 30 mappers

- 39

Page 39: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Azure integration

- 40

Page 40: Azure Brain: 4th paradigm, scientific discovery & (really) big data

The Most Frequent Words benchmark

•Input data size varies from 3.2 GB to 32 GB

•ReductionRatio = 5

- 41

Page 41: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Execution times for A-Brain

•Increasing number of map jobs = increasing size of data

(5 GB to 50 GB)

- 42

Page 42: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Beyond Single Site

processing

• Data movements across geo-distributed deployments is costly

• Minimize the size and number of transfers

• The overall aggregate must collaborate towards reaching the goal

• The deployments work as independent services

• The architecture can be used for scenarios in which data is produced in different locations

- 43

Page 43: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Towards a Geo-distributed

TomusBlobs approach

• TomusBlobs for intra-deployment data management

• Public Storage (Azure Blobs/Queues) for intra-deployment communication

• Iterative Reduce technique for minimizing number of transfers (and data size)

• Balance the network bottleneck from single data center

- 44

Page 44: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Multi-Site MapReduce

• 3 deployments (NE,WE,NUS)

• 1000 CPUs

• ABrain execution across multiple sites

- 45

Page 45: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Beyond MapReduce -

Workflow Processing

- 46

Page 46: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Data access patterns for workflows [1]

[1] Vairavanathan et al.

A Workflow-Aware Storage

System: An Opportunity Study

http://ece.ubc.ca/~matei/paper

s/ccgrid2012.pdf

Pipeline

Caching

Data informed workflow

Inp

ut

Outp

ut

Broadcast

Replication

Data size

Inp

ut

Outp

ut

Reduce/Gather

Co-placement of all data

Data informed workflow

Inp

ut

Outp

ut

Scatter

File size awareness

Data informed workflow

Inp

ut

Outp

ut

- 47

Page 47: Azure Brain: 4th paradigm, scientific discovery & (really) big data

eScience Central(Newcastle University)

- 48

Page 48: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Generic Worker Walkthrough(Microsoft ATLE)

Local storage

Client code

Researc

her

Job Management

Service

Alg

ori

thm

HD

GW Driver

Pluggable Runtime

Environment

Runtime Business Logic

Job Details Table

Job Index Table

Notification Listeners (Accounting, Status

Change, etc..)

BLOB Storage

Notification Service

Scaling Service

OGF BES VM

SOAP WS–*Use of interoperable standard protocols and data schemas!

OGF JSDL

Application Code

GW Services & SDKs

Existing Components

Input Files

Output Files

SharedStorage

- 49

Credits: Microsoft

Page 49: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Defining the scope

ID

Files

Batch jobs

Assumptions about the

workflows

Workflows are composed of

batch jobs with well-defined data passing

schemas

The input and the output of the

batch jobs are files The batch jobs

and their inputs and outputs can

be uniquely identified in the

system

Most workflows fit in this subclass

Idea: manage files inside the deployment

- 50

Page 50: Azure Brain: 4th paradigm, scientific discovery & (really) big data

The Concept

File Name Locations

F1 VM1

F2 VM1,VM2

F3 VM2

VM 1 VM 2

Local Disk Local Disk

F1F2 F3

Transfer

Module

File Metadata Registry

(1) Register (F1,VM1)

(2) GetLocation(F1)

(3) DownloadFile(F1)

F1

F2

• Metadata Registry

• Transfer ModuleComponents

Transfer

Module

- 51

Page 51: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Characteristics of the components

Metadata Registry Transfer Module

Role • Transfer files from one node to

another

Data type

• Files

Accessibility

• Each VM has such a module

• Applications access the local module

• The modules interact across nodes

Solutions

• FTP; Torrent; InMemory, HTTP etc.

Role•Hold the location of files within the

deployment

Data type

•Key-value pairs –

(file identification; retrieval information)

Accessibility

•Accessible by all nodes

Solutions

•Azure Caching Preview, Azure Tables,

InMemory DB

Idea:

Adopt multiple transfer solutions

Adapt to the context: select the one that fits best

- 52

Page 52: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Transfer methods

Method Observations

InMemory • Caching data• InMemory data offers fast access• GBs of memory capacity per

deployment• Small files

BitTorrent • Replicas for file dissemination• Collaborative reads• New way of stage-in data

FTP • TCP transfer • Medium and large files• Potential of inter-operability

- 53

Page 53: Azure Brain: 4th paradigm, scientific discovery & (really) big data

VM Snapshot

VMMemory

MetaData Registry

Adaptive

Storage

FTP

Torrent

InMemory

Transfer Module Services

Replication Queue

Replication

FTP

Tracker

Peer

Local

Disk

- 54

Page 54: Azure Brain: 4th paradigm, scientific discovery & (really) big data

F1

Azure Caching

Adaptive Storage

Adaptive Storage

App App

F1

Create

Upload(F1)

GetMetadata

Read(F1)

Memory Memory

Local Storage Local Storage

Read (F1)

WriteMetadata

Write(F1)

Download (F1)

AP

I

AP

I

- 55

Page 55: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Scenario 2 – Large files ; replication enabled

0

10

20

30

40

50

60

70

80

50 100 150 200 250

Tim

e (

sec)

Size of a single file (MB)

DirectLink Torrent Adaptive AzureBlobs

• Torrents are superior for broadcast when replicas are used

• DirectLink is faster for pipeline (reduction tree)

• Adaptive storage can chose each time the best strategy

- 56

Page 56: Azure Brain: 4th paradigm, scientific discovery & (really) big data

NCBI Blast for Azure

Seamless Experience• Evaluate data and invoke computational models

from Excel.• Computationally heavy analysis done close to

large database of curated data.• Scalable for large, surge computationally heavy

analysis.

selects DBs andinput sequence

Web RoleInput Splitter

Worker Role

BLASTExecution

Worker Role #n….

Combiner

Worker Role

GenomeDB 1

GenomeDB K

BLAST DBConfiguration

Azure Blob Storage

BLASTExecution

Worker

Role #1

Crédits: Dennis Gannon

- 57

Page 57: Azure Brain: 4th paradigm, scientific discovery & (really) big data

BLAST analysis – data management

component

0

30

60

90

120

5 10 15 25 35 40 50 60

Tim

e (

sec)

Number of BLAST jobs

Download Adaptive Download AzureBlobs Upload Adaptive Upload AzureBlobs

• Database files – 1.6 GB

• Input size – 800 MB

• 50 nodes

- 58

Page 58: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Scalable Storage on Clouds: Open Issues

Understanding price-performance trade-offs

• Consistency, availability, performance, cost, security,

quality of service, energy consumption

• Autonomy, adaptive consistency

• Dynamic elasticity

• Trade-offs exposed to the user

High performance variability

- Understand it, model it, cope with it

Deployment/application launching time is high

Latency of data accesses is still an issue

Data movements are expensive

Cope with tightly-coupled applications

Cope with various cloud programming models

Virtualization overhead

Benchmarking

Performance modeling

Self-optimization for cost reduction

- Elastic scale down

Security and privacy

- 59

Page 59: Azure Brain: 4th paradigm, scientific discovery & (really) big data

Extreme scale does matter BUT not only

Other focus areas

– Affordability and usability of intermediate size systems

– Pervasiveness of usage across the entire industry, including Small and

Medium Enterprises (SMEs) and ISVs

– New HPC deployments (e.g. Big Data, HPC in Clouds)

– HPC and Cloud usage expansion, fostering the development of consultancy,

expertise and service business / end-user support

– Facilitating the creation of start-ups and the development of the SME sector

(hw/sw supply side)

– Education and training (inc. engineering skills for industry)

Cloud Computing@INRIAStrategic Research Agenda

- 60

Page 60: Azure Brain: 4th paradigm, scientific discovery & (really) big data

- 61

Azure Brain: 4th paradigm, scientific

discovery & (really) big data (REC201)

Gabriel Antoniu

Senior Research Scientist, Inria

Head of the KerData Project-Team, Inria Rennes – Bretagne Atlantique

Radu Tudoran

PhD student, ENS Cachan – Brittany

KerData Project-Team, Inria Rennes – Bretagne Atlantique

Contacts: [email protected], [email protected]