scale-out beyond mapreduce - semantic scholar...scale-out beyond mapreduce raghu ramakrishnan cloud...

97
Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft

Upload: others

Post on 20-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Scale-out Beyond MapReduce

Raghu Ramakrishnan

Cloud Information Services Lab (CISL)

Microsoft

Page 2: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Outline

• Big Data

– The New Applications

– The Digital Shoebox

• Tiered Storage

• Compute Fabric

• REEF

Page 3: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Cloud Information Services Lab (CISL)

• Applied research for Cloud and Enterprise (CE)

• Focus areas:

– Cloud data platforms, predictive analytics and data-driven enterprise applications

• Modus innovatii:

– Embedded with the product team

– Engage closely with MSR

– Balance of external and internal impact

Page 4: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Big DataWhat’s the big deal?

Page 5: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

What’s New?

• What we’re doing with it!

– The tech is best thought of in terms of what it enables

• Why is this more than tech evolution?

– Cloud services + advances in analytics + HW trends = Ability to cost-effectively do things we couldn’t dream of before

– Uncomfortably fast evolution = revolution

Page 6: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Challenges

• Is there real technical innovation here?– Yes: Elastic scale-out; heterogeneous data and analysis;

real-time/interactive; “instant on” cloud access

– Many fascinating challenges, no deal breakers

• What about the social, legal and regulatory issues?– Will take longer to understand and resolve

• Biggest gap– People: data scientists, data-driven managers (McKinsey)

– NAE CATS Big Data training workshop (planned for Jan 2014)

Page 7: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Web of ConceptsPODS 2009 keynote

Mumbai

julia robertsrestaurant

san jose

Aggregated KB INDEX SERP

conceptstructured data

The “index” is keyed by concept instance, and organizes all relevant information, wherever it is drawn from, in semantically meaningful ways

Page 8: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications
Page 9: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Content OptimizationAgrawal et al., CACM 56(6):92-101 (2013) Content Recommendation on Web Portals

Key Features

Package Ranker (CORE)

Ranks packages by expected CTR based on

data collected every 5 minutes

Dashboard (CORE)

Provides real-time insights into performance by

package, segment, and property

Mix Management (Property)

Ensures editorial voice is maintained and user

gets a variety of content

Package rotation (Property)

Tracks which stories a user has seen and

rotates them after user has seen them for a

certain period of time

Key Performance Indicators

Lifts in quantitative metrics

Editorial Voice Preserved

Recommended links News Interests Top Searches

Page 10: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

CORE Dashboard: Segment Heat Map

Page 11: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Kinect

• The Kinect is an array of sensors.– Depth, audio, RGB camera …

• SDK provides a 3D virtual skeleton.– 20 points around the body, 30 fps

– 30 frames per second

– Between 60-70M sold by May 2013

• Exemplar of “Internet of Things”– Event streams from a multitude of

devices, enabling broad new apps

(Slide modified from Assaf Schuster, Technion)

Page 12: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

• Non-intrusive – Suitable for home monitoring.– Place Kinect anywhere in corridor or room, start

measuring.

– Measure gait as subjects go about their daily routine.

• Comprehensive – Extract parameters from full body.– Parameters extracted from 3D skeleton of entire body.

– Simultaneously measure any part of the body.

• Accurate – Supervised learning overcomes errors.– Full body information improves accuracy.

Kinect-based Full Body Gait AnalysisMickey Gabel, Ran Gilad-Bachrach, Assaf Schuster, Eng. Med. Bio. 2012

(Slide courtesy Assaf Schuster, Technion)

Page 13: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Connected devices will soon be EVERYWHERE

http://blogs.cisco.com/news/the-internet-of-things-infographic/

(Slide courtesy Ratul Mahajan, MSR)

Page 14: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Apps

MonitorData logger

Analysis scripts

AppUI

(Slide courtesy Ratul Mahajan, MSR)

HomeOS: Another Instance of IoT

Page 15: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Big DataBuild it—they’re here already!

Page 16: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

One Slide MapReduce Primer

Data file

Page 17: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

HDFS

One Slide MapReduce Primer

Page 18: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

HDFS

One Slide MapReduce Primer

Page 19: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

HDFS

One Slide MapReduce Primer

Map tasks

Page 20: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

HDFS

One Slide MapReduce Primer

Map tasks

Page 21: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

HDFS

One Slide MapReduce Primer

Map tasks

Page 22: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

HDFS

One Slide MapReduce Primer

Map tasks

Reduce tasks

Page 23: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

HDFS

One Slide MapReduce Primer

Map tasks

Reduce tasks

Page 24: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

HDFS

One Slide MapReduce Primer

Map tasks

HDFS

Reduce tasks

Page 25: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

HDFS

One Slide MapReduce Primer

Map tasks

HDFS

Reduce tasks

Page 26: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

HDFS

One Slide MapReduce Primer

Data file

Map tasks

HDFS

Reduce tasks

Good for scanning/sequentially writing/appending to huge filesScales by “mapping” input to partitions, “reducing” partitions in parallel

Partitions written to disk for fault-toleranceExpensive “shuffle” step between Map & Reduce

No concept of iteration

Hive and Pig are SQL variants implemented by translation to MapReduce

Not great for serving (reading or writing individual objects)

Page 27: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Shoebox Store

• Capture any data, react instantaneously, mix with data stored anywhere

• Tiered storage management

• Federated access

• Use any analysis tool (anywhere, mix and match, interactively)

• Compute fabric

• Collaborate/Share selectively

Tiered Shoebox Store

SQL / Hive /MR

Stream Processing

BusinessIntelligence

MachineLearning

RemoteStores

Compute Fabric

DATA INGEST

Page 28: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Integrated Query “In-Place”

Can join and group-by tables from a relational source with tables in a Hadoop cluster without needing to learn MapReduce

Integrated BI Tools

Using Excel, end users can search for data sources with Power Query and do roll-up/drill-down etc. with Power Pivot—across both relational and Hadoop data

Interactive Visualizations

Use Power View for immersive interactivity and visualizations of both relational and Hadoop data

Page 29: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Aster/Teradata

Berkeley Data Analytics Stack

Cloudera

Google

HortonWorks

Microsoft

Pivotal/EMC

SQL on Hadoop panel, Aug 2013:http://hivedata.com/real-time-query-panel-discussion/

Page 30: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Challenges

• Volume

– Elastic scale-out

– Multi-tenancy

• Variety

– Trade-off: Shared building blocks vs. custom engines

• Velocity

– Real-time and OLTP, interactive, batch

Page 31: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

How Far Away is Data?

• GFS and Map-Reduce:– Schedule computation “near” data– i.e., on machines that have data on their disks

• But– Windows Azure Storage

• And slower tiers such as tape storage …

– Main memory growth• And flash, SSDs, NVRAM etc. …

• Must play two games simultaneously:– Cache data across tiers, anticipating workloads– Schedule compute near cached data

Page 32: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Compute Fabric: YARN

• Resource manager for Hadoop2.x

• Allocates compute containers to competing jobs

– Not necessarily MR jobs!

– Containers are the unit of resource

– Can fail or be taken away; programmer must handle these cases

• Other RMs include Corona, Mesos, Omega

Page 33: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Making YARN Easier to Use: REEF

• Evaluator: YARN container with REEF services– Capability-awareness, Storage support, Fault-

handling support, Communications, Job/task tracking, scheduling hooks

• Activity: User Code to be executed in an Evaluator– Monitored, preemptable, re-started as needed

– Unique id over lifetime of job

– Executes in an Evaluator, which can be re-used

Page 34: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Digital Shoebox Architecture

HDFS-as-Cache

Relational Queries

MachineLearning

REEF

YARN

WAS

TIEREDSTORAGE

COMPUTEFABRIC

ANALYSISENGINES

DURABLESTORAGE

COMPUTETIER(Cluster of machines with local RAM, SSDs, disks, …)

Operators

Page 35: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Example

FormationModeling

Evaluation /

Deployment

Page 36: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

26

• Large dimensionality vector describing possible user activities

• But a typical user has a sparse activity vector

• Hadoop pipeline to model user interests from activities

Attribute Possible Values Typical values per

user

Pages ~ MM 10 – 100

Queries ~ 100s of MM Few

Ads ~ 100s of thousands 10s

Page 37: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

2727

Time

Query Visit Y! finance

Feature Window Target Window

Event of interest

Moving Window

T0

Page 38: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

28

Component Data Processed Time

Data Acquisition ~ 1 Tb per time

period

2 – 3 hours

Feature and Target

Generation

~ 1 Tb * Size of

feature window

4 - 6 hours

Model Training ~ 50 - 100 Gb 1 – 2 hours for 100’s

of models

Scoring ~ 500 Gb 1 hour

Page 39: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Example

EMail

Click Log

Bag of

Words

I

D

LabelI

D

Bag of

WordsLabel

I

D

Feature Extraction

Label Extraction

Data Parallel

Functions

Large Scale

Join

Large Scale

Join

Page 40: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications
Page 41: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Avoid forced rescheduling between iterations

Node-local data storage / caches

Machine learning cost is often I/O dominated

Efficient means of communication withinan iteration

Apply

Model

to Data

Observe

Errors

Update

Model

Page 42: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

YARN / HDFS

SQL / Hive … …Machine

Learning

Fault Tolerance

Row/Column Storage

High Bandwidth Networking

Page 43: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

YARN / HDFS

SQL / Hive … …Machine

Learning

Fault Awareness

Local data caching

Low Latency Networking

Page 44: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

YARN

Can we share more than just Resource Management?

Example

FormationModeling

Evaluation /

Deployment

Spark

GraphLab

MPI

Pregel

One-Offs

Dryad

Pig/Hive

M/R

SQL

Hyracks

Dryad

Pig/Hive/SQL

StreamInsight

One-Offs

Page 45: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Bad for systems builders:

Bad for users:

Bad for cloud providers:

Page 46: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications
Page 47: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

MapReduce library Runs Hive and Pig

Excellent starting point for M/R optimizations: Caching, Shuffle, Map-Reduce-Reduce, Sessions, …

Machine Learning algorithms Scalable implementations: Decision

Trees, Linear Models, Soon: SVD

Excellent starting point for: Fault awareness in ML

Page 48: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

SQL / Hive

YARN / HDFS

… …Machine

Learning

REEF

Page 49: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Storage

Network

State Management

Job

Driver

Control plane

implementation. User code

executed on YARN’s

Application Master

ActivityUser code executed

within an Evaluator.

EvaluatorExecution Environment

for Activities. One

Evaluator is bound to

one YARN Container.

Page 50: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications
Page 51: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Client

public class DistributedShell {...public static void main(String[] args){

...Injector i = new Injector(yarnConfiguration);...REEF reef = i.getInstance(REEF.class);...reef.submit(driverConf);

}}

Page 52: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

public class DistributedShell {...public static void main(String[] args){

...Injector i = new Injector(yarnConfiguration);...REEF reef = i.getInstance(REEF.class);...reef.submit(driverConf);

}}

Client

Page 53: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

public class DistributedShellJobDriver {private final EvaluatorRequestor requestor;...

public void onNext(StartTime time) {

requestor.submit(EvaluatorRequest.Builder().setSize(SMALL).setNumber(2).build());

}

...}

Client

Page 54: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

evaluator

config +

Client

Page 55: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

public class DistributedShellJobDriver {private final String cmd = “dir”;

[...]

public void onNext(RunningEvaluator eval) {final String activityId = [...];

final JavaConfigurationBuilder b = [...];

b.bind(Activity.class, ShellActivity.class);b.bindNamedParameter(Command.class, this.cmd);

eval.submit(activityId, cb.build());

}

[...]

}

activity

config

Client

Page 56: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

class ShellActivity implements Activity {

private final String command;

@InjectShellActivity(@Parameter(Command.class) String c) {

this.command = c;}

private String exec(final String command){...

}

@Overridepublic byte[] call(byte[] memento) {

String s = exec(this.cmd);return s.getBytes();

}

}

Client

Page 57: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Client

Page 58: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Retains

State!

Client

Page 59: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Job Driver

activity

config

Client

Page 60: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Job Driver

Client

Page 61: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Job Driver

Client

Page 62: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications
Page 63: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications
Page 64: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Feature Vector

Label

The Task: Learn a Regression Model

Page 65: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

weight vector

Linear Models

Page 66: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

The Learning Algorithm: Batch Gradient Descent

Page 67: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

How It Maps to REEF

Page 68: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

How It Maps to REEF: Control Flow

Page 69: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Contrast: Hadoop MapReduce

Page 70: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Data Management Services

Page 71: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Job Driver

Page 72: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Job Driver

Page 73: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

public interface Spool<T>implements Iterable<T>, Accumulable<T> {Iterator<T> iterator();Accumulator<T> accumulator();

}

public interface Iterator<T> {boolean hasNext();T next();

}

public interface Accumulator<T> {void add(T t);void close();

}

Page 74: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

public interface Map<K, V> {boolean containsKey(K key);V get(K key);void put(K key, V value);V remove(K key);

// Scatter-gather

void putAll(Map<K, V> m);Iterable<Map.Entry<K, V>> getAll(Set<K> keys);

// Optional

boolean testAndSet(K key, V value, V old);V atomicPut(K key, V value);

}

Page 75: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Job Driver

Page 76: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

all communication is done with identifiersdecouple from physical locations

decouple from temporal constraints

public interface Identifier {}

public class NameService {...

public InetSocketAddress lookup(Identifier i){...

}...

}

Page 77: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Driver

Activity A Activity B

Activity B

node1 node2

node3 node4

Activity A, node1

Activity B, node2

NIB

Activity A, node1

Activity B, node4

NetworkService ns = newNetworkService(…);

ns.send("Activity B", message);

Page 78: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Driver

Activity A Activity B

node1 node2

node3 node4

Activity A, node1

Activity B, node2

NIB

Activity B

NetworkService ns= new NetworkService(…);

ns.getMailbox().send(“Activity B”, message);

Page 79: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

asynchronous send, receive upcall

spool, iterator, accumulator interface

public class NetworkService<T> {…public void send(Identifier id, T obj) {

...}...public void handler(Receiver<T> receiver) {

...}

public Spool<T> getSpool(Identifier id) {...

}…

}

public interface Receiver<T> {void recv(Identifier id, T obj);

}

Page 80: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

1-to-N

N-to-1

M-to-N

Broker

Activity A Activity B

Activity CActivity D

Page 81: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Job Driver

Page 82: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

State Management

Manage (save and retrieve) the state of a computation

Objectives:

Page 83: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

State Management

App Code App Code App Code

Page 84: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

State Management

Checkpoint Service

Preemption Mechanisms / Stats Collection

Preemption PoliciesOptimization

Policies

Also used in: Apache Yarn/MapReduce

App Code App Code App Code

Page 85: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Checkpoint Service API

Atomic, append-only, single-writer, write-once

Support for HDFS and Local Filesystem

Job-level quotas and garbage collection (via Hadoop staging)

(Configured via Tang or directly)

public interface CheckpointService {

public CheckpointWriteChannel create();

public Memento commit(CheckpointWriteChannel ch);

public void abort(CheckpointWriteChannel ch);

public CheckpointReadChannel open(Memento mem);

public boolean delete(Memento mem);

}

public interface WritableByteChannel extends Channel {

public int write(ByteBuffer src);

}

public interface ReadableByteChannel extends Channel {

public int read(ByteBuffer dst);

}

Page 86: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

public interface ResumeableActivity<T,

M extends Memento> extends Activity<T> {

public M suspend();

public T resume(M memento);

}

Name

Node

Yarn

RM

HDFS NM

REEF

HDFS NM

HDFS NM

Job Driver

Activity

Client

node1

node2

node3

node4

Page 87: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Name

Node

Yarn

RM

HDFS NM

REEF

HDFS NM

HDFS NM

Job Driver

Activity

Client

node1

node2

node3

node4

public Memento suspend(){

CheckpointService cs = …Tang config…

CheckpointWriteChannel cwc = cs.create();

cwc.write(…state…);

Memento mem = cs.commit(cwc);

return mem;

}

Page 88: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Name

Node

Yarn

RM

HDFS NM

REEF

HDFS NM

HDFS NM

Job Driver

Client

node1

node2

node3

node4

public void resume(Memento mem){

CheckpointReadChannel crc = cs.open(mem);

crc.read(…buffer…);

}Activity

activity

config +

Page 89: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Popular schedulers

CapacityScheduler

FairScheduler

Deadline-oriented scheduling

New idea:

Support work-preserving preemption

(via) checkpointing more than preemption

Scheduling in Hadoop(Curino, Douglas, Rao)

(Amoeba paper, SOCC 2012)

Page 90: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

80

Page 91: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Killing Tasks vs. Preemption

0

10

20

30

40

50

60

70

80

90

100

02

25

43

06

35

84

01

02

91

21

01

41

51

62

01

82

52

03

02

23

52

44

02

64

52

85

03

05

53

26

03

46

53

67

03

87

54

07

04

25

54

46

04

66

54

87

05

07

55

28

05

48

55

69

05

89

56

10

06

30

56

51

06

71

56

92

07

12

57

33

07

53

57

74

07

94

58

15

08

35

58

56

0

% C

om

ple

te

Time (s)

Kill Preempt33% Improvement

Page 92: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Adding Preemption to YARNAnd Open-Sourcing to Apache

ClientJob1

RM

Scheduler

NodeManager NodeManager NodeManager

AppMaster Task

Task

Task

Task

Task

TaskTask

PreemptionMessage {

Strict { Set<ContainerID> }

Flexible { Set<ResourceRequest>,

Set<ContainerID> }

}

Collaborative applicationPolicy-based binding for flexible preemption requests

Use of Preemption

Context: Outdated informationDelayed effects of actionsMulti-actor orchestration

Interesting type of preemption:RM declarative requestAM binds it to containers

Page 93: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Changes throughout YARN

ClientJob1

RM

Scheduler

NodeManager NodeManager NodeManager

MR AMTask

Task

Task

Task

Task

TaskTask

When can I preempt?tag safe UDFs or user-saved state

@Preemptable

public class

MyReducer{

} Common Checkpoint ServiceWriteChannel cwc = cs.create();

cwc.write(…state…);

CheckpointID cid = cs.commit(cwc);

ReadChannel crc = cs.open(cid);

Page 94: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

ClientJob1

RM

Scheduler

NodeManager NodeManager NodeManager

AppMaster Task

Task

Task

Task

Task

TaskTask

MR-5192

MR-5194

MR-5197

MR-5189

MR-5189

MR-5176

YARN-569

MR-5196

Contributing to Apache

Engaging with OSStalk with active developersshow early/partial work small patches ok to leave things unfinished

Page 95: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

Configuration Management: Tang

Page 96: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications

If you want to implement the next Mahout, consider building on REEF

• Data is the new gold, data mining the new Klondike

• The next generation of data platforms will fuse traditional data management, scale-out systems like Hadoop, and cloud capabilities

• Convergence of analytic toolsets, blurring market boundaries

Page 97: Scale-out Beyond MapReduce - Semantic Scholar...Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft. Outline •Big Data –The New Applications