system support for data-intensive applications

System Support for Data-Intensive Applications

Katherine Yelick U.C. Berkeley, EECS

The “Post PC” Generation

Two technologies will likely dominate:

1) Mobile Consumer Electronic Devices

–e.g., PDA, Cell phone, wearable computers, with cameras, recorders, sensors

–make the computing “invisible” through reliability and simple interfaces

2) Infrastructure to Support such Devices

–e.g., successor to Big Fat Web Servers, Database Servers

–make these “utilities” with reliability and new economic models

Open Research Issues• Human-computer interaction

– uniformity across devices• Distributed computing

– coordination across independent devices• Power

– low power designs and renewable power sources

• Information retrieval– finding useful information amidst a flood of

data• Scalability

– Scaling devices down– Scaling services up

• Reliability and maintainability

The problem space: big data

• Big demand for enormous amounts of data– today: enterprise and internet applications

» online applications: e-commerce, mail, web, archives» enterprise decision-support, data mining databases

– future: richer data and more of it» computational & storage back-ends for mobile devices» more multimedia content» more use of historical data to provide better services

• Two key application domains:– storage: public, private, and institutional data– search: building static indexes, dynamic

discovery

Reliability/Performance Trade-off

• Techniques for reliability:– High level languages with strong types

» avoid memory leaks, wild pointers, etc.» C vs. Java

– Redundant storage, computation, etc.» adds storage and bandwidth overhead

• Techniques for performance:– Optimize for a specific machine

» e.g., cache or memory hierarchy

– Minimize redundancy

• These two goals work against each other

Specific Projects• ISTORE

– A reliable, scalable, maintainable storage system

• Data-intensive applications for “backend” servers– Modeling the real world– Storing and finding information

• Titanium– A high level language (Java) with high

performance– A domain-specific language and

optimizing compiler• Sparsity

– Optimization using partial program input

ISTORE: Reliable Storage System

• 80-node x86-based cluster, 1.4TB storage– cluster nodes are plug-and-play, intelligent, network-

attached storage “bricks”» a single field-replaceable unit to simplify maintenance

– each node is a full x86 PC w/256MB DRAM, 18GB disk– 2-node system running now; full system in next quarter

ISTORE Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mbit/s•2 1 Gbit/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibration sensors...

Intelligent Disk “Brick”Portable PC CPU: Pentium II/266 + DRAM

Redundant NICs (4 100 Mb/s links)Diagnostic Processor

Disk

Half-height canister

A glimpse into the future?

• System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk

• ISTORE HW in 5-7 years:

– building block: 2006 MicroDrive integrated with IRAM » 9GB disk, 50 MB/sec from disk» connected via crossbar switch

– 10,000 nodes fit into one rack!

• O(10,000) scale is our ultimate design point

Heart Modeling• A computer simulation of a human heart

– Used to design artificial heart valves– Simulations run for days on a C90 supercomputer– Done by Peskin and MacQueen at NYU

• Modern machines are faster but harder to use– working with NYU– using Titanium

• Shown here: close-up of aortic valve during ejection

• Images from the Pittsburgh Supercomputer Center

Simulation of a Beating Heart

• Shown here:– Aortic valve (yellow); Mitral valve (purple)– Mitral valves closes when left ventrical pumps

• Future: virtual surgery?

Earthquake Simulation• Earthquake modeling

– Used for retrofitting buildings, emergency preparedness, construction policies– Done by Beilak (CMU); also by Fenves (Berkeley)

– Problems: grid (graph) generation; using images

Earthquake Simuation• Movie shows a simulated aftershock following the

1994 Northridge earthquake in California

• Future: sensors everywhere; tied to central system

Pollution Standards• Simulation of ozone layer

– Done by Russell (CMU) and McRae (MIT)– Used to influence automobile emissions

policy

Los Angeles Basin shown at 8am (left) and 2pm (right)

The “cloud” shows areas where ozone levels are above federal ambient air quality standards (0.12 parts per million)

Information Retrieval• Finding useful information amidst huge data sets

– I/O intensive application• Today’s example: web search engines

– 10 Million documents in typical matrix. – Web storage increasing 2x every 5 months– One class of techniques based on sparse

matrices

• Problem: Can you make this run faster, without writing hand-optimized, non-portable code?

# keywords

~100K

# documents ~= 10 M

•Matrix is compressed

•“Random” memory access

•Cache miss per 2Flops

•Run at 1-5% of machine peak

x

Image-Based Retrieval• Digital library

problem: – retrieval on images– content-based

• Computer vision problem– uses sparse matrix

• Future: search in medical image databases; diagnosis; epidemiological studies

Object Based Image Description

Titanium Goals• Help programmers write reliable software

– Retain safety properties of Java– Extend to parallel programming constructs

• Performance– Sequential code comparable to C/C++/Fortran– Parallel performance comparable to MPI

• Portability• How?

– Domain-specific language and compiler– No JVM– Optimizing compiler– Explicit parallelism and other language

constructs for high performance

Titanium Overview: Sequential

Object-oriented language based on Java with:• Immutable classes

– user-definable non-reference types for performance

• Unordered loops– compiler is free to run iteration in any order– useful for cache optimizations and others

• Operator overloading– by demand from our user community

• Multidimensional arrays– points and index sets as first-class values – specific to an application domain: scientific

computing with block-structured grids

Titanium Overview: ParallelExtensions of Java for scalable parallelism:• Scalable parallelism

– SPMD model with global address space• Global communication library

– E.g., broadcast, exchange (all-to-all)– Used to build data structures in the

global address space• Parallel Optimizations

– Pointer operations– Communication (underway)

• Bulk asynchronous I/O– speed with safety

Implementation• Strategy

– Compile Titanium into C– Communicate through shared memory on SMPs– Lightweight communication for distributed

memory

• Titanium currently runs on:– Uniprocessors– SMPs with Posix or Solaris threads– Berkeley NOW, SP2 (distributed memory)– Tera MTA (multithreaded, hierarchical)– Cray T3E (global address space) – SP3 (cluster of SMPs, e.g., Blue Horizon at

SDSC)

Sequential Performance

C/C++/FORTRAN

JavaArrays

TitaniumArrays Overhead

DAXPY3D multigrid2D multigridEM3D

1.4s12s

5.4s0.7s 1.8s 1.0s 42%

15%83%

7%

6.2s22s

1.5s6.8s

Ultrasparc:

C/C++/FORTRAN

JavaArrays

TitaniumArrays Overhead

DAXPY3D multigrid2D multigridEM3D

1.8s23.0s

7.3s1.0s 1.6s 60%

-25%-13%27%

5.5s20.0s

2.3s

Pentium II:

Performance results from 98; new IR and optimization framework almost complete.

SPMD Execution Model

• Java programs can be run as Titanium, but the result will be that all processors do all the work

• E.g., parallel hello world class HelloWorld { public static void main (String [] argv) { System.out.println(‘’Hello from proc ‘’ + Ti.thisProc()); } }

• Any non-trivial program will have communication and synchronization

SPMD Execution Model

• A common style is compute/communicate

• E.g., in each timestep within particle simulation with gravitation attraction

read all particles and compute forces on mine Ti.barrier(); write to my particles using new forces Ti.barrier();

• This basic model is used on the large-scale parallel simulations described earlier

SPMD Model• All processor start together and execute same

code, but not in lock-step• Basic control done using

– Ti.numProcs() total number of processors– Ti.thisProc() number of executing processor

• Sometimes they do something independent if (Ti.thisProc() == 0) { ….. do setup ..… }

System.out.println(‘’Hello from ‘’ + Ti.thisProc());

double [1d] a = new double [Ti.numProcs()];

Barriers and Single

• Common source of bugs is barriers or other global operations inside branches or loops

barrier, broadcast, reduction, exchange• A “single” method is one called by all procs

public single static void allStep(...)• A “single” variable has same value on all procs

int single timestep = 0;

• The compiler uses “single” type annotations to ensure there are no synchronization bugs with barriers

Explicit Communication: Exchange

• To create shared data structures– each processor builds its own piece– pieces are exchanged (for object, just

exchange pointers)• Exchange primitive in Titanium int [1d] single allData; allData = new int [0:Ti.numProcs()-1]; allData.exchange(Ti.thisProc()*2);

• E.g., on 4 procs, each will have copy of allData:

0 2 4 6

Exchange on Objects

• More interesting example: class Boxed { public Boxed (int j) {

val = j;

}

public in val;

}

Object [1d] single allData;

allData = new Object [0:Ti.numProcs()-1];

allData.exchange(new Boxed(Ti.thisProc());

Use of Global / Local• As seen, references (pointers) may be remote

– easy to port shared-memory programs• Global pointers are more expensive than local

– True even when data is on the same processor

– Use local declarations in critical sections• Costs of global:

– space (processor number + memory address)

– dereference time (check to see if local)• May declare references as local

Global Address Space

• Processes allocate locally• References can be passed

to other processes

Class C { int val;….. }C gv; // global pointerC local lv; // local pointer

if (thisProc() == 0) {lv = new C();

}gv = broadcast lv from 0; gv.val = …..; ….. = gv.val;

Process 0Other

processes

lv

gv

lv

gv

Local Pointer Analysis

• Compiler can infer many uses of local– “Local Qualification Inference” (Liblit’s work)

• Data structures must be well partitioned

Effect of LQI

0

50

100

150

200

250

cannon lu sample gsrb poison

applications

run

nin

g t

ime

(s

ec

)

Original

After LQI

Bulk Asynchronous I/O Performance

async

bulkds

bulkraf

dsb

ds

raf

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0 10 20 30 40 50 60

File Size (MB)

Th

rou

gh

pu

t (M

B/s

ec

)

External sort benchmark on NOW

• raf: random access file (Java)

• ds: unbuffered stream (Java)

• dsb: buffered stream (Java)

• bulkraf: bulk random access (Titanium)

• bulkds: bulk sequential (Titanium)

• async: asynchronous (Titanium)

Performance Heterogeneity

• System performance limited by the weakest link• Performance heterogeneity is the norm

– disks: inner vs. outer track (50%), fragmentation– processors: load (1.5-5x)

• Virtual Streams: dynamically off-load I/O work from slower disks to faster ones

0

1

2

3

4

5

6

100% 67% 39% 29%

Efficiency Of Single Slow Disk

Min

imu

m P

er-

Pro

ce

ss

B

an

dw

idth

(MB

/se

c)

Ideal

Virtual Streams

Static

Parallel performance on an SMP

• Speedup on Ultrasparc SMP (shared memory multiprocessor)

• EM3D performance linear

– simple kernel

• AMR largely limited by

– problem size

– 2 levels, with top one serial

0

1

2

3

4

5

6

7

8

1 2 4 8

em3d

amr

Parallel Performance on a NOW

• MLC for Finite-Differences by Balls and Colella• Poisson equation with infinite boundaries

– arise in astrophysics, some biological systems, etc.• Method is scalable

– Low communication

• Performance on– SP2 (shown) and t3e– scaled speedups– nearly ideal (flat)

• Currently 2D and non-adaptive 0

0.2

0.4

0.6

0.8

1

1.2

1 4 16

processors

Tim

e/f

ine

-pa

tch

-ite

r/p

roc

129x129/65x65

129x129/33x33

257x257/129x129

257x257/65x65

Performance on CLUMPs• Clusters of SMPs (CLUMPs) have two-levels of

communication– BH at SDSC has 144 nodes, each with 8

nodes– 8th processor cannot be used effectively

GSRB performance with 700x700 patches

0

10

20

30

40

50

60

70

0 5 10 15 20 25 30 35

Processes

Tim

e (s

)

1 p/node

2 p/node

4 p/node

7 p/node

8 p/node

Cluster of SMPs• Communication within a node is shared-

memory• Communication between nodes uses LAPI

– for large messages, a separate thread is created by LAPI

– interferes with computation performanceAggregate bandwidth with multiple processes

0

10

20

30

40

50

0 10000 20000 30000 40000 50000 60000 70000

Data Size (bytes)

Ban

dwid

th (M

B/s

) 1 p/node

2 p/node

4 p/node

7 p/node

8 p/node

Optimizing Parallel Programs• Would like compiler to introduce asynchronous

communication, which is a form of possible reordering

• Hardware also reorders– out-of-order execution– write buffered with read by-pass– non-FIFO write buffers

• Software already reorders too– register allocation– any code motion

• System provides enforcement primitives– volatile: at the language level not well-defined– tend to be heavy weight, unpredictable

• Can the compiler hide all this?

Semantics: Sequential Consistency

• When compiling sequential programs:

Valid if y not in expr1 and x not in expr2 (roughly)

• When compiling parallel code, not sufficient test.

y = expr2;

x = expr1;

x = expr1;

y = expr2;

Initially flag = data = 0

Proc A Proc B

data = 1; while (flag==1);

flag = 1; ... = ...data...;

Cycle Detection: Dependence Analog

• Processors define a “program order” on accesses from the same thread P is the union of these total orders

• Memory system define an “access order” on accesses to the same variable

A is access order (read/write & write/write pairs)

• A violation of sequential consistency is cycle in P U A.

• Intuition: time cannot flow backwards.

write data read flag

write flag read data

Cycle Detection

• Generalizes to arbitrary numbers of variables and processors

• Cycles may be arbitrarily long, but it is sufficient to consider only cycles with 1 or 2 consecutive stops per processor [Sasha & Snir]

write x write y read y

read y write x

Static Analysis for Cycle Detection

• Approximate P by the control flow graph• Approximate A by undirected “dependence”

edges• Let the “delay set” D be all edges from P that

are part of a minimal cycle

• The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code)

• Synchronization analsysis also critical [Krishnamurthy]

write z read x

read y write z

write y read x

Automatic Communication Optimization

• Implemented in subset of C with limited pointers • Experiments on the NOW; 3 synchronization

styles

• Future: pointer analysis and optimizations

Tim

e (

no

rma

lized

)

Sparsity: Sparse Matrix Optimizer

• Several data mining or web search algorithms use sparse matrix-vector multiplication– use for documents, images, video, etc.– irregular, indirect memory patterns perform

poorly on memory hierarchies• Performance improvements possible, but depend

on: – sparsity structure, e.g., keywords within

documents– machine parameters without analytical models

• Good news:– operation repeated many times on similar matrix– Sparsity: automatic code generator based on

matrix structure and machine

Sparsity: Sparse Matrix Optimizer

Summary• Future

– small devices + larger servers– reliability increasingly important

• Reliability techniques include– hardware: redundancy, monitoring– software: better languages, many others

• Performance trades off against safety in languages– use of domain-specific features (e.g.,

Titanium)

Backup Slides

The Big Motivators for Programming Systems

Research

• Ease of Programming– Hardware costs -> 0– Software costs -> infinity

• Correctness– Increasing reliance on software increases

cost of software errors (medical, financial, etc.)

• Performance– Increasing machine complexity– New languages and applications

» Enabling Java; network packet filters

The Real Scalability Problems: AME

• Availability– systems should continue to meet quality of

service goals despite hardware and software failures and extreme load

• Maintainability– systems should require only minimal ongoing

human administration, regardless of scale or complexity

• Evolutionary Growth– systems should evolve gracefully in terms of

performance, maintainability, and availability as they are grown/upgraded/expanded

• These are problems at today’s scales, and will only get worse as systems grow

Research Principles

• Redundancy everywhere, no single point of failure• Performance secondary to AME

– Performance robustness over peak performance– Dedicate resources to AME

» biological systems use > 50% of resources on maintenance

– Optimizations viewed as AME-enablers » e.g., use of (slower) safe languages like Java with static

and dynamic optimizations

• Introspection– reactive techniques to detect and adapt to

failures, workload variations, and system evolution

– proactive techniques to anticipate and avert problems before they happen

Outline• Motivation• Hardware Techniques

– general techniques– ISTORE projects

• Software Techniques• Availability Benchmarks• Conclusions

Hardware techniques

• Fully shared-nothing cluster organization– truly scalable architecture, automatic

redundancy– tolerates partial hardware failure

• No Central Processor Unit: distribute processing with storage– Most storage servers limited by speed of CPUs;

why does this make sense?– Amortize sheet metal, power, cooling

infrastructure for disk to add processor, memory, and network

• On-demand network partitioning/isolation– Applications must tolerate these anyway – Allows testing, repair of online system

Hardware techniques

• Heavily instrumented hardware– sensors for temp, vibration, humidity, power

• Independent diagnostic processor on each node– remote control of power, console, boot code– collects, stores, processes environmental

data – connected via independent network

• Built-in fault injection capabilities– Used for proactive hardware introspection

» automated detection of flaky components» controlled testing of error-recovery mechanisms

– Important for AME benchmarking

ISTORE-2 Hardware Proposal• Smaller disks

– replace 3.5” disks with 2.5” or 1” drives» 340MB available now in 1”, 1 GB next year (?)

• Smaller, more highly integrated processors– E.g., Transmeta Crusoe includes processor

and Northbridge (interface) functionality in 1 Watt

– Xilinx FPGA for Southbridge, diagnostic proc, etc.

• Larger scale– Roughly 1000 nodes, depending on support

» ISTORE-1 built with donated disks, memory, processors

» Paid for network, board design, enclosures (discounted)

Outline• Motivation• Hardware Techniques• Software Techniques

– general techniques– Titanium: a high performance Java dialect– Sparsity: using dynamic information– Virtual streams: performance robustness

• Availability Benchmarks • Conclusions

Software techniques• Fault tolerant data structures

– Application controls replication, checkpointing, and consistency policy

– Self-scrubbing used to identify software errors that have corrupted application state

• Encourage use of safe languages– Type safety and automatic memory

management avoid a host of application errors– Use of static and dynamic information to meet

performance needs• Runtime adaptation to performance

heterogeneity– e.g., outer vs. inner track (1.5X),

fragmentation– Evolution of systems adds to this problem

Software Techniques• Reactive introspection

– Use statistical techniques to identify normal behavior and detect deviations from it» e.g., network activity, response time, program

counter (?)

– Semi-automatic response to abnormal behavior » initially, rely on human administrator » eventually, system learns to set response

parameters

• Proactive introspection– Continuous online self-testing

» in deployed systems!» goal is to shake out bugs in failure response code

on isolated subset» use of fault-injection and stress testing

Techniques for Safe Languages

Titanium: A high performance dialect of Java• Scalable parallelism

– A global address space, but not shared memory

– For tightly-coupled applications, e.g., mining– Safe, region-based memory management

• Scalar performance enhancements, some specific to application domain – immutable classes (avoids indirection)– multidimensional arrays with subarrays

• Application domains– scientific computing on grids

» typically +/-20% of C++/F in this domain– data mining in progress

Use of Static Information• Titanium compiler performs parallel

optimizations– communication overlap (40%) and aggregation

• Uses two new analyses– synchronization analysis: the parallel analog

to control flow analysis » identifies code segments that may execute

in parallel– shared variable analysis: the parallel analog to

dependence analysis»recognize when reordering can be observed

by another processor»necessary for any code motion or use of

relaxed memory models in hardware => missed or illegal optimizations

Conclusions• Two key applications domains

– Storage: loosely coupled– Search: tightly coupled, computation important

• Key challenges to future servers are:– Availability, Maintainability, and Evolutionary

growth

• Use of self-monitoring to satisfy AME goals– Proactive and reactive techniques

• Use of static techniques for high performance and reliable software– Titanium extension of Java

• Use of dynamic information for performance robustness– Sparsity and Virtual Streams

• Availability benchmarks a powerful tool?

Projects and Participants

ISTORE: iram.cs.berkeley.edu/istore

With James Beck, Aaron Brown, Daniel Hettena, David Oppenheimer, Randi Thomas, Noah Treuhaft, David Patterson, John Kubiatowicz

Titanium: www.cs.berkeley.edu/projects/titanium

With Greg Balls, Dan Bonachea, David Gay, Ben Liblit, Chang-Sun Lin, Peter McQuorquodale, Carleton Miyamoto, Geoff Pike, Alex Aiken, Phil Colella, Susan Graham, Paul Hilfinger

Sparsity: www.cs.berkeley.edu/~ejim/sparsity

With Eun-Jin Im

History of Programming Language Research

70s 80s 90s 2K

Flop optimization

General PurposeLanguage Design

Parsing Theory

Domain-SpecificLanguage Design

Type Systems Theory

Memory Optimizations

GarbageCollection Threads

Program Verification Program Checking Tools

Data and Control AnalysisType-Based Analysis

system support for data-intensive applications

Documents