berkeley rad lab technical overview

15
1 Berkeley RAD Lab Technical Overview Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica March 2006

Upload: muncel

Post on 07-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Berkeley RAD Lab Technical Overview. Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica March 2006. RAD Lab. The 5-year Vision : Single person can go from vision to a next-generation IT service (“the Fortune 1 million”) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Berkeley RAD Lab Technical Overview

1

Berkeley RAD LabTechnical Overview

Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion StoicaMarch 2006

Page 2: Berkeley RAD Lab Technical Overview

2

RAD LabThe 5-year Vision:Single person can go from vision to a next-generation IT service (“the Fortune 1

million”) E.g., over long holiday weekend in 1995, Pierre Omidyar created Ebay v1.0

The Challenges: Develop the new Service: today, easy prototyping ≠ easy operations Assess: Measuring, Testing, and Debugging the new Service in a realistic

distributed environment: how will it scale? Deploy: Scaling up a new, geographically distributed Service Operate a service that could quickly scale to millions of users with <1

operator

The Vehicle:Interdisciplinary Center creates core technical competency to demo 10X to 100X Researchers are leaders in machine learning, networking, and systems Industrial Participants: leading companies in HW, systems SW, and online

services “RAD Lab” = Reliable, Adaptable, Distributed systems

Page 3: Berkeley RAD Lab Technical Overview

3

Founding the RAD Lab

Looked for 3 to 4 founding companies to fund 5 years @ cost of $0.5M / year Google, Microsoft, Sun Microsystems signed up

Affiliate Companies ($0.1M/yr): HP, IBM, others Founding Company Model

Prefer founding partner technology in prototypes Designate employees to act as consultants Putting IP in Public Domain 3-year project review by founding partners

$2.5-$3M/yr ~65% industry, ~25% state, ~10% fed 30 grad students + 10 undergrads+ 6 faculty + 2 staff

Page 4: Berkeley RAD Lab Technical Overview

4

Steps vs. Process

Process: SupportDADO Evolution, 1 group

Steps: Traditional, Static Handoff Model, N groups

Develop

Assess Deploy

Operate

Develop

Assess

Deploy

Operate

Page 5: Berkeley RAD Lab Technical Overview

5

Key Ingredients: Visualization &Statistical Machine Learning (SML) Too much data for human to

troubleshoot manually Eg Amazon - tens of metrics, 100’s-1000’s of machines

Visualization exploits human visual processing

SML finds patterns in large quantities of data

Page 6: Berkeley RAD Lab Technical Overview

6

Operations example: combiningvisualization & machine learning

• Idea: end-user behavior as “failure detector”• Approach: combine visualization with SML

analysis so operator see anomalies too • Experiment: does distribution of hits to

various pages match the “historical” distribution? Each minute, compare hit counts of top N pages to hit counts

over last 6 hours using Bayesian networks and 2 test, real Ebates data

To learn more, see “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June 2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson.

Page 7: Berkeley RAD Lab Technical Overview

7

Visualization as user behavior completely different; usually animate architectureWin trust in SLT by leveraging operator expertise and human visual pattern recognition

TopTop4040

PagesPages

Time (5 minute intervals)Time (5 minute intervals)

Page 8: Berkeley RAD Lab Technical Overview

8

Build Academic MPP from FPGAs As 25 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from 40 FPGAs?• 16 32-bit simple “soft core” RISC at 150MHz in 2004 (Virtex-II)• FPGA generations every 1.5 yrs; 2X CPUs, 1.2X clock rate

HW research community does logic design (“gate shareware”) to create out-of-the-box, MPP E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-coherent

supercomputer @ 100 MHz/CPU in 2007 RAMPants: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe

(CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley, PI)

“Research Accelerator for Multiple Processors”

Page 9: Berkeley RAD Lab Technical Overview

9

Why RAMP Good for Research MPP? SMP Cluster Simulate RAMP

Scalability (1k CPUs)

C A A A

Cost (1k CPUs) F ($40M) C ($2-3M)

A+ ($0M) A ($0.1-0.2M)

Cost of ownership

A D A A

Power/Space(kilowatts, racks)

D (120 kw, 12 racks)

D (120 kw, 12 racks)

A+ (.1 kw, 0.1 racks)

A (1.5 kw, 0.3 racks)

Community D A A A

Observability D C A+ A+

Reproducibility B D A+ A+

Reconfigurability D C A+ A+

Credibility A+ A+ F B+/A-

Perform. (clock) A (2 GHz) A (3 GHz) F (0 GHz) C (0.1-.2 GHz)

GPA C B- B A-

Page 10: Berkeley RAD Lab Technical Overview

10

Completed Dec. 2004 (14x17 inch 22-layer PCB)

Board:5 Virtex II FPGAs, 18

banks DDR2-400 memory, 20 10GigE conn.

RAMP 1 Hardware

BEE2: Berkeley Emulation Engine 2By John Wawrzynek and Bob Brodersen with students Chen Chang and Pierre Droz

1.5W / computer,5 cu. in. /computer,$100 / computer

1000 CPUs : 1.5 KW,

¼ rack, $100,000

Box:8 compute modules in 8U rack mount chassis

Page 11: Berkeley RAD Lab Technical Overview

11

RAMP in RADS: Internet in a Box

Building blocks also Distributed Computing RAMP vs. Clusters (Emulab, PlanetLab)

Scale: RAMP O(1000) vs. Clusters O(100)Private use: $100k Every group has oneDevelop/Debug: Reproducibility, ObservabilityFlexibility: Modify modules (Router, SMP, OS)

Explore via repeatable experiments as vary parameters, configurations vs. observations on single (aging) cluster that is often idiosyncratic

Page 12: Berkeley RAD Lab Technical Overview

12

Planned Apps & Courses

ResearchIndex: reputation & ranking system for CS research papers and digests Seeking suggestions/collaboration on this & other possible apps, to get

experience with Develop & Deploy Seeing datasets corresponding to larger (real) apps as well, to increase

experience with Assess & Operate

Courses CS 294, Fall 06: MS/PhD level projects contributing to RAD Lab

infrastructure in all areas (DADO) CS 294, Fall 07: Prototype services to run in “production mode” on RAD Lab

platform, improve platform/environment based on lessons from deployment CS 294, Fall 08: “Web 2.0” style services on RAD Lab platform (e.g. joint

with Haas Business School) Undergrad courses, >2008: software eng. assignments are network

services running on RADS platform

Page 13: Berkeley RAD Lab Technical Overview

13

RAD Lab: Interdisciplinary Center for Reliable, Adaptive, Distributed Systems

Develop using primitives to enable functions (MapReduce), services (Craigslist)Assess using deterministic replay and statistical debuggingDeploy via “Internet-in-a-Box” FPGAsOperate SLT-friendly, Control Theory-friendly architectures and operator-centric visualization and analysis tools

CapabilityCapability (Desired): (Desired): 1 person can invent & run the next-gen IT service

BaseBase Technology: Technology:Server Hardware, System Server Hardware, System Software,Software,Middleware, NetworkingMiddleware, Networking

Page 14: Berkeley RAD Lab Technical Overview

14

Industrial collaboration

Historically a UCB strength Industrial research labs are ideal partners

High quality research staff => symmetric collaboration Ties to product groups => work on relevant problems Access to real data sets => realistic evaluation of prototypes

Goal: ongoing transfer of software, technology & people “BSD License” for RAD Lab technology intended to ease adoption by

industrial partners

RADLab targets: SML & control theory, visualization, development of service-oriented archs. & apps.

Page 15: Berkeley RAD Lab Technical Overview

15

RAD Lab Timeline

2005 Launch RAD Lab 2006 Collect workloads, Internet in a Box 2007 SLT/CT distributed architectures, Iboxes,

annotation layer, class testing 2008 Development toolkit 1.0, tuple space,

class testing; Mid Project Review 2009 RAD Lab software suite 1.0, class testing 2010 End of Project Party