haryadi gunawi and andrew chien - synergy...

19
in collaboration with Gokul Soundararajan and Deepak Kenchammana (NetApp) Rob Ross and Dries Kimpe (Argonne National Labs) Haryadi Gunawi and Andrew Chien

Upload: vankien

Post on 27-Jul-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

in collaboration with Gokul Soundararajan and Deepak Kenchammana (NetApp)

Rob Ross and Dries Kimpe (Argonne National Labs)

Haryadi Gunawi and Andrew Chien

2

q Complete fail-stop

q Fail partial

q Corruption

q Performance degradation (“limpware”)?

Rich literature

6/1/15 LigHTS @ XPS Workshop 2015

3

q “… 1Gb NIC card on a machine that suddenly starts transmitting at 1 kbps,

q this one slow machine caused a chain reaction … making a 100 node cluster was crawling at a snail's pace” – Facebook Engineers

“Limping” NIC! (1,000,000x)

Cascading impact!

6/1/15 LigHTS @ XPS Workshop 2015

4

q  Disks §  “… 4 servers having high wait times on I/O for, up to 103 seconds. This was left uncorrected for

50 days.” @ Argonne §  Causes: Weak disk head, bad packaging, missing screws, broken/old fans, too many disks/

box, firmware bugs, bad sector remapping, …

q  SSDs §  Samsung firmware bug (reduce bandwidth by 4x)

q  Network cards and switches §  “On Intrepid, a bad batch of optical transceivers with an extremely high error rate cause an

effective throughput of 1-2 Kbps.” @ Argonne §  Causes: Broken adapter, error correcting, driver bugs, power fluctuation, …

q  Memory §  Runs only at 25% of normal speed – HBase operators

q  Processors §  26% variation §  Aging transistors, overheat, self throttling, …

q  Many others: “Yes we've seen that in production” §  More anecdotes in our paper [SoCC ’13]

6/1/15 LigHTS @ XPS Workshop 2015

5 6/1/15 LigHTS @ XPS Workshop 2015

6

q Introduction

q Impact of limpware to scale-out cloud systems? [HotCloud ’13, SoCC ’13]

q Progress Summary §  What bugs live in the cloud? [SoCC ’14] §  Detecting performance bugs [HotCloud ’15] §  The Tail at Store [In Submission] §  Other ongoing work

6/1/15 LigHTS @ XPS Workshop 2015

7

q Anecdotes §  “The performance of a 100 node cluster was crawling at a

snail's pace” – Facebook

q But, … why?

6/1/15 LigHTS @ XPS Workshop 2015

8

q  Goals: §  Measure system-level impacts §  Find design flaws

q  Run distributed systems/protocols §  E.g., 3-node write in HDFS

q  Measure slowdowns under: §  No failure, crash, a limping NIC

workload 10 Mbps NIC

1Mbps NIC

0.1 Mbps NIC

1

10x slower

100x slower

1000x slower

Execution slowdown

6/1/15 LigHTS @ XPS Workshop 2015

9

HDFS

Hadoop

ZooKeeper

Cassandra

ZooKeeper

HBase

6/1/15 LigHTS @ XPS Workshop 2015

10

Fail-stop tolerant, but not limpware tolerant (no failover recovery)

6/1/15 LigHTS @ XPS Workshop 2015

11

q  Run Hadoop with 6+ hours of Facebook workload §  30-node cluster §  30-node cluster (w/ 1 slow node @ 0.1 Mbps)

Cluster collapse after ~4 hours

1 job/hour Also happens in HDFS and ZooKeeper

6/1/15 LigHTS @ XPS Workshop 2015

12

q  Single point of performance failure

q  Coarse-grained timeouts

q  Bounded thread/queue pool à resource exhaustion

q  Unbounded thread/queue pool à OOM

q  No throttling or back-pressure

q  Limp-oblivious background jobs

q  Unexploited parallelism of small transactional I/Os

q  Long lock/resource contention

q  …

6/1/15 LigHTS @ XPS Workshop 2015

13

q Introduction

q Impact of limpware [SoCC ’13]

q Progress Summary

6/1/15 LigHTS @ XPS Workshop 2015

6/1/15 LigHTS @ XPS Workshop 2015 14

q Study/Analysis §  Limplock/limpware [HotCloud ’13, SoCC ’13] §  What bugs live in the cloud? [SoCC ’14]

6/1/15 LigHTS @ XPS Workshop 2015 15

q Study/Analysis §  Limplock/limpware [HotCloud ’13, SoCC ’13] §  What bugs live in the cloud? [SoCC ’14]

-  Study of 3000+ bugs in scale-out distributed systems -  New: scalability bugs, single-point-of-failure bugs, …

6/1/15 LigHTS @ XPS Workshop 2015 16

q Study/Analysis §  Limplock/limpware [HotCloud ’13, SoCC ’13] §  What bugs live in the cloud? [SoCC ’14] §  The Tail at Store [In Submission]

-  Goal: Anecdotes to real statistics -  Collaboration with Gokul Soundararajan and Deepak Kenchammana -  Study of over 450,000 disks, 4000 SSDs, and 240 EBS drives -  Ask: How many slow drives? How often? Transient?

RAID RAID

6/1/15 LigHTS @ XPS Workshop 2015 17

q  Study/Analysis §  Limplock/limpware [HotCloud ’13, SoCC ’13] §  What bugs live in the cloud? [SoCC ’14] §  The Tail at Store [In Submission]

-  Limping disks and SSDs are real! -  2-digit slowdowns had occurred in 0.01% of disk and SSD hours -  4- and 3-digit slowdowns in 124 and 2461 disk hours, and 3-digit SSD

slowdowns in 10 SSD hours

6/1/15 LigHTS @ XPS Workshop 2015 18

q Study/Analysis q Towards Limpware-Tolerant Systems

§  Detecting limpware-intolerant designs in distributed systems [HotCloud ’15]

§  Tail-tolerant storage [In Progress] -  In flash controller, operating system, and distributed storage -  + Coordination with MapReduce Speculative Execution -  (A cross-cutting approach)

TT OS/RAID

TT Distr. FS TT Flash Ctrl

MapReduce Spec. Ex.

19

XPS à Exploit Scale

Limpware à Underexploit Scale

6/1/15 LigHTS @ XPS Workshop 2015

ucare.cs.uchicago.edu ceres.uchicago.edu