triple-a: a non-ssd based autonomic all-flash array for high performance storage systems myoungsoo...

40
Triple-A: A Non-SSD Based A utonomic A ll-Flash A rray for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf (LBNL), Mahmut Kandemir (PSU)

Upload: kane-yonge

Post on 31-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Triple-A: A Non-SSD Based Autonomic All-Flash Array

for High Performance Storage Systems

Myoungsoo Jung (UT-Dallas)Wonil Choi (UT-Dallas),

John Shalf (LBNL), Mahmut Kandemir (PSU)

Page 2: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Executive Summary• Challenge: SSD array might not be suitable for a high-perfor-

mance computing storage• Our goal: propose a new high-performance storage architecture• Observation

– High maintenance cost: caused by worn-out flash-SSD replacements– Performance degradation: caused by shared resource contentions

• Key Ideas– Cost reduction: by taking bare NAND flash out from SSD box– Contention resolve: by distributing excessive I/O generating bottlenecks

• Triple-A: a new architecture for HPC storages– Consists of non-SSD bare flash memories– Automatically detects and resolves the performance bottlenecks

• Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than a traditional SSD array

Page 3: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

• Motivations• Triple-A Architecture• Triple-A Management• Evaluations• Conclusions

Outline

Page 4: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

SSD Arrays

• SSD arrays are in position to (partially) replace HDD arrays

HPC starts to employ SSDs

HDD Arrays

HPCSSD-Cache on HDD Arrays

SSD-buffer on Compute-Node

Page 5: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

• As time goes by, worn-out SSD should be replaced• The thrown-away SSD has complex internals• Other parts are still useful, only flash memories are useless

High-cost Maintenance of SSD Arrays

SSD Arrays

Worn-out!

Abandon!

Replace!

Dead!

Live!

Page 6: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

I/O Services Suffered in SSD Arrays

• Varying data locality in an array, which consist of 80 SSDs• Hot region is a group of SSDs having 10% of total data• Arrays without a hot region show reasonable latency• As the number of hot regions increases, the performance of

SSD arrays degrades

Page 7: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Why Latency Delayed? Link Contention

• A single data bus is shared by a group of SSDs• When target SSD is ready and the shared bus is idle, the

I/O request can get service right away• When excessive I/Os destined to a specific group of SSDs

SSD-1 SSD-2 SSD-3 SSD-4

SSD-5 SSD-6 SSD-7 SSD-8

Dest- 3

Dest-8

Dest- 4Dest- 1Dest- 2

Dest-6

READY!

READY!

IDLE!

IDLE!

Page 8: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Why Latency Delayed? Link Contention

• When the shared bus is busy, even though the target SSD is ready, I/O requests should stay in the buffer

• This stall is because SSDs in a group share a data bus link contention

SSD-1 SSD-2 SSD-3 SSD-4

SSD-5 SSD-6 SSD-7 SSD-8

Dest- 4

Dest- 1

Dest- 2

READY! READY!

Dest-6

READY!

READY!

IDLE!

IDLE!BUSY!

BUSY!STALL

Page 9: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Why Latency Delayed? Storage Contention

• When excessive I/Os destined to a specific SSD

SSD-1 SSD-2 SSD-3 SSD-4

SSD-5 SSD-6 SSD-7 SSD-8

Dest- 3

Dest-8

READY!

READY!Dest-8Dest-8Dest-8

Dest-1Dest-2Dest-4 BUSY!

BUSY!

Page 10: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Why Latency Delayed? Storage Contention

• When excessive I/O destined to a specific SSD• When the target SSD is busy, even though the link is avail-

able, I/O request should stay in buffer• This stall is because a specific SSD is continuously busy storage contention

SSD-1 SSD-2 SSD-3 SSD-4

SSD-5 SSD-6 SSD-7 SSD-8

Dest-3

Dest-8

READY!

Dest-8

Dest-8

Dest-1

Dest-2

Dest-4

BUSY!

BUSY!Dest-8

BUSY!STALL

READY!BUSY!

Page 11: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

• Motivations• Triple-A Architecture• Triple-A Management• Evaluations• Conclusions

Outline

Page 12: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Unboxing SSD for Cost Reduction

• Worn-out flash packages should be replaced• Many logics in SSDs including H/W controllers and

firmware are wasted, when worn-out SSDs are replaced• Instead of a whole SSD, let’s use only bare flash packages

Bare NAND flash packages

Still useful, so reusable

Useless, so replaced

SSD Internal

Host interface controllerFlash controllersMicroprocessorsDRAM buffers

Firmware

35~50% of total SSD cost

Page 13: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Use of Unboxed Flash Packages, FIMM

• Multiple NAND flash packages integrated into a board– Looks like passive memory device such as DIMM– Referred to as Flash Inline Memory Module (FIMM)

• Control-signals and pin-assignment are defined• For convenient replacement of worn-out FIMMs – A FIMM has hot swappable connector– NV-DDR2 interface design by ONFi

FlashPackage

Page 14: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

How FIMMs Connected? PCI-E

• PCI-E technology provides high-performance interconnect• Root complex – I/O start point component• Switch – middle-layer components• Endpoint – where FIMMs directly attached• Link – bus connecting components

FIMMFIMM

FIMM

FIMM

FIMM

FIMM

FIMMFIMM

FIMM

HPC

root complex

switch switch switch

endpoint endpoint

FIM

M

FIM

M

FIM

M

FIM

M

FIM

M

FIM

M

Page 15: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Connection between FIMMs and PCI-E

• PCI-E Endpoint is where “PCI-E fabric” and “FIMMs” meet– Front-end PCI-E protocol for PCI-E fabric– Back-end ONFi NV-DDR2 interface for FIMMs

• Endpoint consists of three parts– PCI-E device layers: handle PCI-E interface– Control logic: handles FIMMs over ONFi interface– Upstream/downstream buffers: control traffic communication

Page 16: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Connection between FIMMs and PCI-E

• Communication example– (1) PCI-E packet arrived at target endpoint– (2) PCI-E device layers disassemble the packet– (3) The disassembled packet is enqueued into downstream buffer– (4) HAL dequeues the packet and constructs a NAND flash command

• Hot-swappable connector for FIMMs– ONFi 78-pin NV-DDR2 slot

Page 17: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Triple-A Architecture

• PCI-E allows architect to configure any configuration• Endpoint is where FIMMs are directly attached• Triple-A comprises a set of FIMMs using PCI-E• Useful parts of SSDs are aggregated on top of PCI-E fabric

PCI-E Fabric

Endpoint

FIMM FIMM FIMM FIMM

EndpointEndpointEndpoint

FIMM FIMM FIMM FIMM

FIMM FIMM FIMM FIMM

FIMM FIMM FIMM FIMM

Multi-coresMulti-cores DRAMs

SwitchesRCs

Page 18: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Triple-A Architecture

• Flash control logic is also moved out of SSD internal – Address translation, garbage collection, IO scheduler, and so on– Autonomic I/O contention managements

• Triple-A architectures interact with hosts or compute nodes

PCI-E Fabric

Endpoint

FIMM FIMM FIMM FIMM

EndpointEndpointEndpoint

FIMM FIMM FIMM FIMM

FIMM FIMM FIMM FIMM

FIMM FIMM FIMM FIMM

Multi-coresMulti-cores DRAMs Management Module

Hosts CNs

RCs Switches

Page 19: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

• Motivations• Triple-A Architecture• Triple-A Management• Evaluations• Conclusions

Outline

Page 20: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Link Contention Management

(1) Hot cluster detection – I/O stalled due to link contention

FIMM FIMM FIMM FIMM

EndPoint FIMM

EndPoint

EndPoint

EndPoint

FIMM FIMM FIMM

FIMM FIMM FIMM FIMM

PCI-E

Sw

itch

shared data bus

FIMMFIMMFIMMFIMM FIMMFIMM FIMM FIMM BUSY!!!HotCluster

Page 21: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Link Contention Management

(1) Hot cluster detection – I/O stalled due to link contention(2) Cold cluster securement – clusters with free link(3) Autonomic data migration – from hot to cold cluster

FIMM FIMM FIMM FIMM

EndPoint FIMM

EndPoint

EndPoint

EndPoint

FIMM FIMM FIMM

FIMM FIMM FIMM FIMM

PCI-E

Sw

itch

FIMMFIMMFIMMFIMM FIMMFIMM FIMM FIMM BUSY!!!HotCluster

FIMM FIMM FIMMFIMM IDLE!!!ColdCluster

Shadow-cloning can hide the migration overheads

Page 22: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Storage Contention Management

(1) Laggard detection – I/O stalled due to storage contention(2) Autonomic data-layout reshaping for stalled I/O in queue

EndPoint FIMM-1

Switc

h FIMM-2 FIMM-3 FIMM-4REQ-3REQ-3

REQ-1

QUEUE

REQ-3Stalled

REQ-3REQ-2

Issued

Issued

Issued

FIMM-1 FIMM-2 FIMM-3 FIMM-4

Stalled

Stalled EndPoint FIMM FIMM FIMM FIMM

Laggard

Page 23: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Storage Contention Management

(1) Laggard detection – I/O stalled due to storage contention(2) Autonomic data-layout reshaping for stalled I/O in queueWrite I/O – physical data-layout reshaping (to no-laggard neigh-bors)Read I/O – shadow copying (to no-laggard neighbors) & reshaphing

EndPoint FIMM-1

Switc

h FIMM-2 FIMM-3 FIMM-4REQ-3REQ-3

REQ-1

QUEUE

REQ-3Stalled

REQ-3REQ-2

Issued

Issued

Issued

FIMM-1 FIMM-2 FIMM-3 FIMM-4

Stalled

Stalled EndPoint FIMM FIMM FIMM FIMM

Laggard

3 43 2

3 4

IssuedIssued

Issued

Page 24: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

• Motivations• Triple-A Architecture• Triple-A Management• Evaluations• Conclusions

Outline

Page 25: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Experimental Setup

• Flash array network simulation model– Captures PCI-E specific characteristics

• Data movement delay, switching and routing latency (PLX 3734), contention cycles

– Configures diverse system parameters– Will be available in the public (preparing an open-source framework)

• Baseline all-flash array configuration– 4 switches x 16 endpoints x 4 FIMMs (64GB) = 16TB– 80 clusters, 320 FIMM network evaluation

• Workloads– Enterprise workloads (cfs, fin, hm, mds, msnfs, …)– HPC workload (eigensolver simulated at LBNL supercomputer)– Micro-benchmarks (read/write, sequential/random)

Page 26: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Latency Improvement

• Triple-A latency normalized to non-autonomic all-flash array• Real-world workloads: enterprise and HPC I/O traces• On average, x5 shorter latency• Specific workloads (cfs and web) generate no hot clusters

Page 27: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Throughput Improvement

• Triple-A IOPS normalized for system throughput• On average, x6 higher IOPS• Specific workloads (cfs and web) generate no hot clusters• Triple-A boosts the storage system by resolving contentions

Page 28: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Queue Stall Time Decrease

• Queue stall time come from two resource contentions• On average, stall time shortened by 81%• According to our analysis, Triple-A decreases dramatically

link-contention time• msnfs shows low I/O ratio on hot clusters

Page 29: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Network Size Sensitivity

• By increasing the number of clusters (endpoints)• Execution time broken-down into stall times and storage lat.• Triple-A shows better performance on larger networks

– PCI-E components stall times are effectively reduced– FIMM latency is out of Triple-A’s concern

non-autonomic array Triple-A

Page 30: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Related Works (1)• Market products (SSD array)– [Pure Storage] one-large pool storage system with 100% NAND

flash based SSDs– [Texas Memory Systems] 2D flash-RAID– [Violin Memory] flash memory array of 1000s of flash cells

• Academia study (SSD array)– [A.M.Caulfield, ISCA’13] proposed SSD-based storage area

network (QuickSAN) by integrating network adopter into SSDs– [A.Akel, Hotstorage’11] proposed a prototype of PCM based

storage array (Onyx)– [A.M.Caulfield, MICRO’10] proposed a high-performance stor-

age array architecture for emerging non-volatile memories

Page 31: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Related Works (2)• Academia study (SSD RAID)– [M.Balakrishnan, TS’10] proposed SSD-optimized RAID for

better reliability by creating age disparities within arrays– [S.Moon, Hotstorage’13] investigated the effectiveness of

SSD-based RAID and discussed the reliability potential• Academia study (NVM usage for HPC)– [A.M.Caulfield, ASPLOS’09] exploited flash memory to clus-

ters for the performance and power consumption– [A.M.Caulfield, SC’10] explored the impact of NVMs on HPC

Page 32: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Conclusions• Challenge: SSD array might not be suitable for high-perfor-

mance storage• Our goal: propose a new high-performance storage architecture• Observation

– High maintenance cost: caused by worn-out flash-SSD replacements– Performance degradation: caused by shared resource contentions

• Key Ideas– Cost reduction: by taking bare NAND flash out from SSD box– Contention resolve: by distributing excessive I/O generating bottlenecks

• Triple-A: a new architecture suitable for HPC storages– Consists of non-SSD bare flash memories– Automatically detects and resolves the performance bottlenecks

• Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than traditional SSD arrays

Page 33: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Backup

Page 34: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Data Migration Overhead

• A data migration comprises a series of – (1) Data read from source FIMM – (2) Data move to parental switch – (3) Data move to target endpoint – (4) Data write to target FIMM

• Naïve data migration activity shares all-flash array resources with normal I/O requests – I/O latency delayed due to resource contention

Naïve migration

Page 35: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Data Migration Overhead

• Data read of data migration (first step) hurts seriously the system performance

• Shadow cloning overlaps normal read I/O request and data read of data migration

• Shadow cloning successfully hides the data migration over-head and minimizes the system performance degradation

Naïve migration Shadow cloning

Page 36: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Real Workload Latency (1)

• CDF of workload latency for non-autonomic all-flash ar-ray and Triple-A

• Triple-A significantly improves I/O request latency• Relatively low latency improvement in msnfs

– Ratio of I/O requests heading to hot clusters is not very high– Hot clusters detected, but not that hot (less hot)

proj msnfs

Page 37: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Real Workload Latency (2)

• Prxy experiences great latency improved by Triple-A• Websql did not get more benefit than expected

– Due to more and hotter clusters than proxy– But, all hot clusters are located in the same switch

• In addition to 1) hotness, 2) balance of I/O requests among switches determines the effectiveness of Triple-A

prxy websql

Page 38: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Network Size Sensitivity

• Triple-A successfully reduces both contention time– By distributing extra load of hot clusters– Data migration and physical data reshaping

• Link contention time is all most completely eliminated• Storage contention time is steadily reduced

– It is bounded by the number of I/O requests to target clusters

Nor

mal

ized

to

non-

auto

nom

ic

all-fl

ash

arra

ys

Page 39: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf
Page 40: Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf

Why Latency Delayed? Storage Contention

• Regardless of array condition, independent SSDs can be busy or idle (ready to serve a new I/O)

• When the SSD where an I/O destined is ready, I/O can get service right away

• When the SSD where an I/O destined is busy, I/O should wait

SSD-1 SSD-2 SSD-3 SSD-4

SSD-5 SSD-6 SSD-7 SSD-8

Dest- 3

Dest-8

READY!

BUSY!