an active and hybrid storage system for data-intensive applications

52
An Active and Hybrid Storage System for Data-intensive Applications Ph.D Candidate: Zhiyang Ding Defense Committee Members: Dr. Xiao Qin Dr. Kai H. Chang Dr. David A. Umphress 10/27/2022 University Reader: Prof. Wei Wang, Chair of the Art Design Dept.

Upload: xiao-qin

Post on 19-Nov-2014

704 views

Category:

Technology


2 download

DESCRIPTION

Since large-scale and data-intensive applications have been widely deployed, there is a growing demand for high-performance storage systems to support data-intensive applications. Compared with traditional storage systems, next-generation systems will embrace dedicated processor to reduce computational load of host machines and will have hybrid combinations of different storage devices. We present a new architecture of active storage system, which leverage the computational power of the dedicated processor, and show how it utilizes the multi-core processor and offloads the computation from the host machine. We then solve the challenge of applying the active storage node to cooperate with the other nodes in the cluster environment by design a pipeline-parallel processing pattern and report the effectiveness of the mechanism. In order to evaluate the design, an open-source bioinformatics application is extended based on the pipeline-parallel mechanism. We also explore the hybrid configuration of storage devices within the active storage. The advent of flash-memory-based solid state disk has become a critical role in revolutionizing the storage world. However, instead of simply replacing the traditional magnetic hard disk with the solid state disk, researchers believe that finding a complementary approach to corporate both of them is more challenging and attractive. Thus, we propose a hybrid combination of different types of disk drives for our active storage system. An simulator is designed and implemented to verify the new configuration. In summary, this dissertation explores the idea of active storage, an emerging new storage configuration, in terms of the architecture and design, the parallel processing capability, the cooperation of other machines in cluster computing environment, and the new disk configuration, the hybrid combination of different types of disk drives.

TRANSCRIPT

Page 1: An Active and Hybrid Storage System for Data-intensive Applications

04/08/2023

An Active and Hybrid Storage System for Data-intensive Applications

Ph.D Candidate: Zhiyang Ding

Defense Committee Members:Dr. Xiao QinDr. Kai H. ChangDr. David A. UmphressUniversity Reader:Prof. Wei Wang,Chair of the Art Design Dept.

Page 2: An Active and Hybrid Storage System for Data-intensive Applications

2

Cluster Computing

04/08/2023

• Large-scale Data Processing is everywhere.

Page 3: An Active and Hybrid Storage System for Data-intensive Applications

3

Motivation

04/08/2023

• Traditional Storage Nodes on the Cluster

Client Network switch

Compute Nodes

Storage Node (or Storage Area Network)Internet

Head Node

Page 4: An Active and Hybrid Storage System for Data-intensive Applications

4

Motivation

04/08/2023

• What’s the next? • More “Active”.

Storage Node

Client Network switch

Compute Nodes

Internet

Head Node

Computation OffloadI/O Request

Raw DataPre-processed Data

Page 5: An Active and Hybrid Storage System for Data-intensive Applications

5

About the Active Storage

04/08/2023

pp-mpiBlast:How to deploy Active Storage?

McSD: A Smart Disk Model

Storage Node HcDD:Hybrid Disk for Active Storage

Page 6: An Active and Hybrid Storage System for Data-intensive Applications

604/08/2023

McSD: A Multicore Active Storage Device

• I/O Wall Problem: CPU--I/O Gap– Limited I/O Bandwidth– CPU Waiting and Dissipating the Power

• How to – Bridge CPU--I/O Gap– Reduce I/O Traffic

Page 7: An Active and Hybrid Storage System for Data-intensive Applications

7

• “Active”: – Leveraging the Processing Power of Storage Devices

• Benefits:– Offloading Data-intensive Computation– Reducing I/O Traffic– Pipeline Parallel Programming

04/08/2023

Why McSD?

Page 8: An Active and Hybrid Storage System for Data-intensive Applications

8

• Design a prototype of a multicore active storage

• Design a pre-assembled processing module

• Extend a shared-memory MapReduce system

• Emulate the whole system on a real testbed

04/08/2023

Contributions

Page 9: An Active and Hybrid Storage System for Data-intensive Applications

9

• Traditional Smart/Active Disks– On-board: Embedding a processor into the hard disk– Various Research Models• e.g. active disk, smart disk, IDISK, SmartSTOR, and etc.

04/08/2023

Background: Active Disks

• However, “active disk” is not adopted by hardware vendors

Improved attachment technologies

I/O Bound Workloads

Cost of the System

Reliability

Page 10: An Active and Hybrid Storage System for Data-intensive Applications

10

• Multi-core Processors or Multi-processors– 45% transistors increase 20% processing power

• MapReduce: a Parallel Programming Model– MapReduce by Google– Hadoop, Mars, Phoenix, and etc.

• Multicore and Shared-memory Parallel Processing

04/08/2023

Background: Parallel Processing

Page 11: An Active and Hybrid Storage System for Data-intensive Applications

1104/08/2023

Design: System Overview

Multicore and Shared-memory

Parallel Processing

Communication Mechanism

Hybrid Storage Disks

Pipeline Parallel Processing

Design of an Active Storage

Page 12: An Active and Hybrid Storage System for Data-intensive Applications

12

• Computation Mechanism– Pre-assembled Processing Model– smartFAM

• Extend the Shared-Memory MapReduce by Partitioning

04/08/2023

Design and Implementation

Page 13: An Active and Hybrid Storage System for Data-intensive Applications

13

• Pre-assembled Processing Modules– Meet the nature of embedded services– Reduce Complexity and Cost– Provide Services• E.g. Multi-version antivirus service, Pre-process of data-

intensive apps, De-duplication, and etc.

• How to invoke services?

04/08/2023

Pre-assembled Processing Modules

Page 14: An Active and Hybrid Storage System for Data-intensive Applications

14

• smartFAM = Smart File Alternation Monitor– Invokes the pre-assembled processing modules or

functions by monitoring the changes of the system log file.

• Two Components:– an inotify function: a Linux system function– a trigger daemon

04/08/2023

smartFAM

Page 15: An Active and Hybrid Storage System for Data-intensive Applications

1504/08/2023

Design and Implementation

12

3

Page 16: An Active and Hybrid Storage System for Data-intensive Applications

1604/08/2023

Extend the Phoenix:A Shared-memory MapReduce Model

• Extend the Phoenix MapReduce Programming Model by partitioning and merging– New API: partition_input– New Functions:

• partition (provided by the new API)• merge (Develop by user)

• Example:– wordcount [data-file][partition-size][]

Page 17: An Active and Hybrid Storage System for Data-intensive Applications

1704/08/2023

Pipeline Processing

Page 18: An Active and Hybrid Storage System for Data-intensive Applications

18

• Testbed

• Benchmarks– Word Count– String Match– Matrix Multiplication

• Individual Node Performance• System Performance04/08/2023

Evaluation Environment

Page 19: An Active and Hybrid Storage System for Data-intensive Applications

19

Word Count (seconds) String Match (seconds)

1 GB 1.25 GB 1 GB 1.25 GB

w/ Partition 40.60 50.91 17.76 20.61

w/o Partition 85.74 139.54 17.62 21.00

04/08/2023

Individual Node Performance

Page 20: An Active and Hybrid Storage System for Data-intensive Applications

20

Matrix-Multiplication and Word-Count (Speedups)

Input Data Size vs Single Machine vs Single-core Active vs McSD w/o Partition

500 MB 1.47 X 2.15 X 0.99 X

750 MB 1.45 X 2.09 X 1.04 X

1 GB 7.62 X 2.14 X 6.07 X

1.25 GB 19.01 X 2.50 X 15.39 X

04/08/2023

System Evaluation

Page 21: An Active and Hybrid Storage System for Data-intensive Applications

21

• It can improve system performance by offloading data-intensive computation

• McSD is a promising active storage model with– Pre-assembled processing modules– Parallel data processing – Better Evaluation Performance

04/08/2023

Summary

Page 22: An Active and Hybrid Storage System for Data-intensive Applications

22

Storage Node

About the Active Storage

04/08/2023

pp-mpiBlast:How to deploy Active Storage?

McSD: A Smart Disk Model

HcDD:Hybrid Disk for Active Storage

Page 23: An Active and Hybrid Storage System for Data-intensive Applications

23

• So far, we know the potential of Active Storages

• Challenge: How to coordinate active storage nodes with computing nodes?

• Propose a Pipeline-parallel Processing pattern

04/08/2023

Apply Active Storages to a Cluster

Page 24: An Active and Hybrid Storage System for Data-intensive Applications

24

• Propose a pipeline-parallel processing framework to

“connect” a Active Storage node with computing nodes.

• Evaluate the framework using both an analytic model

and a real implementation.

• Case Study: Extend an existing bioinformatics

application based on the framework.

04/08/2023

Contributions

Page 25: An Active and Hybrid Storage System for Data-intensive Applications

2504/08/2023

Background: Active Storage

SSD

Mass Storage

Active Storage Node

SSD

Memory

Buff Disks

Processor

Computation

Bridge?

Page 26: An Active and Hybrid Storage System for Data-intensive Applications

27

• BLAST*: Basic Local Alignment Search Tool– Comparing primary biological sequence

information

• mpiBLAST** is a freely available, open-source, parallel implementation of NCBI BLAST. – Format raw data files– Run a parallel BLAST function

04/08/2023

Background: Bioinformatics App

*http://blast.ncbi.nlm.nih.gov/**http://www.mpiblast.org/

Page 27: An Active and Hybrid Storage System for Data-intensive Applications

28

• Offload the raw-data formatting task to where data stores.

• Intra-application Pipeline-parallel Processing by “partition” and “merge”.

• pp-mpiBlast, a case study.

04/08/2023

Pipeline-parallel Design

Page 28: An Active and Hybrid Storage System for Data-intensive Applications

29

Active Storage Node Computing Nodes

04/08/2023

Pipelining Workflow

Output File

RawInput File

Partition 1

2

…Partition

n

Intermediate 12

…Intermediate

n

Partition

Sub-output 1

2

…Sub-output

n

FormatDB mpiBlast Merge

(n-1) times

n

(n-1) times

1

Inter-mediat

esFormart DB OutputFormart DB

Page 29: An Active and Hybrid Storage System for Data-intensive Applications

3004/08/2023

Analytic Model

• Three Critical Measures

Page 30: An Active and Hybrid Storage System for Data-intensive Applications

31

Computing Nodes Configuration Active Storage ConfigurationCPU Intel XEON X3430 Intel Core 2 Q9400

Memory 2 GB DDR3 (PC3-10600)OS Ubuntu 9.04 Jaunty Jackalope 32bit Version

Kernel 2.6.28-15-genericNetwork Gigabit LAN

04/08/2023

Evaluation Environment

Our Testbed Opposite Testbeds“Pipeline-parallel” “12-node Cluster” “13-node Cluster”12 Computing Nodes 12 Computing Nodes 13 Computing Nodes1 Active Storage Node 1 Storage Node 1 Storage Node

Page 31: An Active and Hybrid Storage System for Data-intensive Applications

3204/08/2023

Pipeline-parallel Design

Results: Compared With 12-node System

Results: Compared With 13-node System

Page 32: An Active and Hybrid Storage System for Data-intensive Applications

3304/08/2023

Speedups Trends: Partition Size

Page 33: An Active and Hybrid Storage System for Data-intensive Applications

34

• We proposed a pipeline-parallel processing mechanism to apply an Active Storage Node.

• As a case study, we extended a classic bioinformatics application based on the pipeline-parallel style.

04/08/2023

Summary

Page 34: An Active and Hybrid Storage System for Data-intensive Applications

35

About the Active Storage

04/08/2023

pp-mpiBlast:How to deploy Active Storage?

McSD: A Smart Disk Model

Storage Node HcDD:Hybrid Disk for Active Storage

Page 35: An Active and Hybrid Storage System for Data-intensive Applications

3604/08/2023

What’s Hybrid?

A Hybrid Combination of a Gas Engine and a Electronic Engine

Power Efficiency

Page 36: An Active and Hybrid Storage System for Data-intensive Applications

3704/08/2023

Hybrid Disk Drives

• A Hybrid Combination of Two Types of Storage Devices: HDD and SSD– HDD: Magnetic Hard Disk– Solid State Disk: Built by NAND-based flash memory.

What are their roles?

Page 37: An Active and Hybrid Storage System for Data-intensive Applications

3804/08/2023

Motivation

• However, SSDs suffer reliability issues.

• In a hybrid storage system, using SSDs as the buffer can boost the performance.

Page 38: An Active and Hybrid Storage System for Data-intensive Applications

39

• Flash Memory:– Each Block consists 32 or 64 or128 pages. – Each Page is typically 512 or 2,048 or 4,096 bytes.

• “Erase-before-write” at block level.• Lifespan is 10,000 Program/Erase cycles.– E.g., *The lifespan of an 80 GB MLC SSD can only

last 106 days, if the write rates is 30 MB/s.

04/08/2023

Limitations Related to SSDs

• Rethink about their roles?*Based on the SSD lifespan calculator provided by Virident.com

Page 39: An Active and Hybrid Storage System for Data-intensive Applications

40

• Hybrid Combination of HDD and SSD disks

• De-duplication Service using HDDs as a Write Buffer

• Internal-parallel Processing in SSD

• Simulation of the Whole System For Evaluation

04/08/2023

Contributions

Page 40: An Active and Hybrid Storage System for Data-intensive Applications

4104/08/2023

Hybrid Disk Configuration

HDD

SSD

I/O Requests

Read Requests

Data of Write Requests

data

Data

De-duplication

Dedicated Processor

Pre-processingRead RequestsPre-processed Data

dataDeduplicated

Page 41: An Active and Hybrid Storage System for Data-intensive Applications

4204/08/2023

HcDD Architecture

Page 42: An Active and Hybrid Storage System for Data-intensive Applications

4304/08/2023

Deduplication Design

Page 43: An Active and Hybrid Storage System for Data-intensive Applications

4404/08/2023

Internal Parallel Processing

Page 44: An Active and Hybrid Storage System for Data-intensive Applications

4504/08/2023

Evaluation

Page 45: An Active and Hybrid Storage System for Data-intensive Applications

4604/08/2023

Internal Parallelism Evaluation:Single Node

Page 46: An Active and Hybrid Storage System for Data-intensive Applications

4704/08/2023

Single Node: Dedup Ratio

Page 47: An Active and Hybrid Storage System for Data-intensive Applications

4804/08/2023

System Performance Evaluation

Page 48: An Active and Hybrid Storage System for Data-intensive Applications

4904/08/2023

System Performance Evaluation

Page 49: An Active and Hybrid Storage System for Data-intensive Applications

5004/08/2023

Summary

Page 50: An Active and Hybrid Storage System for Data-intensive Applications

51

Conclusion

04/08/2023

pp-mpiBlast:How to deploy Active Storage?

McSD: A Smart Disk Model

Storage Node HcDD:Hybrid Disk for Active Storage

Page 51: An Active and Hybrid Storage System for Data-intensive Applications

52

Future Work

04/08/2023

Page 52: An Active and Hybrid Storage System for Data-intensive Applications

53

Many Thanks!And Questions?

04/08/2023