Transcript
Page 1: An Active and Hybrid Storage System for Data-intensive Applications

04/08/2023

An Active and Hybrid Storage System for Data-intensive Applications

Ph.D Candidate: Zhiyang Ding

Defense Committee Members:Dr. Xiao QinDr. Kai H. ChangDr. David A. UmphressUniversity Reader:Prof. Wei Wang,Chair of the Art Design Dept.

Page 2: An Active and Hybrid Storage System for Data-intensive Applications

2

Cluster Computing

04/08/2023

• Large-scale Data Processing is everywhere.

Page 3: An Active and Hybrid Storage System for Data-intensive Applications

3

Motivation

04/08/2023

• Traditional Storage Nodes on the Cluster

Client Network switch

Compute Nodes

Storage Node (or Storage Area Network)Internet

Head Node

Page 4: An Active and Hybrid Storage System for Data-intensive Applications

4

Motivation

04/08/2023

• What’s the next? • More “Active”.

Storage Node

Client Network switch

Compute Nodes

Internet

Head Node

Computation OffloadI/O Request

Raw DataPre-processed Data

Page 5: An Active and Hybrid Storage System for Data-intensive Applications

5

About the Active Storage

04/08/2023

pp-mpiBlast:How to deploy Active Storage?

McSD: A Smart Disk Model

Storage Node HcDD:Hybrid Disk for Active Storage

Page 6: An Active and Hybrid Storage System for Data-intensive Applications

604/08/2023

McSD: A Multicore Active Storage Device

• I/O Wall Problem: CPU--I/O Gap– Limited I/O Bandwidth– CPU Waiting and Dissipating the Power

• How to – Bridge CPU--I/O Gap– Reduce I/O Traffic

Page 7: An Active and Hybrid Storage System for Data-intensive Applications

7

• “Active”: – Leveraging the Processing Power of Storage Devices

• Benefits:– Offloading Data-intensive Computation– Reducing I/O Traffic– Pipeline Parallel Programming

04/08/2023

Why McSD?

Page 8: An Active and Hybrid Storage System for Data-intensive Applications

8

• Design a prototype of a multicore active storage

• Design a pre-assembled processing module

• Extend a shared-memory MapReduce system

• Emulate the whole system on a real testbed

04/08/2023

Contributions

Page 9: An Active and Hybrid Storage System for Data-intensive Applications

9

• Traditional Smart/Active Disks– On-board: Embedding a processor into the hard disk– Various Research Models• e.g. active disk, smart disk, IDISK, SmartSTOR, and etc.

04/08/2023

Background: Active Disks

• However, “active disk” is not adopted by hardware vendors

Improved attachment technologies

I/O Bound Workloads

Cost of the System

Reliability

Page 10: An Active and Hybrid Storage System for Data-intensive Applications

10

• Multi-core Processors or Multi-processors– 45% transistors increase 20% processing power

• MapReduce: a Parallel Programming Model– MapReduce by Google– Hadoop, Mars, Phoenix, and etc.

• Multicore and Shared-memory Parallel Processing

04/08/2023

Background: Parallel Processing

Page 11: An Active and Hybrid Storage System for Data-intensive Applications

1104/08/2023

Design: System Overview

Multicore and Shared-memory

Parallel Processing

Communication Mechanism

Hybrid Storage Disks

Pipeline Parallel Processing

Design of an Active Storage

Page 12: An Active and Hybrid Storage System for Data-intensive Applications

12

• Computation Mechanism– Pre-assembled Processing Model– smartFAM

• Extend the Shared-Memory MapReduce by Partitioning

04/08/2023

Design and Implementation

Page 13: An Active and Hybrid Storage System for Data-intensive Applications

13

• Pre-assembled Processing Modules– Meet the nature of embedded services– Reduce Complexity and Cost– Provide Services• E.g. Multi-version antivirus service, Pre-process of data-

intensive apps, De-duplication, and etc.

• How to invoke services?

04/08/2023

Pre-assembled Processing Modules

Page 14: An Active and Hybrid Storage System for Data-intensive Applications

14

• smartFAM = Smart File Alternation Monitor– Invokes the pre-assembled processing modules or

functions by monitoring the changes of the system log file.

• Two Components:– an inotify function: a Linux system function– a trigger daemon

04/08/2023

smartFAM

Page 15: An Active and Hybrid Storage System for Data-intensive Applications

1504/08/2023

Design and Implementation

12

3

Page 16: An Active and Hybrid Storage System for Data-intensive Applications

1604/08/2023

Extend the Phoenix:A Shared-memory MapReduce Model

• Extend the Phoenix MapReduce Programming Model by partitioning and merging– New API: partition_input– New Functions:

• partition (provided by the new API)• merge (Develop by user)

• Example:– wordcount [data-file][partition-size][]

Page 17: An Active and Hybrid Storage System for Data-intensive Applications

1704/08/2023

Pipeline Processing

Page 18: An Active and Hybrid Storage System for Data-intensive Applications

18

• Testbed

• Benchmarks– Word Count– String Match– Matrix Multiplication

• Individual Node Performance• System Performance04/08/2023

Evaluation Environment

Page 19: An Active and Hybrid Storage System for Data-intensive Applications

19

Word Count (seconds) String Match (seconds)

1 GB 1.25 GB 1 GB 1.25 GB

w/ Partition 40.60 50.91 17.76 20.61

w/o Partition 85.74 139.54 17.62 21.00

04/08/2023

Individual Node Performance

Page 20: An Active and Hybrid Storage System for Data-intensive Applications

20

Matrix-Multiplication and Word-Count (Speedups)

Input Data Size vs Single Machine vs Single-core Active vs McSD w/o Partition

500 MB 1.47 X 2.15 X 0.99 X

750 MB 1.45 X 2.09 X 1.04 X

1 GB 7.62 X 2.14 X 6.07 X

1.25 GB 19.01 X 2.50 X 15.39 X

04/08/2023

System Evaluation

Page 21: An Active and Hybrid Storage System for Data-intensive Applications

21

• It can improve system performance by offloading data-intensive computation

• McSD is a promising active storage model with– Pre-assembled processing modules– Parallel data processing – Better Evaluation Performance

04/08/2023

Summary

Page 22: An Active and Hybrid Storage System for Data-intensive Applications

22

Storage Node

About the Active Storage

04/08/2023

pp-mpiBlast:How to deploy Active Storage?

McSD: A Smart Disk Model

HcDD:Hybrid Disk for Active Storage

Page 23: An Active and Hybrid Storage System for Data-intensive Applications

23

• So far, we know the potential of Active Storages

• Challenge: How to coordinate active storage nodes with computing nodes?

• Propose a Pipeline-parallel Processing pattern

04/08/2023

Apply Active Storages to a Cluster

Page 24: An Active and Hybrid Storage System for Data-intensive Applications

24

• Propose a pipeline-parallel processing framework to

“connect” a Active Storage node with computing nodes.

• Evaluate the framework using both an analytic model

and a real implementation.

• Case Study: Extend an existing bioinformatics

application based on the framework.

04/08/2023

Contributions

Page 25: An Active and Hybrid Storage System for Data-intensive Applications

2504/08/2023

Background: Active Storage

SSD

Mass Storage

Active Storage Node

SSD

Memory

Buff Disks

Processor

Computation

Bridge?

Page 26: An Active and Hybrid Storage System for Data-intensive Applications

27

• BLAST*: Basic Local Alignment Search Tool– Comparing primary biological sequence

information

• mpiBLAST** is a freely available, open-source, parallel implementation of NCBI BLAST. – Format raw data files– Run a parallel BLAST function

04/08/2023

Background: Bioinformatics App

*http://blast.ncbi.nlm.nih.gov/**http://www.mpiblast.org/

Page 27: An Active and Hybrid Storage System for Data-intensive Applications

28

• Offload the raw-data formatting task to where data stores.

• Intra-application Pipeline-parallel Processing by “partition” and “merge”.

• pp-mpiBlast, a case study.

04/08/2023

Pipeline-parallel Design

Page 28: An Active and Hybrid Storage System for Data-intensive Applications

29

Active Storage Node Computing Nodes

04/08/2023

Pipelining Workflow

Output File

RawInput File

Partition 1

2

…Partition

n

Intermediate 12

…Intermediate

n

Partition

Sub-output 1

2

…Sub-output

n

FormatDB mpiBlast Merge

(n-1) times

n

(n-1) times

1

Inter-mediat

esFormart DB OutputFormart DB

Page 29: An Active and Hybrid Storage System for Data-intensive Applications

3004/08/2023

Analytic Model

• Three Critical Measures

Page 30: An Active and Hybrid Storage System for Data-intensive Applications

31

Computing Nodes Configuration Active Storage ConfigurationCPU Intel XEON X3430 Intel Core 2 Q9400

Memory 2 GB DDR3 (PC3-10600)OS Ubuntu 9.04 Jaunty Jackalope 32bit Version

Kernel 2.6.28-15-genericNetwork Gigabit LAN

04/08/2023

Evaluation Environment

Our Testbed Opposite Testbeds“Pipeline-parallel” “12-node Cluster” “13-node Cluster”12 Computing Nodes 12 Computing Nodes 13 Computing Nodes1 Active Storage Node 1 Storage Node 1 Storage Node

Page 31: An Active and Hybrid Storage System for Data-intensive Applications

3204/08/2023

Pipeline-parallel Design

Results: Compared With 12-node System

Results: Compared With 13-node System

Page 32: An Active and Hybrid Storage System for Data-intensive Applications

3304/08/2023

Speedups Trends: Partition Size

Page 33: An Active and Hybrid Storage System for Data-intensive Applications

34

• We proposed a pipeline-parallel processing mechanism to apply an Active Storage Node.

• As a case study, we extended a classic bioinformatics application based on the pipeline-parallel style.

04/08/2023

Summary

Page 34: An Active and Hybrid Storage System for Data-intensive Applications

35

About the Active Storage

04/08/2023

pp-mpiBlast:How to deploy Active Storage?

McSD: A Smart Disk Model

Storage Node HcDD:Hybrid Disk for Active Storage

Page 35: An Active and Hybrid Storage System for Data-intensive Applications

3604/08/2023

What’s Hybrid?

A Hybrid Combination of a Gas Engine and a Electronic Engine

Power Efficiency

Page 36: An Active and Hybrid Storage System for Data-intensive Applications

3704/08/2023

Hybrid Disk Drives

• A Hybrid Combination of Two Types of Storage Devices: HDD and SSD– HDD: Magnetic Hard Disk– Solid State Disk: Built by NAND-based flash memory.

What are their roles?

Page 37: An Active and Hybrid Storage System for Data-intensive Applications

3804/08/2023

Motivation

• However, SSDs suffer reliability issues.

• In a hybrid storage system, using SSDs as the buffer can boost the performance.

Page 38: An Active and Hybrid Storage System for Data-intensive Applications

39

• Flash Memory:– Each Block consists 32 or 64 or128 pages. – Each Page is typically 512 or 2,048 or 4,096 bytes.

• “Erase-before-write” at block level.• Lifespan is 10,000 Program/Erase cycles.– E.g., *The lifespan of an 80 GB MLC SSD can only

last 106 days, if the write rates is 30 MB/s.

04/08/2023

Limitations Related to SSDs

• Rethink about their roles?*Based on the SSD lifespan calculator provided by Virident.com

Page 39: An Active and Hybrid Storage System for Data-intensive Applications

40

• Hybrid Combination of HDD and SSD disks

• De-duplication Service using HDDs as a Write Buffer

• Internal-parallel Processing in SSD

• Simulation of the Whole System For Evaluation

04/08/2023

Contributions

Page 40: An Active and Hybrid Storage System for Data-intensive Applications

4104/08/2023

Hybrid Disk Configuration

HDD

SSD

I/O Requests

Read Requests

Data of Write Requests

data

Data

De-duplication

Dedicated Processor

Pre-processingRead RequestsPre-processed Data

dataDeduplicated

Page 41: An Active and Hybrid Storage System for Data-intensive Applications

4204/08/2023

HcDD Architecture

Page 42: An Active and Hybrid Storage System for Data-intensive Applications

4304/08/2023

Deduplication Design

Page 43: An Active and Hybrid Storage System for Data-intensive Applications

4404/08/2023

Internal Parallel Processing

Page 44: An Active and Hybrid Storage System for Data-intensive Applications

4504/08/2023

Evaluation

Page 45: An Active and Hybrid Storage System for Data-intensive Applications

4604/08/2023

Internal Parallelism Evaluation:Single Node

Page 46: An Active and Hybrid Storage System for Data-intensive Applications

4704/08/2023

Single Node: Dedup Ratio

Page 47: An Active and Hybrid Storage System for Data-intensive Applications

4804/08/2023

System Performance Evaluation

Page 48: An Active and Hybrid Storage System for Data-intensive Applications

4904/08/2023

System Performance Evaluation

Page 49: An Active and Hybrid Storage System for Data-intensive Applications

5004/08/2023

Summary

Page 50: An Active and Hybrid Storage System for Data-intensive Applications

51

Conclusion

04/08/2023

pp-mpiBlast:How to deploy Active Storage?

McSD: A Smart Disk Model

Storage Node HcDD:Hybrid Disk for Active Storage

Page 51: An Active and Hybrid Storage System for Data-intensive Applications

52

Future Work

04/08/2023

Page 52: An Active and Hybrid Storage System for Data-intensive Applications

53

Many Thanks!And Questions?

04/08/2023


Top Related