compoundfs: compounding i/o operations in firmware file

24
CompoundFS: Compounding I/O Operations in Firmware File Systems Yujie Ren 1 , Jian Zhang 2 and Sudarsun Kannan 1 1 Rutgers University; 2 ShanghaiTech University

Upload: others

Post on 21-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

CompoundFS: Compounding I/O Operations in Firmware File Systems

Yujie Ren1, Jian Zhang2 and Sudarsun Kannan1

1 Rutgers University; 2 ShanghaiTech University

• Background• Analysis• Design• Evaluation• Conclusion

2

Outline

3

In-storage Processors Are Powerful

CPU: 2-core 3-core 5-core

RAM: 128MB DDR2 512MB LPDDR2 1GB LPDDR4

Year: 2008 2013 2018

Samsung 840 Samsung 970Intel X25M

Price: $7.4/GB $0.92/GB $0.80/GB

Latency: ~70𝜇s ~60𝜇s ~40𝜇s

B/W: 250 MB/s 500 MB/s 3300 MB/s

Software Latency Matters Now

4

OS Kernel Software Overhead Matters !

Page Cache

Block I/O Layer

Device Driver

VFS Layer

Actual FS

Application

: Kernel Trap

: Data Copy

: OS OverheadPMFS ext4

write()

1 - 4𝜇𝑠

5

Current Solutions• DirectFS (i.e. Strata, SplitFS, DevFS) reduces software overhead

bypassing OS kernel partially or fully

Application

FS Lib

FS Server

Storage

Strata (SOSP’17) DevFS (FAST’18)SplitFS (SOSP’19)

Application Application

FS Lib

Kernel DAX FS

Storage

FS Lib

Storage

Firmware FS

: data-plane ops: control-plane ops

6

Limitation of Current Solutions• DirectFS designs do not reduce boundary crossing

- Strata needs boundary crossing between FS Lib and FS Server- SplitFS needs kernel trap for control-plane operations- DevFS suffers from high PCIe latency for every operation

• DirectFS designs do not efficiently reduce data copy- Current solutions need multiple data copies back and forth between application and storage stack

• DirectFS designs do not utilize in-storage computation- Current solutions only use host CPUs for I/O related operations

• Background• Analysis• Design• Evaluation• Conclusion

7

Outline

8

Analysis Methodology

• File Systems- ext4-DAX: ext4 on byte-addressable storage bypassing page cache- SplitFS: direct-access file system bypassing kernel for data-plane ops

• Application- LevelDB: Well-known persistent key-value store- db_bench: random write and read benchmarks

• Storage- Emulated persistent memory on DRAM like prior work (e.g., SplitFS)

9

LevelDB Overhead Breakdown

• LevelDB spends significant time (~%50) in OS storage stack

• Spends ~%15 of time on data copy between App and OS

• Spends ~%20 of time on App-level crash consistency – CRC of data

0%

20%

40%

60%

80%

256(DAX)

4096(DAX)

256(SplitFS)

4096(SplitFS)

Run

tim

e pe

rcen

tage

(%)

Value size (bytes)

Data allocation (OS) Data copy (OS)Filesystem update (OS) Lock (OS)Data allocation (user) Data copy (user)CRC32 (user)

• Background• Analysis• Design• Evaluation• Conclusion

10

Outline

11

Our solution: CompoundFS

• Combine (compound) multiple file system I/O ops into one

• Offload I/O pre- and post-processing to storage-level CPUs

• Bypass OS kernel and provide direct-access

12

Our solution: CompoundFS

• Combine (compound) multiple file system I/O ops into one- e.g. write() after read() compounded to write-after-read()- Reduces boundary crossing b/w host and storage (e.g., syscall)

• Offload I/O pre- and post-processing to storage-level CPUs- e.g. checksum() after write() compounded to write-and-checksum()- Storage CPUs perform computation (e.g., checksum) and persist- Reduce data movement cost across boundaries

• Bypass OS kernel and provide direct-access- firmware file system design to provide direct access for data plane and most control plane operations

13

I/O Only Compound Operations

Read-modify-write:

Traditional FS Path:

2 syscalls + 2 data copies

User space

Kernel space

User space

Storage

Only 1 data copy with direct access

Read(data) Write(data) Read_modify_write(data)

Compound FS Path:

: Kernel Trap

: Data Copy

modify

Storage FS

StorageFS performs compound op

14

I/O + Compute Compound Operations

Write-and-checksum

Traditional FS Path:

2 syscalls + 2 data copies

User space

Kernel space

User space

Storage

Only 1 data copy with direct access

Write(data) Write(checksum) Write_and_checksum(data)

Compound FS Path:

: Kernel Trap

: Data Copy

checksum

Storage FS

StorageFS handles checksum calculation

15

CompoundFS ArchitectureApplication (Thread 1)

Op1 open(File1) -> fd1

Application (Thread 2)

Op2+ read_modify_write(fd2, buf, off=30, sz=5)

UserLib (in Host) Per-inode I/O Queue Per-inode Data Buffer

Converting POSIX I/O syscalls to CompoundFS compoundOps

Journal

…TxB TxEMeta-data

NVM DataBlock Addr

Cred Table

CPUID

Cred

CPUID CPUID

Cred Cred

StorageFS(In Device)

I/O Request Processing Threads

Device CPU Cores

Compounding I/O ops

Perform CRC calculation before write()

Op3* write_and_checksum(fd1,buff, off=10, sz=1K, checksum_pos=head)

Op4 read(fd2, buf, off=30, sz=5)

Op1 Op2+ Op4Op3*

16

CompoundFS Implementation

• Command-based arch based on PMFS (Eurosys’14)- control-plane ops (e.g. open) as commands via ioctl()- ioctl() carries arguments for each I/O ops

• Avoids VFS overhead- control-plane ops are issued via ioctl(), no VFS layer

• Avoids system call overhead- UserLib and StorageFS share a command buffer- UserLib adds requests to command buffer- StorageFS processes requests from the buffer

17

CompoundFS Challenges• Crash-consistency model for compound I/O operations

• All-or-nothing model (current solution)- an entire compound operation is a transaction

- partially completed operations cannot be recovered

- e.g., write-and-checksum, only data is persisted but checksum not

• All-or-something model (ongoing)- fine-grained journaling and partial recovery is supported

- recovery could become complex

• Background• Analysis• Design• Evaluation• Conclusion

18

Outline

19

Evaluation Goal

• Effectiveness to reduce boundary crossing

• Effectiveness to reduce data copy overheads

• Ability to exploit compute capability of modern storage

20

Experimental Setup• Hardware Platform

- dual-socket 64-core Xeon Scalable CPU @ 2.6GHz- 512GB Intel DC Optane NVM

• Emulate firmware-level FS- reserve dedicated device threads handling I/O requests- add PCIe latency for every I/O operation- reduce CPU frequency to 1.2GHz for device CPU

• State-of-the-art File Systems- ext4-DAX (Kernel-level file system)- SplitFS (User-level file system)- DevFS (Device-level file system)

21

Micro-Benchmark

Read-modify-write Write-and-checksum

• CompoundFS reduces unnecessary data movement and system call overhead by combining operations

0

200

400

600

800

1000

1200

256 4096

Thr

ough

put

(MB/

s)

Value Size

ext4-DAX

SplitFS

DevFS

CompoundFS

CompoundFS-slowCPU

0

200

400

600

800

1000

1200

256 4096

Thr

ough

put

(MB/

s)

Value Size

2.1x

1.25x

• Even with slow device CPUs, CompoundFS can still provide gains for in-storage computation

22

LevelDB

db_bench random writedb_bench random read

• CompoundFS also shows promising speedup in Leveldb

0

20

40

60

80

100

512 4096

Thr

ough

put

(MB/

s)

db_bench Value Size (500k keys)

0

10

20

30

40

512 4096

Late

ncy

(us/

op)

db_bench Value Size (500k keys)

ext4-DAX

SplitFS

DevFS

CompoundFS

CompoundFS-slowCPU

1.75x

23

Conclusion• Storage hardware is moving to microsecond era

- Software overhead matters and providing direct-access is critical- Storage compute capability can benefit I/O intensive applications

• CompoundFS combines I/O ops and offloads computations- Reduces boundary crossing (system call) and data copy overhead- Takes advantage of in-storage compute resources

• Our ongoing work- Fine-grained crash consistency mechanism- Efficient I/O scheduler for managing computation in storage

Thanks!

24

Questions?

[email protected]