write optimization of log-structured flash file system for parallel … write... · 2019. 6....

25
Write Optimization of Log-structured Flash File System for Parallel I/O on Manycore Servers Chang-Gyu Lee, Hyunki Byun, Sunghyun Noh, Hyeongu Kang, Youngjae Kim Department of Computer Science and Engineering Sogang University, Seoul, Republic of Korea SYSTOR ‘19 1

Upload: others

Post on 31-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Write Optimization of Log-structured Flash File System for Parallel I/O on Manycore Servers

    Chang-Gyu Lee, Hyunki Byun, Sunghyun Noh, Hyeongu Kang, Youngjae KimDepartment of Computer Science and Engineering

    Sogang University, Seoul, Republic of Korea

    SYSTOR ‘19

    1

  • Data Intensive Applications Massive data explosion in recent years and expected to grow

    Database Applications

    2007,281 EB

    2010,1.2 ZB

    2013,4.4 ZB

    2020,~44 ZB

    Growing Capacity Demands

    Storage

    Memory

    2

    Parallel Writes

  • Manycore CPU and NVMe SSD

    Manycore Servers

    High-Performance SSD

    Parallel WritesOS File System

    (F2FS)

    3

  • What are Parallel Writes? Shared File Writes (DWOM from FxMark[ATC’16])

    Multiple processes write private regions on a single file.

    Process 1 Process 2 Process 3 Process N

    Private File Write with FSYNC (DWSL from FxMark[ATC’16]) Multiple processes write private files, then call fsync system calls.

    Direct I/O Write

    Process 1 Process 2 Process 3 Process NWrite and fsync

    Shared File

    Private Files

    * FxMark[ATC’16]: Min. et. al., "Understanding Manycore Scalability of File Systems", USENIX ATC 2016 4

  • Preliminary Results

    5

    1 15 28 42 56 70 84 98 112120

    # of Cores

    0

    50

    100

    150

    200

    K IO

    PS

    1 15 28 42 56 70 84 98 112120

    # of Cores

    0

    50

    100

    150

    200

    K IO

    PS

    In DWOM workload, the performance does not scale.

    In DWSL workload, the performance does not scale after 42 cores.

  • Contents Introduction and Motivation Background: F2FS Research Problems

    Parallel Writes do never scale with respect to the increased number of cores on Manycore servers.

    Approaches Applying Range-Locking NVM Node Logging for file and file system metadata Pin-Point Update to completely eliminate checkpointing

    Evaluation Results Conclusion

    6

  • F2FS: Flash Friendly File System F2FS is a log-structured file system designed for NAND Flash SSD. F2FS employs two types of logs to benefit with Flash’s parallelism and garbage

    collection. Data log for directory entry and user data Node log for inode and indirect node

    Node Address Table (NAT) translates Node id (NID) to block address. In memory, block address of an NAT entry is updated when corresponding Node

    Log is flushed. Entire NAT is flushed to the storage device during checkpointing.

    CP NAT SIT SSA Node Log Data Log

    Filesystem Metadata(Random write)

    Main Log Area(Sequential Write)

    7

  • Problem(1): Serialized Shared File Writes Single file write

    A B C

    Inode

    File

    Blocked

    Grant Lock

    8

  • Problem(2): fsync Processing in F2FS

    inode Data

    Node Log Data Log

    Node id Block

    NAT

    DRAM ❷ ❶Datainode

    Node id Block

    NAT

    SSD

    Old Data

    ReferenceFlushing

    New Data

    9

  • Problem(3): I/O Blocking during Checkpointing

    inode Data

    Node Log Data Log

    Node id Block

    NAT

    DRAM ❷ ❶Datainode

    Node id Block

    NAT

    SSD

    Old Data

    ReferenceFlushing

    New Data

    60 Sec.

    Checkpointing

    10

  • Problem(3): I/O Blocking during Checkpointing

    inode Data

    Node Log Data Log

    Node id Block

    NAT

    DRAM ❷❸ ❶Datainode

    Node id Block

    NAT

    SSD

    Old Data

    ReferenceFlushing

    New Data

    60 Sec.

    Checkpointing

    11

  • Problem(3): I/O Blocking during Checkpointing

    inode Data

    Node Log Data Log

    Node id Block

    NAT

    DRAM ❷❸ ❶Datainode

    Node id Block

    NAT

    SSD

    User LevelFilesystem Level

    Old Data

    ReferenceFlushing

    New Data

    12

  • Summary

    We identified the causes of bottlenecks in F2FS for parallel writes as follows.

    1. Serialization of parallel writes on a single file

    2. High latency of fsync system call

    3. I/O blocking by checkpointing of F2FS

    13

  • Approach(1): Range Locking In F2FS, parallel writes to a single file are serialized by inode mutex

    lock.

    B, ref=0

    C, ref=1

    A B C

    Inode

    File

    A, ref=0

    Grant Lock

    Grant Lock

    Block

    We employed a range-based lock to allow parallel writes on a single file.

    14

  • Approach(2): High Latency of fsync Processing When fsync is called, F2FS has to flush data and metadata.

    Even if only small portion of metadata is changed, a block has to be flushed. The latency of fsync is dominated by block I/O latency.

    inodeSlow Block I/O

    Write Amplification

    DRAMSSD

    To mitigate high latency of fsync, we propose NVM Node Logging and fine-graind inode.

    inodeBetter Latency

    Byte-addressability

    DRAMNVM

    15

  • Approach(2): Node Logging on NVM

    inode Data

    Node Log Data Log

    Node id Block

    NAT

    DRAM ❷ ❶Datainode

    Node id Block

    NAT

    SSDNVM

    Old Data

    ReferenceFlushing

    New Data

    16

  • Approach(3): Fine-grained inode Structure

    17

    inode

    Address

    AddressSpace

    nid

    4KB Data

    Double Indirect

    Indirect NodeDirect Node

    inode

    Addressnid

    0.4KB

    Data

    inode in baseline F2FS Fine-grained inode

  • Approach(4): Pin-Point NAT Update Frequent fsync calls trigger checkpointing in F2FS However, F2FS blocks all incoming I/O requests during checkpointing.

    To eliminate checkpointing, we propose Pin-Point NAT Update.

    18

  • Approach(4): Pin-Point NAT Update

    inode Data

    Node Log Data Log

    Node id Block

    NAT

    DRAM ❷❸ ❶Datainode

    Node id Block

    NAT

    SSDNVM

    Old Data

    ReferenceFlushing

    New DataIn Pin-Point NAT Update, we update only the modified NAT entry directly in NVM when fsync is called. Therefore, checkpointing is not necessary to persist the entire NAT.

    19

  • Approach(4): Pin-Point NAT Update

    inode Data

    Node Log Data Log

    Node id Block

    NAT

    DRAM ❷❸ ❶Datainode

    Node id Block

    NAT

    SSDNVM

    Old Data

    ReferenceFlushing

    New Data

    20

  • Evaluation Setup Microbenchmark (FxMark)

    DWOM Shared File Write

    DWSL Private File Write with fsync CPU

    Intel Xeon E7-8870 v2 2.3GHz8 CPU Nodes (15 Cores per Node)Total 120 cores

    RAM 740GB

    SSD Intel SSD 750 Series 400GB (NVMe)Read: 2200 MB/s, Write: 900 MB/s

    NVM 32GB Emulated as PMEM device on RAMOS Linux kernel 4.14.11

    Test-bed IBM x3950 X6 Manycore Server

    * FxMark[ATC’16]: Min. et. al., "Understanding Manycore Scalability of File Systems", USENIX ATC 2016 21

  • Shared File Write (DWOM Workload)

    0

    20

    40

    60

    80

    100

    120

    140

    1 15 28 42 56 70 84 98 112 120

    K IO

    PS

    # of Cores

    baseline range lock node logging integrated

    X15

    X6.8• Baseline and node logging lines overlap. • Node Logging does not help at all because DWOM

    workload does not carry fsync calls.

    22

  • Frequent fsync (DWSL Workload)

    0

    50

    100

    150

    200

    250

    1 15 28 42 56 70 84 98 112 120

    K IO

    PS

    # of Cores

    baseline range lock node logging integrated

    X1.6

    23

  • Conclusion We identified performance bottlenecks of F2FS for parallel writes.

    1. Serialization of share file writes on a file 2. High latency of fsync operations in F2FS 3. High I/O blocking times during checkpointing.

    To solve these problem, we proposed 1. File-level Range Lock to allow parallel writes on a shared file2. NVM Node Logging to provides lower latency for updating file/file system

    metadata3. Pin-Point NAT Update to eliminate I/O blocking times of checkpointing

    24

  • Q&A

    25

    Thank you!

    Contact: Changgyu Lee ([email protected])Department of Computer Science and EngineeringSogang University, Seoul, Republic of Korea

    mailto:[email protected])