dadi block-level image service for agile and elastic ... › system › files ›...

26
DADI Block-Level Image Service for Agile and Elastic Application Deployment Huiba Li, Yifan Yuan, Rui Du, Kai Ma, Lanzheng Liu and Windsor Hsu Alibaba Group

Upload: others

Post on 31-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • DADI Block-Level Image Service for Agile and Elastic Application Deployment

    Huiba Li, Yifan Yuan, Rui Du, Kai Ma, Lanzheng Liu and Windsor Hsu

    Alibaba Group

  • The Problem• Container deployment (cold startup) is slow

    • Long-tail latency reaches 10s of minutes• The essential reasons are image downloading and unpacking

    • Only 6.4% [Slacker] of the image is used for startup• A regression to a decade ago, when VM images were also downloaded to hosts

    • P2P downloading [Dragonfly, Kraken, Borg, Tupperware, FID] is not enough

    • Deals with only half the reason• Little effect for small clusters

    • Slimming the images [DockerSlim, Cntr] is not universal

    • Hard to automatically find all dependencies for all applications• Hard to support ad-hoc operations

  • Remote Image• is the trend

    • [CRFS, Teleport, CernVM-FS, Slacker, Wharf, CFS, Cider]• Optionally with P2P transfer for large clusters

    • Container image (tarball) is, however, NOT viable for remote image

    • Designed for unpacking, not seekable• Hard to support advanced features, such as xattr, cross-layer reference, etc.• We’d better to design a new one

    • Type of image

    • File-system-based image?• Block-device-based image?

  • Type of Image: Block!Features Existing Sys Complexity Universality Security Overall

    Block-Device-Based

    • Work together with a regular file system, such as ext4

    • Viable for container, secure container and virtual machine

    Cider

    (based on Ceph;

    no layering format.)

    Low

    stability↑

    optimization↑

    advanced features↑

    App can choose a best-match file system, e.g. NTFS,

    and pack it into the image as a dependence.

    small attack 
surface

    need the courage to walk alone

    (almost)

    TODO: layering

    File- System-Based

    • Provides a file-system interface directly

    • “Natural” extension of container image

    • Less mental friction (due to inertia and following the crowd)

    CRFS, Teleport, CernVM-FS,

    Slacker, Wharf, CFS

    High

    stability↓

    optimization↓

    advanced features↓

    Fixed features;
may not match all

    applications.

    (e.g. a Windows container on a

    Linux host)

    large attack 
surface

    Technical advantage is insignificant.

  • Background: Layered Image of Container

    docker registry

    download

    untar

  • Background: Layered Image of Container

    Each layer is a change set compared to the previous state
(files added, modified, deleted)

    (read-only, shared)

    docker registry

    download

    untar

  • Background: Layered Image of Container

    Each layer is a change set compared to the previous state
(files added, modified, deleted)

    (read-only, shared)

    Container layer is a change set compared to the image


    (files added, modified, deleted) (read-write, private)docker

    registry

    download

    untar

  • Background: Layered Image of Container

    Each layer is a change set compared to the previous state
(files added, modified, deleted)

    (read-only, shared)

    Container layer is a change set compared to the image


    (files added, modified, deleted) (read-write, private)

    Usually the layers are stored in separate directories, and a merged view is created with a kernel module overlayfs.

    docker registry

    download

    untar

  • Background: I/O Path

    App Processes

    directories

    container

    user space

    kernel space

    overlayfs directories layers (directories)

    Docker Registry

    download, ungzip & untar

  • DADI Remote Image• A layered image format

    • based on virtual block device• work together with a regular file system, e.g. ext4• a general solution for container ecology

    • Compression

    • and seekable decompression (online)

    • Scalability

    • peer-to-peer (P2P) transfer

  • DADI Remote Image• A layered image format

    • based on virtual block device• work together with a regular file system, e.g. ext4• a general solution for container ecology

    • Compression

    • and seekable decompression (online)

    • Scalability

    • peer-to-peer (P2P) transfer

    Overlay Block Device

  • DADI Remote Image• A layered image format

    • based on virtual block device• work together with a regular file system, e.g. ext4• a general solution for container ecology

    • Compression

    • and seekable decompression (online)

    • Scalability

    • peer-to-peer (P2P) transfer

    ZFile

    Overlay Block Device

  • DADI Remote Image• A layered image format

    • based on virtual block device• work together with a regular file system, e.g. ext4• a general solution for container ecology

    • Compression

    • and seekable decompression (online)

    • Scalability

    • peer-to-peer (P2P) transfer

    ZFile

    P2P on-demand read in a tree-structured topology

    Overlay Block Device

  • DADI I/O Path

    App Processes

    regular file system (ext4, etc.) virtual block device

    OverlayBD

    file system (ext4, etc.)

    container

    P2P RPC

    for downloaded layers

    user space

    kernel space

    for new layers

    ZFile

    lsmd daemon

    ZFile ZFile (layer blobs)

  • 0 2 15 87 1501 4 10 50 1030 15

    pread offset lengthSegment

    raw data to readraw data

    raw data to read

    hole hole

    raw data to read

    Overlay Block Device

    • Each layer is a change set of overwritten blocks

    • no concept of file or file system• 512 bytes block size (granularity)

    • An index for fast reading

    • variable-length entries to

    save memory by combining• non-overlapping entries

    sorted by logical offsets• range query by binary search

  • Index Merge

    5 10 1005 10 10

    0 21 53 5 1010 20 875

    0 2 15 87 1501 4 10 50 1030 15

    30 15 13 100 10 110 27 150 10

    +

    =>

    offset length

    Segment

    # of

    Seg

    men

    ts in

    Mer

    ged

    Inde

    x

    0K

    1K

    2K

    3K

    4K

    5K

    Layers Depth0 5 10 15 20 25 30 35 40 45

    Merged Index Size of Productional Images

    4.5K * 16 bytes = 72KB

  • Index PerformanceQ

    uerie

    s / S

    econ

    d

    0M

    3M

    6M

    9M

    Size of Index (# of Segments)1K 2K 3K 4K 5K 6K 7K 8K 9K 10K

    IOPS

    (bs=

    8KB,

    non

    -cac

    hed)

    0K

    30K

    60K

    90K

    120K

    I/O Queue Depth1 2 4 8 16 32 64 128 256

    Thin LVMDADI w/o compDADI - ZFile

    > 6M QPS for productional images

  • Writable Layer• Log-structured design

    • appending index and raw data to separate logs• Maintaining an in-memory index

    • red-black-tree• Commit only useful data blocks (in offset order)

    • combine index entries

    Data (R/W)

    Index (R/W)Header Index TrailerRaw Data

    Layer (RO)Header Raw Data

    IndexHeader append

    append

    commit

  • Header Index TrailerCompressed Chunks[Dict]

    Header Index TrailerRaw Data

    ZFile

    Underlay file (DADI layer blob)

    ZFile

    • A seekable compression format

    • random reading, and online decompression

    • Compressed by fixed-sized chunks

    • Decompressed only needed chunks

    • Not tied to DADI

  • On-Demand P2P Transfer• In a tree-structured topology

    • Each P2P node caches recently used data blocks.• A request is likely to hit parent’s cache,• or the parent will forward the request upward, recursively.

    Registry

    DADI-Root

    DADI-Agent DADI-Agent

    DADI-Agent

    DADI-Agent

    DADI-Agent DADI-Agent

    DADI-Root

    DADI-Agent DADI-Agent

    DADI-Agent

    DADI-Agent DADI-Agent

    DADI-AgentDADI-Agent

    DADI-Agent

    HTTP(S) request DADI request

    Datacenter 1 Datacenter 2

    DADI-Agent

  • Evaluations

  • Startup Latency with DADIC

    old

    Star

    t Lat

    ency

    (s)

    0

    5

    10

    15

    20

    .tgz + 
overlay2

    CRFS pseudo 
Slacker

    DADI from
Registry

    DADI from 
P2P Root

    Image PullApp Launch

    War

    m S

    tartu

    p La

    tenc

    y (s

    )

    0

    0.6

    1.2

    1.8

    2.4

    overlay2 Thin LVM
(device mapper)

    DADI

    NVMe SSDCloud Disk

  • Startup Latency with DADI

    Star

    tup

    Late

    ncy

    (s)

    0.0

    0.6

    1.2

    1.8

    2.4

    Warm
Cache

    Cold
Cache

    app launch with prefetchapp launch

    Col

    d St

    artu

    p La

    tenc

    y (s

    )

    0.0

    1.0

    2.0

    3.0

    # of Hosts (and Containers)0 10 20 30 40

    pseudo-SlackerDADI

  • Scalability with DADI#

    of C

    onta

    iner

    Inst

    ance

    s St

    arte

    d

    0K

    3K

    5K

    8K

    10K

    Time (s)0 1 2 3 4

    Cold Startup 1Cold Startup 2Cold Startup 3Warm Startup

    Estim

    ated

    Sta

    rtup

    Late

    ncie

    s (s

    )

    1.5

    2.0

    2.5

    3.0

    3.5

    # of Containers10K 20K 30K 40K 50K 60K 70K 80K 90K 100K

    2-ary tree 3-ary tree4-ary tree 5-ary tree

    Large-Scale Startup of Agilityon 1,000 hosts Projected Hyper-Scale Startup of Agility (by evaluating a single branch of the P2P tree)

    (Agility is a small application specifically written in Python to assist the test)

  • I/O PerformanceTi

    me

    to d

    u Al

    l File

    s (s

    )

    0

    0.4

    0.8

    1.2

    1.6

    overlay2 Thin LVM DADI

    NVMe SSDCloud Disk

    Tim

    e to

    tar A

    ll Fi

    les

    (s)

    0

    3

    6

    9

    12

    overlay2 Thin LVM DADI

    NVMe SSDCloud Disk

    Image Scanning with du Image Scanning with tar

  • Thanks!