fawn-forward 2010 · 2018. 1. 4. · amar phanishayee, lawrence tan, jason franklin, hyeontaek lim,...

19
FAWN-Forward 2010 David Andersen,Vijay Vasudevan, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Jason Franklin, Hyeontaek Lim, Bin Fan, Alex Gartrell, Ben Blum, Nathan Wan Carnegie Mellon University *Intel Labs Pittsburgh

Upload: others

Post on 18-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • FAWN-Forward2010

    David Andersen, Vijay Vasudevan, Michael Kaminsky*,Amar Phanishayee, Lawrence Tan, Jason Franklin, Hyeontaek Lim, Bin Fan, Alex Gartrell, Ben Blum,

    Nathan WanCarnegie Mellon University *Intel Labs Pittsburgh

  • 1.6Ghz atom, 2GB DRAM,160GB SSD

    “FAWN” MicroserversAmdahl Blades

    “Gordon”SeaMicro

    Smooth-Stone...

    Arrays of low(er)-power nodes can serve

    datacenter workloads at roughly 1/3rd the power.

  • Energy in computingPower is a significant burden on computing

    3-year total cost of ownership soon to be dominated by power

    3

  • Theme Research Project Process

    Towards  balanced  data  intensive  systems

    4

    0

    1

    10

    100

    1000

    10000

    100000

    1000000

    10000000

    100000000

    1980 1985 1990 1995 2000 2005

    Nan

    osec

    onds

    Year

    Disk  Seek

    CPU  Cycle

    Wasted  resources

    DRAM  Access

    What’s  the  most  energy-‐  efficient  way  

    to  balance  computa;on,  memory,

    and  I/O?

  • Theme Research Project Process

    Targe;ng  the  sweet  spot  in  efficiency

    5

    Efficiency  vs.  Speed

    (Includes  0.1W  power  overhead)*  Numbers  from  spec  sheets

    Fixed  costs  dominate

    Fastest  processors  may  not  be  most  

    efficient

    FAWN  targets  sweet  spot  in  efficiency:

    Slower  CPU  +  Flash  storage

  • My perspective on this

    • FAWN-KV key-value store [SOSP’09]• Lesson 1: It worked!

    • < 1ms lookup latency (on 6 year old node)• Drastically improved energy efficiency

    • Lesson 2: It took a lot of engineering...• And even some new algorithms• Finding/making balanced HW was challenging

  • Initial Target:Key-value storage systems• Critical infrastructure service• Performance-conscious• Random-access, read-mostly, hard to cache

    7

  • FAWN-KV Results

    • Log-structured datastore• Flash-optimized: No random writes, fast random reads• Memory-efficient index (~12 bytes per K-V pair) to

    locate items in the log

    • Consistent hashing + chain replication for consistent, primary-backup (3x) replication, load balancing.

    • Already published at SOSP last year. Tell us something new!

    8

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • Recent Result: SplitScreen [NSDI 2010]

    • Inspired by FAWN, a...• New algorithm for “multi-grep”

    “Feed-forward Bloom Filters”

    • Way to get 2x virus scanning speed at 1/10th the DRAM

    • (We’d be delighted to help others use this)

    9

  • Challenges for FAWN

    • Algorithms and Architectures at 10x scale

    • High performance using low capacity nodes• Violate programmer assumptions...

    “Each decimal order of magnitude increase in parallelism requires a major redesign and rewrite of parallel code” - Kathy Yelick

    10

  • 512 b random reads (fast SSD)

    11

    Platform Reads / Second

    i7 + X25-m ~60 K

    Atom-1core + X25-m

    ~23 K

  • Slow “wimpies”• Prior results: Wimpies dominated in

    efficiency

    • What’s happening here? 23k vs 60k

    12

    add_disk_randomness(rq->rq_disk);

    23,000 interrupts/second

    tester program called gettimeofday

    Fixed these, new interrupt coalescing:37k and rising

    Block I/O is poor interface for IOPS

  • • Latest problem: Every now and then, write latency of 2 seconds.

    • Garbage collection on SSD.• But we’re log structured!

    13

    What is the appropriate logical interface for high-performance,

    “server-like” use of flash?

  • Memory

    • Ongoing theme in FAWN• If you can save memory, you can save

    power and $.

    • If you want to save power and $,• You have to save memory!

    14

    Old nodes: 256MB RAM, 4GB flash (16x)

    New nodes: 2GB RAM, 160GB flash (80x) !

  • A motivating vision...

    15

    (PCM??)

    + Bandwidth+ Latency+ Power---- Capacity

    Requires even more mem-efficient systems

  • With limited DRAM

    16

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-

    ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.

    Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash Storage

    Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.

    Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using write

    buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data Store

    FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in

    the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if

    {Check next chain element...}end for

    return NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to

    find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for

    drastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit

    key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    160-bit key

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-

    ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.

    Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash Storage

    Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.

    Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using write

    buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data Store

    FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in

    the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if

    {Check next chain element...}end for

    return NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to

    find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for

    drastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit

    key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    KeyFrag Valid .

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-

    ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.

    Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash Storage

    Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.

    Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using write

    buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data Store

    FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in

    the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if

    {Check next chain element...}end for

    return NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to

    find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for

    drastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit

    key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-

    ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.

    Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash Storage

    Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.

    Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using write

    buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data Store

    FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in

    the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if

    {Check next chain element...}end for

    return NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to

    find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for

    drastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit

    key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    {12 bytes per entry

    KeyFrag != KeyPotential collisions!

    Low probability ofmultiple Flash reads

    Hashtable Data regionDRAM Flash

    New nodes have 5x lower memory/flash

    ratio...

  • Step 1: Split into current and old stores

    17

    Static

    Inserts

    Queries

    DRAM IndexDRAM Index

  • Memory efficient indexes

    18

    Sorted Hashing

    Still CPU intense10µs per lookup < 2µs per lookup

    100% space use on flash~2 bits per key

    93% space use on flash~2 bits per key

    Merging is easy, Can discard index Range queries harder

  • • “FAWN”damental efficiency from arch;• But no free lunch! OS, dist sys, and algorithmic

    challenges along path. Really fun ones.

    • Current big focuses:• Memory efficiency is our bogey-man.• Systems cope poorly with high IOPS

    • Progressing well, just brought up new cluster• p.s. Most common question? “What about the

    network?” :) [sigcomm forward ref]

    19

    Wrapup