fawn-forward 2010 · 2018. 1. 4. · amar phanishayee, lawrence tan, jason franklin, hyeontaek lim,...
TRANSCRIPT
-
FAWN-Forward2010
David Andersen, Vijay Vasudevan, Michael Kaminsky*,Amar Phanishayee, Lawrence Tan, Jason Franklin, Hyeontaek Lim, Bin Fan, Alex Gartrell, Ben Blum,
Nathan WanCarnegie Mellon University *Intel Labs Pittsburgh
-
1.6Ghz atom, 2GB DRAM,160GB SSD
“FAWN” MicroserversAmdahl Blades
“Gordon”SeaMicro
Smooth-Stone...
Arrays of low(er)-power nodes can serve
datacenter workloads at roughly 1/3rd the power.
-
Energy in computingPower is a significant burden on computing
3-year total cost of ownership soon to be dominated by power
3
-
Theme Research Project Process
Towards balanced data intensive systems
4
0
1
10
100
1000
10000
100000
1000000
10000000
100000000
1980 1985 1990 1995 2000 2005
Nan
osec
onds
Year
Disk Seek
CPU Cycle
Wasted resources
DRAM Access
What’s the most energy-‐ efficient way
to balance computa;on, memory,
and I/O?
-
Theme Research Project Process
Targe;ng the sweet spot in efficiency
5
Efficiency vs. Speed
(Includes 0.1W power overhead)* Numbers from spec sheets
Fixed costs dominate
Fastest processors may not be most
efficient
FAWN targets sweet spot in efficiency:
Slower CPU + Flash storage
-
My perspective on this
• FAWN-KV key-value store [SOSP’09]• Lesson 1: It worked!
• < 1ms lookup latency (on 6 year old node)• Drastically improved energy efficiency
• Lesson 2: It took a lot of engineering...• And even some new algorithms• Finding/making balanced HW was challenging
-
Initial Target:Key-value storage systems• Critical infrastructure service• Performance-conscious• Random-access, read-mostly, hard to cache
7
-
FAWN-KV Results
• Log-structured datastore• Flash-optimized: No random writes, fast random reads• Memory-efficient index (~12 bytes per K-V pair) to
locate items in the log
• Consistent hashing + chain replication for consistent, primary-backup (3x) replication, load balancing.
• Already published at SOSP last year. Tell us something new!
8
Principles Key-Value Systems EvaluationFAWN-KV Design
-
Recent Result: SplitScreen [NSDI 2010]
• Inspired by FAWN, a...• New algorithm for “multi-grep”
“Feed-forward Bloom Filters”
• Way to get 2x virus scanning speed at 1/10th the DRAM
• (We’d be delighted to help others use this)
9
-
Challenges for FAWN
• Algorithms and Architectures at 10x scale
• High performance using low capacity nodes• Violate programmer assumptions...
“Each decimal order of magnitude increase in parallelism requires a major redesign and rewrite of parallel code” - Kathy Yelick
10
-
512 b random reads (fast SSD)
11
Platform Reads / Second
i7 + X25-m ~60 K
Atom-1core + X25-m
~23 K
-
Slow “wimpies”• Prior results: Wimpies dominated in
efficiency
• What’s happening here? 23k vs 60k
12
add_disk_randomness(rq->rq_disk);
23,000 interrupts/second
tester program called gettimeofday
Fixed these, new interrupt coalescing:37k and rising
Block I/O is poor interface for IOPS
-
• Latest problem: Every now and then, write latency of 2 seconds.
• Garbage collection on SSD.• But we’re log structured!
13
What is the appropriate logical interface for high-performance,
“server-like” use of flash?
-
Memory
• Ongoing theme in FAWN• If you can save memory, you can save
power and $.
• If you want to save power and $,• You have to save memory!
14
Old nodes: 256MB RAM, 4GB flash (16x)
New nodes: 2GB RAM, 160GB flash (80x) !
-
A motivating vision...
15
(PCM??)
+ Bandwidth+ Latency+ Power---- Capacity
Requires even more mem-efficient systems
-
With limited DRAM
16
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-
ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.
Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash Storage
Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.
Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using write
buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data Store
FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in
the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if
{Check next chain element...}end for
return NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to
find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for
drastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit
key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
160-bit key
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-
ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.
Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash Storage
Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.
Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using write
buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data Store
FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in
the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if
{Check next chain element...}end for
return NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to
find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for
drastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit
key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
KeyFrag Valid .
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-
ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.
Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash Storage
Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.
Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using write
buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data Store
FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in
the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if
{Check next chain element...}end for
return NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to
find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for
drastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit
key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-
ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.
Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash Storage
Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.
Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using write
buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data Store
FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in
the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if
{Check next chain element...}end for
return NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to
find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for
drastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit
key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
{12 bytes per entry
KeyFrag != KeyPotential collisions!
Low probability ofmultiple Flash reads
Hashtable Data regionDRAM Flash
New nodes have 5x lower memory/flash
ratio...
-
Step 1: Split into current and old stores
17
Static
Inserts
Queries
DRAM IndexDRAM Index
-
Memory efficient indexes
18
Sorted Hashing
Still CPU intense10µs per lookup < 2µs per lookup
100% space use on flash~2 bits per key
93% space use on flash~2 bits per key
Merging is easy, Can discard index Range queries harder
-
• “FAWN”damental efficiency from arch;• But no free lunch! OS, dist sys, and algorithmic
challenges along path. Really fun ones.
• Current big focuses:• Memory efficiency is our bogey-man.• Systems cope poorly with high IOPS
• Progressing well, just brought up new cluster• p.s. Most common question? “What about the
network?” :) [sigcomm forward ref]
19
Wrapup