fawn-forward 2010 · 2018. 1. 4. · amar phanishayee, lawrence tan, jason franklin, hyeontaek lim,...

FAWN-Forward2010

David Andersen, Vijay Vasudevan, Michael Kaminsky*,Amar Phanishayee, Lawrence Tan, Jason Franklin, Hyeontaek Lim, Bin Fan, Alex Gartrell, Ben Blum,

Nathan WanCarnegie Mellon University *Intel Labs Pittsburgh

1.6Ghz atom, 2GB DRAM,160GB SSD

“FAWN” MicroserversAmdahl Blades

“Gordon”SeaMicro

Smooth-Stone...

Arrays of low(er)-power nodes can serve

datacenter workloads at roughly 1/3rd the power.

Energy in computingPower is a significant burden on computing

3-year total cost of ownership soon to be dominated by power

3

Theme Research Project Process

Towards balanced data intensive systems

4

0

1

10

100

1000

10000

100000

1000000

10000000

100000000

1980 1985 1990 1995 2000 2005

Nan

osec

onds

Year

Disk Seek

CPU Cycle

Wasted resources

DRAM Access

What’s the most energy-‐ efficient way

to balance computa;on, memory,

and I/O?

Theme Research Project Process

Targe;ng the sweet spot in efficiency

5

Efficiency vs. Speed

(Includes 0.1W power overhead)* Numbers from spec sheets

Fixed costs dominate

Fastest processors may not be most

efficient

FAWN targets sweet spot in efficiency:

Slower CPU + Flash storage

My perspective on this

• FAWN-KV key-value store [SOSP’09]• Lesson 1: It worked!

• < 1ms lookup latency (on 6 year old node)• Drastically improved energy efficiency

• Lesson 2: It took a lot of engineering...• And even some new algorithms• Finding/making balanced HW was challenging

Initial Target:Key-value storage systems• Critical infrastructure service• Performance-conscious• Random-access, read-mostly, hard to cache

7

FAWN-KV Results

• Log-structured datastore• Flash-optimized: No random writes, fast random reads• Memory-efficient index (~12 bytes per K-V pair) to

locate items in the log

• Consistent hashing + chain replication for consistent, primary-backup (3x) replication, load balancing.

• Already published at SOSP last year. Tell us something new!

8

Principles Key-Value Systems EvaluationFAWN-KV Design

Recent Result: SplitScreen [NSDI 2010]

• Inspired by FAWN, a...• New algorithm for “multi-grep”

“Feed-forward Bloom Filters”

• Way to get 2x virus scanning speed at 1/10th the DRAM

• (We’d be delighted to help others use this)

9

Challenges for FAWN

• Algorithms and Architectures at 10x scale

• High performance using low capacity nodes• Violate programmer assumptions...

“Each decimal order of magnitude increase in parallelism requires a major redesign and rewrite of parallel code” - Kathy Yelick

10

512 b random reads (fast SSD)

11

Platform Reads / Second

i7 + X25-m ~60 K

Atom-1core + X25-m

~23 K

Slow “wimpies”• Prior results: Wimpies dominated in

efficiency

• What’s happening here? 23k vs 60k

12

add_disk_randomness(rq->rq_disk);

23,000 interrupts/second

tester program called gettimeofday

Fixed these, new interrupt coalescing:37k and rising

Block I/O is poor interface for IOPS

• Latest problem: Every now and then, write latency of 2 seconds.

• Garbage collection on SSD.• But we’re log structured!

13

What is the appropriate logical interface for high-performance,

“server-like” use of flash?

Memory

• Ongoing theme in FAWN• If you can save memory, you can save

power and $.

• If you want to save power and $,• You have to save memory!

14

Old nodes: 256MB RAM, 4GB flash (16x)

New nodes: 2GB RAM, 160GB flash (80x) !

A motivating vision...

15

(PCM??)

+ Bandwidth+ Latency+ Power---- Capacity

Requires even more mem-efficient systems

With limited DRAM

16

Data Log In-memoryHash Index

Log Entry

KeyFrag Valid Offset

160-bit Key

KeyFrag

Key Len Data

Inserted valuesare appended

Scan and Split

ConcurrentInserts

Datastore List Datastore ListData in new rangeData in original range Atomic Update

of Datastore List

(a) (b) (c)

Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-

ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.

Compaction of the original store will clean up out-of-range entries.

3.2 Understanding Flash Storage

Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.

Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

Modern devices improve random write performance using write

buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

3.3 The FAWN Data Store

FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in

the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

readKey(bucket.offset)==KEY thenreturn bucket;

end if

{Check next chain element...}end for

return NOT FOUND;

Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to

find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for

drastically reduced memory requirements (only six bytes of DRAMper key-value pair).

Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit

key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

160-bit key


Log Entry


160-bit Key

KeyFrag

Key Len Data


Scan and Split

ConcurrentInserts


of Datastore List

(a) (b) (c)





















end if


return NOT FOUND;









KeyFrag Valid .


Log Entry


160-bit Key

KeyFrag

Key Len Data


Scan and Split

ConcurrentInserts


of Datastore List

(a) (b) (c)





















end if


return NOT FOUND;










Log Entry


160-bit Key

KeyFrag

Key Len Data


Scan and Split

ConcurrentInserts


of Datastore List

(a) (b) (c)





















end if


return NOT FOUND;









{12 bytes per entry

KeyFrag != KeyPotential collisions!

Low probability ofmultiple Flash reads

Hashtable Data regionDRAM Flash

New nodes have 5x lower memory/flash

ratio...

Step 1: Split into current and old stores

17

Static

Inserts

Queries

DRAM IndexDRAM Index

Memory efficient indexes

18

Sorted Hashing

Still CPU intense10µs per lookup < 2µs per lookup

100% space use on flash~2 bits per key

93% space use on flash~2 bits per key

Merging is easy, Can discard index Range queries harder

• “FAWN”damental efficiency from arch;• But no free lunch! OS, dist sys, and algorithmic

challenges along path. Really fun ones.

• Current big focuses:• Memory efficiency is our bogey-man.• Systems cope poorly with high IOPS

• Progressing well, just brought up new cluster• p.s. Most common question? “What about the

network?” :) [sigcomm forward ref]

19

Wrapup

fawn-forward 2010 · 2018. 1. 4. · amar phanishayee, lawrence tan, jason franklin, hyeontaek lim,...

Documents