datacenter computers

74
Datacenter Computers modern challenges in CPU design Dick Sites Google Inc. February 2015

Upload: dinhtruc

Post on 14-Feb-2017

242 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Datacenter Computers

Datacenter Computersmodern challenges in CPU design

Dick SitesGoogle Inc.February 2015

Page 2: Datacenter Computers

February 2015 2

Thesis: Servers anddesktops require

different design emphasis

Page 3: Datacenter Computers

February 2015 3

Goals

• Draw a vivid picture of a computing environment that may be foreign to your experience so far

• Expose some ongoing research problems

• Perhaps inspire contributions in this area

Page 4: Datacenter Computers

February 2015 4

Analogy (S.A.T. pre-2005)

:

:

as

Page 5: Datacenter Computers

February 2015 5

Analogy (S.A.T. pre-2005)

:

:

as

Page 6: Datacenter Computers

February 2015 6

Datacenter Servers are Different

Move data: big and small

Real-time transactions: 1000s per second

Isolation between programs

Measurement underpinnings

Page 7: Datacenter Computers

Move data: big and small

Page 8: Datacenter Computers

February 2015 8

Move data: big and small

• Move lots of data– Disk to/from RAM– Network to/from RAM– SSD to/from RAM– Within RAM

• Bulk data• Short data: variable-length items• Compress, encrypt, checksum, hash, sort

Page 9: Datacenter Computers

February 2015 9

Lots of memory

• 4004: no memory

• i7: 12MB L3 cache

Page 10: Datacenter Computers

February 2015 10

Little brain, LOTS of memory

• Server 64GB-1TB

Drinking straw access

Page 11: Datacenter Computers

February 2015 11

High-Bandwidth Cache Structure

• Hypothetical 64-CPU-thread server

L3 cache

DRAMDRAMDRAMDRAMDRAMDRAM

DRAMDRAMDRAMDRAMDRAMDRAM

…0 1 2 3Core 0

L1 caches

4 5 6 7Core 1

L1 caches

8 9 10 11Core 2

L1 caches

60 61 62 63 PCsCore 15L1 caches

…L2 cache L2 cache

Page 12: Datacenter Computers

February 2015 12

Forcing Function

Move 16 bytes every CPU cycle

What are the direct consequences of this goal?

Page 13: Datacenter Computers

February 2015 13

Move 16 bytes every CPU cycle

• Load 16B, Store 16B, test, branch– All in one cycle = 4-way issue minimum– Need some 16-byte registers

• At 3.2GHz, 50 GB/s read + 50 GB/s write,– 100 GB/sec, one L1 D-cache

• Must avoid reading cache line before write– if it is to be fully written (else 150 GB/s)

Page 14: Datacenter Computers

February 2015 14

Short strings

• Parsing, words, packet headers, scatter-gather, checksums

the _ quick brown_ _ fox

IP hdr TCP hdr chksumVLAN …

Page 15: Datacenter Computers

February 2015 15

Generic Move, 37 bytes

thequ ickbrownfoxjumps overthelazydog..

thequ ickbrownfoxjumps overthelazydog..

5 bytes 16 bytes 16 bytes

Page 16: Datacenter Computers

February 2015 16

Generic Move, 37 bytes

thequ ickbrownfoxjumps overthelazydog..

thequ ickbrownfoxjumps overthelazydog..

5 bytes 16 bytes 16 bytes

16B aligned cache accesses

Page 17: Datacenter Computers

February 2015 17

Generic Move, 37 bytes, aligned target

thequick brownfoxjumpsove rthelazydog..

thequickbr ownfoxjumpsovert helazydog..

10 bytes 16 bytes 11 bytes

16B aligned cache accesses

32B shift helpful:LD, >>, ST

Page 18: Datacenter Computers

February 2015 18

Useful Instructions

• Load Partial R1, R2, R3• Store Partial R1, R2, R3

– Load/store R1 low bytes with 0(R2), length 0..15 from R3 low 4 bits, 0 pad high bytes, R1 = 16 byte register

– 0(R2) can be unaligned– Length zero never segfaults– Similar to Power architecture Move Assist instructions

• Handles all short moves• Remaining length is always multiple of 16

Page 19: Datacenter Computers

February 2015 19

Move 16 bytes every CPU cycle

• L1 data cache: 2 aligned LD/ST accesses per cycle: 100 GB/s, plus 100 GB/s fill

L3 cache

L2 cache L2 cache

DRAMDRAMDRAMDRAMDRAMDRAM

DRAMDRAMDRAMDRAMDRAMDRAM

…4 5 6 7Core 1

L1 caches

8 9 10 11Core 2

L1 caches

60 61 62 63 PCs

Core 15L1 caches

0 1 2 3Core 0

L1 caches

Page 20: Datacenter Computers

February 2015 20

Move 16 bytes every CPU cycle

• Sustained throughout L2, L3, and RAM• Perhaps four CPUs simultaneously

L3 cache

L2 cache L2 cache

DRAMDRAMDRAMDRAMDRAMDRAM

DRAMDRAMDRAMDRAMDRAMDRAM

…4 5 6 7Core 1

L1 caches

8 9 10 11Core 2

L1 caches

60 61 62 63 PCs

Core 15L1 caches

0 1 2 3Core 0

L1 caches100 GB/s

x 16

200 GB/sx 4

400 GB/sx 1

Page 21: Datacenter Computers

February 2015 21

Top 20 Stream Copy Bandwidth (April 2014)

Date Machine ID ncpus COPY 1. 2012.08.14 SGI_Altix_UV_2000 2048 6591 GB/s2. 2011.04.05 SGI_Altix_UV_1000 2048 5321 3. 2006.07.10 SGI_Altix_4700 1024 3661 4. 2013.03.26 Fujitsu_SPARC_M10-4S 1024 3474 5. 2011.06.06 ScaleMP_Xeon_X6560_64B 768 1493 6. 2004.12.22 SGI_Altix_3700_Bx2 512 906 7. 2003.11.13 SGI_Altix_3000 512 854 8. 2003.10.02 NEC_SX-7 32 876 9. 2008.04.07 IBM_Power_595 64 679 10. 2013.09.12 Oracle_SPARC_T5-8 128 604 11. 1999.12.07 NEC_SX-5-16A 16 607 12. 2009.08.10 ScaleMP_XeonX5570_vSMP_16B 128 437 13. 1997.06.10 NEC_SX-4 32 434 14. 2004.08.11 HP_AlphaServer_GS1280-1300 64 407

Our Forcing Function 1 400 GB/s15. 1996.11.21 Cray_T932_321024-3E 32 310 16. 2014.04.24 Oracle_Sun_Server_X4-4 60 221 17. 2007.04.17 Fujitsu/Sun_Enterprise_M9000 128 224 18. 2002.10.16 NEC_SX-6 8 202 19. 2006.07.23 IBM_System_p5_595 64 186 20. 2013.09.17 Intel_XeonPhi_SE10P 61 169

https://www.cs.virginia.edu/stream/top20/Bandwidth.html

Page 22: Datacenter Computers

February 2015 22

Latency

• Buy 256GB of RAM only if you use it– Higher cache miss rates– And main memory is ~200 cycles away

Page 23: Datacenter Computers

February 2015 23

Cache Hits200-cycle miss @4 HzCache Hits1-cycle hit @4 Hz

Page 24: Datacenter Computers

February 2015 24

Cache Miss to Main Memory200-cycle miss @4 Hz

cache miss

Page 25: Datacenter Computers

February 2015 25

What constructive workcould you do during

this time?

Page 26: Datacenter Computers

February 2015 26

Latency

• Buy 256GB of RAM only if you use it– Higher cache miss rates– And main memory is ~200 cycles away

• For starters, we need to prefetch about 16B * 200cy = 3.2KB to meet our forcing function; call it 4KB

Page 27: Datacenter Computers

February 2015 27

Additional Considerations

• L1 cache size = associativity * page size– Need bigger than 4KB pages

• Translation buffer at 256 x 4KB covers only 1MB of memory– Covers much less than on-chip caches– TB miss can consume 15% of total CPU time – Need bigger than 4KB pages

• With 256GB of RAM @4KB: 64M pages– Need bigger than 4KB pages

Page 28: Datacenter Computers

February 2015 28

Move data: big and small

• It’s still the memory

• Need a coordinated design of instruction architecture and memory implementation to achieve high bandwidth with low delay

Page 29: Datacenter Computers

February 2015 29

Modern challenges in CPU design

• Lots of memory• Multi-issue CPU instructions every cycle

that drive full bandwidth • Full bandwidth all the way to RAM, not

just to L1 cache• More prefetching in software• Bigger page size(s)

Topic: do better

Page 30: Datacenter Computers

Real-time transactions:1000s per second

Page 31: Datacenter Computers

February 2015 31

A Single Transaction Across ~40 Racks of ~60 Servers Each• Each arc is a related client-server RPC

(remote procedure call)

Page 32: Datacenter Computers

February 2015 32

A Single Transaction RPC Tree vs Time: Client & 93 Servers

?

??

Page 33: Datacenter Computers

February 2015 33

Single Transaction Tail Latency

• One slow response out of 93 parallel RPCsslows the entire transaction

Page 34: Datacenter Computers

February 2015 34

One Server, One User-Mode Thread vs. TimeEleven Sequential Transactions (circa 2004)

Normal: 60-70 msec

Long: 1800 msec

Page 35: Datacenter Computers

February 2015 35

One Server, Four CPUs: User/kernel transitions every CPU every nanosecond (Ktrace)

200 usec

CPU 0

CPU 1

CPU 2

CPU 3

Page 36: Datacenter Computers

February 2015 36

16 CPUs, 600us, Many RPCs

CPUs 0..15

RPCs0..39

Page 37: Datacenter Computers

February 2015 37

16 CPUs, 600us, Many RPCs

THREADs0..46

LOCKs0..22

Page 38: Datacenter Computers

February 2015 38

That is A LOT going on at once

• Let’s look at just one long-tail RPC in context

Page 39: Datacenter Computers

February 2015 39

16 CPUs, 600us, one RPC

CPUs 0..15

RPCs0..39

Page 40: Datacenter Computers

February 2015 40

16 CPUs, 600us, one RPC

THREADs0..46

LOCKs0..22

Page 41: Datacenter Computers

February 2015 41

Thread 25 actually runs

Wakeup Detail

Thread 19 frees lock, sends wakeup to waiting thread 25

50 us

50us wakeup delay ??

Page 42: Datacenter Computers

February 2015 42

Wakeup Detail

• Target CPU was busy; kernel waited

THREADs

50 us 50us

CPUs CPU 9 busy;But 3 was idle

Page 43: Datacenter Computers

February 2015 43

CPU Scheduling, 2 Designs

• Re-dispatch on any idle CPU core– But if idle CPU core is in deep sleep, can take 75-

100us to wake up

• Wait to re-dispatch on previous CPU core, to get cache hits– Saves squat if could use same L1 cache– Saves ~10us if could use same L2 cache– Saves ~100us if could use same L3 cache– Expensive if cross-socket cache refills– Don’t wait too long…

Page 44: Datacenter Computers

February 2015 44

Real-time transactions: 1000s per second

• Not your father’s SPECmarks

• To understand delays, need to track simultaneous transactions across servers, CPU cores, threads, queues, locks

Page 45: Datacenter Computers

February 2015 45

Modern challenges in CPU design

• A single transaction can touch thousands of servers in parallel

• The slowest parallel path dominates • Tail latency is the enemy

– Must control lock-holding times– Must control scheduler delays– Must control interference via shared resources

Topic: do better

Page 46: Datacenter Computers

Isolation between programs

Page 47: Datacenter Computers

February 2015 47

Histogram: Disk Server Latency; Long Tail

Latency 99th %ile= 696 msec

Page 48: Datacenter Computers

Isolation of programs reduces tail latency.Reduced tail latency = higher utilization.

Higher utilization = $$$.

Non-repeatable Tail Latency Comes from Unknown Interference

Page 49: Datacenter Computers

February 2015 49

Many Sources of Interference

• Most interference comes from software• But a bit from the hardware underpinnings

• In a shared apartment building, most interference comes from jerky neighbors

• But thin walls and bad kitchen venting can be the hardware underpinnings

Page 50: Datacenter Computers

February 2015 50

Isolation issue: Cache Interference

CPU thread 0 is moving 16B/cycle flat out, filling caches, hurting threads 3, 7, 63

0000000000000000000000000000000000

DRAMDRAMDRAMDRAMDRAMDRAM

DRAMDRAMDRAMDRAMDRAMDRAM

…0 1 2 3Core 0

00000000

4 5 6 7Core 1

L1 caches

60 61 62 63 PCsCore 15L1 caches

0000000000000000000 L2 cache…

000 000

Page 51: Datacenter Computers

February 2015 51

Isolation issue: Cache Interference

• CPU thread 0 is moving 16B/cycle flat out, hurting threads 3, 7, 63

000000000000000000000…

0 1 2 3

Core 0

00000000

4 5 6 7

Core 1

77777777

Today

~000111222333444555666777…

0 1 2 3

Core 0

~00112233

4 5 6 7

Core 1

~44556677

Desired

Page 52: Datacenter Computers

February 2015 52

Cache Interference

• How to get there?– Partition by ways

• no good if 16 threads and 8 ways• No good if result is direct-mapped• Underutilizes cache

– Selective allocation• Give each thread a target cache size• Allocate lines freely if under target• Replace only own lines if over target• Allow over-budget slop to avoid underutilization

Page 53: Datacenter Computers

February 2015 53

Cache Interference

• Each thread has target:current, replace only own lines if over target

2525 25 252427 24 24

TargetCurrent

100 100 …TargetCurrent 105 99 … ~000111222333444555666777…

0 1 2 3

Core 0

~00112233

4 5 6 7

Core 1

~44556677

Desired

Page 54: Datacenter Computers

February 2015 54

Cache Interference

• Each thread has target:current, replace only own lines if over target

• Requires owner bits per cache line – expensive bits

• Requires 64 target/current at L3• Fails if L3 not at least 64-way associative

– Can rarely find own lines in a set

Page 55: Datacenter Computers

February 2015 55

Current …0-3 4-7 8-…OV KTarget …0-3 4-7 8-…OV K

Design Improvement

• Track ownership just by incoming paths– Plus separate target for kernel accesses– Plus separate target for over-target accesses

• Fewer bits, 8-way assoc OK

Current 30 1 2OV KTarget 30 1 2OV K 0

~000111222333444555666777…

0 1 2 3

Core 0

~00112233

4 5 6 7

Core 1

~44556677

Desired

Page 56: Datacenter Computers

February 2015 56

Isolation between programs

• Good fences make good neighbors

• We need better hardware support for program isolation in shared memory systems

Page 57: Datacenter Computers

February 2015 57

Modern challenges in CPU design

• Isolating programs from each other on a shared server is hard

• As an industry, we do it poorly– Shared CPU scheduling– Shared caches– Shared network links– Shared disks

• More hardware support needed• More innovation needed

Page 58: Datacenter Computers

Measurement underpinnings

Page 59: Datacenter Computers

February 2015 59

Profiles: What, not Why

• Samples of 1000s of transactions, merging results

• Pro: understanding average performance

• Blind spots:outliers, idle time

– Chuck Close

Page 60: Datacenter Computers

February 2015 60

Traces: Why

• Full detail of individual transactions

• Pro: understanding outlier performance

• Blind spot:tracing overhead

Page 61: Datacenter Computers

February 2015 61

Histogram: Disk Server Latency; Long Tail

Latency 99th %ile= 696 msec

1

23

4 65 7

0 25 50 100 250 500 750 1000

The 1% long tail never shows up

CPU profile showsthe 99% common cases

Page 62: Datacenter Computers

February 2015 62

Trace: Disk Server, Event-by-Event

• Read RPC + disk time

• Write hit, hit, miss

• 700ms mixture

• 13 disks, three normal seconds:

Page 63: Datacenter Computers

February 2015 63

Trace: Disk Server, 13 disks, 1.5 sec

• Phase transition to 250ms boundaries exactly

Read

Write

Read hit

Write hit

Page 64: Datacenter Computers

February 2015 64

Trace: Disk Server, 13 disks, 1.5 sec

• Latencies: 250ms, 500ms, … for nine minutes

Read

Write

Read hit

Write hit

4

5

4

Page 65: Datacenter Computers

February 2015 65

Why?

• Probably not on your guessing radar…

• Kernel throttling the CPU use of any process that is over purchased quota

• Only happened on old, slow servers

Page 66: Datacenter Computers

February 2015 66

Disk Server, CPU Quota bug

• Understanding Why sped up 25% of entire disk fleet worldwide!– Had been going on for three years– Savings paid my salary for 10 years

• Hanlon's razor: Never attribute to malice that which is adequately explained by stupidity.

• Sites’ corollary: Never attribute to stupidity that which is adequately explained by software complexity.

4

Page 67: Datacenter Computers

February 2015 67

Measurement Underpinnings

• All performance mysteries are simple once they are understood

• “Mystery” means that the picture in your head is wrong; software engineers are singularly inept at guessing how their view differs from reality

Page 68: Datacenter Computers

February 2015 68

Modern challenges in CPU design

• Need low-overhead tools to observe the dynamics of performance anomalies– Transaction IDs– RPC trees– Timestamped transaction begin/end

• Traces– CPU kernel+user, RPC, lock, thread traces– Disk, network, power-consumption

Topic: do better

Page 69: Datacenter Computers

Summary: Datacenter Servers are Different

Page 70: Datacenter Computers

February 2015 70

Datacenter Servers are Different

Move data: big and small

Real-time transactions: 1000s per second

Isolation between programs

Measurement underpinnings

Page 71: Datacenter Computers

February 2015 71

ReferencesClaude Shannon, A Mathematical Theory of Communication, The Bell System

Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948.http://cs.ucf.edu/~dcm/Teaching/COP5611-Spring2013/Shannon48-MathTheoryComm.pdf

Richard Sites, It’s the Memory, Stupid!, Microprocessor Report August 5, 1996.http://cva.stanford.edu/classes/cs99s/papers/architects_look_to_future.pdf

Ravi Iyer, CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms, ACM International Conference on Supercomputing, 2004.

http://cs.binghamton.edu/~apatel/cache_sharing/CQoS_iyer.pdf

M Bligh, M Desnoyers, R Schultz , Linux Kernel Debugging on Google-sized clusters, Linux Symposium, 2007

https://www.kernel.org/doc/mirror/ols2007v1.pdf#page=29

Page 72: Datacenter Computers

February 2015 72

ReferencesDaniel Sanchez and Christos Kozyrakis , The ZCache: Decoupling Ways and

Associativity. IEEE/ACM Symp. on Microarchitecture (MICRO-43), 2010.http://people.csail.mit.edu/sanchez/papers/2010.zcache.micro.pdf

Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag, Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google Technical Report dapper-2010-1, April 2010

http://scholar.google.com/scholar?q=Google+Technical+Report+dapper-2010-1&btnG=&hl=en&as_sdt=0%2C5&as_vis=1

Daniel Sanchez, Christos Kozyrakis, Vantage: Scalable and Efficient Fine-Grained Cache Partitioning, Symp. on Computer Architecture ISCA 2011.

http://ppl.stanford.edu/papers/isca11-sanchez.pdf

Luiz André Barroso and Urs Hölzle , The Datacenter as a Computer An Introduction to the Design of Warehouse-Scale Machines, 2nd Edition 2013.

http://www.morganclaypool.com/doi/pdf/10.2200/S00516ED2V01Y201306CAC024

Page 73: Datacenter Computers

February 2015 73

Thank You, Questions?

Page 74: Datacenter Computers

February 2015 74

Thank You