alpha 21364
DESCRIPTION
Alpha 21364. Goal: very fast multiprocessor systems, highly scalable Main trick is high-bandwidth, low-latency data access. How to do it, how to do it? . Fast access to L2 cache. Easy solution: put it on chip Technology scaling has made it practical. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/1.jpg)
Alpha 21364
• Goal: very fast multiprocessor systems, highly scalable
• Main trick is high-bandwidth, low-latency data access.
• How to do it, how to do it?
![Page 2: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/2.jpg)
Fast access to L2 cache
• Easy solution: put it on chip• Technology scaling has made it practical.• Higher bandwidth, lower latency, but
smaller size than SRAM.• Many design and CAD problems.
![Page 3: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/3.jpg)
Fast access to main memory
• Build a NUMA system.• Each CPU directly controls its main
memory chips (no intervening chipset).• On-chip RAMBus memory controller• Multiple frequencies cause design and CAD
problems.
![Page 4: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/4.jpg)
Fast remote memory access
• Direct communication with other CPUs.• 2-D torus (folded checkerboard)• Switchbox/router on chip for passing
packets between any 2 grid points.• Clock-forwarded data via matched T-lines.• Many design and CAD challenges.
![Page 5: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/5.jpg)
All of that, and FAST
• Greater than 1 Ghz in initial part.• Faster shrinks to follow.• Many design and CAD challenges!
![Page 6: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/6.jpg)
One-chip scalable system
MemCPU CPU
CPU Mem
Mem
Mem CPU
![Page 7: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/7.jpg)
October 13 & 14Microprocessor Forum 19
21364 System Block Diagram21364 System Block Diagram
364M
IO364
M
IO364
M
IO364
M
IO
364M
IO364
M
IO364
M
IO364
M
IO
364M
IO364
M
IO364
M
IO364
M
IO
![Page 8: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/8.jpg)
It gets worse
• Much of this has been designed before -- by trial and error.
• Now it’s part of a full-custom CPU.• Must be right the first time.
![Page 9: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/9.jpg)
L2 cache
• We are combining memory and logic in a high-speed part.
• Cache covers a large die area, but is synchronous and needs a clock.
• Many conditional clocks are needed to save power.
• Problem: how do we control/simulate clock skew?
![Page 10: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/10.jpg)
H tree?
• H tree has nominal 0 skew at terminuses.• Real life must include OCV:
L, , sheet , C– Vdd, T
• How do we minimize the sensitivity of skew to OCV?
![Page 11: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/11.jpg)
L2 cache logic verification
• A cache is not a simple animal.• The “simple” high-level picture is
complicated by redundancy, BIST/BISR, fuse farms, optimal repair algorithms, complex circuit design.
• Needs verification of RTL and schematics
![Page 12: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/12.jpg)
Too big to verify?
• Flat? 4 MB virtual memory / 100M Mos = 40 B/MOS.
• The cache is “not quite” hierarchical.– ECC gets in the way (odd # of bits)– mirrored bank pairs share logic– The “same” path may be a race or a critical path
in different banks.
![Page 13: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/13.jpg)
Formal verification?
• Symbolic simulation of something this big (e.g., with STE) is impossible.
• Redundancy is an interesting challenge.• We can verify the pieces: but how do we
prove they equal the whole?
![Page 14: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/14.jpg)
The abstraction gap
• The model must run fast• The schematics contain 100M devices.• Thus there is an abstraction gap.• This makes formal verification difficult.
![Page 15: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/15.jpg)
Fast access to main memory
• Build a NUMA system.• Each CPU directly controls its main
memory chips (no intervening chipset).• On-chip RAMBus memory controller• Multiple frequencies cause design and CAD
problems.
![Page 16: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/16.jpg)
On-chip Rambus Controller
• 400 Mhz dual data rate Rambus• > 1 Ghz CPU• How do they interact?
![Page 17: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/17.jpg)
Fast remote memory access
• Direct communication with other CPUs.• 2-D torus (folded checkerboard)• Switchbox/router on chip for passing
packets between any 2 grid points.• Clock-forwarded data via matched T-lines.• Many design and CAD challenges.
![Page 18: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/18.jpg)
On Chip Switchbox/router
• Message passing usually handled by chipsets.
• Now it’s on the CPU• We’ve got to get it right the 1st time.
![Page 19: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/19.jpg)
Routers are tricky
• Deadlock, Livelock• Route around broken links• Easy to forget corner cases• Formal verification is a must
![Page 20: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/20.jpg)
High speed CPU
• Clocking is a challenge.• Short tick is a challenge.• OCV is a killer.• Power density is also.
![Page 21: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/21.jpg)
Clocking
• Wires do not scale (even with copper).• Low clock skew = high clock power.• No longer practical to have a single main
clock grid.
![Page 22: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/22.jpg)
Multiple grids
• Solution - multiple grids linked by Delay Locked Loops (DLLs).
• Use skew-insensitive circuits to cross clock domains. These are functional at any skew (albeit with slower clock frequency).
• How do you do static timing verification?
![Page 23: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/23.jpg)
Short tick
• “Short tick” CPU is highly pipelined, with small amount of gates between latches.
• Most of the design is single-wire clocking, true single phase.
• Races are bad.
![Page 24: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/24.jpg)
Double-sided constraints
• Tdmax + Tsetup < Tcycle + Ts,min
• Tdmin > Thold + Ts,max
• Short tick and large delay variation give you a small design window.
![Page 25: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/25.jpg)
OCV
• OCV gets worse every generation.• Higher density more T, more V.• Smaller feature size more variability.• Result is more delay variation.
![Page 26: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/26.jpg)
Statistical delay correlation
• Many delays are correlated.• Most “nearby” effects move together.• If two clocks have identical layout, they
mostly move together.• Howe do we quantify this and use it in
timing verification?
![Page 27: Alpha 21364](https://reader036.vdocument.in/reader036/viewer/2022062501/568160f4550346895dd02e5c/html5/thumbnails/27.jpg)
Summary
• Alpha 21364 is a high-speed CPU targeted at glueless, scalable MP systems.
• On-chip L2 cache• On-chip Rambus controllers• On-chip Routing• Many new CAD challenges - not all have
solutions identified.