mit lincoln laboratory 000523-jca-1 kam 11/13/2015 panel session: paving the way for multicore open...
TRANSCRIPT
MIT Lincoln Laboratory000523-jca-1KAM 04/20/23
Panel Session:Paving the Way for Multicore Open Systems Architectures
James C. Anderson MIT Lincoln Laboratory
HPEC08Wednesday, 24 September 2008
This work was sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations,
conclusions and recommendations are those of the author, and are not necessarily endorsed by the United States Government.
Reference to any specific commercial product, trade name, trademark or manufacturer does not constitute or imply endorsement.
MIT Lincoln Laboratory000523-jca-2KAM 04/20/23
Objective & Schedule
• Objective: Assess the infrastructure (hardware, software & support) that enables use of multicore open systems architectures
– Where are we now?– What needs to be done?
• Schedule– 1525: Overview– 1540: Guest speaker: Mr. Markus Levy– 1600: Introduction of the panelists– 1605: Previously submitted questions for the panel– 1635: Open forum– 1655: Conclusions & the way ahead– 1700: Closing remarks & adjourn
MIT Lincoln Laboratory000523-jca-3KAM 04/20/23
Paving the Way forMulticore Open Systems Architectures
Multicore µPs
Uni-processors
Design paths diverged ca. 2005
Tools
Methodologies
MIT Lincoln Laboratory000523-jca-4KAM 04/20/23
But First, A Few Infrastructure Issues
Performance was doubling every 18
months (Moore’s Law), but not anymore
MIT Lincoln Laboratory000523-jca-5KAM 04/20/23
0
10000
20000
30000
40000
Dec-06 Dec-07 Dec-08 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15
2000 International Technology Roadmap for Semiconductors (ITRS00)
20
2007(65nm)
10
40
30
On
-ch
ip l
oca
l cl
ock
, G
Hz
2010(45nm)
2013(32nm)
2016(22nm)
Year(Node)
ITRS00
• ~3.5X throughput every 3 yrs predicted for multiple independent cores (~same as 4X every 3 yrs for historical Moore’s Law)
– 1.4X clock speed every 3 yrs for constant power– 2.5X transistors/chip every 3 yrs (partially driven by economics) for
constant chip size (chip size growth ended ~1998)
In 2000, ITRS00 predicted a slightly lower improvement
rate vs. historical Moore’s Law for the 2008-2014 timeframe
MIT Lincoln Laboratory000523-jca-6KAM 04/20/23
0
10000
20000
30000
40000
Dec-06 Dec-07 Dec-08 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15
2001-2002 International Technology Roadmap for Semiconductors (ITRS01-02)
20
2007(65nm)
10
40
30
On
-ch
ip l
oca
l cl
ock
, G
Hz
2010(45nm)
2013(32nm)
2016(22nm)
Year(Node)
ITRS00
ITRS01-02
• 2.8X throughput every 3 yrs predicted for multiple independent cores– 1.4X clock speed every 3 yrs for constant power (same as ITRS00)– 2X transistors/chip every 3 yrs for constant chip size (less than ITRS00)
ITRS01-02 predicted substantially lower improvement rate vs.
ITRS00, but higher clock speeds
MIT Lincoln Laboratory000523-jca-7KAM 04/20/23
0
10000
20000
30000
40000
Dec-06 Dec-07 Dec-08 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15
2003-2006 International Technology Roadmap for Semiconductors (ITRS03-06)
20
2007(65nm)
ITRS03-ITRS06
10
40
30
On
-ch
ip l
oca
l cl
ock
, G
Hz
2010(45nm)
2013(32nm)
2016(22nm)
Year(Node)
ITRS00
ITRS01-02
ITRS03-06 predicted same improvement rate as ITRS01-02,
but even higher clock speeds
• 2.8X throughput every 3 yrs predicted for multiple independent cores– 1.4X clock speed every 3 yrs for constant power (same as ITRS00-02)– 2X transistors/chip every 3 yrs for constant chip size (same as ITRS01-02)
1.4X increase
MIT Lincoln Laboratory000523-jca-8KAM 04/20/23
0
10000
20000
30000
40000
Dec-06 Dec-07 Dec-08 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15
2007 International Technology Roadmap for Semiconductors (ITRS07)
20
2007(65nm)
ITRS03-ITRS06
10
40
30
ITRS07 predicts lower clock speeds & improvement rate vs. ITRS00-06
On
-ch
ip l
oca
l cl
ock
, G
Hz
4.3X reduction
2010(45nm)
2013(32nm)
2016(22nm)
Year(Node)
ITRS00
ITRS01-02
• ~2.5X throughput every 3 yrs predicted for multiple independent cores– 1.23X clock speed every 3 yrs for constant power (less than ITRS00-06)– 2X transistors/chip every 3 yrs for constant chip size (same as ITRS01-06)
ITRS07
MIT Lincoln Laboratory000523-jca-9KAM 04/20/23
0.01
0.1
1
10
100
1/1/
1993
1/1/
1994
1/1/
1995
1/1/
1996
1/1/
1997
1/1/
1998
1/1/
1999
1/1/
2000
1/1/
2001
1/1/
2002
1/1/
2003
1/1/
2004
1/1/
2005
1/1/
2006
1/1/
2007
1/1/
2008
1/1/
2009
1/1/
2010
1/1/
2011
1/1/
2012
1/1/
2013
1/1/
2014
1/1/
2015
1/1/
2016
1/1/
2017
1/1/
2018
1/1/
2019
1/1/
2020
1/1/
2021
1/1/
2022
1/1/
2023
COTS Compute Node (processor, memory & I/O) Performance History & Projections (2Q08)
2023
100
GF
LO
PS
/W(b
illio
n 3
2-b
it f
loat
ing
po
int
op
erat
ion
s p
er
sec-
wat
t fo
r co
mp
uti
ng
1K
-po
int
com
ple
x F
FT
)
Year
0.011993
1
10
2.5X in 3 yrs im
provement rate fo
r
multicore µPs & COTS ASICs
0.1
2X in 3 yrs improvement ra
te for general-
purpose uni-processor µPs
4X in
3 y
rs im
prove
men
t
rate
for a
ll te
chnolo
gies
2008
1000nm
100nm
10nm
2.5X in 3 yrs im
provement
rate for S
RAM-based FPGAs
Projected improvement rates, although smaller than historical
values, are still substantial
IBM Cell (90nm)
Intel i860µP (1µm)
MIT Lincoln Laboratory000523-jca-10KAM 04/20/23
0.1
1
10
100
1000
10000
Notional Cost (cumulative) & Schedule for COTS 90nm Cell Broadband Engine
20080.1
Year
1
2000
10
1000
Cell Broadband Engine: 205 GFLOPS (peak, 32-
bit) @ 100W (est.), ~2 GFLOPS/W
100
Dev
elo
pm
ent
Co
st (
mil
lio
ns
of
$)
20042002 2006
Cell
IBM, Sony & Toshiba hold architectural discussions
Austin (Texas) design center opens ($400M
joint investment in Cell design)
Sony exits future Cell development after
investing $1.7B
Technology evaluation systems shipped
MIT Lincoln Laboratory000523-jca-11KAM 04/20/23
Multicore Open Systems Architecture Example
• LEON3– 32-bit SPARC V8 processor developed by Gaisler Research (Aeroflex as
of 7/14/08) for the European Space Agency– Synthesizable VHDL (GNU general public license) & documentation
downloadable from www.gaisler.com– Open source software support (embedded Linux, C/C++ cross-compiler,
simulator & symbolic debugger)
• 0.25µm LEON3FT– Commercial fault-tolerant implementation of LEON3– 75 MFLOPS/W (150 MIPS & 30 MFLOPS @ 150 MHz for 0.4W)
• 90nm quad-core LEON3FT– System emulated with a single SRAM-based FPGA– 133 MFLOPS/W (4x500 MIPS & 4x100 MFLOPS for 3W)– Each core occupies <1mm2 including caches– MOSIS fabricates 65nm & 90nm die up to 360mm2 (IBM process)
How can we improve performance (FLOPS/W), which lags COTS by up to 9 yrs (15X) in this example?
MIT Lincoln Laboratory000523-jca-12KAM 04/20/23
0.1
1
10
100
1000
10000
Notional Cost (cumulative) & Schedule for 90nm LEON3FT Multicore Processor
20080.1
Year
1
2000
10
1000
100
$3M estimated development cost is mostly staff expense, with
schedule determined by foundry
Dev
elo
pm
ent
Co
st (
mil
lio
ns
of
$)
20042002 2006
0.25µm LEON3FT
90nm quad-core LEON3FT
65nm fab access
MIT Lincoln Laboratory000523-jca-13KAM 04/20/23
Objective & Schedule
• Objective: Assess the infrastructure (hardware, software & support) that enables use of multicore open systems architectures
– Where are we now?– What needs to be done?
• Schedule– 1525: Overview– 1540: Guest speaker: Mr. Markus Levy– 1600: Introduction of the panelists– 1605: Previously submitted questions for the panel– 1635: Open forum– 1655: Conclusions & the way ahead– 1700: Closing remarks & adjourn
MIT Lincoln Laboratory000523-jca-14KAM 04/20/23
Panel Session: Paving the Way for Multicore Open Systems Architectures
Panel members & audience may hold diverse, evolving opinions
Moderator: Dr. James C. AndersonMIT Lincoln Laboratory
Prof. Saman AmarasingheMIT Computer Science & Artificial
Intelligence Laboratory (CSAIL)
Mr. Markus LevyThe Multicore Association &The Embedded Microprocessor Benchmark Consortium (EEMBC)
Dr. Matthew ReillyChief EngineerSiCortex, Inc. Mr. John Rooks
Air Force Research Laboratory (AFRL/RITC)
Emerging Computing Technology
Dr. Steve MuirChief Technology Officer
Vanu, Inc.
MIT Lincoln Laboratory000523-jca-15KAM 04/20/23
Objective & Schedule
• Objective: Assess the infrastructure (hardware, software & support) that enables use of multicore open systems architectures
– Where are we now?– What needs to be done?
• Schedule– 1525: Overview– 1540: Guest speaker: Mr. Markus Levy– 1600: Introduction of the panelists– 1605: Previously submitted questions for the panel– 1635: Open forum– 1655: Conclusions & the way ahead– 1700: Closing remarks & adjourn
MIT Lincoln Laboratory000523-jca-16KAM 04/20/23
• Despite industry slowdown, embedded processors are still improving exponentially (2/3 of historical Moore’s Law rate)
• Although performance improvements in multicore designs (2.5X every 3 yrs) continue to outpace those of uni-processors (2X every 3 yrs), the “performance gap” is less than previously projected
• New tools and methodologies will be needed to maximize the benefits of using multicore open systems architectures
– Power & packaging issues– Cost & availability issues– Training & ease-of-use issues– Platform independence issues
• Although many challenges remain in reducing the performance gap between highly specialized systems vs. multicore open systems architectures, the latter will help insulate users from manufacturer-specific issues
Conclusions & The Way Ahead
Success still depends on ability of foundries to provide smaller geometries & increasing speed for constant
power (driven by large-scale COTS product economics)
MIT Lincoln Laboratory000523-jca-17KAM 04/20/23
Backup Slides
MIT Lincoln Laboratory000523-jca-18KAM 04/20/23
COTS ASIC: 90nm IBMCell Broadband Engine (4Q06)
• 100W (est.) @ 3.2 GHz
• 170 GFLOPS sustained for 32-bit flt pt 1K cmplx FFT (83% of peak)
• 16 Gbyte memory options (~10 FLOPS/byte)– COTS Rambus XDR DRAM (Cell is designed to use only this memory)
256 chips 690W (note: Rambus devices may not be 3D stackable due to 2.7W/chip
power consumption)
– Non-COTS solution: Design a bridge chip ASIC (10W est.) to allow use of 128 DDR2 SDRAM devices (32W)
128 chips in 3D stacks to save space (0.25W/chip) Operate many memory chips in parallel Buffer to support Rambus speeds Increased latency vs. Rambus
• 40W budget for external 27 Gbytes/sec simultaneous I&O (using same non-COTS bridge chip to handle I/O with Cell)
• Single non-COTS CN (compute node) using DDR2 SDRAM– 170 GFLOPS sustained for 200W (182W est. for CN plus 18W for 91%
efficient DC-to-DC converter)– 0.85 GFLOPS/W & 56 GFLOPS/L
MIT Lincoln Laboratory000523-jca-19KAM 04/20/23
0.01
0.1
1
10
100
1/1/
1993
1/1/
1994
1/1/
1995
1/1/
1996
1/1/
1997
1/1/
1998
1/1/
1999
1/1/
2000
1/1/
2001
1/1/
2002
1/1/
2003
1/1/
2004
1/1/
2005
1/1/
2006
1/1/
2007
1/1/
2008
1/1/
2009
1/1/
2010
1/1/
2011
1/1/
2012
1/1/
2013
1/1/
2014
1/1/
2015
1/1/
2016
1/1/
2017
1/1/
2018
1/1/
2019
1/1/
2020
1/1/
2021
1/1/
2022
1/1/
2023
COTS Compute Node Performance History & Projections (2Q08)
2023
100
GF
LO
PS
/W(b
illio
n 3
2-b
it f
loat
ing
po
int
op
erat
ion
s p
er
sec-
wat
t fo
r co
mp
uti
ng
1K
-po
int
com
ple
x F
FT
)
Year
0.011993 2003
1
102.5X in
3 yrs
improvement ra
te for
SRAM-based FPGAs,
COTS ASICs &
multicore µPs
0.1
2X in 3 yrs improvement ra
te for general-
purpose uni-processor µPs
2013
Intel Polaris (65nm)
IBM Cell (90nm)
Virtex-4(90nm)
FreescaleMPC7447A
(130nm)Intel i860µP (1000nm)
Analog DevicesSHARC DSP
(600nm)
Compute Node includes FFT (fast Fourier transform)
processor, memory (10 FLOPS/byte),
simultaneous I&O (1.28 bits/sec per FLOPS) &
DC-to-DC converter
Catalina Research
Pathfinder-1 ASIC (350nm)
Texas Memory Systems TM-44 Blackbird ASIC
(180nm)
Xilinx Virtex FPGA (180nm)
Motorola MPC7400 PowerPC RISC with AltiVec (180-220nm)
MPC7410 (180nm)
Virtex II (150nm)
Clearspeed CS301 ASIC
(130nm)4X in 3 yrs
MPC7448(90nm)
22nm
16nm
16nm
MIT Lincoln Laboratory000523-jca-20KAM 04/20/23
World’s Largest Economies: 2000 vs. 2024
U.S. population grows by 1/3 & income shrinks from 5X to <4X world average
* Gross domestic product (purchasing power parity)
2000 GDP*
2024GDP*
2000 population
2024population
6 billion total
“Europe’s Top 5” are Germany, Great Britain,
France, Italy & Spain
8 billion total
MIT Lincoln Laboratory000523-jca-21KAM 04/20/23
0
4
8
12
16
20
0 1 10 100 1000 10000
Sampling Rate (million samples/sec)
Eff
ecti
ve N
um
ber
of
Bit
s
1986-1990
1991-1995
1996-2000
2001-2005
2006-2008
Highest-performance COTS (commercial off-the-shelf) ADCs (analog-to-digital converters), 3Q08
.1
Max processing gain w/ linearization
2000-2007: 2X speed in 6.75 yrs (but up to 0.5 bit processing gain is more than offset by loss of 1 effective bit)
Historic device-level
improvement rates may not be sustainable as technical &
economic limits are
approached
Thermal noise(~0.5 bit/octave)
Aperture
uncertainty(~1
bit/octave)
2003-2007:~0.25 bit/yr
@ 400 MSPS
MIT Lincoln Laboratory000523-jca-22KAM 04/20/23
30
40
50
60
70
80
90
100
110
120
0 1 10 100 1000 10000
Sampling rate (million samples/sec)
Sp
ur-
free
dyn
amic
ran
ge
(dB
)
1986-1990
1991-1995
1996-2000
2001-2005
2006-2008
SFDR (spur-free dynamic range) for Highest-performance COTS ADCs, 3Q08
.1
3dB/octave
6dB/octave
SFDR performance often limits
ability to subsequently
achieve “processing
gain”
MIT Lincoln Laboratory000523-jca-23KAM 04/20/23
0
50
100
150
200
250
300
350
0 1 10 100 1000 10000
Sampling rate (million samples/sec)
Qu
anti
zati
on
en
erg
y (p
J)
1986-1990
1991-1995
1996-2000
2001-2005
2006-2008
Energy per Effective Quantization Level for Highest-performance COTS ADCs, 3Q08
.1
Recent power decrease (driven by
mobile devices market) from
smaller geometry & advanced
architectures
MIT Lincoln Laboratory000523-jca-24KAM 04/20/23
Resolution Improvement Timeline forHighest-performance COTS ADCs, 1986-2008
0.1
1
10
100
1000
10000
Jan-86 Jan-88 Jan-90 Jan-92 Jan-94 Jan-96 Jan-98 Jan-00 Jan-02 Jan-04 Jan-06 Jan-08
Year
Sam
plin
g R
ate
(MSP
S)
86 90 94 9896 020.1
1
10
Sam
pli
ng
Rat
e (M
SP
S)
Year
100
88 92 00
1000
10,000 Non-COTS chip set (ENOB=4.6 @ 12 GSPS)
10-bit (ENOB>7.6)
12-bit (ENOB>9.8)
14-bit (ENOB>10.9)
16-bit (ENOB>12.5)
24-bit (ENOB>15.3)
04 06 08
6- to 8-bit (ENOB>4.3)