mit lincoln laboratory 000523-jca-1 kam 11/13/2015 panel session: paving the way for multicore open...

MIT Lincoln Laboratory000523-jca-1KAM 04/20/23

Panel Session:Paving the Way for Multicore Open Systems Architectures

James C. Anderson MIT Lincoln Laboratory

HPEC08Wednesday, 24 September 2008

This work was sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations,

conclusions and recommendations are those of the author, and are not necessarily endorsed by the United States Government.

Reference to any specific commercial product, trade name, trademark or manufacturer does not constitute or imply endorsement.


Objective & Schedule

• Objective: Assess the infrastructure (hardware, software & support) that enables use of multicore open systems architectures

– Where are we now?– What needs to be done?

• Schedule– 1525: Overview– 1540: Guest speaker: Mr. Markus Levy– 1600: Introduction of the panelists– 1605: Previously submitted questions for the panel– 1635: Open forum– 1655: Conclusions & the way ahead– 1700: Closing remarks & adjourn


Paving the Way forMulticore Open Systems Architectures

Multicore µPs

Uni-processors

Design paths diverged ca. 2005

Tools

Methodologies


But First, A Few Infrastructure Issues

Performance was doubling every 18

months (Moore’s Law), but not anymore


0

10000

20000

30000

40000

Dec-06 Dec-07 Dec-08 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15

2000 International Technology Roadmap for Semiconductors (ITRS00)

20

2007(65nm)

10

40

30

On

-ch

ip l

oca

l cl

ock

, G

Hz

2010(45nm)

2013(32nm)

2016(22nm)

Year(Node)

ITRS00

• ~3.5X throughput every 3 yrs predicted for multiple independent cores (~same as 4X every 3 yrs for historical Moore’s Law)

– 1.4X clock speed every 3 yrs for constant power– 2.5X transistors/chip every 3 yrs (partially driven by economics) for

constant chip size (chip size growth ended ~1998)

In 2000, ITRS00 predicted a slightly lower improvement

rate vs. historical Moore’s Law for the 2008-2014 timeframe


0

10000

20000

30000

40000


2001-2002 International Technology Roadmap for Semiconductors (ITRS01-02)

20

2007(65nm)

10

40

30

On

-ch

ip l

oca

l cl

ock

, G

Hz

2010(45nm)

2013(32nm)

2016(22nm)

Year(Node)

ITRS00

ITRS01-02

• 2.8X throughput every 3 yrs predicted for multiple independent cores– 1.4X clock speed every 3 yrs for constant power (same as ITRS00)– 2X transistors/chip every 3 yrs for constant chip size (less than ITRS00)

ITRS01-02 predicted substantially lower improvement rate vs.

ITRS00, but higher clock speeds


0

10000

20000

30000

40000


2003-2006 International Technology Roadmap for Semiconductors (ITRS03-06)

20

2007(65nm)

ITRS03-ITRS06

10

40

30

On

-ch

ip l

oca

l cl

ock

, G

Hz

2010(45nm)

2013(32nm)

2016(22nm)

Year(Node)

ITRS00

ITRS01-02

ITRS03-06 predicted same improvement rate as ITRS01-02,

but even higher clock speeds

• 2.8X throughput every 3 yrs predicted for multiple independent cores– 1.4X clock speed every 3 yrs for constant power (same as ITRS00-02)– 2X transistors/chip every 3 yrs for constant chip size (same as ITRS01-02)

1.4X increase


0

10000

20000

30000

40000


2007 International Technology Roadmap for Semiconductors (ITRS07)

20

2007(65nm)

ITRS03-ITRS06

10

40

30

ITRS07 predicts lower clock speeds & improvement rate vs. ITRS00-06

On

-ch

ip l

oca

l cl

ock

, G

Hz

4.3X reduction

2010(45nm)

2013(32nm)

2016(22nm)

Year(Node)

ITRS00

ITRS01-02

• ~2.5X throughput every 3 yrs predicted for multiple independent cores– 1.23X clock speed every 3 yrs for constant power (less than ITRS00-06)– 2X transistors/chip every 3 yrs for constant chip size (same as ITRS01-06)

ITRS07


0.01

0.1

1

10

100

1/1/

1993

1/1/

1994

1/1/

1995

1/1/

1996

1/1/

1997

1/1/

1998

1/1/

1999

1/1/

2000

1/1/

2001

1/1/

2002

1/1/

2003

1/1/

2004

1/1/

2005

1/1/

2006

1/1/

2007

1/1/

2008

1/1/

2009

1/1/

2010

1/1/

2011

1/1/

2012

1/1/

2013

1/1/

2014

1/1/

2015

1/1/

2016

1/1/

2017

1/1/

2018

1/1/

2019

1/1/

2020

1/1/

2021

1/1/

2022

1/1/

2023

COTS Compute Node (processor, memory & I/O) Performance History & Projections (2Q08)

2023

100

GF

LO

PS

/W(b

illio

n 3

2-b

it f

loat

ing

po

int

op

erat

ion

s p

er

sec-

wat

t fo

r co

mp

uti

ng

1K

-po

int

com

ple

x F

FT

)

Year

0.011993

1

10

2.5X in 3 yrs im

provement rate fo

r

multicore µPs & COTS ASICs

0.1

2X in 3 yrs improvement ra

te for general-

purpose uni-processor µPs

4X in

3 y

rs im

prove

men

t

rate

for a

ll te

chnolo

gies

2008

1000nm

100nm

10nm

2.5X in 3 yrs im

provement

rate for S

RAM-based FPGAs

Projected improvement rates, although smaller than historical

values, are still substantial

IBM Cell (90nm)

Intel i860µP (1µm)


0.1

1

10

100

1000

10000

Notional Cost (cumulative) & Schedule for COTS 90nm Cell Broadband Engine

20080.1

Year

1

2000

10

1000

Cell Broadband Engine: 205 GFLOPS (peak, 32-

bit) @ 100W (est.), ~2 GFLOPS/W

100

Dev

elo

pm

ent

Co

st (

mil

lio

ns

of

$)

20042002 2006

Cell

IBM, Sony & Toshiba hold architectural discussions

Austin (Texas) design center opens ($400M

joint investment in Cell design)

Sony exits future Cell development after

investing $1.7B

Technology evaluation systems shipped


Multicore Open Systems Architecture Example

• LEON3– 32-bit SPARC V8 processor developed by Gaisler Research (Aeroflex as

of 7/14/08) for the European Space Agency– Synthesizable VHDL (GNU general public license) & documentation

downloadable from www.gaisler.com– Open source software support (embedded Linux, C/C++ cross-compiler,

simulator & symbolic debugger)

• 0.25µm LEON3FT– Commercial fault-tolerant implementation of LEON3– 75 MFLOPS/W (150 MIPS & 30 MFLOPS @ 150 MHz for 0.4W)

• 90nm quad-core LEON3FT– System emulated with a single SRAM-based FPGA– 133 MFLOPS/W (4x500 MIPS & 4x100 MFLOPS for 3W)– Each core occupies <1mm2 including caches– MOSIS fabricates 65nm & 90nm die up to 360mm2 (IBM process)

How can we improve performance (FLOPS/W), which lags COTS by up to 9 yrs (15X) in this example?


0.1

1

10

100

1000

10000

Notional Cost (cumulative) & Schedule for 90nm LEON3FT Multicore Processor

20080.1

Year

1

2000

10

1000

100

$3M estimated development cost is mostly staff expense, with

schedule determined by foundry

Dev

elo

pm

ent

Co

st (

mil

lio

ns

of

$)

20042002 2006

0.25µm LEON3FT

90nm quad-core LEON3FT

65nm fab access


Panel Session: Paving the Way for Multicore Open Systems Architectures

Panel members & audience may hold diverse, evolving opinions

Moderator: Dr. James C. AndersonMIT Lincoln Laboratory

Prof. Saman AmarasingheMIT Computer Science & Artificial

Intelligence Laboratory (CSAIL)

Mr. Markus LevyThe Multicore Association &The Embedded Microprocessor Benchmark Consortium (EEMBC)

Dr. Matthew ReillyChief EngineerSiCortex, Inc. Mr. John Rooks

Air Force Research Laboratory (AFRL/RITC)

Emerging Computing Technology

Dr. Steve MuirChief Technology Officer

Vanu, Inc.


• Despite industry slowdown, embedded processors are still improving exponentially (2/3 of historical Moore’s Law rate)

• Although performance improvements in multicore designs (2.5X every 3 yrs) continue to outpace those of uni-processors (2X every 3 yrs), the “performance gap” is less than previously projected

• New tools and methodologies will be needed to maximize the benefits of using multicore open systems architectures

– Power & packaging issues– Cost & availability issues– Training & ease-of-use issues– Platform independence issues

• Although many challenges remain in reducing the performance gap between highly specialized systems vs. multicore open systems architectures, the latter will help insulate users from manufacturer-specific issues

Conclusions & The Way Ahead

Success still depends on ability of foundries to provide smaller geometries & increasing speed for constant

power (driven by large-scale COTS product economics)


Backup Slides


COTS ASIC: 90nm IBMCell Broadband Engine (4Q06)

• 100W (est.) @ 3.2 GHz

• 170 GFLOPS sustained for 32-bit flt pt 1K cmplx FFT (83% of peak)

• 16 Gbyte memory options (~10 FLOPS/byte)– COTS Rambus XDR DRAM (Cell is designed to use only this memory)

256 chips 690W (note: Rambus devices may not be 3D stackable due to 2.7W/chip

power consumption)

– Non-COTS solution: Design a bridge chip ASIC (10W est.) to allow use of 128 DDR2 SDRAM devices (32W)

128 chips in 3D stacks to save space (0.25W/chip) Operate many memory chips in parallel Buffer to support Rambus speeds Increased latency vs. Rambus

• 40W budget for external 27 Gbytes/sec simultaneous I&O (using same non-COTS bridge chip to handle I/O with Cell)

• Single non-COTS CN (compute node) using DDR2 SDRAM– 170 GFLOPS sustained for 200W (182W est. for CN plus 18W for 91%

efficient DC-to-DC converter)– 0.85 GFLOPS/W & 56 GFLOPS/L


0.01

0.1

1

10

100

1/1/

1993

1/1/

1994

1/1/

1995

1/1/

1996

1/1/

1997

1/1/

1998

1/1/

1999

1/1/

2000

1/1/

2001

1/1/

2002

1/1/

2003

1/1/

2004

1/1/

2005

1/1/

2006

1/1/

2007

1/1/

2008

1/1/

2009

1/1/

2010

1/1/

2011

1/1/

2012

1/1/

2013

1/1/

2014

1/1/

2015

1/1/

2016

1/1/

2017

1/1/

2018

1/1/

2019

1/1/

2020

1/1/

2021

1/1/

2022

1/1/

2023

COTS Compute Node Performance History & Projections (2Q08)

2023

100

GF

LO

PS

/W(b

illio

n 3

2-b

it f

loat

ing

po

int

op

erat

ion

s p

er

sec-

wat

t fo

r co

mp

uti

ng

1K

-po

int

com

ple

x F

FT

)

Year

0.011993 2003

1

102.5X in

3 yrs

improvement ra

te for

SRAM-based FPGAs,

COTS ASICs &

multicore µPs

0.1

2X in 3 yrs improvement ra

te for general-

purpose uni-processor µPs

2013

Intel Polaris (65nm)

IBM Cell (90nm)

Virtex-4(90nm)

FreescaleMPC7447A

(130nm)Intel i860µP (1000nm)

Analog DevicesSHARC DSP

(600nm)

Compute Node includes FFT (fast Fourier transform)

processor, memory (10 FLOPS/byte),

simultaneous I&O (1.28 bits/sec per FLOPS) &

DC-to-DC converter

Catalina Research

Pathfinder-1 ASIC (350nm)

Texas Memory Systems TM-44 Blackbird ASIC

(180nm)

Xilinx Virtex FPGA (180nm)

Motorola MPC7400 PowerPC RISC with AltiVec (180-220nm)

MPC7410 (180nm)

Virtex II (150nm)

Clearspeed CS301 ASIC

(130nm)4X in 3 yrs

MPC7448(90nm)

22nm

16nm

16nm


World’s Largest Economies: 2000 vs. 2024

U.S. population grows by 1/3 & income shrinks from 5X to <4X world average

* Gross domestic product (purchasing power parity)

2000 GDP*

2024GDP*

2000 population

2024population

6 billion total

“Europe’s Top 5” are Germany, Great Britain,

France, Italy & Spain

8 billion total


0

4

8

12

16

20

0 1 10 100 1000 10000

Sampling Rate (million samples/sec)

Eff

ecti

ve N

um

ber

of

Bit

s

1986-1990

1991-1995

1996-2000

2001-2005

2006-2008

Highest-performance COTS (commercial off-the-shelf) ADCs (analog-to-digital converters), 3Q08

.1

Max processing gain w/ linearization

2000-2007: 2X speed in 6.75 yrs (but up to 0.5 bit processing gain is more than offset by loss of 1 effective bit)

Historic device-level

improvement rates may not be sustainable as technical &

economic limits are

approached

Thermal noise(~0.5 bit/octave)

Aperture

uncertainty(~1

bit/octave)

2003-2007:~0.25 bit/yr

@ 400 MSPS


30

40

50

60

70

80

90

100

110

120

0 1 10 100 1000 10000

Sampling rate (million samples/sec)

Sp

ur-

free

dyn

amic

ran

ge

(dB

)

1986-1990

1991-1995

1996-2000

2001-2005

2006-2008

SFDR (spur-free dynamic range) for Highest-performance COTS ADCs, 3Q08

.1

3dB/octave

6dB/octave

SFDR performance often limits

ability to subsequently

achieve “processing

gain”


0

50

100

150

200

250

300

350

0 1 10 100 1000 10000

Sampling rate (million samples/sec)

Qu

anti

zati

on

en

erg

y (p

J)

1986-1990

1991-1995

1996-2000

2001-2005

2006-2008

Energy per Effective Quantization Level for Highest-performance COTS ADCs, 3Q08

.1

Recent power decrease (driven by

mobile devices market) from

smaller geometry & advanced

architectures


Resolution Improvement Timeline forHighest-performance COTS ADCs, 1986-2008

0.1

1

10

100

1000

10000

Jan-86 Jan-88 Jan-90 Jan-92 Jan-94 Jan-96 Jan-98 Jan-00 Jan-02 Jan-04 Jan-06 Jan-08

Year

Sam

plin

g R

ate

(MSP

S)

86 90 94 9896 020.1

1

10

Sam

pli

ng

Rat

e (M

SP

S)

Year

100

88 92 00

1000

10,000 Non-COTS chip set (ENOB=4.6 @ 12 GSPS)

10-bit (ENOB>7.6)

12-bit (ENOB>9.8)

14-bit (ENOB>10.9)

16-bit (ENOB>12.5)

24-bit (ENOB>15.3)

04 06 08

6- to 8-bit (ENOB>4.3)

mit lincoln laboratory 000523-jca-1 kam 11/13/2015 panel session: paving the way for multicore open...

Documents