computer architectures for dna self-assembled nanoelectronics alvin r. lebeck department of computer...

Computer Architectures for DNA Self-Assembled Nanoelectronics

Alvin R. LebeckDepartment of Computer Science

Duke University

+ =

Duke Computer Architecture

2© 2006 A. R. LebeckDuke Computer Architecture

Acknowledgements

People

• Students: Jaidev Patwardhan, Constantin Pistol, Vijeta Johri, Sung-Ha Park, Nathan Sadler, Niranjan Soundararajan, Ben Burnham, R. Curt Harting

• Chris Dwyer, Daniel J. Sorin, Thomas H. LaBean, Jie Liu, John H. Reif, Hao Yan

• Sean Washburn, Dorothy A. Erie (UNC)

Funding

• Air Force Research Lab

• National Science Foundation (ITR)

• Duke University Office of the Provost

• Equipment from IBM & Intel


Current Processor Designs• Large Complex Systems

(millions/billions of transistors)

• Mature technology (CMOS)

• Precise control of entire design and fabrication process

• Lithographic process to create smaller and smaller features.

– But has limits…

• Cost of facility, high defect rates, process variation, etc.

SiliconN dopedN doped

S DGate

Transistor


The Red Brick Wall• “Eventually, toward the

end of the Roadmap or beyond, scaling of MOSFETs (transistors) will become ineffective and/or very costly, and advanced non-CMOS solutions will need to be implemented.” [International Technology Roadmap for Semiconductors, 2003 Edition, Difficult Challenge #10]

0

5

10

15

20

25

30

35

NA

ND

Del

ay (

ps)

CMOS-Known

CMOS-Red Brick Wall


The Potential Solution

• Self-Assembled Nanoelectronics• Self-assembly

– Molecules self-organize into stable structures (nano)

• What nanostructures?• What nanoelectronic devices?• How does self-assembly affect

computer system design?


Outline

• Nanostructures & Components

• Circuit Design Issues

• Architectural Implications

• Proposed Architectures

• Defect Tolerance

• Conclusion


DNA Self-Assembly

• Well defined rules for base pair matching

– Thermodynamics driven hybridization

• Can specify sequence of pairs, forms double helix

– Synthetic DNA

– Engineered Nanostructures

– Inexpensive lab equipment

Adenine (A) (T) Thymine

Cytosine (C) (G) Guanine

Sticky End(Tag)

20 nm

[Seeman ’99, Winfree et al. ’98,Yan, et al. ’03]

Strands→Tiles→Structures


DNA-based Self-Assembly of Nanoscale Systems

• Use synthetic DNA as scaffolding for nanoelectronics

• Create circuits (nodes) using aperiodic patterning– Demonstrated aperiodic patterns with 20nm pitch

[FNANO ’05, Angewandte Chemie ’06, DAC ’06]


Nanoelectronic Components

• Many Choices / Challenges – Good Transistor Behavior

– Interaction with DNA Lattice

• Crossed Nanotube Transistor [Fuhrer et al. ’01]

• Demonstrated Functionalization of Tube Ends [Dwyer, et al. ’02]

• Other candidates: Ring-gated, Crossed Nanorod, Crossed Carbon Nanotube FETs

A CG T

[Dwyer, et al. IEEE FNANO ’04]

05

101520253035

NA

ND

Del

ay (

ps) CMOS-Known

CMOS-Red Brick WallCNFET


Circuit Design Issues

Goal

Construct a computing system using the DNA Lattice and nanoelectronic components.

Proposal

Use DNA tags (sticky-ends) to place nano-components on lattice

1. Regularity of DNA Lattice Easy to replicate simple structures on a moderate scale

2. Complexity of Digital Circuits Large Graph with many unique nodes and edges

3. Tolerating Defects Single-stranded DNA for tags (sticky-ends) may have partial matches (must

minimize number of unique tags) Nanotubes may not work as advertised


Balancing Regularity & Complexity

• Array of simple objects

• Unit Cell based on lattice cavity

– Uniform length nanotubes

– Minimizes # of DNA Tags => reduces probability of partial match

– 20nm x 20nm

• Two levels of interconnect

• Complex circuits on single lattice (10K FETS)

• Envision ~9µm2 node size: ~10,000 FETs + interconnect

• How to get billions or more?

A

B

20nmVdd plane

Ground planeInsulating Layer

Interconnect Layers


Self-Assembled System

• Self-assemble ~ 109 - 1012 simple nodes (~10K FETs)

• Potential: Tera to Peta-scale computing

• Random Graph of Small Scale Nodes

– There will be defects

– Scaled CMOS may look similar

• How do we perform useful computation?

+A

B

20nm

Node Interconnect Node

Node

Wire [Yan ’03]

(selective metallization)


Outline






• Conclusion


Implications of Small Nodes

• Node: DNA Grid FETs– 3m x 3m node– Carbon nanotube [Dwyer ’02]

– Ring Gated [Skinner ’05]

• Small Scale Control– Controlled complexity only within one node

• Limited space on each node– Simple circuits (e.g., full adder)

• Limited communication between nodes– only 4 neighbors

– No global (long haul) interconnect

• Limited coordination– Difficult to get many nodes to work together (e.g., 64-bit adder)

A

B

20nm


Implications of Randomness

• Self-assemble interconnect of nodes

1. Random node placement

2. Random node orientation

3. Random connectivity

4. High defect rates (assume fail stop node)

• Limitations -> architectural challenges


Architectural Challenges

• Node Design

• Utilizing Multiple Nodes– Each node is very simple

• Routing

• Execution Model– Must overcome implementation constraints

• Instruction Set

• Micro-scale Interface


Outline




• Proposed Architectures– Defect Isolation & Structure– NANA [JETC ’06]

– SOSA [ASPLOS ’06]


• Conclusion


Nano-scale Active Network Architecture

• Large-scale fabrication (1012 nodes, 109 cells)

• Via provides micro-scale interface, Multiple Node Types

• First Cut: Understand issues

A Single CellSystem View


Defect Isolation/Structure• Grid w/ Defects → Random

Graph• Reverse path forwarding [Dalal ’78]

• Broadcast on all links except input [Nanoarch ’05]

– Forward broadcast if not seen before– Implement fail-stop nodes [Nanoarch ’06]

• RPF maps out defective regions– No external defect map– Can tolerate up to 30% defective nodes

• Distributed algorithm to create spanning tree

• Route packets along tree– Up*/down*– Depth first

• How do we compute?

Anchor

Defective NodeNode

Node (after RPF)

Root Direction


NANA: Computing on a Random Graph

• Perform 3 operations: Add, Add, Multiply

• Search along path for correct blocks to perform function

• Execution packets carry operation and values

• Proof-of-concept simulations

X

+X

-

X+

+

Enter Exit


NANA: Execution Model & ISA

• Accumulator based ISA

• Carry data and instructions in a “packet”

• Use bit-serial processing elements– Each element operates on one bit at a time

– Minimize inter-bit communication

Header Tailop 1 op 2 op 3 A0 B0 C0 D0 A1 B1 C1 D1 B31C31 D31A31

opcode Bit Interleaved operands


NANA: System Overview

• Simple programs– Fibonacci

– String compare

• Utilization is low

• Divide 1012 nodes into 109 cells

• Peak performance potentially higher than IBM Blue Gene and NEC Earth Simulator

• Need to use more nodes! 0

5

10

15

20

25

DNA-SA BlueGene

EarthSimLo

g P

eak

Per

form

ance

(bi

tops

/sec

)


Self-Assembled System

• Self-assemble ~ 109 - 1012 simple nodes (~10K FETs)

• Potential: Tera to Peta-scale computing

• Random Graph of Small Scale Nodes

– There will be defects

– Scaled CMOS may look similar

• How do we perform useful computation?

+A

B

20nm

Node Interconnect Node

Node

Wire [Yan ’03]

(selective metallization)

PE PE

Control Processor

• Group many nodes into a SIMD PE

• PEs connected in logic ring

• Familiar data parallel programming


Self-Organizing SIMD Architecture (SOSA)

• Nodes Grouped to form SIMD Processing Element (PE)– Head, Tail, N computation nodes (k-wide bit-slice of PE)

• Configuration: Depth First Traversal of Spanning Tree– Orders nodes within PE (Head → LSB → …→ MSB → Tail)– Orders PEs

• Many SIMD PEs on logical ring → familiar data parallel programming abstraction

VIATree Edge

PE boundary

13

5

6

8

9

12

24

710

11


SOSA: Instruction Broadcast

• Instructions broadcast to all nodes

• Instructions decomposed into three “microinstructions” (opcode, registers, synch)

• Can reach nodes/PEs at different times (5 before 9)

13

5

6

8

9

12

24

710

11

Enter

PE boundary


SOSA: Instruction Execution

• Instructions execute asynchronously within/across PEs

• XOR parallel within PE vs. Addition serial within PE

• ISA: Three register operand, predication, optimizations, see paper for details…

13

5

6

8

9

12

24

710

11

Enter

PE boundary


Two System Configurations

• One Large System

• Latency

• Space sharing

• Multiple “cells”

• Throughput


Outline




• Proposed Architectures– SOSA [ASPLOS ’06]

– Node Design

– Evaluation


• Conclusion


SOSA Node

• Homogeneous Nodes– Specialized during configuration

• Asynchronous Logic

• Communication– 4 tranceivers (4 phase handshake)

– 3 virtual channels (inst bcast, ring left & right)

• Computation– ALU

– Register (32-bits: 32x1 or 16x2)

– Inst Buffer

• Configuration– Route Setup

• Subcomponent BIST[nanoarch ’06]

ALU

SR

VC1

Buf

fer

Out

put

VC2

Buf

fer

Out

put

EntryRouting

Transceiver 1

Register File

Control

Tra

nsce

iver

0

Tra

nsce

iver

2

Transceiver 3

Cha

nnel

1V

irtu

al

Vir

tual

Cha

nnel

2

Vir

tual

Cha

nnel

0

Analog Control

Synch Control Reg

DataBuffer

VC2

VC1

DeM

ux

Instruction Buffer

Reg

iste

rSp

ecif

iers

Buf

fer

Inpu

tB

uffe

rIn

put

C

VC0

Opcode

Mu

x

Point to PointNetwork

LogicSend

PointLogic

Point to

RouteLogic

CMDLogic

WritingLogic Logic

Receive

D S D S

DeM

uxVC0

Buf

fer

Inpu

t

Mux

Buf

fer

Out

putRoute Setup Logic


SOSA Node• VHDL

– ~10K FETs

• Area ~= 9m2

– Custom layout tools for standard cells

• Power ~= 6.5 W/cm2

– Semi-empirical spice model [IEEE Nano ’04]

– 1ns switching time

– 88% devices active

– 0.775W / node

• Modern proc > 75 W/cm2

Transceiver 1

Tra

nsce

iver

0

Tra

nsce

iver

2

Transceiver 3

Configuration Logic

Compute Logic


Evaluation Methodology

• Custom event simulator– Conservative 1ns time quantum (switching time)

– 2 bits per node (16 registers, 16 + 2 for 32-bit PE)

• Nine benchmarks– Integer code only – no hardware support for floating point

– Matrix multiplication, image filters (gaussian, generic, median), encryption (TEA, XTEA), sort, search, bin-packing

• Compare performance to four other architectures– Pentium 4 (P4) (real hardware)

– ideal out-of-order superscalar (I-SS) 10GHz, 128-wide, 8K ROB

– ideal Chip Multiprocessor (I-CMP) 16-way ideal

– ideal SOSA (I-SOSA) no communication overhead, unit inst latency

– extrapolate for large SOSA systems (back validate)


Matrix Multiply (Execution Time)

• Hand optimizations (loop unrolling, etc.)

• Better scalability than other systems (crossover < 1000)

• Still room for improvement

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

1.E+11

1 10 100 1000 10000 100000

Matrix Dimension

Ru

n T

ime

(m

icro

se

co

nd

s)

Pentium 4

Ideal Single Core

Ideal 16-CMP

Ideal SOSA

SOSA

"Extrapolated SOSA"


TEA Encryption (Throughput)

Architecture Encryptions/sec

P4 @ 3 GHz (100mm2) 3.9 M/sec

I-SS 73.62 M/sec

16-CMP 1180 M/sec

SOSA (1 cell ~ 0.019mm2) 0.175 M/sec

I-SOSA (1 cell) 27.7 M/sec

SOSA (5400 cells, 100mm2) 940 M/sec

I-SOSA(5400 cells) 72300 M/sec

• Used in XBOX• shift, add and xor• 64 bit data blocks • 128-bit key• Pipelined on 64 PEs

• Configure Multiple Cells of 64 PEs

• Single Cell poor

• 200X better than P4 in same area


Outline





• Defect Tolerance (not transient faults)

• Conclusion


Defect Tolerance

• Simple Fail Stop model

• Encryption gracefully degrades

• MXM < 10% degradation up to 20% defective nodes

400

500

600

700

800

900

1000

0 10 20 30 40

Node Defect Rate (%)

Th

rou

gh

pu

t (M

illio

n E

ncr

yptio

ns

Per

Sec

on

d)

Tea

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

0 1 5 10 15 20

Node Defect Rate (%)

No

rma

lize

d R

un

tim

e

Matrix MultiplyEncryption


Node Failure Modes [Nanoarch ’06]

Transceiver

Transceiver

Tra

nsc

eiv

er

Tra

nsc

eiv

er

Compute Logic

Configuration

Simple

Transceiver

Transceiver

Tra

nsc

eiv

er

Tra

nsc

eiv

er

Compute Logic

Configuration

Compute - CentricCommunication - Centric

Transceiver

Transceiver

Tra

nsc

eiv

er

Tra

nsc

eiv

er

Compute Logic

Configuration

Hybrid – Any Two Components

• Exploit modular node design– VHDL BIST for communication & configuration (all stuck-at faults)

– Assume software test for compute logic

• Configuration logic is critical


Evaluation

• Simple node model in C

• Model network with 10,000 nodes

• Vary transistor defect probability from 0%-0.1%– Map defective transistors to defective components

• Average 500 runs per data point

• How much do we benefit by node modularity?– What device defect probability can it handle?


Results: Usable Nodes

0

10

20

30

40

50

60

70

80

90

100

1.E-06 1.E-05 1.E-04 1.E-03

Device Failure Probability

% D

efe

cti

ve

No

de

s

Simple

Communication Centric

Compute Centric

Hybrid - Any Two

• Hybrid failure mode can tolerate a higher device failure probability– Three orders of magnitude greater than typical CMOS designs (10-4 vs. 10-7)


Results: Reachable Nodes

0

10

20

30

40

50

60

70

80

90

100

0.0E+00 5.0E-05 1.0E-04 1.5E-04 2.0E-04 2.5E-04 3.0E-04 3.5E-04 4.0E-04

Device Failure Probability

% N

od

es R

each

able

Simple

Compute Centric

Hybrid-Total

Hybrid - Compute Only

• Hybrid increases the number of reachable nodes– More nodes with functioning compute logic reachable and usable


Fail-Stop Summary

• Test logic detects defects in node components

• Modular node design enables partial node operation

• Node is useful if– It can compute

OR

– It can improve system connectivity

• Hybrid failure mode increases available nodes– Can help tolerate a device failure probability of 1.5x10

-4 (1000

times greater than typical CMOS designs)


SOSA Summary

• Distributed algorithm for structure & defect tolerance– No external defect map

• Configuration groups nodes into SIMD PEs

• High utilization w/ familiar programming model

• Ability to reconfigure– One system for latency critical systems

– Multiple cells for throughput systems

• Limitations: I/O bandwidth, general purpose codes, FP, transient faults


Conclusion

• Future limits on traditional CMOS scaling– Multicore, etc. -> tera/peta scale w/ 1M nodes

• Defects, cost of fabrication, process variation, etc.• High performance, low power despite randomness and

defects

+ =

Engineered DNANanostructures Nanoelectronics

ComputersOf Tomorrow


Duke Nanosystems Overview

DNA Self-Assembly [FNANO 2005, Ang.

Chemie 2006, DAC 2006]

Nano DevicesElectronic, optical, etc.

[Nanoletters 2006]

Large Scale Interconnection

[NANONETS 2006]

Circuit Architecture [FNANO 2004]

Logical Structure & Defect Isolation [NANOARCH 2005]

A

3.6

1.01.1

1.2

1.31.4 1.5

1.7

1.6 1.T

2.H

2.0 2.1

2.2

2.32.4

2.5

2.T

2.72.6

3.H

3.0

3.4

3.5

3.73.1

3.2

3.3

3.T

1.H

VIA

SOSA - Data Parallel Architecture [NANOARCH 2006,

ASPLOS 2006]

NANA - General Purpose Architecture [JETC 2006]

MA

EA


Generic Filter (Execution Time)

• 3x3 generic filter (Gaussian & Median similar)

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1 10 100 1000 10000 100000

Image Width

Run

Tim

e (n

anos

econ

ds)

P4I-SS16-CMPI-SOSASOSAExtrapolated SOSA


Circuit Architecture

• Unit Cell based on lattice cavity

– Place uniform length nanoelectronic devices

– Reduces probability of partial matches

– Two layers of interconnect

• Achieve balance between– Regularity of DNA lattice

– Complexity required for circuits

– Defect Tolerance

• Node: DNA Lattice with CNFETs

20nm

Carbon nanotubes

Vdd plane

Ground planeInsulating Layer

Interconnect Layers

Metal nanoparticles


001

Fail-Stop Transceivers

• Minimize test overhead– Reuse node hardware during test

• Hardware Test– Send ‘0’ and ‘1’ in a loop

– If data returns, enable component

– If data does not return, component remains disabled

• Similar principle for configuration logic

• Modular design enables graceful degradation

Transmit Logic

Receive Logic

Output Buffer

Input Buffer

Test Logic

Test loopback path

TEST_OK=0

computer architectures for dna self-assembled nanoelectronics alvin r. lebeck department of computer...

Documents

red brick wallcnfet100200330

lebeckdna selfassemblywell

computer architectures

crossed nanorod

process variation

smaller features

curt hartingchris dwyer

sequence of pairs