approximate on-chip communication2

71
Approximate On-Chip Communication Davide Patti, Ph.D. [email protected] University of Catania, Italy

Upload: others

Post on 11-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Approximate On-Chip Communication2

Approximate On-Chip Communication

Davide Patti, Ph.D. [email protected] University of Catania, Italy

Page 2: Approximate On-Chip Communication2

…in the Previous Episodes1. The goal of computing was to be the fastest

2. The challenge to maximize MHz hit the ‘power wall in the mid-2000s

3. Initial solution: “ok, no problem, let’s optimise for speed and power…”

4. …but, eventually, the dramatically increasing workloads ruined the party…

Page 3: Approximate On-Chip Communication2

!3

Why?! Ever-increasing amount of information

! Industry reports – 2010 – 2020 amount of information will expand by 50x – ...number of servers will only grow by a factor of 10!

Page 4: Approximate On-Chip Communication2

Emerging RMS Applications

Page 5: Approximate On-Chip Communication2

Error-Resilience Property Forgiving workloads: multimedia, recognition, search, can tolerate not perfect computing, examples: • Inexact inputs, derived from noisy and redundant

sources (e.g. sensors) • human consumer of results may not discern small

variations • data/algortihms including statistical/probabilistic

computations • computations which may be refined with multiple

iterations

Page 6: Approximate On-Chip Communication2

!6

Page 7: Approximate On-Chip Communication2

Approximate Computing: A Third Dimension for Optimization

Page 8: Approximate On-Chip Communication2

“Error” or “Feature” ?• Approximation not as a “problem” to deal with, not as a

“limitation”, but part of the game

• A neuron spikes when a combination of all the excitation and inhibition it receives makes it reach threshold (around -50mV )

Page 9: Approximate On-Chip Communication2

Approximating at Multiple Levels of the Stack

Hardware level

• Less accurate yet energy-efficient circuits (e.g., simplified adder)

• Tuning the supply voltage

Software level

• Ignore some computations (skip loop iterations, relaxing control dependences)

• Data structures, e.g., reducing vector sizes

• Ignore certain memory accesses replacing them by estimated values

Page 10: Approximate On-Chip Communication2

Current Applications• Database Querying/Visualization:

• BlinkDB, Facebook’s Presto, M4 from SAP 

2B points (70 mins) vs 1M points (3 mins)

Page 11: Approximate On-Chip Communication2

Current Applications• Neural Networks

• Using NN to replace some expensive computation or algorithm

• Approximate NN implementations for inference (e.g., less bits to represent weights)

• SqueezeNet, Google’s Neural Machine Translation 

Page 12: Approximate On-Chip Communication2

Approximate Communication: the NOC Case Study

■ Shared bus➔Low area ➔Poor scalability ➔High energy consumption

■Network-on-Chip➔Mesh of Routers (in red) ➔Each Processing Element

connected to a Router ➔Scalability and modularity ➔Low energy consumption ➔ Increase of design complexity

Shared bus

Page 13: Approximate On-Chip Communication2

Communication Overhead• Interconnection networks consume 10% to 20% of the power in

current HPC systems

• Majority due to network's links NoC based design

• More than one-third of the chip's power consumption

Page 14: Approximate On-Chip Communication2

!14

Example

for (i=0; i<n; i++) v[i] = f(w[i]);

MemoryMI

CPU

Page 15: Approximate On-Chip Communication2

!15

Example – Load w[i]

for (i=0; i<n; i++) v[i] = f(w[i]);

MemoryMI

CPU

Address Data

Page 16: Approximate On-Chip Communication2

!16

Example – Store v[i]

for (i=0; i<n; i++) v[i] = f(w[i]);

MemoryMI

CPU

Data

Page 17: Approximate On-Chip Communication2

!17

Approximate Communication! Send(data, destination) ! Send(data, destination, reliability_level)

Reliability Level

Communication Energy

Communication System “aware” of error-resilience Acting on two Knobs:

Voltage Swing (wired) Transmission Power (wireless)

Page 18: Approximate On-Chip Communication2

!18

Tuning the Link Voltage Swing! Reliability vs. Energy (1mm bit-line):

! Nominal voltage swing → low BER, high energy ! Low voltage swing → high BER, low energy

Page 19: Approximate On-Chip Communication2

ReconfigurableLink

coreNI

coreNI

coreNI

coreNI

R R RR

coreNI

coreNI

coreNI

coreNI

R R RR

coreNI

coreNI

coreNI

coreNI

R R RR

R R RR

coreNI

coreNI

coreNI

coreNI

core IPCore

NI NetworkInterface

R Router

PhysicalLink

TilecoreNI

R

Page 20: Approximate On-Chip Communication2

ReconfigurableLink

coreNI

coreNI

coreNI

coreNI

R R RR

coreNI

coreNI

coreNI

coreNI

RR

coreNI

coreNI

coreNI

coreNI

R R RR

R R RR

coreNI

coreNI

coreNI

coreNI

R R

Page 21: Approximate On-Chip Communication2

ReconfigurableLink

coreNI

coreNI

coreNI

coreNI

R R RR

coreNI

coreNI

coreNI

coreNI

RR

coreNI

coreNI

coreNI

coreNI

R R RR

R R RR

coreNI

coreNI

coreNI

coreNI

R R

Page 22: Approximate On-Chip Communication2

HSPICELinkSimulation• 45nmCMOStechnology(NanGate'sOpenCellLibrary):• 10metallayers• 3mmlinklineusingtheseventhmetallayer• 2GHztargetfrequency

Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 2018

Page 23: Approximate On-Chip Communication2

HSPICELinkSimulation

70%saving3%overhead

Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 9

Page 24: Approximate On-Chip Communication2

HSPICELinkSimulation

Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 9

Page 25: Approximate On-Chip Communication2

!25

ImplementationHeader Data Data Data Tail

Reliability LevelDestination Other

control info

Page 26: Approximate On-Chip Communication2

!26

Annotation Example

! Data coming from/delivered to w[i] travel with a reliability level rl

#pragma resilient(w, rl) for (i=0; i<n; i++) v[i] = f(w[i]);

Page 27: Approximate On-Chip Communication2
Page 28: Approximate On-Chip Communication2

!28

Application Characterization

! How the imprecision on inputs and internal data reflects on the outputs ?

! Classify data structures according to their impact on the outputs – Exploitation

! Storing less sensitive data on energy efficient memories (low voltage, low refresh rate, ...)

! Optimizing communication of less sensitive data (unreliable communications, lossy compression, ...)

Page 29: Approximate On-Chip Communication2

!29

Experiments

! Two voltage swing levels – Nominal 1.1 V → BER: 10-17, Ebit: 512 fJ – Low 0.6 V → BER: 10-6, Ebit: 152 fJ

Page 30: Approximate On-Chip Communication2

!30

Experiments! JPEG encoding pipeline (AXBench)

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

UINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { levelShift(Y1); dct(Y1); quantization(Y1, ILqt); outputBuffer = huffman(1, outputBuffer); return outputBuffer; }

Page 31: Approximate On-Chip Communication2

!31

ExperimentsUINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { #pragma resilient_load(Y1, rl_load) levelShift(Y1); ... }

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Memory

rl_load

Page 32: Approximate On-Chip Communication2

!32

ExperimentsUINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { #pragma resilient_store(Y1, rl_store) levelShift(Y1); ... }

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Memory

rl_store

Page 33: Approximate On-Chip Communication2

!33

ExperimentsUINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { #pragma resilient(Y1, rl) levelShift(Y1); ... }

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Memory

rlrl

Page 34: Approximate On-Chip Communication2

Approximation Profiles

Page 35: Approximate On-Chip Communication2

!35

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Page 36: Approximate On-Chip Communication2

!36

Experiments

Level Shifter

R

DCT

R

MC

R

Quantizer

R

Entropy Encoder

R

MC

R

Mem 1

Mem 2

Page 37: Approximate On-Chip Communication2

!37

Evaluation FlowApplication Resilient data

selection

Annotated application

Resilience level selection

Full Simulation (MIT Graphite)

Memory Reference

trace

NoC architecture

Energy estimation (Noxim)

Error injection

Perturbated Application

Execution

Communication energy

Execution

Imprecise results

Exactresults

Comparison Quality metric

Page 38: Approximate On-Chip Communication2

!38

Experiments

Page 39: Approximate On-Chip Communication2

!39

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 0 (gold)

Page 40: Approximate On-Chip Communication2

!40

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 1

Page 41: Approximate On-Chip Communication2

!41

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 2

Page 42: Approximate On-Chip Communication2
Page 43: Approximate On-Chip Communication2

!43

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 3

Page 44: Approximate On-Chip Communication2

!44

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 4

Page 45: Approximate On-Chip Communication2

!45

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 5

Page 46: Approximate On-Chip Communication2
Page 47: Approximate On-Chip Communication2

!47

Experiments

Page 48: Approximate On-Chip Communication2

!48

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 6

Page 49: Approximate On-Chip Communication2
Page 50: Approximate On-Chip Communication2

!50

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 7

Page 51: Approximate On-Chip Communication2
Page 52: Approximate On-Chip Communication2

!52

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 8

Page 53: Approximate On-Chip Communication2
Page 54: Approximate On-Chip Communication2

!54

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 9

Page 55: Approximate On-Chip Communication2
Page 56: Approximate On-Chip Communication2

!56

Experiments

Page 57: Approximate On-Chip Communication2

!57

Experiments

0 1 2 3 4 5 6 7 8 90.0000

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Image diff Normalized energy

Configuration

Imag

e di

ff (R

MS

E)

Nor

mal

ized

ene

rgy

Page 58: Approximate On-Chip Communication2

!58

Sensitivity Analysis

in Y

1/le

velS

hift

out Y

1/le

velS

hift

in Y

1/dc

t

out Y

1/dc

t

in Y

1/qu

antiz

atio

n

in Il

qt/q

uant

izat

ion

out T

emp/

quan

tizat

ion

in T

emp/

huffm

an

out o

utpu

tBuf

fer/

huffm

an

0.00000

0.00005

0.00010

0.00015

0.00020

0.00025

0.00030

Sensitivity

Page 59: Approximate On-Chip Communication2

!59

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 9

Page 60: Approximate On-Chip Communication2

!60

Experiments

0.0E+0 1.0E-4 2.0E-4 3.0E-4 4.0E-4 5.0E-4 6.0E-40.00

0.20

0.40

0.60

0.80

1.00

1.20

Image diff (RSME)

Nor

mal

ized

ene

rgy

Page 61: Approximate On-Chip Communication2

Next Step: On-Chip Wireless Communications

V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti, “Improving energy efficiency in wireless network-on-chip architectures,” ACM Journal on Emerging Technologies in Computing Systems, vol. 14, no. 1, 2017.

Page 62: Approximate On-Chip Communication2

!62

Tuning Transmitting Power

! High BER as compared to wired NoC – 10-9 vs. 10-14

! General approach – Increasing the transmitting power for compensating

the attenuation introduced by the wireless medium ! Proposed approach – Tuning the transmitting power based on the reliability

level of the current transmitted data

Page 63: Approximate On-Chip Communication2

Tunable Transmitting PowerZigzag antenna modeled with Ansoft HFSS to compute attenuation (16Gbps)

Variable Power Amplifier

• S. Kaushik, M. Agrawal, H. K. Mondal, S. H. Gade, and S. Deb, “Path loss-aware adaptive transmission power control scheme for energy- efficient wireless noc,” in International Midwest Symposium on Circuits and Systems (MWSCAS), Aug. 2017, pp. 132–135.

• A. Mineo, M. Palesi, G. Ascia, and V. Catania, “Exploiting antenna directivity in wireless noc architectures,” Microprocessors and Microsys- tems, vol. 43, pp. 59–66, 2016.

Page 64: Approximate On-Chip Communication2

Simulation Setup• Two transmission profiles:

• normal) BER 10e-12 —> 1.47 pJ/bit

• (approximate) BER 10e-6 —> 1pJ/bit

• Wireless Interfaces placement same as Memory Controllers (mesh corners)

• 8 × 8 mesh-based NoC architecture simulated by using the Graphite Multicore Simulator with the following parameters:

Page 65: Approximate On-Chip Communication2

RepresentativeApplicationsApplication Description Approximated Regions

streamcluster:aRMSkerneldevelopedbyPrincetonUniversitythatsolvestheonlineclusteringproblem

Regions of 256 bytes required for storing the 64 dimensions of each point encoded as a floating point value of 4 bytes, for a total of 8192 regions.

canneal: developedbyPrincetonUniversity,itusescache-awaresimulatedannealing(SA)tominimizetherouXngcostofachipdesign

The annotation has been performed on the netlist element, for a total of 160,000 instances of 64 bytes netlist elements.

blackscholes:anIntelRMSbenchmarkthatcalculatespricesforaporYolioofEuropeanopXonsanalyXcallywiththeBlack-ScholesparXaldifferenXalequaXon

Two data structures have been annotated: optiondata a 36 bytes floating point structure, and prices (4 bytes floating point), for a total of 147,456 bytes and a 16,384 bytes, respectively.

radiosity: computestheequilibriumdistribuXonoflightinasceneusingthehierarchicaldiffuseradiositymethod.

elemvertex buf.col, a data structure encoding the three RGB components as 4 bytes floating point values, and elemvertex buf.vertex, a data structure encoding the 3-dimensional coordinates of each vertex of the polygons describing the 3D model of the scene. Each of these two structure occupies 12 bytes, for a total of 65,535 regions and 786,420 annotated bytes size each.

Page 66: Approximate On-Chip Communication2

EvaluationFlowFourscenarios:

3. Approx.NoC4. Approx.WiNoC

1. NoC2. WiNoC

Page 67: Approximate On-Chip Communication2

Results

∗AllenergyvaluesarenormalizedwithrespecttothewiredNoCenergyconsumption.

Page 68: Approximate On-Chip Communication2

Results–PerformanceMetrics

Page 69: Approximate On-Chip Communication2

Conclusions• ApproximatecommunicationtechniqueforimprovingtheenergyefficiencyofWiNoCarchitectures.• Dynamiclinkvoltageswing(NoClinks)• Dynamictransmittingpowermodulation(wirelesscommunications)

• Pragmabasedannotationoftheapplicationcode• loadandstoreinducedcommunicationsrelatedtoerrortolerantdata

• Assessmentonasetofrepresentativebenchmarks• Energysavingversusapplicationaccuracytrade-off.• Upto30%oftotalcommunicationenergysavinghasbeenobservedwithoutanyappreciableimpactontheaccuracymetrics

Page 70: Approximate On-Chip Communication2

Future Developments• Generalize & Automate in order to reduce the

required knowledge about the Application

• A methodology to identify approximable communication flows

• Automated choice of the most efficient approximation technique (reduced bits representation, reduced iterations, etc..)

• Automatic exploration loop

Page 71: Approximate On-Chip Communication2

Bibliography

• Vincenzo Catania, Andrea Mineo, Salvatore Monteleone, Maurizio Palesi, and Davide Patti. 2016. Cycle-Accurate Network on Chip Simulation with Noxim. ACM Trans. Model. Comput. Simul. 27, 1, Article 4 (August 2016), 25 pages. DOI: https://doi.org/10.1145/2953878

• Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 9

• . Kaushik, M. Agrawal, H. K. Mondal, S. H. Gade, and S. Deb, “Path loss-aware adaptive transmission power control scheme for energy- efficient wireless noc,” in International Midwest Symposium on Circuits and Systems (MWSCAS), Aug. 2017, pp. 132–135.

• C. Roth, H. Bucher, S. Reder, F. Buciuman, O. Sander, and J. Becker. 2013. A SystemC modeling and simulation methodology for fast and accurate parallel MPSoC simulation. In Integrated Circuits and Systems Design (SBCCI), 2013 26th Symposium on. 1–6. DOI:http://dx.doi.org/10.1109/SBCCI.2013.6644853

• S. Deb, K. Chang, M. Cosic, A. Ganguly, P. P. Pande, D. Heo, and B. Belzer, “Enhancing performance of network-on-chip architectures with millimeter-wave wireless interconnects,” in IEEE International Conference on Application-specific Systems Architectures and Processors, 2010, pp. 73–80.

• E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, “Graphite: A distributed parallel simulator for multicores,” in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010, pp. 1–12.