design and implementation of an efficient titolo ... · two sv packages implementing the functions...

11
Titolo presentazione sottotitolo Milano, XX mese 20XX Design and implementation of an efficient Floating Point Unit (FPU) in the IoT era Andrea Galimberti Andrea Romanoni Davide Zoni OPRECOMP summer of code NiPS Summer School Sept 3-6, 2019 Perugia, Italy

Upload: others

Post on 14-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

Titolo presentazionesottotitolo

Milano, XX mese 20XX

Design and implementation of an efficient Floating Point Unit (FPU) in the IoT era

Andrea Galimberti

Andrea Romanoni

Davide Zoni

OPRECOMP summer of codeNiPS Summer SchoolSept 3-6, 2019 Perugia, Italy

Page 2: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni

The computing platforms in the Internet-of-Things (IoT) era

Efficient computing: challenges and opportunities in the Internet-of-Things (HiPEAC vision 2019*)

Distributed data processing

Specific requirements and constraints at each stage

Revise the design of the System-on-Chip to maximize efficiency

2 / -

* HiPEAC Vision 2019: https://www.hipeac.net/vision/2019/

Page 3: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni

Open-hardware platforms for research

Motivations (HiPEAC 2019): Design open-hardware computing

platforms for efficiency, security and safety Trustfulness and security requirements Computational efficiency

Limitations of currentopen-hardware solutions: Monolithic solutions, not modular Targeting efficient computation Mainly focused to ASIC design flow Not for research in microarchitecture

Successful open-hardware projects

Spectre and meltdown: ICT cannot rely on closed-hardware solutions

3 / 11

Let’s focus on a single optimization direction ...

Floating point data transfer and computation dominate the energy consumption

Page 4: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni

The motivation behind the bfloat16 format

Brain-inspired Float (bfloat16): same format of the IEEE754 single-precision (SP) floating-point standard but truncates the mantissa field from 23 bits to 7 bits.

4 / 11

Motivations:

Better timing and smaller area

For machine learning tasks a 7 bit mantissa is enough (Google and Intel)

Still possible to represent tiny values

Easier debugging than float16 since bfloat16 shares the same format of the IEEE754-SP standard

S E E E E E E E E M M M M M M M M M M M M M M M M M M M M M M M

S E E E E E M M M M M M M M M M

S E E E E E E E E M M M M M M M

Exp (8 bits)

Exp (8 bits) Mantissa (7 bits)

Mantissa (10 bits)Exp (5 bits)

Mantissa (23 bits)

~2-(126+23) < x < ~2(128)

16 bits

~2-(14+10) < x < ~2(16)

~2-(126+7) < x < ~2(128)

MRRE = 0.5 * ulp * B = = 0.5 * 2-23 * 2 =

= 2-23

MRRE = 0.5 * ulp * B = = 0.5 * 2-10 * 2 =

= 2-10

MRRE = 0.5 * ulp * B = = 0.5 * 2-7 * 2 =

= 2-7

Float16

bFloat16

Float32

Page 5: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni

single SystemVerilog design that implements bFloat and IEEE-754 single-precision

Goals andstatus of the

project

5 / 11

Efficient FPU for IoT SoCs, i.e., lamp-platform: both timing and area efficient

Reproducible results: new set of benchmarks targeting machine learning,

vision and IA validation infrastructure using Xilinx Vivado,

coupled with SystemVerilog DPI-C and random features. Free tools only

Page 6: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni

Design and verification methodology

6 / 11

Flexibility

Two SV packages implementing

the functions for each operation

according to the selected size

of the mantissa;

Each operation can be either

implemented or not

Efficiency:

Rounding scheme can be

configured to consider N LSB

bits to optimize timing/area;

Constrained rnd verification

using gcc and Xilinx Vivado

FPU_top

add_sub

mul

div

i2f

f2i

Generate random

operands and

opcode

DPI-C

FPU_top_tb

cmp

Checkresults

IEEE754_SP_pkg

bFloat_pkg

opCode,op1,op2doOp,rndMode Res,

IsValid,flags

Page 7: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni

Area and timing: preliminary results

7 / 11

We optimize each operation in the FPU to improve the timing

The reduction in the mantissa improves both area and timing

The proposed FPU allows to selectively implement a subset of the available operations

NOTE: Flip-Flops results are not reported

Page 8: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni

New set of benchmarks targeting machine learning, vision and IA

Ongoing

8 / 11

FPU integration in the lamp-platform

Page 9: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni

FPU benchmarks for IoT System-on-Chip*

9 / 11

Errors as function of the bits in the mantissa (tolerance 0.001)

Motivations:

Bare-metal C code for IoT devices (different from PARSEC and SPLASH2 bench suites)

Consider emerging computing scenarios:

(computer vision, computer graphics, machine learning)

* Project website: www.lamp-platform.org

Page 10: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni

Lightweight Application-specific Modular Processor (LAMP)*

* Project website: www.lamp-platform.org

Contributions: FPGA targets with up to 4k LUTs with CPU(a RISC-V 5 stage CPU is offered as well )

Modular design to target different design metrics For example: Side-channel evaluation using

the same design in post-synthesis, post-implementation and board

Designed for Artix 7 FPGA family

Easily plug different computing devices (request/response data/instruction interface)

Key idea: The Raspberry successful story

A small and affordable computer that you can use to learn programming

Our vision:

LAMP-platform: a flexibile and affordable System-on-Chip for investigating IoT challenges

10 / 11

Page 11: Design and implementation of an efficient Titolo ... · Two SV packages implementing the functions for each operation according to the selected size of the mantissa; Each operation

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni

Thank you

11 / 11

Davide Zoni, PhDEmail: [email protected]

Andrea Galimberti, PhD StudentEmail: [email protected]

Andrea Romanoni, PhDEmail: [email protected]