architecture of digital integrated systems -...

Architecture of Digital Integrated Systems

Course Presentation

Davide BertozziUniversity of Ferrara

Most of the course material will be in english

2

Course Information

Instructor:Davide BertozziAssistant ProfessorEmail: [email protected]: +390532974832

Teaching assistant, responsible for laboratory experiences:Meriem TurkiPhD studentEmail: [email protected]: no. 338 (third floor, Engineering Department)

Course schedule

MONDAY 16.30-19.00 - Room 20

WEDNESDAY 16.30-19.00 - Room 9OR- Informatics Lab (Small)

All Lab experiences will be taught in english Office hours: on appointment

(email reservation, or after lectures)

The Exam

Roughly one/third of the course will be in lab (or related to it) Expertise on the (C++-derived) SystemC hardware description

language (HDL) Exam split into 2 parts:

Oral exam (25 points) - 3 questions.

Course project (5 points) Hands-on final project assignment showing off SystemC programming

skills Exams are on appointment, and requests should be emailed

to me at least one week in advance

Course material Course website:

http://mpsoc.unife.it/~arch-dig/- Slides (at least 1 hour before lessons)- News, course information

No unique course book available, since the topic of this courseis fast evolving It is at the frontier of research Specific book chapters, papers,...will be suggested on a topic by

topic basis

Taking the course and taking notes is the best way to enjoy the course!

Useful books 1. T.Groetker, S.Liao, G.Martin, S.Swan; System Design with SystemC, Kluwer

Academic Publishers, 2002 SystemC hardware description language

2. J.Flich, D.Bertozzi; Designing Network-on-Chip Architectures in the Nanoscale Era, CRC Press, 2011. Networks-on-Chip

3. William James Dally, Brian Patrick Towles; Principles and Practices of Interconnection Networks; Morgan Kaufmann, 2004 Interconnection networks

4. Digital Integrated Circuits - A Design Perspective (second edition), J.M.Rabaey, A.Chandrakasan, B.Nikolic, Prentice Hall Design methodologies; Timing issues in digital circuit design

5. David A. Patterson, John L. Hennessy; Computer Organization and Design: The Hardware/Software Interface, Morgan Kaufmann, 2004 Microprocessor architecture

5. D.Culler, P.Singh, A.Gupta; Parallel Computer Architecture: a Hardware/Software Approach, Morgan Kaufmann, 2004 Design issues of multicore processors

Lab Schedule

Indicative dates for lab experiences

- march 21st from 16:30 to 18:30 - march 28th from 16:30 to 18:30 - april 11th from 16:30 to 18:30 - april 18th from 16:30 to 18:30 - may 9th from 16:30 to 18:30 - may 16th from 16:30 to 18:30 - may 23rd from 16:30 to 18:30 (Final Project Assignment)

These dates may change based on my work commitments. They should be considered as indicative.

8

Architecture

Technology

Synthesis flow

SystemC Hardware

Description Language (HDL)

The course at a glance

Technology-aware design

Why taking this course?

Breaking the abstraction layers and knowing what is underneathenables you to solve problems and design better future systems

Cooperation between multiple components and layers can enable more effective solutions and systems

10

Off-chip memoryMicroprocessor core Bus Memory I/OAccelerators

Operating System

Language Runtime

Application and Libraries

Hypervisor

Netlist of logic gates

Circuits

Layout

Transistors

Horizontal integration

Verticalintegration

I am ready!

When I push a button on the touchscreen, the smartphone recovers from sleep mode and starts working

Operation Batteryduration

Standby 250 ore

Operation Batteryduration

3G talk time 10 hrs

3G browsing 8 hrs

LTE browsing

10 hrs

Wi-Fi browsing

10 hrs

Video 10 hrs

Music 40 hrs

iPhone 5s

Several kinds of works stress the smartphone to a different extent

This has direct implications on the battery duration

Let us start from our daily experience

Who does the actual computing inside the smartphone?

Electronic board

Opening the smartphone

Touch Screen Controller

Antennas and controllers

Electronic Board – Face up

ApplicationProcessor

Non-Volatile Memory

Gyroscope & Accelerometer

LTE/GSM Modem

NFC Controller

Image Processor

Electronic Board – Face down

Application Processor

The «brain» of the smartphone is its «Application Processor»:Snapdragon (Qualcomm), Exynos (Samsung), Helio (Mediatek), OMAP (Texas Instr.), Kirin (HiSilicon), Tegra (Nvidia), Ax (Apple), ….

What is a microprocessor?

Execute!

Yes Sir!

Programmer

Microprocessor

Send an email!

Yes Sir!

Programmer

Microprocessor

Microprocessors are not able to understandand process such «abstract» commands!

From now on, the terms «application processor» and «microprocessor» will be used interchangeably, although other kinds of microprocessors do exist (e.g., for power

management, wireless control, display control, etc..).

Machine LanguageTalking to a microprocessor goes through «electrical signals» The basic thing a microprocessor does is to detect the

following conditions: - Presence of signal (symbol «1») - Lack of signal (symbol «0»)

Fundamentals of digital processing:Binary numbers are used to communite

both instructions and datato a microprocessor

1000110010100000!

Microprocessors speak a language whose alphabet consists of 2 letters (the italianlanguage has 21 letters). As a result, machine language consists of binary numbers:

microprocessor

programmer

«0»«1»«0»«1»«0»«1»

With this language, what kind of orders can I give to a microprocessor?

A microprocessor can only execute simple «low-level» arithmetic-logic instructions

- E.g., sum/subtract/multiply/divide two integers- E.g., carry out the logic AND/OR/EXOR of two bits or sets of bits

Applications consist of «complex» (or abstract) operations, which take for granted the capability to think/abstract/plan/structure of the human mind:

- Start a phone call!- Play back a video!- Send an email

Abstractions hide details, but enable to cope with problem complexity

Several intermediate HW/SW layers are needed to interpretand translate high-

level operations intothe basic operations

a microprocessorcan do.

Moving to a «lowerabstraction layer»

increases the informative content

ABSTRACTIONGAP

Application Processor: an Example

SAMSUNG EXYNOS7420 OCTA-CORE

A microprocessor does NOT ONLY consist of a single CORE (computation unit), but rather of a (more or less regular) network of cores.

ANALOGY

Once upon a time (~50 years ago) it was not like that…

Microprocessors used to be «Monolithic processor cores»

Hardware capable of executingarithmetic and logic instructions

Optionals di questo hardware:OPTIONS:Processing speed (or clock frequency)Instruction throughputInstruction-Level ParallelismOut-of-order execution capabilityBranch prediction strategyMemory hierararchy and access speedVirtual memory…..

Trend to integrate more and more functions (computation units, controllers, modems, memory macros, etc.) on the same silicon die, thus building up «Systems-on-Chip».

- lower power, better performance, lower sizeToday, all application processors are «systems-on-chip»

What slows down this trend:Technology, Cost, Reliability

The long-term idealasymptotic trend

consists of the «smartphone-on-

a-chip»

The «system-on-chip (SoC)» revolution

«Integrated» Application ProcessorsMicroprocessors in the «system-on-chip (SoC)» era

Memory Peripherical unitsand I/O


SoCfor

wireless applications

Computation unit from a third-party vendor(implementation made by system integrator – SOFT MACRO)

Computation unit from a third-party

vendor(layout defined

by the vendor as well – HARD

MACRO)

A typical «SoC» consists of pre-designed and pre-verified blocks, which can be made in-house or bought by «third party» vendors

(against the payment of royalties) Data Memory and Instruction Memory (from

third-party vendor). Their layout comes

from vendor as well - HARD

MACRONew terminology coming up:- Platform-based design- Design reuse- System integration task

Systems-on-chip: a different way of designing systems….and of doing business!

«ARM» PROCESSORS You may have heard that «most smartphone

processors» are ARM processors…… … but when talking about application processors ARM

was not mentioned: Qualcomm, MediaTek, Nvidia, HiSilicon, Apple,…..??!?!

Generic Application Processorswith «ARM core inside».

Core ARM

In turn, ARM processors are systems-on-chip….

Application processors are Systems-on-chip!

Evolution: the Memory Hierarchy

1st level memory

Peripheral unitsand I/O2nd level memory


Microprocessors in the «system-on-chip (SoC)» era

Memory – the ProblemWe all would like knowledge to be accessible in a single book!

It follows from thisthe need to selectevery time…

..the book containing the needed information at any given point in time!

Working Set

• We may identify a set of books containing all the information that we normally need (except for specific cases!) = Working set.

• What are the «habits» of the microprocessor«reader», so to build the working set?

The Problem

The microprocessor would like to have infinite memorywith overly fast access times… But this is not feasible in practice:

Fast memories are small. Large memories are slow. The amount of memory that can stay on a chip is limited! Fast memories are also very expensive, so they have to be

small.

Key Insight The microprocessor has a distinctive feature: it tends to reuse

data and instructions that have been accessed recently, or thatare close to the recently-accessed ones.

TEMPORAL LOCALITY

Recently-accessed elements are likely to be accessed again soon

SPATIAL LOCALITY

When accessing an element, the elements nearby are likely to be accessed soon

Example

Spatial locality Access to the elements A[i] in sequnce

Temporal locality At each iteration, the «sum» instruction is used

How to exploit this?

Low latency High throughput Small size High cost

Memory Hieararchy The approach is to store on a small and fast memory «close to» the

processor (i.e., a cache) the data/instructions that I am currentlyaccessing, in addition to the «nearby» ones

By implementing the memory system as a «hierarchy», the microprocessor is given the illusion of having a memory as large as

the last-level one and as fast as the first-level one

Working set

Whether the working set is «good» or not depends on the number of MISSes in the first-level memory

How does it work? Hit and MissI am searching for a/an data/instruction. Do you have it?YES (HIT!)

NO (MISS!) Transfer from the lower level

Registers

L1 Cache

L2 Cache

The philosophy is as follows:Fast access to data/instructions that are most commonly used or which can be foreseen to be accessed in the near future. For the exceptions, a temporary performance slowdown has to be accounted for.

Evolution: the Memory Hierarchy

1st level memory

Peripheral unitsand I/O2nd level memory


Microprocessors in the «system-on-chip (SoC)» era

Trend: increasing chip size!

Growing to Super-cores!

Memoria I livello

Periferica e/o porta di I/OMemoria II livello

Chip Size

Goal: meet the growing user expectations for advanced software services

Chip Size

Next-Generation Processor Core

1st-level memory Peripheral unit and I/O

2nd-level memoryPeripheral unit and I/OPeripheral unit and I/O

3rd-level memoryPeripheral unit and I/O

Peripheral unit and I/OPeripheral unit and I/O

Integration of tens or hundreds of «cores»

More memory levels or

same levels with more memory

Higher performance, lower cost,

etc.

Trend: increasing chip size!

Next-Generation Processor Core

1st-level memory Peripheral unit and I/O

2nd-level memoryPeripheral unit and I/OPeripheral unit and I/O

3rd-level memoryPeripheral unit and I/O


Chip SizeOpposite trend: chip size reduction under the pressure of technology scaling

Super-cores!65 nm45 nm

90 nm Today we are

headingbelow 14nm

processnodes!

Below 10nm fundamental physical issus come to the forefront

Trend stabilization: constant chip size

Microprocessor area has today stabilized (rough numbers provided): 140 mm2 for desktop computers 260 mm2 for high-performance computing (e.g., scientific computing) 70–100 mm2 for “embedded” microprocessorsArea split into LOGIC, MEMORY AND INTEGRATION OVERHEAD

There are limiting factors for chiparea:- power consumption- manufacturing cost- chip-wide transmission delay- design cost

Then, around 2000, an epoch-makingparadigm shift occurred…

SOMETHING WENT WRONG!

Designers realize that at each new generation of microprocessors, the cost to achieve a predefined performance increase skyrockets (if at all achievable)

Pollack’s rule:At a given feature size (process node), a new

microprocessor generation takes 2-3x the area of the old one, while the performance speedup is only 1.4-1.6x

DECREASING MARGINAL UTILIZATION

WHAT EXACTLY WENT WRONG?

1. Few applications can expose more than 2 parallel instructions/cycle The main limiter of the instruction throughput is the presence of

dependencies in the instruction flow, whichmicroarchitecture/compiler designers cannot completely get rid of

2. Sometimes, although instruction parallelism is there, the compilerand/or the hardware are not able to extract it E.g., potentially parallel instructions that are thousands of cycles

apart3. Memory access latencies limit the utilization rate of the processor

Memory latency cannot be completely hidden4. Beyond a 150W power envelope, it is not economically convenient to

cool down any more

5. Beyond a given clock frequency, there are concerns: A well-formed clock pulse becomes challenging Within the clock cycle, there is an inactive time that does not scale

6. Although the processor is fast in processing data, data communication is overly slow and costly! The communication bottleneck becomes more severe as

technology scales down!

Equivalent to:

WHAT EXACTLY WENT WRONG?

C1 C2

C3 C4

2nd-level Memory

Large Core

1st-level Memory

1

2

3

4

1

2 SmallCore 1 1

1

2

3

4

1

2

3

4

Power

PerformancePower = 1/4

Performance = 1/2

Multi-Core:Power efficient +Better power and thermal management

A NEW ERA: MULTI-CORE COMPUTING

Computation parallelism represents a more efficient and scalable way of delivering computing performance and power management!

Multi-core computing

1985 199019801970 1975 1995 2000 2005

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Boardcom 1480

20??

# ofcores

1248

163264

128256512

Opteron 4PXeon MP

AmbricAM2045

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2Athlon

C1 C2

C3 C4

Cache

Large Core

Cache

1

2

3

4

1

2 SmallCore 1 1

1

2

3

4

1

2

3

4

Power

PerformancePower = 1/4

Performance = 1/2

Multi-Core:Power efficient +Better power and thermal management

Multi-Core ArchitecturesMicroprocessors in the era of parallelism

ReplicatedHardwareCapable ofExecuting

Arithmetic and Logic

Instructions

Peripheral unit and I/OShared L2 MemoryPeripheral unit and I/OPeripheral unit and I/O

Shared L3 Memory Peripheral unit and I/O


1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

Power ManagementThe processor is split into voltage and frequency islands

Periferica e/o porta di I/OPeriferica e/o porta di I/O

Memoria III livelloPeriferica e/o porta di

I/OPeriferica e/o porta di I/OPeriferica e/o porta di I/O

OFF

OFF

1st-level memory

1st-level memory

1st-level memory

1st-level memory

1st-level memory

1st-level memory

OR

per-core activation

Each core (or «cluster» of cores) can be operated at different voltage and frequency settings, or selectively powered off

Intel Single-Chip Cloud Computer (SCC)48 cores structured into 2-core clusters

24 frequency islands8 voltage islands

15 speed settings from 100 to 800 MHz7 voltage levels from 0.7V to 1.3V in steps of 0.1V

Case Study: an Industrial Research Prototype

Power Management

From Multicores to Manycores There are applications where the hardware parallelism is

perfectly matched to the software parallelism (this is notalways the case!!) E.g., graphics

Single instruction («sum with 6») applied to Multiple Data(SIMD – Single Instruction Multiple Data) Implementation

Graphics Processing Unit

Server of/with GPUsOptimized for SIMD Workloads

NVIDIA TITAN V GPU

5120 cores 640 additional cores to support artificial

intelligence Other 320 cores («texture» units) 21.1 billions of transistors Maximum clock frequency: 1.5 GHz 12 GB of memory 12nm technology 110 TFLOPS of compute performance Target applications: deep learning,

supercomputing, financial services, high-end gaming, big data applications

TDP: 250 W Price: roughly 3000 dollars

LATEST NEWS FROM THE WORLD OF MICROPROCESSORS

Heterogeneity

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

System-on-Chip

Low-PerformanceLow-PowerMulticore

High-PerformanceHigh-Power

Multicore

OPERATING SYSTEM SCHEDULER

OFFON

«Big-LITTLE» ARM architecture:Combination of hight-end ARM A57

with low-end ARM A53

SAMSUNG EXYNOS7420 OCTA-CORE

The Accelerator Store

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

System-on-Chip



Multicore

Hardware Accelerators

Image processing

VideoPlayback

FFT

NeuromorphicAccelerator

HOST PROCESSOR


1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

System-on-Chip



Multicore

Hardware AcceleratorGraphics (Embedded GPU)HOST PROCESSOR


Re-programmable accelerators Specialized accelerators

Example: Huawei Kirin 970

«Big.LITTLE» Host processor 4-core A73 (2.4 GHz) 4-core A53 (1.8 GHz)

Embedded GPU with 12 cores 5.5 billions of transistors Area: 1cm2 1 accelerator for machine learning(25x performance, 50x energyefficiency)

2005 classified images/minute (Samsung Galaxy S8: 95, iPhone7 Plus: 487)

THE FUTURE

Deep Learning

60

Hardware for Deep Learning

61

UNIFE is playing the game!

62

Collaboration with Fabrizio Riguzzi, Department of Informatics

Dynamically Reconfigurable DNN

Courtesy of Rice University

Brain-inspired Computing

63

Courtesy of D.Querlioz

Brain-Inspired Computing: Why?

64

Courtesy of D.Querlioz

We are playing the game right now!

65

UNIFE is playing the game

66Collaboration with Michele Favalli, Columbia University and AMD

Enabling Asynchronous Interconnect Technology

Integrated Photonics for Communication(typically) off-chip

Laser sourceOptical signal carried to the chip via optical fiber

Tapered input

Silicon waveguide Optical OOK modulationSilicon waveguide

Optical OOK modulation

Silicon waveguide

Photonic Switching

PhotodetectorTransimpedance amplifieDigital Comparator

Integrated Photonics for Communication

3-D Integration


AutomaticPlace&Route

Tool

Full-custom

Irregular Pattern Regular pattern Optical Ring Optical RingsAutomatic topology synthesis Framework

Collaboration with Maddalena Nonato, Marco Gavanelli and TU Munich (Germany)

UNIFE is playing the gameFabrication of photonic integrated circuits and measureents on them

Collaboration with Gaetano Bellanca and Inphotec Pisa


72

f0

DC‐FIFO 1 RL 1 MUX

3x1

DEMU

X 1x1

5

15

1

1

Arbiter

Credits From Rx15

DEMU

X 1x2

MUX 3

0x1

Arbiter

Credit counter

Credit counter

15

1

M29

CMOS 40nm ECL 130nm

÷2÷2

VC DECODER

MESO

TX

TIA PD30

SE2D

D2SE ÷2 ÷2

clk5 clk4 clk3 clk2

PLL

clk1

32x1 Binary Tree Serializer 15 Driver

DC‐FIFO 2 RL 2

DEMU

X 1x1

5

MUX

3x1

15

1

15

1

SE2D

1

15

M30Driver

VC DECODER

VC‐ID

DC‐FIFO 1

DC‐FIFO 15

Credits to Rx15

DC‐FIFO 29

DC‐FIFO 30

DEMU

X 1x3 D2SE 1x32 Binary Tree Deserializer 15

clk5 clk4 clk3 clk2 clk1

÷2÷2D2SE ÷2 ÷2 TIA PD29

1

15

RX

f1f1/16 f1/2f1/4f1/8

1

15

VC_ID

CMOS 40nm

ONOC

FULLY CMOS

Hybrid

HOW TO DRIVE AN OPTICAL NETWORK?

Collaboration with IHP Microelectronics (Germany)

architecture of digital integrated systems -...

Documents