architecture of digital integrated systems -...

72
Architecture of Digital Integrated Systems Course Presentation Davide Bertozzi University of Ferrara

Upload: lamkhanh

Post on 16-May-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Architecture of Digital Integrated Systems

Course Presentation

Davide BertozziUniversity of Ferrara

Page 2: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Most of the course material will be in english

2

Page 3: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Course Information

Instructor:Davide BertozziAssistant ProfessorEmail: [email protected]: +390532974832

Teaching assistant, responsible for laboratory experiences:Meriem TurkiPhD studentEmail: [email protected]: no. 338 (third floor, Engineering Department)

Page 4: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Course schedule

MONDAY 16.30-19.00 - Room 20

WEDNESDAY 16.30-19.00 - Room 9OR- Informatics Lab (Small)

All Lab experiences will be taught in english Office hours: on appointment

(email reservation, or after lectures)

Page 5: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

The Exam

Roughly one/third of the course will be in lab (or related to it) Expertise on the (C++-derived) SystemC hardware description

language (HDL) Exam split into 2 parts:

Oral exam (25 points) - 3 questions.

Course project (5 points) Hands-on final project assignment showing off SystemC programming

skills Exams are on appointment, and requests should be emailed

to me at least one week in advance

Page 6: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Course material Course website:

http://mpsoc.unife.it/~arch-dig/- Slides (at least 1 hour before lessons)- News, course information

No unique course book available, since the topic of this courseis fast evolving It is at the frontier of research Specific book chapters, papers,...will be suggested on a topic by

topic basis

Taking the course and taking notes is the best way to enjoy the course!

Page 7: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Useful books 1. T.Groetker, S.Liao, G.Martin, S.Swan; System Design with SystemC, Kluwer

Academic Publishers, 2002 SystemC hardware description language

2. J.Flich, D.Bertozzi; Designing Network-on-Chip Architectures in the Nanoscale Era, CRC Press, 2011. Networks-on-Chip

3. William James Dally, Brian Patrick Towles; Principles and Practices of Interconnection Networks; Morgan Kaufmann, 2004 Interconnection networks

4. Digital Integrated Circuits - A Design Perspective (second edition), J.M.Rabaey, A.Chandrakasan, B.Nikolic, Prentice Hall Design methodologies; Timing issues in digital circuit design

5. David A. Patterson, John L. Hennessy; Computer Organization and Design: The Hardware/Software Interface, Morgan Kaufmann, 2004 Microprocessor architecture

5. D.Culler, P.Singh, A.Gupta; Parallel Computer Architecture: a Hardware/Software Approach, Morgan Kaufmann, 2004 Design issues of multicore processors

Page 8: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Lab Schedule

Indicative dates for lab experiences

- march 21st from 16:30 to 18:30 - march 28th from 16:30 to 18:30 - april 11th from 16:30 to 18:30 - april 18th from 16:30 to 18:30 - may 9th from 16:30 to 18:30 - may 16th from 16:30 to 18:30 - may 23rd from 16:30 to 18:30 (Final Project Assignment)

These dates may change based on my work commitments. They should be considered as indicative.

8

Page 9: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Architecture

Technology

Synthesis flow

SystemC Hardware

Description Language (HDL)

The course at a glance

Technology-aware design

Page 10: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Why taking this course?

Breaking the abstraction layers and knowing what is underneathenables you to solve problems and design better future systems

Cooperation between multiple components and layers can enable more effective solutions and systems

10

Off-chip memoryMicroprocessor core Bus Memory I/OAccelerators

Operating System

Language Runtime

Application and Libraries

Hypervisor

Netlist of logic gates

Circuits

Layout

Transistors

Horizontal integration

Verticalintegration

Page 11: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

I am ready!

When I push a button on the touchscreen, the smartphone recovers from sleep mode and starts working

Operation Batteryduration

Standby 250 ore

Operation Batteryduration

3G talk time 10 hrs

3G browsing 8 hrs

LTE browsing

10 hrs

Wi-Fi browsing

10 hrs

Video 10 hrs

Music 40 hrs

iPhone 5s

Several kinds of works stress the smartphone to a different extent

This has direct implications on the battery duration

Let us start from our daily experience

Who does the actual computing inside the smartphone?

Page 12: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Electronic board

Opening the smartphone

Page 13: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Touch Screen Controller

Antennas and controllers

Electronic Board – Face up

Page 14: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

ApplicationProcessor

Non-Volatile Memory

Gyroscope & Accelerometer

LTE/GSM Modem

NFC Controller

Image Processor

Electronic Board – Face down

Application Processor

The «brain» of the smartphone is its «Application Processor»:Snapdragon (Qualcomm), Exynos (Samsung), Helio (Mediatek), OMAP (Texas Instr.), Kirin (HiSilicon), Tegra (Nvidia), Ax (Apple), ….

Page 15: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

What is a microprocessor?

Execute!

Yes Sir!

Programmer

Microprocessor

Send an email!

Yes Sir!

Programmer

Microprocessor

Microprocessors are not able to understandand process such «abstract» commands!

From now on, the terms «application processor» and «microprocessor» will be used interchangeably, although other kinds of microprocessors do exist (e.g., for power

management, wireless control, display control, etc..).

Page 16: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Machine LanguageTalking to a microprocessor goes through «electrical signals» The basic thing a microprocessor does is to detect the

following conditions: - Presence of signal (symbol «1») - Lack of signal (symbol «0»)

Fundamentals of digital processing:Binary numbers are used to communite

both instructions and datato a microprocessor

1000110010100000!

Microprocessors speak a language whose alphabet consists of 2 letters (the italianlanguage has 21 letters). As a result, machine language consists of binary numbers:

microprocessor

programmer

«0»«1»«0»«1»«0»«1»

With this language, what kind of orders can I give to a microprocessor?

Page 17: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

A microprocessor can only execute simple «low-level» arithmetic-logic instructions

- E.g., sum/subtract/multiply/divide two integers- E.g., carry out the logic AND/OR/EXOR of two bits or sets of bits

Applications consist of «complex» (or abstract) operations, which take for granted the capability to think/abstract/plan/structure of the human mind:

- Start a phone call!- Play back a video!- Send an email

Abstractions hide details, but enable to cope with problem complexity

Several intermediate HW/SW layers are needed to interpretand translate high-

level operations intothe basic operations

a microprocessorcan do.

Moving to a «lowerabstraction layer»

increases the informative content

ABSTRACTIONGAP

Page 18: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Application Processor: an Example

SAMSUNG EXYNOS7420 OCTA-CORE

A microprocessor does NOT ONLY consist of a single CORE (computation unit), but rather of a (more or less regular) network of cores.

ANALOGY

Page 19: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Once upon a time (~50 years ago) it was not like that…

Microprocessors used to be «Monolithic processor cores»

Hardware capable of executingarithmetic and logic instructions

Optionals di questo hardware:OPTIONS:Processing speed (or clock frequency)Instruction throughputInstruction-Level ParallelismOut-of-order execution capabilityBranch prediction strategyMemory hierararchy and access speedVirtual memory…..

Page 20: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Trend to integrate more and more functions (computation units, controllers, modems, memory macros, etc.) on the same silicon die, thus building up «Systems-on-Chip».

- lower power, better performance, lower sizeToday, all application processors are «systems-on-chip»

What slows down this trend:Technology, Cost, Reliability

The long-term idealasymptotic trend

consists of the «smartphone-on-

a-chip»

The «system-on-chip (SoC)» revolution

Page 21: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

«Integrated» Application ProcessorsMicroprocessors in the «system-on-chip (SoC)» era

Memory Peripherical unitsand I/O

Hardware capable of executingarithmetic and logic instructions

Page 22: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

SoCfor

wireless applications

Computation unit from a third-party vendor(implementation made by system integrator – SOFT MACRO)

Computation unit from a third-party

vendor(layout defined

by the vendor as well – HARD

MACRO)

A typical «SoC» consists of pre-designed and pre-verified blocks, which can be made in-house or bought by «third party» vendors

(against the payment of royalties) Data Memory and Instruction Memory (from

third-party vendor). Their layout comes

from vendor as well - HARD

MACRONew terminology coming up:- Platform-based design- Design reuse- System integration task

Systems-on-chip: a different way of designing systems….and of doing business!

Page 23: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

«ARM» PROCESSORS You may have heard that «most smartphone

processors» are ARM processors…… … but when talking about application processors ARM

was not mentioned: Qualcomm, MediaTek, Nvidia, HiSilicon, Apple,…..??!?!

Generic Application Processorswith «ARM core inside».

Core ARM

In turn, ARM processors are systems-on-chip….

Application processors are Systems-on-chip!

Page 24: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Evolution: the Memory Hierarchy

1st level memory

Peripheral unitsand I/O2nd level memory

Hardware capable of executingarithmetic and logic instructions

Microprocessors in the «system-on-chip (SoC)» era

Page 25: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Memory – the ProblemWe all would like knowledge to be accessible in a single book!

It follows from thisthe need to selectevery time…

..the book containing the needed information at any given point in time!

Page 26: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Working Set

• We may identify a set of books containing all the information that we normally need (except for specific cases!) = Working set.

• What are the «habits» of the microprocessor«reader», so to build the working set?

Page 27: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

The Problem

The microprocessor would like to have infinite memorywith overly fast access times… But this is not feasible in practice:

Fast memories are small. Large memories are slow. The amount of memory that can stay on a chip is limited! Fast memories are also very expensive, so they have to be

small.

Page 28: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Key Insight The microprocessor has a distinctive feature: it tends to reuse

data and instructions that have been accessed recently, or thatare close to the recently-accessed ones.

TEMPORAL LOCALITY

Recently-accessed elements are likely to be accessed again soon

SPATIAL LOCALITY

When accessing an element, the elements nearby are likely to be accessed soon

Page 29: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Example

Spatial locality Access to the elements A[i] in sequnce

Temporal locality At each iteration, the «sum» instruction is used

How to exploit this?

Page 30: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Low latency High throughput Small size High cost

Memory Hieararchy The approach is to store on a small and fast memory «close to» the

processor (i.e., a cache) the data/instructions that I am currentlyaccessing, in addition to the «nearby» ones

By implementing the memory system as a «hierarchy», the microprocessor is given the illusion of having a memory as large as

the last-level one and as fast as the first-level one

Working set

Whether the working set is «good» or not depends on the number of MISSes in the first-level memory

Page 31: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

How does it work? Hit and MissI am searching for a/an data/instruction. Do you have it?YES (HIT!)

NO (MISS!) Transfer from the lower level

Registers

L1 Cache

L2 Cache

The philosophy is as follows:Fast access to data/instructions that are most commonly used or which can be foreseen to be accessed in the near future. For the exceptions, a temporary performance slowdown has to be accounted for.

Page 32: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Evolution: the Memory Hierarchy

1st level memory

Peripheral unitsand I/O2nd level memory

Hardware capable of executingarithmetic and logic instructions

Microprocessors in the «system-on-chip (SoC)» era

Page 33: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Trend: increasing chip size!

Growing to Super-cores!

Memoria I livello

Periferica e/o porta di I/OMemoria II livello

Chip Size

Goal: meet the growing user expectations for advanced software services

Page 34: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Chip Size

Next-Generation Processor Core

1st-level memory Peripheral unit and I/O

2nd-level memoryPeripheral unit and I/OPeripheral unit and I/O

3rd-level memoryPeripheral unit and I/O

Peripheral unit and I/OPeripheral unit and I/O

Integration of tens or hundreds of «cores»

More memory levels or

same levels with more memory

Higher performance, lower cost,

etc.

Trend: increasing chip size!

Page 35: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Next-Generation Processor Core

1st-level memory Peripheral unit and I/O

2nd-level memoryPeripheral unit and I/OPeripheral unit and I/O

3rd-level memoryPeripheral unit and I/O

Peripheral unit and I/OPeripheral unit and I/O

Chip SizeOpposite trend: chip size reduction under the pressure of technology scaling

Super-cores!65 nm45 nm

90 nm Today we are

headingbelow 14nm

processnodes!

Below 10nm fundamental physical issus come to the forefront

Page 36: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Trend stabilization: constant chip size

Microprocessor area has today stabilized (rough numbers provided): 140 mm2 for desktop computers 260 mm2 for high-performance computing (e.g., scientific computing) 70–100 mm2 for “embedded” microprocessorsArea split into LOGIC, MEMORY AND INTEGRATION OVERHEAD

There are limiting factors for chiparea:- power consumption- manufacturing cost- chip-wide transmission delay- design cost

Page 37: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Then, around 2000, an epoch-makingparadigm shift occurred…

Page 38: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

SOMETHING WENT WRONG!

Designers realize that at each new generation of microprocessors, the cost to achieve a predefined performance increase skyrockets (if at all achievable)

Pollack’s rule:At a given feature size (process node), a new

microprocessor generation takes 2-3x the area of the old one, while the performance speedup is only 1.4-1.6x

Page 39: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

DECREASING MARGINAL UTILIZATION

Page 40: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

WHAT EXACTLY WENT WRONG?

1. Few applications can expose more than 2 parallel instructions/cycle The main limiter of the instruction throughput is the presence of

dependencies in the instruction flow, whichmicroarchitecture/compiler designers cannot completely get rid of

2. Sometimes, although instruction parallelism is there, the compilerand/or the hardware are not able to extract it E.g., potentially parallel instructions that are thousands of cycles

apart3. Memory access latencies limit the utilization rate of the processor

Memory latency cannot be completely hidden4. Beyond a 150W power envelope, it is not economically convenient to

cool down any more

Page 41: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

5. Beyond a given clock frequency, there are concerns: A well-formed clock pulse becomes challenging Within the clock cycle, there is an inactive time that does not scale

6. Although the processor is fast in processing data, data communication is overly slow and costly! The communication bottleneck becomes more severe as

technology scales down!

Equivalent to:

WHAT EXACTLY WENT WRONG?

Page 42: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

C1 C2

C3 C4

2nd-level Memory

Large Core

1st-level Memory

1

2

3

4

1

2 SmallCore 1 1

1

2

3

4

1

2

3

4

Power

PerformancePower = 1/4

Performance = 1/2

Multi-Core:Power efficient +Better power and thermal management

A NEW ERA: MULTI-CORE COMPUTING

Computation parallelism represents a more efficient and scalable way of delivering computing performance and power management!

Page 43: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

C1 C2

C3 C4

2nd-level Memory

Large Core

1st-level Memory

1

2

3

4

1

2 SmallCore 1 1

1

2

3

4

1

2

3

4

Power

PerformancePower = 1/4

Performance = 1/2

Multi-Core:Power efficient +Better power and thermal management

A NEW ERA: MULTI-CORE COMPUTING

Computation parallelism represents a more efficient and scalable way of delivering computing performance and power management!

Page 44: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Multi-core computing

1985 199019801970 1975 1995 2000 2005

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Boardcom 1480

20??

# ofcores

1248

163264

128256512

Opteron 4PXeon MP

AmbricAM2045

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2Athlon

C1 C2

C3 C4

Cache

Large Core

Cache

1

2

3

4

1

2 SmallCore 1 1

1

2

3

4

1

2

3

4

Power

PerformancePower = 1/4

Performance = 1/2

Multi-Core:Power efficient +Better power and thermal management

Page 45: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Multi-Core ArchitecturesMicroprocessors in the era of parallelism

ReplicatedHardwareCapable ofExecuting

Arithmetic and Logic

Instructions

Peripheral unit and I/OShared L2 MemoryPeripheral unit and I/OPeripheral unit and I/O

Shared L3 Memory Peripheral unit and I/O

Peripheral unit and I/OPeripheral unit and I/O

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

Page 46: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Power ManagementThe processor is split into voltage and frequency islands

Periferica e/o porta di I/OPeriferica e/o porta di I/O

Memoria III livelloPeriferica e/o porta di

I/OPeriferica e/o porta di I/OPeriferica e/o porta di I/O

OFF

OFF

1st-level memory

1st-level memory

1st-level memory

1st-level memory

1st-level memory

1st-level memory

OR

per-core activation

Each core (or «cluster» of cores) can be operated at different voltage and frequency settings, or selectively powered off

Page 47: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Intel Single-Chip Cloud Computer (SCC)48 cores structured into 2-core clusters

24 frequency islands8 voltage islands

15 speed settings from 100 to 800 MHz7 voltage levels from 0.7V to 1.3V in steps of 0.1V

Case Study: an Industrial Research Prototype

Power Management

Page 48: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

From Multicores to Manycores There are applications where the hardware parallelism is

perfectly matched to the software parallelism (this is notalways the case!!) E.g., graphics

Single instruction («sum with 6») applied to Multiple Data(SIMD – Single Instruction Multiple Data) Implementation

Page 49: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Graphics Processing Unit

Server of/with GPUsOptimized for SIMD Workloads

Page 50: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

NVIDIA TITAN V GPU

5120 cores 640 additional cores to support artificial

intelligence Other 320 cores («texture» units) 21.1 billions of transistors Maximum clock frequency: 1.5 GHz 12 GB of memory 12nm technology 110 TFLOPS of compute performance Target applications: deep learning,

supercomputing, financial services, high-end gaming, big data applications

TDP: 250 W Price: roughly 3000 dollars

Page 51: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

LATEST NEWS FROM THE WORLD OF MICROPROCESSORS

Page 52: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Heterogeneity

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

System-on-Chip

Low-PerformanceLow-PowerMulticore

High-PerformanceHigh-Power

Multicore

OPERATING SYSTEM SCHEDULER

OFFON

Page 53: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

«Big-LITTLE» ARM architecture:Combination of hight-end ARM A57

with low-end ARM A53

SAMSUNG EXYNOS7420 OCTA-CORE

Page 54: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

54

Page 55: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

The Accelerator Store

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

System-on-Chip

Low-PerformanceLow-PowerMulticore

High-PerformanceHigh-Power

Multicore

Hardware Accelerators

Image processing

VideoPlayback

FFT

NeuromorphicAccelerator

HOST PROCESSOR

Page 56: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

The Accelerator Store

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

1st-level Memory

System-on-Chip

Low-PerformanceLow-PowerMulticore

High-PerformanceHigh-Power

Multicore

Hardware AcceleratorGraphics (Embedded GPU)HOST PROCESSOR

Page 57: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

The Accelerator Store

Re-programmable accelerators Specialized accelerators

Page 58: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Example: Huawei Kirin 970

«Big.LITTLE» Host processor 4-core A73 (2.4 GHz) 4-core A53 (1.8 GHz)

Embedded GPU with 12 cores 5.5 billions of transistors Area: 1cm2 1 accelerator for machine learning(25x performance, 50x energyefficiency)

2005 classified images/minute (Samsung Galaxy S8: 95, iPhone7 Plus: 487)

Page 59: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

THE FUTURE

Page 60: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Deep Learning

60

Page 61: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Hardware for Deep Learning

61

Page 62: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

UNIFE is playing the game!

62

Collaboration with Fabrizio Riguzzi, Department of Informatics

Dynamically Reconfigurable DNN

Courtesy of Rice University

Page 63: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Brain-inspired Computing

63

Courtesy of D.Querlioz

Page 64: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Brain-Inspired Computing: Why?

64

Courtesy of D.Querlioz

Page 65: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

We are playing the game right now!

65

Page 66: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

UNIFE is playing the game

66Collaboration with Michele Favalli, Columbia University and AMD

Enabling Asynchronous Interconnect Technology

Page 67: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Integrated Photonics for Communication(typically) off-chip

Laser sourceOptical signal carried to the chip via optical fiber

Tapered input

Silicon waveguide Optical OOK modulationSilicon waveguide

Page 68: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

Optical OOK modulation

Silicon waveguide

Photonic Switching

PhotodetectorTransimpedance amplifieDigital Comparator

Integrated Photonics for Communication

Page 69: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

3-D Integration

Page 70: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

UNIFE is playing the game

AutomaticPlace&Route

Tool

Full-custom

Irregular Pattern Regular pattern Optical Ring Optical RingsAutomatic topology synthesis Framework

Collaboration with Maddalena Nonato, Marco Gavanelli and TU Munich (Germany)

Page 71: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

UNIFE is playing the gameFabrication of photonic integrated circuits and measureents on them

Collaboration with Gaetano Bellanca and Inphotec Pisa

Page 72: Architecture of Digital Integrated Systems - mpsoc.unife.itmpsoc.unife.it/~arch-dig/slides2016/Course2018.pdfmore effective solutions and systems 10 ... 7420 OCTA-CORE ... We all would

UNIFE is playing the game

72

f0

DC‐FIFO 1 RL 1 MUX 

3x1

DEMU

X 1x1

5

15 

1

Arbiter

Credits From Rx15

DEMU

X 1x2

MUX 3

0x1

Arbiter

Credit counter

Credit counter

15 

M29

CMOS 40nm ECL 130nm

÷2÷2

VC DECODER

MESO

TX

TIA PD30

SE2D

D2SE ÷2 ÷2

clk5 clk4 clk3 clk2

PLL

clk1

32x1 Binary Tree Serializer 15 Driver

DC‐FIFO 2 RL 2

DEMU

X 1x1

5

MUX 

3x1

15 

15 

SE2D

15 

M30Driver

VC DECODER

VC‐ID

DC‐FIFO 1

DC‐FIFO 15

Credits to Rx15

DC‐FIFO 29

DC‐FIFO 30

DEMU

X 1x3 D2SE 1x32 Binary Tree Deserializer 15

clk5 clk4 clk3 clk2 clk1

÷2÷2D2SE ÷2 ÷2 TIA PD29

15 

RX

f1f1/16 f1/2f1/4f1/8

15 

VC_ID

CMOS 40nm

ONOC

FULLY CMOS

Hybrid

HOW TO DRIVE AN OPTICAL NETWORK?

Collaboration with IHP Microelectronics (Germany)