architecture of digital integrated systems -...
TRANSCRIPT
Architecture of Digital Integrated Systems
Course Presentation
Davide BertozziUniversity of Ferrara
Most of the course material will be in english
2
Course Information
Instructor:Davide BertozziAssistant ProfessorEmail: [email protected]: +390532974832
Teaching assistant, responsible for laboratory experiences:Meriem TurkiPhD studentEmail: [email protected]: no. 338 (third floor, Engineering Department)
Course schedule
MONDAY 16.30-19.00 - Room 20
WEDNESDAY 16.30-19.00 - Room 9OR- Informatics Lab (Small)
All Lab experiences will be taught in english Office hours: on appointment
(email reservation, or after lectures)
The Exam
Roughly one/third of the course will be in lab (or related to it) Expertise on the (C++-derived) SystemC hardware description
language (HDL) Exam split into 2 parts:
Oral exam (25 points) - 3 questions.
Course project (5 points) Hands-on final project assignment showing off SystemC programming
skills Exams are on appointment, and requests should be emailed
to me at least one week in advance
Course material Course website:
http://mpsoc.unife.it/~arch-dig/- Slides (at least 1 hour before lessons)- News, course information
No unique course book available, since the topic of this courseis fast evolving It is at the frontier of research Specific book chapters, papers,...will be suggested on a topic by
topic basis
Taking the course and taking notes is the best way to enjoy the course!
Useful books 1. T.Groetker, S.Liao, G.Martin, S.Swan; System Design with SystemC, Kluwer
Academic Publishers, 2002 SystemC hardware description language
2. J.Flich, D.Bertozzi; Designing Network-on-Chip Architectures in the Nanoscale Era, CRC Press, 2011. Networks-on-Chip
3. William James Dally, Brian Patrick Towles; Principles and Practices of Interconnection Networks; Morgan Kaufmann, 2004 Interconnection networks
4. Digital Integrated Circuits - A Design Perspective (second edition), J.M.Rabaey, A.Chandrakasan, B.Nikolic, Prentice Hall Design methodologies; Timing issues in digital circuit design
5. David A. Patterson, John L. Hennessy; Computer Organization and Design: The Hardware/Software Interface, Morgan Kaufmann, 2004 Microprocessor architecture
5. D.Culler, P.Singh, A.Gupta; Parallel Computer Architecture: a Hardware/Software Approach, Morgan Kaufmann, 2004 Design issues of multicore processors
Lab Schedule
Indicative dates for lab experiences
- march 21st from 16:30 to 18:30 - march 28th from 16:30 to 18:30 - april 11th from 16:30 to 18:30 - april 18th from 16:30 to 18:30 - may 9th from 16:30 to 18:30 - may 16th from 16:30 to 18:30 - may 23rd from 16:30 to 18:30 (Final Project Assignment)
These dates may change based on my work commitments. They should be considered as indicative.
8
Architecture
Technology
Synthesis flow
SystemC Hardware
Description Language (HDL)
The course at a glance
Technology-aware design
Why taking this course?
Breaking the abstraction layers and knowing what is underneathenables you to solve problems and design better future systems
Cooperation between multiple components and layers can enable more effective solutions and systems
10
Off-chip memoryMicroprocessor core Bus Memory I/OAccelerators
Operating System
Language Runtime
Application and Libraries
Hypervisor
Netlist of logic gates
Circuits
Layout
Transistors
Horizontal integration
Verticalintegration
I am ready!
When I push a button on the touchscreen, the smartphone recovers from sleep mode and starts working
Operation Batteryduration
Standby 250 ore
Operation Batteryduration
3G talk time 10 hrs
3G browsing 8 hrs
LTE browsing
10 hrs
Wi-Fi browsing
10 hrs
Video 10 hrs
Music 40 hrs
iPhone 5s
Several kinds of works stress the smartphone to a different extent
This has direct implications on the battery duration
Let us start from our daily experience
Who does the actual computing inside the smartphone?
Electronic board
Opening the smartphone
Touch Screen Controller
Antennas and controllers
Electronic Board – Face up
ApplicationProcessor
Non-Volatile Memory
Gyroscope & Accelerometer
LTE/GSM Modem
NFC Controller
Image Processor
Electronic Board – Face down
Application Processor
The «brain» of the smartphone is its «Application Processor»:Snapdragon (Qualcomm), Exynos (Samsung), Helio (Mediatek), OMAP (Texas Instr.), Kirin (HiSilicon), Tegra (Nvidia), Ax (Apple), ….
What is a microprocessor?
Execute!
Yes Sir!
Programmer
Microprocessor
Send an email!
Yes Sir!
Programmer
Microprocessor
Microprocessors are not able to understandand process such «abstract» commands!
From now on, the terms «application processor» and «microprocessor» will be used interchangeably, although other kinds of microprocessors do exist (e.g., for power
management, wireless control, display control, etc..).
Machine LanguageTalking to a microprocessor goes through «electrical signals» The basic thing a microprocessor does is to detect the
following conditions: - Presence of signal (symbol «1») - Lack of signal (symbol «0»)
Fundamentals of digital processing:Binary numbers are used to communite
both instructions and datato a microprocessor
1000110010100000!
Microprocessors speak a language whose alphabet consists of 2 letters (the italianlanguage has 21 letters). As a result, machine language consists of binary numbers:
microprocessor
programmer
«0»«1»«0»«1»«0»«1»
With this language, what kind of orders can I give to a microprocessor?
A microprocessor can only execute simple «low-level» arithmetic-logic instructions
- E.g., sum/subtract/multiply/divide two integers- E.g., carry out the logic AND/OR/EXOR of two bits or sets of bits
Applications consist of «complex» (or abstract) operations, which take for granted the capability to think/abstract/plan/structure of the human mind:
- Start a phone call!- Play back a video!- Send an email
Abstractions hide details, but enable to cope with problem complexity
Several intermediate HW/SW layers are needed to interpretand translate high-
level operations intothe basic operations
a microprocessorcan do.
Moving to a «lowerabstraction layer»
increases the informative content
ABSTRACTIONGAP
Application Processor: an Example
SAMSUNG EXYNOS7420 OCTA-CORE
A microprocessor does NOT ONLY consist of a single CORE (computation unit), but rather of a (more or less regular) network of cores.
ANALOGY
Once upon a time (~50 years ago) it was not like that…
Microprocessors used to be «Monolithic processor cores»
Hardware capable of executingarithmetic and logic instructions
Optionals di questo hardware:OPTIONS:Processing speed (or clock frequency)Instruction throughputInstruction-Level ParallelismOut-of-order execution capabilityBranch prediction strategyMemory hierararchy and access speedVirtual memory…..
Trend to integrate more and more functions (computation units, controllers, modems, memory macros, etc.) on the same silicon die, thus building up «Systems-on-Chip».
- lower power, better performance, lower sizeToday, all application processors are «systems-on-chip»
What slows down this trend:Technology, Cost, Reliability
The long-term idealasymptotic trend
consists of the «smartphone-on-
a-chip»
The «system-on-chip (SoC)» revolution
«Integrated» Application ProcessorsMicroprocessors in the «system-on-chip (SoC)» era
Memory Peripherical unitsand I/O
Hardware capable of executingarithmetic and logic instructions
SoCfor
wireless applications
Computation unit from a third-party vendor(implementation made by system integrator – SOFT MACRO)
Computation unit from a third-party
vendor(layout defined
by the vendor as well – HARD
MACRO)
A typical «SoC» consists of pre-designed and pre-verified blocks, which can be made in-house or bought by «third party» vendors
(against the payment of royalties) Data Memory and Instruction Memory (from
third-party vendor). Their layout comes
from vendor as well - HARD
MACRONew terminology coming up:- Platform-based design- Design reuse- System integration task
Systems-on-chip: a different way of designing systems….and of doing business!
«ARM» PROCESSORS You may have heard that «most smartphone
processors» are ARM processors…… … but when talking about application processors ARM
was not mentioned: Qualcomm, MediaTek, Nvidia, HiSilicon, Apple,…..??!?!
Generic Application Processorswith «ARM core inside».
Core ARM
In turn, ARM processors are systems-on-chip….
Application processors are Systems-on-chip!
Evolution: the Memory Hierarchy
1st level memory
Peripheral unitsand I/O2nd level memory
Hardware capable of executingarithmetic and logic instructions
Microprocessors in the «system-on-chip (SoC)» era
Memory – the ProblemWe all would like knowledge to be accessible in a single book!
It follows from thisthe need to selectevery time…
..the book containing the needed information at any given point in time!
Working Set
• We may identify a set of books containing all the information that we normally need (except for specific cases!) = Working set.
• What are the «habits» of the microprocessor«reader», so to build the working set?
The Problem
The microprocessor would like to have infinite memorywith overly fast access times… But this is not feasible in practice:
Fast memories are small. Large memories are slow. The amount of memory that can stay on a chip is limited! Fast memories are also very expensive, so they have to be
small.
Key Insight The microprocessor has a distinctive feature: it tends to reuse
data and instructions that have been accessed recently, or thatare close to the recently-accessed ones.
TEMPORAL LOCALITY
Recently-accessed elements are likely to be accessed again soon
SPATIAL LOCALITY
When accessing an element, the elements nearby are likely to be accessed soon
Example
Spatial locality Access to the elements A[i] in sequnce
Temporal locality At each iteration, the «sum» instruction is used
How to exploit this?
Low latency High throughput Small size High cost
Memory Hieararchy The approach is to store on a small and fast memory «close to» the
processor (i.e., a cache) the data/instructions that I am currentlyaccessing, in addition to the «nearby» ones
By implementing the memory system as a «hierarchy», the microprocessor is given the illusion of having a memory as large as
the last-level one and as fast as the first-level one
Working set
Whether the working set is «good» or not depends on the number of MISSes in the first-level memory
How does it work? Hit and MissI am searching for a/an data/instruction. Do you have it?YES (HIT!)
NO (MISS!) Transfer from the lower level
Registers
L1 Cache
L2 Cache
The philosophy is as follows:Fast access to data/instructions that are most commonly used or which can be foreseen to be accessed in the near future. For the exceptions, a temporary performance slowdown has to be accounted for.
Evolution: the Memory Hierarchy
1st level memory
Peripheral unitsand I/O2nd level memory
Hardware capable of executingarithmetic and logic instructions
Microprocessors in the «system-on-chip (SoC)» era
Trend: increasing chip size!
Growing to Super-cores!
Memoria I livello
Periferica e/o porta di I/OMemoria II livello
Chip Size
Goal: meet the growing user expectations for advanced software services
Chip Size
Next-Generation Processor Core
1st-level memory Peripheral unit and I/O
2nd-level memoryPeripheral unit and I/OPeripheral unit and I/O
3rd-level memoryPeripheral unit and I/O
Peripheral unit and I/OPeripheral unit and I/O
Integration of tens or hundreds of «cores»
More memory levels or
same levels with more memory
Higher performance, lower cost,
etc.
Trend: increasing chip size!
Next-Generation Processor Core
1st-level memory Peripheral unit and I/O
2nd-level memoryPeripheral unit and I/OPeripheral unit and I/O
3rd-level memoryPeripheral unit and I/O
Peripheral unit and I/OPeripheral unit and I/O
Chip SizeOpposite trend: chip size reduction under the pressure of technology scaling
Super-cores!65 nm45 nm
90 nm Today we are
headingbelow 14nm
processnodes!
Below 10nm fundamental physical issus come to the forefront
Trend stabilization: constant chip size
Microprocessor area has today stabilized (rough numbers provided): 140 mm2 for desktop computers 260 mm2 for high-performance computing (e.g., scientific computing) 70–100 mm2 for “embedded” microprocessorsArea split into LOGIC, MEMORY AND INTEGRATION OVERHEAD
There are limiting factors for chiparea:- power consumption- manufacturing cost- chip-wide transmission delay- design cost
Then, around 2000, an epoch-makingparadigm shift occurred…
SOMETHING WENT WRONG!
Designers realize that at each new generation of microprocessors, the cost to achieve a predefined performance increase skyrockets (if at all achievable)
Pollack’s rule:At a given feature size (process node), a new
microprocessor generation takes 2-3x the area of the old one, while the performance speedup is only 1.4-1.6x
DECREASING MARGINAL UTILIZATION
WHAT EXACTLY WENT WRONG?
1. Few applications can expose more than 2 parallel instructions/cycle The main limiter of the instruction throughput is the presence of
dependencies in the instruction flow, whichmicroarchitecture/compiler designers cannot completely get rid of
2. Sometimes, although instruction parallelism is there, the compilerand/or the hardware are not able to extract it E.g., potentially parallel instructions that are thousands of cycles
apart3. Memory access latencies limit the utilization rate of the processor
Memory latency cannot be completely hidden4. Beyond a 150W power envelope, it is not economically convenient to
cool down any more
5. Beyond a given clock frequency, there are concerns: A well-formed clock pulse becomes challenging Within the clock cycle, there is an inactive time that does not scale
6. Although the processor is fast in processing data, data communication is overly slow and costly! The communication bottleneck becomes more severe as
technology scales down!
Equivalent to:
WHAT EXACTLY WENT WRONG?
C1 C2
C3 C4
2nd-level Memory
Large Core
1st-level Memory
1
2
3
4
1
2 SmallCore 1 1
1
2
3
4
1
2
3
4
Power
PerformancePower = 1/4
Performance = 1/2
Multi-Core:Power efficient +Better power and thermal management
A NEW ERA: MULTI-CORE COMPUTING
Computation parallelism represents a more efficient and scalable way of delivering computing performance and power management!
C1 C2
C3 C4
2nd-level Memory
Large Core
1st-level Memory
1
2
3
4
1
2 SmallCore 1 1
1
2
3
4
1
2
3
4
Power
PerformancePower = 1/4
Performance = 1/2
Multi-Core:Power efficient +Better power and thermal management
A NEW ERA: MULTI-CORE COMPUTING
Computation parallelism represents a more efficient and scalable way of delivering computing performance and power management!
Multi-core computing
1985 199019801970 1975 1995 2000 2005
Raw
Power4 Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Boardcom 1480
20??
# ofcores
1248
163264
128256512
Opteron 4PXeon MP
AmbricAM2045
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2Athlon
C1 C2
C3 C4
Cache
Large Core
Cache
1
2
3
4
1
2 SmallCore 1 1
1
2
3
4
1
2
3
4
Power
PerformancePower = 1/4
Performance = 1/2
Multi-Core:Power efficient +Better power and thermal management
Multi-Core ArchitecturesMicroprocessors in the era of parallelism
ReplicatedHardwareCapable ofExecuting
Arithmetic and Logic
Instructions
Peripheral unit and I/OShared L2 MemoryPeripheral unit and I/OPeripheral unit and I/O
Shared L3 Memory Peripheral unit and I/O
Peripheral unit and I/OPeripheral unit and I/O
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
Power ManagementThe processor is split into voltage and frequency islands
Periferica e/o porta di I/OPeriferica e/o porta di I/O
Memoria III livelloPeriferica e/o porta di
I/OPeriferica e/o porta di I/OPeriferica e/o porta di I/O
OFF
OFF
1st-level memory
1st-level memory
1st-level memory
1st-level memory
1st-level memory
1st-level memory
OR
per-core activation
Each core (or «cluster» of cores) can be operated at different voltage and frequency settings, or selectively powered off
Intel Single-Chip Cloud Computer (SCC)48 cores structured into 2-core clusters
24 frequency islands8 voltage islands
15 speed settings from 100 to 800 MHz7 voltage levels from 0.7V to 1.3V in steps of 0.1V
Case Study: an Industrial Research Prototype
Power Management
From Multicores to Manycores There are applications where the hardware parallelism is
perfectly matched to the software parallelism (this is notalways the case!!) E.g., graphics
Single instruction («sum with 6») applied to Multiple Data(SIMD – Single Instruction Multiple Data) Implementation
Graphics Processing Unit
Server of/with GPUsOptimized for SIMD Workloads
NVIDIA TITAN V GPU
5120 cores 640 additional cores to support artificial
intelligence Other 320 cores («texture» units) 21.1 billions of transistors Maximum clock frequency: 1.5 GHz 12 GB of memory 12nm technology 110 TFLOPS of compute performance Target applications: deep learning,
supercomputing, financial services, high-end gaming, big data applications
TDP: 250 W Price: roughly 3000 dollars
LATEST NEWS FROM THE WORLD OF MICROPROCESSORS
Heterogeneity
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
System-on-Chip
Low-PerformanceLow-PowerMulticore
High-PerformanceHigh-Power
Multicore
OPERATING SYSTEM SCHEDULER
OFFON
«Big-LITTLE» ARM architecture:Combination of hight-end ARM A57
with low-end ARM A53
SAMSUNG EXYNOS7420 OCTA-CORE
54
The Accelerator Store
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
System-on-Chip
Low-PerformanceLow-PowerMulticore
High-PerformanceHigh-Power
Multicore
Hardware Accelerators
Image processing
VideoPlayback
FFT
NeuromorphicAccelerator
HOST PROCESSOR
The Accelerator Store
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
1st-level Memory
System-on-Chip
Low-PerformanceLow-PowerMulticore
High-PerformanceHigh-Power
Multicore
Hardware AcceleratorGraphics (Embedded GPU)HOST PROCESSOR
The Accelerator Store
Re-programmable accelerators Specialized accelerators
Example: Huawei Kirin 970
«Big.LITTLE» Host processor 4-core A73 (2.4 GHz) 4-core A53 (1.8 GHz)
Embedded GPU with 12 cores 5.5 billions of transistors Area: 1cm2 1 accelerator for machine learning(25x performance, 50x energyefficiency)
2005 classified images/minute (Samsung Galaxy S8: 95, iPhone7 Plus: 487)
THE FUTURE
Deep Learning
60
Hardware for Deep Learning
61
UNIFE is playing the game!
62
Collaboration with Fabrizio Riguzzi, Department of Informatics
Dynamically Reconfigurable DNN
Courtesy of Rice University
Brain-inspired Computing
63
Courtesy of D.Querlioz
Brain-Inspired Computing: Why?
64
Courtesy of D.Querlioz
We are playing the game right now!
65
UNIFE is playing the game
66Collaboration with Michele Favalli, Columbia University and AMD
Enabling Asynchronous Interconnect Technology
Integrated Photonics for Communication(typically) off-chip
Laser sourceOptical signal carried to the chip via optical fiber
Tapered input
Silicon waveguide Optical OOK modulationSilicon waveguide
Optical OOK modulation
Silicon waveguide
Photonic Switching
PhotodetectorTransimpedance amplifieDigital Comparator
Integrated Photonics for Communication
3-D Integration
UNIFE is playing the game
AutomaticPlace&Route
Tool
Full-custom
Irregular Pattern Regular pattern Optical Ring Optical RingsAutomatic topology synthesis Framework
Collaboration with Maddalena Nonato, Marco Gavanelli and TU Munich (Germany)
UNIFE is playing the gameFabrication of photonic integrated circuits and measureents on them
Collaboration with Gaetano Bellanca and Inphotec Pisa
UNIFE is playing the game
72
f0
DC‐FIFO 1 RL 1 MUX
3x1
DEMU
X 1x1
5
15
1
1
Arbiter
Credits From Rx15
DEMU
X 1x2
MUX 3
0x1
Arbiter
Credit counter
Credit counter
15
1
M29
CMOS 40nm ECL 130nm
÷2÷2
VC DECODER
MESO
TX
TIA PD30
SE2D
D2SE ÷2 ÷2
clk5 clk4 clk3 clk2
PLL
clk1
32x1 Binary Tree Serializer 15 Driver
DC‐FIFO 2 RL 2
DEMU
X 1x1
5
MUX
3x1
15
1
15
1
SE2D
1
15
M30Driver
VC DECODER
VC‐ID
DC‐FIFO 1
DC‐FIFO 15
Credits to Rx15
DC‐FIFO 29
DC‐FIFO 30
DEMU
X 1x3 D2SE 1x32 Binary Tree Deserializer 15
clk5 clk4 clk3 clk2 clk1
÷2÷2D2SE ÷2 ÷2 TIA PD29
1
15
RX
f1f1/16 f1/2f1/4f1/8
1
15
VC_ID
CMOS 40nm
ONOC
FULLY CMOS
Hybrid
HOW TO DRIVE AN OPTICAL NETWORK?
Collaboration with IHP Microelectronics (Germany)