mpsoc
DESCRIPTION
mpsocTRANSCRIPT
-
SoCrates
- A Scalable Multiprocessor System On Chip
Authors
Mikael Collin, Mladen Nikitovic, and Raimo Haukilahti
fmci,mnc,[email protected]
Supervisors
Johan Starner and Joakim Adomat
Examinator
Lennart Lindh
Department of Computer Engineering
Computer Architecture Lab
Malardalen University
Box 883, 721 23 Vasteras
Abstract
This document is the result of a Master Thesis in Computer Engineering, describing the analysis,
specication and implementation of The rst prototype of Socrates, a congurable, scalable and
predictable platform for System-on-chip Multiprocessor system for real-time applications. The design
time of System-on-a-Chip (SoC) is today rapidly increasing due to high complexity and lack of ecient
tools for development and verication. By combining all the functions into one chip the system
becomes smaller, faster, and less power consuming but increasing the complexity. To decrease the
time-to-market SoCs are entirely or partially build with IP-components. Thanks to SoC, a whole new
domain of products, like small hand held devices, has emerged. The concept has been around a few
years now, but there are still challenges that needs to be resolved. There is a lack of standards for
enabling fast mix and match of cores from dierent vendors. Further needs are new design methods,
tools, and verication techniques. SoC solutions needs special kind of CPUs that consumes less power,
is cheaper, smaller, but still has high-performance requirements. To fulll all these demands, they are
getting more and more complex as the number of transistors are rapidly growing which has led to the
emerging of multiprocessors systems-on-a-chip. Our initial question is to investigate if it is possible
to build these complex multiprocessors systems on a single FPGA and if these solutions can lead
to shorter time-to-market. The consumer demands for cheaper and smaller products makes FPGA
solutions interesting. Our approach is to have multiple processing nodes containing processing unit,
memory and a network interface all together connected on a shared bus. A central in-house developed
hardware real-time unit handles scheduling and synchronization. We have designed and implemented
a MSoC that ts on a single FPGA in only 40 days, which has to our supervisors knowledge not been
accomplished before. Our experience is that a tightly coupled group can produce fast results since
information, new ideas and bug reports propagates immediately.
SoCrates stands for SoC for Real-Time Systems
-
Introduction
This report describes the design of the rst prototype of SoCrates, a generic scalable platform gener-
ator which creates a synthesizable HDL description of a multiprocessor system. The goal was to build
a predictable multiprocessor system on a single FPGA with mechanisms for prefetching data and an
in-house developed integrated hardware real-time unit.
The report consist of three parts where the rst part, Computer Architecture For System on Chip,
is a state of the art report introducing basic SoC terminology and practice with a deeper analysis in
CPUs, interconnects and memory hierarchies. The purpose of this analysis was to learn about state of
the art techniques on how to design complex multiprocessor SoCs. The design process resulted in part
two, SoCrates - specications, which describes the prototype and all the individual parts functionality
and specic demands. Part three, Socrates -implementation details, describes the implementation on all
parts, how to congure the system, and how to compilie and link the system software. We also present
synthesis results and suggest future work that can be done to improve the system.
-
SoCrates
-Document index
Document 1 Computer Architecture for System on Chip - A State of the Art Report
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Embedded CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3. Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
4. Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Document 2 Socrates Specications
1. System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .1
2. CPU Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
4. Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5. IO Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6. Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7. Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8. Boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
9. Memory Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
Document 3 Socrates -Implementation details
1. CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
2. Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3. Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4. Compiling & Linking the System Software . . . . . . . . . . . . . . . 31
5. Configuring the Socrates Platform . . . . . . . . . . . . . . . . . . 35
6. Current Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7. Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Document 4 Appendix
1. Demo Application . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. I/O Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4. Task switch routines . . . . . . . . . . . . . . . . . . . . . . . . . 6
5. Linker scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6. DATE 2001 Conference, Designers Forum, publication . . . . . . . . . .12
-
Computer Architecture for System on Chip
- A State of the Art Report
Revision: 1.0
Authors
Mikael Collin, Mladen Nikitovic, and Raimo Haukilahti
fmci,mnc,[email protected]
Supervisors
Johan Starner and Joakim Adomat
Department of Computer Engineering
Computer Architecture Lab
Malardalen University
Box 883, 721 23 Vasteras
May 20, 2000
Abstract
This state of the art report introduces basic SoC terminology and practice with deeper analysis
in three architectural components: the CPU, the interconnection, and memory hierarchy. A short
historical view is presented before going into todays trends in SoC architecture and development. The
SoC concept is not new, but there are challenges that has to be met to satisfy customer demands for
faster, smaller, cheaper, and less power consuming products today and in the future. This document
the rst of three documents that forms a Master Thesis in Computer Engineering.
-
CONTENTS II
Contents
1 Introduction 1
1.1 What is SoC ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Soc Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Intellectual Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 An Example of a SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Why SoC? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 State of Practice and Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Introduction to Computer System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Computer System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Research & Design Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5.1 Hydra: A next generation microarchitecture . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Self-Test in Embedded Systems (STES) . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.3 Socware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.4 The Pittsbourgh Digital Greenhouse . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.5 Cadence SoC Design Centre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Embedded CPU 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 The Building Blocks of an Embedded CPU . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Arithmetic Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Memory Management Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.5 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.6 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 The Microprocessor Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Design Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Code Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.4 Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Implementation Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 State of Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.1 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.2 Motorola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.4 Patriot Scientic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.5 AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.6 Hitachi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.7 Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.8 PowerPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.9 Sparc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Improving Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.1 Multiple-issue Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.2 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.3 Simultaneous Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.4 Chip Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.5 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Measuring Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
-
CONTENTS III
2.8.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.8.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9 Trends and Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9.1 University . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9.2 Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Interconnect 27
3.1 Introduction and basic denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Bus based architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Arbitration mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Synchronous versus asynchronous buses . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.4 Pipelining and split transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.5 Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.6 Bus hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.7 Connecting multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Case studies of bus standards with multiprocessor support . . . . . . . . . . . . . . . . . . 31
3.3.1 FutureBus+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 VME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 PCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Point-to-point interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Interconnection topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Interconnect performance & scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.2 Shared buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.3 Point-to-point architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Interconnecting components in a SoC design . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.1 VSIA eorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.2 Dierences between standard and SoC interconnects . . . . . . . . . . . . . . . . . 36
3.7 Case studies of existing SoC-Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7.1 AMBA 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7.2 CoreConnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.3 CoreFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7.4 FPIbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7.5 FISPbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7.6 IPBus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7.7 MARBLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7.8 PI-Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7.9 SiliconBackplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7.10 WISHBONE Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7.11 Motorola Unied Peripheral Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.8 Case studies of SoC multiprocessor interconnects . . . . . . . . . . . . . . . . . . . . . . . 43
3.8.1 Hydra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.8.2 Silicon Magic's DVine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Memory System 45
4.1 Semiconductor memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Cache memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 Cache: the general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 The nature of cache misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
-
CONTENTS IV
4.3.3 Storage strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.4 Replacement policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.5 Read policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.6 Write policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.7 Improve cache performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 MMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Multiprocessors architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Symmetric Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.2 Distributed memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.3 COMA Cache Only Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.4 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5.5 Coherence through bus-snooping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.6 Directory-based coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6 Hardware-driven prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6.1 One-Block-Lookahead: (OBL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.2 Stream buer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.3 Filter buers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6.4 Opcode-driven cache prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6.5 Reference Prediction Table: (RPT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6.6 Data preloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6.7 Prefetching in multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Summary 68
-
1 INTRODUCTION 1
1 Introduction
This State of the Art Report covers computer architecture topics with emphasis on System on Chip (SoC).
The reader is introduced to the basic ideas behind SoC and general computer architecture concepts before
presenting an in-depth analysis of three important SoC components: CPU, Interconnect and Memory
architecture.
1.1 What is SoC?
SoC stands for System-on-Chip and is a term for putting a complete system on a single piece of silicon.
SoC has become a very popular word in the computer-industry, but very few agree on a general denition
of SoC [19]. There are several alternative names for putting a system on a chip, such as system on
silicon, system-on-a-chip, system-LSI, system-ASIC, and system-level integration (SLI) device [33]. Some
might say a large design automatically makes it a SoC, but that would probably include every existing
design today. A better approach would be to say that a SoC should include dierent structures such as
a CPU-core, embedded memory and peripheral cores. This still is a wide denition which could imply
that any modern processor with a on-chip cache should be included into the SoC-community. Therefore
a more suitable denition of SoC would be:
A complete system on a single piece of silicon, consisting of several types of modules including at least one
processing unit designated for software execution, where the system depends on no or very few external
components in order to execute its task.
1.2 Soc Designs
In the beginning, almost all SoC's were simply integrations of existing board-level designs [20]. This way
of designing a system looses many benets that otherwise could be taken advantage of if the system would
be designed from scratch. Another approach is to use already existing modules, called IP-components,
and to integrate them to a complete system suitable for a single die.
1.2.1 Intellectual Property
When something is protected through patents, copyrights, trademarks or trade secrets it is considered as
a Intellectual Property (IP). Only patents and copyrights is relevant for IP-components [13] (also referred
as macros, cores and Virtual Components (VC) [10]). An IP-component is a pre-implemented, reusable
module, for example a DMA-controller or a CPU-core. There are several companies that makes their
living by building, licensing and selling IP-components, which the semiconductor companies pays both
fees and royalties for
1
. There exist three classes of IP-components with dierent properties regarding to
portability and protection characteristics. As the portability decreases through the classes, the protection
will increase.
Soft This class of IP-components have their architecture specied at Register-Transfer Level (RTL),
which are synthetizable. Soft IP's are functionally validated and are very portable and modiable.
Since they are not mapped to a specic technology, the behavior according to area, speed, and
power consumption will be unpredictable. Much work still needs to be done before the component
can be utilized and the end-result is dependent of the used synthesis tools.
Firm The rm class components are in general soft components that have been oorplanned and synthe-
sized into one or several dierent technologies to get better estimations of area, speed, and power
consumption.
1
There are exceptions where one can acquire IP-components without any licensing or royalty fees. More information can
be found at http://www.openip.org/.
-
1.3 Why SoC? 2
Hard Hard-IP's are further renement of rm components. They are fully syntesized into mask-level
and physicaly validated. Very little work has to be done in order to implement the functionality
in silicon. Hard IP's are not modiable nor portable, but the prediction of their area, speed, and
power consumption is very accurate.
1.2.2 An Example of a SoC
A typical SoC consists of a CPU-core, a Digital Signal Processor (DSP), some embedded memory, and a
few peripherals such as DMA, I/O, etc (Figure 1). The CPU could perform several tasks with the assis-
tance of a DSP when needed. The DSP is usually responsible for o-loading the CPU by doing numerical
calculation on the incoming signals from the A/D-converter. The SoC could be built of only third-party-
IP-components, or it could be a mixture of IP-components and custom-made solutions. More recently,
there has been eorts to implement a Multiprocessor System on Chip (MSoC) [6], which introduces new
challenges regarding cost, performance, and predictability.
Figure 1: An example of a SoC
1.3 Why SoC?
The rst computer systems consisting of relays and later vacuum tubes, used to occupy whole rooms and
their performance were negligible compared to todays standard workstations. The advent of the transistor
in 1948 enabled engineers to minimize a functional block to an Integrated Circuit (IC). These IC's made it
possible to build complex functions by combining several IC's onto a circuit board. Further development
of process technology increased the number of transistor on each IC, which led to the emerging of systems
on board. After this, there has been a constant battle between semiconductor companies to deliver the
fastest, smallest and cheapest products, resulting in today's multi-billion dollar industry. Even though
the SoC concept has been around for quite some time, it has not really been fully feasible until recent
years due to advances like deep sub-micron CMOS process technology.
1.3.1 Motivation
There are several reasons why SoC is an attractive way to implement a system. Todays rened manu-
facturing processes makes it possible to combine both logic and memory on a single die, thus decreasing
overall memory access times. Given that the application memory requirement is small enough for the
on-chip embedded memory, memory latency will be reduced due to elimination of data trac between
separate chips. Since there is no need to access the memory on external chips, the number of pins can
also be reduced and the use of on-board buses becomes obsolete. Encapsulation counts for over 50% of
the the overall process cost for chip manufacturing [15]. In comparison to a ordinary system-on-board,
SoC uses one or very few IC's, reducing total encapsulation cost and thereby total manufacturing cost.
These characteristics as well as less power consumption and shorter time-to-market enables smaller,
better, and cheaper products reaching the consumers in an altogether faster rate.
-
1.3 Why SoC? 3
1.3.2 State of Practice and Trends
Until now, much of SoC implementation has been about shrinking existing board-level systems onto a
single chip, with no or little consideration to all benets that could be gained from a chip-level design.
Another approach to SoC is to interconnect several dies and place them inside one chip. This kind of
modules are calledMulti Chip Modules (MCM). The choice of implementation of the Hydra Multiprocessor
Project was at rst a MCM , which later evolved to a SoC [14, 6].
Today it is too time-consuming for companies to implement a system from scratch. Instead, a faster and
more reliable way is to use own or third party pre-implemented IP-components [3], which makes designing
a whole system more about integrating components rather than designing them. There exist three design
methodologies, each with it's own eciency and cost regarding SoC design [16, 18]. The vendor design
approach, which shifts the design responsibilities from the system designers to the ASIC vendors, can
result in the lowest die cost. But it can also lead to higher engineering costs and longer time-to-market.
A more exible method is the partial integration approach, which divides the responsibilities of the design
more equally. It lets the system designers produce the ASIC design, while the semiconductor vendors
are responsible for the core and integration. This method gives the system designers more control of the
working process in comparison to the vendor method. Yet more exible is the desktop approach which
leaves the semiconductor vendors only to design the core. This reduces time-to-market and requires low
engineering costs. A key property for IP-components in the future are parameterization of soft cores [16].
There is a continuous growth in the demand for \smart products" which is expected to make our lives
better and simpler. Recently, SoC products has begun to emerge on several markets in form of Application
Specic Standard Products (ASSP)
2
or Application Specic Instruction-set Processors (ASIP)
3
:
Set-top-boxes A Set-Top-Box (STB) is a device that makes it possible for television viewers to access
the Internet and also watch digital television (D-TV) broadcasts. The user has access to several
services: weather and trac updates, on-line shopping, sport statistics, news, e-commerce, etc. By
integrating the STB's dierent components into a SoC, it will simplify system design and be a more
competitive product with its shorter time-to-market, be less expensive and less power-consuming.
The Geode SC1400 chip is an example of a SoC used in a STB that meets the demands of delivering
both high-quality DVD video and Internet accessibility [34].
Cell phones A SoC in a cell phone will reduce its size and weight, make it cheaper and less power
consuming.
Home automation Many domestic appliances at home will be "smarter". For example, the refrigerator
will be able to notify its owner when a product is missing and place an order on the Internet.
Hand-held devices A new generation of hand-held devices is coming, that can send and receive email
and faxes, make calls, surf the Web etc. A SoC solution is especially suited for portable applications
such as hand-held PC's, digital cameras, personal digital assistants and other hand-held devices
because its built-in peripheral functions minimizes overall system cost and eliminates the need to
use and congure additional components.
1.3.3 Challenges
One of the emerging challenges is to standardize interfaces of IP-components to make integration and
composition easier. A lot of dierent on-chip bus standards has been created by the dierent design houses
to make it possible to fast integrate IP-components, which has resulted in noncompability caused by the
dierent interfaces. To solve this dilemma the Virtual Socket Interface Alliance (VSIA) was founded to
enable the mix and match of IP-components from multiple sources by proposing a hierarchical solution
2
High integration chip or chipset for a specic application [59].
3
A eld or mask programmable processors of which the architecture and instruction set are optimized to a specic
application domain [58].
-
1.4 Introduction to Computer System Architecture 4
that enables multiple buses [17]. Still some criticizes VSIA for only addressing simple data ows [11].
More can be read about dierent on-chip bus standards in section ??.
Since the time-to-market is decreasing, the testing and verication of the SoC must be done very fast.
By reusing IP-components it is possible that the test development actually takes longer time than the
work to integrate the dierent functional parts [12]. The fact that the components are from dierent
sources and may have dierent test methodologies complicates the test of the whole system. At the board
level design many of the components has their primary inputs visible which made the testing easier, but
SoC's contain deeply embedded components where there is no or very little possibility to observe signals
directly from one IP-component after manufacturing. Since the on-chip interconnect is inside the chip,
it is also hard to test due to the lack of observability.
As the future is lurking behind the door, integration is not likely to stop with IP-components and
dierent memory technologies, we are also likely to see a variety of analog functionality. Analog blocks
are very layout and process dependent, requires isolation and utilizes dierent voltages and grounds. All
these facts makes them the dicult to integrate in the design [10]. Are the limits to the integration
urge? As the process technologies becomes more sophisticated, transistor switching speed will increase
and the voltage for logical levels will decrease. Dropping the voltages will make the units more sensible
for noise. Analog devices with higher voltage needs can encounter problems working properly in those
environments [17].
Apart from the lack of eective design and test methologies [29] and all the technical problems with
mapping a complex design consisting of several IP-components from dierent design houses onto a partic-
ular silicon platform, there are complex business issues dealing with licensing fees and royalty payments
[30].
1.4 Introduction to Computer System Architecture
SoC is about putting a whole system on a single piece of silicon. But what is a system? This section serves
as a introduction to computer system architecture and tries do give the reader a better understanding of
what is actually put onto a SoC.
1.4.1 Computer System
In general, a typical computer system (gure 2) consists of one or more CPU's that executes a program by
fetching instructions and data from memory. To be able to access the memory, the CPU needs some kind
of interface and a connection to it. The interface is usually provided by the Memory Management Unit
(MMU) and the connection is handled by the interconnect. The local interconnect is often implemented
as a bus consisting of a number of electrical wires. Sometimes, the CPU needs assistance in fetching large
amount of data, in order to be eective. This work can be done in parallel with the CPU by the Direct
Memory Access (DMA) component. The system needs some means to communicate with the outside
world. This is provided by the I/O system. We proceed with a closer look at the important components
that comprises a computer system.
CPU The CPU is where the arithmetic, logic, branching and data transfer are implemented[8]. It consists
of registers, a Arithmetic Logic Unit (ALU) for computations and a control unit. The CPU can
be classied as a Complex Instruction Set Computer (CISC), if the instruction set is complex (e.g
has a lot of instructions, several addressing modes, dierent word-length on instructions etc). The
idea behind a Reduced Instruction Set Computer (RISC) is to make use of a limited instruction set
to maximize performance on common instructions by working with a lot of registers while taking
penalties on the load and store instructions. RISC has uniform length on all instructions and
very few addressing modes. This uniformity is the main reason why this approach is suitable for
instruction pipelining, in order to increase performance. There are other architectures that further
increase performance, for example superscalar, VLIW, and vector computers. A machine is called a
n-bit machine if it is operating internally on n-bit data[8]. Today a lot of the embedded processors
are still working with 8 or 16 bit words while the majority of workstations and PC's are 32 or 64
bit machines.
-
1.5 Research & Design Clusters 5
Data Data
Data
Data
Address
AddressAddress Address
CPU
DMAcontroller
DMA device
Main MemoryCache
Syste
m b
us
Figure 2: A typical computer system
Memory System The key to an eective computer system is to implement a ecient memory hierarchy,
this due to the latency for memory accesses has become a signicant factor when executing a memory
accessing instruction. In the last decade the gap between memory and CPU speed have been
growing. Memory sub-systems must be built to overcome it's shortcomings towards the processor
which otherwise results in wasted computational time from waiting for memory operations to be
completed.
The memory are often organized in a primary and a secondary part, where the primary memory
usually consists of one or several RAM-circuits. The secondary are for longterm storage like Hard
Disc Drives (HDDs), Disks, Tapes, Optic media, and sometimes FLASH memories. To overbridge
the gap between the memory and CPU, nearly all modern processors has a cache that makes use
of the inherent locality of data and code, that ideally can deliver data to the CPU in just one clock
cycle. Usually there exists several levels of cache between the main memory and CPU, each with
dierent sizes and optimizations. The memory system interfaces to the outside world (e.g processor
and I/O) via the MMU, that has the responsibility to translate addresses and to fetch data from
memory via the the memory bus. In multiprocessor systems there are issues whether the memory
should be local on every node or global.
Interconnect A computer systems internal components needs to communicate in order to perform their
task. To make communication possible between components, a interconnect is usually used. A
interconnect can be designed in variety of ways, called topologies. Examples of topologies are bus,
switch, crossbar, mesh, hypercube, torus, etc. Each topology has its own characteristics concerning
latency, scalability and performance.
I/O System The Input/Output system is the computer systems interface to the outside world, which
enables it to receive input and output results to the user. Examples of I/O devices includes HDDs,
graphic and sound systems, serial ports, parallel ports, keyboard, mouse, etc. The transferring of
I/O data is usually taken care of by the DMA component to o-load the CPU of constant data
transfer.
1.5 Research & Design Clusters
There is a lot of research eort done on computer architecture, which of course is related in some degree
SoCs, since they all are actually computers. Unlike most research areas, SoC research is lead by the
industry and not by the universities. Of those universities that have SoC related research projects, very
few have reached the implementation stage.
-
1.5 Research & Design Clusters 6
1.5.1 Hydra: A next generation microarchitecture
The Stanford Hydra single-chip multiprocessor [6] started out as an MultiChipModule(MCM) in 1994
but evolved in 1997 to become a Chip MultiProcessor(CMP). The project are supervised by Associate
Professor Kunle Olukotun accompanied by Associate Professor Monica S. Lam and Mark Horowitz, also
incorporated in the project are a dozen students. Early development of the project was performed by
Basem A. Nayfeh nowdays a Ph.D. The Hydra projects focus on combining shared-cache multiproces-
sor architectures, innovative synchronization mechanisms, advanced integrated circuit technology and
parallelizing compiler technology to produce microprocessor cost/performance and parallel processor
programmability The four integrated MIPS-based processors will demonstrate that it is feasible for a
multiprocessor to gain better cost/performance than achieved in wide superscalar architectures for se-
quential applications. By using MCM, communication bandwidth and latency will be improved resulting
in better parallelism. This makes Hydra a good platform to exploit ne grained parallelism, hence a
parallelizing complier for extracting this sort of parallelism is under development. The project is nanced
by US Defense Advanced Research Projects Agency(DARPA) contracts DABT and MDA.
1.5.2 Self-Test in Embedded Systems (STES)
STES is a co-operational project between ESLAB, the Laboratory for Dependable Computing of Chalmers
University of Technology, the Electronic Design for Production Laboratory of Jonkoping University, the
Ericsson CadLab Research Center, FFV Test Systems, and SAAB Combitech Electronics AB. ESLAB
are responsible of developing a self-test strategy for system-level testing of complex embedded systems.
Which utilizes the BIST(Built In Self Test) functionality at the device, board, and MCM level. Except
the involved commercial participants the project are founded by NUTEK.
1.5.3 Socware
An international Swedish Design Center/cluster has recently been builded that will be in close coop-
eration with the technical universities in Linkoping/ Norrkoping, Lund and Stockholm/Kista. The
Socware, formerly known as Acreo System Level Integration Center (SLIC), aims to have nearly 40
employees/specialists in the beginning but this number will is grow to 1500 in the near future with a
special research institute located in Norrkoping. The Design Center will serve as an bridge between
the industry and research activity and the universities, enabling research results rapidly converting into
industrial products.
The focus of research and development will be directed to design of system components within digital
media technology. initially special focus will be on applications in mobile radio and broadband networking.
Project is nanced by the government, the municipality of Norrkotoping and other local and regional
agencies. More information can be found in [35].
1.5.4 The Pittsbourgh Digital Greenhouse
The Pittsburgh Digital Greenhouse is a SoC design cluster that focuses on digital video and networking
markets. The non-prot organization is an initiative taken by the U.S government, universities, and
industry that started in June 1999. It involves the Carnegie Mellon University, Penn State University,
University of Pittsbourgh, and several industry members like Sony, Oki, and Cadence.
Some ongoing research activities closely related to SoC are:
Congurable System on a Chip Design Methodologies with a Focus on Network Switching
This project focuses on development of design tools for hardware/software co-design as those re-
quired for next generation switches on the Internet and cryptography.
Architecture and Compiler Power Issues in System on a Chip
The program is focused on to create a software system that characterizes the power of the ma-
jor components of a SoC design and allows the design to be optimized for lowest possible power
consumption.
-
1.5 Research & Design Clusters 7
MediaWorm: A Single Chip Router Architecture with Quality of Service Support
The research has focus on the design, fabrication, and testing of a new high performance switched
network router, called Mediaworm. It is aimed to be used in computer clusters where there are
demands on level Quality of Service (QoS) guarantees.
Lightweight Arithmetic IP: Customizable Computational Cores for Mobile Multimedia Appliances
Focus is made on complexity of multimedia algorithms and development of mathematical software
functions which provides the required level of computational performance at lower power levels.
The long range goal is to have SoC in a wide range of next generation products, from \smart homes"
to hand-held devices that allows the user to surf on the web, send faxes and receive e-mail. Further
goals is to provide venture capital, training and education and to assist start-up companies that uses the
research results and pre-designed chips created by the Digital Greenhouse for use in their products. More
information can be found at [37].
1.5.5 Cadence SoC Design Centre
In February 1998, there was an opening of Cadence Design Centre with the purpose of creating one of
the electronic industry's largest and most advanced SoC design facilities. The centre is located on The
Alba Campus in Livingston, Scotland and is the largest European design centre. The centre oers exper-
tise within the spheres of Digital IC, Multimedia/Wireless, Analogue/Mixed Signal, Datacom/Telecom,
Silicon Technology Services, and Methodology Services. In 1999, The Centre became authorized as the
rst ARM approved design centre, through the ARM Technology Access Program (ATAP). Current re-
search projects conducted at the centre involves a single-chip processor for Internet telephony and audio,
a exible receiver chip suitable for among other things pinpointing location by picking up high-frequency
radio waves transmitted by GPS satellites, and a fully customized wireless Local Area Network (LAN)
environment. There are three main pieces of the center, the Virtual Component Exchange (VCX), the
Institute for System Level Integration (SLI) and The Alba Centre. VCX opened in 1998, which is an
institution dedicated for establishing of structured framework of rules and regulations for inter-company
licensing of IP blocks. Members of VCX include ARM, Motorola, Toshiba, Hitachi, Mentor Graphics,
and Cadence. The SLI institute is an educational institution dedicated to system level integration and
research. The institute was established by four of Scotland's leading universities, Edinburgh, Glasgow,
Heriot Watt and Strathclyde. Finally, the Alba centre is the headquarter of the whole initiative and
provides a central point for information about the venture and assistance for interested rms.
-
2 EMBEDDED CPU 8
2 Embedded CPU
There are several dierent interpretations of the term CPU. Some say it is "The Brains of the computer"
or "Where most calculations take place", and that it "Acts as a information nerve center for a computer".
A more concrete denition is given by John L Hennessy and David A Patterson[8]:
Where arithmetic, logic, branching, and data transfer are implemented.
This chapter serves as an introduction to CPU's that are especially suitable for SoC solutions, namely
embedded CPUs. In this case, the term "embedded" does not only refer to how suitable these CPUs are
for embedded systems, or as stand-alone microprocessors, but also to how they are good candidates to be
"embedded" into a SoC. The purpose of this chapter is to look at the possibilities of embedded processors
as a SoC component and what aspects need to be considered when designing and implementing a solution.
Techniques for improving and measuring performance is discussed as well as where the research is today
together with a look at the future of embedded processors.
The chapter begins with an introduction to embedded CPUs that explains some of the factors behind
their popularity. Section 2.2 is a presentation of the building blocks of a modern embedded CPU. Section
2.4 discusses the major factors aecting the design. Section 2.3 looks at which paradigm is currently in
front regarding embedded CPUs. Section 4.3.7 considers options on how to implement an embedded CPU.
Section 2.6 presents case studies of embedded CPUs available in the market today. Section 2.7 shows
several techniques of how to improve the performance. Next, section 2.8 consider how the performance
of a embedded processor could be measured. Finally, section 2.9 looks at where the research is today and
what the trends are in the embedded processor market.
2.1 Introduction
The latest advances in process technology has increased the number of available transistors on a single
die almost to the extent that todays battle between designers is not about how to t it all on a single
piece of silicon, but how to make the most use of it. This evolution has also made it possible for designers
to put a complete processor, together with some or all of its periphal components on a single die, creating
a new class of products, called Application Specic Standard Products (ASSPs). The demand for ASSPs
has also created a new domain of processors, embedded 32-bit CPUs, that are cheap, energy-ecient,
especially designed for solving their domain of tasks.
Before getting into all the wonders of embedded CPU's, some clarifactions should be made about what
they are and what they are not. When CPU's are discussed, the thoughts often goes to the architectures
from Intel, Motorola, Sun, etc. These architectures are mainly y designed for the desktop market and
have dominated it for a long time. In recent years, there has been a increasing demand for CPU's designed
for a specic domain of products. Among those noticing this trend was David Patterson[21]:
Intel specializes in designing microprocessors for the desktop PC, which in ve years may
no longer be the most important type of computer. Its successor may be a personal mobile
computer that integrates the portable computer with a cellular phone, digital camera, and video
game player... Such devices require low-cost, energy-ecient microprocessors, and Intel is far
from a leader in that area.
The question of what the dierence is between a desktop and an embedded processor is still unanswered.
Actually, some embedded platforms arose from desktop platforms (such as MIPS, Sparc, x86), so the
dierence can not be in register organization, the instruction set or the pipelining concept. Instead, the
factors that dierentiate a desktop CPU from an embedded processor will be power consumption, cost,
integrated periphal, interrupt response time, and the amount of on-chip RAM or ROM. The desktop world
values the processing power whereas an embedded processor must do the job for a particular application
at the lowest possible cost[22].
-
2.2 The Building Blocks of an Embedded CPU 9
2.2 The Building Blocks of an Embedded CPU
This section serves as an introduction to the components of a modern embedded CPU. Readers that are
familiar with the basics of computer architecture and processor design might skip this section.
A CPU basically consists of three components: register set, ALU, and a control unit. Today, it is often
the case that the CPU includes a on-chip cache and a pipeline, in order to achieve an adequate level of
performance (Figure 3). The following text will give a brief introduction to the components function and
purpose in the CPU.
Address Register
IncrementerAddress
32-bit Registers
32 x 8 Multiplier
32-bit ALU
Write Data Register
Control Logic
InstructionDecoder
&
Instruction
Pipeline
Barrel Shifter
ALU
Bus
PC B
us
Incr
emen
t Bus
32-bit Address Bus
32-bit Data Bus
Cache
Figure 3: A typical embedded CPU
2.2.1 Register File
The organization of registers or how information is handled inside the computer is part of a machines
Instruction Set Architecture (ISA)[8, 9]. An ISA includes the instruction set, the machine's memory,
and all of the registers that is accessible by the programmer. ISAs are usually divided into three main
categories regarding to how information is stored in the CPU: stack architecture, accumulator architecture,
and general-purpose register (GPR) architecture. These architectures dier in how an operand is handled.
A stack architecture keeps its current operands on top of the stack, while a accumulator architecture keeps
one implicit operand in the accumulator, and a general-purpose register architecture only have explicit
operands which can reside either in memory or registers. Following example shows how the expression A
= B + C would be evaluated in these three architectures.
stack architecture accumulator architecture general-purpose architecture
PUSH C LOAD R1,C LOAD R1,C
PUSH B ADD R1,B LOAD R2,B
ADD STORE A,R1 ADD R3,R2,R1
POP A STORE A,R3
The machines in the early days used stack architectures and did not need any registers at all. Instead,
the operands are pushed onto the stack and popped o into a memory location. Some advantages was
that space could be saved because the register le was not needed, and no operands were needed during
arithmetic operation. As the memories became slower compared to the CPU's, the stack architecture
-
2.2 The Building Blocks of an Embedded CPU 10
also became ineective, due to the fact that most time is spent while fetching the operands from memory
and writing them back. This became a major bottleneck, which made the accumulator architecture a
more attractive choice.
The accumulator architecture was a step-up regarding performance by letting the CPU hold one of
the operands in a register. Often, the accumulator machines only had one data accumulator register,
together with the other address registers. They are called accumulators, due to their responsibility to act
as a source of one operand and destination of arithmetic instructions, thus accumulating data. The accu-
mulator machine was a good idea at the time when memories were expensive, because only one address
operand had to be specied, while the other resided in the accumulator. Still, the accumulator machine
has it's drawbacks when evaluating longer expressions, due to the limited amount of accumulator registers.
The GPR machines solved many problems often related to stack and accumulator machines. They
could store variables in registers, thus reducing the number of accesses to main memory. Also, the
compiler could associate the variables of a complex expression in several dierent ways, making it more
exible and ecient for pipelining. A stack machine needs to evaluate the same complex expression from
left to right which might result in unnecessary stalling. Many embedded CPUs are RISC architectures
which means that they have lots of registers (usually about 32).
2.2.2 Arithmetic Logic Unit
The Arithmetic Logic Unit (ALU) performs arithmetic and logic functions in the CPU. It is usually
capable of adding, subtracting, comparing, and shifting. The design can range from using simple combi-
national logic units that does ripple carry addition, shift-and-add multiplication, and a single-bit shifts,
to no-holds-barred units that do fast addition, hardware multiplication, and barrel shifts[9].
2.2.3 Control Unit
The control unit is responsible for generating proper timing and control signals (usually implemented as
a state-machine that performs the machine cycle: fetch, decode, execute, and store to other logical blocks
in order to complete the execution of instructions.
2.2.4 Memory Management Unit
The Memory Management Unit (MMU) is located between the CPU and the main memory and is
responsible for translating virtual addresses into a their corresponding physical address. The physical
address is then presented to the main memory. The MMU can also enforce memory protection when
needed.
2.2.5 Cache
There are few processors today that don't incorporate a cache. The cache acts as a buer between the
CPU and the main memory to reduce access time, taking advantage of the locality of both code and
data. There are usually several levels of cache, each with their own purpose. The rst level is usually
located on-chip, thus together with the CPU. The cache is often separated into a instruction- and a
data-cache. Cache is especially important in a RISC architecture with frequent loads and stores. For
example, Digital's StrongARM chip devotes about 90% of its die area to cache[89]. The reader can learn
more about cache and how it is used in section 4.2.
2.2.6 Pipeline
As with the case of cache, there are very few processors today that doesn't use any kind of pipelining
in order to improve their performance. This section will serve as an introduction to pipelining and
the benets and drawbacks of using it. Pipelining is implementation technique that tries to achieve
-
2.2 The Building Blocks of an Embedded CPU 11
Instruction Level Parallelism (ILP) by letting multiple instructions overlap in execution. The objective
is to increase throughput, the number of instructions completed at a time. By dividing the execution of
an instructions into several phases, called pipeline stages, an ideal speedup equal to the pipeline depth
could theoretically be achieved. Also, by dividing the pipeline into several stages the workload will be less
each stage, letting the processor run at a higher frequency[8]. Figure 4 shows a typical pipeline together
with it's stages. This particular pipeline has a length of ve and consists of unique pipeline stages, each
with their own purpose.
WBIDIF MEMEX
Figure 4: A General Pipeline
The Instruction Fetch cycle (IF) consists of fetching the next instruction from memory.
The Instruction Decode cycle (ID) handles the decoding of the instruction and reads the register le
in case one or several of the instruction's operand(s) is a register.
The Execution cycle (EX) evaluates ALU operations or calculates destination address in case of a
branch instruction.
The Memory Access cycle (MEM) is where memory is accessed when needed, or in case of a branch
the new program counter is set to the calculated destination address from the previous pipeline
stage
4
.
The Write-back cycle (WB) writes the result back to the register le.
The time it takes for an instruction to move from one pipeline stage to another is usually referred to
as a machine cycle. If one stage require several cycles to complete, it could be decomposed into several
smaller stages, resulting in a superpipline. Because instructions need to move at the same time, the length
of a machine cycle is dictated by the slowest stage of the pipeline. The designer challenge is to reduce
the number of machine cycles per instruction (CPI). If one would execute one instruction at a time, the
CPI count would be equal to the pipeline length. The optimal result in a linear pipeline would be to
reach a CPI count of 1.0, which means that a instruction is completed every cycle and that every pipeline
stage is fully utilized. This is not entireably achievable, due to the fact that a program usually consist of
internal dependencies, branches, etc. These pipeline obstacles are usually referred to as hazards and can
cause delays in the pipeline, called stalls. An execution of a program completely without hazards would
execute its instructions with virtually no delays, resulting in a CPI count close to 1.0
5
. Those hazards
that do cause pipeline stalls are usually classied as: structural, data and control hazards.
structural hazards are caused by resource conicts when the hardware cannot support certain combi-
nations of overlapped execution. It can depend on just having one port to the register le, causing
conicts in the ID and WB stage for register requests. Another source of conict might be that the
memory is not divided into code and data, thus causing conicts in the IF and MEM stage due to
instruction fetching and memory writing.
data hazards are caused by an instruction being dependent of another instruction in previous pipeline
stage so that execution must be stalled, or else written data can be inconsistent. These instruction
dependencies comes in three avors: Read After Write (RAW), Write After Write (WAW), and
4
In case of an conditional branch, the condition will be evaluated. If the instruction branches, the program counter is
set by previous calculation, otherwise the program counter is incremented to point at the next instruction.
5
The CPI count never reaches the ideal value of 1.0, thus cycles are always lost in the beginning because the pipeline
is initially empty and need to be lled with instructions. By the time the rst instruction reaches the WB phase, several
cycles is lost.
-
2.3 The Microprocessor Evolution 12
Write After Read (WAR). RAW hazard are the most common ones and occurs when a write
instruction is followed by a read instruction and both instructions operate on the same register,
causing the one instruction to wait until the write has been issued in the WB stage. This can
be handled by forwarding, thus introducing "shortcuts" in the pipeline, so that instructions can
take part of results before the current instruction reaches the WB stage[8]. WAW hazards cannot
occur in pipelines like the one showed earlier (gure 4). The reason for this is that in order for
a WAW hazard to occur, either the memory stage has to be divided into several stages, making
several simultaneous writes possible, or some mechanism where an instruction can bypass an another
instruction in the pipeline. WAR hazards are rare and happens when a instruction tries to write
to a register read by an instruction that is ahead in the pipeline. As with WAW hazards, WAR
hazards cannot occur in a general pipeline because register contents are always read at the ID stage.
Some pipelines do read register contents late and can create a WAR hazard [8].
control hazards are caused by the instructions that changes the path of execution, called branches.
By the time a branch instruction calculates it's destination address in the EXE stage, instructions
following the branch has reached the IF and ID stage. If the branch was unconditional, the instruc-
tions that is in the IF and ID stage has to be removed, because the branch changes the program
counter and the new instructions have to be fetched from a new address, namely the destination
address of the branch. On the other hand, if there was an unconditional branch, the condition
need to be evaluated in order to decide if the branch should be taken or if the program counter
only should be incremented. One way of dealing with this problem is to automatically stall the
pipeline until condition is evaluated. These stalls are issued in the ID stage, where the branch is
rst identied. Also, in order to evaluate the condition of a conditional branch and calculate the
destination address simultaneously, extra logic for condition evaluation is added together with the
ALU in the ID stage. This way, only one stall cycle will be wasted when a branch instruction
occurs.
Most structural hazards can be prevented by adding more ports and dividing the memory into data
and instruction memory segments. The memory can also be improved by adding cache or increasing the
cache area. Data hazards can be handled by letting the compiler reschedule the instructions in order to
reduce the number of dependencies. Control hazards can be reduced by trying to predict the destination
of a branch. The prediction is based on tables storing historical information about whether the same
branch did or did not jump in earlier executions. Such tables, called Branch History Table (BTB) or
Branch Prediction Buer (BPB), are doing just that. Other tables such as Branch Target Buer (BTB)
acts as a cache storing the destination address of many previously executed branches. The interested
reader can continue it's reading in several books and articles addressing dierent branch penalty reduction
techniques[8, 64, 63].
2.3 The Microprocessor Evolution
This section serves as a "walk-through" of the dierent phases in microprocessor evolution (gure 5).
As this section may seem unrelevant to embedded processors, the embedded processor design has always
been inuenced by the microprocessor and may continue to do so in the future. The reader who feels
unfamiliar with the principles behind the RISC and CISC paradigms, should reread section 1.4.1 before
proceeding with this text.
In the early days, there was a limiting amount of transistors available for the CPU designer. Usually,
the chips where lled with logic that was seldom used (e.g decoding of seldom used instructions). CISC
computers used microcoding, which made it easier to execute complex instructions. As the years went
by, it became harder for CISC designers to keep up with Moore's law
6
. Building more complex solutions
each year was not enough. Some designers realized that the rule locality of reference is something that
needs to be taken into consideration. It states that A program executes about 90% of its instructions
6
The capability of microprocessor doubles every 18 month.
-
2.3 The Microprocessor Evolution 13
CISCinstructions take variable time to complete
RISC/CISC
RISCCISC
Superscalar/VLIWMultithreaded Processors Single Chip Multiprocessor
Simultaneous Multithreading
microcoding, more complex instructions pipeline, simple instructions for speed
merging of architectures
execute multiple instructions duplicated processorsduplicated HW resources (regs, PC, SP)
any context can execute each cycle
Figure 5: The evolution of microprocessors.
in 10% of it's code. The RISC designers thought that if they could implement the 10 % of most used
instructions and throw out the other 90%, then there would be lots of free die area left for other ways of
increasing the performance. Some of the performance enhancing techniques are listed below.
Cache Memory references was becoming a serious bottleneck, and a way to reduce the access time is to
use the extra on-chip space for cache. With the on-chip cache, the processor did not need to access
the main memory for all memory references.
Pipelining By breaking down the handling of an instruction into several simpler stages, the processor
is able to run faster, resulting in higher frequency.
More registers When compiling a program into machine code, the handling of variables usually is
taken care of by registers. Sometimes, there are stalls in the pipeline, due to dependencies between
registers (e.g one can not use a register until it is available), which can be avoided by register
renaming. This is possible when increasing the number of registers.
Computers using some or all of these advantages include RISC I and IBM 801 [2]. These enhancements
gave the RISC designers the upper hand for several generations in the 80's and 90's. But when the
number of available transistors on a chip passed the million-mark the number of transistors as a limiting
factor dissapeared. The CISC designers could level the score by introducing more complex solutions that
increased their performance a couple of percent, with little concern to how much die are was used. Even
though the CISC processor was several factors more complex than the corresponding RISC processor,
it was still keeping up with the RISC. Nowadays, the RISC and CISC paradigms are merged together
and uses techniques from both of the original paradigms. Now, when there are 10, 20 million or more
transistors available, the problem the designer is facing is more about making the most use of all the
transistors than how to t it all on one die. A simple processor can now be realized on only a fraction of
the available space. There are limits in the performance gains when increasing the cache size, deepening
the pipeline and increasing the number of registers. So, the question is what to do with the available
space? To gain more performance, new architectures like Multithreading, Simultaneous Multithreading
(SMT), Very Long Word Instruction (VLIW), and Single Chip Multiprocessor (CMP) are emerging.
These architectures will be discussed in section 2.7.
-
2.4 Design Aspects 14
2.4 Design Aspects
The designers of embedded processors are under market pressure when it comes to producing cheap, low
power-consuming, fast processors[22]. To meet the market demand for a SoC solution, the designers of
an embedded processor need to look at several design aspects, listed below.
2.4.1 Code Density
The size of a program may not be an issue in the desktop world, but is major challenge in embedded
systems. The embedded processor market is highly constrained by power, cost, and size. For control
oriented embedded applications a signicant portion of the nal circuitry is used for instruction memory.
Since the cost of an integrated circuit is strongly related to die size, smaller programs imply cheaper
smaller dies is needed, which in turn means cheaper dies can be used for embedded systems [81, 82].
Thumb andMIPS16 are two approaches that tries to reduce the code density of programs by compressing
the code. Thumb and MIPS16 are subsets of the ARM and MIPS-III architecture. The instructions used
in the subset are either frequently used or does not require full 32-bits or are important to the compiler
for generating small code. The original 32-bit instructions are re-encoded to be 16-bits wide. Thumb and
MIPS16 is reported to achieve code reductions of 30% and 40%, respectively. The 16-bit instructions are
fetched from instruction memory and decoded to equivalent 32-bit instructions that is run as normally
by the core. Both approaches have drawbacks:
Instruction widths are shrunk at the expense of reducing the number of bits used to represent
registers and immediate values
Conditional execution and zero-latency shifts are not possible in Thumb
Floating-point instructions are not available in MIPS16
The number of instructions in a program grows with compression
Thumb code runs 15-20% slower on systems with ideal instruction memories
Both Thumb and MIPS16 are execution-based selection form of selective compression which is a tech-
nique that selects procedures to compress according to a procedure execution frequency prole. The other
form is miss-based selection which is invoked only on an instruction cache miss. All performance loss will
occur on a cache miss path. This way, miss-based selection is based on the number of cache misses and
not the number of executed instructions as in execution-based selection. Speedup can be achieved by
letting the procedures with the most cache misses to be in native code.
Jim Turley has a dierent view on the techniques for reducing code density[89]: Claimed advantages
in code density should be considered in light of factors such as compiler optimization (loop unrolling,
procedure inlining, etc), the addressing (32-bit vs. 64-bit integers or pointers), and memory granularity.
Finally, code density does little or nothing to aect the size of data space. Applications working with
large data sets requires much more memory than the executable, thus code reduction does little help
here.
2.4.2 Power Consumption
Many products using embedded processors use batteries as power supply. To preserve as much power as
possible, embedded processors usually operate in three dierent modes: fully operational, standby mode
and clock-o mode[22]. Fully operational means that the clock signal is propagated to the entire pro-
cessor, and all functional units are available to execute instructions. When the processor is in standby
mode, it is not actually executing a instruction, but the DRAM is still refreshed, register contents is also
available. The processor returns to fully operational mode upon a activity that requires units that are
not available in standby mode, without loosing any information. Finally, in clock-o mode, the system
has to be restarted in order to continue, which almost take as much time as a initial start-up. Power
-
2.5 Implementation Aspects 15
consumption is often measured as milliwatt per megahertz (mW/MHz).
The simplest way of reducing the power consumption is to reduce the voltage level. Today, CPU core
voltage has been reduced to about 1.8V and is still decreasing. Embedded processors are starting to
incorporate dynamic power management into their design. One example is a pipeline that can shut o
the clock at various logic blocks that is not needed when executing the current instruction[98]. This type
of pipeline is usually referred to as a power-aware pipeline. Also, it is no longer sucient to only measure
the power consumption of the CPU, as it gets integrated with its peripherals in a SoC. Instead, a power
consumption measure of the entire system has to be done.
2.4.3 Performance
Unlike the desktop market, performance isn't everything in the embedded processor market. Instead
factors like price, power consumption is equally important. A typical embedded processor usually executes
about one instruction per cycle. Today, performance is still measured as Million Instructions Per Second
(MIPS) which basically only reveals the amount of instructions executed per second, not if they were
any useful instructions executed. MIPS is not a good way of measuring performance, and section 2.8.1
looks at other alternatives. Sometimes, the usual performance of one executed instruction per cycle for
an embedded processor is not enough and other alternative architectures must be considered in order to
increase the performance. Section 4.3.7 discusses possible alternative architectures.
2.4.4 Predictability
Architectures that supports real-time systems must have the ability to achieve predictability [84]. Pre-
dictability is dependent on the Worst Case Execution Time (WCET) which is in turn dictated by the
underlying hardware. Much focus is on improving an architectures performance, and little thought goes
to make it predictable. This has lead to architectures that includes cache, pipeline, virtual storage man-
agement, etc, all which has improved the average case execution time, but has worsen the prospects for
predictable real-time performance.
Caches have not been popular in the real-time competing community, due its unpredictable behavior.
This is true for multi-tasking, interrupt driven environments which are common in real-time applica-
tions [87]. Here, the individual task execution time can have dier from time to time due to interactions
of real-time tasks and the external environment via the operating system. Preemptions may modify
the cache contents and thereby cause a nondeterministic cache hit ratio resulting in unpredictable task
execution task times.
Pipelines also introduces similar problems to caches concerning worst case execution time. There are
eorts to achieve predictable performance of pipelines without using a cache and without the hazards
associated with them [88]. This approach, called Multiple Active Context System (MACS), uses multiple
processor contexts to achieve increasing performance and predictability. Here, a single pipeline is shared
among a number of threads and context of every thread is stored within the processor. On each cycle,
a single context is selected to issue a single instruction to the pipeline. While this instruction proceeds
through the pipeline, other contexts issue instructions to ll consecutive pipeline stages. Contexts are
selected in a round-robin fashion. A key feature of MACS architecture is that its memory model allows
the programmer to derive theoretical upper bounds on memory access times. The maximum number of
cycles a context will wait for a shared memory request is dictated by the number of contexts, the memory
issue latency, the number of shared memory competing threads, and the number of contexts scheduled
between consecutive threads.
2.5 Implementation Aspects
There are several options available for the designer who wants to integrate an embedded processor into
a SoC. Besides building a processor from "scratch", there are other options available. The rst option
-
2.6 State of Practice 16
is to acquire the processor core as an hard IP-component
7
in form of a specic semiconductor fabri-
cation process and are delivered as mask data. Several hard IP-cores will be examined in section 2.6.
The second option is to acquire the CPU as a rm IP-component which is usually delivered in form
of a netlist. The third and last option is to acquire a soft IP-component in form of VHDL or Verilog
code or to produce a synthesizable core with a parameterizable core generator. There has been sev-
eral research eorts to develop generators of parameterizable RISC-cores[73, 76]. One conducted at the
university of Hanover has developed a parameterizable core generator that outputs fully synthesizable
VHDL code. The generated core is based on a standard 5 stage pipeline (Figure 4). The designer has
many choices when using the generator (e.g pipeline length, ALU and data width, size of register le, etc).
The generated cores are simple RISC-processors with a parameterizable word and instruction width.
Instruction and data memories are provided as a VHDL template le for simulations, but they are not
suitable for synthesis. Instead, they should be taken from a technology specic library. Since the cores
are based on RISC-principles, the instruction set consists of only few instructions and addressing modes.
A typical 32-bit RISC core with 32 bit data path and 8 32 bit registers can with a 3LM 0.5 micron.
standard-cell library deliver about 100 MHz achievable clock-frequency.
Commercial core generators are also available from Tensilica, ARC, and Triscend[100, 101, 99].
2.6 State of Practice
The 4, 8 and 16 bit microprocessors was and still are dominating the embedded control market. In
fact, it was forecasted that eight times more 8-bit than 32-bit CPU's will be shipped during 1999[89].
The 32-bit embedded processor market diers from the desktop market in that there are about 100
vendors and a dozen instruction set architectures to choose from. The thing that makes 32-bit embedded
CPU's attractive is their ability to handle emerging new consumer demands in form of ltering, articial
intelligence, multimedia, still maintaining a low level of power consumption, price, etc. Next will follow
a brief presentation of available embedded processors commonly used today.
2.6.1 ARM
The Advanced RISC Machines (ARM) company is a leading IP provider that licenses RISC processors,
periphals, and system-on-chip designs to international electronics companies. The ARM7 family of pro-
cessors consists of ARM7TDMI and ARM7TDMI-S processor cores, and the ARM710T, ARM720T and
ARM740T cached processor macrocells.
An ARM7 processor consists of an ARM7TDMI or ARM7TDMI-S (S stands for Synthesizable and
means that it can be acquired as VHDL or Verilog code) core that can be augmented with one of
the available macrocells. The macrocells provides the core with 8KB cache, write buer, and memory
functions. ARM710T also provides a virtual memory support for operating systems such as Linux and
Symbain's EPOC32. ARM720T is a superset of ARM710T and supports WindowsCE.
When writing a 32-bit program for an embedded system there might be a problem to t the entire
program in the on-chip memory. This kind of problem is usually referred to as a code density problem.
In order to address the code size problem ARM has developed Thumb, a new instruction set. Thumb is
an extension to the ARM architecture, containing 36 instruction formats drawn from the standard 32-bit
ARM instruction set that have been re-coded into 16-bit wide opcodes. Upon execution, the Thumb
codes are decompressed by the processor to their real ARM instruction set equivalents, which are then
run on ARM as usual. This gives the designer the benets of running ARM's 32-bit instruction set and
reducing code size by using Thumb.
7
Those who are not familiar with the dierent layers of IP-components, can read the section SoC Design.
-
2.6 State of Practice 17
The ARM9 family is a newer and more powerful version of ARM7 and designed for system-on-chip
solutions due to its built-in DSP capabilities. The ARM9E-S solutions are macrocells intended for in-
tegration into Application Specic Integrated Circuits (ASICs), Application Specic Standard Products
(ASSPs) and System-on-chip (SoC) products.
CPU core Die Area Power Frequency Performance
ARM7TDMI 1.0 mm
2
on 0.25 m 0.6 mW/MHz @ 3.3V 66 MHz 0.9 MIPS/MHz
ARM9E-S 2.7 mm
2
on 0.25 m 1.6 mW/MHz @ 2.5V 160 MHz 1.1 MIPS/MHz
2.6.2 Motorola
The Motorola M-CORE microprocessor, introduced in 1997, was targeting the market of analog cellular
phones, digital phones, PDAs, portable GPS systems, automobile braking systems, automobile engine
control, and automotive body electronics. The development of the M-CORE architecture was designed
from the ground up to achieve the lowest milliwatts per MHz. It is a 32-bit processor that has a 16-bit
xed length instruction format, and a 32-bit RISC architecture. The M-CORE minimizes power usage
by utilizing dynamic power management.
Motorola has also developed a modern version of the 68K architecture, the Coldre, which is positioned
between the 68K (low end) and the PowerPC (high end). This architecture is also known as VL-RISC,
because although the core is RISC-like, the instructions are variable length (VL). VL instructions help to
attain higher code density, Coldre has a four-stage pipeline consisting of two subpipelines: a two-stage
instruction prefetch pipeline and two-stage operand execution pipeline.
2.6.3 MIPS
MIPS Technologies designs and licenses embedded 32- and 64-bit intellectual property (IP) and core
technology for digital consumer and embedded systems market. The MIPS32 architecture is a superset
of the previous MIPS I and MIPS II instruction set architectures.
2.6.4 Patriot Scientic
Patriot Scientic Corporation was one of the rst developing a Java microprocessor, the PSC1000. The
PSC1000 is targeted for high performance, low-system cost applications like, network computers, set-top
boxes, cellular phones, Personal Digital Assistants (PDA's) and more. The PSC1000 microprocessor is
a 32-bit RISC processor that oers ability to execute both Java(tm) programs as well as C and FORTH
applications. It oers a unique architecture that is a blend of stack- and register-based designs, which
enables features like 8-bit instructions for reduced code size. The idea behind the PSC1000 is to make
Internet connectivity for low cost devices such as PDA's, set-top cable boxes and "smart" cell phones.
2.6.5 AMD
Advanced Micro Devices (AMD)'s 29000K was an early leader which was frequently used in laser print-
ers and network buses. The 29K family comprises three product lines, including three-bus Harvard-
architecture processors, two-bus processors, and a microprocessor with on-chip peripheral support. The
core is built around a simple four-stage pipeline: fetch, decode, execute, and write-back. The 29K has a
triple-ported register le of 192 32-bit registers. In 1995, AMD cancelled all further development of the
29K to concentrate its eorts on x86 chips.
2.6.6 Hitachi
Hitachi SuperH (SH) became popular when Sega chose the SH7032 for its Genesis and Saturn video game
consoles. Then, it expanded to cover consumer-electronics markets. Its short, 16-bit instruction word
gives SuperH one of the best code density compared with almost any 32-bit processor. The SH family
-
2.7 Improving Performance 18
uses a ve-stage pipeline: fetch, decode, execute, memory access, and write-back to register. The CPU
is built around 25 32-bit registers.
2.6.7 Intel
Intel i960 emerged early in the embedded market which made it successful in printer and networking
equipments. The i960 is well supported with development tools. The i960 combines a Von Neumann
architecture with a load/store architecture that centers on a core of 32 32-bit general-purpose registers.
All i960s have multistage pipelines and use resource scoreboarding to track resource usage.
2.6.8 PowerPC
The PowerPC is one of the best-known microprocessor name next to Pentium and is steadily gaining
ground in the embedded space. IBM and Motorola are pursuing dierent strategies with their embedded
PowerPC chips, with the former inviting customer designs and the latter leveraging its massive library
of peripheral I/O logic.
2.6.9 Sparc
Sun's SPARC was the rst workstation processor to be openly licensed and is still popular with some
embedded users. The microSPARC are built around a large multiported register le that breaks down
into small set of global registers for holding global variables and sets overlapping register windows. The
microSPARC's pipeline consists of an instruction-fetch unit, two integer ALUs, a load/store unit, and a
FPU.
2.7 Improving Performance
Pipelining is a way of achieving a level of parallelism, resulting in a low CPI count. In order to be
even more eective, linear pipelining will not suce and other techniques have to be considered. These
techniques have the ability to execute several instructions at once, resulting in a CPI count below 1.0.
The most popular techniques includes Multiple-issue Processors (such as Very Long Instruction Word
(VLIW) and Superscalar Processors),Multithreading, Simultaneous Multithreading (SMT) and Chip Mul-
tiprocessor (CMP). Also, another technique will be discussed that tries to come to terms with the ever
growing memory-CPU speed gap. There is one technique, called prefetching or preloading, that hides
the memory latency by fetching and storing required data or instructions in a buer before it is actually
needed.
2.7.1 Multiple-issue Processors
Although there are techniques that can remedy most of the stalls in an ordinary pipeline, the ideal
result is still only a CPI count of 1.0, thus executing exactly one instruction for every machine cycle.
This performance is not always enough and other ways of achieving a higher level of parallelism need
to be considered. Multiple-issue processors tries to execute several instructions in a machine cycle, thus
achieving a higher rate of Instruction-Level Parallelism (ILP). There are mainly two types of processors
using these techniques, namely Very Long Instruction Word (VLIW) and superscalar processors. Also,
in addition to the two architectures, a third alternative processor, called Multiple Instruction Stream
Computer (MISC) will be discussed.
As the name implies, a VLIW processor issues a very long instruction packet that consists of several
instructions. An example of a instruction packet can be seen in (Figure 6), were there is room for two
integer/branch operations, one oating point operation, and two memory references. In the case of VLIW
processors, the task of nding independent instructions in the code is done by the compiler instead of dy-
namic hardware as in superscalar processors. Additional hardware is saved because the compiler always
-
2.7 Improving Performance 19
I