computer architecture assoc.prof. stasys maciulevičius computer dept. [email protected]

42
COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. sta sys. ma ciulevicius @ktu.lt

Upload: amos-lee

Post on 23-Dec-2015

237 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

COMPUTER ARCHITECTURE

Assoc.Prof. Stasys Maciulevičius

Computer Dept.

[email protected]

Page 2: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 2

AMD road

The company began as a producer of logic chips, then entered the RAM chip business in 1975. That same year, it introduced a reverse-engineered clone of the Intel 8080 microprocessor

In February 1982, AMD becomes a licensed second-source manufacturer of Intel 8086 and 8088 processors

In 1991, AMD released the Am386, its clone of the Intel 386 processor

AMD's first in-house x86 processor was the K5, which was launched in 1996

2014 ©S.Maciulevičius 2

Page 3: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 3

AMD roud

2014 ©S.Maciulevičius 3

Later were processors K6 (1997), Athlon (K7, 1999), Athlon XP (2001) released

First server processor was dual core Opteron (2005) After K8 came K10. In 2007, AMD released the first

K10 processors: quad-core 3rd generation Opteron processors. This was followed by the Phenom processor for desktop. K10 processors came in dual-core, triple-core, and quad-core versions, with all cores on a single die

In January 2009, AMD released a new processor line Phenom II, which came in dual-core, triple-core and quad-core variants

Page 4: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 4

AMD K10 architecture

The new K10 architecture was based on the K8 architecture with some enhancements:

The fetch unit fetches 32 bytes (256 bits) of data per clock cycle from the L1 instruction cache – this is the double CPUs based on K8 architecture could fetch per clock cycle (Intel CPUs based on Core microarchitecture, like Core 2 Duo, also fetches 32 bytes per clock cycle)

The use of a true 128-bit internal datapath. On previous CPUs based on K8 microarchitecture the internal datapath was of 64 bits only. This was a problem for SSE instructions, since SSE registers, called XMM, are 128-bit long

2014 ©S.Maciulevičius 4

Page 5: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 5

Barcelona

2014 ©S.Maciulevičius 5

Page 6: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 62014

AMD’s APU

An accelerated processing unit (APU) is a processing system that includes additional processing capability designed to accelerate one or more types of computations outside of a CPU

This may include a graphics processing unit (GPU) used for general-purpose computing (GPGPU), a field-programmable gate array (FPGA), or similar specialized processing system

Page 7: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 72014

AMD’s APU

At the most basic level, AMD’s new Accelerated Processing Units combine general-purpose x86 CPU cores with programmable vector processing engines on a single silicon die

AMD’s APUs also include a variety of critical system elements, including memory controllers, I/O controllers, specialized video decoders, display outputs, and bus interfaces

Page 8: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 82014

AMD view on APUs

Page 9: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 92014

AMD’s APU AMD announced the first generation APUs,

Llano for high-performance and Brazos for low-power devices in January 2011

The second-generation Trinity for high-performance and Brazos-2 for low-power devices were announced in June 2012

The third-generation Kaveri for high performance devices was launched in January 2014, while Kabini and Temash for low-power devices were announced in summer 2013.

Page 10: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 102014

AMD Fusion

AMD Fusion is the marketing name for a series of APUs by AMD, aimed at providing good performance with low power consumption, and integrating a CPU and a GPU based on a mobile stand-alone GPU

Page 11: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 112014

AMD Fusion

First demonstration of AFU Fusion was on Computex 2010 (Taipei, Taiwan, June 2. 2010 )

Page 12: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 122014

New AMD core - Bulldozer

Bulldozer is the codename AMD has given to one of the CPU cores based on the AMD family 15h microarchitecture

Bulldozer is designed from scratch, not a development of earlier processors

AMD has introduced a new microarchitecture building block called module

In terms of hardware complexity and functionality, a module is midway between a dual-core processor (in which each core is fully independent) and a single processor core that has two SMT threads (in which each thread shares most of the hardware resources with the other thread)

Page 13: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 132014

AMD Bulldozer core

A module consists of two tightly coupled, "conventional" x86 out-of-order processing engines

The processing engine shares the early pipeline stages (eg. instruction fetch, decode), the FPUs, and the L2 cache

Page 14: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 142014

AMD Bulldozer core

Two dedicated integer cores each consists of two ALU and two AGU which

are capable for total of 4 independent arithmetic and memory operations per clock per core

duplicating integer schedulers and execution pipelines offers dedicated hardware to each of two threads which significantly increase performance in multithreaded integer applications

second integer core increases Bulldozer module die by around 12%, which at chip level adds about 5% of total die space

Page 15: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 152014

AMD Bulldozer core

Two symmetrical 128-bit FMAC (fused multiply–add capability) floating-point pipelines per module that can be unified into one large 256-bit-wide unit if one of integer cores dispatch AVX instruction and two symmetrical x87/MMX/SSE capable FPPs for backward compatibility with SSE2 non-optimized software

Multiple modules share an L3 cache as well as an Advanced Dual-Channel Memory Sub-System (IMC - Integrated Memory Controller)

A dual-core Bulldozer processor has a single module, a quad-core processor has two modules and an octo-core processor has four modules

Page 16: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 162014

AMD Bulldozer core

The first shipments of Bulldozer-based Opteron processors begun on September 2011

On 12 October 2011, AMD released the first four FX-series processors of the Bulldozer line (FX-8150, FX-8120, FX-6100, FX-4100)

AMD stated on its blog that “there are some in our community who feel the product performance did not meet their expectations”

AMD said that the remaining FX series AMD processors would be released at the end of the first quarter of 2012

Page 17: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 172014 ©S.Maciulevičius 17

Page 18: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

AMD Piledriver

©S.Maciulevičius 182014 ©S.Maciulevičius 18

Page 19: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

AMD Piledriver

©S.Maciulevičius 192014 ©S.Maciulevičius 19

Page 20: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

AMD Piledriver

©S.Maciulevičius 202014 ©S.Maciulevičius 20

Page 21: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

Improvements in the Piledriver

Improved branch prediction precision due to the use of Hybrid Predictor augmented with 2nd level predictor;

128 and 256-bit FMA3 instructions extensions (fused multiply-add) and F16C SSE5 instructions extensions (half-precision floating-point conversion);

Optimized schedulers; Accelerated division by modifying a corresponding execution

unit; Increased L1 TLB; Improved L1 and L2 pre-fetchers that can work with variable

length patterns, including those on page boundaries; Improved L2 cache efficiency by more aggressive removal of

the unused data, which the pre-fetcher algorithms loaded into the cache by mistake.

©S.Maciulevičius 212014 ©S.Maciulevičius 21

Page 22: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

New micro-architecture - x86 Steamroller Steamroller is the third modular x86 architecture

from AMD promises a yield per cycle/watt from 15% to 20%

higher than the micro-architecture Piledriver released in Trinity,

come with a new memory controller integrated DDR3-2133, plus have a PCI Express (PCIe) 3.0.

Kaveri Steamroller possess up to 2 modules (4 cores of processing whole “ALUs”) and 2 floating point units Flex-FP third generation.

©S.Maciulevičius 222014 ©S.Maciulevičius 22

Page 23: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

AMD Steamroller

The focus of Steamroller is for greater parallelism. Improvements will center on: independent instruction decoders for each core within

a module, 25% more of the maximum width dispatches per

thread, better instruction schedulers, improved branch predictor, larger and smarter caches, ….

©S.Maciulevičius 232014

Page 24: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

AMD Steamroller …

up to 30% less instruction cache misses, branch misprediction rate reduced by 20%,dynamically resizable L2 cache, micro-operations queuemore internal register resources and improved

memory controller

©S.Maciulevičius 242014

Page 25: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

From APU to HSA

AMD's first mainstream APU combined the CPU and a capable GPU — each with a separate slice of system memory — on the same chip

In Trinity APU, a memory management unit allowed the GPU to see all of the physical system memory, shared power management, and support for OpenCL C++ and Microsoft C++ AMP)

But the basic software model has remained the same; the CPU and GPU can't work together on the same data

©S.Maciulevičius 252014

Page 26: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

From APU to HSA The next step for HSA, heterogeneous Uniform Memory

Access (hUMA), promises to solve this problem with three features: the CPU and GPU use the same pointers (addresses) to

access the entire memory space to read and write data; they are cache coherent, so they can work on data at the

same time without issues; and, like the CPU, the GPU supports paged virtual memory, which makes it

possible to work with larger datasets The net result is that the CPU and GPU can work together

much more efficiently, and it should be easier to write applications that take advantage of both

©S.Maciulevičius 262014

Page 27: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

AMD HSA

©S.Maciulevičius 272014

Page 28: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

Kaveri APU implements the HSA

©S.Maciulevičius 282014

Page 29: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

ARM architecture

ARM is a family of instruction set architectures for computer processors based on a RISC architecture

ARM Holdings' (it is a British multinational semiconductor and software design company) primary business is selling IP cores, which licensees use to create microcontrollers and CPUs based on those cores. The original design manufacturer combines the ARM core with other parts to produce a complete CPU

Today, the ARM architecture is licensed for use by many companies, including Apple, Intel, LG, Microsoft, NEC, Nintendo, Nvidia, Sony, Samsung, Sharp, Texas Instruments, Yamaha, and many more

©S.Maciulevičius 292014

Page 30: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

ARM architecture

Processors based on designs licensed from ARM, are used in all classes of computing devices from microcontrollers in embedded systems – including real-time safety systems, smartTVs and all modern smartwatches – up to smartphones, tablets, laptops, servers and supercomputers/HPC

According to ARM Holdings, in 2010 alone, producers of chips based on ARM architectures reported shipments of 6.1 billion ARM-based processors, representing 95% of smartphones, 35% of digital televisions and set-top boxes and 10% of mobile computers

©S.Maciulevičius 302014

Page 31: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

ARM architecture

The ARM architecture is one of the most successful on the planet

The original ARM architecture was heavily influenced by the Berkeley RISC architecture

ARM has a number of RISC features, such as a large register set, fixed-length instructions, and a purely load-store architecture

A modern ARM chip supports several instruction sets (this increases complexity of the instruction decoder )

©S.Maciulevičius 312014

Page 32: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

ARM architecture

The ARM architecture is one of the most successful on the planet

The original ARM architecture was heavily influenced by the Berkeley RISC architecture

ARM has a number of RISC features, such as a large register set, fixed-length instructions, and a purely load-store architecture

A modern ARM chip supports several instruction sets (ARM, Thumb, or Thumb-2; this increases complexity of the instruction decoder)

©S.Maciulevičius 322014

Page 33: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 332010-2014 ©S.Maciulevičius 33

ARM processorHere we see a quad-core Cortex processor for a wide range of devices - from mobile devices to servers

Page 34: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

ARM A15 MPCore

Main components of Cortex-A15 MPCore are: floating-point unit, performing operations with

conventional and double-precision numbers; NEON expanded instruction system is realized here with media and signal processing operations, additional 64 and 128-bit registers, SIMD operations are carried out with 8, 16 and 32-bit integers and 32-bit floating point numbers;

integer ALU, which generates 40-bit physical addresses, enabling to address up to 1 TB of memory (separate thread uses a 32-bit address only);

©S.Maciulevičius 342010-2014 ©S.Maciulevičius 34

Page 35: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

ARM A15 MPCore

32 kB data and 32 kB instruction L1 cache on each core, designed for the minimum time delay and power consumption, they realize data transparency measures supporting multi-core environments, as well as error control and correction (ECC);

SCU (Snoop Control Unit) is responsible for managing the interconnect, arbitration, communication, cache-2-cache and system memory transfers, cache coherence and other capabilities for the processor ;

©S.Maciulevičius 352010-2014 ©S.Maciulevičius 35

Page 36: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

ARM A15 MPCore

128-bit CoreLink CCI-400 provides AMBA 4 AXI™ Coherency Extensions (ACE) compliant ports for full coherency between multiple Cortex-A15 MPCore processors, better utilizing caches and simplifying software development

This is essential for high bandwidth applications including gaming, servers and networking that require clusters of coherent single and multicore processors

©S.Maciulevičius 362010-2014 ©S.Maciulevičius 36

Page 37: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

ARM big.LITTLE

big.LITTLE is a heterogeneous computing architecture developed by ARM Holdings coupling (relatively) slower, low-power processor cores with (relatively) more powerful and power-hungry ones

Each pair operates as one virtual core, and only one real core is (fully) powered up and running at a time

The 'big' core is used when demand is high, the 'LITTLE' core when demand is low

©S.Maciulevičius 372014

Page 38: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

ARM big.LITTLE

©S.Maciulevičius 382014

Page 39: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 392010-2014 ©S.Maciulevičius 39

Processors for servers Serveriu vadinama sistema (operacinė sistema

plius atitinkama techninė įranga), skirta teikti per tinklą įvairias paslaugas (servisus) – duomenų failus, skaičiavimus, elektroninį paštą ir t.t.

Serveriams keliami ypač aukšti patikimumo, spartos, išorinės atminties talpos reikalavimai

Nors serveriuose gali būti naudojami įprasti procesoriai, vis tik kuriami specialūs procesoriai serveriams

Page 40: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 402010-2014 ©S.Maciulevičius 40

Processors for servers Server is the system (operating system plus the

appropriate hardware), which is designed to provide a variety of services over the network - data files, computing, email, etc.

Servers have to meet extremely high requirements for reliability, speed, external memory space

Although conventional processors can be used in servers, however, a special server processors are developed and produced

Page 41: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 412010-2014 ©S.Maciulevičius 41

Processors for servers

Server processors distinguish by: a larger number of cores, a larger L3 cache, support or hyperthreading, higher reliability, ability to work in multi-processor system, higher energy consumption and the high price.

Page 42: COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

©S.Maciulevičius 422010-2014 ©S.Maciulevičius 42

Processors for servers Intel server processors are known as Xeons (the

recent processors - Xeon E3, E5, E7; Xeon E7-2870, 2400 MHz, 10 cores, $4227, up to 30 MB L3, 130 W, 4 channel DDR3 support)

AMD server processors are known as Opterons (the recent processors – series 4300, 6100, 6200, 6300, A1100; Opteron™ 6180 SE, 2500 MHz, 12 cores,

$1514, 2x6 MB L3, 140 W, 4 channel DDR3 support) IBM server processors are known as Power

processors (Power 6, Power 7)