computer architecture assoc.prof. stasys maciulevičius computer dept. [email protected]

COMPUTER ARCHITECTURE

Assoc.Prof. Stasys Maciulevičius

Computer Dept.

[email protected]

©S.Maciulevičius 2

AMD road

The company began as a producer of logic chips, then entered the RAM chip business in 1975. That same year, it introduced a reverse-engineered clone of the Intel 8080 microprocessor

In February 1982, AMD becomes a licensed second-source manufacturer of Intel 8086 and 8088 processors

In 1991, AMD released the Am386, its clone of the Intel 386 processor

AMD's first in-house x86 processor was the K5, which was launched in 1996

2014 ©S.Maciulevičius 2


AMD roud


Later were processors K6 (1997), Athlon (K7, 1999), Athlon XP (2001) released

First server processor was dual core Opteron (2005) After K8 came K10. In 2007, AMD released the first

K10 processors: quad-core 3rd generation Opteron processors. This was followed by the Phenom processor for desktop. K10 processors came in dual-core, triple-core, and quad-core versions, with all cores on a single die

In January 2009, AMD released a new processor line Phenom II, which came in dual-core, triple-core and quad-core variants


AMD K10 architecture

The new K10 architecture was based on the K8 architecture with some enhancements:

The fetch unit fetches 32 bytes (256 bits) of data per clock cycle from the L1 instruction cache – this is the double CPUs based on K8 architecture could fetch per clock cycle (Intel CPUs based on Core microarchitecture, like Core 2 Duo, also fetches 32 bytes per clock cycle)

The use of a true 128-bit internal datapath. On previous CPUs based on K8 microarchitecture the internal datapath was of 64 bits only. This was a problem for SSE instructions, since SSE registers, called XMM, are 128-bit long



Barcelona



AMD’s APU

An accelerated processing unit (APU) is a processing system that includes additional processing capability designed to accelerate one or more types of computations outside of a CPU

This may include a graphics processing unit (GPU) used for general-purpose computing (GPGPU), a field-programmable gate array (FPGA), or similar specialized processing system


AMD’s APU

At the most basic level, AMD’s new Accelerated Processing Units combine general-purpose x86 CPU cores with programmable vector processing engines on a single silicon die

AMD’s APUs also include a variety of critical system elements, including memory controllers, I/O controllers, specialized video decoders, display outputs, and bus interfaces


AMD view on APUs


AMD’s APU AMD announced the first generation APUs,

Llano for high-performance and Brazos for low-power devices in January 2011

The second-generation Trinity for high-performance and Brazos-2 for low-power devices were announced in June 2012

The third-generation Kaveri for high performance devices was launched in January 2014, while Kabini and Temash for low-power devices were announced in summer 2013.


AMD Fusion

AMD Fusion is the marketing name for a series of APUs by AMD, aimed at providing good performance with low power consumption, and integrating a CPU and a GPU based on a mobile stand-alone GPU


AMD Fusion

First demonstration of AFU Fusion was on Computex 2010 (Taipei, Taiwan, June 2. 2010 )


New AMD core - Bulldozer

Bulldozer is the codename AMD has given to one of the CPU cores based on the AMD family 15h microarchitecture

Bulldozer is designed from scratch, not a development of earlier processors

AMD has introduced a new microarchitecture building block called module

In terms of hardware complexity and functionality, a module is midway between a dual-core processor (in which each core is fully independent) and a single processor core that has two SMT threads (in which each thread shares most of the hardware resources with the other thread)


AMD Bulldozer core

A module consists of two tightly coupled, "conventional" x86 out-of-order processing engines

The processing engine shares the early pipeline stages (eg. instruction fetch, decode), the FPUs, and the L2 cache


AMD Bulldozer core

Two dedicated integer cores each consists of two ALU and two AGU which

are capable for total of 4 independent arithmetic and memory operations per clock per core

duplicating integer schedulers and execution pipelines offers dedicated hardware to each of two threads which significantly increase performance in multithreaded integer applications

second integer core increases Bulldozer module die by around 12%, which at chip level adds about 5% of total die space


AMD Bulldozer core

Two symmetrical 128-bit FMAC (fused multiply–add capability) floating-point pipelines per module that can be unified into one large 256-bit-wide unit if one of integer cores dispatch AVX instruction and two symmetrical x87/MMX/SSE capable FPPs for backward compatibility with SSE2 non-optimized software

Multiple modules share an L3 cache as well as an Advanced Dual-Channel Memory Sub-System (IMC - Integrated Memory Controller)

A dual-core Bulldozer processor has a single module, a quad-core processor has two modules and an octo-core processor has four modules


AMD Bulldozer core

The first shipments of Bulldozer-based Opteron processors begun on September 2011

On 12 October 2011, AMD released the first four FX-series processors of the Bulldozer line (FX-8150, FX-8120, FX-6100, FX-4100)

AMD stated on its blog that “there are some in our community who feel the product performance did not meet their expectations”

AMD said that the remaining FX series AMD processors would be released at the end of the first quarter of 2012

©S.Maciulevičius 172014 ©S.Maciulevičius 17

AMD Piledriver


Improvements in the Piledriver

Improved branch prediction precision due to the use of Hybrid Predictor augmented with 2nd level predictor;

128 and 256-bit FMA3 instructions extensions (fused multiply-add) and F16C SSE5 instructions extensions (half-precision floating-point conversion);

Optimized schedulers; Accelerated division by modifying a corresponding execution

unit; Increased L1 TLB; Improved L1 and L2 pre-fetchers that can work with variable

length patterns, including those on page boundaries; Improved L2 cache efficiency by more aggressive removal of

the unused data, which the pre-fetcher algorithms loaded into the cache by mistake.


New micro-architecture - x86 Steamroller Steamroller is the third modular x86 architecture

from AMD promises a yield per cycle/watt from 15% to 20%

higher than the micro-architecture Piledriver released in Trinity,

come with a new memory controller integrated DDR3-2133, plus have a PCI Express (PCIe) 3.0.

Kaveri Steamroller possess up to 2 modules (4 cores of processing whole “ALUs”) and 2 floating point units Flex-FP third generation.


AMD Steamroller

The focus of Steamroller is for greater parallelism. Improvements will center on: independent instruction decoders for each core within

a module, 25% more of the maximum width dispatches per

thread, better instruction schedulers, improved branch predictor, larger and smarter caches, ….


AMD Steamroller …

up to 30% less instruction cache misses, branch misprediction rate reduced by 20%,dynamically resizable L2 cache, micro-operations queuemore internal register resources and improved

memory controller


From APU to HSA

AMD's first mainstream APU combined the CPU and a capable GPU — each with a separate slice of system memory — on the same chip

In Trinity APU, a memory management unit allowed the GPU to see all of the physical system memory, shared power management, and support for OpenCL C++ and Microsoft C++ AMP)

But the basic software model has remained the same; the CPU and GPU can't work together on the same data


From APU to HSA The next step for HSA, heterogeneous Uniform Memory

Access (hUMA), promises to solve this problem with three features: the CPU and GPU use the same pointers (addresses) to

access the entire memory space to read and write data; they are cache coherent, so they can work on data at the

same time without issues; and, like the CPU, the GPU supports paged virtual memory, which makes it

possible to work with larger datasets The net result is that the CPU and GPU can work together

much more efficiently, and it should be easier to write applications that take advantage of both


AMD HSA


Kaveri APU implements the HSA


ARM architecture

ARM is a family of instruction set architectures for computer processors based on a RISC architecture

ARM Holdings' (it is a British multinational semiconductor and software design company) primary business is selling IP cores, which licensees use to create microcontrollers and CPUs based on those cores. The original design manufacturer combines the ARM core with other parts to produce a complete CPU

Today, the ARM architecture is licensed for use by many companies, including Apple, Intel, LG, Microsoft, NEC, Nintendo, Nvidia, Sony, Samsung, Sharp, Texas Instruments, Yamaha, and many more


ARM architecture

Processors based on designs licensed from ARM, are used in all classes of computing devices from microcontrollers in embedded systems – including real-time safety systems, smartTVs and all modern smartwatches – up to smartphones, tablets, laptops, servers and supercomputers/HPC

According to ARM Holdings, in 2010 alone, producers of chips based on ARM architectures reported shipments of 6.1 billion ARM-based processors, representing 95% of smartphones, 35% of digital televisions and set-top boxes and 10% of mobile computers


ARM architecture

The ARM architecture is one of the most successful on the planet

The original ARM architecture was heavily influenced by the Berkeley RISC architecture

ARM has a number of RISC features, such as a large register set, fixed-length instructions, and a purely load-store architecture

A modern ARM chip supports several instruction sets (this increases complexity of the instruction decoder )


ARM architecture

The ARM architecture is one of the most successful on the planet

The original ARM architecture was heavily influenced by the Berkeley RISC architecture

ARM has a number of RISC features, such as a large register set, fixed-length instructions, and a purely load-store architecture

A modern ARM chip supports several instruction sets (ARM, Thumb, or Thumb-2; this increases complexity of the instruction decoder)


©S.Maciulevičius 332010-2014 ©S.Maciulevičius 33

ARM processorHere we see a quad-core Cortex processor for a wide range of devices - from mobile devices to servers

ARM A15 MPCore

Main components of Cortex-A15 MPCore are: floating-point unit, performing operations with

conventional and double-precision numbers; NEON expanded instruction system is realized here with media and signal processing operations, additional 64 and 128-bit registers, SIMD operations are carried out with 8, 16 and 32-bit integers and 32-bit floating point numbers;

integer ALU, which generates 40-bit physical addresses, enabling to address up to 1 TB of memory (separate thread uses a 32-bit address only);


ARM A15 MPCore

32 kB data and 32 kB instruction L1 cache on each core, designed for the minimum time delay and power consumption, they realize data transparency measures supporting multi-core environments, as well as error control and correction (ECC);

SCU (Snoop Control Unit) is responsible for managing the interconnect, arbitration, communication, cache-2-cache and system memory transfers, cache coherence and other capabilities for the processor ;


ARM A15 MPCore

128-bit CoreLink CCI-400 provides AMBA 4 AXI™ Coherency Extensions (ACE) compliant ports for full coherency between multiple Cortex-A15 MPCore processors, better utilizing caches and simplifying software development

This is essential for high bandwidth applications including gaming, servers and networking that require clusters of coherent single and multicore processors


ARM big.LITTLE

big.LITTLE is a heterogeneous computing architecture developed by ARM Holdings coupling (relatively) slower, low-power processor cores with (relatively) more powerful and power-hungry ones

Each pair operates as one virtual core, and only one real core is (fully) powered up and running at a time

The 'big' core is used when demand is high, the 'LITTLE' core when demand is low


ARM big.LITTLE



Processors for servers Serveriu vadinama sistema (operacinė sistema

plius atitinkama techninė įranga), skirta teikti per tinklą įvairias paslaugas (servisus) – duomenų failus, skaičiavimus, elektroninį paštą ir t.t.

Serveriams keliami ypač aukšti patikimumo, spartos, išorinės atminties talpos reikalavimai

Nors serveriuose gali būti naudojami įprasti procesoriai, vis tik kuriami specialūs procesoriai serveriams


Processors for servers Server is the system (operating system plus the

appropriate hardware), which is designed to provide a variety of services over the network - data files, computing, email, etc.

Servers have to meet extremely high requirements for reliability, speed, external memory space

Although conventional processors can be used in servers, however, a special server processors are developed and produced


Processors for servers

Server processors distinguish by: a larger number of cores, a larger L3 cache, support or hyperthreading, higher reliability, ability to work in multi-processor system, higher energy consumption and the high price.


Processors for servers Intel server processors are known as Xeons (the

recent processors - Xeon E3, E5, E7; Xeon E7-2870, 2400 MHz, 10 cores, $4227, up to 30 MB L3, 130 W, 4 channel DDR3 support)

AMD server processors are known as Opterons (the recent processors – series 4300, 6100, 6200, 6300, A1100; Opteron™ 6180 SE, 2500 MHz, 12 cores,

$1514, 2x6 MB L3, 140 W, 4 channel DDR3 support) IBM server processors are known as Power

processors (Power 6, Power 7)

computer architecture assoc.prof. stasys maciulevičius computer dept. [email protected]

Documents

amds apu amd

amd fusion amd fusion

apus slide

amd view

amd roud

dual core

core microarchitecture

maciuleviius4 slide