print cpu complex - nxp semiconductors€¦ · the features include an integer unit an eight-stage...

Purpose • The intent of this module is to provide you with an overview of the

i.MX31 CPU complex.

Objectives • Describe the ARM1136 core platform.• Identify features of the ARM1136JF-S processor.• Describe the two levels of caches in the CPU complex.• Identify the purpose of the Smart Speed™ switch.

Content• 15 pages• 3 question

Learning Time• 25 minutes

Module Introduction

The intent of this module is to provide you with an overview of the CPU complex of the i.MX31 processor. You will learn about the ARM1136JF-STM processor, the cache write strategy, and the Level 2 (L2) cache system. You will also learn about the Smart SpeedTM switch, and the Vector Floating-Point (VFP) co-processor. It should be noted that, unless specifically mentioned, all information in this module applies to both the i.MX31 and the i.MX31L.

ARM1136 Core Platform

AlternateBus

Masters

Primary AHB

1136 Core

Smart Speed TM SwitchMulti AHB Crossbar

Patch

16 KbytesD Cache

16 KbytesI Cache

AHB 1,2,3

PeripheralI/F 1

PeripheralI/F 2

Mem Ctl

L2 Cache Cntrl128 KbytesL2 Cache Interrupt

FPU ETM ETB4 Kbytes

Let’s start by looking at the Freescale ARM®1136 core platform. The CPU complex of the i.MX31 consists of the ARM1136JF-S processor, an L2 cache system, the Smart Speed switch, and an ARM11™ Vector Interrupt Controller (AVIC).

The multilevel cache system consists of a powerful L2 Cache Controller (L2CC) that has been optimized by ARM® to Freescale specifications with 128 Kbytes of unified L2 cache memory and an integrated L2 cache monitor. The L1 cache provides 16 Kbytes for instruction and 16 Kbytes for data.

The Smart Speed switch, otherwise known as the 6 × 5 Multi-Layer AHB Crossbar switch (MAX), allows for up to five simultaneous transactions to occur in parallel, giving the performance of up to a 665 MHz bus.

The VFP11 Floating Point Unit (FPU) is an ARM-enhanced IEEE 754 numeric co-processor that can be used to support and enhance 3D graphics, gaming, high resolution audio, Java™ and other general-purpose applications.

ARM1136 Core Features• High performance core platform

• ARM1136 core with:

– 8 stage pipeline

– 16 Kbyte instruction and 16 Kbyte data caches

– 64-bit data paths to memory offers increased bandwidth

– Jazelle hardware for Java acceleration

• Vector Floating Point Unit (VFP)

• Trace module with 4 Kbyte buffer for SW debug

• Freescale Smart Speed switch:

– 5 simultaneous 32-bit transfers increases performance

– Programmable priorities optimize system performance

• 128 Kbytes L2 cache for up to 30 percent increased system performance

– Freescale was the lead partner with ARM

• Enhanced hardware-assisted interrupts for faster response

• Flexible power management techniques

• Dynamic voltage frequency scaling modes:

– High speed: 532 MHz @ 1.45V

– Medium speed: 400 or 266 MHz @ 1.1V

– Idle speed: 133 MHz @ 1.1V

Reference material for previous page

ARM1136JF-S Processor

Let’s look at the core of the ARM11 platform, which is the ARM1136JF-S processor. In this module, it is referred to as simply the “ARM11.”

The ARM11 incorporates an integer unit that implements the ARM V6 architecture. It supports the ARM and Thumb™ instruction sets, Jazelle™technology to enable direct execution of Java byte codes, and a range of SIMD DSP instructions that operate on 16-bit or 8-bit data values in 32-bit registers.

ARM1136JF-S Features

• Synthesizable• ARM V6 architecture:

– ARM, THUMB, Jazelle– Mixed endian support– Unaligned data support– Physically addressed caches– Media extensions

• High performance core:– 8-stage pipeline– Branch prediction– Return stack

• VFP co-processor• Fast Interrupt mode• 16 Kbyte I- and D- caches


ARM1136JF-S: Key Benefits• ARM V6 architecture

– ARM and Thumb instruction sets– Jazelle technology enabling direct execution of Java byte codes– a range of SIMD DSP instructions which operate on 16-bit or 8-bit data

values in 32-bit registers• Power and area efficient• Synthesizable design• Complete set of supporting system IP• Backwards compatible with previous ARM processors• Provides full virtual memory capabilities• Physical address tagging for caches and Application Space Identifiers

(ASIDs)– Reduces overhead on context switches– Reduces cache invalidation and refill– Saves cycles and power


ARM V6 BenefitsImproved: • CPU efficiency and performance

• Multimedia performance

• Real-time performance

• Data sharing with non-ARM execution units

• Application portability from non-ARM processors

• Unaligned and mixed endian support

The ARM1136 is the first processor implementation of the ARM V6 architecture. Let’s look at how this architecture improves some CPU and multimedia functionalities.

The ARM V6 architecture improves CPU efficiency and performance and multimedia performance, which includes media processing extensions, two times faster MPEG-4 encode/decode, and faster audio DSP than the ARM926. Another improvement is real-time performance, which includes faster exception and interrupt handling, vectored interrupt support, reduced latency mode, and new stack and mode change instructions that have a three times faster interrupt entry. The ARM V6 architecture improves data sharing with non-ARM execution units, application portability from non-ARM processors, and unaligned and mixed endian support.

System Metrics

L1 Instruction SideCache Controller

Prefetch Unit

DebugJTAG ETM VIC

External Co-processorInterface

Instruction Fetch DRead DWrite Peripherals

ARM1136 Processor Block Diagram

ARM 11 Core LSUI Cache

MainTLB

L1 Data SideCache Controller

VFP

D Cache

Let’s continue to examine the ARM11 processor by looking at a functional block diagram. The features include an integer unit an eight-stage pipeline, branch prediction with return stack low interrupt latency external co-processor interface and co-processor 14 and 15, instruction and data MMUs (managed using micro TLB structure backed by a unified main TLB), and instruction and data caches (including non-blocking D cache with Hit-Under-Miss). Note that the caches are virtually indexed and physically addressed, and there is a 64-bit interface to both caches.

Other features include a write buffer that can be bypassed, a high-speed Advanced Microcontroller Bus Architecture (AMBA) L2 interface supporting prioritizing multiprocessor implementations, an AMBA bus interface (AHB-lite protocol), a Floating Point co-processor, trace support, JTAG-based debug, and a Load Store Unit (LSU).

The ARM11 processor features an interrupt service to quickly determine the interrupt source and branch to the interrupt service routine. The ARM11 solution contains an Interrupt vector port, and a Vector Interrupt Controller (VIC).

Question

The ARM1136 is the first processor implementation of the ARM V6 architecture. What are some of the ARM V6 architecture improvements? Select all that apply and then click Done.

CPU efficiency and performance

Vectored interrupt support

Data sharing with non-ARM execution units

5-stage pipeline

Done

Consider this question concerning the ARM1136 processor.

Correct.

The V6 architecture includes CPU efficiency and performance, real-time performance, which includes vectored interrupt support, and data sharing with non-ARM execution units. The ARM11 processor also contains an eight-stage processor.

ARM V6 Memory Model

ARM CoreLevel 1Caches

Level 2Cache

DRAM

SRAM

Flash

ROM

AddressTranslation

AdditionalProcessor(s)

InstructionPrefetch

Load

Store

CP15 Configuration/Control

Physical AddressVirtualAddress

R15...

R0

• Level 1 cache memory fully defined in ARM V6• Hierarchy and memory order support for Level 2 cache

EMI

SRAM ROMARM Platform

SOC

Now let’s look at the V6 memory model to explore cache in greater detail. There are two levels of caches in the CPU complex. Level 1 (L1) consists of separate instruction and data caches, a write buffer, two micro TLBs backed by a main TLB, Application Space Identifiers (ASIDs), and memory system attributes. The Level 1 cache memory subsystem is fully defined in ARM V6, and ARM V6 also has hierarchy and memory order support for the Level 2 cache. The Level 2 cache is unified, and will therefore hold both instruction and data elements.

The cache is virtually indexed and physically addressed. Line length is fixed at eight words. The ARM1136 cache is four-way set-associative. A particular address may be stored in one of four locations within the cache. To check for a cache hit for a non-sequential access, address comparisons must be performed with four different tag values. To prevent this comparison from reducing the maximum core clock frequency, there is a minimum one cycle latency between the comparison matching and the writing of data to that cache line. This requires a small Write buffer to be implemented in the cache, to allow written words to be held until they can be written.

Cache-related Definitions

• Line: Smallest loadable unit of a cache that is always a block of contiguous words in memory.

• Tag: The portion of a memory address that is stored within the cache to identify the particular physical address located there.

• Set: The set of cache lines that can hold data from a particular memory location.

• Way: The number of sets in the cache is the number of “ways” in the cache.

• Index: The portion of the memory address that determines the set in which the cache line may be stored.


Now, let’s look at cache write strategies. The write buffer is used to decouple memory writes. Data is placed in the buffer at core speed and is written to memory at bus speed in parallel. A FIFO holds a set of addresses and a set of data words and size information. A sequence of data words in the write buffer require only the first address. The address of a new access may be compared against write buffer addresses. A separate FIFO is maintained for cache Write Back operations. This avoids complications associated with performing an external write while handling a write-through store operation.

With write-through, if the location is in the cache, the memory update is stored in the cache and in the write buffer, which performs the write so that the main processor does not have to slow down to main memory speed.

With Write Back, if location is in the cache, only the cache is updated and the “dirty”bit is set to show that the cache line must be written back to main memory before the line is reused.

Please note that if the data location is not contained within the cache, the data will be written directly to memory. The write buffer will be used if the region is bufferable or cacheable.

ExternalMemory

CPU

Cache

WriteBuffer

ExternalMemory

CPU

Cache

WriteBuffer

Write Through:If location is within the cache, the cache is updated.

Write is also sent to memory via the Write Buffer

ExternalMemory

CPU

Cache

WriteBuffer

Write Back:If location is within the cache, only the cache is

updated

L2MemorySystem

CPU

Cache

WriteBuffer

WB

WT

Access Mode

Non cacheable, non bufferable0 0

C B

0 Non cacheable, bufferable

1

1 WT, Write Through01 WB, Write Back1

Cache Write Strategy

Write Through:If location is within the cache, the cache is updated. Write is also sent to memory via the Write Buffer

Write Back:If location is within the cache, only the cache is updated.

ARM L2 Cache

• Improves the performance of computer systems when significant memory traffic is generated by the CPU

• Fastest memory access is via the L1 cache, followed closely by the L210; access is significantly slower to the main memory (L3)

• Is 128 Kbytes on the ARM1136 core platform

• Has a fixed line length of 32 bytes, 8 words

• Supports lockdown format C

• Has eight-way associativity, which can be directly mapped

Now, moving on to ARM L2 cache, this cache improves the performance of computer systems when significant memory traffic is generated by the CPU.

Memory access is fastest to the L1 cache, followed closely by the ARM L210™. Memory access is significantly slower to the main memory (L3).

The L2 cache on the ARM1136 core platform is 128 Kbytes.

The L2 cache has a fixed line length of 32 bytes, or 8 words.

The L2 cache supports lockdown format C with separate way locking mechanisms for data and instructions.

The L2 cache has eight-way associativity, which can be directly mapped, depending on the use of lockdown registers.

ARM L2 Cache• Data RAM is byte-writeable.

• L2 cache has support for:– Write Through, read allocate– Write Back, read allocate– Write Back, read and write allocate

• Write allocate override option allows for allocation on write misses in the ARM L210.

• L2 cache performs critical word first refilling, with the option of refilling starting with word 0.

• A pseudo-random victim selection policy can be made deterministic with use of lockdown registers.

• L2 chache has increased performance by 25 to 75 percent, extended battery life, and reduced memory cost.

Continuing with the features of the L2 cache, data RAM is byte-writeable.

The L2 cache has support for the following cache modes: Write Through, read allocate; Write Back, read allocate; and Write Back, read and write allocate.

The write allocate override option allows for always having allocation on write misses in the ARM L210.

The L2 cache performs critical word first refilling, with the option of refilling starting with word 0.

The L2 cache has a pseudo-random victim selection policy, which can be made deterministic with the use of lockdown registers.

The ARM L210 L2CC and the accompanying 128 Kbytes of memory, combined with the ARM1136JF-S processor, can increase performance by 25 to 75 percent and extend battery life while reducing memory cost. By bringing more data on-chip and closer to the CPU, the ARM L210 L2CC helps remove the performance-limiting bandwidth constraints associated with off-chip memory.

Now let’s move on to the VFP co-processor. The VFP co-processor is an ARM-enhanced IEEE 754 numeric co-processor that supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications.

The VFP co-processor supports high-performance, short-vector operations in registers that can be addressed as short vectors. The VFP co-processor also features a long pipeline for floating-point MAC operations such as decode, issue, execute E1 through to E8, and Write Back. Also featured is a separate divide and square root pipeline that supports load, store, and arithmetic operations in parallel with a divide, square root operation. The VFP reduces the latency impact of these operations.

The VFP includes a separate load, store pipeline feature that enables load and storeoperations to be done in parallel with data processing operations.

For VFP instruction throughput, most single precision data processing operations and double precision data operations have single-cycle execution. Loads are bandwidth balanced to sustain FMAC operations. Two single precision values and one double precision value can be transferred each cycle.

Many calculation functions are supported in hardware, including multiplication, absolute value, and square root. Click this box to see a complete list of calculation functions.

VFP Co-processor

High-performance, short-vector operations• Registers can be addressed as short vectors

Separate load/store pipeline• Load/store operations done in parallel with data processing operations

Separate divide/square root pipeline• Supports load/store, and arithmetic operation in parallel with divide/square root operation• Reduces latency impact of these operations

Long pipeline for floating point MAC operation• Decode- Issue- Execute (E1)- E2- E3- E4- E5- E6- E7- E8- Write Back

Calculation functions supported in hardware

Multiply, add, multiply-add, subtract, multiply-subtract, negate, negate multiply, negate multiply add, negate

multiply-subtract, absolute value, compare, convert, divide and square root, conversions

Single cycle execution• Loads are bandwidth balanced to sustain FMAC operations

Smart Speed Switch

ARM1136 Core Complex

Smart Speed Switch

SSISSI

SIMSIM

SD/MMC x2SD/MMC x2

IIMIIM

UARTUART

CSPICSPI

Mem Stick x 2Mem Stick x 2

One WireOne Wire

AudioMuxAudioMux

SSISSI

KeypadKeypad

UART x 4UART x 4

ATAATA

RTICRTIC

CSPICSPI

ECTECT

I2C x 3I2C x 3

USBOTG /Hosts

USBOTG /Hosts

123

123

SCCSCC

IOMUXCIOMUXC

eDMAeDMA

00

11

22

33

44

00

11

22

33

44

55

ROMC /32K ROMROMC /

32K ROM

RAMC /16K RAMRAMC /

16K RAM

AIPI #1AIPI #1AIPI #1

AIPI #2AIPI #2AIPI #2

L2Cache

L2Cache

ARM1136JFARM1136JF

CSPICSPI

GPIO x 3GPIO x 3

PWMPWM

EPIT x 2EPIT x 2

FIRIFIRI

GPTGPT

WatchdogWatchdog

RNGARNGA

RTCRTC

CCM/CGM/PLLCCM/CGM/PLL

SJCSJC

MPEG4 EncMPEG4 Enc

IPUIPU

EMIEMI

GPUGPU64

64

32

64RAMCRAMC

00

11

00

11

AVICAVICAVIC

The purpose of the Smart Speed switch is to concurrently support up to five simultaneous connections between master devices 0 to 5 and slave devices 0 to 4. It supports 32-bit address bus width and 32-bit data bus width at all master and slave ports. The ARM11 platform implements a six master by five slave configuration. The Smart Speed switch supports two arbitration schemes that are independently programmable for each slave device: the simple fixed-priority algorithm and simple round-robin fairness algorithm.

The Smart Speed switch allows for concurrent transactions to occur from any master device to any slave device. It is possible for five master devices and all slave devices to be in use at the same time due to independent requests. The Smart Speed switch can gain control of the slave devices and prevent any masters from making any accesses to the slave devices. This is useful if the user wishes to turn off all the clocks and ensure that no bus activity will be interrupted. The Smart Speed switch can put each slave port in low power park mode so the slave will not dissipate any power when not being accessed by a master port.

Question

Which cache write strategy is illustrated in this graphic? Select the response that applies and click Done.

a. Write Back

b. Rewrite

c. Write Through

d. Read/Write

ExternalMemory

CPU

Cache

WriteBuffer

L2MemorySystem

CPU

Cache

WriteBuffer

Done

Let’s see if you can remember the cache write strategies.

Correct.

Cache write strategies consist of Write Through and Write Back. The cache write strategy shown here is Write Through. For Write Through, if the location is within the cache, the cache is updated and write is also sent to memory via the Write Buffer. For Write Back, if the location is within the cache, only the cache is updated.

QuestionWhich of the following statements about the CPU complex of the i.MX31are correct? Select all that apply and then click Done.

The Smart Speed switch concurrently supports up to 5 simultaneous connections between master devices and slave devices.

The L2CC and the accompanying 128 Kbytes of memory, combined with the ARM1136JF-S processor, do not increase performance.

The VFP co-processor supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications.

The L1 cache improves the performance of computer systems when significant memory traffic is generated by the CPU.

Done

Please select all the statements that accurately describe aspects of the I.MX31 CPU complex.

Correct.

The purpose of the Smart Speed switch is to concurrently support up to 5 simultaneous connections between master devices and slave devices. The ARM L210 L2CC and the accompanying 128 Kbytes of memory, combined with the ARM1136JF-S processor, can increase performance by 25 to 75 percent and extend battery life. The VFP co-processor supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications. The L2 cache improves the performance of computer systems when significant memory traffic is generated by the CPU.

Module Summary

• ARM1136 core platform

• ARM1136JF-S processor

• ARM V6 architecture

• Caches in the CPU complex

• Cache write strategies

• ARM Level 2 cache

• VFP Co-processor

• Smart Speed switch

In this module, you learned about the various components of the i.MX31 CPU complex. First, you learned about the ARM1136 core platform, ARM1136JF-S processor, and the benefits of ARM V6 architecture, which include increased CPU efficiency and performance and multimedia performance. Next, you learned about the two levels of caches in the CPU complex: L1 and L2. Specifically, you learned about cache write strategies and the ARM Level 2 cache. Finally, you learned about the VFP co-processor, which supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications, and the Smart Speed switch, which can concurrently support up to five simultaneous connections between master devices and slave devices.

print cpu complex - nxp semiconductors€¦ · the features include an integer unit an eight-stage...

Documents