sh-4r technical overview - dcemulationdcemulation.org/1-newsdump/qrandom/dc stuff/info/sh... · the...

SH-4R The High-End Multimedia architecture 2

SH-4R RISC Multimedia enhancements 2

Two-way Superscalar support - The leading performance per MHz 2

Integrated Floating Point Unit co-processor (FPU) - Embedding your PC Software 3

DSP/SIMD instruction - The true Multimedia acceleration 3

16-bit Instruction length - The smallest memory footprint 4

The SH-4R architecture features 5

Register configuration 5

Instruction set - The full compatibility 6

Flexible data format 6

Five-stage pipeline 6

Optimised cache architecture 7

Store queues - high speed burst transfers 7

Very large address space 8

Low power consumption 8

Optimised technology & quality 8

SH-4R Architecture

Technical Overview

2

SH-126 MIPS

SH-278 MIPS

CacheMAC

SH-3260 MIPS

MMU

SH-51000 MIPS

SIMPFPU

SH-DSP130 MIPS

DSP

SH-3-DSP260 MIPS

DSP

SH-62000 MIPS

VLIW

SH-4R>480 MIPS

Superscalar FPU

NEW

ROM FLASH

Deeply Embedded

MMU Cache

Multimedia Core

64-bit

Instruction Set Upward Compatibility

DSP S

uppo

rt

Instruction Queue

Decoder 1 Decoder 2

Integer Unit

LoadStoreUnit

FloatingPointUnit

BranchUnit

Dedicated Core for Ded i cated Markets

SH-4R Supersca lar Arch i tec ture

SH-4R – The High-end Multimedia architecture

The standard RISC (Reduced Instruction Set Computing) features

have been extended by Hitachi to create an enhanced microprocessor

core dedicated for embedded applications requiring DSP and

multimedia accelerations as well as high performance general

purpose functions.

The SH-4R core is a first-class choice for multimedia applications

such as car infotainment systems, game consoles, multimedia

terminals, video conferencing, interactive TV and many others.

Algorithms such as speech codecs, advanced audio MP3, voice

recognition, MPEG motion video processing and 3D-graphics

calculations benefit directly from this dedicated architecture.

SH-4R RISC Multimedia enhancements

The SH-4R architecture has been optimised to provide higher performances and lower power consumption at a reduced price, whilst keeping

the full software compatibility with the former SH4 core. This means that moving to the SH-4R is really the simplest way to boost your system

performance without adding additional cost associated with redesign of your hardware and software environment.

The main improvements in the SH-4R core are the higher frequency supported, and the change of the cache architecture (doubled in size and

two-way set associative). This considerably reduces the cache misses and significantly increases the overall system performance. Also the DMA

controllers have been extended to handle more media stream transfers.

The new technology reduces the power consumption considerably and allows running at a higher speed whist reducing the cost/performances

ratio.

Two-way Superscalar support – The leading performance per MHz

The Superscalar implementation allows the decoding and

execution of two instructions in parallel. This generates a

maximum performance of up to two instructions per cycle (1.5

instruction average) including FPU.

The Superscalar architecture allows an integer processing power of

more than 1.8MIPS/MHz (Dhrystone1.1)

Stage 1

A single DIF butterfly

Stage 4

16-tap, 40-sampleblock FIR(1.6 MACs/cycle)

1024-point,radix-2 FFT(35.4k cycles)

In1

In2

Out1

Out2-1 W=a+jb

= x + ... +

yi

yi+1

yi+2

yi+3

c0c1c2c3

xi

xi+1

xi+2

xi+3

xi+1

xi

xi+1

xi+2

xi-2

xi-1

xi

xi+1

xi-3

xi-2

xi-1

xi-

=x

In1_rIn1_iIn2_rIn2_i

Out1_rOut1_iOut2_rOut2_i

10ab

01-ba

10-a-b

01b-a

x

c12c13c14c15

xi-12

xi-11

xi-10

xi-9

xi-13

xi-12

xi-11

xi-10

xi-14

xi-13

xi-12

xi-11

xi-15

xi-14

xi-13

xi-12

rayd

Rnr1 r2 r3 0

d1

d2

d3

0

x=I

Multiplication3D graphicsgeometry

Multiplication

Surface judgementbrightness calculation source

Inner Product

A single SuperH instructionFTRV can perform thismatrix-vector multiplicationevery 4 cycles

A single SuperH instructionFIPR can perform thisinner product multiplicationevery cycle

= x

Y1Y2Y31

x1x2x31

c11c21c31c41

c12c22c32c42

c13c23c33c43

c14c24c34c44

3

Integrated FPU co-processor – Embedding your PC Software

The SH-4R Core includes an IEEE-754 compliant floating point co-processor, which supports single-precision (32-bits) and double-precision

(64-bits) numbers. The SH-4R FPU can process an impressive seven million polygons per MHz.

The FPU is a major building block required in systems running multimedia applications coming from the PC market. Most of the software

developers programming on a PC-based environment are using FPU variables for their convenience. The transfer of this software on the

SH-4R architecture will be straightforward and directly optimised by the compiler by using the FPU.

DSP / SIMD instruction – The true Multimedia acceleration

The matrix math enhancements contribute to the graphics quality and are also suited to accelerate certain signal-processing algorithms.

The SH-4R instruction set includes several SIMD (Single Instruction – Multiple Data) instructions which dramatically accelerate algorithms

based on vector and matrix arithmetic. Vectors and matrixes with a dimension of four are operands and coefficients of these single-issue

instructions. The 128-bit floating-point vector engine executes the highly efficient SIMD instructions like matrix operations (FTRV) and inner

product of two vectors (FIPR) required for 3D graphics processing and data (de)compression.

The other hardware DSP implementations supported are scalar square root, absolute value, divide, and multiply-and-accumulate operations.

SH-4R FPU used to accelerate MP3, MPEG-4, Speech Recognition, VoIP algorithms.SH-4R running at 240MHz decodes Windows Media, Audio and Video at 24 frames/sec, on a QVGA LCD and 16 bits per pixel.

Power fu l 4 -way FPU for 3 -D graph i c s

Versat i l i t y o f the SH-4R ar i thmet i c acce lerator

4

Lower memory footpr in t w i th SH

16-bit Instruction length - The smallest memory footprint

A fixed 16-bit instruction length offers a very high code density to save cost for storage memory (external RAM, ROM and cache) and solves

the bandwidth bottleneck problem of conventional 32-bit RISC architectures. On a 32-bit memory access, two instructions can be fetched in

parallel, reducing the necessary memory accesses by a factor of two.

There is an average code size ratio of 2:3 between a SuperH processor and other RISC processors. Providing a lower total system cost.

Memory footprint saved.

Effective memory footprintrequired with SH.

SDRAM

Flash

Standard RISC32-bit Opcode

SDRAM

Flash

SH RISC16-bit Opcode

The SH-4R architecture features

The SH-4R Core employs a common 32-bit RISC archi-

tecture designed by Hitachi to meet the requirement of

multimedia and high performance embedded applica-

tions.

The SH architecture has the following basic features:

• Load/Store

• Register orientation

• 32-bit internal data path

• RISC-type instruction set

• 4 Gbyte address space

• Five-stage RISC instruction pipeline

Register configuration

• General Purpose 32-bit Register bank

• 32-bit Control Registers

• 32-bit System Registers

• 32-bit Shadow registers

The SH architecture enables arithmetic and logical

instructions to operate normally on the 32-bit general-

purpose registers. Special load/store instructions are

provided to transfer data from memory to registers

and vice versa. The diagram on the right shows the

basic general purpose 32-bit register bank which is

used for source and destination operands. In addition

to the 16 registers the SH-4R has 8 x 32-bit shadow

registers, which can be accessed in the so-called

privilege mode. As well as the general purpose

registers, the SH architecture supports four system

registers providing a Program Counter (PC), Procedure

Register (PR), and two 32-bit Multiply and Accumulate

Registers (MACH/MACL). The block of control

registers contains the Status Register (SR), the Global

Base Register (GBR), the Vector Base Register (VBR),

the Saved Status Register (SSR) and the Saved Program

Counter (SPC).

5

I-Cache (16KB)

CPG

INTC

SCI(CH1,2)

RTC

TMU

26-bit Address 64-bit Data

SH-4 CPU Core FPUUBC

D-Cache (32KB)ITLB UTLBCache &TLB Controller

Lower 32-bit Data

Lower 32-bit Data

BSC DMAC

External BusInterface

32-b

it A

ddre

ss (

Inst

ruct

ion)

Per

iphe

ral A

ddre

ss

16-b

it P

erip

hera

l Dat

a B

us

32-b

it D

ata

(Ins

truc

tion

)

Upp

er 3

2-bi

t D

ata

32-b

it A

ddre

ss (

Dat

a)

29-b

it A

ddre

ss

32-b

it D

ata

32-b

it D

ata

Add

ress

64-b

it D

ata

64-b

it D

ata

64-b

it D

ata

(Sto

re)

32-b

it D

ata

(Sto

re)

32-b

it D

ata

(Loa

d)

29-bit Address

32-bit Data

32-bit Data

MACH (Multiply and Add Accumulator High)

MACL (Multiply and Add Accumulator Low)

PC (Procedure Register)

31

SR (Status Register)

GBR (Global Base Register)

VBR (Vector Base Register)

31

ROGENERAL PURPOSE

REGISTERS

CONTROLREGISTERS

SYSTEMREGISTERS

PROGRAMCOUNTER

R1

R13

R14

R2

31

PC (Program Counter)31

0

0

0

0

31

31

31

0

0

0

ALL

R15, SP (Stack Pointer)

SR (Saved Status Register)

RO (Shadow Registers)

R2

SPC (Saved Program Counter)

SH3, SH3-DSP, SH4

R7

SH7750R Funct iona l B lock Diagram

SH Arch i tec ture Genera l Purpose Reg i s ter Bank ,Contro l Reg i s ters and System Reg i s ters

6

IFIDEXMA

= Instruction Fetch= Instruction Decode= Execution= Memory Access

WBmm^)

= Write Back= Multiplier Operation= If instruction also uses the multiplier pipeline, contention can occur

Next instruction

Third instruction in series

: Slot

: Slot

IF ID EX .....

IF ID EX .....

Instruction A IF ID EX

Next instruction

Third instruction in series

.....

: Slot

IF ID EX .....

IF ID EX .....

IF ID EX .....

Instruction A IF ID EX

Next instruction^)

Third instruction^)

IF ID EX MA WB .....

IF ID EX MA WB .....

MULS.W IF ID EX MA mm mm

Inst ruc t ion P ipe l in ing Examples

Instruction set – The full upward compatibility

The instruction set has been carefully chosen to provide a high-

level language orientation, thus simplifying programming of the

individual devices.

A major strength of the SH-4R core is the instruction set’s

upward compatibility with SH1, SH2 and SH3, which allows

customers to move easily from one SH core to another. Each

SH core is designed and targeted for specific applications with

different requirements such as integration, performance and

cost.

The SH-4R supports up to 91 types of instructions including data

transfer, arithmetic operations (with on-chip multiplier), logic

operations, shift, branch, system control and DSP/FPU classes.

The SH-4R C and C++ compilers are optimised to use the

maximum benefit of each class of instructions.

Flexible data format

Memory data formats are classified into bytes,

words and longwords. The SH-4R supports big

and little endian mode. This choice is a major

advantage to simplify and ease the connection of

external devices around the SH-4R processors

Five-stage pipeline

The advantage of using a RISC approach can be seen by the

pipeline mechanism, allowing very high clock frequencies. The

pipelining mechanism of the SH provides a single cycle peak

throughput for the basic instructions (two in the case of the

Superscalar SH-4R). The pipeline is automatically reduced if an

instruction does not need all stages, and extended if an instruction

needs some more latency cycles to be completed or if pipeline

contention occurs. To reduce pipeline penalties, a delay-slot

mechanism has been provided, reducing pipeline-breakage.

SH architecture provides a full software compatibility from 10MIPS to more than 1000 MIPS

SH-156 types

SH-262 types

SH-368 types

SH-491 types

32-bit multiplier/

accumulator

MMU instructions

SH3-DSP 160 typesMMU & DSP instructions

SH-DSP 154 typesDSP instructions

FPU, Graphicsinstructions

Word0

Longword

Byte0 Byte1 Byte2 Byte3

Word1

Rn

31 23

AddressA

AddressA

Big endian

AddressA+4

AddressA+8

AddressA+8

AddressA+4

AddressA

AddressA+1

15

AddressA+2

7 0

AddressA+3

Word1

Longword

Byte3 Byte2 Byte1 Byte0

Word0

31 23

AddressA+11

Little endian

AddressA+10

15

AddressA+9

7 0

AddressA+8

Inst ruc t ion Set Upward Compat ib i l i t y

Memory data formats , Byte , Word and Longword Al ignment

Running at 200MHzWind River Systems Inc.,JWorks 4.0Operating System: VxWorks

SH7750R

SH7750S

800

Com

pres

sion

Tim

e

700

600

500

400

300

200

100

0

_202

_jess

_201

_com

pres

s

_202

_jack

_202

_java

c

_202

_db

_222

_mpe

gaud

io

PerformanceImprovements

27%13%

30% 16%

23%19%

7

Optimised cache architecture

The SH-4R cache architecture has been significantly optimised over the SH4 Core to boost the real time capabilities and overall system

performances.The instruction cache and operand cache are handled separately. The global cache architecture is a two-way set associative to

reduce cache misses, whilst keeping the latency to a minimum. The cache is doubled (16Kbyte instruction cache and 32KB operand cache).

The SH-4R includes large on-chip caches including copy-back and write-through

buffers to close the performance gap between the processing speed of the CPU and

the bandwidth of external SDRAMs.

This optimisation is especially important for applications using intensive memory

transfers, such as Java-based systems. Significant performance improvements are

achieved by optimising the SH4 cache architecture.

Impressive results of the new cache architecture running on operating systems suchas Linux, VxWorks™, Windows®CE or QNX

Direct

Two-way

Four-way

2 4 8 16 32Cache size

Miss rate

Fast data t rans fers to Frame buf fers us ing the SH-4R Store Queues

Spec JVM98 Benchmark - Compress ion t ime

Store queues – high speed burst transfers

The SH4(R) supports two 32-byte store

queues (SQ) to perform high-speed burst

writes to external memory. While the

contents of one SQ are being transferred to

external memory, the other SQ can be

written to without a penalty cycle. This

functionality is especially useful to transfer

video, graphic or display list data to the

frame buffer of the graphic processor.

GeometryProcessor

Direct Access

Display Controller

Display

Frame Buffer

SDRAM

SH-4R CP

U In

terf

ace

Loca

l SH

Bus

Fram

e B

uffe

rIn

terf

ace

Graphic Processor

Primitives of

Display List

SQ 1-32 Bytes

SQ 2-32 Bytes

Two-way se t -assoc ia t i ve cache :for opt imised la tency/cache miss ba lance

August 2002 Printed in Europe 19-040B

EUROPEAN HEADQUARTERS

Hitachi Europe Ltd.Whitebrook Park, Lower Cookham Road, Maidenhead, Berkshire SL6 8YA United KingdomTel: +44-1628 585000 Fax: +44-1628 585160 Email: [email protected]

(Please visit our website for contact details of our Hitachi Sales Offices in EMEA)

www.hitachi-eu.com/semiconductors/

Low power consumption

SH is a low power processor family designed to produce

best-in-class performance/watt ratio. The SH-4R core

features an advanced power saving mechanism with built-in

power down modes:

• Sleep, on-chip peripherals still run

• Standby, on-chip peripherals halt

• Module stand-by, specified module haltLatest technology further reduces the low power consumption ofthe SH-4R

Area 0

26-bitExternal Memory Space

32-bit32-bit

0x00000000

0x80000000

0xA0000000

0xC0000000

0xE0000000

2 GByte virtual space,

cacheable

0.5 GByte cacheable

0.5 GByte non-cacheable

0.5 GByte virtual cacheable

0.5 GByte control non-cacheable

2 GByte virtual space,

cacheableArea U0

0xE 000 0000

0xE 400 0000Address errorUser ModePrivileged Mode

Address error

Store queue area

Area 1

Area 2

Area 3

Area 4

Area 5

Area 6

Area 7

Area P0

Area P1

Area P2

Area P3

Area P4

Physical AddressSpace

Large SH-4R Memory Map

Very large address space

The large address space of the SH-4R facilitates the connection/mapping of external devices. The uniform and unsegmented address space

is managed by a Memory Management Unit (MMU) which translates addresses between the physical address space and the virtual one

required by operating systems. Up to 4Gbytes address space (448-Mbyte external memory space) is available.

The SH-4R architecture incorporates two processor modes: user mode and privileged mode. Normal program execution is done in user mode,

privileged mode is normally entered when an exception occurs.

Optimised technology and & quality

The SH-4R core from Hitachi is a leading architecture recognised

in applications running in difficult environments such as

automotive. Over the years, Hitachi has built a strong expertise by

providing high reliability devices in extended operating conditions

on the latest technology process. That is why the SH-4R is also

qualified for wide temperature ranges (-40C to + 85C).

sh-4r technical overview - dcemulationdcemulation.org/1-newsdump/qrandom/dc stuff/info/sh... · the...

Documents