architecture and instruction set design of an atm network processor

Architecture and instruction set design of an ATM network processor

Gary Jones, Elias Stipidis*

Communications Research Group, School of Engineering and Information Technology, University of Sussex, Falmer, Brighton BN19QT, E. Sussex, UK

Received 6 December 2002; revised 1 March 2003; accepted 18 March 2003

Abstract

Microprocessor architectures are diversifying to support niche market requirements, with growing emphasis for performance delivery on

the architectural design rather than the silicon implementation. This paper outlines the architectural design, programmer’s model and

instruction set of a microprocessor, which adopts a novel approach to network data. In particular, Asynchronous Transfer Mode (ATM) cells

are delivered to a special FIFO cache memory, located at the heart of the processor. Cell input and output is conducted at wire speed using

dedicated streaming input and output hardware. Special read and write instructions then allow the cell payloads to be accessed directly, and

transferred from/to the register file. Multimedia applications have previously been identified as an important market for such a network

centric architecture. Therefore the paper ends with a demonstration of the power of some key instructions. A motion estimation kernel from

the MPEG standard is used to exercise the architecture and instruction set. Execution speed is shown to be comparable to today’s processors,

using only a 400 MHz clock for a full search. The minimally resourced design is therefore suited to embedded network applications from

both economic and performance standpoints.

q 2003 Elsevier B.V. All rights reserved.

Keywords: Network processor architecture; Asynchronous transfer mode

1. Introduction

This paper presents a novel microprocessor architecture

currently under development within the School Of Engin-

eering and Information Technology, at the University of

Sussex. It describes the architectural design work, pro-

grammer’s model, an overview of the instruction set and the

results of simulating code kernels from multimedia

applications.

The design captures the essence of the conceptual design

produced earlier by Neil Cashman [1]. This aimed to

revolutionising the internal architecture of a computer,

replacing the traditional shared busses with Asynchronous

Transfer Mode (ATM) network links. In doing so the

architecture would address both the limitations of bus based

systems and the perceived future requirement to process

increased quantities of multimedia data.

The original SMART design was an entire computer

architecture built of many components. These included the

Processor Node (PN), which is a custom designed proces-

sing architecture, fed directly with ATM data streams, and it

is the PN, which is the focus of this paper. The computer

architecture, called SMART, is named after its main

characteristics:

Smooth protocol boundaries

Multimedia processing

ATM bandwidth

Real-time processing

Transfer of processes

There are many others who see the internal busses as

weak links in the ever increasing performance of computers

[2–4]. We feel that this other work has not yet resulted in a

viable alternative computer because they all rely on

commercial microprocessors. This implies that bus archi-

tectures and traditional memory arrays have to be used at

some stage. Therefore they have not replaced busses, but

merely moved them.

The Processing Unit (PU) described in this paper is the

first attempt to break free of the use of commercial

microprocessors for such a computer architecture. Analysis

of the previous conceptual design highlighted the benefits of

a network based computer architecture, although the actual

PU was excessively complex and limiting to the overall

0141-933/03/$ - see front matter q 2003 Elsevier B.V. All rights reserved.

doi:10.1016/S0141-9331(03)00064-4

Microprocessors and Microsystems 27 (2003) 367–379

www.elsevier.com/locate/micpro

* Corresponding author.

E-mail address: [email protected] (E. Stipidis).

http://www.elsevier.com/locate/micpro

performance. In order to overcome the significant hurdles of

development, implementation and manufacture, the design

has been constrained by several key axioms:

1. Retain the proven architectural advantages of ATM as

the underlying transfer mechanism.

2. Retain the original aim of raising the network connection

to the highest priority.

3. Simplify the operation and architecture, specifically the

management and control.

4. Target the design at a System-On-Chip (SOC) solution,

without being implementation specific in the architec-

tural design.

5. Address the growing speed differential between pro-

cessor speeds and memory access times.

Some elaboration is required on these. Simplifying the

architecture reduces the resources that are required to

implement it. Keeping the design within the realms of

today’s’ leading edge ASIC and FPGA technology will also

keep it within tomorrows cost effective manufacture. Full

custom design is far too expensive to realise and would

further hinder the chances of implementing the design.

Small/medium scale designs aimed at the embedded market

stand a greater chance of success than trying to compete

with the likes of Intel and SUN Microsystems.

Modern FPGA technology offers an expedient, realistic

and attainable route to implementing this microprocessor

design. For example, the leading devices from Xilinx, the

Virtex-II family, currently includes one million program-

mable gates, and the road map extends to ten million gates

[5]. Furthermore, the support software that is used to

compile, synthesis and simulate the designs for such FPGAs

is available and affordable.

The growing disparity between the speeds of processor

clocks and memory access times results in performance-

crippling latencies for anything other than first level cache

accesses. This is widely seen as a significant threat to the

continued growth in real-world application performance

[6–10]. Major design compromises have been made by the

likes of Intel in order to retain the backwards compatibility

of their existing instruction sets whilst extending the

underlying hardware architectures. Our work presented

herein is an opportunity to start with a ‘clean slate’, and

therefore is also an opportunity to redress this speed

differential.

The following sections present an overview of the new

PU, which is composed of two distinct, but highly

interdependent subsections; the ATM Cell Processing

Engine (ATM-CPE) and the ATM Cell Cache (ATMCC).

Section 2 of this paper describes the overall architecture, its

main functional blocks and their general operation. It pays

particular attention to the novel features of the design.

Section 3 presents the Programmers Model. This is the

set of registers that the end user/programmer will ‘see’ and

use. These are closely related to the hardware operation, and

provide an insight into the functional operation of the

processor.

An overview of the instruction set is given in Section 4,

together with the performance results of a code kernel from

a real-world video compression algorithm. The Section 5 is

a summary and conclusion.

2. Architecture

Essentially the PU is composed of two halves that are

closely interconnected. The first is the ATM Cell Processing

Engine (ATM-CPE) and the second is the ATMCC. These

can be seen in the diagram of the simplified SMART PU

shown as Fig. 1.

At this stage it is sufficient to consider the ATMCC as a

FIFO of ATM cells. Movement of cells through the FIFO is

completely asynchronous from the ATM-CPE, and uses

hardware to implement all the functions, which are

necessary for its operation. This hardware assistance is the

key to obtaining the high performance for a network

processor.

As previously mentioned, the ATM-CPE is a micro-

processor, developed on a “clean sheet”. In comparison to

the original design it is considerably simplified and more

streamlined whilst retaining the key characteristics.

Fig. 2 depicts the main functional units and interconnec-

tions of the new SMART PN design. Further details of each

item are provided below.

If the limitations of today’s FPGA and ASIC technology

is embraced, the maximum clock speed must be accepted to

be significantly lower than that offered by leading edge full

custom design. Therefore, the design maximises the work

done in each instruction, reducing its reliance on brute force

clock speeds.

As can be seen from Fig. 2, the ATM-CPE is a

superscalar design. It can simultaneously execute three

instructions, contained in a 128-bit Very Long Instruction

Word (VLIW). Using this technology forces the onerous

task of instruction scheduling onto the compiler. However,

it also greatly simplifies the hardware design because the

out-of-order scheduling logic is removed entirely. This has

been identified as restriction to clock speeds, and a difficult

unit to design and test [11,12]. Such technology is now

being favoured by companies such as Intel and SUN

Microsystems for their new designs [13,14].

Fig. 1. Simplified diagram of the new SMART processor unit architecture.

G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379368

The ATM-CPE uses multiple memory interfaces to

ensure high-speed operation can be maintained. The

architecture of the ATM-CPE can be termed a ‘modified

Harvard architecture’ in that it is based upon separate

instruction and data interfaces. However, the additional

ATMCC interface provides the main data input and output

path. This modification does not directly affect the Harvard

architecture operation, but can allow use of the data memory

interface for other tasks.

The implementation of memory interfaces and the

organisation of the data and instruction caches is not new

to microprocessors. Therefore these topics will only be

briefly discussed. It is widely regarded that reducing the

proximity of the memory to the processor core significantly

helps to reduce access latency. However, it is not possible to

integrate all of the desired memory onto a single die.

However, the extent of the integration is implementation

specific, and as such only generic interfaces are included in

the architectural design.

2.1. Data cache and data memory

The data access port is used for all non-ATM memory

accesses. The hierarchical organisation and the implemen-

tation of the cache will both be implementation specific.

Providing on-chip cache would be advantageous although

not crucial.

Data memory organisation supports reading and writing

of execution unit registers therefore, each addressable

location is 128-bits wide. Although wider than the data

units supported by today’s microprocessors, it is not

uncommon for those processors to use 128-bit busses for

cache accesses [15,16]. This is because pin speed is a

limiting factor to transferring data on and off the chip, which

forces the use of wider busses in order to support higher data

rates.

Only complete registers (memory locations) can be

transferred in DC bus cycles, as sub-word memory

operations increase the complexity of the memory interface.

Furthermore, the ATM-CPE is a load/store architecture—

preventing memory resident data from being used by

computation instructions. Furthermore, transfers are

restricted to only one location/register per cycle, and this

instruction must be executed on one of the execution units in

particular.

2.2. Instruction cache and instruction memory

The benefit of separating the instruction and data caches

is that the organisation of each can be individually tailored

to suit the different requirements. For the Instruction

memory and cache this is to deliver one VLIW per cycle.

Like the data cache described above the design of cache,

memory and memory interfaces are not new to micro-

processors. Indeed the issue widths of modern processors

such as the DEC Alpha 21264 and HP PA-8500 mean that

the instruction memory interface is often wider than the data

port. Both the Alpha 21264 and the PA-8500 supply 128-

bits of instruction in one cycle (four 32-bit instructions) to

feed the execution units [17,18]. The intended instruction

width and issue rate of the ATM-CPE will not extend this,

ensuring the instruction bus design is also not a risk to its

design and implementation.

A dedicated DMA channel exists to support transfers

between the ATMCC and instruction memory. This unloads

the ATM-CPE from having to facilitate the transfers in

software, thus dedicating more processing time to the

application execution. Transfers between instruction mem-

ory and the ATMCC facilitate network update/transfer of

applications. A second DMA channel exists for the data

cache memory, and provides similar services for data.

2.3. ATM cell cache

The ATMCC was a novel feature of the original SMART

PU design and remains so for the new PU. Data enters and

leaves the SMART architecture in streams of ATM cells. It is

therefore necessary to decouple the architecture from the

network, and the ATMCC performs this function. Using a

series of hardware pointers the ATMCC forms a content

addressable memory, that stores complete cells organised

according to their header numbers. Control Registers are used

to facilitate control of the ATMCC functions, including

accessing the pointers and determining the cache status.

Hardware assistance is used for the wire speed cell input

and output functions, which are entirely separate from one

another. Both of these are repetitive operations, fixed by the

ATM standards and required for every cell. Effectively

these operations comprise a combination of SMART

functionality and lower layer network functions. For

example, the cell input hardware would perform the cell

delineation, header error checking, and storage of the cell in

the appropriate queue within the FIFO. The output functions

include header generation (VPI/VCI and HEC as appro-

priate), timely cell transmission, and multiplexing cells

from different queues within the FIFO.

A tight integration between the ATMCC and the ATM-

CPE is crucial for true wire speed processing. Therefore

Fig. 2. The new SMART processing unit architecture.

G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379 369

the ATMCC, above any other memory, should be

implemented on the same silicon as (or as closely combined

as possible to) the ATM-CPE and operated at the core clock

frequency. Combined with the fact that it only needs to be

relatively small by existing cache standards (tens of KB) it

can be designed such that all accesses are completed in one

cycle. This is aided by the pointer mechanisms and by the

reliable knowledge that the payload being accessed is

present in the cache.

The interface between the ATMCC and the execution

units is 384-bits wide. This is to support the transfer of a

whole cell payload in one operation. With careful instruc-

tion scheduling it is possible for this large transfer to occur

without any modifications to the general register file. Wide

internal buses have long been a part of microprocessors, and

do not pose significant problems if constrained to short

distances and point to point transfers. Given the large

amount of work completed in this one transfer, these

restrictions do not prove too prohibitive of overall

performance.

It will be necessary for the ATMCC to be multi-ported in

order to prevent bottlenecks to the flow of stream data. Due

to its small size this is also not a significant problem. With

careful organisation it is possible for the stream input and

output to share one port, whilst the ATM-CPE (register

file/I-DMA/D-DMA) uses another.

2.4. ATMCC input

The operation of the input stage is shown in Fig. 3. The

ATM-CPE programs the look-up table that determines

which queue the cell payload will join. For whichever queue

has been chosen, an associated hardware write pointer

ensures the payload joins the correct place in the queue, and

also discards the majority of the header bytes. This is

possible because a specific FIFO queue is associated with a

particular cell stream, identified by a specific VPI/VCI

number, and the header is reconstructed at the time when the

cell is output (see Section 2.5). Any number of these

individual FIFOs may exist, and are setup by the operating

system or application software.

Cells that do not conform to the predetermined assigna-

tions can either be dropped or passed through to the

ATMCC output. In the latter case, the header values must be

retained, and as such the storage requirements for the FIFO

labelled ‘default cells’ Fig. 3 will be increased to include the

stream number (the HEC CRC need not be stored as this is

generated for each cell as it is output). These cells leave the

ATMCC in the same order in which they arrived, and with

minimal delay.

As hardware pointers are used to store the cells the ATM-

CPE does not need to be involved whenever a cell arrives.

Interrupts may be programmed for certain events on an

individual pointer. The ATM-CPE has sight of the hardware

pointer values through its control registers. These also

provide a means by which the ATM-CPE can read and write

the ATM cell header bytes. Payload transfer to the register

file is accomplished via a read instruction, which moves all

384-bits of data.

2.5. ATMCC output

The output is very much a mirror image of the input.

Each of the FIFOs marked for output have an associated

hardware read pointer that automatically generates the

address of the next cell to be read.

As shown in Fig. 4, the Statistical Time Division

Multiplexing (STDM) unit re-assembles the output cell

stream, performing the lower layer network functions of the

ATM reference model. It selects the next cell to output

based on a user tuneable STDM algorithm, where the OS or

application code may tune the algorithm through the

adjustment of certain coefficients.

Factors that would affect the STDM algorithm include

the following:

1. The relative importance of one particular stream to all of

the others. All FIFOs (or cell queues) will be given a

weighting, which the OS or application software can

Fig. 3. ATM cell cache input function and hardware assistance. Fig. 4. ATM cell cache output function and hardware assistance.


modify to police the network traffic and control the

internal buffer levels.

2. Buffer size. This will impact the STDM algorithm in

much the same way as the importance rating. However,

the OS or application software may chose to allow the

FIFO fullness to modify the importance weighting such

that an element of automatic traffic policing and buffer

control is implemented.

3. ‘Forced’ cell transmission. The OS or application code

may chose to force a cell to be transmitted from any

queue at any time. This will over-ride the next STDM

selection.

A major task of the STDM stream generation unit, Fig. 4,

is to regenerate the cell header data. For each FIFO except

the ‘default cells’ queue, the stream number (VPI/VCI) will

be directly related to the FIFO from which it is taken. After

compiling this header data it will need to calculate the HEC

value and insert that before transmitting the total recom-

bined cell. Where the header value exists (‘default cells’

FIFO) only the HEC value will need to be calculated. This

function, and the others mentioned above are commonplace

in ATM hardware as they represent a selection of the

functions performed by the ATM and Physical layers.

2.6. Example of ATMCC usage: ATM cell processing

Obviously there is a reason for subdividing the cells into

separate queues, and this is to enable processing of the

payloads. Not all the incoming FIFOs will be used for the

output stream generation, and neither will all of the output

stream be made up of cells that were input to the ATMCC.

In fact, three scenarios exist for the behaviour of the FIFOs,

described below in conjunction with Fig. 5.

Firstly, it is possible to change any of the cell header

information without direct involvement of the micropro-

cessor. The output hardware reconstructs the header using a

stream number linked to the FIFO, but not necessarily the

same as that of the incoming cells. Therefore the ATM-CPE

can program the input hardware to separate a stream, or

streams, and then program the output hardware to write

those stream(s) with different VPI/VCI values (e.g. header 1

in Fig. 5). This function is only useful as part of a greater

scheme to process data, but demonstrates the ease with

which the PU can delegate tasks to the hardware assistance,

leaving itself free for other tasks. Whilst cells are queued in

a FIFO they can be read or written without further overhead.

The output hardware will only transmit those cells that are

older than the write pointer, and therefore the ATM-CPE

can ensure that a cell is output only after it is finished with.

Secondly the contents of the FIFO may be processed by

the ATM-CPE and not used to form an output stream

(header 2 in Fig. 5). The cells will still be queued as a FIFO

buffer. However, the output hardware does not select cells

from this FIFO. After the cells are no longer required they

can be released to be overwritten, simply by moving the

read pointer to the next cell that is to be retained.

Lastly, a stream of cells may be generated by the ATM-

CPE for output (header 3 in Fig. 5). In a process that is the

reverse of that described above, the ATM-CPE writes cells

into a FIFO and releases them for transmission by moving

it’s write pointer. Cells older than the write pointer will then

be ‘seen’ by the output hardware and transmitted as normal.

This area of ATMCC will not be allocated to the input

hardware and is therefore safe from being overwritten by

cells received from the ATM input.

Permutations and mixtures of the above three Scheme

allow total flexibility in manipulating and using the cell

streams. When combined with the flexibility of the wider

architecture, which dynamically arranges the processing

topology, the SMART PU architecture can be applied to any

processing requirement.

3. Programmers model

The programmer’s model is the interface between the

processing hardware and the programmer. From this point

of view the SMART ATM-CPE has a similar appearance to

that of other traditional processors. The full programmers

model is shown in Fig. 6, and is explained below. The

resources seen in Fig. 6 are global resources, shared between

the three execution units.

In keeping with load/store Reduced Instruction Set

Computer (RISC) principles the ATM-CPE has a unified

register file of General Registers (GR). Additionally, there

are a number of Control Registers (CR) that forms the

interface to the various peripheral functions and hardware

units. Several of the special purpose registers are mapped

into these CRs. Whilst they appear as CRs and can be read

with CR-move instructions, they must be written with

special instructions—e.g. a CALL updates the instruction

pointer (IP).

Much of the definition of the architecture and program-

mers model is independent of the underlying hardware

implementation. This allows each implementation to use the

most suitable technology and to scale the architecture to

match the requirements of the application or environment.

Only a minimum subset of the architecture is compulsory.Fig. 5. ATM cell cache operation.


This is a similar approach to that which is increasingly

being adopted by processor designers and manufacturers in

an attempt to add longevity to their designs, whilst

maintaining leading edge performance from technology

advances [19]. Rapid technological advances mean that

specifications which do not scale with increasing resources

are quickly outdated, and require much effort before another

generation of product can be produced.

It is commonplace for a microprocessor to include

different execution states (or privilege levels). These are

used, for instance, to allow the OS access to resources that

must be kept from general application execution. The most

privileged execution state of the ATM-CPE has a reserved

set of registers, shown within the dotted outline of Fig. 6.

These provide a low overhead context switch for the

operating system.

3.1. General registers

The GR behave very much like the registers of other

load/store architectures, in that they can be used freely as

sources or destinations for operations. This includes using

the same register as source and destination in the same

instruction. They can also be accessed as groups of three,

referred to as Register Groups (RGs). The groupings are

pre-assigned and unchangeable. RGs are formed from

logical combinations of GR, and use the normal hardware

and data paths—thus requiring little extra design. Access to

RGs requires special instructions, which are different to

normal register instructions. This minimises the instruction

size required, as only one reference needs to be made in

order to access three registers. Although RG instructions are

limited in scope, they are completed in the normal cycle

time, thereby avoiding scheduling problems and pipeline

bubbles.

All of the GR are 128-bits wide. This is for two main

reasons; the first being an extrapolation of microprocessor

design, and the second is the convenience of dividing an

ATM cell payload across three registers. Microprocessor

designers have been continually increasing the width of the

registers and execution units, and SMART is effectively one

step ahead of the current generation of 64-bit machines.

Although there are specialised processors that include 128-

bit registers [15,20], they are based around narrower data

paths and the wide registers are generally additional to their

original architecture. Wide registers allow powerful

manipulation of data for graphics and multimedia appli-

cations, which are particularly effective when the execution

units are horizontally partitioned for single-Instruction

Multiple-Data (SIMD) parallelism. The ATM-CPE uses

Fig. 6. Programmers model of the ATM-CPE.


this to a great extent, and it provides a significant advantage

for the multimedia applications that the architecture is

aimed at.

The argument against wide registers is that they are

wasteful when processing small quantities, particularly

when data of the same width is stored to memory. Dual

width registers or two register files of differing widths was

rejected for the SMART ATM-CPE on the grounds that the

additional complexities would impede the streamlined

design.

The number of physical GRs in the ATM-CPE is

implementation specific. There is a recommended minimum

of 256 GRs, which may be increased to a maximum of 1280.

These are sequential and contiguous, numbered incremen-

tally from zero. The only restriction is that they are

implemented in such a way so as to not split a register group.

Only 256 GR can be visible at any time, although this is

transparent to the execution of the application code.

3.2. Indirection register

The Indirection Register (IR) provides a simple and

effective way to provide single cycle context switching and

function calls. This provides indirect general register access,

transparently to the programmer. In order to access a

register the current IR value is added to the register number

contained within the instruction, which results in an easy

translation to physical registers. To prevent incorrect

register accesses the IR is only updated during a function

call or return, and is cleared to zero during reset.

By specifying an increment to the IR the calling routine

can furnish the callee with a ‘clean’ set of registers. This is

illustrated in Fig. 7, where the register mapping on the left

represents the state of the processor before the function call.

As can be seen, the GR are mapped to the physical registers

with an indirection value of 15 (general register 16 plus IR

of 15 equals physical register number 31). A function call is

made and the IR is incremented by three, now providing the

mapping seen on the right hand side of Fig. 7. The physical

registers numbered 34 and 35 are used to pass data to the

function, and all registers below number 34 are safe from

corruption during the execution of the subroutine. It is

important to differentiate between indirect register accesses,

and indirect memory addressing modes, which use an offset

from the current IP value.

All accesses to registers R0-R15 will be absolute

accesses. This is a binary boundary that coincides with the

register groups, because the IR equally affects RGs.

A similar set of operations to those accompanying a

CALL can be performed upon the execution of a RETURN

instruction. The indirect register needs to be returned to the

value that it had prior to the call, and this must be a hardware

function—to prevent misalignment. The increment supplied

with the call is stored on an internal stack, which is ‘popped’

when the return instruction is encountered. As the new IR

value does not need to be calculated the execution unit,

which executes the return instruction, is available for

another use.

If there are insufficient registers available to provide a

full set, the privileged execution steps in at some stage and

saves some early registers to memory, before continuing

execution. These must then be restored when the appro-

priate return instructions are called and the IR is set to a

value less than zero.

3.3. Hardware loop assistance

The most basic form of code iteration is the counted loop.

Other variants are iteration until a condition is met or

changes, and iteration forever. All these forms are made up

of a section of code with a known start and end point that is

executed repeatedly. Very often it is not feasible to unroll

these loops, particularly when loop counts are high or when

the number of iterations is unknown.

With traditional architectures the last instruction in the

loop supplies a new IP, which redirects the fetch unit to

another area of memory, where it will find the first

instruction of the loop. As the new fetch address is not

available until after the execution stage of the pipeline this

re-direction causes problems, which degrade performance.

Modern processors from Intel, DEC, SUN, and others, all

include dedicated branch prediction logic which attempts to

guess the direction which a processor will be forced to take

by a conditional branch [13,16]. Much effort has been

applied to improving the prediction algorithms and some

high performance claims are made. However, these are

generally made using best-case code fragments, which is not

entirely representative of real applications. Furthermore,

multiple context support requires the branch history to be

preserved whilst other contexts are being serviced, adding to

the volume of information that must stored and retrieved. If

the pre-fetch unit is sent in the wrong direction then there

are numerous instructions that must be nullified/flushed

from the pipeline, and execution must wait until the correct

instructions refill the pipeline. With long pipelines (14

stages for the Intel P6 [21]) this can severely inhibit

performance.

The ATM-CPE hardware loop assistance is similar in

principle to that found on the Analog Devices 16-bit DSP

family, the ADSP-21xx [22]. The aim is to remove the

overhead associated with counted loops. The hardwareFig. 7. Indirection register operation during a function call.


ensures that upon fetching the last instruction the fetch unit

is updated with the address of the start of the loop. In this

way the last instruction does not need to be a branch

instruction, and there is no execution penalty to the iteration.

Additionally, the overhead of processing the conditional test

is also removed.

Essentially, the hardware is initialised before the loop

commences, storing the number of loop iterations, the loop

end address and the loop start address (the current IP value

when the initialisation instruction is executed). These values

are all stored on the loop stack, which allows multiple

outstanding loops to be maintained and provisions for

nested loops.

Each time the loop is executed the loop counter is

decremented, and upon reaching zero execution is allowed

to continue past the end of the loop. This does not require

any additional address calculation hardware, as the absolute

addresses are stored ready for instant use. It is the

initialisation instructions that calculate the address prior to

beginning the loop.

A benefit of this form of loop control is that it can be

adjusted to take account of the pipeline length, thus ensuring

the last instruction of a loop is followed immediately by the

first without any pipeline bubbles. Instead of using the end

address to compare to the current IP value the end address

minus the pipeline length can be used as the end address.

The only other change is to the mechanism that determines

the start address from the current IP value.

This form of loop control works very well for

mathematical functions that have a known number of

iterations. As many of the multimedia and network

operations are fundamentally mathematical and iterative

this will benefit the general environment of SMART.

Additionally, SMART includes the ability to program the

loop assistance unit not to decrement the loop counter, and

also instructions to immediately clear the counter value

(predicated for conditional execution). These also enable

SMART to process the other two forms of iteration that

were previously mentioned without any overhead—con-

ditional exit loops and infinite loops. There are restrictions

on the placement of an instruction, which clears the loop

counter, as its pipeline latency must be accounted for. Even

with these restrictions, however, hardware loop assistance

remains a powerful feature of the architecture.

4. Instruction set

Defining an instruction set is as much of an entire project

as any other single task in the overall development of a new

microprocessor. Whilst some types of instructions are

fundamental to the operation of all processors, there are

others, which are novel or innovative because of the

architectural design of the ATM-CPE. Some examples of

these are included here to illustrate the functional operation,

and to support the code kernel.

Instructions are grouped into threes in a VLIW. They are

all self-contained, without interdependences and conflicts.

There are certain restrictions on which execution units can

perform some limited instructions (such as address

calculation for load/store operations), but essentially all

three are identical. This reduces the design time, as a single

instruction unit can be copied for units two and three. Self-

contained instructions further reduce the hardware compli-

cations, as they do not require communications or transfers

between the execution units. A VLIW packet is issued as a

single entity, with the position of the instructions within the

VLIW used to determine which execution unit they are

issued to.

Complex instruction scheduling logic is widely identified

as a limiting factor to the design of faster superscalar

processors [11,23]. VLIW technology relies on the compiler

to optimise the instruction scheduling, but removes this

large section of logic from the processor entirely. To further

aide the instruction scheduling, the ATM-CPE follows a

similar route to the latest designs from Intel (IA-64), SUN

Microsystems (MAJIC) and others, where instruction

execution is predicated [24,25]. Therefore the compiler

can simultaneously schedule multiple execution paths, and

use the outcome of a conditional test to determine which one

is used.

Instructions include five fields, each of these is 8-bits.

These identify the various registers and operations; instruc-

tion operation code, two source registers, one destination

register, and a predicate register. Therefore such instruc-

tions as Rdest ¼ RA þ RB are possible. Equally, the two

source registers may be used as a 16-bit immediate value, or

a combination of register identifier and 8-bit immediate

data.

The instruction set includes significant emphasis on

SIMD partitioned operations. This exploits the wide

registers, allowing one instruction to perform as much

work as possible. These instructions include arithmetic,

logical, shift and rotate. One example is the concatenated

shift. Whilst the operation is thought of as a shift, it can

actually be performed as a dual bit-field extract and

combine. Fig. 8 illustrates the operation, where two registers

are concatenated into a 256-bit word, and shifted by an

amount contained in the third register.

In this example the destination register is the most

significant operand. When the shift direction is reversed

Fig. 8. Concatenated shift instruction operation.


the two source registers are concatenated the other way

round such that the destination register becomes the least

significant operand.

This instruction is a powerful manipulation that greatly

enhances the power of the ATM-CPE. It makes the 128-bit

registers and memory locations very efficient through a

reduction in the number of operations that are required for

pattern matching functions such as motion estimation.

To support this, another important partitioned instruction

is the Sum of Absolute Difference (SAD), which performs

the necessary steps to calculate the SAD of pixel data

contained in a single register. New microprocessor archi-

tectures include such instructions [24] to enhance the power

of their instruction sets without necessarily increasing the

demands on memory bandwidth. There is a fine line

separating over complex instructions that help reduce the

memory bandwidth requirements and those whose complex-

ities slow down the critical paths within the micro-

architecture.

The SAD instruction, however, has been implemented on

other processors without detrimental effects and is also a

significant benefit in the highly computational motion

estimation kernel. The ATM-CPE supports two forms of

the SAD instruction, which are aimed at different pixel

accuracies—8-bit and 16-bit representations. Fig. 9 illus-

trates the operations performed for a SAD instruction. The

partitioned execution units are particularly well suited to

this type of calculation, and its inclusion in the instruction

set does not add much complexity to the overall design.

4.1. Performance estimation—compressed video motion

estimation

The computational load of the motion estimation

algorithm used in MPEG and other video compression

algorithms is often regarded as a benchmark for processor

performance. The number of operations that must be

performed remains too great for all but a few of the top

processors to achieve real time results. Estimates of the

computational requirements of a full search and a ‘fast’

search have been given as 18.2 £ 109 and 621 £ 106 generic

RISC-like operations per second, respectively [26]. Using

the SIMD parallelism of its wide registers, the ATM-CPE

can perform the same functions with a speed up of 45 times

over the generic load/store architecture.

As each of SMART’s 128-bit register or memory

locations can store sixteen 8-bit pixels it is not efficient to

read individual pixels from memory. Therefore the motion

estimation code must take account of how the video frame is

stored in data memory in order to achieve the greatest

speed-up from its wide data path. If pixels are stored in

groups that are the same size as Macro Blocks (MBs), and

have the same borders as MBs, then they can be read, stored

and written with the maximum efficiency. Pointers are

linked to the MB number, and an entire MB is transferred

with 16 sequential memory read operations. MBs are stored

in memory in such a manner to reduce the overhead of

pointer management during kernel functions such as Motion

Estimation, such that an area of the video frame, composed

of any 9 adjacent MBs, would be stored in memory as

shown in Fig. 10.

Search operations pay no heed to boundaries between

MBs, and a disadvantage of grouping pixels into a single

memory location is the additional memory accesses that are

required to search across such boundaries. For example, the

full extent of the search range for MB ‘Q’ is shown in Fig. 11

as the shaded area, whilst an arbitrary search location is

shown as a dotted line. This location requires 32 memory

accesses in order to overcome the misalignment, as opposed

to an optimum 16 memory fetches for a perfectly aligned

MB. However, nearly all search locations will be mis-

aligned, and therefore the overhead cannot be ignored.

The offset m must be used to shift the memory data in

order to compare it to the reference MB held in registers.

Fig. 9. Illustration of ‘sum of absolute difference’ instruction operation.

Fig. 10. MB storage organisation for optimum motion estimation

performance.


The vertical offset between different MBs does not present a

problem, as the fetches remain sequential across these

boundaries.

For the misaligned search shown in Fig. 11, the kernel

function only needs to know the memory location of the first

fetch, the difference between similar locations in adjacent

columns n; and the required shift distance to align the

registers m: The reference MB, for which the best match is

to be found, is loaded into 16 registers. This only needs to be

done once, as it remains in the registers until the search is

concluded. As this code fragment is the kernel of a

subroutine that would be called once for each search

location the 16 loads are not included in the subroutine code.

The source code for the motion estimation SAD is shown

in Fig. 12. It is written for the SMART ATM-CPE as it uses

128-bit data values (natural integers), which are not native

to the ‘C’ language. The code supporting the function call

SAD(16_pels) is not shown as it is later optimised to a

single SMART PN instruction, although a function call

allows alternative compiler optimisations for other

architectures.

The operations required for one pass through the loop

are: two loads (from the data memory location pointed to by

mb1_ptr and mb1_ptr þ n), a pointer increment, two shifts

and a logical OR (to generate a single 128-bit value from

across the MB boundary), the SADs, and finally the

accumulation of the SAD summation.

This assembles to the pseudo code shown in Table 1,

where the operations described above for one pass through

the loop have been converted into functions achievable with

ATM-CPE instructions.

The operations of Table 1 clearly demonstrate the power

of the ATM-CPE instruction set and its wide registers.

Further speed enhancements can be made if the loop is

unrolled and the code fragment shown in Table 2 is

interleaved with two other loop iterations on the two other

execution units. This is an illustrative example only

showing the parallelism achieved, together with the

tessellation of the individual loop structures that have

been unrolled—the full code can be seen in Table 3. This

form of parallelisation ensures that the execution units are

kept fully occupied and no instruction slots are wasted. It is

possible to unroll the loops in their entirety because the

instruction set reduces the loop body to a short sequence of

instructions.

Laying this code out as true assembler level source code1

loses the obvious parallelism shown above. Throughout the

coding the register usage is: registers 16–31 contain the MB

which is to be compared with the video frame in memory,

register 32 contains the shift value m, register 33 contains

the pointer offset n, and register 34 is the pointer to the first

data memory location containing the reference pixels. The

total accumulates in register 35, whilst registers 36 through

to 41 are used for the temporary totals of each parallel

operation. In this way the calling routine passed the

necessary data in registers, which became those numbered

16 onwards after the update of the IR.

The architecture restricts execution unit number one to

the calculation of memory addresses for load/store oper-

ations to memory. This is the main reason that the source

code of Table 3 loses it’s obvious readability as the

Fig. 11. Macro block search locations and memory organisation boundaries.

Table 1

ATM-CPE Instructions for a single pass through the motion estimation loop

Pseudo instruction Operation/C source code

load from ptr_1 load from mb1_ptr

load from ptr_2 load from n þ mb1_ptr

concatenated shift shift two registers to align pix

calculate the SAD SAD of all 16 pixels (SAD(16_pels)

accumulate total total ¼ total þ SAD(16_pels)

increment ptr pointer moved to next 16 pixels

Fig. 12. Source code for sum of absolute difference calculation.

1 Source code format is: opcode, destination, sourceA, sourceB. As every

instruction must be executed the predicate register has been omitted for

clarity.


instructions are scheduled around this unit. Upon return the

final total value will be accumulated in R35, ready for use

by the calling routine. Notice that in adhering to the rules of

restricted functions for each of the execution units (load/

store and branch) this code has not become any longer than

without such limitations. This is not only due to careful

planning of the operation, but also to the architectural

resources and the powerful instruction set.

4.2. Summary of motion estimation performance

Careful instruction set design, a clean architecture and

horizontal register partitioning all serve to exploit the power

inherent in the 128-bit GR. In just 35 clock cycles, the new

ATM-CPE can perform the 256 comparisons and accumu-

late the SADs for a whole MB.

The ATM-CPE fairs well in comparison to performance

claims by other processors. The VIS extensions to the SUN

Ultra-SPARC I and II allow it to perform a similar MB

comparison for motion estimation in 441 instruction cycles,

operating on similar 8-bit pixel data [27]. Hewlett Packard

show that their MAX-2 instruction set extensions can

compute the difference of 8 pixels in 2 instruction cycles of

their superscalar PA-8000. Their bench mark results claim

that the code kernel for the full SAD for a MB takes 130

instruction cycles [28]. It is the need to continually load data

from memory and transfer it between internal registers that

slow the Ultra-SPARC, whilst the narrower data path and

lack of dedicated difference instruction prevent greater

speed in the PA-8000. At 35 cycles the ATM-CPE uses far

fewer instructions. Furthermore, the wider memory

locations and registers reduce the memory accesses, and

help prevent stalls due to memory access latencies. This is

an often overlooked performance penalty that is only really

apparent when a real world application is executed.

A frame size of 18 by 22 MBs, at 30 frames per second

would require an ATM-CPE to run at 400 MHz in order to

perform a full motion estimation search. This represents a

speed up of 45 times over the generic RISC processor

Table 2

Pseudo code implementation of motion estimation loop

..

. ... ..

.

accumulate total shift values to align load from ptr_1

increment ptr_1 calculate the SAD load from ptr\_2

load from ptr_1 accumulate total shift values to align

load from ptr_2 increment ptr_1 calculate the SAD

shift values to align load from ptr_1 accumulate total

calculate the SAD load from ptr_2 increment ptr_1


increment ptr_1 calculate the SAD load from ptr_2

load from ptr_1 accumulate total shift values to align

load from ptr_2 increment ptr_1 calculate the SAD

shift values to align load from ptr_1 accumulate total

calculate the SAD load from ptr_2 increment ptr_1


Table 3

Complete motion estimation code

#1 load-rr r36, r34, r00;f Fetch the first 16 pixels

nop;

nop;

#2 load-rr r37, r34, r33; Fetch the second 16 pixels

add-rr r34, r34, i01; Increment memory pointer

nop;

#3 cshiftr r37, r36, r32; Concatenate and shift to align

load-rr r38, r34, r00; Fetch the third 16 pixels

nop;

#4 load-rr r39, r34, r33; Fetch the fourth 16 pixels

sad8 r37, r37, r16; First sum of absolute differences


#5 load-rr r40, r34, r00; Fetch the fifth 16 pixels

add-rr r35, r35, r37; Accumulate the first total

cshiftr r39, r38, r32; Concatenate and shift to align

#6 load-rr r41, r34, r33; Fetch the sixth 16 pixels


sad8 r39, r39, r17; Second sum of abs. differences

#7 load-rr r36, r34, r00; Fetch the seventh 16 pixels

add-rr r35, r35, r39; Accumulate the second total


#8 load-rr r36, r34, r00; Fetch the eighth 16 pixels

sad8 r41, r41, r18; Third sum of absolute differences


#9 load-rr r38, r34, r00; Fetch the ninth 16 pixels


add-rr r35, r35, r41; Accumulate the third total

#10 load-rr r39, r34, r33; Fetch the 10th 16 pixels

sad8 r37, r37, r19; Fourth sum of abs. differences



add-rr r35, r35, r37; Accumulate the fourth total




sad8 r39, r39, r20; Fifth sum of absolute differences


add-rr r35, r35, r39; Accumulate the fifth total



sad8 r41, r41, r21; 6th sum of absolute differences




add-rr r35, r35, r41; Accumulate the sixth total





add-rr r35, r35, r37; Accumulate the seventh total






add-rr r35, r35, r39; Accumulate the 8th total


#20 load-rr r36, r34, r00; 9th sum of absolute differences

sad8 r41, r41, r24; Fetch the 20th 16 pixels


#21 load-rr r38, r34, r00; Fetch the 21st 16 pixels


(continued on next page)


operations. More importantly, a fast search algorithm could

reduce this requirement to under 14 MHz!

5. Conclusion

This paper has outlined the design, operation and

performance of a highly novel microprocessor architecture

currently under development at the University Of Sussex. It

directly supports the manipulation of ATM cells, and

includes a cell stream interface. A third small cache memory

allows the cells to be buffered and accessed by the

processing core.

The five main design axioms set out in the introduction

have been met. ATM has been retained as the network

interface, which is given a high priority route into the very

heart of the processor architecture. At all times the design

has tried to avoid complexity and the need for many

resources to implement it. This ensures the implementation

is affordable, and does not rely on fully customised silicon

design. Additionally, a smaller implementation makes a

System On Chip (SOC) implementation more plausible with

today’s technology. Finally, the design has strived to reduce

the impact of the disparity between memory and processing

speeds. Through the use of wide registers, powerful

instructions, and deliberately lower processor clock speeds

(from the accepted target of FPGA and ASIC technology)

the impact of memory latency has been diminished.

Promoting the importance of the network interface

leads to many possible deployments beyond the original

workstation CPU envisaged by Cashman. These include

processing of cells in switch buffers and queues, affordable

embedded network processors for ATM networks, and

intelligent network interface cards. As network speeds

continue to increase faster than processor or memory

speeds, such innovative architectures will become ever-

more important. In fact, one of the fathers of the

microprocessor, Frederico Faggin, predicted that micro-

processor development would be divided into three phases,

each lasting for approximately 25 years [29]. During the

first phase the dominant factor over improvements in cost-

performance was semiconductor technology. This allowed

the implementation, and gradual speed enhancement of

many of the ideas originally developed for minicomputers

and mainframes.

Then during the second phase, it is predicted that

architectural developments will become as dominant as

those of semiconductor technology. In the words of

Frederico Faggin [29] “Innovation will also come about

from changing market requirements, primarily the need to

process and communicate multimedia information more

efficiently… It will stimulate novel computer structures

motivated by the new market needs and supported by an

ever more powerful semiconductor technology.”

It is forecast that this phase will be a period of vigorous

growth for the microprocessor. We should see the trend of

systems on a chip begin to appear, which is seen as the main

evolutionary direction of the microprocessor.

The prediction for the final phase says that architectural

advances alone will be the dominant factor in micropro-

cessor progress. This is because the physical limits of

semiconductor technology will have been reached. The

systems on a chip will become more complex and flexible,

drawing technology from re-configurable computing to

create dynamic architectures.

The second phase of Faggin’s prediction is beginning to

be realised, as many different architectural features are

being explored to overcome some fundamental limitations.

Incremental development is beginning to give way to

radical architectural changes, and market segments are

emerging based on different functional requirements. Some

of these are existing and proven, such as Digital Signal

Table 3 (continued)


#22 load-rr r39, r34, r33; Fetch the 22nd 16 pixels



#23 load-rr r40, r34, r00; Fetch the 23rd 16 pixels
















sad8 r37, r37, r28; 13th sum of absolute difference








#31 load-rr r36, r34, r00; Fetch the 31st 16 pixels



#32 load-rr r36, r34, r00; Fetch the 32nd 16 pixels


nop;

#33 add-rr r35, r35, r41; Accumulate the 15th total


nop;

#34 return; Return, with delay slot


nop;

#35 add-rr r35, r35, r37; Accumulate the 16th total

nop;

nop;


Processors (DSPs), and others are being created by new

markets, such as these, which led to network processors.

The design of the SMART processor is very forward

looking. It is well placed to serve network connectivity, it

has the processing power to support multimedia appli-

cations, and through its design simplicity is affordable to

manufacture.

References

[1] N. Cashman, SMART: a proposed novel multimedia computer

architecture for processing ATM cells in real-time, PhD thesis,

University of Sussex, October 1998.

[2] R.H. Arpaci-Dusseau, A.C. Arpaci-Dusseau, D.E. Culler, J.M.

Hellerstein, D.A. Patterson, The architectural costs of streaming I/

O: a comparison of workstations, clusters and smps, Fourth

International Symposium on High Performance Computer Architec-

ture (HPCA), IEEE computer society technical committee on

computer architecture, Jan-Feb, 1998.

[3] M.D. Hayter, A workstation architecture to support multimedia, PhD

thesis, St John’s College, University of Cambridge, September 1993.

[4] A. Boxer, Where busses can’t go, IEEE Spectrum February (1995)

41–45.

[5] Xilinx, Virtex-II 1.5 Vfield-Programmable Gate Arrays, Xilinx Inc,

April 2001, Data sheet from http://www.xilinx.com

[6] Wm.A. Wulf, S.A. McKee, Hitting the memory wall: Implications of

the obvious, Communications of the ACM 23 (1) (1995) 20–24.

[7] D. Burger, J.R. Goodman, A. Kagi, Limited bandwidth to affect

processor design, IEEE Micro 17 (6) (1997) 55–62.

[8] J.D. McCalpin, Memory bandwidth and machine balance in current

high performance computers, IEEE Computer Society Technical

Committee on Computer Architecture Newsletter December (1995)

19–25.

[9] R. Nass, Ring architecture connects up to 128 pci buses, Electronic

Design November (1997) 85–88.

[10] W.J. Dally, S. Lacy, VLSI architecture: past, present and future,

Advanced Research In VLSI, 1999.

[11] T.R. Halfhill, Inside IA-64, BYTE June (1998) 81–88.

[12] S. Palacharla, Complexity-effective superscalar processors, PhD

thesis, School of Computer Sciences, University of Wisconsin-

Madison, 1998.

[13] H. Sharangpani, Intel itanium processor microarchitecture overview,

Proceedings of Microprocessor Forum, San Jose, California, October,

1999.

[14] S. Sudharsanan, MAJC-5200: a high performance microprocessor for

multimedia computing, White paper, June, 1999.

[15] Intel Corporation, IA-32 Intel architecture software developer’s

manual volume 1: Basic architecture, Intel Corporation, P.O. Box

7641 Mt. Prospect IL 60056-7641, 2001, Document order number

245470.

[16] L. Gwennap, Digital 21264 sets new standard, Microprocessor Report

October (1996).

[17] HP, PA-8500: The continuing evolution of the PA-8500 family, Web

page; http://www.hp.com/computing/-framed/technology/micropto/

pa-8500/docs/8500.html, 1997.

[18] D. Tabak, Advanced Microprocessors, Second ed., McGraw-Hill,

1995.

[19] M. Tremblay, MAJC: Microprocessor architecture for java comput-

ing, Hotchips ’99, 1999.

[20] C. Hansen, Microunity’s media processor architecture, IEEE Micro 16

(4) (1996) 34–41.

[21] J. Bayko, Great microprocessors of the past and present, Web page;

http://infopad.eecs.berkeley.edu/CIC/-archive/cpu_history.html,

1997.

[22] Analog Devices, ADSP-2100 family user’s manual, Analog Devices,

Third ed., 1995.

[23] Intel Corporation and Hewlett Packard, IA-64 architectural inno-

vations, joint white paper, February, 1999.

[24] Intel Corporation, IA-64 Application Developer’s Architecture Guide

May (1999).

[25] M. Tremblay, MaJC-5200: A VLIW convergent MPSOC, Micro-

processor Forum, 1999.

[26] V. Bhaskaran, K. Konstandinides, Image and Video Compression

Standards Algorithms and Architectures, Second ed., Kluwer

Academic Press, 1995.

[27] M. Tremblay, J.M. O’Connor, V. Narayanan, L. He, VIS speeds new

media processing, IEEE Micro 16 (4) (1996) 10–20.

[28] R.B. Lee, Subword parallelism with MAX-2, IEEE Micro 16 (4)

(1996) 51–59.

[29] F. Faggin, The microprocessor, IEEE Micro 16 (6) (1996) 7–9.

Gary Jones graduated from the University

of Brighton, England, in 1996 with a first

class Honours degree in Electronic Engin-

eering, and a Masters degree in Engineer-

ing with Business Management. He went

on to start his DPhil research work with the

Communications Research Group, at the

University of Sussex. During the last two

years Gary has worked as a consultant

design engineer for a local company. He

now runs his own engineering and soft-

ware consultancy company.

Elias Stipidis graduated with a first class

BEng Honours degree in Electronics and a

DPhil in Communications from the Uni-

versity of Sussex in 1995 and 1998,

respectively, and is currently leading the

Data Communications and Embedded

Systems research at the University of

Sussex, England.


http://www.xilinx.com

http://www.hp.com/computing/-framed/technology/micropto/pa-8500/docs/8500.html

http://www.hp.com/computing/-framed/technology/micropto/pa-8500/docs/8500.html

architecture and instruction set design of an atm network processor

Documents