architecture and instruction set design of an atm network processor
TRANSCRIPT
Architecture and instruction set design of an ATM network processor
Gary Jones, Elias Stipidis*
Communications Research Group, School of Engineering and Information Technology, University of Sussex, Falmer, Brighton BN19QT, E. Sussex, UK
Received 6 December 2002; revised 1 March 2003; accepted 18 March 2003
Abstract
Microprocessor architectures are diversifying to support niche market requirements, with growing emphasis for performance delivery on
the architectural design rather than the silicon implementation. This paper outlines the architectural design, programmer’s model and
instruction set of a microprocessor, which adopts a novel approach to network data. In particular, Asynchronous Transfer Mode (ATM) cells
are delivered to a special FIFO cache memory, located at the heart of the processor. Cell input and output is conducted at wire speed using
dedicated streaming input and output hardware. Special read and write instructions then allow the cell payloads to be accessed directly, and
transferred from/to the register file. Multimedia applications have previously been identified as an important market for such a network
centric architecture. Therefore the paper ends with a demonstration of the power of some key instructions. A motion estimation kernel from
the MPEG standard is used to exercise the architecture and instruction set. Execution speed is shown to be comparable to today’s processors,
using only a 400 MHz clock for a full search. The minimally resourced design is therefore suited to embedded network applications from
both economic and performance standpoints.
q 2003 Elsevier B.V. All rights reserved.
Keywords: Network processor architecture; Asynchronous transfer mode
1. Introduction
This paper presents a novel microprocessor architecture
currently under development within the School Of Engin-
eering and Information Technology, at the University of
Sussex. It describes the architectural design work, pro-
grammer’s model, an overview of the instruction set and the
results of simulating code kernels from multimedia
applications.
The design captures the essence of the conceptual design
produced earlier by Neil Cashman [1]. This aimed to
revolutionising the internal architecture of a computer,
replacing the traditional shared busses with Asynchronous
Transfer Mode (ATM) network links. In doing so the
architecture would address both the limitations of bus based
systems and the perceived future requirement to process
increased quantities of multimedia data.
The original SMART design was an entire computer
architecture built of many components. These included the
Processor Node (PN), which is a custom designed proces-
sing architecture, fed directly with ATM data streams, and it
is the PN, which is the focus of this paper. The computer
architecture, called SMART, is named after its main
characteristics:
Smooth protocol boundaries
Multimedia processing
ATM bandwidth
Real-time processing
Transfer of processes
There are many others who see the internal busses as
weak links in the ever increasing performance of computers
[2–4]. We feel that this other work has not yet resulted in a
viable alternative computer because they all rely on
commercial microprocessors. This implies that bus archi-
tectures and traditional memory arrays have to be used at
some stage. Therefore they have not replaced busses, but
merely moved them.
The Processing Unit (PU) described in this paper is the
first attempt to break free of the use of commercial
microprocessors for such a computer architecture. Analysis
of the previous conceptual design highlighted the benefits of
a network based computer architecture, although the actual
PU was excessively complex and limiting to the overall
0141-933/03/$ - see front matter q 2003 Elsevier B.V. All rights reserved.
doi:10.1016/S0141-9331(03)00064-4
Microprocessors and Microsystems 27 (2003) 367–379
www.elsevier.com/locate/micpro
* Corresponding author.
E-mail address: [email protected] (E. Stipidis).
performance. In order to overcome the significant hurdles of
development, implementation and manufacture, the design
has been constrained by several key axioms:
1. Retain the proven architectural advantages of ATM as
the underlying transfer mechanism.
2. Retain the original aim of raising the network connection
to the highest priority.
3. Simplify the operation and architecture, specifically the
management and control.
4. Target the design at a System-On-Chip (SOC) solution,
without being implementation specific in the architec-
tural design.
5. Address the growing speed differential between pro-
cessor speeds and memory access times.
Some elaboration is required on these. Simplifying the
architecture reduces the resources that are required to
implement it. Keeping the design within the realms of
today’s’ leading edge ASIC and FPGA technology will also
keep it within tomorrows cost effective manufacture. Full
custom design is far too expensive to realise and would
further hinder the chances of implementing the design.
Small/medium scale designs aimed at the embedded market
stand a greater chance of success than trying to compete
with the likes of Intel and SUN Microsystems.
Modern FPGA technology offers an expedient, realistic
and attainable route to implementing this microprocessor
design. For example, the leading devices from Xilinx, the
Virtex-II family, currently includes one million program-
mable gates, and the road map extends to ten million gates
[5]. Furthermore, the support software that is used to
compile, synthesis and simulate the designs for such FPGAs
is available and affordable.
The growing disparity between the speeds of processor
clocks and memory access times results in performance-
crippling latencies for anything other than first level cache
accesses. This is widely seen as a significant threat to the
continued growth in real-world application performance
[6–10]. Major design compromises have been made by the
likes of Intel in order to retain the backwards compatibility
of their existing instruction sets whilst extending the
underlying hardware architectures. Our work presented
herein is an opportunity to start with a ‘clean slate’, and
therefore is also an opportunity to redress this speed
differential.
The following sections present an overview of the new
PU, which is composed of two distinct, but highly
interdependent subsections; the ATM Cell Processing
Engine (ATM-CPE) and the ATM Cell Cache (ATMCC).
Section 2 of this paper describes the overall architecture, its
main functional blocks and their general operation. It pays
particular attention to the novel features of the design.
Section 3 presents the Programmers Model. This is the
set of registers that the end user/programmer will ‘see’ and
use. These are closely related to the hardware operation, and
provide an insight into the functional operation of the
processor.
An overview of the instruction set is given in Section 4,
together with the performance results of a code kernel from
a real-world video compression algorithm. The Section 5 is
a summary and conclusion.
2. Architecture
Essentially the PU is composed of two halves that are
closely interconnected. The first is the ATM Cell Processing
Engine (ATM-CPE) and the second is the ATMCC. These
can be seen in the diagram of the simplified SMART PU
shown as Fig. 1.
At this stage it is sufficient to consider the ATMCC as a
FIFO of ATM cells. Movement of cells through the FIFO is
completely asynchronous from the ATM-CPE, and uses
hardware to implement all the functions, which are
necessary for its operation. This hardware assistance is the
key to obtaining the high performance for a network
processor.
As previously mentioned, the ATM-CPE is a micro-
processor, developed on a “clean sheet”. In comparison to
the original design it is considerably simplified and more
streamlined whilst retaining the key characteristics.
Fig. 2 depicts the main functional units and interconnec-
tions of the new SMART PN design. Further details of each
item are provided below.
If the limitations of today’s FPGA and ASIC technology
is embraced, the maximum clock speed must be accepted to
be significantly lower than that offered by leading edge full
custom design. Therefore, the design maximises the work
done in each instruction, reducing its reliance on brute force
clock speeds.
As can be seen from Fig. 2, the ATM-CPE is a
superscalar design. It can simultaneously execute three
instructions, contained in a 128-bit Very Long Instruction
Word (VLIW). Using this technology forces the onerous
task of instruction scheduling onto the compiler. However,
it also greatly simplifies the hardware design because the
out-of-order scheduling logic is removed entirely. This has
been identified as restriction to clock speeds, and a difficult
unit to design and test [11,12]. Such technology is now
being favoured by companies such as Intel and SUN
Microsystems for their new designs [13,14].
Fig. 1. Simplified diagram of the new SMART processor unit architecture.
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379368
The ATM-CPE uses multiple memory interfaces to
ensure high-speed operation can be maintained. The
architecture of the ATM-CPE can be termed a ‘modified
Harvard architecture’ in that it is based upon separate
instruction and data interfaces. However, the additional
ATMCC interface provides the main data input and output
path. This modification does not directly affect the Harvard
architecture operation, but can allow use of the data memory
interface for other tasks.
The implementation of memory interfaces and the
organisation of the data and instruction caches is not new
to microprocessors. Therefore these topics will only be
briefly discussed. It is widely regarded that reducing the
proximity of the memory to the processor core significantly
helps to reduce access latency. However, it is not possible to
integrate all of the desired memory onto a single die.
However, the extent of the integration is implementation
specific, and as such only generic interfaces are included in
the architectural design.
2.1. Data cache and data memory
The data access port is used for all non-ATM memory
accesses. The hierarchical organisation and the implemen-
tation of the cache will both be implementation specific.
Providing on-chip cache would be advantageous although
not crucial.
Data memory organisation supports reading and writing
of execution unit registers therefore, each addressable
location is 128-bits wide. Although wider than the data
units supported by today’s microprocessors, it is not
uncommon for those processors to use 128-bit busses for
cache accesses [15,16]. This is because pin speed is a
limiting factor to transferring data on and off the chip, which
forces the use of wider busses in order to support higher data
rates.
Only complete registers (memory locations) can be
transferred in DC bus cycles, as sub-word memory
operations increase the complexity of the memory interface.
Furthermore, the ATM-CPE is a load/store architecture—
preventing memory resident data from being used by
computation instructions. Furthermore, transfers are
restricted to only one location/register per cycle, and this
instruction must be executed on one of the execution units in
particular.
2.2. Instruction cache and instruction memory
The benefit of separating the instruction and data caches
is that the organisation of each can be individually tailored
to suit the different requirements. For the Instruction
memory and cache this is to deliver one VLIW per cycle.
Like the data cache described above the design of cache,
memory and memory interfaces are not new to micro-
processors. Indeed the issue widths of modern processors
such as the DEC Alpha 21264 and HP PA-8500 mean that
the instruction memory interface is often wider than the data
port. Both the Alpha 21264 and the PA-8500 supply 128-
bits of instruction in one cycle (four 32-bit instructions) to
feed the execution units [17,18]. The intended instruction
width and issue rate of the ATM-CPE will not extend this,
ensuring the instruction bus design is also not a risk to its
design and implementation.
A dedicated DMA channel exists to support transfers
between the ATMCC and instruction memory. This unloads
the ATM-CPE from having to facilitate the transfers in
software, thus dedicating more processing time to the
application execution. Transfers between instruction mem-
ory and the ATMCC facilitate network update/transfer of
applications. A second DMA channel exists for the data
cache memory, and provides similar services for data.
2.3. ATM cell cache
The ATMCC was a novel feature of the original SMART
PU design and remains so for the new PU. Data enters and
leaves the SMART architecture in streams of ATM cells. It is
therefore necessary to decouple the architecture from the
network, and the ATMCC performs this function. Using a
series of hardware pointers the ATMCC forms a content
addressable memory, that stores complete cells organised
according to their header numbers. Control Registers are used
to facilitate control of the ATMCC functions, including
accessing the pointers and determining the cache status.
Hardware assistance is used for the wire speed cell input
and output functions, which are entirely separate from one
another. Both of these are repetitive operations, fixed by the
ATM standards and required for every cell. Effectively
these operations comprise a combination of SMART
functionality and lower layer network functions. For
example, the cell input hardware would perform the cell
delineation, header error checking, and storage of the cell in
the appropriate queue within the FIFO. The output functions
include header generation (VPI/VCI and HEC as appro-
priate), timely cell transmission, and multiplexing cells
from different queues within the FIFO.
A tight integration between the ATMCC and the ATM-
CPE is crucial for true wire speed processing. Therefore
Fig. 2. The new SMART processing unit architecture.
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379 369
the ATMCC, above any other memory, should be
implemented on the same silicon as (or as closely combined
as possible to) the ATM-CPE and operated at the core clock
frequency. Combined with the fact that it only needs to be
relatively small by existing cache standards (tens of KB) it
can be designed such that all accesses are completed in one
cycle. This is aided by the pointer mechanisms and by the
reliable knowledge that the payload being accessed is
present in the cache.
The interface between the ATMCC and the execution
units is 384-bits wide. This is to support the transfer of a
whole cell payload in one operation. With careful instruc-
tion scheduling it is possible for this large transfer to occur
without any modifications to the general register file. Wide
internal buses have long been a part of microprocessors, and
do not pose significant problems if constrained to short
distances and point to point transfers. Given the large
amount of work completed in this one transfer, these
restrictions do not prove too prohibitive of overall
performance.
It will be necessary for the ATMCC to be multi-ported in
order to prevent bottlenecks to the flow of stream data. Due
to its small size this is also not a significant problem. With
careful organisation it is possible for the stream input and
output to share one port, whilst the ATM-CPE (register
file/I-DMA/D-DMA) uses another.
2.4. ATMCC input
The operation of the input stage is shown in Fig. 3. The
ATM-CPE programs the look-up table that determines
which queue the cell payload will join. For whichever queue
has been chosen, an associated hardware write pointer
ensures the payload joins the correct place in the queue, and
also discards the majority of the header bytes. This is
possible because a specific FIFO queue is associated with a
particular cell stream, identified by a specific VPI/VCI
number, and the header is reconstructed at the time when the
cell is output (see Section 2.5). Any number of these
individual FIFOs may exist, and are setup by the operating
system or application software.
Cells that do not conform to the predetermined assigna-
tions can either be dropped or passed through to the
ATMCC output. In the latter case, the header values must be
retained, and as such the storage requirements for the FIFO
labelled ‘default cells’ Fig. 3 will be increased to include the
stream number (the HEC CRC need not be stored as this is
generated for each cell as it is output). These cells leave the
ATMCC in the same order in which they arrived, and with
minimal delay.
As hardware pointers are used to store the cells the ATM-
CPE does not need to be involved whenever a cell arrives.
Interrupts may be programmed for certain events on an
individual pointer. The ATM-CPE has sight of the hardware
pointer values through its control registers. These also
provide a means by which the ATM-CPE can read and write
the ATM cell header bytes. Payload transfer to the register
file is accomplished via a read instruction, which moves all
384-bits of data.
2.5. ATMCC output
The output is very much a mirror image of the input.
Each of the FIFOs marked for output have an associated
hardware read pointer that automatically generates the
address of the next cell to be read.
As shown in Fig. 4, the Statistical Time Division
Multiplexing (STDM) unit re-assembles the output cell
stream, performing the lower layer network functions of the
ATM reference model. It selects the next cell to output
based on a user tuneable STDM algorithm, where the OS or
application code may tune the algorithm through the
adjustment of certain coefficients.
Factors that would affect the STDM algorithm include
the following:
1. The relative importance of one particular stream to all of
the others. All FIFOs (or cell queues) will be given a
weighting, which the OS or application software can
Fig. 3. ATM cell cache input function and hardware assistance. Fig. 4. ATM cell cache output function and hardware assistance.
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379370
modify to police the network traffic and control the
internal buffer levels.
2. Buffer size. This will impact the STDM algorithm in
much the same way as the importance rating. However,
the OS or application software may chose to allow the
FIFO fullness to modify the importance weighting such
that an element of automatic traffic policing and buffer
control is implemented.
3. ‘Forced’ cell transmission. The OS or application code
may chose to force a cell to be transmitted from any
queue at any time. This will over-ride the next STDM
selection.
A major task of the STDM stream generation unit, Fig. 4,
is to regenerate the cell header data. For each FIFO except
the ‘default cells’ queue, the stream number (VPI/VCI) will
be directly related to the FIFO from which it is taken. After
compiling this header data it will need to calculate the HEC
value and insert that before transmitting the total recom-
bined cell. Where the header value exists (‘default cells’
FIFO) only the HEC value will need to be calculated. This
function, and the others mentioned above are commonplace
in ATM hardware as they represent a selection of the
functions performed by the ATM and Physical layers.
2.6. Example of ATMCC usage: ATM cell processing
Obviously there is a reason for subdividing the cells into
separate queues, and this is to enable processing of the
payloads. Not all the incoming FIFOs will be used for the
output stream generation, and neither will all of the output
stream be made up of cells that were input to the ATMCC.
In fact, three scenarios exist for the behaviour of the FIFOs,
described below in conjunction with Fig. 5.
Firstly, it is possible to change any of the cell header
information without direct involvement of the micropro-
cessor. The output hardware reconstructs the header using a
stream number linked to the FIFO, but not necessarily the
same as that of the incoming cells. Therefore the ATM-CPE
can program the input hardware to separate a stream, or
streams, and then program the output hardware to write
those stream(s) with different VPI/VCI values (e.g. header 1
in Fig. 5). This function is only useful as part of a greater
scheme to process data, but demonstrates the ease with
which the PU can delegate tasks to the hardware assistance,
leaving itself free for other tasks. Whilst cells are queued in
a FIFO they can be read or written without further overhead.
The output hardware will only transmit those cells that are
older than the write pointer, and therefore the ATM-CPE
can ensure that a cell is output only after it is finished with.
Secondly the contents of the FIFO may be processed by
the ATM-CPE and not used to form an output stream
(header 2 in Fig. 5). The cells will still be queued as a FIFO
buffer. However, the output hardware does not select cells
from this FIFO. After the cells are no longer required they
can be released to be overwritten, simply by moving the
read pointer to the next cell that is to be retained.
Lastly, a stream of cells may be generated by the ATM-
CPE for output (header 3 in Fig. 5). In a process that is the
reverse of that described above, the ATM-CPE writes cells
into a FIFO and releases them for transmission by moving
it’s write pointer. Cells older than the write pointer will then
be ‘seen’ by the output hardware and transmitted as normal.
This area of ATMCC will not be allocated to the input
hardware and is therefore safe from being overwritten by
cells received from the ATM input.
Permutations and mixtures of the above three Scheme
allow total flexibility in manipulating and using the cell
streams. When combined with the flexibility of the wider
architecture, which dynamically arranges the processing
topology, the SMART PU architecture can be applied to any
processing requirement.
3. Programmers model
The programmer’s model is the interface between the
processing hardware and the programmer. From this point
of view the SMART ATM-CPE has a similar appearance to
that of other traditional processors. The full programmers
model is shown in Fig. 6, and is explained below. The
resources seen in Fig. 6 are global resources, shared between
the three execution units.
In keeping with load/store Reduced Instruction Set
Computer (RISC) principles the ATM-CPE has a unified
register file of General Registers (GR). Additionally, there
are a number of Control Registers (CR) that forms the
interface to the various peripheral functions and hardware
units. Several of the special purpose registers are mapped
into these CRs. Whilst they appear as CRs and can be read
with CR-move instructions, they must be written with
special instructions—e.g. a CALL updates the instruction
pointer (IP).
Much of the definition of the architecture and program-
mers model is independent of the underlying hardware
implementation. This allows each implementation to use the
most suitable technology and to scale the architecture to
match the requirements of the application or environment.
Only a minimum subset of the architecture is compulsory.Fig. 5. ATM cell cache operation.
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379 371
This is a similar approach to that which is increasingly
being adopted by processor designers and manufacturers in
an attempt to add longevity to their designs, whilst
maintaining leading edge performance from technology
advances [19]. Rapid technological advances mean that
specifications which do not scale with increasing resources
are quickly outdated, and require much effort before another
generation of product can be produced.
It is commonplace for a microprocessor to include
different execution states (or privilege levels). These are
used, for instance, to allow the OS access to resources that
must be kept from general application execution. The most
privileged execution state of the ATM-CPE has a reserved
set of registers, shown within the dotted outline of Fig. 6.
These provide a low overhead context switch for the
operating system.
3.1. General registers
The GR behave very much like the registers of other
load/store architectures, in that they can be used freely as
sources or destinations for operations. This includes using
the same register as source and destination in the same
instruction. They can also be accessed as groups of three,
referred to as Register Groups (RGs). The groupings are
pre-assigned and unchangeable. RGs are formed from
logical combinations of GR, and use the normal hardware
and data paths—thus requiring little extra design. Access to
RGs requires special instructions, which are different to
normal register instructions. This minimises the instruction
size required, as only one reference needs to be made in
order to access three registers. Although RG instructions are
limited in scope, they are completed in the normal cycle
time, thereby avoiding scheduling problems and pipeline
bubbles.
All of the GR are 128-bits wide. This is for two main
reasons; the first being an extrapolation of microprocessor
design, and the second is the convenience of dividing an
ATM cell payload across three registers. Microprocessor
designers have been continually increasing the width of the
registers and execution units, and SMART is effectively one
step ahead of the current generation of 64-bit machines.
Although there are specialised processors that include 128-
bit registers [15,20], they are based around narrower data
paths and the wide registers are generally additional to their
original architecture. Wide registers allow powerful
manipulation of data for graphics and multimedia appli-
cations, which are particularly effective when the execution
units are horizontally partitioned for single-Instruction
Multiple-Data (SIMD) parallelism. The ATM-CPE uses
Fig. 6. Programmers model of the ATM-CPE.
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379372
this to a great extent, and it provides a significant advantage
for the multimedia applications that the architecture is
aimed at.
The argument against wide registers is that they are
wasteful when processing small quantities, particularly
when data of the same width is stored to memory. Dual
width registers or two register files of differing widths was
rejected for the SMART ATM-CPE on the grounds that the
additional complexities would impede the streamlined
design.
The number of physical GRs in the ATM-CPE is
implementation specific. There is a recommended minimum
of 256 GRs, which may be increased to a maximum of 1280.
These are sequential and contiguous, numbered incremen-
tally from zero. The only restriction is that they are
implemented in such a way so as to not split a register group.
Only 256 GR can be visible at any time, although this is
transparent to the execution of the application code.
3.2. Indirection register
The Indirection Register (IR) provides a simple and
effective way to provide single cycle context switching and
function calls. This provides indirect general register access,
transparently to the programmer. In order to access a
register the current IR value is added to the register number
contained within the instruction, which results in an easy
translation to physical registers. To prevent incorrect
register accesses the IR is only updated during a function
call or return, and is cleared to zero during reset.
By specifying an increment to the IR the calling routine
can furnish the callee with a ‘clean’ set of registers. This is
illustrated in Fig. 7, where the register mapping on the left
represents the state of the processor before the function call.
As can be seen, the GR are mapped to the physical registers
with an indirection value of 15 (general register 16 plus IR
of 15 equals physical register number 31). A function call is
made and the IR is incremented by three, now providing the
mapping seen on the right hand side of Fig. 7. The physical
registers numbered 34 and 35 are used to pass data to the
function, and all registers below number 34 are safe from
corruption during the execution of the subroutine. It is
important to differentiate between indirect register accesses,
and indirect memory addressing modes, which use an offset
from the current IP value.
All accesses to registers R0-R15 will be absolute
accesses. This is a binary boundary that coincides with the
register groups, because the IR equally affects RGs.
A similar set of operations to those accompanying a
CALL can be performed upon the execution of a RETURN
instruction. The indirect register needs to be returned to the
value that it had prior to the call, and this must be a hardware
function—to prevent misalignment. The increment supplied
with the call is stored on an internal stack, which is ‘popped’
when the return instruction is encountered. As the new IR
value does not need to be calculated the execution unit,
which executes the return instruction, is available for
another use.
If there are insufficient registers available to provide a
full set, the privileged execution steps in at some stage and
saves some early registers to memory, before continuing
execution. These must then be restored when the appro-
priate return instructions are called and the IR is set to a
value less than zero.
3.3. Hardware loop assistance
The most basic form of code iteration is the counted loop.
Other variants are iteration until a condition is met or
changes, and iteration forever. All these forms are made up
of a section of code with a known start and end point that is
executed repeatedly. Very often it is not feasible to unroll
these loops, particularly when loop counts are high or when
the number of iterations is unknown.
With traditional architectures the last instruction in the
loop supplies a new IP, which redirects the fetch unit to
another area of memory, where it will find the first
instruction of the loop. As the new fetch address is not
available until after the execution stage of the pipeline this
re-direction causes problems, which degrade performance.
Modern processors from Intel, DEC, SUN, and others, all
include dedicated branch prediction logic which attempts to
guess the direction which a processor will be forced to take
by a conditional branch [13,16]. Much effort has been
applied to improving the prediction algorithms and some
high performance claims are made. However, these are
generally made using best-case code fragments, which is not
entirely representative of real applications. Furthermore,
multiple context support requires the branch history to be
preserved whilst other contexts are being serviced, adding to
the volume of information that must stored and retrieved. If
the pre-fetch unit is sent in the wrong direction then there
are numerous instructions that must be nullified/flushed
from the pipeline, and execution must wait until the correct
instructions refill the pipeline. With long pipelines (14
stages for the Intel P6 [21]) this can severely inhibit
performance.
The ATM-CPE hardware loop assistance is similar in
principle to that found on the Analog Devices 16-bit DSP
family, the ADSP-21xx [22]. The aim is to remove the
overhead associated with counted loops. The hardwareFig. 7. Indirection register operation during a function call.
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379 373
ensures that upon fetching the last instruction the fetch unit
is updated with the address of the start of the loop. In this
way the last instruction does not need to be a branch
instruction, and there is no execution penalty to the iteration.
Additionally, the overhead of processing the conditional test
is also removed.
Essentially, the hardware is initialised before the loop
commences, storing the number of loop iterations, the loop
end address and the loop start address (the current IP value
when the initialisation instruction is executed). These values
are all stored on the loop stack, which allows multiple
outstanding loops to be maintained and provisions for
nested loops.
Each time the loop is executed the loop counter is
decremented, and upon reaching zero execution is allowed
to continue past the end of the loop. This does not require
any additional address calculation hardware, as the absolute
addresses are stored ready for instant use. It is the
initialisation instructions that calculate the address prior to
beginning the loop.
A benefit of this form of loop control is that it can be
adjusted to take account of the pipeline length, thus ensuring
the last instruction of a loop is followed immediately by the
first without any pipeline bubbles. Instead of using the end
address to compare to the current IP value the end address
minus the pipeline length can be used as the end address.
The only other change is to the mechanism that determines
the start address from the current IP value.
This form of loop control works very well for
mathematical functions that have a known number of
iterations. As many of the multimedia and network
operations are fundamentally mathematical and iterative
this will benefit the general environment of SMART.
Additionally, SMART includes the ability to program the
loop assistance unit not to decrement the loop counter, and
also instructions to immediately clear the counter value
(predicated for conditional execution). These also enable
SMART to process the other two forms of iteration that
were previously mentioned without any overhead—con-
ditional exit loops and infinite loops. There are restrictions
on the placement of an instruction, which clears the loop
counter, as its pipeline latency must be accounted for. Even
with these restrictions, however, hardware loop assistance
remains a powerful feature of the architecture.
4. Instruction set
Defining an instruction set is as much of an entire project
as any other single task in the overall development of a new
microprocessor. Whilst some types of instructions are
fundamental to the operation of all processors, there are
others, which are novel or innovative because of the
architectural design of the ATM-CPE. Some examples of
these are included here to illustrate the functional operation,
and to support the code kernel.
Instructions are grouped into threes in a VLIW. They are
all self-contained, without interdependences and conflicts.
There are certain restrictions on which execution units can
perform some limited instructions (such as address
calculation for load/store operations), but essentially all
three are identical. This reduces the design time, as a single
instruction unit can be copied for units two and three. Self-
contained instructions further reduce the hardware compli-
cations, as they do not require communications or transfers
between the execution units. A VLIW packet is issued as a
single entity, with the position of the instructions within the
VLIW used to determine which execution unit they are
issued to.
Complex instruction scheduling logic is widely identified
as a limiting factor to the design of faster superscalar
processors [11,23]. VLIW technology relies on the compiler
to optimise the instruction scheduling, but removes this
large section of logic from the processor entirely. To further
aide the instruction scheduling, the ATM-CPE follows a
similar route to the latest designs from Intel (IA-64), SUN
Microsystems (MAJIC) and others, where instruction
execution is predicated [24,25]. Therefore the compiler
can simultaneously schedule multiple execution paths, and
use the outcome of a conditional test to determine which one
is used.
Instructions include five fields, each of these is 8-bits.
These identify the various registers and operations; instruc-
tion operation code, two source registers, one destination
register, and a predicate register. Therefore such instruc-
tions as Rdest ¼ RA þ RB are possible. Equally, the two
source registers may be used as a 16-bit immediate value, or
a combination of register identifier and 8-bit immediate
data.
The instruction set includes significant emphasis on
SIMD partitioned operations. This exploits the wide
registers, allowing one instruction to perform as much
work as possible. These instructions include arithmetic,
logical, shift and rotate. One example is the concatenated
shift. Whilst the operation is thought of as a shift, it can
actually be performed as a dual bit-field extract and
combine. Fig. 8 illustrates the operation, where two registers
are concatenated into a 256-bit word, and shifted by an
amount contained in the third register.
In this example the destination register is the most
significant operand. When the shift direction is reversed
Fig. 8. Concatenated shift instruction operation.
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379374
the two source registers are concatenated the other way
round such that the destination register becomes the least
significant operand.
This instruction is a powerful manipulation that greatly
enhances the power of the ATM-CPE. It makes the 128-bit
registers and memory locations very efficient through a
reduction in the number of operations that are required for
pattern matching functions such as motion estimation.
To support this, another important partitioned instruction
is the Sum of Absolute Difference (SAD), which performs
the necessary steps to calculate the SAD of pixel data
contained in a single register. New microprocessor archi-
tectures include such instructions [24] to enhance the power
of their instruction sets without necessarily increasing the
demands on memory bandwidth. There is a fine line
separating over complex instructions that help reduce the
memory bandwidth requirements and those whose complex-
ities slow down the critical paths within the micro-
architecture.
The SAD instruction, however, has been implemented on
other processors without detrimental effects and is also a
significant benefit in the highly computational motion
estimation kernel. The ATM-CPE supports two forms of
the SAD instruction, which are aimed at different pixel
accuracies—8-bit and 16-bit representations. Fig. 9 illus-
trates the operations performed for a SAD instruction. The
partitioned execution units are particularly well suited to
this type of calculation, and its inclusion in the instruction
set does not add much complexity to the overall design.
4.1. Performance estimation—compressed video motion
estimation
The computational load of the motion estimation
algorithm used in MPEG and other video compression
algorithms is often regarded as a benchmark for processor
performance. The number of operations that must be
performed remains too great for all but a few of the top
processors to achieve real time results. Estimates of the
computational requirements of a full search and a ‘fast’
search have been given as 18.2 £ 109 and 621 £ 106 generic
RISC-like operations per second, respectively [26]. Using
the SIMD parallelism of its wide registers, the ATM-CPE
can perform the same functions with a speed up of 45 times
over the generic load/store architecture.
As each of SMART’s 128-bit register or memory
locations can store sixteen 8-bit pixels it is not efficient to
read individual pixels from memory. Therefore the motion
estimation code must take account of how the video frame is
stored in data memory in order to achieve the greatest
speed-up from its wide data path. If pixels are stored in
groups that are the same size as Macro Blocks (MBs), and
have the same borders as MBs, then they can be read, stored
and written with the maximum efficiency. Pointers are
linked to the MB number, and an entire MB is transferred
with 16 sequential memory read operations. MBs are stored
in memory in such a manner to reduce the overhead of
pointer management during kernel functions such as Motion
Estimation, such that an area of the video frame, composed
of any 9 adjacent MBs, would be stored in memory as
shown in Fig. 10.
Search operations pay no heed to boundaries between
MBs, and a disadvantage of grouping pixels into a single
memory location is the additional memory accesses that are
required to search across such boundaries. For example, the
full extent of the search range for MB ‘Q’ is shown in Fig. 11
as the shaded area, whilst an arbitrary search location is
shown as a dotted line. This location requires 32 memory
accesses in order to overcome the misalignment, as opposed
to an optimum 16 memory fetches for a perfectly aligned
MB. However, nearly all search locations will be mis-
aligned, and therefore the overhead cannot be ignored.
The offset m must be used to shift the memory data in
order to compare it to the reference MB held in registers.
Fig. 9. Illustration of ‘sum of absolute difference’ instruction operation.
Fig. 10. MB storage organisation for optimum motion estimation
performance.
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379 375
The vertical offset between different MBs does not present a
problem, as the fetches remain sequential across these
boundaries.
For the misaligned search shown in Fig. 11, the kernel
function only needs to know the memory location of the first
fetch, the difference between similar locations in adjacent
columns n; and the required shift distance to align the
registers m: The reference MB, for which the best match is
to be found, is loaded into 16 registers. This only needs to be
done once, as it remains in the registers until the search is
concluded. As this code fragment is the kernel of a
subroutine that would be called once for each search
location the 16 loads are not included in the subroutine code.
The source code for the motion estimation SAD is shown
in Fig. 12. It is written for the SMART ATM-CPE as it uses
128-bit data values (natural integers), which are not native
to the ‘C’ language. The code supporting the function call
SAD(16_pels) is not shown as it is later optimised to a
single SMART PN instruction, although a function call
allows alternative compiler optimisations for other
architectures.
The operations required for one pass through the loop
are: two loads (from the data memory location pointed to by
mb1_ptr and mb1_ptr þ n), a pointer increment, two shifts
and a logical OR (to generate a single 128-bit value from
across the MB boundary), the SADs, and finally the
accumulation of the SAD summation.
This assembles to the pseudo code shown in Table 1,
where the operations described above for one pass through
the loop have been converted into functions achievable with
ATM-CPE instructions.
The operations of Table 1 clearly demonstrate the power
of the ATM-CPE instruction set and its wide registers.
Further speed enhancements can be made if the loop is
unrolled and the code fragment shown in Table 2 is
interleaved with two other loop iterations on the two other
execution units. This is an illustrative example only
showing the parallelism achieved, together with the
tessellation of the individual loop structures that have
been unrolled—the full code can be seen in Table 3. This
form of parallelisation ensures that the execution units are
kept fully occupied and no instruction slots are wasted. It is
possible to unroll the loops in their entirety because the
instruction set reduces the loop body to a short sequence of
instructions.
Laying this code out as true assembler level source code1
loses the obvious parallelism shown above. Throughout the
coding the register usage is: registers 16–31 contain the MB
which is to be compared with the video frame in memory,
register 32 contains the shift value m, register 33 contains
the pointer offset n, and register 34 is the pointer to the first
data memory location containing the reference pixels. The
total accumulates in register 35, whilst registers 36 through
to 41 are used for the temporary totals of each parallel
operation. In this way the calling routine passed the
necessary data in registers, which became those numbered
16 onwards after the update of the IR.
The architecture restricts execution unit number one to
the calculation of memory addresses for load/store oper-
ations to memory. This is the main reason that the source
code of Table 3 loses it’s obvious readability as the
Fig. 11. Macro block search locations and memory organisation boundaries.
Table 1
ATM-CPE Instructions for a single pass through the motion estimation loop
Pseudo instruction Operation/C source code
load from ptr_1 load from mb1_ptr
load from ptr_2 load from n þ mb1_ptr
concatenated shift shift two registers to align pix
calculate the SAD SAD of all 16 pixels (SAD(16_pels)
accumulate total total ¼ total þ SAD(16_pels)
increment ptr pointer moved to next 16 pixels
Fig. 12. Source code for sum of absolute difference calculation.
1 Source code format is: opcode, destination, sourceA, sourceB. As every
instruction must be executed the predicate register has been omitted for
clarity.
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379376
instructions are scheduled around this unit. Upon return the
final total value will be accumulated in R35, ready for use
by the calling routine. Notice that in adhering to the rules of
restricted functions for each of the execution units (load/
store and branch) this code has not become any longer than
without such limitations. This is not only due to careful
planning of the operation, but also to the architectural
resources and the powerful instruction set.
4.2. Summary of motion estimation performance
Careful instruction set design, a clean architecture and
horizontal register partitioning all serve to exploit the power
inherent in the 128-bit GR. In just 35 clock cycles, the new
ATM-CPE can perform the 256 comparisons and accumu-
late the SADs for a whole MB.
The ATM-CPE fairs well in comparison to performance
claims by other processors. The VIS extensions to the SUN
Ultra-SPARC I and II allow it to perform a similar MB
comparison for motion estimation in 441 instruction cycles,
operating on similar 8-bit pixel data [27]. Hewlett Packard
show that their MAX-2 instruction set extensions can
compute the difference of 8 pixels in 2 instruction cycles of
their superscalar PA-8000. Their bench mark results claim
that the code kernel for the full SAD for a MB takes 130
instruction cycles [28]. It is the need to continually load data
from memory and transfer it between internal registers that
slow the Ultra-SPARC, whilst the narrower data path and
lack of dedicated difference instruction prevent greater
speed in the PA-8000. At 35 cycles the ATM-CPE uses far
fewer instructions. Furthermore, the wider memory
locations and registers reduce the memory accesses, and
help prevent stalls due to memory access latencies. This is
an often overlooked performance penalty that is only really
apparent when a real world application is executed.
A frame size of 18 by 22 MBs, at 30 frames per second
would require an ATM-CPE to run at 400 MHz in order to
perform a full motion estimation search. This represents a
speed up of 45 times over the generic RISC processor
Table 2
Pseudo code implementation of motion estimation loop
..
. ... ..
.
accumulate total shift values to align load from ptr_1
increment ptr_1 calculate the SAD load from ptr\_2
load from ptr_1 accumulate total shift values to align
load from ptr_2 increment ptr_1 calculate the SAD
shift values to align load from ptr_1 accumulate total
calculate the SAD load from ptr_2 increment ptr_1
accumulate total shift values to align load from ptr_1
increment ptr_1 calculate the SAD load from ptr_2
load from ptr_1 accumulate total shift values to align
load from ptr_2 increment ptr_1 calculate the SAD
shift values to align load from ptr_1 accumulate total
calculate the SAD load from ptr_2 increment ptr_1
accumulate total shift values to align load from ptr_1
Table 3
Complete motion estimation code
#1 load-rr r36, r34, r00;f Fetch the first 16 pixels
nop;
nop;
#2 load-rr r37, r34, r33; Fetch the second 16 pixels
add-rr r34, r34, i01; Increment memory pointer
nop;
#3 cshiftr r37, r36, r32; Concatenate and shift to align
load-rr r38, r34, r00; Fetch the third 16 pixels
nop;
#4 load-rr r39, r34, r33; Fetch the fourth 16 pixels
sad8 r37, r37, r16; First sum of absolute differences
add-rr r34, r34, i01; Increment memory pointer
#5 load-rr r40, r34, r00; Fetch the fifth 16 pixels
add-rr r35, r35, r37; Accumulate the first total
cshiftr r39, r38, r32; Concatenate and shift to align
#6 load-rr r41, r34, r33; Fetch the sixth 16 pixels
add-rr r34, r34, i01; Increment memory pointer
sad8 r39, r39, r17; Second sum of abs. differences
#7 load-rr r36, r34, r00; Fetch the seventh 16 pixels
add-rr r35, r35, r39; Accumulate the second total
cshiftr r41, r40, r32; Concatenate and shift to align
#8 load-rr r36, r34, r00; Fetch the eighth 16 pixels
sad8 r41, r41, r18; Third sum of absolute differences
add-rr r34, r34, i01; Increment memory pointer
#9 load-rr r38, r34, r00; Fetch the ninth 16 pixels
cshiftr r37, r36, r32; Concatenate and shift to align
add-rr r35, r35, r41; Accumulate the third total
#10 load-rr r39, r34, r33; Fetch the 10th 16 pixels
sad8 r37, r37, r19; Fourth sum of abs. differences
add-rr r34, r34, i01; Increment memory pointer
#11 load-rr r40, r34, r00; Fetch the 11th 16 pixels
add-rr r35, r35, r37; Accumulate the fourth total
cshiftr r39, r38, r32; Concatenate and shift to align
#12 load-rr r41, r34, r33; Fetch the 12th 16 pixels
add-rr r34, r34, i01; Increment memory pointer
sad8 r39, r39, r20; Fifth sum of absolute differences
#13 load-rr r36, r34, r00; Fetch the 13th 16 pixels
add-rr r35, r35, r39; Accumulate the fifth total
cshiftr r41, r40, r32; Concatenate and shift to align
#14 load-rr r36, r34, r00; Fetch the 14th 16 pixels
sad8 r41, r41, r21; 6th sum of absolute differences
add-rr r34, r34, i01; Increment memory pointer
#15 load-rr r38, r34, r00; Fetch the 15th 16 pixels
cshiftr r37, r36, r32; Concatenate and shift to align
add-rr r35, r35, r41; Accumulate the sixth total
#16 load-rr r39, r34, r33; Fetch the 16th 16 pixels
sad8 r37, r37, r22; 7th sum of absolute differences
add-rr r34, r34, i01; Increment memory pointer
#17 load-rr r40, r34, r00; Fetch the 17th 16 pixels
add-rr r35, r35, r37; Accumulate the seventh total
cshiftr r39, r38, r32; Concatenate and shift to align
#18 load-rr r41, r34, r33; Fetch the 18th 16 pixels
add-rr r34, r34, i01; Increment memory pointer
sad8 r39, r39, r23; 8th sum of absolute differences
#19 load-rr r36, r34, r00; Fetch the 19th 16 pixels
add-rr r35, r35, r39; Accumulate the 8th total
cshiftr r41, r40, r32; Concatenate and shift to align
#20 load-rr r36, r34, r00; 9th sum of absolute differences
sad8 r41, r41, r24; Fetch the 20th 16 pixels
add-rr r34, r34, i01; Increment memory pointer
#21 load-rr r38, r34, r00; Fetch the 21st 16 pixels
cshiftr r37, r36, r32; Concatenate and shift to align
(continued on next page)
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379 377
operations. More importantly, a fast search algorithm could
reduce this requirement to under 14 MHz!
5. Conclusion
This paper has outlined the design, operation and
performance of a highly novel microprocessor architecture
currently under development at the University Of Sussex. It
directly supports the manipulation of ATM cells, and
includes a cell stream interface. A third small cache memory
allows the cells to be buffered and accessed by the
processing core.
The five main design axioms set out in the introduction
have been met. ATM has been retained as the network
interface, which is given a high priority route into the very
heart of the processor architecture. At all times the design
has tried to avoid complexity and the need for many
resources to implement it. This ensures the implementation
is affordable, and does not rely on fully customised silicon
design. Additionally, a smaller implementation makes a
System On Chip (SOC) implementation more plausible with
today’s technology. Finally, the design has strived to reduce
the impact of the disparity between memory and processing
speeds. Through the use of wide registers, powerful
instructions, and deliberately lower processor clock speeds
(from the accepted target of FPGA and ASIC technology)
the impact of memory latency has been diminished.
Promoting the importance of the network interface
leads to many possible deployments beyond the original
workstation CPU envisaged by Cashman. These include
processing of cells in switch buffers and queues, affordable
embedded network processors for ATM networks, and
intelligent network interface cards. As network speeds
continue to increase faster than processor or memory
speeds, such innovative architectures will become ever-
more important. In fact, one of the fathers of the
microprocessor, Frederico Faggin, predicted that micro-
processor development would be divided into three phases,
each lasting for approximately 25 years [29]. During the
first phase the dominant factor over improvements in cost-
performance was semiconductor technology. This allowed
the implementation, and gradual speed enhancement of
many of the ideas originally developed for minicomputers
and mainframes.
Then during the second phase, it is predicted that
architectural developments will become as dominant as
those of semiconductor technology. In the words of
Frederico Faggin [29] “Innovation will also come about
from changing market requirements, primarily the need to
process and communicate multimedia information more
efficiently… It will stimulate novel computer structures
motivated by the new market needs and supported by an
ever more powerful semiconductor technology.”
It is forecast that this phase will be a period of vigorous
growth for the microprocessor. We should see the trend of
systems on a chip begin to appear, which is seen as the main
evolutionary direction of the microprocessor.
The prediction for the final phase says that architectural
advances alone will be the dominant factor in micropro-
cessor progress. This is because the physical limits of
semiconductor technology will have been reached. The
systems on a chip will become more complex and flexible,
drawing technology from re-configurable computing to
create dynamic architectures.
The second phase of Faggin’s prediction is beginning to
be realised, as many different architectural features are
being explored to overcome some fundamental limitations.
Incremental development is beginning to give way to
radical architectural changes, and market segments are
emerging based on different functional requirements. Some
of these are existing and proven, such as Digital Signal
Table 3 (continued)
add-rr r35, r35, r41; Accumulate the 9th total
#22 load-rr r39, r34, r33; Fetch the 22nd 16 pixels
sad8 r37, r37, r25; 10th sum of absolute differences
add-rr r34, r34, i01; Increment memory pointer
#23 load-rr r40, r34, r00; Fetch the 23rd 16 pixels
add-rr r35, r35, r37; Accumulate the 10th total
cshiftr r39, r38, r32; Concatenate and shift to align
#24 load-rr r41, r34, r33; Fetch the 24th 16 pixels
add-rr r34, r34, i01; Increment memory pointer
sad8 r39, r39, r26; 11th sum of absolute differences
#25 load-rr r36, r34, r00; Fetch the 25th 16 pixels
add-rr r35, r35, r39; Accumulate the 11th total
cshiftr r41, r40, r32; Concatenate and shift to align
#26 load-rr r36, r34, r00; Fetch the 26th 16 pixels
sad8 r41, r41, r27; 12th sum of absolute differences
add-rr r34, r34, i01; Increment memory pointer
#27 load-rr r38, r34, r00; Fetch the 27th 16 pixels
cshiftr r37, r36, r32; Concatenate and shift to align
add-rr r35, r35, r41; Accumulate the 12th total
#28 load-rr r39, r34, r33; Fetch the 28th 16 pixels
sad8 r37, r37, r28; 13th sum of absolute difference
add-rr r34, r34, i01; Increment memory pointer
#29 load-rr r40, r34, r00; Fetch the 29th 16 pixels
add-rr r35, r35, r37; Accumulate the 13th total
cshiftr r39, r38, r32; Concatenate and shift to align
#30 load-rr r41, r34, r33; Fetch the 30th 16 pixels
add-rr r34, r34, i01; Increment memory pointer
sad8 r39, r39, r29; 14th sum of absolute differences
#31 load-rr r36, r34, r00; Fetch the 31st 16 pixels
cshiftr r41, r40, r32; Concatenate and shift to align
add-rr r35, r35, r39; Accumulate the 14th total
#32 load-rr r36, r34, r00; Fetch the 32nd 16 pixels
sad8 r41, r41, r30; 15th sum of absolute differences
nop;
#33 add-rr r35, r35, r41; Accumulate the 15th total
cshiftr r37, r36, r32; Concatenate and shift to align
nop;
#34 return; Return, with delay slot
sad8 r37, r37, r31; 16th sum of absolute differences
nop;
#35 add-rr r35, r35, r37; Accumulate the 16th total
nop;
nop;
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379378
Processors (DSPs), and others are being created by new
markets, such as these, which led to network processors.
The design of the SMART processor is very forward
looking. It is well placed to serve network connectivity, it
has the processing power to support multimedia appli-
cations, and through its design simplicity is affordable to
manufacture.
References
[1] N. Cashman, SMART: a proposed novel multimedia computer
architecture for processing ATM cells in real-time, PhD thesis,
University of Sussex, October 1998.
[2] R.H. Arpaci-Dusseau, A.C. Arpaci-Dusseau, D.E. Culler, J.M.
Hellerstein, D.A. Patterson, The architectural costs of streaming I/
O: a comparison of workstations, clusters and smps, Fourth
International Symposium on High Performance Computer Architec-
ture (HPCA), IEEE computer society technical committee on
computer architecture, Jan-Feb, 1998.
[3] M.D. Hayter, A workstation architecture to support multimedia, PhD
thesis, St John’s College, University of Cambridge, September 1993.
[4] A. Boxer, Where busses can’t go, IEEE Spectrum February (1995)
41–45.
[5] Xilinx, Virtex-II 1.5 Vfield-Programmable Gate Arrays, Xilinx Inc,
April 2001, Data sheet from http://www.xilinx.com
[6] Wm.A. Wulf, S.A. McKee, Hitting the memory wall: Implications of
the obvious, Communications of the ACM 23 (1) (1995) 20–24.
[7] D. Burger, J.R. Goodman, A. Kagi, Limited bandwidth to affect
processor design, IEEE Micro 17 (6) (1997) 55–62.
[8] J.D. McCalpin, Memory bandwidth and machine balance in current
high performance computers, IEEE Computer Society Technical
Committee on Computer Architecture Newsletter December (1995)
19–25.
[9] R. Nass, Ring architecture connects up to 128 pci buses, Electronic
Design November (1997) 85–88.
[10] W.J. Dally, S. Lacy, VLSI architecture: past, present and future,
Advanced Research In VLSI, 1999.
[11] T.R. Halfhill, Inside IA-64, BYTE June (1998) 81–88.
[12] S. Palacharla, Complexity-effective superscalar processors, PhD
thesis, School of Computer Sciences, University of Wisconsin-
Madison, 1998.
[13] H. Sharangpani, Intel itanium processor microarchitecture overview,
Proceedings of Microprocessor Forum, San Jose, California, October,
1999.
[14] S. Sudharsanan, MAJC-5200: a high performance microprocessor for
multimedia computing, White paper, June, 1999.
[15] Intel Corporation, IA-32 Intel architecture software developer’s
manual volume 1: Basic architecture, Intel Corporation, P.O. Box
7641 Mt. Prospect IL 60056-7641, 2001, Document order number
245470.
[16] L. Gwennap, Digital 21264 sets new standard, Microprocessor Report
October (1996).
[17] HP, PA-8500: The continuing evolution of the PA-8500 family, Web
page; http://www.hp.com/computing/-framed/technology/micropto/
pa-8500/docs/8500.html, 1997.
[18] D. Tabak, Advanced Microprocessors, Second ed., McGraw-Hill,
1995.
[19] M. Tremblay, MAJC: Microprocessor architecture for java comput-
ing, Hotchips ’99, 1999.
[20] C. Hansen, Microunity’s media processor architecture, IEEE Micro 16
(4) (1996) 34–41.
[21] J. Bayko, Great microprocessors of the past and present, Web page;
http://infopad.eecs.berkeley.edu/CIC/-archive/cpu_history.html,
1997.
[22] Analog Devices, ADSP-2100 family user’s manual, Analog Devices,
Third ed., 1995.
[23] Intel Corporation and Hewlett Packard, IA-64 architectural inno-
vations, joint white paper, February, 1999.
[24] Intel Corporation, IA-64 Application Developer’s Architecture Guide
May (1999).
[25] M. Tremblay, MaJC-5200: A VLIW convergent MPSOC, Micro-
processor Forum, 1999.
[26] V. Bhaskaran, K. Konstandinides, Image and Video Compression
Standards Algorithms and Architectures, Second ed., Kluwer
Academic Press, 1995.
[27] M. Tremblay, J.M. O’Connor, V. Narayanan, L. He, VIS speeds new
media processing, IEEE Micro 16 (4) (1996) 10–20.
[28] R.B. Lee, Subword parallelism with MAX-2, IEEE Micro 16 (4)
(1996) 51–59.
[29] F. Faggin, The microprocessor, IEEE Micro 16 (6) (1996) 7–9.
Gary Jones graduated from the University
of Brighton, England, in 1996 with a first
class Honours degree in Electronic Engin-
eering, and a Masters degree in Engineer-
ing with Business Management. He went
on to start his DPhil research work with the
Communications Research Group, at the
University of Sussex. During the last two
years Gary has worked as a consultant
design engineer for a local company. He
now runs his own engineering and soft-
ware consultancy company.
Elias Stipidis graduated with a first class
BEng Honours degree in Electronics and a
DPhil in Communications from the Uni-
versity of Sussex in 1995 and 1998,
respectively, and is currently leading the
Data Communications and Embedded
Systems research at the University of
Sussex, England.
G. Jones, E. Stipidis / Microprocessors and Microsystems 27 (2003) 367–379 379