programmable peripheral devicespcrowley/papers/generals.pdfprogrammable peripheral devices....

1

Programmable Peripheral Devices

Patrick CrowleyDepartment of Computer Science and Engineering

University of WashingtonSeattle, WA 98043

1 Introduction

Many important server applications are I/O bound. For example, large-scale database mining and

decision support applications are limited by the performance of the storage I/O subsystem, and,

likewise, web servers and Internet backbone routers are constrained by the capabilities of the

network I/O system.

For this reason, many proposals have been made in recent years to migrate functionality from

servers onto programmable storage and network devices in order to improve application

performance. This report surveys these proposals and evaluates research areas concerning these

programmable peripheral devices.

Programmable disks can be used to: scale processing power with the size of large scan-

intensive database problems, build scalable, secure, and cost-efficient storage systems, and

implement sophisticated storage optimizations at the disk.

Programmable network interfaces (NIs) can be used to: unburden the host from managing

data transfers in fast networks, scale processing power with the number of links in a network,

and enable complex packet processing, such as the aggressive program-in-a-packet Active

Network proposal, at network speeds.

Both types of programmable peripherals contain all the components found in computer

systems: processor, memory, and communications subsystem. Based on this observation, this

report concludes with a set of common issues, including technical vulnerabilities and areas for

future research.

This report is organized as follows. Section 2 contains background information and a

historical perspective concerning programmable peripherals. Sections 3 and 4 discuss the designs

for and applications of programmable disks and network interfaces, respectively. Examples of

other programmable peripherals are briefly discussed in Section 5. A set of issues common to

2

both programmable disks and network interfaces is presented in Section 6. The report concludes

with a summary and brief set of research proposals in Section 7.

2 Background

The seasoned reader will note that programmable I/O devices are, in fact, far from a new idea.

Programmable peripherals have been implemented and abandoned for good reasons in the past.

To consider the arguments against programmability in these devices, we first recall a Pitfall and

Fallacy from the storage and network I/O chapters, respectively, of a popular computer

architecture textbook [Hennessy and Patterson 1996].

Pitfall: Moving functions from the CPU to the I/O processor to improve performance.

An I/O processor, in this context, is a direct memory access (DMA) device that can do more than

shuffle data. The authors are recalling the programmable I/O processors found in classic

machines such as the IBM 360 which, in the 1960s, had programmable I/O channels [Amdahl et

al. 1964, Cormier et al. 1983]. One application of this programmability was support for linked-

list traversal at the I/O processor. (Interestingly, one group recently proposed the addition of

execution engines at each level in the CPU cache memory hierarchy to enable the overlap of

computation and communication in linked-list traversals [Yang and Lebeck 2000].) This

alleviated the host CPU of the traversal task, so it was free to do other work. The argument

against this usage was that the advances in host CPU performance would prove far greater than

the performance advances of the I/O controller in the next generation. Thus, applications that

used and benefited from this optimization in generation N, actually saw decreased performance

when running on the machine of generation N+1. Put plainly, the host CPU was by far the most

expensive and powerful compute element in the system; to bet against it was folly.

Fallacy: Adding a processor to the network interface card improves performance.

In elaborating on this fallacy, the authors argue essentially the same point: the advantages of host

CPU speed.

The issues raised by H&P are answered by the state-of-the-art in embedded microprocessors

today. Rather than being far slower than high-performance desktop CPUs, embedded

microprocessor integer performance is now within a factor of two of their desktop counterparts

[Keeton et al. 1998] (note the co-author on this reference). This change in relative processor

3

performance has been a consequence of Moore’s Law [Moore 1965]; increasingly, and at all

levels of abstraction, communication is a far scarcer resource than computation.

Generally speaking, increasing I/O performance in a computer system is a matter of cost:

improvements can be achieved by spending more money. Improvements in I/O performance are

costly because I/O components are generally standards-based, with many companies offering

competing, compatible products. For example, any new PC interconnect technology must gain

wide acceptance to achieve the economies of scale necessary to be cost-effective. Thus, only the

most urgent and important problems get solved with expensive, customized I/O systems.

Furthermore, I/O subsystems comprise the bulk of the cost of modern computer systems, despite

being built with commodity components [Hill et al. 2000, editors' introduction to Ch. 7]. I/O

systems are relatively costly since they, generally speaking, do not benefit from Moore’s Law as

do semiconductor devices like processors and memories. Thus, many I/O advances aim to either

reduce the cost for a given level of performance, or improve performance for a given level of

cost.

Programmable microprocessors are an increasingly cost-effective solution for providing

sophisticated control in I/O devices. Advances in VLSI technology have made powerful

embedded microprocessors small, powerful and relatively inexpensive. In fact, most peripherals

common to today's computer systems, including disks, graphics accelerators, and network

interface cards, are built around microprocessors. As mentioned previously, this situation has

sparked numerous research efforts attempting to exploit any excess compute power at peripheral

devices, particularly on devices related to storage I/O and network I/O, to speed I/O intensive

applications in a cost-effective manner. This report surveys the research efforts under way for

programmable disks and programmable networks, unifies and gives context for their common

problems, and identifies areas of future research.

3 Programmable Disks

Compared to modern processors, disks are slow as a result of the physical motion needed to

access data. This fact enables disk manufacturers to implement much of the disk control logic in

software/firmware executed on a microprocessor. Software-based control reduces the number of

electronic and ASIC components on the disk, and therefore reduces cost. In this section, we

4

consider the design of modern disks and survey the approaches taken by researchers to leverage

this programmability in order to increase application performance.

3.1 Basic Operation

Magnetic disk drives store information in the form of magnetic flux patterns. This encoded data

is arranged on sectors within tracks on a platter, as shown in Figure 1. To store information, the

drive receives blocks of digital data through a host interconnect channel, such as SCSI, maps

block addresses to physical sectors, moves the read/write head over the appropriate disk sector,

and encodes the data as flux patterns that are recorded onto the magnetic surface. Information

retrieval is similar, except data is sensed and decoded rather than encoded and written.

As noted by [Ruemmler and Wilkes 1994], modern disk drives contain a mechanism, which

includes the recording and positioning components shown in Figure 1, and a disk controller,

which consists of a microprocessor, memory, and host interface as shown, among other things, in

Figure 2.

Recording and positioning components. The overall performance of the disk is dominated by

the engineering tradeoffs found in the disk mechanism. Two different but intimately related

aspects contribute to disk performance: media transfer rate and storage density.

The media transfer rate for a fixed storage density is primarily determined by two common

performance measures: spindle rotation speed and seek time. Very fast spindle rotation requires a

powerful motor, which consumes more energy, and high-quality bearings, which are more

expensive. Seek time refers to the time needed to position the head over a particular cylinder.

Figure 1. Mechanical components of a disk drive. Source: [Ruemmler and Wilkes 1994].

5

This time is limited by the power of the motor that rotates the arm and the stiffness of the arm

itself.

The storage density for a fixed media transfer rate is a consequence of two forms of density:

linear recording density and track density. The former is constrained by the maximum rate of

magnetic phase change that can be recorded and sensed. Track density refers to how closely

tracks may be packed together on the platter and is the primary source of density improvement.

Track density is influenced heavily by the precision provided by the head positioning and media

sensing mechanism. Both linear and track density are influenced by, and in turn influence, the

speed of the encoding process.

The read-write data channel encodes and decodes the data stream into or from a pattern of

magnetic phase changes. Error correction is built into the encoded data stream (and DSP

techniques can be used to increase data channel speed), and positioning information is recorded

onto the disk surface by the manufacturer to help determine the location of head.

3.2 Disk Controller

The disk controller governs the operation of the mechanism described above. The controller

receives and interprets SCSI requests, manages the media access mechanism, manages data

transfers, and controls the cache. The heart of the controller is the microprocessor. The current

trend is to reduce cost and improve performance by replacing electronic components with

Figure 2. The structure of a disk controller and integration trends. Source: [Riedel 1999].

6

software/firmware, augmenting the processor with DSP capabilities, and tightly integrating the

interfaces to hardware, which permits direct control.

3.2.1 Processor

Since disk performance is limited by the media access rate, which is slow relative to

microprocessor speeds, the control processor does not need to be particularly fast. However,

embedded microprocessor price/performance continues to improve, so, increasingly, disk control

logic, which controls spindle rotation and arm actuation, is being moved into software executed

by the control processor. [Adams and Ou 1997] describe their experience in doing so.

Chip-level system integration is also having an impact. Cirrus logic sells a system-on-a-chip

disk controller, called 3Ci, that integrates: a 66MHz ARM7 32-bit RISC processor core, disk

control logic, a DSP-based integrated read/write channel (PRML), 48 KB SRAM, 128KB ROM,

and a memory controller for off-chip Flash, SRAM and DRAM memory [Cirrus Logic ]. The

next generation of this device will include a 200 MHz ARM core with more on-chip memory.

3.2.2 Memory System

There is considerable semiconductor-based buffer storage (between 64KB and 1MB) on disks

today, and future devices will have even more. Originally, buffer memory performed only rate

matching between the media access rate and the host transfer rate. Today, data caching is used

and, in some cases, provides excellent improvements. Read caching can be performed

optimistically since there is no on-disk penalty associated with reading unnecessary data as the

head moves across a platter; it can simply be discarded. Outside of the disk, however, host-based

prefetching may be affected if the host makes cache content assumptions based on its own

reference pattern. Write caching permits the disk to: organize data before writing and reorganize

disk blocks during operation without interrupting the host CPU. However, write caching, for

reliability, is generally implemented on non-volatile memory to avoid loss of information if

power fails before the cached data can be written. IRAM [Patterson et al. 1997] [Keeton et al.

1997] has been proposed [Patterson and Keeton 1998] as a good integrated processor and

memory system architecture due to its potential low-latency and high-bandwidth characteristics.

Cirrus, by virtue of their 3Ci device, agrees with this call for integration.

7

3.2.3 Communication

All high-performance disk drives use the small computer system interface (SCSI). The SCSI

standard defines both an interconnection fabric and programming interface. SCSI interconnects

are parallel busses shared by several devices. Historically, bus-based interconnects have been the

standard for connecting hosts and storage devices. Following the trend seen in LANs, however,

high-performance, serial, point-to-point interconnect technologies like Fibre Channel [FCIA

2000] are rapidly replacing SCSI in server systems. Fibre Channel is a serial interconnect

technology that uses fewer wires than SCSI (4 rather than the 25, 50, 68, or 80 used in various

SCSI generations) and, therefore, has a smaller connector, and is considerably faster (125 MBps

vs. 80 MBps for SCSI-2). SCSI, the programming interface, can be implemented on top of Fibre

Channel.

The SCSI interface has proven to be a successful abstraction between hosts and storage

devices. The SCSI interface frees programmers and the host from having to manage the storage

device and, furthermore, permits the storage device to implement optimizations beneath the

interface. RAID [Patterson et al. 1988] is an example of an optimized system that presents itself

to the host as a standard SCSI device.

However, SCSI is a low-level interface, and one recommendation of the network-attached

secure disk drive (NASD) [Gibson et al. 1997] project is to replace it with a higher-level, object-

based interface to permit devices to better manage data, meta-data and security. With an object-

based interface, the device would manage the storage of blocks that belong to a particular object.

Hence, when a request comes for that object, the drive has knowledge of all the blocks that are of

potential interest. Presently, the interface permits no expression of relationships between blocks.

The object-based interface also simplifies security concerns, which are paramount for disks than

can be accessed directly across a network by multiple hosts, by associating capabilities with

host/object pairs.

3.3 Control Software

Modern disks are built around microprocessors, and, accordingly, a software control system is

responsible for governing the operation of the device. The control software is not exposed to the

programmer, and it generally resides on a disk-resident ROM or EEPROM. Disk-based operating

systems proposed in the research literature will be discussed in Section 3.4.1.

8

3.4 Applications of Programmability

We have examined the design of modern disks and the factors that have made them

programmable. In this section, we survey the manner in which this programmability has been

exploited to solve problems. These proposals fall into two categories: storage systems and

distributed disk-centric applications.

3.4.1 Storage Systems (NASD, Virtual Log-based FS)

The afore-mentioned NASD project describes a cost-effective scalable storage architecture with

network-attached and secure programmable disks [Gibson et al. 1997]. Disks directly attached to

the network require changes in the programming interface and security model, as mentioned in

Section 3.2.3. The NASD project addresses these issues, and proposes the following four

characteristics: direct data transfer between drive and client, a capability based access control

system permitting asynchronous oversight by a centralized file manager, cryptographic integrity,

and an object-based interface. The NASD work culminates in the demonstration of a parallel

distributed file system, built on a prototype NASD, that provides file system support to a parallel

data mining application. Application performance in their prototype system scales linearly with

the number of NASDs.

Another proposal implements a virtual log based file system on a programmable disk [Wang

et al. 1999]. Wang’s file system uses a virtual log, that is, a disk-based log with non-contiguous

entries, to achieve good performance for small synchronous writes while retaining all the

benefits of a log based file system, including transactional semantics. The technique involves

migrating some of the low-level file system implementation to the disk and performing small

atomic writes near the location of the disk head when data arrives. The authors note that these

techniques do not necessitate a programmable disk; the technique only requires a file system

with precise knowledge of disk state.

3.4.2 Distributed Disk-bound Applications (Active Disks, IDISK)

A number of researchers [Acharya et al. 1998, Gray 1998, Keeton et al. 1998, Riedel et al. 1998]

have proposed executing application-level code on programmable disks, in particular NASDs or

IDISKs, as a means of scaling processing power with the size of very large data sets in certain

scan-intensive database problems. While database machines, which scaled processing power

9

with the number of read/write heads on a single disk, failed in the 80s, these researchers contend

that there are now important applications that need to scale processing power with data set size.

Researchers from CMU describe target applications as: 1) able to leverage the parallelism

available in systems with many disks, 2) operate with a small amount of state, processing data as

it "streams" past, and 3) execute few instructions per byte of data [Riedel et al. 1998]. Most of

these applications are scan-intensive database operations used in data-mining where the same

queries are run over all data, producing a result set that requires further processing. This

approach makes use of the processors in all disks, and does not require all data to be sent across a

host I/O bus for processing.

The Active Disk literature proposes a programming model [Acharya et al. 1998] and an

analytical model [Riedel et al. 1998]. The programming model proposed by the group from UC

Santa Barbara/Univ. of Maryland is simple and calls for stream-based message passing. In this

model, the disk runs DiskOS, a disk-resident OS which handles memory management and stream

communication, and the application developer partitions code between the host and the disks.

This approach is fine for big problems that justify customization, as is the case for certain

problems solved by message-passing multiprocessors, but is far from a comprehensive

programming model. We return to this issue and the larger issue of software models in Section

6.2. The analytical model from CMU is designed to give intuition about the performance of an

active disks system compared to a traditional server [Riedel et al. 1998]. This model, in addition

to most of the arguments from the Active Disk literature, primarily speaks to and argues for

distributing these computations and, therefore, applies to other scalable approaches as well.

Clusters of inexpensive machines are another way of doing this [Arpaci-Dusseau et al. 1998]. In

fact, the arguments for a distributed, serverless file system were laid out in xFS [Anderson et al.

1996], which organized all workstations on a network as peers providing file system services.

The question is whether it is more cost-effective to run the software at each disk, or on a PC that

manages a few disks. The Santa Barbara group compared clusters to active disks for a set of

target applications and found that they were equivalent in terms of performance [Uysal et al.

2000]. However, the active disk solution was 60% less expensive, given late ’99 prices and the

authors’ bargain-hunting skills.

The IDISK project from Berkeley specifically argues for independent disks with considerable

processing power and memory that are capable of autonomous communication; in particular,

10

they state the case against clusters with respect to IDISKs. They point out four weaknesses in

cluster architectures: 1) the I/O bus bottleneck, 2) system administration challenges, 3)

packaging and cost difficulties, and 4) inefficiency of desktop microprocessors for database

applications. The first three items are clear. The fourth points out that desktop microprocessors

are slightly more powerful than embedded microprocessors when executing database codes, but

are far more costly. We return to this point in discussion of the future of programmability in

Section 6.1. The IDISK work distinguishes itself from Active Disks primarily by arguing for

considerable resources on each disk; the current Active Disks proposals seek out applications

that require minimal computation per byte.

3.5 Summary

In this section, we have surveyed the motivations, designs and applications for programmable

disks. The most compelling motivation for executing application code on these devices is the

need to scale processing power with problem size; data-set sizes for important disk-bound

problems continue to grow rapidly. However, cluster-based systems are already in use for this

purpose, and they must be displaced for Active Disks to become a reality.

Special processor and memory architectures -- other than IRAM -- have not been investigated

for disks, as they have for network interfaces as we will see, because disk performance is already

limited by the mechanical speeds of the spindle and arm. Any microprocessor performance

increase will have a marginal effect on the overall performance of the disk. (Unless the device is

tailored to increase the physical properties of the disk, as in using DSP techniques to increase

effective density [Smotherman 1989].) The IDISK proposal contends that there is clear benefit,

however, in improving the computational resources afforded application level code executing at

the disk, presuming, as do the Active Disk proponents, that additional processing power can be

added for marginal cost.

4 Programmable Network Interfaces

Network performance is increasing dramatically, outpacing the increase in memory speeds, with

no end anticipated in the near future [Schoinas and Hill 1998]. This fact, coupled with the

limitations in server I/O bus performance described in the previous section, has motivated high-

performance NI designs built around powerful microprocessors that require minimal host CPU

interaction.

11

Network interface design issues have traditionally been categorized according to network

type: local-area networks (LAN), system-area networks (SAN) and massively parallel processing

networks (MPPs). Since the reliability, bandwidth, and latency characteristics of these network

types are converging, the primary distinction that remains, one that is a large performance factor,

is the location of the NI/host connection. LAN and SAN NIs typically connect to the host I/O

bus. MPP NIs ordinarily attach to the node processor’s memory bus or processor datapath [Dally

et al. 1992]. In this report, we focus on the LAN/SAN type NI. However, as we shall see, the

integration of network processors on these devices raises many of the same issues confronting

MPP NIs.

While no two NIs are identical, in the following discussion of NI operation and design, we

use the Myrinet [Boden and Cohen 1995] host interface, as depicted in Figure 3, as a running

example. The Myrinet system area network (SAN) was a ground-breaking advance in

interconnect technology. It was the product of two academic research projects, namely Caltech’s

Cosmic Cube [Seitz 1985] and USC’s Atomic LAN [Cohen et al. 1992]. The Myrinet host

interface is similar in the important ways to other high-performance interfaces, and the bulk of

the differences lie in the relative sophistication (or lack thereof) of the network processor.

4.1 Basic Operation

A switched network such as Myrinet consists of host network interfaces and switches. Myrinet

switches range in size from 4 to 16 ports. Myrinet is self-configuring and source routed – it uses

the blocking-cut-through (wormhole) routing technique found in MPP systems such as the Intel

Paragon and the Cray T3D. The switches have no hard state or software; they only steer the

source-routed packets. Moore's Law has helped make switched-based networks economical since

switches and crossbars can be implemented on a single chip.

Upon packet arrival, the link, or packet, interface handles framing, error detection, and the

media access protocol. The link interface accepts a frame via the media access protocol, checks

for errors via cyclic redundancy checks (CRCs), and writes the frame into the buffer memory.

Minimally, this buffer is used to cope with asynchrony between the network link and the host

interface. In many high-performance NIs, additional processing of the packet takes place, as

discussed further in Section 4.2.1. Once all processing is complete, the packet is moved via

DMA into the host CPU’s memory.

12

Unlike disks, networks are fast compared to modern processors; high-performance NIs are too

fast for host I/O busses. For example, one link in a Myrinet LAN has full-duplex bandwidth of

110 MBps (1.28 Gbps), which is greater than the 10 MBps peak bandwidth available on host PCI

I/O busses, which is shared by all I/O devices. As previously mentioned, networks have been

increasing in speed and bandwidth faster than memory [Schoinas and Hill 1998]. Consequently,

the performance of the microprocessor, memory and operating system that run the interface card

can have tremendous influence on the capabilities of the device for certain applications. The

design of high-performance processors and execution environments has become a research issue,

and, in the case of network processors, a fledgling industry full of start-up companies and

established semiconductor vendors. A major push behind these efforts is the need to meet the

increasing bandwidth and functionality requirements of the Internet. Traditionally, the middle of

the network has been kept simple and fast, with sophistication being implemented at the edges.

However, to meet these demands, functionality is being pushed from servers on the edge of the

network onto internal network nodes. Sophistication is moving into the network in the form of

application data caching, tunneling, content distribution techniques, etc. There remain

proponents on both sides of this issue. However, it seems unlikely that the services being

deployed in the network today can ever be reigned back in.

The tremendous momentum behind Internet related technologies has inspired much research

in: network interface design, communications based operating systems research, and large-scale

systems research that implement network services. In this section, we first consider the design of

modern network interfaces, including the design alternatives investigated in the research

literature. Then, we survey various proposals for exploiting the programmability in network

interfaces.

4.2 NI Organization

As was the case with disks, modern NIs have all the components found in computer systems:

processor, memory, and a communications subsystem. Figure 3 depicts how the Myrinet host

interface and most NIs are organized. In this section, we consider each of the major components

individually.

13

4.2.1 Processor

The most important function of the processor is to manage packet delivery and protocol specific

tasks. In Myrinet, for example, each network has a manager (chosen manually or automatically

by the network) that is responsible for continuously mapping the network by sending messages

to all hosts. This mapping enables source routing. So, in addition to managing host packets, the

processor must also adhere to the control protocol of the network.

The LANai processor, found on the Myrinet NI, is a simple, 32-bit RISC processor clocked at

33 MHz with integrated link and host interfaces. The LANai is a relatively meager processor

compared to processors found on other high-end devices. For example, the 3Com 3CR990

ethernet network interface is built around a 200 MHz ARM9 processor core; this device

aggressively handles security (IPsec) and TCP segmentation and reassembly completely on the

NI [3Com 2000].

Research directions in NI processors can be grouped in two categories: communication

processors and network processors. The first category emphasizes low-cost interrupt and

message handling, and the second focuses on higher-level packet processing.

Communication Processors. The I/O processors described earlier unburdened the host CPU

from handling all the details of data transfers. Similarly, communication processors are used on

NIs to manage data movement. Rather than using polling or interrupts on the host (or network

processor), NIs use smaller, less powerful communication processors to poll the network for

data. Significant work on programmable communication processors has been done in the context

of message-passing MPPs. For example, the Stanford FLASH multiprocessor project developed

a communication processor called MAGIC to manage the movement of all data between host

CPU, memory and the network [Kuskin et al. 1994]. MAGIC managed all data movement on a

processor node, thus, in addition to unburdening the host CPU, cache coherence and

Figure 3. Myrinet NI structure. Source: [Boden and Cohen 1995].

14

communication mechanisms were implemented in software. In the following paragraphs, we

consider: the effectiveness of communications processors, the feasibility of zero-overhead

application message handling on communication processors, and design concerns specific to

communication processors built around general-purpose microprocessors.

Recently, [Scheiman and Schauser 1998] evaluated network performance in an MPP both

with and without a communication processor using the Meiko CS-2 multiprocessor. Results

indicate that implementing application, or user, level message handlers on a communication

processor, despite being slower than the host CPU, improves latency. The authors report that the

improvement is due to (1) the faster response time of the communication processor, and (2) to

the task offloading that frees the main processor from polling or handling interrupts. Similar

performance evaluations have been published concerning the CMU Nectar communication

processor [Menzilcioglu and Schlick 1991, Steenkiste 1992].

Much work has been done on the support needed to perform zero-copy application and user

level messaging on high-speed NIs [Chien et al. 1998, Mukherjee 1998, von Eicken and Vogels

1998]. In one case, [Schoinas and Hill 1998] show that it is possible to perform zero-copy

application-level messaging in software on a communications processor. This minimal

messaging attempts to move data directly from the NI into the host data structures in main

memory. (Here, host refers to either the host CPU or the network processor, depending on which

is receiving the data.) The key issue is providing efficient virtual/physical memory address

translation [Chen et al. 1998] on the network interface.

Finally, some recent work describes the support needed to implement low level operations

[Cranor et al. 1999] involved with network-specific data transfer on microprocessor-based

communication processors. Specifically, they use multiple thread contexts to limit the overhead

involved with servicing DMA completion interrupts. This helps reduce overall message latency

when using a general-purpose embedded processor. This technique improves communication

processor performance by replacing costly polling with low-overhead interrupts. In cases where

additional packet processing requirements are low, this approach can remove the need for

separate communication and network processors.

Network Processor. The distinction between a communication processor and a network

processor is far from settled. A general description would be that communication processors

handle low-level data-link protocol details (e.g., ethernet or Myrinet specifics) and message

15

handling, while network processors perform high-level, network and transport layer processing

(e.g., IP and TCP/UDP processing). The tasks carried out by the communication processor are

the tasks traditionally performed by NIs. The network processor implements functionality found

in host device drivers and applications. This distinction is relatively new, but as the processors

found on NIs increase in power, the need for a communication processor to unburden the

network processor will grow for very reasons discussed above.

Industry is producing numerous network processors, most of which employ chip-

multiprocessor or fine-grained multithreaded processor architectures, to provide high-

performance on the NI [Crowley et al. 2000]. For example, the recently announced Prism

network processor from Sitera [Sitera 2000] is a 4 processor chip-multiprocessor with hardware

support for packet classification and quality of service.

Our work, performed here at UW, made the following contributions to network processor

research: 1) identified a set of network processor workloads, 2) showed that chip-multiprocessors

(CMP) [Nayfeh et al. 1996] and simultaneous multithreaded architectures (SMT) [Tullsen et al.

1995] can exploit packet-level parallelism, while aggressive superscalar and fine-grained

multithreaded architectures cannot, 3) showed that packet classification can be performed

economically in software on network processors, and 4) showed that SMT adapts better to the

variability in real multiprogrammed workloads [Crowley et al. 2000, a paper describing parts 3

and 4 is currently under review]. A problem with this work is that it only reports throughput. The

results give no intuition about how these processor designs impact latency, an oversight warned

against by [Hennessy and Patterson 1996]. It is likely that latency considerations are slight, given

the wide-area applications considered, but the subject should have been addressed.

4.2.2 Memory System

Memory serves as the bridge between the processor and the network. Network interfaces

typically use high-speed SRAM to buffer packets. The bandwidth and latency characteristics of

the memory system figure prominently in the amount of processing that can be performed on

each packet at network speeds. Surprisingly, there has been little reported in the research

literature on memory systems for high-performance network interfaces.

The general question of how to architect the memory system is open, although there is

significant discussion of this in the trade news. For example, the Prism network processor from

16

Sitera uses optional SRAM and is the first network processor to integrate a RAMBUS memory

controller [Crisp 1997].

Recent proposals have appeared in the trade news for integrated DRAM memories on network

processors. The idea of using IRAM to buffer packets is somewhat obvious, but unexplored.

However, latency is a very important consideration, and standard approaches of integrating

processors and memory may not help [Cuppu et al. 1999].

One related proposal uses a standard CPU cache memory to implement single-cycle IP route

lookups [Chiueh and Pradhan 1999]. Following work by the same authors includes cache

modifications to increase the effectiveness of this technique [Chiueh and Pradhan 2000]. This

work is rooted in finding longest-matching prefixes with the dynamic prefix trie (as in retrieval)

data structure [Doeringer et al. 1996]. The technique uses the cache as a hardware assist for

performing fast matches between addresses and next hop values; in essence, IP addresses are

treated as virtual memory addresses.

4.2.3 Communication

Network interfaces connect to a network media on one side, and the host interface on the other.

With an eye toward large-scale routers, several companies are developing very fast back-end

interconnects that permit high-bandwidth, low-latency transfers between network interfaces. The

proposals from the common switch interface (CSIX) consortium [CSIX 2000] and IX

archtitecture forum [LevelOne 1999] are seeking to standardize these interfaces. Not

surprisingly, research on interconnects for message-passing multiprocessors has inspired these

efforts. For example, Avici Systems, a start-up founded by Bill Dally from Stanford/MIT,

basically uses the J-Machine, and in particular its interconnect, to do terabit routing [Avici

2000].

4.3 Control Software

Control on the Myrinet interface is the responsibility of the Myrinet control program (MCP) that

is loaded into device memory on boot-up. The MCP implements network specific control

processing, such as network mapping, and handles DMA requests both within the interface and

into host memory. The research community has embraced Myrinet due in large part to its open

interfaces and open, modifiable MCP [Bhoedjang et al. 1998].

17

Key challenges in the design of control software include low overhead user-level messages,

because going through the OS for permissions checking is too slow, and utilizing a minimum

number of copy operations [Eicken et al. 1995]. A good survey of efficient techniques for user-

level messages is provided by [Bhoedjang et al. 1998]. Communication in general is very latency

sensitive – in many cases it matters just as much as throughput. SPINE describes a safe,

extensible system for executing application-specific code on programmable NIs [Fiuczynski et

al. 1998].

4.4 Applications

We have considered the motivations for and designs of programmable NIs. In this section, we

survey the applications and techniques that have been proposed for exploiting this

programmability.

4.4.1 Fast LANs/System Area Networks

As mentioned previously, Myrinet employs host controllers that were built around

microprocessors. Since the introduction of Myrinet, manufacturers have produced solutions for

fast LANs employing the same technique. For example, Asanté’s GigaNIX gigabit ethernet NI is

built around two 32-bit embedded RISC processors [Asanté 2000]. Programmability is generally

required in these devices since the network-specific control (i.e., network mapping, flow control)

is more easily and economically implemented in software on cost-effective embedded

microprocessors.

4.4.2 Computing at Network Speeds

Emerging network applications and services require a fast path that does not involve the latency

penalty associated with crossing the I/O bus to get to the host CPU. Examples of such services

include: IPsec, routing, server load-balancing, and quality-of-service (QoS). The additional

latency to get to the host CPU makes these services infeasible at network speeds. Hence, a

network processor is included on the network interface to execute these applications.

These applications are indicative of the general trend of pushing more computation and

sophistication into the network, as discussed in Section 4.2.1. Other examples of this trend

include web caching, network-address translation (NAT), firewalls, and virtual private networks

18

(VPNs). This trend has the potential to radically increase the computational resources required at

each link in the network. The execution of many applications at network speeds requires a

significant amount of processing power that, furthermore, scales with the number of network

connections in a computer system. This trend has particular significance for services running at

the backbone of large internetworks.

A special class of machines, traditionally called routers, service many network links

simultaneously. It is particularly necessary to execute network services on network interfaces in

these devices so that processing power, and hence overall service performance, can scale with

the number of links. A number of researchers have proposed the use of high-performance

programmable network interfaces connected via a fast interconnect to implement large-scale

routing systems [Peterson et al. 1999, Walton et al. 1998]. This proposal closely matches what is

actually taking place in industry.

4.4.3 Active Networks

Active networks is a new approach in network design that provides a customizable infrastructure

to support the rapid evolution of new transport and application services by enabling users to

upload code into programmable network nodes. This is the most aggressive example of

computing at network speeds: each packet can contain a unique program. The last few years have

seen considerable coverage of active network research. Directions of inquiry have included

designs for: software platforms and programming models [Wetherall et al. 1999] [Hicks et al.

1999], active network node architectures [Decasper et al. 1999] [Nygren et al. 1999], and

operating systems for active nodes [Merugu et al. 2000] including emphases on QoS [Alexander

et al. 2000] and security [Campbell et al. 2000]. Two recent articles comment on the results thus

far [Smith et al. 1999] and lessons learned [Wetherall 1999]. This proposal poses big challenges

in safety, performance, and management.

4.5 Summary

The preceding section surveyed the motivations, designs and applications for programmable

network interfaces. More so than disks, innovative design and research proposals for

programmable NIs are being investigated to meet the growing performance and functionality

requirements in next-generations networks.

19

5 Other Examples

There are other examples of peripheral devices that are now programmable, including graphic

display adapters and printers. Graphic display adapters for many years have implemented

graphics pipelines and other display primitives at the device. Considerable work has done for

graphics and media-specific programmable architectures [Basoglu et al. 1999, Rixner et al.

1998]. The relatively high bandwidth required for graphics on consumer PCs lead Intel to device

the accelerated graphics port (AGP) [Intel 2000]. The AGP bus uses the same I/O “switch” as the

processor and main memory in order to “fatten and shorten the pipe” between the processor on

the graphics card and main memory. Intel had noticed that graphics cards were beginning to ship

with significant amounts of memory, forcing the primarily graphics-based multimedia processors

to manage memory in a fashion similar to the host CPU. This extension also helps Intel’s MMX

extensions to speed graphics processing in ways that were infeasible across the standard

peripheral I/O bus.

Postscript printers have been programmable I/O devices from the beginning [Tennenhouse

and Wetherall 1996]. A postscript document is, in fact, a program generated by an application

that is sent to the printer and interpreted by the printer's control microprocessor.

6 Common Issues

In this section, we consider a set of issues common to both programmable disks and

programmable NIs.

6.1 The Future of Programmability

Is programmability here to stay? There are at least two reasons why these devices may cease to

be programmable: 1) ASICs become more cost-effective at implementing the required

functionality, or 2) vastly faster host CPUs connected to passive peripherals via fast, switched

I/O networks make relatively slow embedded processors a performance liability.

The first issue presumes that these devices do not need the flexibility offered by software, and

is, to a large extent, answered by the state of the industry today. The integration of a

microprocessor core with device specific hardware and interfaces seems to be the preferred

solution for the time being. However, if application specific hardware design were to

unexpectedly become fast and cost-effective, this could change. A general framework for

20

intelligent I/O devices is gaining industry support, and will likely help keep these devices

programmable [I2O Special Interest Group 1997].

The second issue is a more serious challenge. If the performance gap between desktop CPUs

and embedded processors begins to widen, and desktop machines adopt a fast switched I/O

interconnect between the CPU and peripherals [Mukherjee and Hill 1997], then clustering

passive, low-end disks with powerful CPUs may be more cost-effective than a system of high-

end active peripherals. [Keeton et al. 1998] do not expect this to happen. They cite

cost/power/price differences between desktop and embedded microprocessors that range between

5X and 20X that translate to SPECint 95 performance differences of only 1.5X to 2X. Desktop

CPU markets can afford to pay heavily for marginal improvements in performance on SPECint

95. However, these improvements are not justified or beneficial in embedded systems. [Uysal et

al. 2000] show that performance is comparable between clusters and active disk systems on the

workloads that inspired active disks; the only difference is cost. These researchers report active

disk system costs at less than half of the cluster system cost. Regardless, since people the world

over are currently programming clusters to solve real problems, and active disks do not exist yet,

this issue remains a serious challenge.

In any case, the trend of system level integration, which leads to systems-on-a-chip (SOCs), is

not likely to stop any time soon. Particularly given that communication, and wires, will continue

to be the expensive resource going forward, as memory and compute resources become nearly

infinite. This trend seems to lend itself to any solution that involves a high computation to

communication ratio.

6.2 Research Directions

In this section, we propose a number of research issues that face programmable peripherals. As

distributed systems go, the environmental conditions for systems of network peripherals are

pleasant: a high-performance and reliable interconnect, reasonable processing and memory

resources, and a single administrative domain. Given this environment, the items listed here can

be considered properties necessary or desirable in systems comprised of programmable

peripherals.

21

6.2.1 Comprehensive programming model

Programming model concerns include safe extensions, reasonable host/peripheral interaction,

reasonable host/host interaction, and reasonable support for scalable software. A number of

proposals have recommended programming models for individual device functions. One active

disk proposal [Acharya et al. 1998] describes a programming model for partitioning certain

database applications between hosts and disks, but the model does not address interactions with

or support for other types of applications. Furthermore, the programmer manages all

communication. This recalls the programming models for message-passing multiprocessors,

which are hard to program since all applications require heavy customization. For network

interfaces, pattern-based languages [Begel et al. 1999, Engler and Kaashoek 1996] and object-

oriented systems [Morris et al. 1999] for packet-classification, filtering and routing have been

proposed. These are heavily used, but no proposals have been made to integrate these functions

into a comprehensive programming model for network interfaces.

The software system needs to also provide protection from untrusted, malicious and faulty

code. The SPINE operating system [Fiuczynski et al. 1998] advocates the use of safely

extensible operating systems to govern the operation of these programmable peripherals. This

notion is descended from the work developed in user-extensible operating systems such as SPIN

[Bershad et al. 1995] and Exokernel [Engler et al. 1995].

6.2.2 Platform independence

Traditional disks and network interfaces are integrated into server and workstation operating

systems through device drivers. As application code migrates onto these peripherals, however,

application code compatibility becomes an issue. Across-the-board device compatibility is

necessary to keep the programmable peripheral market a commodity market, and therefore price

competitive with passive peripherals. This can be achieved via object-based interfaces, such as

CORBA and COM, or through the use of platform independent binary code executed on virtual

machines such as the Java VM [Sirer et al. 1999]. Platform independence is tightly integrated

with the overall design of the execution environment.

22

6.2.3 Support for unbalanced performance

Applications commonly exhibit "hot spots" in which certain portions of data require a relatively

greater amount of work. This issue has not been raised with the full-scan database workloads

initially considered for Active Disks, however it will be a concern for more general-purpose

applications of programmable disks. Similarly, as components fail, it is likely that older devices

will be replaced with newer ones with greater performance. This introduces an imbalance in the

ideal partitioning of work onto devices. The general need is for execution environment load

balancing support for applications and devices with varying performance characteristics.

6.2.4 Support for multiprogrammed workloads

The execution environment must also support multiple tasks simultaneously. The initial Active

Disk proposals have avoided this completely, limiting their studies to single applications. SPINE

has started to address this for NIs. In addition to balancing between applications, in devices such

as disks and network interfaces there are elements of real-time constraints that must be managed.

A disk controller, for example, must be able to schedule disk arm movements along with client

requests and buffering tasks, simultaneously. It is unclear that the resource sharing techniques

implemented in standard operating systems will work well under these conditions. With network

interfaces, certain applications, such as guaranteeing a particular quality of service, may require a

tighter integration between the mechanisms allocating processor resources and the mechanisms

allocating network resources.

6.2.5 Support for centralized control

As noted in [Sirer et al. 1999], centralized control makes some difficult problems much easier.

These problems include managing software versions, security, auditing, and performance

analysis. Centralized control that does not limit performance is a nice property in distributed

systems; an execution environment for programmable devices will provide it.

[Keeton et al. 1998] contend that IDISKS will do away with much of the administrative costs

associated with clusters by assuming that IDISKS will incur the support and maintenance costs

of disks rather than cluster nodes. This may be the case since, for example, diagnostic checks

may be simpler in integrated devices as compared to desktop-like systems where many

components may fail and need to be tested separately. In general, taking steps to ease the

23

administrative problem in distributed systems is a valuable line of research with enormous

potential impact.

7 Conclusion

This report has surveyed the motivations, designs and applications for programmable disks and

network interfaces. These programmable peripherals have all the basic components found in

computer systems: a microprocessor, memory, and a communications subsystem. Given today’s

technology trends, embedded processors, and, hence, programmability, are likely to have a

permanent place in these devices. In order to improve I/O performance, a number of proposals

have been made for migrating data- and application-specific functions from servers onto these

devices. Initial implementations have provided functionality and support for specific tasks, such

as a decision support database environment for disks and language and programming systems for

packet classification on network interfaces. Based on these finding, a remaining challenge for

this general approach is to provide a software model that incorporates a comprehensive

programming model and the right set of libraries and OS services, including security and

resource management, needed in these peripheral environments.

This report concludes with some specific areas of future research. These directions focus on

programmable network interfaces, and, in particular, network processor design, which is the

author’s current field of research.

1. Memory system design for network processors. This study compares the performance of

modern and proposed memory technologies and cache hierarchies for a selection of network

interface organizations.

2. Analytical performance model for network processors. This model incorporates the

parallelism found in network workloads, the parallelism exploited by modern network

processor architectures, and includes memory system parameters. The purpose is to guide

network interface provisioning and give intuition concerning the relative importance of

processor and memory improvements.

3. Thread scheduling on SMT for network processor workloads. Initial results have suggested

that more sophisticated thread scheduling policies on SMT may be beneficial for network

processor workloads. This study examines ideal resource allocation for these workloads and

investigates scheduling policies on SMT that approximate the ideal.

24

References

[3Com 2000] C. 3Com. The EtherLink® 10/100 PCI NIC with 3XP Processor. ,http://www.3com.com/technology/tech_net/tech_briefs/500907.html, 2000.

[Acharya et al. 1998] A. Acharya, M. Uysal, and J. Saltz. Active Disks: programming model, algorithms andevaluation. Proceedings of the Eigth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, pp. 81-91. San Jose, November 1998.

[Adams and Ou 1997] L. Adams and M. Ou. Processor Integration in a Disk Controller. IEEE Micro vol. 14, no.4, July 1997.

[Alexander et al. 2000] D.S. Alexander, W.A. Arbaugh, A.D. Keromytis, S. Muir, and J.M. Smith. SecureQuality of Service Handling: SQoSH. IEEE Communications vol. 38, no. 4, pp. 106-112, April 2000.

[Amdahl et al. 1964] G.M. Amdahl, G.A. Blaauw, and J. F. P. Brooks. Architecture of the IBM System/360.IBM Journal of Research and Development , April 1964.

[Anderson et al. 1996] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, D.S. Roselli, and R.Y. Wang.Serverless Network File Systems. ACM Trans. on Computer Systems vol. 14, no. 1, pp. 41-79, Feb. 1996.

[Arpaci-Dusseau et al. 1998] R.H. Arpaci-Dusseau, A.C. Arpaci-Dusseau, D.E. Culler, J.M. Hellerstein, andD.A. Patter. The Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, andSMPs. Proceedings of the HPCA. Las Vegas, 1998.

[Asanté 2000] T. Asanté. GigaNIX Gigabit Ethernet Adapter. , http://www.asante.com/new/2000/GigaNIX.html,2000.

[Avici 2000] S. Avici. The Avici Terabit Switch Router. , http://www.avici.com, 2000.[Basoglu et al. 1999] C. Basoglu, R. Gove, K. Kojima, and J. O'Donnell. Single-Chip Processor for Media

Applications: The MAP1000TM. International Journal of Imaging Systems and Technology, 1999.[Begel et al. 1999] A. Begel, S. McCanne, and S.L. Graham. BPF+: Exploiting Global Data-Flow

Optimization in a Generalized Packet Filter Architecture. Proceedings of the ACM CommunicationArchitectures, Protocols, and Applications (SIGCOMM ’99), 1999.

[Bershad et al. 1995] B.N. Bershad, S. Savage, P. Pardyak, E.G. Sirer, M.E. Fiuczynski, D. Becker, S. Eggers,and C. Chambers. Extensibility, Safety and Performance in the SPIN Operating System. Proceedings ofthe Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (SOSP), December1995.

[Bhoedjang et al. 1998] R.A.F. Bhoedjang, T. Ruhl, and H.E. Bal. User-level network interface protocols.Computer vol. 31, no. 11, pp. 53-60, Nov. 1998.

[Boden and Cohen 1995] N. Boden and D. Cohen. Myrinet -- A Gigabit-per-Second Local-Area Network. IEEEMicro, 15(1):29-36, 1995.

[Campbell et al. 2000] R.H. Campbell, L. Zhaoyu, M.D. Mickunas, P. Naldurg, and Y. Seung. Seraphim:dynamic interoperable security architecture for active networks. Proceedings of the IEEE 3rd Conf. onOpen Arch. and Network Programming, pp. 55-64, 2000.

[Chen et al. 1998] Y. Chen, C. Dubnicki, S. Damianakis, A. Bilas, and K. Li. UTLB: A Mechanism forAddress Translation on Network Interfaces. Proceedings of the Proceedings of the Eighth InternationalConference on Architectural Support for Programming Languages and Operating Systems (ASPLOS),October 1998.

[Chien et al. 1998] A.A. Chien, M.D. Hill, and S.S. Mukherjee. Design Challenges for High-PerformanceNetwork Interfaces. IEEE Micro:42-44, 1998.

[Chiueh and Pradhan 1999] T.-c. Chiueh and P. Pradhan. High-Performance IP Routing Table LookupUsing CPU Caching. Proceedings of the INFOCOMM ’99, pp. 1421-1428, 1999.

[Chiueh and Pradhan 2000] T.-C. Chiueh and P. Pradhan. Cache Memory Design for Network Processors.Proceedings of the 6th Int’l Symp. on High-Performance Computer Architecture, January 2000.

[Cirrus Logic 2000] I. Cirrus Logic. New Open-Processor Platform Enables Cost-Effective System-on-a-chipSolutions for Hard Disk Drives. , http://www.cirrus.com/3ci, 2000.

[Cohen et al. 1992] D. Cohen, G. Finn, R. Felderman, and A. DeSchon. The Atomic Lan. Proceedings ofthe IEEE Workshop on the Arch. and Impl. of High Performance Communication Subsystems, 1992.

[Cormier et al. 1983] R.L. Cormier, R.J. Dugan, and R.R. Guyette. System/370 Extended Architecture: TheChannel Subsystem. IBM J. Res. develop, 27(3):206-218, 1983.

[Cranor et al. 1999] C.D. Cranor, R. Gopalakrishnan, and P.Z. Onufryk. Architectural Considerations forCPU and Network Interface Integration. Proceedings of the Hot Interconnects. Stanford, CA, 1999.

25

[Crisp 1997] R. Crisp. Direct RAMbus Technology: the New Main Memory Standard. IEEE Micro vol. 17, no.6, pp. 18-28, March 1997.

[Crowley et al. 2000] P. Crowley, M.E. Fiuczynski, J.-L. Baer, and B.N. Bershad. Characterizing ProcessorArchitectures for Programmable Network Interfaces. Proceedings of the International Conference onSupercomputing, pp. 54-65. Santa Fe, N.M., May 8-11 2000.

[CSIX 2000] CSIX. CSIX: The Common Switch Interface Consortium. , http://www.csix.org/, 2000.[Cuppu et al. 1999] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of

Contemporary DRAM Architectures. Proceedings of the 26th Int’l Symp. on Computer Architecture, pp.222-233, 1999.

[Dally et al. 1992] W.J. Dally, J.A.S. Fiske, J.S. Keen, R.A. Lethin, M.D. Noakes, P.R. Nuth, R.E. Davison,and G.A. Fyler. The Message-Driven Processor: A Multicomputer Processing Node with EfficientMechanisms. IEEE Micro:23-39, 1992.

[Decasper et al. 1999] D.S. Decasper, B. Plattner, G.M. Parulkar, C. Sumi, J.D. DeHart, and T. Wolf. AScalable High-Performance Active Network Node. IEEE Network vol. 13, no. 1, pp. 8-19, Jan.-Feb. 1999.

[Doeringer et al. 1996] W. Doeringer, G. Karjoth, and M. Nassehi. Routing on Longest-Matching Prefixes.IEEE/ACM Trans. on Networking vol. 4, no. 1, pp. 86-97, Feb. 1996.

[Eicken et al. 1995] T.v. Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interfacefor Parallel and Distributed Computing. Proceedings of the 15th ACM Symp. on Operating SystemsPrinciples, pp. 40-53, 1995.

[Engler and Kaashoek 1996] D.R. Engler and M.F. Kaashoek. DPF: Fast, Flexible Message Demultiplexingusing Dynamic Code Generation. Proceedings of the ACM Communication Architectures, Protocols, andApplications (SIGCOMM ’96), 1996.

[Engler et al. 1995] D.R. Engler, M.F. Kaashoek, and J. O’Toole. Exokernel: an operating systemarchitecture for application-level resource management. Proceedings of the 15th ACM Symp. on OperatingSystems Principles, pp. 251-266, 1995.

[FCIA 2000] FCIA. Fibre Channel Technology Overview. . The Fibre Channel Industry Association,http://www.fibrechannel.org/, 2000.

[Fiuczynski et al. 1998] M.E. Fiuczynski, R.P. Martin, T. Owa, and B.N. Bershad. SPINE: An Operating Systemfor Intelligent Network Adapters. Proceedings of the Eighth ACM SIGOPS European Workshop, pp. 7-12.Sintra, Portugal, September 1998.

[Gibson et al. 1997] G. Gibson, D. Nagle, K. Amiri, F.W. Chang, E. Feinberg, H. Gobioff, C. Lee, B. Ozceri,E. Riedel, D. Rochberg, and J. Zelenka. File Server Scaling with Network-Attached Secure Disks.Proceedings of the SIGMETRICS, June 1997.

[Gray 1998] J. Gray. Put Everything in the Disk Controller. , ’98 NASD workshop,http://research.microsoft.com/~gray/talks/Gray_NASD_Talk.ppt, 1998.

[Hennessy and Patterson 1996] J.L. Hennessy and D.A. Patterson. Computer Architecture: A QuantitativeApproach, Second Edition. Kaufman Publishers, 1996.

[Hicks et al. 1999] M. Hicks, J.T. Moore, D.S. Alexander, C.A. Gunter, and S.M. Nettles. PLANet: AnActive Internetwork. Proceedings of the INFOCOM, pp. 1124-1133, 1999.

[Hill et al. 2000] M.D. Hill, N.P. Jouppi, and G.S. Sohi. Readings in Computer Architecture. , First ed. MorganKaufmann, 2000.

[I2O Special Interest Group 1997] I2O Special Interest Group. Intelligent I/O (I2O) Architecture Specification

v1.5. , Available from www.i2osig.org, March 1997.[Intel 2000] C. Intel. Accelerated Graphics Port Technology. ,

http://www.intel.com/technology/agp/index.htm, 2000.[Keeton et al. 1997] K. Keeton, R. Apraci-Dusseau, and D.A. Patterson. IRAM and SmartSIMM:

Overcoming the I/O Bus Bottleneck. Proceedings of the Workshop on Mixing Logic and DRAM: Chipsthat Compute and Remember, June 1997.

[Keeton et al. 1998] K. Keeton, D.A. Patterson, and J.M. Hellerstein. A Case for Intelligent Disks (IDISKS).SIGMOD Record vol. 27, no. 3, Nov. 1998.

[Kuskin et al. 1994] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D.Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASHMultiprocessor. Proceedings of the 21st Int. Symp. on Computer Architecture, pp. 302-313, April 1994.

[LevelOne 1999] LevelOne. IX Architecture Whitepaper. An Intel Company, 1999[Menzilcioglu and Schlick 1991] O. Menzilcioglu and S. Schlick. Nectar CAB: a high-speed network processor.

Proceedings of the 11th Int’l Conf. on Distributed Computing Systems, pp. 508-515, 1991.

26

[Merugu et al. 2000] S. Merugu, S. Bhattacharjee, E. Zegura, and K. Calvert. Bowman: A Node OS for ActiveNetworks. Proceedings of the INFOCOM 2000, pp. 1127-1136, 2000.

[Moore 1965] G.E. Moore. Cramming more components onto integrated circuits. Electronics , pp. 114-117, April1965.

[Morris et al. 1999] R. Morris, E. Kohler, J. Jannotti, and M.F. Kaashoek. The Click Modular Router.Proceedings of the 17th ACM Symp. on Operating Systems Principles, Dec. 1999.

[Mukherjee 1998] S.S. Mukherjee. Design and Evaluation of Network Interfaces for System AreaNetworks. Computer Science, pp. 189. University of Wisconsin, Madison, 1998.

[Mukherjee and Hill 1997]S.S. Mukherjee and M.D. Hill. A Case for Making Network Interfaces Less Peripheral.Proceedings of the Hot Interconnects V. Stanford, August 1997.

[Nayfeh et al. 1996] B.A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of Design Alternatives for aMultiprocessor Microprocessor. Proceedings of the 23rd International Symposium on ComputerArchitecture, pp. 67-77, May 1996.

[Nygren et al. 1999] E.L. Nygren, S.J. Garland, and M.F. Kaashoek. PAN: A High-Performance ActiveNetwork Node Supporting Multiple Mobile Code Systems. Proceedings of the 2nd Conf. on OpenArchitectures and Network Programming, pp. 78-89, 1999.

[Patterson et al. 1997] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas,and K. Yelick. A Case for Intelligent RAM: IRAM. IEEE Micro:34-44, 1997.

[Patterson and Keeton 1998] D. Patterson and K. Keeton. Hardware Technology Trends and DatabaseOpportunities. , Slides from SIGMOD’98 Keynote Address, 1998.

[Patterson et al. 1988] D.A. Patterson, G. Gibson, and R.H. Katz. A case for redundant arrays of inexpensivedisks (RAID). Proceedings of the ACM SIGMOD Conference. Chicago, IL, June 1988.

[Peterson et al. 1999] L. Peterson, S. Karlin, and K. Li. OS Support for General-Purpose Routers. Proceedingsof the HotOS Workshop, March 1999.

[Riedel 1999] E. Riedel. Active Disks - Remote Execution for Network-Attached Storage. CMU, DoctoralDissertation, Tech. Report CMU-CS-99-177, Nov. 1999 Pittsburgh, PA.

[Riedel et al. 1998] E. Riedel, G. Gibson, and C. Faloutsos. Active Storage for Large-Scale Data Mining andMultimedia. Proceedings of the VLDB, Aug. 1998.

[Rixner et al. 1998] S. Rixner, W.J. Dally, U.J. Kapasi, B. Khailany, A. López-Lagunas, P.R. Mattson, andJ.D. Owens. A Bandwidth-Efficient Architecture for Media Processing. Proceedings of the 31st Int’lSymp. on Microarchitecture, pp. 3-13, Nov. 1998.

[Ruemmler and Wilkes 1994] C. Ruemmler and J. Wilkes. An introduction to disk drive modeling. IEEEComputer vol. 27, no. 3, pp. 17-28, 1994.

[Scheiman and Schauser 1998] C.J. Scheiman and K.E. Schauser. Evaluating the Benefits of CommunicationCoprocessors. Journal of Parallel and Distributed Computing, 57(2):236-256, 1998.

[Schoinas and Hill 1998] I. Schoinas and M.D. Hill. Address Translation Mechanisms in Network Interfaces.Proceedings of the 4th Int’l Symp. on High Performance Computer Architecture, 1998.

[Seitz 1985] C.L. Seitz. The cosmic cube. Communications of the ACM vol. 28, no. 1, pp. 22-33, 1985.[Sirer et al. 1999] E.G. Sirer, R. Grimm, A.J. Gregory, and B.N. Bershad. Design and implementation of a

distributed virtual machine for networked computers. Proceedings of the 17th ACM Symp. on OperatingSystems Principles, pp. 202-216, Dec. 1999.

[Sitera 2000] C. Sitera. The PRISM IQ2000 Network Processor Family. , http://www.sitera.com, 2000.[Smith et al. 1999] J.M. Smith, K.L. Calvert, S.L. Murphy, H.K. Orman, and L.L. Peterson. Activating

Networks: A Progress Report. IEEE Computer Magazine vol. 32, no. 4, pp. 32-41, April 1999.[Smotherman 1989] M. Smotherman. A Sequencing-Based Taxonomy of I/O Systems and Review of

Historical Machines. Computer Architecture News, 17(5):5-15, 1989.[Steenkiste 1992] P. Steenkiste. Analysis of the Nectar Communication Processor. Proceedings of the IEEE

Workshop on the Arch. and Impl. of High Perf. Comm. Subsystems, pp. 1-3, 1992.[Tennenhouse and Wetherall 1996] D.L. Tennenhouse and D.H. Wetherall. Towards an Active Network

Architecture. ACM Computer Communications Review, 26(2):5-18, 1996.[Tullsen et al. 1995] D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximizing On-Chip

Parallelism. Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 392-403. Santa Margherita Ligure, Italy, June 1995.

[Uysal et al. 2000] M. Uysal, A. Acharya, and J. Saltz. Evaluation of Active Disks for Decision SupportDatabases. Proceedings of the 6th Int’l Symp. on High-Performance Computer Architecture, pp. 337-348,2000.

27

[von Eicken and Vogels 1998] T. von Eicken and W. Vogels. Evolution of the Virtual Interface Architecture.IEEE Micro , pp. 61-68, November 1998.

[Walton et al. 1998] S. Walton, A. Hutton, and J. Touch. Efficient High-Speed Data Paths for IP Forwardingusing Host Based Routers. Proceedings of the Proceedings of the Ninth IEEE Workshop on Local andMetropolitan Area Networks, May 1998.

[Wang et al. 1999] R.Y. Wang, T.E. Anderson, and D.A. Patterson. Virtual Log Based File Systems for aProgrammable Disk. Proceedings of the Third USENIX Operating System Design and ImplemenationConference. New Orleans, LA, February 1999. USENIX.

[Wetherall 1999] D. Wetherall. Active network vision and reality: lessons from a capsule-based system.Proceedings of the 17th ACM Symp. on Operating Systems Principles, pp. 64-79, Dec. 1999.

[Wetherall et al. 1999] D. Wetherall, J. Guttag, and D. Tennenhouse. ANTS: Network Services Without the RedTape. IEEE Computer Magazine vol. 32, no. 4, April 1999.

[Yang and Lebeck 2000] C.-L. Yang and A.R. Lebeck. Push vs. Pull: Data Movement for Linked Data Structures.Proceedings of the International Conference on Supercomputing, pp. 176-186. Santa Fe, N.M., May 8-112000.

programmable peripheral devicespcrowley/papers/generals.pdfprogrammable peripheral devices....

Documents