uu.diva-portal.orguu.diva-portal.org/smash/get/diva2:167524/fulltext01.pdf · martin karlsson,...

Till min familj

List of Papers

This thesis is based on the following papers, which are referred to in the textby the capital letters A through G.

A. Conserving Memory Bandwidth in Chip Multiprocessors with RunaheadExecutionMartin Karlsson and Erik HagerstenTechnical Report 2005-040, Department of Information Technology, UppsalaUniversity, Sweden, December 2005. Submitted for publication

B. Timestamp-Based Selective Cache AllocationMartin Karlsson and Erik HagerstenIn High Performance Memory Systems, Chapter 4, Edited by H. Hadimiouglu, D.Kaeli, J. Kuskin, A. Nanda, and J. Torrellas, pages 43-59, Springer-Verlag, 2003.ISBN 038700310X

C. Skewed Caches from a Low-Power PerspectiveMathias Spjuth, Martin Karlsson and Erik HagerstenIn Proceedings of the ACM International Conference on Computing Frontiers,Ischia, Italy, May 2005.

D. TMA: A Trap-Based Memory ArchitectureHåkan Zeffer, Zoran Radovic, Martin Karlsson and Erik HagerstenTechnical Report 2005-041, Department of Information Technology, UppsalaUniversity, Sweden, December 2005. Submitted for publication

E. Memory System Behaviour of Java-Based MiddlewareMartin Karlsson, Kevin Moore, Erik Hagersten and David WoodIn Proceeding of the Ninth International Symposium on High PerformanceComputer Architecture, Anaheim, California, USA, February 2003.

F. Exploring Processor Design Options for Java-Based MiddlewareMartin Karlsson, Kevin Moore, Erik Hagersten and David WoodIn Proceedings of the 34th International Conference on Parallel Processing, Oslo,Norway, June 2005.

G. VASA: A Simulator Infrastructure with Adjustable FidelityDan Wallin, Håkan Zeffer, Martin Karlsson and Erik HagerstenIn Proceedings of the International Conference on Parallel and Distributed Com-puting and Systems, Phoenix, Arizona, USA, November 2005.

5

Other Publications

• Timestamp-Based Selective Cache Allocation,Martin Karlsson and Erik Hagersten,In Proceedings of the 1st Annual Workshop on Memory Performance Issues(WMPI 2001), held in conjunction with the 28th International Symposium onComputer Architecture (ISCA28), Göteborg, Sweden, May 2002.

• Memory Characterization of the ECperf Benchmark,Martin Karlsson, Kevin Moore, Erik Hagersten, and David Wood,In Proceedings of the 2nd Annual Workshop on Memory Performance Issues(WMPI 2002), held in conjunction with the 29th International Symposium onComputer Architecture (ISCA29), Anchorage, Alaska, USA, May 2002.

• Skewed Caches from a Low-Power PerspectiveMathias Spjuth, Martin Karlsson and Erik Hagersten,In Proceedings of the 4th Annual Workshop on Memory Performance Issues(WMPI 2004), held in conjunction with the 31th International Symposium onComputer Architecture (ISCA31), Munich, Germany, June 2004.

• Cache Memory Design Trade-offs for Current and Emerging WorkloadsMartin Karlsson,Licentiate Thesis 2003-009, Department of Information Technology, Uppsala Uni-versity, September 2003.

7

Comments on my Participation

A. Conserving Memory Bandwidth in Chip Multiprocessors with RunaheadExecutionI am the principal author and have performed all the experiments in this work.

B. Timestamp-Based Selective Cache AllocationI am the principal author and have performed all the experiments in this work.

C. Skewed Caches from a Low-Power PerspectiveI am the principal author of the paper while Mathias Spjuth is the principalinvestigator of the project. The energy model was developed jointly, while theperformance model was based on Mathias Spjuth’s Masters thesis which I advised.

D. TMA: A Trap-Based Memory ArchitectureHåkan Zeffer is the principal investigator of the project and has performed all theexperiments. I am the principal author of the article together with Håkan Zefferand we also co-designed the trap mechanism. I have had no part in the protocolimplementation, which was carried out by Håkan Zeffer and Zoran Radovic.

E. Memory System Behaviour of Java-Based MiddlewareThe paper was jointly written with Kevin Moore. From an experimental point ofview, Kevin Moore was responsible for setting up and tuning the workload while Idesigned most of the experiments and provided the initial benchmark setup insideSimics.

F. Exploring Processor Design Options for Java-Based MiddlewareThe paper was written jointly with Kevin Moore. I designed and performed all theexperiments in this project

G. VASA: A Simulator Infrastructure with Adjustable FidelityAll authors have contributed to all aspects of this project. I am the principal authorof the article together with Dan Wallin who also designed and performed all theexperiments. Håkan Zeffer is the main contributor to the Vasa implementation.

9

Acknowledgments

My first attempt at research during an internship at Los Alamos National Lab-oratories was not much of a success, quite the opposite. The result was thatI decided that getting a PhD (in databases), which I had been contemplating,was not anything for me. I am therefore very grateful to my advisor Profes-sor Erik Hagersten who persuaded me to get a PhD and accepted me in hisresearch group, which at that time was lacking members. I have since that daynever regretted my decision and would do it all over again without hesitation.

Erik’s role somewhere along the way blurred between advisor, mentor andfriend. I consider myself very fortunate to have had Erik as an advisor dur-ing these years. I am especially grateful to Erik for allowing me to pursue myown ideas to a large extent. I have learned many things from him, a surprisingamount in areas unrelated to computer architecture. Thank you! Apart fromErik I have been very fortunate to have had an extraordinary set of teachersand mentors during my education, many of which have indirectly been instru-mental in reaching this goal. While the list is too long to be included here mythanks goes out to all of you.

The Uppsala Architecture Team (UART) members have been excellent com-panions along this long winding road. Lars Albertson, Erik Berg, Henrik Löf,Zoran Radovic, Dan Wallin and Håkan Zeffer. I will miss our daily interactiona great deal. The IT department has been an excellent workplace and I haveenjoyed the friendly atmosphere among faculty and staff very much. I wouldlike to thank the collection of wise men that have joined me at the grad studentfriday pub.

Without the administrative staff, coffee breaks certainly would not have beensuch a pleasure. The computer administrators helped me out with numerous is-sues. I would like to thank the staff at Virtutech, particularly Andreas Moestedt,for countless hours of Simics support. Sverker Holmgren provided computa-tional resources and shipped me to Sri Lanka, installing linux boxes, in themiddle of the cold dark winter. Thanks!

During the course of this work I have had three very rewarding internshipsat Sun Microsystems. I would like to thank the people at Sun Laboratories andthe Processor Performance Group for making my visits there so enjoyable,especially Paul Caprioli and Shailender Chaudhry.

11

I would also like to express my gratitude towards my fellow student co-authors and collaborators Kevin Moore, Mathias Spjuth, Håkan Zeffer, DanWallin and Zoran Radovic. I think in each project the result was far greaterthan the sum of our efforts. I would especially like to thank Kevin Moore forproofreading this thesis.

Till slut vill jag tacka min familj, vänner och andra närstående för allt stödoch uppmuntran som jag erhållit under de här åren. För det kommer jag föralltid att vara tacksam. Jag vill rikta ett speciellt tack till Josefin som genomsin oändliga klokhet under den senare delen av denna resa hjälpt mig bibehållaperspektiv på saker och ting.

Thanks!This research has been supported by a grant from the Swedish Foundation forStrategic Research (SSF) under the ARTES/PAMP program.

12

Introduction

The services provided by multiprocessor servers are increasingly becoming apart of our every day lives. For example when we make an online reservation orpurchase, servers running databases and web applications keep track of prod-ucts in stock or available seats, and are used to deliver the service to us. Thecost and performance of these multiprocessor servers are important. Multipro-cessor servers are used in many domains. In the scientific computing arena,faster multiprocessor servers allow for more detailed and accurate modeling offor example weather and pharmaceuticals.

The discrepancy between the speed of processing and the time to accessmemory has made memory system performance one of the most significantbottleneck preventing increased computational performance and capabilities.This trend is sometimes referred to as the memory wall problem [21] andshows no sign of decreasing in the foreseeable future.

Chip multiprocessors (CMPs) [12][2][16] introduce new resource trade-offsand further complicate the architectural task of chip design. While CMP tech-nology is still in its infancy it has already proven viable. CMP design is likelyto be the prevailing trend in processor design for many years to come. Allmajor processor manufacturers have either released or announced CMP-baseddesigns. Researchers at Intel have for example projected that 75 percent of alltheir processor products will be CMPs by the end of 2006 [5].

In this thesis we study multiple aspects of memory system design whenmoving to CMPs. In particular we study cost, performance and power aspects.

Chip MultiprocessorsProcessor performance improvements have been driven partly by advances insilicon fabrication technology. Reduced feature sizes have allowed more tran-sistors to fit on a die, following the well known Moore’s law [8] stating that thenumber of devices per chip doubles every 18 months. The increased transistorbudgets has enabled architects to design more elaborate microarchitectures.Processor structures, such as register files, issue windows and caches havegrown with each generation, improving the performance of a single monolithicprocessor core.

Technology projections show that the electrical characteristics of chips arechanging [18]. While the transistors will get smaller the relative speed of wires

13

iL1 dL1

CPU0

Intra-chip Switch

L20

iL1 dL1

CPU1

L21

iL1 dL1

CPU2

L22

iL1 dL1

CPU3

L23

Processor Slice

Figure 1: Overview of a Chip Multiprocessor.

will decrease. The fact that transistor speed scales inversely with feature size,leads to an increased cost of communication across the chip. In recent pro-cess generations a signal could be propagated across an entire chip within oneclock cycle. Projections on future process generations indicate that less than 10percent of the chip will be reachable within a cycle and that it will get progres-sively worse with future generations [1]. Therefore, the traditional approachof obtaining performance through larger and larger monolithic processor coresfaces serious challenges.2

The introduction of chip multiprocessors, provides an alternative approach.Instead of scaling up a monolithic core, multiple cores may be fitted on thesame die. The CMP approach offers performance increases mainly throughmulti-processing rather than through increased single-thread performance.

The design of chip multiprocessors introduces new trade-offs to architectssince they now have to strike a balance between allocating resources to eachprocessor core against fitting more cores on a die. For example, designers mustdecide how much cache should be allocated per core and how many coresshould share each cache. An advantage of chip multiprocessors compared toconventional monolithic designs is the ability to scale up a processor design.

2One approach to mitigate the wire delay problem is to employ a clustered microarchitecturewhere clusters of smaller structures replace larger structures [17].

14

Let us consider a processor core together with the L1 caches and a subset of theL2 banks, a processor slice, as illustrated in Figure 1. Once a process shrinkprovides more transistors, a CMP design can be scaled up by simply addingmore processor slices, incurring a minimal redesign effort.

From a software perspective, the paradigm shift in processor design towardsproviding higher performance, mainly through multiprocessing, forces appli-cation writers to parallelize their applications in order to take advantage ofthe performance increase. Most commercial applications, such as databases,web or application servers, are throughput based and have already been paral-lelized. For such workloads, the interesting metric is not how fast one requestis satisfied. Instead, performance is measured by how many requests that canbe satisfied per time unit. Of course, single-thread performance can not be en-tirely ignored since certain quality of service aspects, e.g. responsiveness ina web application, must be observed. Such applications, which often have anabundance of parallelism [20], increases the focus on processing core com-plexity. Processor designers seeking to maximize the amount of throughputper chip may find that a larger number of more simple cores outperform fewermore complex ones [6].

From a multiprocessing perspective, CMPs have a very important advan-tage. Since the processors are fitted on the same die, sharing resources such asan L2 cache can be done with a very limited latency penalty. Shared cachesincrease utilization and significantly reduce the inter-thread communicationcost, as long as the communicating threads share the same cache. Since com-munication overhead often limits the scaling of multithreaded programs, theintroduction of CMPs with shared caches is likely to have an important corol-lary for parallelization efforts: while it becomes easier to get an application toscale across cores within a chip, getting an application to scale across chipsremains as challenging as it is today. Therefore, there exist a great incentive tobuild CMPs with as many processing cores as possible.

Processor Design ConstraintsThe basis for the resource constraints in current processor design lies in theeconomics of CMOS chip fabrication. When designing processors and the sys-tem around them, cost and performance are the driving factors.

Since computer evolution has followed a pattern of roughly doubling per-formance every 18 months, the time to market for a new design is critical. Dueto this performance trend every extra month in development time leads to anexpectation of 6 percent higher performance. The overall development time ofa new processor is often in the range of 3 to 7 years. Complex solutions leadsto longer design, implementation and foremost verification time. Therefore,

15

complexity within a design can be viewed as a limited resource. Apart fromcomplexity, some of the most important constraints in chip design are area,power and packaging costs.

$-

$500

000

500

2 000

50 100 150 200 250 300 350 400

Chip Size (mm2)

Co

st

100 150 200 250 300 350 400

Chip Area (mm2)

Figure 2: Example of die cost as a function of chip area.

Processor chips are manufactured by imprinting a design on a wafer throughlithography. A wafer is today typically measured in decimeters in diagonal,while a chip is on the order of a few hundred mm2. Larger chips simply meansthat fewer will fit on the same wafer. A significant fraction of the chips on awafer usually have to be discarded due to process variations. These two factorslead to a quadratic area cost in chip design [10], as illustrated in Figure 2. ChipMultiprocessors alleviate this problem somewhat, since a processor where oneor even a few of the cores are unusable can be priced and sold as a scaleddown version. Due to the high manufacturing costs associated with area it isimportant to make large processor structures, such as caches, as area efficientas possible.

Power consumption, which recently only was considered an issue in portablecomputing, is currently one of the most pressing processor design constraintsalso in high-performance computing [14]. Chip power is based on switching,short-circuit and leakage power. Switching power has traditionally made up thevast majority of chip power. As transistor size has kept decreasing, it has beenpossible to scale down the supply voltage. Since switching power is quadrati-cally dependent on voltage, the voltage scaling approach has enabled more andmore transistors per chip, without a corresponding power increase. However,in current and future process generations, leakage power effects will counter

16

0%

20%

40%

60%

80%

100%

20042006

20082010

20132016

Year

Ban

dw

idth

per

Millio

n t

ran

sis

tors

(N

orm

alized

)

Figure 3: The Semiconductor Industry Association (SIA) projection of bandwidth permillion transistors.

this trend and is predicted to make up an increasingly large portion of chippower [15].

Power is limited in high-performance computing for multiple reasons. Forlarge server facilities the electricity and room cooling cost are significant partsof the total cost of ownership. Also server density, i.e. how much processingpower that fits into a rack, is an important design consideration. From a singleprocessor point of view, power is important because the likelihood that a chipwill break increases with temperature. To guarantee that a chip is safely keptcool, processor designers state a power envelope under which the maximumpower is kept. System designers must make sure that the system design is ableto cool the processor and maintain temperature within the thermal limit.

The off-chip bandwidth of a processor is determined by the processor pack-aging. The number of pins, as well as the pin frequency, dictates the bandwidth.Packaging costs grow superlinearly with pin count and often represent a sig-nificant fraction of the overall cost [3]. The 2004 edition of the internationaltechnology road map for semiconductors, published by the semiconductor in-dustry association (SIA), predicts that the cost-efficient number of pins perchip will remain constant while the number of transistors will grow by 67 per-cent [19]. Hence, if the increase in transistor count is used to scale up CMPdesigns by adding more processor cores (or slices), it will lead to a decreasein bandwidth per processor core. Figure A.1 shows the predicted cost-effective

17

amount of bandwidth (pin count times pin frequency) per million transistorsover time. As can be seen the bandwidth per transistor is projected to decreaseby 90 percent over the next decade.

Memory System DesignThe growing disparity between processor and memory speed has made thememory system the bottleneck of computer system performance. To overcomethe ever increasing latencies, high-speed SRAM caches are used extensively.

The principle of locality is the basis for the successful use of caches. Spatiallocality (locality in space) and temporal locality (locality in time) refers toa repetitiveness exhibited by many programs. By exploiting locality, cacheshelp to reduce the effects of the long memory latency. Spatial locality impliesthat after a piece of data has been accessed there is an increased likelihoodthat nearby data will be used within a short period of time. Temporal localitymeans that a piece of data that has been accessed is likely to be requested soonagain. Temporal locality is exploited by the use of caches holding a subsetof the contents of the slow but large memory. Spatial locality is exploited byfetching more than one word from memory at the time. As the memory gaphas widened, memory system designers have responded by designing largerand larger caches and fetching larger cache blocks at a time.

Larger caches leads to longer access times. Therefore, cache hierarchies areemployed with smaller and smaller caches closer to the processing core. Intoday’s multi-GHz processors, the size of the first level caches are constrainedin order to allow access within a few cycles. Therefore it is very important todesign as efficient caches as possible from a miss ratio perspective. However,not only cache performance is important. As part of the chip infrastructure,caches are limited by the same resource constraints as other structures, forc-ing architects to consider for example power aspects of different cache designalternatives.

It is desirable to share some of the resources, such as the L2 cache, in a CMP.This enables a resource to be allocated according to demand rather than stat-ically partitioned. For example, if only a single application thread is runningon a CMP it would be preferable if that thread could make use of the entireL2 cache instead of just a fraction of it. First level caches are rarely sharedamong different cores for timing reasons.3 However, each core may containmultiple hardware threads [7], which often share an L1 cache. Also, the previ-ously mentioned cost decrease for inter-chip communication among softwarethreads is a strong motivation for shared caches.

3The MAJC-5200 is a notable exception to this [11].

18

While sharing enables more efficient use of caches, it is also prone to con-flicts when multiple processors compete for the same cache space. A commonapproach to reduce this effect is to increase the cache associativity, i.e. thenumber of locations in which a piece of data may reside. This is costly sincepower consumption scales with associativity.

Architectural Design MethodologyWhen designing a new computer system, it is common to use simulation incombination with benchmarks that represent the workload of the system. Bycomparing the simulation results of an existing design with those of a new de-sign, computer architects are able to approximate the performance associatedwith various design options.

System Modeling and SimulationSimulation is extensively used in the field of computer architecture to evaluatenew ideas. Many independent simulations are often made in parallel, e.g. fordifferent benchmarks. Therefore the task of simulation can be viewed as con-taining plenty of application-level parallelism. Since the simulation results areneeded within hours or days in the design process, simulation speed is veryimportant.

Care is often taken to make simulators resemble the proposed designs asclosely as possible, leading to extremely detailed simulators. Although thisprovides high accuracy, it leads to very slow simulators and limits the simula-tion coverage. For example, a detailed simulator capable of simulating 100000 instructions per second on todays computers is considered a relativelyfast simulator. Modeling a target architecture at 4Ghz, running an applicationspending on average two cycles per instruction, would require 6 hours to sim-ulate a single second of the target architecture. Full system simulators, e.g.Virtutech Simics, capture operating system effects which can have a signifi-cant impact when studying many workloads and applications. Simics has beenused throughout this thesis.

Benchmark SimulationSince different applications have radically different memory characteristics, itis important to make design trade-offs based on the performance of the typeof application that are likely to run on the target system. Server vendors typ-ically emphasize commercial benchmarks, and in particular database work-loads, since a large fraction of computer systems are employed as databaseservers. As previously mentioned, the design time of a processor or a system

19

Browsers/

Thin Clients

Databases

Middleware

LAN/WAN

Web Server

Business

Logic

Tier 1

Tier 3

Tier 2

Figure 4: Overview of a 3-tiered system.

is long and counted in years. Therefore, it is important to use modern work-loads while designing a system so that it is optimized for the workloads thatwill be relevant when the system is deployed. It is also important from a re-search perspective to discover new performance bottlenecks that arise withnew workloads. A new workload that is gaining importance due to its rapidand widespread use is middleware. Middleware is an important building blockin e-commerce systems, allowing the business logic to be glued together withdatabases and webservers. An illustration of this is shown in Figure 4, wherebrowsers or thin clients are connected to two backend database servers throughmiddleware. The fact that they are commonly transaction-oriented Java ap-plications make them quite different from conventional number-crunching ordatabase-style workloads.

An application server is a piece of software that provides an abstractionlayer to N-tier application developers. They typically provide services, such asthread pooling, load balancing and database connection pooling and allow theapplication writer to focus on the business rule implementation instead of theapplication performance. Therefore applications servers can be viewed as anunderlying software layer on which middleware software is implemented.The rest of this thesis is organized as follows: Chapter 2 contains a summary ofeach paper included in this thesis. The contribution of this thesis is describedin Chapter 3, followed by a summary of the thesis in Swedish.

20

Paper Summaries

The papers included in this thesis are verbatim copies of the previously pub-lished versions, but are reformatted to fit the format of this thesis. Reprintswere made with permission from the publishers. This chapter contains sum-maries of the included papers together with a retrospective section per paperwith comments and some of the insights gained since the publication of thepaper.

Conserving Memory Bandwidth in ChipMultiprocessors with Runahead ExecutionThe cost of increasing the chip pin-count and thereby bandwidth is high andpredicted to increase considerably. As more transistors become available, thenumber of cores per chip will be dictated either by power or bandwidth. In thispaper we target the issue of conserving bandwidth and note a trend towardslarger cache blocks in on-chip L2 caches. We find that large parts of the cacheblocks are not being used, leading to a waste of the costly bandwidth.

Runahead execution is a promising technique for tolerating long memorylatencies. In this paper we observe multiple characteristics of runaheadexecution that can be exploited to remove the implicit assumption of spatiallocality. We propose a method of fine-grained fetching that reduces theamount of unnecessary data being fetched and achieves a performancespeedup through the reduction in bandwidth.

RetrospectiveThis paper provides a first step towards understanding how characteristics ofrunahead execution can be exploited in memory system design. While the fine-grained fetch method proposed in Paper A only targets data fetched frommemory, a similar technique can be applied between nodes in a cache coherentsystem.

Timestamp-Based Selective Cache AllocationIn order to minimize miss ratio and maximize area and power utilization, itis important to carefully select which piece of data to evict every time a new

21

piece of data is allocated in a cache. Therefore, a large number of replacementalgorithms have been devised. In Paper B we take the orthogonal approachof proposing a selective allocation strategy. Instead of only selecting what toreplace on a cache miss, we choose whether a piece of data should be allocatedor not in the first place. The scheme is based on the observation that a largefraction of data that is accessed does not benefit from cache allocation. By notallocating some data, the data that does get allocated is able to reside in thecache for a longer period of time. Thus, avoiding some of the capacity misses.

In Paper B we introduce the concept of Cache Allocation Ticks, CAT. CATis a time metric for caches. The scheme is based on a global counter thatis incremented for each cache allocation, in combination with a timestampper cache item. Based on these timestamps we propose an algorithm todetect which data to avoid allocating. We also demonstrate how a selectiveallocation algorithm combined with a small staging cache, L0, drastically canincrease the effectiveness of a cache. Allowing a smaller cache to performsimilar to a larger one. We also show that CAT provides a fairly accurate andapplication-independent way of detecting good allocation candidates.

RetrospectiveIf the L0 staging cache used in the RASCAL cache is looked up before ac-cessing the main cache, it can be expected to reduce power consumption. Thereason is that many accesses are likely to be satisfied from the small L0 cache,which consumes less power than the main cache. This is however neither dis-cussed nor quantified in Paper B.

Skewed Caches from a Low-Power PerspectiveWhen more cache blocks concurrently map to a set than there are ways in aset-associative cache, conflict misses arise. The elbow cache presented in Pa-per C targets designs requiring a high degree of conflict tolerance. The sharedcaches often employed in chip multithreaded processors is a typical case whereconflict tolerance is desirable. Increased associativity is the standard approachto achieve conflict tolerance. However, associativity comes at a high cost in dy-namic power, a scarce resource in processors today. The elbow cache achievesthe same conflict tolerance of highly associative caches with a fraction of thepower.

The elbow cache extends the idea of a skewed cache by adding a relocationstrategy yielding an average miss ratio that is comparable to the miss ratioof an 8-way set-associative cache. In this paper we also quantify the powerconsumption of a skewed cache organization as well as the proposed Elbowcache and compare it to set-associative caches. We find that the Elbow cache

22

consumes up to 48 percent less energy than an 8-way set-associative cache.Since caches often account for a significant fraction of the overall chippower [13][9], using the elbow cache design instead of an 8-way associativecache would lead to considerable overall chip power savings.

RetrospectiveSimilar to the cache organization presented in Paper B, the elbow cache evalu-ation was made from a first-level cache perspective. However nothing preventseither of them from being employed in other types of caches such as TLBs,higher-level caches, trace caches, or directory structures.

We used the metric miss ratio to evaluate our proposals in Paper A andPaper B. The goal was to devise more efficient cache organizations. To whatextent a miss ratio reduction implies an overall performance improvement isdependent on many other factors such as latency and the ability of the proces-sor to overlap a cache miss with useful execution. Since these factors tend todiffer significantly between processor designs and over time we have chosennot to include execution time approximations in this study.

TMA: A Trap-Based Memory ArchitectureThe cost of designing large cache coherent systems from multiplechip-multiprocessors is very high. Since errors in coherence protocolscan lead to incorrect computations they are unacceptable, hence a newmultiprocessor system design must be extraordinarily well tested and verified.In todays system design, verification time represents a significant part ofthe overall product development time. Since the task of verification growsquadratically or worse with system complexity, the sophistication level ofcoherence protocol is restricted by verification.

In almost all systems shipped today the cache coherence protocol is imple-mented in hardware. While hardware solutions have a speed advantage, theyhave the drawback of lacking flexibility. The best performing protocol must beverified and ready the day the machine leaves the assembly plant and the sameprotocol implementation is used for banking systems and scientific computa-tions. In a software-based solution, such as the one proposed in this paper, asimple protocol can be shipped with the systems and more advanced versionscan be made available later on as they become verified and tested. Similarlydifferent protocols can be supplied with systems targeted at different tasks.

In this paper we propose to extend the existing trap mechanism inprocessors with a small set of hardware extensions to support coherencetraps. Based on the extensions fine-grained software-based coherence canbe implemented in a way that is transparant to software. The described

23

architecture relies on interconnect functionality existing in some commodityinterconnects such as Infiniband. We show that our proposed trap-basedmemory architecture (TMA) on average can perform within 1 percent of ahardware based system.

Retrospective Two of the most revenue-important commercial application forserver manufacturers are databases and application servers. Both workloadshave successfully been parallelized on the application level, and does not re-quire cache coherent shared memory machines to scale. This trend towardssoftware clustering may significantly decrease the market for large sharedmemory machines. If that is the case, the huge costs associated with cache co-herent shared memory system design would become harder to amortize. Thelow costs associated with software based solutions to maintain coherence, mayprovide an option to build large systems consisting of multiple single- or multi-CMP designs.

Memory System Behavior of Java-BasedMiddlewareJava middleware and application servers in particular is one of the most im-portant workloads for server processors and systems being designed today.Despite this, little is known about this new workload from an architecturalpoint.

In Paper E, we present a detailed characterization of the memory systembehavior of two Java middleware benchmarks, ECperf and SPECjbb. Ourresults show that it is important to isolate the behavior of the middle tier. Wefind that these workloads have small memory footprints and primary workingsets compared with other commercial workloads (e.g., on-line transactionprocessing) and that a large fraction of the working sets are shared betweenprocessors. We observed two key differences between ECperf and SPECjbb.First, ECperf has a larger instruction footprint, resulting in much higher missrates for intermediate-size instruction caches. Second, SPECjbb’s data setsize increases linearly as the benchmark scales up, while ECperf’s remainsroughly constant. This difference leads to opposite conclusions on the utilityof moderately sized shared caches in a chip multiprocessor.

RetrospectiveThis is the first step towards understanding the characteristics of this new kindof workload.

24

As the workload matures and its performance-dependent factors are betterunderstood, its characteristics are likely to change. Further advances in garbagecollection and other JVM technology are also likely to change the memorybehavior of the workload.

Exploring Processor Design Options forJava-Based Middleware

In Paper F we characterize the execution of Java middleware from a processordesign perspective and compare and contrast it to more well-known workloads,such as online-transaction processing (OLTP), static web serving (APACHE)and to several SPEC benchmarks.

By idealizing a processor model we are able to get an approximation of thesingle-thread performance that is obtainable in a conventional out-of-order de-sign. We then take a realistic processor model and idealize various aspects ofit, in order to quantify the performance impact of each structure. The perfor-mance effect of idealizing the branch prediction compared to todays predictorsare quantified and isolated for directional branch prediction, branch target pre-diction and return address prediction. We also measure the memory level paral-lelism (MLP) of each workload. MLP is an increasingly important metric sinceit indicates how many of the costly memory accesses that can be overlappedwith each other.

With the introduction of chip multiprocessors the task of processorarchitects becomes more complicated. In particular architects must now strikea balance between allocating resources such as die area and power to eachprocessor against fitting more processors on a chip. In Paper F we combineinstruction-level parallelism findings with results from Paper E to investigatethe optimal trade-off between instruction and thread-level parallelism on aCMP for the Java middleware workload.

RetrospectiveThe workload setup of the ECperf benchmark was based on a Solaris 8 instal-lation, which does not use large TLB pages for code. This is likely to haveinflated the number of ITLB misses we observed. Even when ignoring theITLB misses we observe a substantial trap rate in the commercial applications.This trap rate is likely to have severe performance implications for future pro-cessor designs with hundreds or thousands of instructions in-flight [4], unlessnew less costly trap mechanisms are devised.

25

VASA: A Simulator Infrastructure withAdjustable FidelityThe arrival of chip-multiprocessor designs has exacerbated the simulationslowdown problem by increasing the need to simulate multiprocessor designs.In Paper G we present the VASA simulator framework, which has multiplefidelity and speed modes. Vasa is an architectural level simulator that ishighly integrated with Virtutech Simics full-system simulator. It containsmodels of caches, storebuffers, interconnects and memory controllers andis capable of modeling full multi-node systems with complex out-of-orderSMT/CMP-processors in great detail. However, VASA also contains twofaster modes with highly simplified processors modeling. VASA containsthree different fidelity levels or modes, where the fastest mode is 260 timesfaster than the most detailed mode. In this paper we show that, for memorysystem studies targeting the lower levels of the memory system, the moresimplified modes can be used and still yield the same conclusions as thecomplex mode.

RetrospectiveThe Vasa simulator is the result of an attempt to create a simulator frameworkthat would fulfill the requirements of each authors current, past and futureresearch projects. It was designed using the authors experience from variousother simulators and research projects. This paper contains some of the lessonslearned and insights gained in the design of the VASA simulator framework.VASA was used in both Paper A and Paper D.

26

Contributions of this Thesis

The main contributions of this thesis are:

• It highlights a trend of decreasing bandwidth per transistor, which is likelyto have severe implications towards scaling up CMP designs. Furthermore,it identifies characteristics of runahead execution regarding accessesexhibiting spatial locality and presents a method, fine-grained fetching, toexploit them to save bandwidth and cache space.

• The Cache Allocation Tick (CAT) time metric is introduced and thepotential of selective cache allocation is quantified. The RASCAL cacheorganization, targeted at capacity misses, is proposed.

• It presents the first dynamic power evaluation of skewed cache organiza-tions, together with multiple implementation optimizations for skewedcaches. It proposes the Elbow cache organization, targeted at conflict misses

• The Trap-based Memory Architecture is proposed, which providesapplication-transparent fine-grained coherence through software andperforms on par with hardware-based solutions.

• It includes the first two papers analyzing the new Java middlewareworkload from a memory system and processor perspective. It identifiesseveral performance bottlenecks in current designs

27

Summary in Swedish

På grund av landvinningar inom chiptillverkningsindustrin får det numera platsflera processorkärnor på samma kisel chip. Detta tillsammans med nya spel-regler inom kiselteknologin, som gör det allt svårare att öka prestanda för en-skilda processorkärnor, har lett fram till utvecklingen av chip-multiprocessorer(CMPer). Denna avhandling handlar om vilka följder detta får för minnessys-temet samt föreslår ett flertal nya konstruktioner utifrån de resursbegränsningarsom gäller för CMPer. Exempel på sådana begränsningar är kiselarea och ef-fekten ett chip utvecklar. Förutom detta innehåller avhandlingen även studierrörande simulering och programkaraktärisering, vilka utgör viktiga verktyg närman designar nya processorer och datorsystem.

För att överbrygga hastighetsskillnaden mellan processorer ochprimärminnen använder man så kallade cache-minnen. Cache-minnen ärsnabba, men mycket dyra minnen, jämfört med vanliga primärminnen.Eftersom läs och skrivtiden till ett cache-minne ökar med cache-minnetsstorlek begränsas cache-minnenas kapacitet. Flertalet av dagens datorerinnehåller därför flera nivåer av cache-minnen, där det minnet som är närmastprocessorn är minst och snabbast. En av grundstenarna i konstruktionen avcache-minnen är ett fenomen som uppvisas av det stora flertalet program,nämligen referenslokalitet. Lokalitet kan delas in i temporal och spatiallokalitet. Temporal lokalitet betyder att data som nyligen använts sannoliktkommer att användas snart igen. Spatial lokalitet indikerar att data närliggandetill nyligen använt data sannolikt kommer att användas inom en korttidsperiod. I denna avhandling analyseras och ifrågasätts flera antagandenom dessa välkända begrepp och metoder för att dra nytta av insikternapresenteras.

När nya processorer och minnessystem konstrueras använder man sigofta av simulering. En simuleringsmodell av den nya datorn skapas ochmodellerar den tilltänkta typen av program för att uppskatta hur dennya datorn kommer att prestera. Eftersom olika program uppvisar olikabeteenden är det viktigt att använda moderna och tidsenliga programnär man gör dessa simuleringar. Denna avhandling innehåller den förstaminnessystemsstudien samt den första processorkaraktäriseringsstudienav en ny typ av programtillämpning som blir allt viktigare i vår vardag,nämligen Java-baserade middleware. Dessa utgör grunden i många av deinternetbaserade tjänster, som till exempel biljettbokning och skivinköp, som

29

många av oss använder regelbundet. Vi studerar denna nya typ av programspeciellt ur ett chip-multiprocessorperspektiv.

Artikelvis sammanfattningar av de minnessystemskonstruktioner sombeskrivs i denna avhandling återfinns nedan:

• Kostnaden för att öka chipbandbredden, dvs mängden data som kanöverföras till ett processorchip per tidsenhet, är hög. I artikel A visar viatt denna kostnad sannolikt kommer att öka avsevärt framöver. Vi visaräven att en stor del av den data som läses in på chipet ofta inte används.Samtidigt har vi identifierat ett beteende bland läsningar och skrivningarför en ny typ av processor. I artikel A visar vi hur vi kan dra nytta av dettabeteende för att reducera den mängd av onödig data som läses in på ettchip och därmed minska bandbreddsbehovet.

• I artikel B beskrivs en selektiv cachningsalgoritm som undviker att upplåtacacheutrymme åt viss data. Vilken data som skall undvikas bestäms genomatt mäta intervallet mellan användningar av datat. Data som bedöms ha enlåg sannolikhet att få ligga kvar i cachen till nästa användning tillåts inte taupp utrymme i cachen. Därigenom tillåts det data som allokeras att få liggakvar längre. En ny tidsenhet, Cache Allocation Ticks (CAT), introducerasoch används för att mäta avståndet mellan två användningar.

• I artikel C föreslås en utökning av en typ av cachekonstruktion som kallasskewed caches. Utökningen är baserad på CAT-begreppet från papper Boch en omflyttningsstrategi för cacheblock som tävlar om samma plats icachen. En av de resurser som är begränsade i en procesor är effekt. Dennaartikel innehåller den första effektutvärderingen av skewed caches, samt ettflertal optimeringar till hur man kan implementera dessa.

• I artikel D presenteras en ny typ av arkitektur för att hålla flera cacheminnenkoherenta. I stället för att göra ändringar i minnessystemet för att tillhan-dahålla koherens föreslår vi att göra små ändringar i processorn för att utökaden existerande felhanteringsmekanismen som finns i dagens datorer till attäven hantera koherensfel i mjukvara. Vi visar att ett system baserat på dessautökningar kan prestera likvärdigt med ett hårdvarubaserat system.

plain

30

References

[1] Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, and Doug Burger. Clockrate versus IPC: The End of the Road for Conventional Microarchitectures. InProceedings of the 27th Annual International Symposium on ComputerArchitecture (ISCA’00), pages 248–259, 2000.

[2] L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano,S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable Architecture Based onSingle-Chip Multiprocessing. In Proceedings of the 27th Annual Interna-tional Symposium on Computer Architecture (ISCA’00), 2000.

[3] Doug Burger, James R. Goodman, and Alain Kägi. Limited Bandwidth to AffectProcessor Design. IEEE Micro, 17(6), 1997.

[4] Adrian Cristal, Daniel Ortega, Josep Llosa, and Mateo Valero. Out-of-OrderCommit Processors. In Proceedings of the 10th International Symposiumon High Performance Computer Architecture (HPCA ’04), page 48, 2004.

[5] Darrell Dunn. Intel Expects 75% Of Its Processors To Be Dual-Core Next Year.InformationWeek, March 2005.

[6] John D. Davis, James Laudon, and Kunle Olukotun. Maximizing cmp through-put with mediocre cores. In PACT ’05: Proceedings of the 14th Interna-tional Conference on Parallel Architectures and Compilation Techniques(PACT’05), pages 51–62, 2005.

[7] Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. Stamm,and Dean M. Tullsen. Simultaneous Multithreading: A Platform for Next-Generation Processors. IEEE Micro, 17(5), 1997.

[8] Gordon E. Moore. Cramming more components onto integrated circuits. Elec-tronics Magazine, April 1965.

[9] M. Gowan, L. Biro, and D. Jackson. Power considerations in the design of thealpha 21264 microprocessor. In 35th Design Automation Conference, 1998.

[10] J. Hirase. Economical importance of the maximum chip area. In Proceedingsof the Seventh Asian Test Symposium, 1998.

31

[11] A. Kowalczyk, V. Adler, C. Amir, F. Chiu, C. P. Chng, W.J. De Lange, Y. Ge,S. Ghosh, T. C. Hoang, B. Huang, S. Kant, Y.S. Kao, C. Khieu, S. Kumar,L. Lee, A. Liebermensch, X. Liu, N. G. Malur, A. A. Martin, H. Ngo, S. H.Oh, I. Orginos, L. Shih, B. Sur, M. Tremblay, A. Tzeng, D. Vo, S. Zambere,and J. Zong. The First MAJC Microprocessor: A Dual CPU System-on-a-Chip.Solid-State Circuits, IEEE Journal of, 36(11):1609–1616, 2001.

[12] M. Tremblay and J. Chan and S. Chaudhry and A. W. Conigliam and S. S. Tse.The MAJC architecture: a synthesis of parallelism and scalability. IEEE Micro,20(6):12–25, 2000.

[13] S. Manne, A. Klauser, and D. Grunwald. Pipeline gating: Speculation controlfor energy reduction. In Proceedings of the 25th Annual International Sym-posium on Computer Architecture (ISCA’98), 1998.

[14] Trevor Mudge. Power: A First-Class Architectural Design Constraint. IEEEComputer, 34(4), April 2001.

[15] Siva Narendra, Vivek De, Shekhar Borkar, Dimitri Antoniadis, and AnanthaChandrakasan. Full-chip sub-threshold leakage power prediction model for sub-0.18 micron cmos. In ISLPED ’02: Proceedings of the 2002 internationalsymposium on Low power electronics and design, pages 19–23, 2002.

[16] Kunle Olukotun, Basem A. Nayfeh, L. Hammond, K. Wilson, and K.-Y. Chang.The Case for a Single-Chip Multiprocessor. In Proceedings of the 7th Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS-VII), 1996.

[17] Ramon Canal and Joan-Manuel Parcerisa and Antonio González. DynamicCluster Assigment Mechanisms. In Proceedings of the sixth InternationalSymposium on High Performance Computer Architecture (HPCA-6),Toulouse, France, January 2000.

[18] Semiconductor Industry Association (SIA). International Technology Roadmapfor Semiconductors 1999.

[19] Semiconductor Industry Association (SIA). International Technology Roadmapfor Semiconductors 2004 Update.

[20] Lawrence Spracklen and Santosh G. Abraham. Chip Multithreading: Opportu-nities and Challenges. In Proceedings of the 11th International Symposiumon High-Performance Computer Architecture (HPCA ’05, 2005.

[21] Wm.A. Wulf and S.A. McKee. Hitting the Memory Wall: Implications of theObvious. Computer Architecture News, 23(1):20–24, March 1995.

32

uu.diva-portal.orguu.diva-portal.org/smash/get/diva2:167524/fulltext01.pdf · martin karlsson,...

Documents