07 - multi2sim · 2012. 10. 1. · isbn 0-7695-23014-1 isbn 978-0-7695-3014-7 issn 1550-6533...

r r 7, 20077, 2007

zedzed by

, Brazil, Brazil

7, 2007 7, 2007

07–

IEEEComputerSociety

Published by the IEEE Computer Society10662 Los Vaqueros Circle P.O. Box 3014 Los Alamitos, CA 90720-1314

IEEE Computer Society Order Number P3014ISBN 0-7695-3014-1ISSN 1550-6533

19th International Symposium

on Com

puter Architecture and H

igh Perform

ance Com

putingISBN 0-7695-3014-1

90000

9 780769 530147

Copyright © 2007 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.

Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331. The papers in this book comprise the proceedings of the meeting mentioned on the cover and title page. They reflect the authors’ opinions and, in the interests of timely dissemination, are published as presented and without change. Their inclusion in this publication does not necessarily constitute endorsement by the editors, the IEEE Computer Society, or the Institute of Electrical and Electronics Engineers, Inc.

IEEE Computer Society Order Number P3014

ISBN 0-7695-23014-1 ISBN 978-0-7695-3014-7

ISSN 1550-6533

Additional copies may be ordered from:

IEEE Computer Society IEEE Service Center IEEE Computer Society Customer Service Center 445 Hoes Lane Asia/Pacific Office

10662 Los Vaqueros Circle P.O. Box 1331 Watanabe Bldg., 1-4-2 P.O. Box 3014 Piscataway, NJ 08855-1331 Minami-Aoyama

Los Alamitos, CA 90720-1314 Tel: + 1 732 981 0060 Minato-ku, Tokyo 107-0062 Tel: + 1 800 272 6657 Fax: + 1 732 981 9667 JAPAN Fax: + 1 714 821 4641 http://shop.ieee.org/store/ Tel: + 81 3 3408 3118

http://computer.org/cspress [email protected]

[email protected] Fax: + 81 3 3408 3553 [email protected]

Individual paper REPRINTS may be ordered at: <[email protected]>

Editorial production by Silvia Ceballos

Cover art production by Joseph Daigle/ Studio Productions Printed in the United States of America by The Printing House

IEEE Computer Society

Conference Publishing Services (CPS) http://www.computer.org/cps

19th International Symposium on CCoommppuutteerr AArrcchhiitteeccttuurree aanndd HHiigghh PPeerrffoorrmmaannccee CCoommppuuttiinngg

SSBBAACC--PPAADD Message from the General Chairs............................................................... ix Message from the Program Committee Chairs ............................................ x Conference Organizers .............................................................................. xi Program Committee.................................................................................. xii Reviewers ................................................................................................ xiv Brazilian Computer Society (SBC)............................................................. xv Session 1

Applications I Multi-level Parallelism in the Computational Modeling of the Heart.............................................................. 3

Carolina Xavier, Rafael Sachetto, Vinicius Vieira, Rodrigo Weber dos Santos, and Wagner Meira Jr. Computational Characteristics of Production Seismic Migration and its Performance on Novel Processor Architectures............................................................................................................... 11

Jairo Panetta, Paulo R. P. de Souza Filho, Carlos A. da Cunha Filho, Fernando M. Roxo da Motta, Silvio S. Pinheiro, Ivan Pedrosa Junior, Andre L. R. Rosa, Luiz R. Monnerat, Leandro T. Carneiro, and Carlos H. B. de Albrecht

Voice Command Recognition with Dynamic Time Warping (DTW) using Graphics Processing Units (GPU) with Compute Unified Device Architecture (CUDA) ............................................................... 19

Gustavo Poli, Alexandre L. M. Levada, João F. Mari, and José Hiroki Saito Exploring Novel Parallelization Technologies for 3-D Imaging Applications .............................................. 26

Diego Rivera, Dana Schaa, Micha Moffie, and David Kaeli Session 2

Microarchitecture Low-cost Techniques for Reducing Branch Context Pollution in a Soft Realtime Embedded Multithreaded Processor........................................................................................................... 37

Emre Özer, Alastair Reid and Stuart Biles Self-Imposed Temporal Redundancy: An Efficient Technique to Enhance the Reliability of Pipelined Functional Units ................................................................................................ 45

Elias Mizan, Tileli Amimeur, and Margarida F. Jacome Predicting Loop Termination to Boost Speculative Thread-Level Parallelism in Embedded Applications........................................................................................................................... 54

Mafijul Md. Islam Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors.............................. 62

Rafael Ubal, Julio Sahuquillo, Salvador Petit, and Pedro López

vv

Session 3 Applications II

Performance Improvement of the Parallel Lattice Boltzmann Method through Blocked Data Distributions........................................................................................................................................ 71

Claudio Schepke and Nicolas Maillard A Scalable Parallel Deduplication Algorithm............................................................................................... 79

Walter Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr., Altigran S. Da Silva, Renato Ferreira and Dorgival Guedes

A Multigrid-Schwarz Method for the Solution of Hydrodynamics and Heat Transfer Problems in Unstructured Meshes .............................................................................................................................. 87

Guilherme Galante, Rogério L. Rizzi, and Tiarajú A. Diverio Session 4

Benchmarking, Performance Measurements and Analysis Performance Evaluation of the Dual-Core Based SGI Altix 4700............................................................... 97

Rod Fatoohi Impacts of Multiprocessor Configurations on Workloads in Bioinformatics .............................................. 105

Youfeng Wu, Mauricio Breternitz Jr., and Victor Ying Session 5

Application-Specific Architectures Efficient Hardware for Modular Exponentiation Using the Sliding-Window Method with Variable-Length Partitioning .............................................................................................................. 117

Nadia Nedjah and Luiza de Macedo Mourelle Optimized Math Functions for a Fixed-Point DSP Architecture ................................................................ 125

Karlo G. Lenzi and Osamu Saotome Session 6

Grid Computing A Component-Oriented Support for Hierarchical MPI Programming on Multi-cluster Grid Environments........................................................................................................... 135

Elton Nicoletti Mathias, Vincent Cave, Francoise Baude, and Nicolas Maillard A Selector of Grid Resources based on the Semantic Integration of Multiple Ontologies ....................... 143

Alexandre P.C Silva and Mario A.R. Dantas A Novel Algorithm for Indirect Reputation-Based Grid Resource Management....................................... 151

Javier Echaiz, Jorge R. Ardenghi, and Guillermo R. Simari

vivi

Session 7 Cache and Memory Architectures

Register File Energy Optimization for Snooping Based Clustered VLIW Architectures ........................... 161

Rahul Nagpal and Y. N. Srikant Queue Register File Optimization Algorithm for QueueCore Processor .................................................. 169

Arquimedes Canedo, Ben Abderazek, and Masahiro Sowa An Intelligent Mechanism to Explore a Two-Level Cache Hierarchy Considering Energy Consumption and Time Performance ....................................................................................................... 177

Abel G. Silva-Filho, Carmelo J. A. Bastos-Filho, Ricardo M.F. Lima, Davi M.A. Falcão, Filipe R. Cordeiro, and Marília P. Lima

A Code Compression Method to Cope with Security Hardware Overheads ............................................ 185

Eduardo Wanderley Netto, Romain Vaslin, Guy Gogniat, and Jean-Philippe Diguet Session 8

Interconnection Networks, Routing, and Communication Architectural Breakdown of End-to-End Latency in a TCP/IP Network .................................................... 195

Steen Larsen, Parthasarathy Sarangam, and Ram Huggahalli Performance Analysis and Linear Optimization Modeling of All-to-all Collective Communication Algorithms ....................................................................................................................... 203

Hyacinthe N. Mamadou, Guilherme de Melo B. Domingues, Takeshi Nanri, and Kazuaki Murakami Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP) ...................... 211

Seung Eun Lee, Jun Ho Bahn, and Nader Bagherzadeh Session 9

Tools for Parallel and Distributed Programming Node Level Primitives for Parallel Exact Inference................................................................................... 221

Yinglong Xia and Viktor Prasanna Fault-Tolerance in Filter-Labeled-Stream Applications............................................................................. 229

Bruno Coutinho, Dorgival Guedes, Wagner Meira Jr., and Renato A. Ferreira High-Level Service Connectors for Component-Based High Performance Computing ........................... 237

Francisco H. de Carvalho-Junior, Ricardo C. Corrêa, Gisele A. Araújo, Jefferson C. Silva, and Rafael D. Lins

viivii

Session 10 Load Balancing and Scheduling

On-Line Scheduling of MPI-2 Programs with Hierarchical Work Stealing ................................................ 247

Guilherme P. Pezzi, Márcia C. Cera, Elton Mathias, Nicolas Maillard, and Philippe O. A. Navaux Exigency-Based Real-Time Scheduling Policy to Provide Absolute QoS for Web Services.................... 255

Lucas S. Casagrande, Rodrigo F. de Mello, Ricardo Bertagna, José A. Andrade Filho, and Francisco J. Monaco

DTA-C: A Decoupled Multi-threaded Architecture for CMP Systems....................................................... 263

Roberto Giorgi, Zdravko Popovic, and Nikola Puzovic Automatic Constraint Partitioning to Speed up CLP Execution ................................................................ 271

Marluce R. Pereira, Patrícia K. Vargas, Maria Clícia S. de Castro, Felipe M. G. França, and Inês de Castro Dutra

Author Index .............................................................................................................. 279

viiiviii

MMeessssaaggee ffrroomm tthhee PPrrooggrraamm CCoommmmiitttteeee CChhaaiirrss

SSBBAACC--PPAADD On behalf of the Program Committee, we are pleased to welcome you to the 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2007). SBAC-PAD is an annual international conference series, the first of which was held 20 years ago, that has traditionally presented the state of the art, latest trends and new developments in computer architecture design, parallel and distributed technologies and high performance applications. We would first like to thank the Brazilian Computer Society, the IEEE Computer Society, the Technical Committees on Computer Architecture (TCCA) and Scalable Computing (TCSC), and International Federation for Information Processing (IFIP) for their continued support and sponsorship of SBAC-PAD 2007. Thanks also to the 66 members of the Program Committee, all recognized experts in their fields, from around the world who volunteered to participate in the selection of papers. Their help, together with that of the Organizing Committee, in initially publicizing the symposium had a significant impact on the number and quality of paper submissions. Authors were invited to submit manuscripts that presented original unpublished research in all areas of computer architecture and high performance computing. Work focusing on applications or emerging technologies were especially welcome. Even with the plethora of coinciding conferences, SBAC-PAD 2007 received 107 submissions from industry and academia located in 24 countries, a reflection of its current international standing. We would thus like to thank the authors for their contributions and, of course, the 142 reviewers for the time and effort they took to diligently review the submissions. After a rigorous peer-review process, most papers had 5 reviews but every paper had at least three, we chose 32 high quality papers (less than a third of the submissions) on research work from institutions in 10 countries (15 full papers from Brazil, 7 from the USA, 2 each from Japan and France, and one from Argentina, India, Italy, Spain, Sweden, and the UK, respectively). In addition to these regular papers, the scientific and technical program includes several keynote presentations from experts who have kindly accepted to share their knowledge and wisdom on a variety of state of the art issues. To further stimulate discussion among attendees, six workshops have been organized, covering a range of “hot” topics. Finally, a number of our industrial sponsors will also be presenting some of their insights on leading edge technologies. Thanks to all of the contributors and participants who together have created an excellent scientific and technical program. A number of people have endeavored to help organize this outstanding program and to make SBAC-PAD 2007 a resounding success. We would like to offer our sincerest thanks to them all; in particular thanks must go to Professor Philippe Navaux, this year’s General Chair, as well as the Steering Committee for their guidance and support and to Lucas Schnorr for his substantial assistance with the development and configuration the professional conference web pages. Once again, welcome to Gramado and to SBAC-PAD 2007! We are sure that you will enjoy both the scientific as well the social program of the conference. We look forward to the many stimulating discussions and presentations. SBAC-PAD 2007 Co-chairs Jean-Luc Gaudiot Vinod Rebello

xx

Multi2Sim: A Simulation Framework to EvaluateMulticore-Multithreaded Processors

R. Ubal, J. Sahuquillo, S. Petit and P. LopezUniversidad Politecnica de Valencia

Camino de Vera s/n 46021 Valencia, [email protected]

Abstract

Current microprocessors are based in complex designs,integrating different components on a single chip, such ashardware threads, processor cores, memory hierarchy orinterconnection networks. The permanent need of evalu-ating new designs on each of these components motivatesthe development of tools which simulate the system workingas a whole. In this paper, we present the Multi2Sim simu-lation framework, which models the major components ofincoming systems, and is intended to cover the limitationsof existing simulators. A set of simulation examples is alsoincluded for illustrative purposes.

1 Introduction

The evolution of microprocessors, mainly enabled bytechnology advances, has led to complex designs that com-bine multiple physical processing units in a single chip.These designs provide to the operating system (OS) theview of having multiple processors, and thus, different soft-ware processes can be scheduled at the same time.

This processor model consists of three major compo-nents: the microprocessor core, the cache hierarchy, andthe interconnection network. A design improvement on anyof these components will result in a performance gain overthe whole system. Therefore, current processor architecturetrends bring a lot of opportunities for researchers to inves-tigate novel microarchitectural proposals. Below, some de-sign issues on these components are drawn.

Concerning processor cores, deep and wide pipelineshave been designed, aimed at exploiting the high amount ofinstruction level parallelism (ILP) present in current work-loads. On the other hand, thread level parallelism (TLP)enables to exploit additional sources of independent in-structions to increase processor resources utilization. Thisidea, jointly with an overcome of hardware constraints, re-

sulted in chip multiprocessors (CMPs), which include vari-ous cores in a single chip [1].

With respect to memory hierarchy, its design is a ma-jor concern in current and incoming microprocessors, sincelong memory latencies act frequently as a performancebottleneck. Current on-chip parallel processing modelsprovide new cache access patterns and offer the possibil-ity of either replicating or sharing caches among process-ing elements. This fact rises the need to evaluate trade-offs between memory hierarchy configuration and proces-sor cores/threads structure.

Finally, interconnection networks (or interconnects)serve as communication medium for processor components(mainly processor cores). In an environment where cachesfrom different processors share memory blocks, the inter-connect is in charge of transmitting coherence messagesgenerated by the cache controllers. Research in this fieldtries to increase network performance by focusing on newtopologies, switching and flow control mechanisms, routingalgorithms or fault tolerance techniques.

In this paper we present Multi2Sim, which integratesprocessor cores, memory hierarchy and interconnection net-work in a tool that enables their evaluation. The rest ofthis paper is organized as follows. Section 2 presents anoverview of existing processor simulators. Section 3 de-scribes the Multi2Sim structure, while Section 4 discussesthe integrated features to support multithreading and multi-core simulation. Examples including simulation results areshown in Section 5. Finally, Section 6 presents some con-cluding remarks.

2 Related Work

Multiple simulation environments aimed at evaluatingcomputer architecture proposals have been developed. Themost widely used simulator in recent years has been Sim-pleScalar [2], which models an out-of-order superscalarprocessor. Lots of extensions have been applied to Sim-

19th International Symposium on Computer Architecture and High Performance Computing

1550-6533/07 $25.00 © 2007 IEEEDOI 10.1109/SBAC-PAD.2007.17

62

19th International Symposium on Computer Architecture and High Performance Computing

1550-6533/07 $25.00 © 2007 IEEEDOI 10.1109/SBAC-PAD.2007.17

62

pleScalar to model in a more accurate manner certain as-pects of superscalar processors. For example, the HotLeak-age simulator [3] quantifies leakage energy consumption.

SimpleScalar is quite difficult to extend to model newparallel microarchitectures without significantly changingits structure. In spite of this fact, various SimpleScalar ex-tensions to support multithreading have been implemented,e.g. SSMT [4], M-Sim [5], or SMTSim [6], but they havethe limitation of only executing a set of sequential work-loads and implementing a fixed resource sharing strategyamong threads.

Multithread and multicore extensions have been also ap-plied to the Turandot simulator [7] [8], which models aPowerPC architecture and has been also used with powermeasurement aims (PowerTimer [9]). Turandot extensionsto parallel microarchitectures are mostly cited (e.g., [10])but not publicly available.

Both SimpleScalar and Turandot are application-onlytools, which directly simulate the behaviour of an applica-tion. Such tools have the advantage of isolating the work-load execution, so statistics are not affected by the simula-tion of additional software. The tool proposed in this papercan also be classified as an application-only simulator.

In contrast to the application-only simulators, a set ofso-called full-system simulators are available. Such toolsare able to boot an unmodified operating system and appli-cations run at the same time over it. Although this modelprovides higher simulation power, it involves a huge com-putational load and sometimes unnecessary simulation ac-curacy.

Simics [11] is an example of generic full-system simu-lator, commonly used for multiprocessor systems simula-tion, but unfortunately not freely available. A variety ofSimics derived tools has been implemented for specific re-search purposes in this area. This is the case of GEMS [12],which introduces a timing simulation module to model acomplete processor pipeline and a memory hierarchy sup-porting cache coherence. However, GEMS provides lowflexibility of modelling multithreaded designs and it inte-grates no interconnection network model.

An important feature included in some processor simu-lators is the timing-first approach, provided by GEMS andadopted in Multi2Sim. On such a scheme, a timing moduletraces the state of the processor pipeline while instructionstraverse it, possibly in a speculative state. Then, a func-tional module is called to actually execute the instructions,so the correct execution paths are always guaranteed by apreviously developed robust simulator. The timing-first ap-proach confers efficiency, robustness, and the possibility ofperforming simulations on different levels of detail. Ourproposal adopts the timing-first simulation with a functionalsupport that, unlike GEMS, need not simulate a whole oper-ating system, but is still capable of executing parallel work-

loads, with dynamic threads creation.The last cited simulator is M5 [13], which provides

support for out-of-order SMT-capable CPUs, multiproces-sors and cache coherency, and runs in both full-system andapplication-only modes. The limitations lie once again inthe low flexibility of multithreaded pipeline designs.

3 Basic simulator description

Multi2Sim [14] has been developed integrating somesignificant characteristics of popular simulators, such asseparate functional and timing simulation, SMT and mul-tiprocessor support and cache coherence. Multi2Sim is anapplication-only tool intended to simulate final MIPS32 ex-ecutable files. With a MIPS32 cross-compiler (or a MIPS32machine) one can compile his own program sources, andtest them under Multi2Sim. This section deals with theprocess of starting and running an application in a cross-platform environment, and describes briefly the three im-plemented simulation techniques (functional, detailed andevent-driven simulation).

3.1 Program Loading

Program loading is the process in which an executablefile is mapped into different virtual memory regions of anew software context, and its register file and stack are ini-tialized to start execution. In a real machine, the operatingsystem is in charge of these actions, but an application-onlytool should manage program loading during its initializa-tion.

Executable File Loading. The executable files outputby gcc follow the ELF (Executable and Linkable Format)specification. An ELF file is made up of a header and a setof sections. Some Linux distributions include the librarylibbfd, which provides types and functions to list the sec-tions of an ELF file and track their main attributes (startingaddress, size, flags and content). When the flags of an ELFsection indicate that it is loadable, its contents are copiedinto memory after the corresponding starting address.

Program Stack. The next step of the program loadingprocess is to initialize the process stack. The aim of theprogram stack is to store function local variables and pa-rameters. During the program execution, the stack pointer($sp register) is managed by the own program code. How-ever, when the program starts, it expects some data in it,namely the program arguments and environment variables,which must be placed by the program loader.

Register File. The last step is the register file initializa-tion. This includes the $sp register, which has been pro-gressively updated during the stack initialization, and the

6363

PC and NPC registers. The initial value of the PC registeris specified in the ELF header of the executable file as theprogram entry point. The NPC register is not explicitly de-fined in the MIPS32 architecture, but it is used internally bythe simulator to handle the branch delay slot.

3.2 Simulation Model

Multi2Sim uses three different simulation models, em-bodied in different modules: a functional simulation engine,a detailed simulator and an event-driven module —the lat-ter two perform the timing simulation. To describe them,the term context will be used hereafter to denote a softwareentity, defined by the status of a virtual memory image and alogical register file. In contrast, the term thread will refer toa processor hardware entity comprising a physical registerfile, a set of physical memory pages, a set of entries in thepipeline queues, etc. The three main simulation techniquesare described next.

Functional Simulation, also called simulator kernel. Itis built as an autonomous library and provides an interfaceto the rest of the simulator. This engine does not knowof hardware threads, and owns functions to create/destroysoftware contexts, perform program loading, enumerate ex-isting contexts, consult their status, execute machine in-structions and handle speculative execution. The supportedmachine instructions follow the MIPS32 specification [15][16]. This choice was basically motivated by a fixed instruc-tion size and formats, which enable a simple instruction de-coding.

An important feature of the simulation kernel, inheritedfrom SimpleScalar [2], is the checkpointing capability ofthe implemented memory module and register file, think-ing of an external module that needs to implement specula-tive execution. In this sense, when a wrong execution pathstarts, both the register file and memory status are saved,reloading them on the misprediction detection.

Detailed Simulation. The Multi2Sim detailed simula-tor uses the functional engine to perform a timing-first [12]simulation: in each cycle, a sequence of calls to the kernelupdates the state of existing contexts. The detailed simu-lator analyzes the nature of the recently executed machineinstructions and accounts the operation latencies incurredby hardware structures.

The main simulated hardware consists of pipeline struc-tures (stage resources, instruction queue, load-store queue,reorder buffer...), branch predictor (modelling a combinedbimodal-gshare predictor), cache memories (with variablesize, associativity and replacement policy), memory man-agement unit, and segmented functional units of config-urable latency.

Event-Driven Simulation. In a scheme where func-

tional and detailed simulation are independent, the imple-mentation of the machine instructions behaviour can be cen-tralized in a single file (functional simulation), increasingthe simulator modularity. In this sense, function calls thatactivate hardware components (detailed simulation) have aninterface that returns the latency required to complete theiraccess.

Nevertheless, this latency is not a deterministic value insome situations, so it cannot be calculated when the func-tion call is performed. Instead, it must be simulated cycle bycycle. This is the case of interconnects and caches, where anaccess can result in a message transfer, whose delay cannotbe computed a priori, justifying the need of an independentevent-driven simulation engine.

4 Support for Multithreaded and MulticoreArchitectures

This section describes the basic simulator features thatprovide support for multithreaded and multicore processormodelling. They can be classified in two main groups: thosethat affect the functional simulation engine (enabling the ex-ecution of parallel workloads) and those which involve thedetailed simulation module (enabling pipelines with varioushardware threads on the one hand, and systems with severalcores on the other).

4.1 Functional simulation: parallel work-loads support

The functional engine has been extended to support par-allel workloads execution. In this context, parallel work-loads can be seen as tasks that dynamically create childprocesses at runtime, carrying out communication and syn-chronization operations. The supported parallel program-ming model is the one specified by the widely used POSIXThreads library (pthread) shared memory model [17].

In a multithreaded environment, some studies suggestusing a set of sequential workloads [18]. The reason isthat multiple resources are shared among hardware threads,and processor throughput can be evaluated more accuratelywhen no contention appears due to communication betweencontexts. In contrast, multicore processor pipelines are fullyreplicated, and an important contention point is the inter-connection network. The execution of multiple sequentialworkloads exhibits only some interconnect activity in formof L2-L1 cache transfers, but no coherence actions can oc-cur between processes having disjoint memory maps. Thus,in order to evaluate multicore processors, it makes senseto support and run parallel workloads with shared memorylocations, whose distributed access can stress the intercon-nection network.

6464

Actual parallel workloads require special hardware sup-port (machine instructions), as well as low level softwaresupport (system calls) that enable threads spawning, syn-chronization and termination. Each of these issues are de-scribed below, jointly with a brief description of the POSIXthreads management:

Instruction set support. When the processor hard-ware supports concurrent threads execution, the parallelprogramming requirement that directly affects its architec-ture is the existence of critical sections, which cannot be ex-ecuted simultaneously by more than one thread. CMPs ormultithreaded processors must stall the activity of a hard-ware thread when it tries to enter a critical section occupiedby other thread.

The MIPS32 approach implements the mutual exclusionmechanism by means of two machine instructions (LL andSC), defining the concept of RMW (read-modify-write) se-quence [16]. An RMW sequence is a set of instructions,embraced by a pair LL-SC that run atomically on a multi-processor system. The cited machine instructions do notenforce an RMW sequence, but the output value of SC in-forms of the RMW success or failure.

Operating system support. Tracing the execution ofa parallel workload, the operating system support requiredby pthread is formed of system calls i) to spawn/destroya thread (clone, exit group), ii) to wait for childthreads (waitpid), iii) to communicate and synchronizethreads with system pipes (pipe, read, write, poll)and iv) to wake up suspended threads using system signals(sigaction, sigprocmask, sigsuspend, kill).

POSIX Threads parallelism management. Applica-tions programmed with pthread can be simulated withoutchanges using Multi2Sim. This library introduces user codewhich handles parallelism by means of the described sub-set of machine instructions and system calls. However, thefact of having thread management code mingled with ap-plication code must be taken into account, as it constitutesa certain overhead which could affect final results. Furtherdetails on this consideration can be found in [14].

4.2 Detailed simulation: Multithreadingsupport

Multi2Sim supports a set of parameters that specify howstages are organized in a multithreaded design. Stages canbe shared among threads or private per thread [19] (exceptthe execute stage, which is shared by definition of multi-thread). Moreover, when a stage is shared, there must be analgorithm which schedules a thread every cycle on the stage.The modelled pipeline is divided into five stages, describedbelow.

Figure 1. Examples of pipeline organizations

The fetch stage takes instructions from the L1 instruc-tion cache and places them into an IFQ (instruction fetchqueue). The decode/rename stage takes instructions froman IFQ, decodes them, renames their registers and assignsthem a slot in the ROB (reorder buffer) and IQ (instructionqueue). Then, the issue stage consumes instructions fromthe IQ and sends them to the corresponding functional unit.During the execution stage, the functional units operate andwrite their results back into the register file. Finally, thecommit stage retires instructions from the ROB in programorder. This architecture is analogous to the one modelledby the SimpleScalar tool set [2], but uses a ROB, an IQ (in-struction queue) and a physical register file, instead of theRUU (register update unit).

Figure 1 illustrates two possible pipeline organizations.In a) all stages are shared among threads, while in b) allstages (except execute) are replicated as many times assupported hardware threads. Multi2Sim allows to eval-uate different stage sharing strategies, as well as differ-ent algorithms that schedule stage resources in each cy-cle. Depending on the stages sharing and thread selectionpolicies, a multithread processor can be classified as fine-grain (FGMT), coarse-grain (CGMT) or simultaneous mul-tithread (SMT).

A FGMT processor switches threads on a fixed schedule,typically on every processor cycle. In contrast, a CGMTprocessor is characterized by a thread switch induced by along latency operation or a thread quantum expiration. Fi-nally, an SMT processor enhances the previous ones witha more aggressive instruction issue policy, which is able toissue instructions from different threads in a single cycle.The simulator parameters that specify the sharing strategyof pipeline stages among threads, and thus the kind of mul-tithreading, are summarized in Table 1. Again, [14] gives adetailed description of all possible values these parametersmay take.

6565

Table 1. Combination of parameters for differ-ent multithread configurations

FGMT CGMT SMT

fetch kind timeslice switchonevent timeslice/

multiple

fetch priority - - equal/icount

decode kind shared/

timeslice/

replicated

shared/

timeslice

shared/

timeslice/

replicated

issue kind timeslice shared/

timeslice

replicated

retire kind timeslice timeslice timeslice/

replicated

Figure 2. Evaluated cache distribution de-signs

4.3 Detailed simulation: Multicore sup-port

A multicore simulation environment is basicallyachieved by replicating the data structures that representa single processor core. The zone of shared resources ina multicore processor starts with the memory hierarchy.When caches are shared among cores, some contention canexist when they are accessed simultaneously. In contrast,when they are private per core, a coherence protocol (e.g.MOESI [20]) is implemented to guarantee memory consis-tency. Multi2Sim implements in its current version a split-transaction bus as interconnection network, extensible toany other topology of on-chip networks.

The number of interconnects and their location vary de-pending on the sharing strategy of data and instructioncaches. Figure 2 shows three possible schemes of sharingL1 and L2 caches (t = private per thread, c = private percore, s = shared), and the resulting interconnects for a dual-core dual-thread processor.

5 Results

This section presents some simulation experiments us-ing Multi2Sim, illustrating the simulator application on one

hand, and checking its correctness on the other. These ex-periments i) test different multithread pipeline configura-tions, ii) explore different bus widths and iii) trace the net-work traffic executing a parallel workload. In all cases, thesimulated machine includes 64KB separate L1 instructionand data caches, 1MB unified and shared among threadsL2 cache, private physical register files of 128 entries, andfetch, decode, issue and commit width of 8 instructions percycle.

i) Multithread Pipeline Organizations. Figure 3 showsthe results for four different multithreaded implementa-tions: FGMT, CGMT, SMT with equal thread prioritiesand SMT with ICOUNT (giving priority to those threadswith less instructions in the pipeline [21]). Figure 3a showsthe average number of instructions issued per cycle, whileFigure 3b represents the global IPC (i.e., the sum of theIPCs achieved by the different threads), executing bench-mark 176.gcc from the SPEC2000 suite with one instanceper hardware thread, and varying the number of threads.

Results are in accordance with the ones published byTullsen et al [18], where CGMT and FGMT processors per-forms slightly better when the number of threads increasesup to four threads. Besides, an SMT processor shows notonly higher performance for any number of threads, but alsohigher scalability, both with equal and variable thread pri-orities.

ii) Bus Width Evaluation. This experiment shows howthe bus width impacts on processor performance, resultingin different number of contention cycles during data trans-fers. For this test, we assume MOESI requests of 8 bytesand cache blocks of 64 bytes, so network messages can haveeither 8 bytes (only a MOESI request) or 72 bytes (MOESIrequest + block data). The executed workload is fft, whichbelongs to the SPLASH2 suite, a set of parallel benchmarks.

Figure 4 represents the average contention cycles pertransfer. Because no message larger than 72 bytes will betransferred, at least a 72-byte bus width is required to sendany message in a single bus cycle and minimize contention.However, results show that a bus width more than threetimes smaller provides (for this workload) almost the samebenefits.

iii) Interconnect Traffic Evaluation. This experimentshows the activity of the interconnection network during theexecution of the fft benchmark with the same processor con-figuration described above, for a 16-byte bus width. Figure5a represents the fraction of total bus bandwidth used inthe network connecting the L1 caches and the common L2cache, taking intervals of 104 cycles. Figure 5b representsthe same metric referring to the interconnect between L2and main memory (MM). As one can see, traffic distribu-tion is quite irregular, showing some peaks of interconnectactivity at some execution intervals.

6666

a) Issue rate

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

5

1 2 3 4 5 6 7 8

Inst

ruct

ions

Issu

ed p

er C

ycle

Number of Threads

cgmtfgmt

smt_equalsmt_icount

b) IPC

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8

Thr

ough

put (

IPC

)

Number of Threads

cgmtfgmt

smt_equalsmt_icount

Figure 3. Issue rate and IPC with differentmultithreaded designs

6 Conclusions

In this paper, we presented Multi2Sim, a simulationframework that integrates important features of existingsimulators and extends them to provide additional function-ality. Regarding the features adopted from other tools, wecan cite the basic pipeline architecture (SimpleScalar), thetiming first simulation (Simics-GEMS) or the support tocache coherence protocols.

Among the extensions of Multi2Sim, we find the sim-ulation of sharing strategies of pipeline stages, memory hi-erarchy configurations, multicore-multithread combinationsand an integrated interface with the on-chip interconnectionnetwork. These features make Multi2Sim suitable for theevaluation of state-of-the-art processors, covering hot top-ics in the computer architecture field. In this paper, weshowed some guidance examples on how to use these sim-ulator characteristics.

The source code of Multi2Sim is written in C and can bedownloaded at [14].

0

5

10

15

20

25

0 10 20 30 40 50 60 70 80

Con

tent

ion

L1-L2 Bus Width (bytes)

a) Bus contention

0.8

1

1.2

1.4

1.6

0 10 20 30 40 50 60 70 80

IPC

L1-L2 Bus Width (bytes)

b) Processor performance

Figure 4. Performance for different buswidths simulating fft

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.5 1 1.5 2 2.5 3 3.5

Fra

ctio

n of

Use

d B

andw

idth

Processor Cycles (Millions)

a) L1-L2 Interconnect

0

0.01

0.02

0.03

0.04

0.05

0 0.5 1 1.5 2 2.5 3 3.5

Fra

ctio

n of

Use

d B

andw

idth

Processor Cycles (Millions)

b) L2-MM Interconnect

Figure 5. Traffic distribution in L1-L2 and L2-MM interconnects

6767

Acknowledgements

This work was supported by CICYT under GrantTIN2006-15516-C04-01, by Consolider-Ingenio 2010 un-der Grant CSD2006-00046 and by the Generalitat Valen-ciana under grant GV06/326.

References

[1] AMD AthlonTM 64 X2 Dual-Core Processor ProductData Sheet. www.amd.com, Sept. 2006.

[2] D.C. Burger and T.M. Austin. The SimpleScalar ToolSet, Version 2.0. Technical Report CS-TR-1997-1342,1997.

[3] Y. Zhang, D. Parikh, K. Sankaranarayanan,K. Skadron, and M. Stan. HotLeakage: ATemperature-Aware Model of Subthreshold andGate Leakage for Architects. Univ. of Virginia Dept.of Computer Science Technical Report CS-2003-05,2003.

[4] D. Madon, E. Sanchez, and S. Monnier. A Study ofa Simultaneous Multithreaded Processor Implementa-tion. In European Conference on Parallel Processing,pages 716–726, 1999.

[5] J. Sharkey. M-Sim: A Flexible, Multithreaded Archi-tectural Simulation Environment. Technical ReportCS-TR-05-DP01, Department of Computer Science,State University of New York at Binghamton, 2005.

[6] D. M. Tullsen. Simulation and Modeling of a Si-multaneous Multithreading Processor. 22nd AnnualComputer Measurement Group Conference, Decem-ber 1996.

[7] M. Moudgill, P. Bose, and J. Moreno. Validation ofTurandot, a Fast Processor Model for Microarchitec-ture Exploration. IEEE International Performance,Computing, and Communications Conference, pages451–457, 1999.

[8] M. Moudgill, J. Wellman, and J. Moreno. Envi-ronment for PowerPC Microarchitecture Exploration.IEEE Micro, pages 15–25, 1999.

[9] D. Brooks, P. Bose, V. Srinivasan, M. Gschwind,and M. Rosenfield P. Emma. Microarchitecture-LevelPower-Performance Analysis: The PowerTimer Ap-proach. IBM J. Research and Development, 47(5/6),2003.

[10] B. Lee and D. Brooks. Effects of Pipeline Complexityon SMT/CMP Power-Performance Efficiency. Work-shop on Complexity Effective Design, 2005.

[11] P.S. Magnusson, M. Christensson, J. Eskilson,D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson,A. Moestedt, and B. Werner. Simics: A Full SystemSimulation Platform. IEEE Computer, 35(2), 2002.

[12] M. R. Marty, B. Beckmann, L. Yen, A. R. Alameldeen,M. Xu, and K. Moore. GEMS: Multifacet’s GeneralExecution-driven Multiprocessor Simulator. Interna-tional Symposium on Computer Architecture, 2006.

[13] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt.Network-Oriented Full-System Simulation Using M5.Sixth Workshop on Computer Architecture Evaluationusing Commercial Workloads (CAECW), pages 36–43, Feb. 2003.

[14] www.gap.upv.es/˜raurte/tools/multi2sim.html.R. Ubal Homepage – Tools – Multi2Sim.

[15] MIPS Technologies, Inc. MIPS32TM ArchitectureFor Programmers, volume I: Introduction to theMIPS32TM Architecture. 2001.

[16] MIPS Technologies, Inc. MIPS32TM Architecture ForProgrammers, volume II: The MIPS32TM InstructionSet. 2001.

[17] D. R. Butenhof. Programming with POSIX R© Threads.Addison Wesley Professional, 1997.

[18] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simul-taneous Multithreading: Maximizing On-Chip Paral-lelism. Proceedings of the 22nd International Sympo-sium on Computer Architecture, pages 392–403, June1995.

[19] J. P. Shen and M. H. Lipasti. Modern Processor De-sign: Fundamentals of Superscalar Processors. July2004.

[20] P. Sweazey and A.J. Smith. A Class of CompatibleCache Consistency Protocols and Their Support by theIEEE Futurebus. 13th Int’l Symp. Computer Architec-ture, pages 414–423, June 1986.

[21] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy,J. L. Lo, and R. L. Stamm. Exploiting Choice: Instruc-tion Fetch and Issue on an Implementable Simultane-ous Multithreading Processor. In ISCA, pages 191–202, 1996.

6868

07 - multi2sim · 2012. 10. 1. · isbn 0-7695-23014-1 isbn 978-0-7695-3014-7 issn 1550-6533...

Documents