dplasma and hwloc

Co-operative Report

Integrating Hardware Architecture knowledge to the

Distributed PLASMA project.

byKrerkchai Kusolchu

Student Visitor

Innovative Computing LaboratoryDepartment of Electrical Engineering and

Computer ScienceUniversity of Tennessee

2010

Table of Contents

Acknowledgement 1Abstract 1Objective 1Introduction 2

About Organization 2Background 3

Hardware Locality 5What is Hardware Locality? 5Design and interface of Hardware

Locality.A. Abstracting the Hardware TopologyB. Abstracting the Hardware Topology

7

Application and Performance ExampleAffinity-aware Thread Scheduling

6

Implementation, Problem, Solving & Result

19

Conclusion 21

References 22

Appendix 23

Acknowledgement

I would like to express my gratitude to all those who gave me the opportunity to come here at University of Tennessee for my internship to gain valuable experiences. I want to thank Department of Computer Engineering for giving me permission to work and broaden my knowledge in the area of Computer Science. I have furthermore to thank Institute of Engineering, Suranaree University of Technology to encourage and support everything for my time here. I am deeply indebted to my supervisor Dr. George Bosilca whose help, concern, suggestions assisted me in all the time of completing my internship. My colleagues from Department of Computer Engineering who gave me some beneficial guidance in my project work. And also to all of the people who help me along the way. Last but not least, I would like to give my special thanks to Dr. Thara Angskun who contributed greatly to my visit to University of Tennessee this time for my internship.

Abstract

The document is mainly consists of the following, introduce the hwloc software, explains why affinities are important in modern HPC hardware and applications, gave several use cases with MPI and OPENMP libraries, and show how hwloc helps them achieve better performance. Compare the performance of Distributed PLASMA with hardware architecture knowledge with other method.

Objective

Apply and integrating hardware architecture knowledge to the Distributed PLASMA project using hwloc to improve the performance of Distributed PLASMA and compare the performance of Distributed PLASMA with hardware architecture knowledge with other method.

Chapter 1: IntroductionAbout Organization

Name: Innovative Computer LaboratoryAlias: ICLContact

University of Tennessee Department of Electrical Engineering and Computer ScienceSuite 413 Claxton1122 Volunteer BlvdKnoxville TN 37996-3450

Located at the heart of the University of Tennessee campus in Knoxville, ICL continues to lead the way as one of the most respected academic, enabling technology research laboratories in the world. Our many contributions to technological discovery in the HPC community, as well as at UT, underscore our commitment to remain at the forefront of enabling technology research.

Before his recent departure as Chancellor of the Knoxville campus, Dr. Loren Crabtree remarked about ICL’s prominent role at the University of Tennessee:

On behalf of the entire university, it is a privilege to recognize the importance of the Innovative Computing Laboratory to the university’s research mission. Led by Distinguished Professor Jack Dongarra, ICL continues to set the standard for academic research centers in the 21st century. As one of the university’s most respected centers, the students and staff of ICL continue to demonstrate the dedication, leadership, and accomplishments that embody the university’s ongoing efforts to remain one of the top publicly funded academic research institutions in the United States. Going forward, I also expect ICL to continue to play a

major role in helping the university establish and foster national and international collaborations, including our ongoing partnerships with Oak Ridge National Laboratory and the construction in Tennessee of the NSF’s new petascale supercomputing center. The future of research demands that academic institutions raise the bar for instruction and exploration. The University of Tennessee is proud to be the home of world-class centers such as ICL and we look forward to its continued contributions to our nation’s research agenda.

Background

At the Innovative Computing Laboratory (ICL), our mission is simple. We intend to be a world leader in enabling technologies and software for scientific computing. Our vision is to provide leading edge tools to tackle science’s most challenging high performance computing problems and to play a major role in the development of standards for scientific computing in general.

ICL was founded in 1989 by Dr. Jack Dongarra who came to the University of Tennessee from Argonne National Laboratory upon receiving a dual appointment as Distinguished Professor in the Computer Science Department and as Distinguished Scientist at nearby Oak Ridge National Laboratory (ORNL), two positions he holds today. What began with Dr. Dongarra and a single graduate assistant has evolved into a fully functional center, with a staff of more than 40 researchers, students, and administrators. ThrougoutThroughout the past 18 years, ICL has attracted many post-doctoral researchers and

professors from multi-disciplines such as mathematics, chemistry, etc. Many of these scientists came to UT specifically to work with Dr. Dongarra, which began a long list of top research talent to pass through ICL and move on to make exciting contributions at other institutions and organizations. Below we recognize just a few who have helped make ICL the respected center it has become.

Zhaojun Bai - University of California, Davis Richard Barrett - Oak Ridge National

Laboratory Adam Beguelin - formerly of AOL, now retired Susan Blackford - Myricom Henri Casanova - University of Hawaii, Manoa Jaeyoung Choi - Soongsil University, Korea Andy Cleary - Lawrence Livermore National

Laboratory Frederic Desprez - ENS-Lyon, France Victor Eijkhout - University of Texas, Austin Graham Fagg - Microsoft Edgar Gabriel - University of Houston Robert van de Geijn - University of Texas,

Austin Julien Langou - University of Colorado at

Denver Antoine Petitet - ESI Group, France Roldan Pozo - NIST Erich Strohmaier - Lawrence Berkeley National

Laboratory Francoise Tisseur - Manchester University,

England Bernard Tourancheau - University of Lyon,

France Sathish Vadhiyar - Indian Institute of Science

(IISC), India

Clint Whaley - University of Texas, San Antonio Felix Wolf - Forschungszentrum Julich,

Germany

Over the past 18 years, ICL has produced numerous high value tools and applications that now compose the basic fabric of high performance, scientific computing. Some of the technologies that our research has produced include:

Active Netlib ATLAS BLASFT-MPI HARNESS LAPACKLAPACK for Clusters

LINPACK Benchmark

MPI

NetBuild Netlib NetSolve

PAPI PVM RIBScaLAPACK Top500

Our successes continue along with current ICL efforts such as Fault Tolerant Linear Algebra, Generic Code Optimization (GCO), HPC Challenge benchmark suite (HPCC), KOJAK, Multi-core and Cell effort (PLASMA), NetSolve/GridSolve, Open MPI, PAPI, SALSA, SCALASCA, and vGrADS. Many of our efforts have been recognized nationally and internationally, which includes many awards such as four R&D 100 awards; PVM in 1994, ATLAS and NetSolve in 1999, and PAPI in 2001.

Chapter 2: Hardware Locality

What is Hardware Locality ?

Hardware Locality or hwloc is a softwaresoftware that provides command line tools and a C API to gathers hardware information about processors, caches, memory nodes and more, and exposes it

to applications and runtime systems in a abstracted and portable hierarchical manner. Hwloc primarily goal is to help high-performance computing (HPC) applications, but is also applicable to any project seeking to exploit code and/or data locality on modern computing platforms. Hwloc may significantly help performance by having runtime systems place their tasks or adapt their communication strategies depending on hardware affinities.

Design and Interface

We now introduce the design and interface of hwloc. It aims at abstracting topology information in a portable manner so as to export it to applications and runtime systems in a convenient way.

A. Abstracting the Hardware Topology

Hardware Locality was designed from the idea that nowadays and next-generation architectures are highly hierarchical. Indeed, current machines consist of several processor sockets containing multiple cores composed of one or several threads. This led to representing the hardware architecture as a tree of resources. hwloc also includes NUMA memory nodes in its resource tree as depicted on Figure 4. In case of NUMA machines with dozens of memory nodes such as SGI ALTIX systems [2], hwloc can also parse the matrix of distances between nodes (reported by the operating system) so as to exhibit the hierarchical organization of these

memory nodes. hwloc was also designed with the idea that future architectures may be asymmetric (less cores in some sockets) or even heterogeneous (different processor types). Thus, the hierarchical tree is composed of generic objects containing a type (among Node, Socket, Cache, Core, and more) and various attributes such as the cache type and size, or the socket number. This design enables easy porting on future architectures thanks to no assumption being made on the presence of currently-existing object types (such as sockets or cores) or their relative depth in the tree.

B. Abstracting Exporting the Hardware Topology

hwloc gathers information about the underlying hardware at startup. It uses operating system-specific strategies to do so: reading the sysfs pseudo-filesystem on LINUX, or calling some specific low-level library on AIX, DARWIN, OSF, SOLARIS or WINDOWS. It can then display to the user a graphical or textual output. It can also save it to an XML file so as to reload it later instead of re-gathering it from scratch, for instance if both a launcher and the actual process uses it .

The most interesting way to use hwloc is through its C-programming interface. The hwloc interface not only abstracts OS-specific interfaces into a portable API. It also tries to leverage all their advantages through both a low-level detailed interface and a high-level conceptual interface. The former lets an advanced programmer directly traverse the

object tree, following pointers to parents, children, siblings, etc. So as to find the relevant resource information using topology attributes such as their depth or index. The latter API provides generic and higher-level helpers to find resources matching some properties. Once the application or runtime system has found the interesting objects in the topology tree, it can then retrieve information from its attributes to adapt its behavior to the underlying hardware characteristics.

Application and Performance Example In this section will tell how hwloc can be

used by some existing OPENMP and MPI runtime systems. We first look at scheduling OPENMP threads and placing MPI processes depending on their software affinities and on the hardware hierarchy. Then, we show how a predefined process placement can benefit from topology information by adapting its commu-nication strategy to the hardware affinities between processes.

Affinity-aware Thread SchedulingThe OPENMP language consists of a set

of compiler directives, library routines and environment variables that help the programmer with designing parallel applications. It has been originally designed for SMP architectures, and OPENMP runtime systems now have to evolve to deal with affinities on hierarchical NUMA machines.

FORESTGOMP is an extension of the GCC GNU OPENMP runtime system (GOMP) that takes benefit from hwloc to be efficient on any kind of shared-memory architecture. It relies on the BUBBLESCHED scheduling framework to group related threads together into recursive Bubble Structures every time the application enters a parallel section, thus generating a tree of threads out of OPENMP applications.

BUBBLESCHED also decorates the topology provided by hwloc with thread queues called Runqueues. Each runqueue is thus attached to a different object of the architecture topology. This way, the computer architecture is modeled by a tree of runqueues on which a tree of threads can be scheduled. For instance, scheduling a thread on a socket-level runqueue means that this thread can only be executed by the corresponding cores. And each core can run any thread that is placed on the runqueue of an object containing this core.

So the problem of scheduling is only a matter of mapping a dynamic tree of threads onto a tree of runqueues. FORESTGOMP provides several scheduling policies to fit different situations. One of them, called Cache, takes the topology into account to perform a thread distribution accounting for cache memory affinities. Its main goal is to schedule related threads together in a portable way, consulting the topology to determine which processing units share cache memory. It also keeps track of the last runqueue a thread was scheduled on to be able to move it back there during a new thread distribution, to benefit from cache memory reuse. When a processor

idles, the Cache scheduler browses the topology to steal work from the most local cores to benefit from shared cache memory.

Experimented Cache on an implicit surface reconstruction application called MPU on a quad-socket quad-core OPTERON host. The parallelism of this application is highly irregular and leads to the creation of a tree of more than 100,000 threads. Table I shows the results obtained by both the GOMP and the FORESTGOMP runtime systems.We also slightly modified FORESTGOMP to

ignore the architecture topology for comparison. It behaves better than the GOMP runtime system thanks to the cheap user-level thread management in BUBBLESCHED. As re-using cache memory is crucial for this kind of divide-and-conquer application, the topology-aware Cache scheduling policy behaves much better here. The OPENMP parallelization on this 16- core host achieves a speedup of 14 over the sequential code thanks to proper hardware affinity knowledge, while GOMP and the non-topology aware FORESTGOMP only reach 4.18 and 8.52 speedups.

4. Implementation

Here are the code that is use to detect the hardware architecture by using the hardware locality.

int conf_topology(int set);

This function job is to allocate, detection, build, and terminate and free a topology context. If the input is 1 it will allocate, and build the topology context, if the input is 0 it will terminate and any thing else it will just return 0.

int dplasma_hwlock_nb_levels();

Returns the number of levels of the hardware architecture for the system up to the core. So if the number of levels of the hardware architecture is 3: the System level (numbered 0), the L3 Cache level (numbered 1), the L2 level (numbered 2), and the L1 level (numbered 3).

int dplasma_hwlock_master_id(int level, int processor_id);

This function returns the processor id of the "master" of the processor defined by processor_id, at level i. Example.

System(126GB) L3(5118KB)

L2(512KB) + L1(64KB) + Core#0 L2(512KB) + L1(64KB) + Core#1 L2(512KB) + L1(64KB) + Core#2 L2(512KB) + L1(64KB) + Core#3

L3(5118KB) L2(512KB) + L1(64KB) + Core#4 L2(512KB) + L1(64KB) + Core#5 L2(512KB) + L1(64KB) + Core#6 L2(512KB) + L1(64KB) + Core#7

If Level 0 is the system. So, dplasma_hwlock_master_id(0, 0) = 0, dplasma_hwlock_master_id(0, 3) = 0, etc. So basically it will find the the master is the first processor that appear at this level.

unsigned int dplasma_hwlock_nb_cores(int level, int master_id); This function returns the number of processors that have the same master master_id, at level level.

Using the same exsample at the top dplasma_hwlock_nb_cores(0, 0) = 8

dplasma_hwlock_nb_cores(0, 4) = 8

size_t dplasma_hwlock_cache_size(int level, int master_id); This function returns the size of the cache at level i for the processors whose master is master_id at level i.

Using the same exsample at the top dplasma_hwlock_cache_size(42, 1) =

5118 KB, dplasma_hwlock_cache_size(44, 2) = 512 KB

int dplasma_hwloc_distance(int id1, int id2);

This function returns the distance between id1 and id2 . It return how many jumps must be done to go from the core with id1 to the core of id2. Since the hierarchy is a tree, this number would normally be even.

Using the same exsample at the top dplasma_hwloc_distance(0, 1) = 6 jumpdplasma_hwloc_distance(0, 4) = 8 jump

5. ConclusionBy apply and integrating hardware

architecture knowledge to the Distributed PLASMA project using hwloc. We can use the hwloc to take advantage of the hardware architecture so that the schedule related threads can by schedule in a portable way, consulting the topology to determine which processing units share cache memory. To benefit from cache memory reuse and also to steal work from the most local cores to benefit from shared cache memory. So we can improve the performance of Distributed PLASMA.

References

François Broquedis, Jérôme Clet-Ortega, Stéphanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault , Raymond Namyst. (2010) hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications fromhttp://hal.inria.fr/docs/00/42/98/89/PDF/

main.pdf

AppendixBuild and Destory Topology

int conf_topology(int set) { if(set=1){

hwloc_topology_init(&topology);

hwloc_topology_ignore_type_keep_structure(topology, HWLOC_OBJ_NODE);

hwloc_topology_ignore_type_keep_structure(topology, HWLOC_OBJ_SOCKET);

hwloc_topology_load(topology); } else if

hwloc_topology_destroy(topology); else return(0);}

Find the number of core for master_idunsigned int dplasma_hwlock_nb_cores(int level, int master_id){

int i;

for(i = 0; i < hwloc_get_nbobjs_by_depth(topology, level); i++){

hwloc_obj_t obj = hwloc_get_obj_by_depth(topology, level, i);

if(hwloc_cpuset_isset(obj->cpuset, master_id)){ return hwloc_cpuset_weight(obj->cpuset);

}}

return 0;}

Find the master id form the processor idint dplasma_hwlock_master_id(int level, int processor_id){ int count=0, i, div =0,real_cores, cores, test = 0;

real_cores = hwloc_get_nbobjs_by_type(topology, HWLOC_OBJ_CORE);

cores = real_cores;div = cores;

if(processor_id/cores>0){while(processor_id){ if(processor_id%div==0){

processor_id = count; break; } count++;div++;if(real_cores==count) count = 0;}

}



if(hwloc_cpuset_isset(obj->cpuset,

processor_id)){return hwloc_cpuset_first(obj->cpuset);

}}

return -1;}

Find the number of core for master_idunsigned int dplasma_hwlock_nb_cores(int level, int

master_id){

int i;



if(hwloc_cpuset_isset(obj->cpuset, master_id)){ return hwloc_cpuset_weight(obj->cpuset);

}}

return 0;}

Find the cache sizesize_t dplasma_hwlock_cache_size(int level, int master_id){

hwloc_obj_t obj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PROC, master_id);

while (obj) {

if(obj->depth == level){if(obj->type == HWLOC_OBJ_CACHE){

return obj->attr->cache.memory_kB; }

else {return 0;

}}obj = obj->father;}

Find the distance betweenint dplasma_hwloc_distance(int id1, int id2)} int jump, count = 0;

hwloc_obj_t obj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_CORE, id1); hwloc_obj_t obj2 = hwloc_get_obj_by_type(topology, HWLOC_OBJ_CORE, id2);

while (obj) {

if (obj==obj2) return jump = count+count;

obj = obj->father; obj2 = obj2->father; count++;

}}}

Find the number of level of the hardware architectureint dplasma_hwlock_nb_levels(void){

return hwloc_get_type_depth(topology, HWLOC_OBJ_CORE); }

dplasma and hwloc

Documents