parallelization of a thermal elastohydrodynamic lubricated ...1470325/fulltext01.pdfera k arnor...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2020

Parallelization of a thermal elastohydrodynamic lubricated contacts simulation using OpenMP

GHASSAN ALRHEIS

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Parallelization of a thermal elastohydrodynamic

lubricated contacts simulation using OpenMP

Ghassan AlrheisKTH Royal Institute of Technology

29th of May 2019

Industry supervisor: Erland NordinAcademic supervisor: Carl-Magnus Everitt

Examiner: Erwin Laure

Sammanfattning

Datorer med flera kärnor som delar p̊a ett gemensamt minne (SMP) har blivit normen sedanMoore´s lag har slutat gälla. För att utnyttja den prestanda flera kärnor erbjuder s̊a behövermjukvaruingenjören skriva programmen s̊a att de explicit utnyttjar flera kärnor. För mindre pro-jekt är det lätt att detta bortses fr̊an vilket skapar program som endast utnyttjar en kärna. Dettagör att det i s̊adana fall finns stora vinningar genom att parallellisera koden. Det här examensar-betet har förbättrat prestandan p̊a ett beräkningstungt simuleringsprogram, skrivit att utnyttjaendast en kärna, genom att hitta omr̊aden i koden som är lämpliga att parallellisera. Dessa omr̊adenhar identifierats med Intel´s Vtune Amplifier och utförts med OpenMP. Arbetet har ocks̊a byttut en speciell beräkningsrutin som var särskilt krävande, speciellt för större problem. Slutresul-tatet är ett beräkningsprogram som ger samma resultat som det ursprungliga programmet menbetydligt snabbare och med mindre datorresurser. Programmet kommer att användas i framtidaforskningsprojekt.

Abstract

Multi-core Shared Memory Parallel (SMP) systems became the norm ever since the performancetrend prophesied by Moore’s law ended. Correctly utilizing the performance benefits these systemsoffer usually requires a conscious effort from the software developer’s side to enforce concurrencyin the program. This is easy to disregard in small software projects and can lead to great amountsof unused potential parallelism in the produced code. This thesis attempted to improve the perfor-mance of a computationally demanding Thermal Elastohydrodynamic Lubrication (TEHL) simula-tion written in Fortran by finding such parallelism. The parallelization effort focused on the mostdemanding parts of the program identified using Intel’s VTune Amplifier and was implementedusing OpenMP. The thesis also documents an algorithm change that led to further improvementsin terms of execution time and scalability with respect to problem size. The end result is a faster,lighter and more efficient TEHL simulator that can further support the research in its domain.

Keywords: OpenMP; SMP; Parallelism; TEHL; Multi-core;

Nomenclature

API Application Programmer Interface

ARB Architecture Review Board

cc-NUMA Cache Coherent NUMA

DMP Distributed Memory Parallel

DSM Distributed Shared Memory Parallel

DS Direct Summation

EHL Elastohydrodynamic Lubrication

FFT Fast Fourier Transform

IDE Integrated Development Environment

ifort Intel Fortran Compiler

MKL Intel’s Math Kernel Library

MPI Message Passing Interface

NUMA Non-Uniform Memory Access

OpenMP Open Multi-Processing

SMP Shared Memory Parallel

SMT Simultaneous Multi-Threading

TEHL Thermal Elastohydrodynamic Lubrication

UMA Uniform Memory Access

1

Inneh̊all

1 Introduction 41.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 62.1 Thermal Elastohydrodynamic Lubrication Simulation . . . . . . . . . . . . . . . . 6

2.1.1 Modeling EHL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Numerical setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Thermal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.4 Overview of the simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Parallelization and OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Memory models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 OpenMP’s programming model . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.3 OpenMP’s work-sharing constructs . . . . . . . . . . . . . . . . . . . . . . . 122.2.4 Potential performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Methodology 153.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Performance data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Single-step mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 Simulation Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.3 Reference Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Execution 204.1 Initial performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Parallelizing VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 Parallelizing TEMP CALC METAL . . . . . . . . . . . . . . . . . . . . . . . . 224.4 Addressing libm powf l9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5 Finalizing the parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.6 FFT VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.6.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.7 Final remarks and reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Results and discussion 325.1 Concurrency scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1.1 Using 1 x Intel Core i7-7800x . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1.2 Using 2 x Intel Xeon E5-2690 V2 . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 Computational scaling of the elastic deformation subroutines . . . . . . . . . . . . 355.3 Summary and final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 Conclusions and future work 386.1 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2 Recommended future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2

Tabeller

1 Specifications highlights of the two computers used. Obtained from [3] and [5]. . . 152 Software versions used on the two machines. . . . . . . . . . . . . . . . . . . . . . . 163 Summary of the initial code analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Summary of the analysis of the code with parallel VI. . . . . . . . . . . . . . . . . 225 Summary of the analysis of the code with parallel TEMP CALC METAL. . . . . . 236 Caller/callee report for libm powf l9. . . . . . . . . . . . . . . . . . . . . . . . . 257 Summary of the analysis of the code with concurrent usage of libm powf l9. . . 278 Summary of the analysis of the final threaded code. . . . . . . . . . . . . . . . . . 289 Total CPU time and CPU hotspots of the final threaded version of the code. . . . 2810 Summary of the analysis of the threaded code using the FFT method for elastic

deformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Figurer

1 Conformal (a) versus non-conformal (b) contacts. . . . . . . . . . . . . . . . . . . . 62 Illustration of the discretization of the contact metals and lubricant film. . . . . . . 93 Illustration of the execution of the simulation. . . . . . . . . . . . . . . . . . . . . . 104 Illustration of SMP, DMP and DSM. . . . . . . . . . . . . . . . . . . . . . . . . . . 115 OpenMP’s fork-join model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 The effects of Amdahl’s law illustrated for different fractions of parallelized code. . 137 Illustration of the data dependency between time steps. . . . . . . . . . . . . . . . 178 Single step process flow illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . 189 The nested loop in TEMP CALC METAL. . . . . . . . . . . . . . . . . . . . . 2310 The nested loop in TEMP CALC METAL post parallelization. . . . . . . . . . 2411 The calls to the demanding subroutines in LUBRICATION TEMP. . . . . . . 2512 The parallelization of the nested loop that calls TEMP CALC IJ. . . . . . . . . 2613 One of the segments of NEWTONIAN parallelized. . . . . . . . . . . . . . . . . 2714 Illustration of the circular convolution in the deformation calculations. . . . . . . . 3015 Illustration of deformation calculations with expanded pressure. . . . . . . . . . . . 3016 Execution time changes on the PC. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3317 Obtained speedups on the PC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3318 Threading efficiency on the PC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3419 Illustration of the NUMA node’s motherboard. . . . . . . . . . . . . . . . . . . . . 3520 Execution time changes on the NUMA node. . . . . . . . . . . . . . . . . . . . . . 3521 Speedups on the NUMA node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3622 Threading efficiency on the NUMA node. . . . . . . . . . . . . . . . . . . . . . . . 3623 DS versus FFT elastic deformation calculations. . . . . . . . . . . . . . . . . . . . . 37

3

1 Introduction

This section provides the reader with a brief historical context for multi-core parallel computing.The scope of the thesis is also described here. Finally, a brief outline of the thesis is presented forthe convenience of the reader.

1.1 Background

According to the popular and loosely defined Moore’s law, the number of transistors that can fit ona single chip should double every year or two. This prophecy was used, for a long time, to predictthe improvement in the processing power of general purpose processors [16]. The way transistorcount and performance were linked was through the attainable clock frequencies of said processorswhich corresponded with the size of the transistors. This relationship held for a long time butaround the early 2000s the clock frequencies stagnated. This was not due to a slow down in theimprovements to the transistor sizes and the scale of integration possible on a single die, but ratherit was due to power and heat dissipation issues. Maintaining the trend of performance increaserequired using this possible increase in chip complexity in a different way. For that reason, hardwaremanufacturers moved to multi-core architectures which included several processors on a single chip[2]. This would lead to a potential increase in performance but only to software developed withthat hardware in mind. Meaning threaded software which can distribute its workload over multipleprocessors in an effective manner.

Automatic parallelization of code is still not widely adapted which can be due to the focusof research on parallelization APIs and their implementations [15]. This implies that introducingconcurrency to software requires a conscious effort from the developer. This, alongside the newconcepts introduced by a concurrent programming model, can inhibit the development of parallelcode when performance is not being prioritized. In turn, this leads to software with great amountsof potential parallelism that can be exploited in order to improve the performance.

In its initial state, the Thermal-Elastohydrodynamic Lubrication (TEHL) simulation softwarewritten by Carl-Magnus Everitt, a PhD candidate at KTH and one of the supervisors of thisproject, is entirely sequential and very computationally demanding. It can require more than aweek of run time in order to complete some of the more complex cases. Concurrency was neverexplored during the development of the code and that represents the main reason why this thesiswas proposed.

1.2 Scope

This thesis explored the parallelization of the TEHL simulation. Threading was done using OpenMPin an incremental manner. Meaning, OpenMP directives were used to parallelize a demanding seg-ment of the program resulting in a version which is then used to identify other demanding segments.When the remaining demanding segments had small execution times compared to the overheadparallelization would add, the parallelization was deemed complete. These demanding segmentsare later referred to as hotspots and were obtained from Intel’s Vtune Amplifier which is a parallelcode profiler described more in Section 3.1.2.

This incremental approach produced many intermediate versions of the code with increasinglybetter performances. The decision to parallelize a new segment of the code is always justified byreferring to the profiler’s results. Following that, the observed effect on performance is explained.The final version of the code is run with different numbers of threads with and without Hyper-threading in order to examine the efficiency of the threading and how the attained performancescales with the amount of hardware used.

In addition to traditional parallelization, some optimizations to the original sequential codewere done. Also, a more efficient elastic deformation calculation algorithm was implemented tospeed up the final code and make it less computationally heavy. This approach was obtained fromexperts in this area and is considered a tangent to the original aim of the thesis. However, it ismentioned in this thesis because it led to significant improvements to the program.

4

1.3 Outline

The report starts with Section 2 which offers the reader an overview of the theory relevant to thiswork as well as some related prior work found by surveying published literature. Section 3 presentsthe hardware and software tools used throughout this thesis along with the general approach takento collect representative performance data from the program. Following that, Section 4 documentsthe investigation carried out to determine how the code must be threaded and improved. Theresults of that effort were then analyzed and discussed in Section 5. Finally, the thesis is concludedwith a summary and a few recommended areas of future work in Section 6.

5

2 Theory

This section provides the reader with the needed theoretical knowledge to understand the restof the work. This mainly concerns the mathematical models used to numerically estimate theparameters of interest. Also, it surveys what has been done before within the area of parallelizationusing OpenMP. This information is used to identify the range of speed-up that is attainable whenparallelizing the application to a certain number of threads. Alongside this, the literature was usedto develop a suitable style to present the work in this thesis.

2.1 Thermal Elastohydrodynamic Lubrication Simulation

Elastohydrodynamic lubrication (EHL) is a type of hydrodynamic lubrication which occurs inconcentrated contacts between solid materials. This concentrated nature of the interaction leadsto high pressures which in turn cause significant elastic deformations to the solids. Concentratedcontacts are called non-conformal, as opposed to conformal contacts in which the contact area ismuch larger and thus the pressure profile, over the surface of the solid within the contact area, ismore uniform. For example, the contact between a plain bearing and its sleeve is conformal, whilethe contact between gears or between cam followers and cams is not [20]. Figure 1 illustrates thetwo types.

Figur 1: Conformal (a) versus non-conformal (b) contacts.

An EHL contact is described by the pressure profile over the interacting surfaces. Accuratelyapproximating this can require the inclusion of many complex aspects of the contact. To limit thediscussion to what is relevant to this thesis, this section will focus on the aspects of EHL describedin Reference [8] which represent the model used in the subject simulation.

The rest of this section is organized as follows: The mathematical model of EHL as describedin the simulation is presented; The discretization of the EHL model is shown; The thermal modeladded later is described; Finally, an overview of the code is given.

2.1.1 Modeling EHL

Carl-Magnus Everitt in Reference [8] describes EHL contacts with a system consisting of fiveequations. The first of which is Reynolds’ equation which can be used to obtain the pressureprofile over the contact area. The equation relates the pressure to the thickness of the lubricantfilm, its density and its viscosity and it is formulated as follows:

∂

∂x(ρh3

12η

∂

∂xp) +

∂

∂y(ρh3

12η

∂

∂yp)− um

∂

∂x(ρh)− ∂

∂t(ρh) = 0, (1)

where t is the time, x and y are the surface coordinates, p represents the pressure, h, η and ρrepresent the thickness, viscosity and density of the lubricant respectively and um represents theaverage speed at which the lubricant enters the contact area. The speed is called “entrainmentspeedänd in most cases is the average speed of the two surfaces.

Correct pressure profiles should satisfy the load balance equation given by∫ ∞−∞

∫ ∞−∞

pdxdy − f = 0, (2)

6

where f is the applied load normal to the contact surface. This load balance equation is the secondequation in this EHL model. The thickness of the film is defined by the third equation of the systemas

h(x, y) = h0 − ash(x, y, t) +x2

2rx+

2

πE′

∫ ∞−∞

∫ ∞−∞

pdx′dy

′√(x− x′)(y − y′)

, (3)

where h0 is the lubrication offset height, ash(x, y, t) is the expression defining the shape of thesurface roughness or asperities on the rolling surface, rx is the equivalent radius of the rollingcylinder and E

′is the equivalent elastic modulus of the surfaces. Equivalent parameters are used

here since the rolling contact problems are modeled as contact between a rolling cylinder or balland a rigid flat surface. The equivalent radius and elastic modulus are calculated by the following.

rx =r1r2r1 + r2

, (4)

E′

=2E1E2

(1− ν22)E1 + (1− ν21)E2, (5)

where r1 and r2 are the radii of the curvature of the original surfaces, E1 and E2 are Young’smodules and ν1 and ν2 are Poisson’s ratios for the surfaces. The elastic deformation of the solidis superimposed over the thickness of the film as its the last term in Equation (3). The fourthequation of the system calculates the viscosity of the lubricant. It is called Roelands equation andit is given as

η = η0(Γ)exp(

[ln(η0(Γ)) + 9.67][−1 + (1 + 5.1 · 10−9p)ZR(Γ)

])(6)

where Γ is the temperature, η0(Γ) is the dynamic viscosity at atmospheric pressure and given as

log(η0(Γ)) = −4.2 +G0(

1 +Γ

135

)−S0, (7)

and ZR(Γ) is the temperature exponent and given as

ZR(Γ) = Dz + Czlog

(1 +

Γ

135

). (8)

The coefficients Cz, Dz, G0 and S0 are constants and depend on the lubricant. Finally the lastequation in this model is the pressure density relationship given by the Dowson and Higginsonrelation stated as

ρ = ρ0

[1 +

A1p

1 +A2p

][1− α(Γ− Γ40)] (9)

where ρ0 is the density of the lubricant at the reference temperature and Γ40 represents referencetemperature of 40°C.

The system is solved numerically in most cases due to its non-linearity. The numeric model ofthe system is presented in the following section.

2.1.2 Numerical setup

The numerical setup is built on the approach presented in Reference [13] which uses the FiniteDifference Method (FDM) to discretize the differential equations in the system. The approach usesdimensionless parameters and some of which are given below.

P = p/pHertz, ρ =ρ

ρ0, H =

hrxa2

,

Ash =ashrxa2

, X =x

a, Y =

y

a,

T =tuma, ∆T =

um (Xe −X0)Ntuc

,

(10)

where X0 and Xe are the normalized x-coordinates of the beginning and end points of the simulatedarea respectively, Nt is the total number of time steps to be simulated which is determined at the

7

beginning of the simulation based on the input parameters, a and phertz are the Hertz contact halfwidth and Hertzian pressure respectively given as

a =

√8frxπE′

, phertz =2f

πa. (11)

Also, in order to simplify the discretized expressions of the equations two more parameters areintroduced and defined as

λ =12umr

2x

a3pHertz, � =

pH3

ηλ(12)

With this Reynolds’ equation (Equation (1)), without the time derivative term, is discretizedin space as

∂

∂X

(�i,j

∂

∂XPi,j

)+

∂

∂Y

(�i,j

∂

∂YPi,j

)−∂ρi,jHi,j

∂X= 0. (13)

This is used to obtain the pressure profile of the time independent solution without the asperitiesin the contact area. The initial film thickness (h0 in Equation (3)) is also obtained during thisphase. The derivative terms in Equation (13) are defined as follows.

∂

∂X

(�i,j

∂

∂XPi,j

)=

(�i+1,j + �i,j)Pi+1,j − (2�i,j + �i+1,j + �i−1,j)Pi,j + (�i,j + �i−1,j)Pi−1,j2∆X2

∂

∂X

(ρi,jHi,j

)= ρi,j

3Hi,j − 4Hi−1,j +Hi−2,j∆X

+Hi,j∂ρi,j∂Pi,j

3Pi,j − 4Pi−1,j + Pi−2,j∆X

(14)The partial derivative with respect to Y is analogous to the X partial derivative. This discretizationin space implies that the simulated contact area is split into nodes arranged on a 2D grid. Thissays nothing about the shape of the grid but Reference [8] makes it a square for the simplicity ofrepresentation.

To obtain the time dependent solution, the full expression of Reynolds’ equation is discretizedusing the Crank-Nicolson discretization scheme which produces the following.(

∂

∂X

(�i,j

∂

∂XPi,j

))tn+1

+

(∂

∂Y

(�i,j

∂

∂YPi,j

))tn+1

−(∂ρi,jHi,j

∂X

)tn+1

+

(∂

∂X

(�i,j

∂

∂XPi,j

))tn

+

(∂

∂Y

(�i,j

∂

∂YPi,j

))tn

−(∂ρi,jHi,j

∂X

)tn

− 2(

(ρi,jHi,j)tn+1 − (ρi,jHi,j)tn∆T

)= 0

(15)

The other equation that needs discretization is Equation (3) which models the thickness of thelubricant film. The focus of the discretization is the elastic deformation term which includes asurface integral. The discretization is given as follows.

H0 = H0 +X2i −Ash(i, j, t) +

2

πE′

Nx∑k=1

j+Nx∑l=j−Nx

P (k, l)(AD +BD + CD +DD) (16)

The terms AD to DD are added to simplify the expression and due to their size they were keptout of this section. They can be found in Equation number 22 in [8]. The load balance equation isanother surface integral over the x− y plane but with a simpler expression to integrate. Reference[8] does not show this discretization and thus this section will not either. Finally the viscosity anddensity expressions do not include any continuous calculus and therefore, are not discretized. Theydo, however, use the unit-less parameters shown above.

2.1.3 Thermal model

In his later work, Carl-Magnus Everitt added a model of the thermal changes that can occur ina contact in order to more accurately simulate the lubricant film thickness. The model with thisadded detail is called Thermal Elastohydrodynamic Lubrication (TEHL) [9].

8

The main assumption made in this model is that the temperature is constant throughout thethickness of the lubricant film. This means that the fluid temperature term is calculated over a 2Dgrid. The same assumption is not made for the solids in the model, however, and therefore everymetal in the contact has its own 3D grid. The number of nodes into the metals is set to 40 nodesfor each node on the 2D grid of the lubricant. The discretization is shown in Figure 2 below.

Figur 2: Illustration of the discretization of the contact metals and lubricant film. The film consistsof a single layer of nodes while the metals contain multiple layers.

Heat transfer is computed between every node and its direct neighbors, as illustrated by thearrows in the figure, in every iteration of the algorithm. The algorithm concludes when equilibriumis reached.

2.1.4 Overview of the simulation

An overview of the steps taken to solve for the pressure at every time step is shown in Figure 3below. This description is based on [13] which describes the model on which this simulation wasbuilt. Also, some of the details were obtained from examining the source code. This does not serveas a comprehensive description of the code and it is only shown to illustrate how the system issolved.

2.2 Parallelization and OpenMP

The main scope of this work is to get the most out of a single shared memory parallel computingplatform. Also, not reducing the portability of the original code by making it hardware specific isan important aspect of this thesis. With those conditions in mind, the tool chosen to parallelizethe code was OpenMP.

OpenMP stands for Open Multi-Processing, and it is an Application-Programmer Interface(API) which facilitates the creation of multi-threaded code. The API consists of compiler directives,environment variables and run time functions. It is developed by the OpenMP Architecture ReviewBoard (ARB) which contains representatives from the biggest semiconductor vendors. The boardrepresents a joint effort by the vendors to create a common tool to facilitate multi-core parallelismon a wide array of computing platforms [2]. The ARB produces the specification of the API whichis then implemented by compiler vendors or developers.

This section presents a survey of what is found in the literature regarding parallelization usingOpenMP. The rest of the section is organized as follows: Different memory models are disambi-

9

Figur 3: Illustration of the execution of the simulation.

guated; OpenMP’s programming model is discussed; The main work-sharing methods in OpenMPare briefly described; Amdahl’s law is introduced to illustrate the potential speedups attainabletheoretically; Finally, published literature about parallelizing using OpenMP is briefly described.

2.2.1 Memory models

The first memory model discussed is the SMP model. SMP architectures are the most commontype of parallel computers used today where almost every modern personal computer containstwo physical cores or more. In these architectures every processor has access to a single physicalshared memory. This access can either be symmetric or asymmetric for every processor in thesystem. The former type of architectures is called Uniform Memory Access (UMA) SMP whilethe latter is called Non Uniform Memory Access (NUMA) SMP. Asymmetrical access can be dueto the memory being physically closer to some processors in the system than others. Therefore,most big multi-processor architectures are NUMA-SMP. The communication mechanisms used inthese types of systems usually ensure cache coherence (cc) across processors and these types ofarchitectures are then called cc-NUMA-SMP. OpenMP specifications assume SMP architecturesand while using it the programmer could usually ignore the differences between the subcategoriesin this model [2].

The second memory model is the Distributed Memory Parallel (DMP) model. These systemsare made up of multiple independent computers connected by a network. These machines couldthen be used to collaboratively execute programs. In this case, every individual machine has itsown memory and its portion of the data needed for the program. The machines can communicateand exchange data by passing messages over the network. The most common tool to use with these

10

types of systems is the Message Passing Interface (MPI). If the system is made up of SMP machinesthen OpenMP can be used locally alongside MPI in order to exploit the potential computationpower of each machine in the system [2].

The third memory model is derived from the first two and is called Distributed Shared MemoryParallel (DSM). These are distributed systems which allow every node to access the local memoryof any other machine in the system. In this case the distributed memory is modeled as a big sharedmemory which is accessed non-uniformly. The communication scheme across the network can alsoensure cache memory (or local memory) coherence to ensure changes to the same data is recordedin every machine that has a copy of said data. Also, heterogeneous architectures which consistof SMP systems augmented with co-processors or accelerators are categorized under this modelsince multiple address spaces can be in use in these systems [11]. OpenMP can be used with DSMarchitectures if the memory sharing is guaranteed by the system. Moreover, OpenMP specification4.0 (and later) provides explicit support for heterogeneous systems due to their popularity [18].

The three memory models are illustrated in Figure 4 below.

Figur 4: Illustration of the three memory models. (a) depicts an SMP model, (b) is a DMP modeland (c) is a DSM model. The figure was adapted from Figure 1.2 from [2].

2.2.2 OpenMP’s programming model

The OpenMP specification has changed significantly over the years with new features, that supportdifferent architectures, being constantly added. That being said, the core programming model iskept constant in order to allow for backward compatibility and easy portability. This model, alongwith the general terminology used with OpenMP applications and a brief introduction to some ofits directives, is described below.

The programming model OpenMP uses is characterized by shared memory parallelism whichis assumed in OpenMP applications. The model enables the creation and high-level control ofthreads which are streams of instructions assigned to processors. In OpenMP threads are createdand destroyed when needed using what is called the Fork-Join model illustrated in Figure 5 below.In this model the program is always executed by a single thread initially and at the end. This threadis called the initial thread. More threads are created, and work is distributed among them, when acompiler directive indicating the beginning of a parallel region is encountered. This is referred toas the fork. Also, within the parallel region the initial thread is called the master thread. When thecorresponding directive indicating the end of the parallel region is encountered, all the threads butthe initial/master thread terminate. This is referred to as the join. The threads share the commonaddress space but can, and usually do, have private data.

In this process OpenMP allows the user to:

• Create threads

• Specify how work should be shared between the threads

11

Figur 5: The fork join model used in OpenMP. Fortran Compiler directives that indicate thebeginning and end of a parallel region are shown on the right.

• Define which of the variables in the scope are shared among threads and which should beprivate for each

• Synchronize threads

All of these are done through compiler directives mainly. The one which is discussed more here isthe work sharing control since the rest are either intuitive or not necessarily done explicitly by theuser (as in the case of thread synchronization).

2.2.3 OpenMP’s work-sharing constructs

Work sharing is described within parallel regions and right before the block to be parallelizedthrough directives called work-sharing constructs. These include the loop construct, sections con-struct, single construct and the Workshare construct in Fortran. The loop construct tells thecompiler to distribute the iterations of the loop over the threads. The user can further guide howthis should be done by specifying the method using optional clauses with the constructs. Differentwork distribution methods can lead to a reduced imbalance between the threads at run time. Thesection construct can be used to specify blocks that can be done concurrently by multiple threads.This construct can be used to easily pipeline segments of a program. The single construct, as thename entails, is used inside parallel regions to limit the number of threads that can execute ablock to one. Finally, the Fortran Workshare construct is used with Fortran array operations toparallelize array manipulations.

2.2.4 Potential performance

In general when N processors are used the execution time should reduce to 1/N of its originallength. This represents the ideal situation which is usually not achieved due to the overheadOpenMP adds to the execution of the program. Another reason is Amdahl’s law which statesthat the sequential part of a program dominates the execution time after a certain number ofthreads are used to execute the parallel part. This partitioning of the program is assumed becauseevery program has parts that can be done concurrently and parts that must be done sequentially.Also, in big programs it can be difficult to identify every parallelizable part. Amdahl’s law can beformulated as follows:

S =1

fpar/P + (1− fpar), (17)

where fpar is the fraction of the code which has been parallelized, P is the number of processorsused and S is the expected speedup. This, therefore, can create a limit to the linear speedupobtained when increasing the number of processors executing the code as shown in Figure 6.

12

Figur 6: The effects of Amdahl’s law illustrated for different fractions of parallelized code.

That being said, the speedup can also be affected by other factors such as cache memoryaggregate and problem size. The first is caused by using multiple processors each with their owncache memory. This leads to a bigger portion of the program’s data residing in the faster cachememory. The latter is due to increasing the time a single processor spends on the parallelizedregion of the code. This then leads to a significantly smaller increase in the execution time of thesame code when done by a team of threads. With these two factors, super-linear speed ups (wherethe speed up is more than the number of processors used) can be encountered.

2.2.5 Related work

Many examples from published literature examine the performance due to parallelization usingOpenMP in different areas of science. These articles were looked at to create a base for the approachto take in this thesis. This section briefly summarizes the interesting points from the articlesconsidered.

Reference [17] reports the attempt to parallelize FEAP, an open source finite element analysisprogram, written by members of the department of the civil and environmental engineering atthe University of California at Berkeley. The authors targeted the subroutine which required themost execution time. Parallelization was done by splitting up the iterations of the main loop on anumber of threads. The performance was analyzed with respect to the problem size (in this casethe number of elements) and the number of threads. Increasing the problem size in this case lead toan overall slower execution time with every number of threads. The trend in the speedup observedwas linear, and equal to the number of threads used, up to eight threads and sub-linear for 16 and24 threads.

Reference [10] attempts to improve the performance of ClamAV, an open source anti-virussoftware, by partially parallelizing it using OpenMP. This work targeted the string search functionswhich represented 52.86% of the execution time as reported in the paper. The functions were calledfrom within loop nests to cover a collection of files and thus the loop work-sharing construct wasused. The analysis looked at the effect of using a different number of threads and at varying themethod used to distribute the iterations of the loops over the threads. The speedups observed werealways sub-linear and less than the number of threads used. The best speed-up was around 2.6and occurred when using four threads.

Reference [12] explored the parallelization of the Modular Transport 3 Dimensional MultispeciesTransport Model (MT3DMS) using OpenMP. Like the previous examples this one also attemptedto parallelize the most time consuming parts of the software. The segment modified represented

13

96% of the execution time. The analysis looked at the relationship between the number of threadsand the speed-up obtained. The speedup was sub-linear for every thread count exceeding one andthe maximum speed-up obtained was 4.15 at eight threads.

Reference [19] is about parallelizing a DNA sequence search algorithm. This reference examinesdifferent tiling methods to work around the data dependencies in the model and exploit possibleparallelism. The work does not look at the effects of varying the problem size or thread number.Also, it is not clear if four or eight threads were used in this work since it does not mention if theIntel platform used supports hyper-threading. That being said the speedup obtained decreased forbigger tile sizes and its maximum value was 7.5.

14

3 Methodology

This section of the thesis describes the tools used to achieve the results presented in the latersections. This information can be used, if access to the original code is possible, to recreate theseresults. Also, this section documents the approach taken to identifying samples of performance dataconsidered representative to the worst case total execution time of the program. Terms such as“Simulation Test Cases” and “Reference Time Steps” are related to the aforementioned approachand are defined and explained here as well.

3.1 Tools

The platforms used to collect the performance data of the TEHL simulation are documented underthe “Hardware” header. The threading of the code was also aided by several software tools andthose are described under the “Software” header.

It should be noted that in this section, and the rest of the thesis for that matter, the term“Processor” refers to the complete single chip die while the term “Core” refers to a single processingunit inside the chip. Therefore, a processor can contain multiple cores but not vice versa.

3.1.1 Hardware

This project used two different machines to assess the performance of the code. The first of whichwas an Intel based personal desktop with an i7-7800X processor which has 6 physical cores runningat a base frequency of 3.5 GHz. The second was a single node of a supercomputer on Scania’spremises. The node is Intel based with two E5-2690V2 processors which have 10 physical cores eachand run at a base frequency of 3.0 GHz. Both processors used support Hyper-threading, whichmeans each physical core could be used as two logical cores. This means 12 and 40 logical cores onthe desktop and the supercomputer node respectively. The specifications of the two machines aresummarized in Table 1 below.

PropertyMachine

Desktop computer Supercomputer node

Processor(s) 1 x Intel Core i7-7800X 2 x Intel Xeon E5-2690 V2Number of physical Cores 1 x 6 2 x 10

Base frequency 3.5 GHz 3.0 GHzMaximum Turbo frequency 4.0 GHz 3.6 GHz

Hyper-threading Supported SupportedL3 Cache memory size 8.25 MB 25 MB

Tabell 1: Specifications highlights of the two computers used. Obtained from [3] and [5].

The logical cores added by hyper-threading can lead to inaccurate performance results if one isnot careful with how the threads are distributed over the hardware. This was dealt with by usingthe affinity interface offered by the Intel run time library which is automatically linked by thecompilers used [4, 6]. More information about this can be found in Section 5 which compares theperformance obtained when using 1 thread per core and 2 threads per core.

The approach taken to ensure that the clock frequencies do not fluctuate significantly, during aslightly periodic workload, was by setting the computer to high-performance mode. It was assumedthat this way the clock frequency was maintained around the highest attainable value by the systemwhich is usually something between the base frequency and the “Maximum Turbo frequency”.Again, this was easy to change on the personal computer but not on the node. Luckily, however,the supercomputer node is always set to high-performance mode.

3.1.2 Software

The software tools used in this project consist of compilers, APIs and a code profiler. The compilerused was Intel’s Fortran Compiler (ifort) which was available for both Windows and Linux. Thecompiler implements OpenMP which in turn is one of the APIs used in this project. The secondAPI used was Intel’s Math Kernel Library (MKL) which provides easy to use, and highly optimized

15

implementations of complex mathematical operations and algorithms. MKL was used to implementthe Fast Fourier Transform (FFT) approach used to calculate the elastic deformation. Finally, thecode profiler used was Intel’s VTune Amplifier.

Since two computers were used, different versions of the tools were used as necessary. Forinstance, different versions of ifort were used. This is due to having access to different licenseson each machine. Also, the two compilers implement different OpenMP standards and thereforethey differ in the version of the OpenMP API used. VTune Amplifier was not installed on thesupercomputer node since it was not included in the license package of the tools used on thatmachine. The main software differences between the two machines are summarized in Table 2.This table also serves to highlight the specific versions of the main tools used.

PropertyMachine

Desktop computer Supercomputer node

Operating system Windows 10 Enterprise Red Hat Enterprise Linux release 6.4Fortran compiler ifort 19.0 ifort 12.0

OpenMP standard OpenMP 4.5 and 5 partially OpenMP 3.1VTune Amplifier 2019 update 3 Not installed

Intel’s MKL 2019 update 3 2019 update 3

Tabell 2: Software versions used on the two machines.

The differing compilers meant that some of the newer Fortran language features could not beincluded in the project to not affect the portability of the source code between the two machi-nes. Therefore this was consciously avoided while making changes to the program. Similarly withOpenMP, most of the features added between the 3.1 and the 4.5 standards had to be ignored.None of the new OpenMP features seemed necessary for this project. However, it can be interestingto explore explicit vectorization which was introduced in OpenMP 4.5.

Intel’s VTune Amplifier was selected because it is easy to use and provides a comprehensiveanalysis of the code in terms of hotspots, hardware utilization, threading efficiency and more.The hotspots and threading efficiency analyses were used the most during this project. Analyzinghotspots identifies areas worth parallelizing and analyzing threading efficiency analysis enablesfine tuning the work distribution among threads. Most of the performance data presented in thefollowing sections is collected using VTune Amplifier.

Finally, and for the sake of comprehensiveness, the Integrated Development Environment (IDE)used to debug and edit the code was Microsoft’s Visual Studio 2017. Again, it was mainly selecteddue to its accessibility and since it integrates easily with ifort 19.0 which was the compiler usedduring the development of the code.

3.2 Performance data collection

Since a single simulation can take more than a week of execution time on a single thread, it wasdecided that the performance of smaller parts of the simulation should be looked at instead. Theselected parts had to accurately represent the performance characteristics of the entire simulationso that improvements to the execution time of these parts would directly correspond to that of thesimulation.

In the case of this thesis it was intuitive to break the simulation into smaller parts since the“Time dependent solution” part of the simulation calculates a set of solutions that correspondto a set of time steps as shown in Figure 3. Running the program to solve for a single time steprequired some changes to the code. This is described briefly under the “Single-step mode” header.The “Simulation Test Cases” subsection explains how the input file can affect the coverage of thecode significantly. Finally, this section ends with describing the selected time step and the reasoningbehind this selection under the “Reference Step” header.

3.2.1 Single-step mode

Creating the Single time-step solver (referred to henceforth as “Single-step mode”) required un-derstanding the data dependency between an individual time step and another. In this program atime step uses data calculated in the previous and the second to previous time steps in addition to

16

the lubrication offset height calculated in the time independent part of the simulation. A simplifiedexample of this is presented in Figure 7 below which shows the “Time dependent solution” part ofFigure 3.

Figur 7: Illustration of the data dependency between time steps.

In the figure, the subscripts “1” and “2” refer to data calculated in the previous and secondto previous time steps respectively. While the subscript “0” refers to data calculated in the timeindependent part of the simulation. The data from the second to previous time step is used tocalculate the right hand side residuals of Reynolds’ equation (Equation (1)) when using a differenttime dependence method than the one shown in Equation 1.

The figure does not represent the approach taken to determine all the dependency between thetime steps and serves as an illustration only. In reality the paths that can be taken through thecode during a single step were examined and all the uninitialized data used (hence data normallywritten in a different time step) was listed. Therefore, a good understanding of the underlyingmathematical model was not necessary to achieve this.

Collecting the required data was the next step in making this mode. This was straightforwardand involved outputting all the identified data to files once each time step finishes executing. Thiscreated another mode for the program which can be called “data collection” mode. The entiresimulation had to run completely in data collection mode before single-step mode could be used.To collect the needed data quickly a partially parallelized version of the program was used. Thisversion is highlighted later in Section 4.2.

Finally, after identifying and obtaining the dependency data, the single-step mode was as simpleas reading the input parameters, reading the dependency data, jumping to the beginning of a timedependent time-step solution and letting the program continue until the step is completed. Atwhich point a flag is checked to determine if single-step mode was selected and in that case theprogram terminates. This is illustrated in Figure 8.

3.2.2 Simulation Test Cases

As shown in the first illustration of the simulation in Figure 3, as well as in Figure 8, an inputfile is read initially. This file contains around a hundred input parameters which allow the user tocustomize the analysis to some extent. For example, through these parameters, the resolution of thefinite grid is defined as well as the speed the asperity moves at inside the contact area. These twoproperties directly affect the execution time since the first controls the size of the model and thesecond controls the number of time steps in the time dependent solution. Some of the other inputparameters affect which parts of the code are visited in a given simulation. This can indirectlyaffect execution time since some of the more costly subroutines are optional or have multiplepossible paths through them with varying lengths. For example the entire dynamic thermal modelis optional and is only enabled if the respective selector input parameter is set to a specific value.Another example is shown in Listing 1, where the parameter geom determines whether the firstpath or the second path is taken through the subroutine with the latter having a higher potential

17

Figur 8: Illustration of the code path taken when single step mode is selected. The “Time indepen-dent solution” section is minimized except for the “Start” and “Read input setup file” processes.

execution time since it contains triple the amount of nested Do-loops. The main workload in theloops is truncated for the sake of clarity but it should be noted that at the last level all the nestedloops have similar workloads.

1 IF(Geom .EQ. 2 .OR. Geom .EQ. 3 .OR. Geom .EQ. 6) THEN2 DO J=1,NN,SS !====== Path 1 ======3 DO I=1,NX,SS4 ...5 DO L=1,NYs ,SS6 ...7 DO K=1,NX,SS8 ...9 END DO

10 END DO11 ...12 END DO13 END DO !====== End of Path 1======14 ELSE15 DO J=1,NN,SS !====== Path 2 ======16 DO I=1,NX,SS17 ...18 DO L=−NX+J−1,1−SS,SS19 ...20 DO K=1,NX,SS21 ...22 END DO23 END DO24 DO L=NYs+SS,NX+J,SS25 ...26 DO K=1,NX,SS27 ...28 END DO29 END DO30 ...31 END DO32 END DO33 DO J=1,NN,SS34 DO I=1,NX,SS35 ...36 DO L=−(J−1),NYs−J,SS37 ...38 DO K=1,NX,SS39 ...40 END DO41 END DO42 ...43 END DO44 END DO !====== End of Path 2======

18

45 END IF

Listing 1: A shortened segment of the original elastic deformation calculation subroutinedemonstrating the potential influence of input parameters on the execution time.

A “Simulation Test Case” is completely defined by an input file. The test cases were providedby the author of the code and they are simulation set-ups used by him to either recreate resultsfound in the literature or to create original results for his research. Since blindly testing everypossible combination of inputs is not feasible, the tests in this thesis focused on some of these testcases. Initially, one that does not use the thermal model was profiled and then used to create theinitial parallel version of the code. Later, a test case which uses the thermal model, and which wasbeing used by the author of the code to write the recently published paper (at the time of writingthis report) cited as Reference [9] in this thesis, was taken as a comprehensive and representativetest case and was used to create the final threaded version of the code. The selected test case isdescribed more in the following subsection.

3.2.3 Reference Step

As mentioned in the introduction to Section 3.2, the simulation was split into smaller parts ofa single time step each. Then some of these time steps were studied as samples with behaviourrepresentative to that of the full simulation. Not every time step is interesting to study since someof them have a very short execution time due to various reasons. For example, the first few timesteps of simulations with small localized asperities seem to have short execution time since duringthose steps the asperity would still be behind the high pressure contact area and close to the edgeof the simulated surface. This means that the input pressure would significantly resemble a correctoutput since no surface disturbance is introduced yet.

A good reference time step should ideally have the worst possible execution time in a simulation.This way an upper bound on the execution time of the simulation can be placed given the totalnumber of time steps in the simulation. Also improvements to the execution of a long time step willbe easier to observe and measure especially since speed-ups due to multi-core parallelism can beseverely capped by Amdahl’s law when the parallelized segment of the program is small as shownin Figure 6.

Given the aforementioned reasons, the selected time step was a step that does not lead toa convergent solution. This means the program depletes all the allowed iterations attempting tofind a solution that satisfies the convergence conditions without succeeding. However, the result isusually close enough to what is expected and the divergence is usually due to the very conservativetolerance limits set for the entire simulation. This leads to having time steps with the longestexecution times in the simulation since all the parts of the program that contribute to findinga solution will be executed more often. Therefore, the way this selection was made was intuitivesince a failed time step is very close to the ideal reference time step described above. The selectedtime step was the 423rd out of 514 of a test case which looks at the effect of a single asperitysurface detail on the pressure, temperature and lubrication thickness of a TEHL contact area.This information is not intended to be comprehensive since describing the step and the test caseproperly would require going into technicalities which are out of the scope of this paper. However,this can still help any result recreation effort if the original code along with the test case setup filewere obtained.

19

4 Execution

This section attempts to capture the exploratory nature of this study introduced by the incre-mental parallelization of the code. The performance of the code is presented before, during andafter the parallelization effort. The last part of this section documents the FFT elastic deformationimplementation and presents the performance improvement which resulted from it. The code issegmented into subroutines which are the smallest parts referred to in the analyses shown below.Unless stated otherwise, all of the following performance data is collected using Intel’s VTune Amp-lifier from the reference time step described in Section 3.2.3 running on the “Desktop computer”machine described in Section 3.1.1. It should also be noted that all sequential optimizations carriedout on the code, except for the FFT approach, is included in the initial performance analysis andnot documented any further in this report. This is because those were deemed unfitting to thescope of the study.

4.1 Initial performance

The initial code was profiled to determine the starting performance, in terms of execution time, andthe biggest hotspots in the code. The summary of the analysis is presented in Table 3. The formatof this table is used throughout this section to document the analyses conducted on the code. Theleft segment of the table displays execution time information while the right segment illustratesthe concurrency of the code at the current parallelization step. The execution time informationis split into “total wall time”, which represents the total elapsed time from the beginning of theexecution till its conclusion, and the parts of this total spent on the five biggest serial hotspots inthe program at the current parallelization step. The serial hotspots represent the demanding partsof the program that are executed using a single core only. Therefore, adding up the wall time spentin the serial hotspots will give the amount of wall time spent executing the sequential parts of theprogram and not the total execution time. This difference is not apparent now but will become soin the later sections.

Serial hotspots Wall time CPU histogram

VI 908.563sTEMP CALC METAL 414.924s

libm powf l9 104.725sTEMP CALC IJ 43.319s

LUBRICATION TEMP 41.749s

Total wall time 1787.076s

Tabell 3: Summary of the initial code analysis.

As shown in the table, the subroutine VI represented around 51% of the execution time initially.This subroutine calculates the elastic deformation, caused in the contact area by the high pressure,using the direct summation method. The rightmost term in Equation 16 shows the calculationmethod described. For more information about the direct summation method, for calculating theelastic deformation, the reader is referred to Reference [1].

Following VI, the 2nd, 4th and 5th highest execution times are caused by temperature modelsubroutines. These subroutines are optional and only used when the dynamic temperature modelis selected. This means that VI would represent a significantly larger majority of the executiontime in the test cases that do not use this model. Therefore, VI was the starting point of theparallelization effort.

Finally, it should be noted that libm powf l9 is one of the implementations of the floatingpoint power calculation intrinsic function used in Fortran and included in a standard math librarylinked automatically by ifort. The source of this implementation is not accessible and for thatreason the approach taken to address this was slightly different.

20

4.2 Parallelizing VI

Following the initial analysis results, VI was identified as the most significant hotspot in theprogram. To parallelize this subroutine its source code had to be examined. Listing 1, in theprevious section showed most of the subroutine but shortened for readability. Listing 2 belowhighlights one of the biggest nested loops in the subroutine with the workload in the loops shown.While this loop is not necessarily executed, it is enough to demonstrate how this subroutine wasparallelized.

1 IF(Geom .EQ. 2 .OR. Geom .EQ. 3 .OR. Geom .EQ. 6) THEN2 ...3 ELSE4 DO J=1,NN,SS !====== Path 2 ======5 DO I=1,NX,SS6 H0=0.07 DO L=−NX+J−1,1−SS,SS8 LL=abs((J−L)/SS)9 DO K=1,NX,SS

10 IK=IABS(I−K)/SS11 H0=H0+AK(IK,LL) ( P line (K))12 END DO13 END DO14 DO L=NYs+SS,NX+J,SS15 LL=abs((L−J)/SS)16 DO K=1,NX,SS17 IK=IABS(I−K)/SS18 H0=H0+AK(IK,LL) ( P line (K))19 END DO20 END DO21 Wside(i,j)=H022 END DO23 END DO24 ...25 END IF

Listing 2: Shortened segment of the original elastic deformation calculation subroutine showingone of the nested loops.

Using OpenMP to parallelize such a segment is trivial since every iteration of the topmostloop is independent from all other iterations. This means that, when multiple threads are usedeach thread can execute a different iteration of the topmost loop concurrently. In OpenMP terms,this is easily done by creating a parallel region and using the loop work-sharing construct. This isdemonstrated in Listing 3 below.

1 IF(Geom .EQ. 2 .OR. Geom .EQ. 3 .OR. Geom .EQ. 6) THEN2 ...3 ELSE4 !$OMP PARALLEL DO &5 !$OMP& IF( use multiple cores ) &6 !$OMP& PRIVATE(J,I,L,LL,K,IK,H0) &7 !$OMP& SHARED(NN,SS,NX,NYs ,P line ,Wside ,AK)8 DO J=1,NN,SS9 DO I=1,NX,SS

10 H0=0.011 DO L=−NX+J−1,1−SS,SS12 LL=abs((J−L)/SS)13 DO K=1,NX,SS14 IK=IABS(I−K)/SS15 H0=H0+AK(IK,LL) ( P line (K))16 END DO17 END DO18 DO L=NYs+SS,NX+J,SS19 LL=abs((L−J)/SS)20 DO K=1,NX,SS21 IK=IABS(I−K)/SS22 H0=H0+AK(IK,LL) ( P line (K))23 END DO24 END DO25 Wside(i,j)=H026 END DO27 END DO28 !$OMP END PARALLEL DO29 ...30 END IF

Listing 3: Parallelization of one of the nested loops in VI using OpenMP.

21

OpenMP’s syntax has not been formally described previously since it was deemed unnecessaryfor the reader to have a comprehensive overview of it. Instead, OpenMP constructs are describedwhenever they are encountered in this section. The first thing worth noting here is that OpenMPdirectives are marked by “!$OMP” at the beginning of the line. This symbol is what the compilerlooks for if it implements OpenMP, otherwise, they are treated as comments and ignored. Next,the directive “PARALLEL DO” is called a combined parallel work-sharing construct and bothstarts a parallel region and defines how the work should be shared among threads. The IF clausecontrols whether the parallel section is enabled or disabled. The latter means no threads are createdto execute the section and, therefore, the section is executed sequentially by the original masterthread. The “Private” and “Shared” clauses define which of the variables in the following parallelsection are private to each thread or shared among threads. As shown, the loop indices are privatewhile the arrays used are shared in this case. OpenMP guarantees that the ranges of indices, ofthe topmost loop, assigned to the threads are not overlapping. This way each thread has exclusiveaccess to different elements of the shared arrays since they are accessed by the loop indices. Finally,the end of the parallel region is marked by an “END” clause which marks the point at which thethreads synchronize and join again [2].

The VI subroutine contains more nested loops but all of them have the same general structure.The parallelization illustrated in Listing 3 is applied to every other one of those loops which createsthe first parallel version of the code. For the reader’s reference, this also represents the partiallyparallelized version of the code mentioned in Section 3.2.1 above and used with the data-collectionmode.

The analysis following this change is summarized in Table 4. As this is the first analysis thatuses multiple cores, it should be mentioned that the maximum number of threads that can be usedduring the execution is kept to the default value defined by the OpenMP implementation which inthis case is set to the total number of physical cores in the system. As shown, this simple change cutaround 750 seconds from the execution time. Now VI runs concurrently and takes 136.9 seconds tocomplete which is equal to the amount of time the program uses 6 cores. The other serial hotspotsstill have similar execution times to those shown initially, however, TEMP CALC METALrepresents around 40% of the execution time now compared to 23% of it initially. Therefore, thenext logical step is to parallelize this subroutine.


TEMP CALC METAL 410.993slibm powf l9 104.197s

TEMP CALC IJ 45.327sLUBRICATION TEMP 43.150s

MET TEMP UPD 37.295s


Tabell 4: Summary of the analysis of the code with parallel VI.

4.3 Parallelizing TEMP CALC METAL

Similar to VI above, most of the time spent in TEMP CALC METAL is within nested loops.The structure of the big nested loop which represents the majority of the subroutine is illustratedin Figure 9 below. The exact Do-statements are put into the figure to show the weight of thedifferent parts of the loop. It should be noted that the variables NN n and fini are proportionalto the size of the problem defined by the resolution of the contact area discretization. n met is thenumber of nodes within the metal of one of the surfaces and this is usually hard-coded to be 39.SS represents the step size and it is used to solve using the multigrid numerical method which isused during the initial steps of the simulation.

The l, Jm, and K-index loops are all easily parallelizable since the iterations can be executedout of order and lead to the same results. Meaning every iteration is independent from the othersand uses data calculated before the beginning of the loop or within said iteration. Using the loopwork-sharing construct can yield different results based on the level selected for the parallelization.The two options, in this case, are either parallelizing the l-index loop or parallelizing both the Jm

22

Figur 9: The nested loop in TEMP CALC METAL.

and K-index loops. The former would lead to splitting the block of code into a maximum of twothreads only since the l-index loop contains two iterations and since the loop work-sharing constructdistributes whole loop iterations to threads. The latter can produce work for more threads sincethe resolution of the discretization grid was big enough to require more than 10 iterations in everytest case examined.

A different approach which uses nested parallelism is also possible and done by parallelizingboth the l-index loop and the Jm and K-index loops as well. This way two threads are createdto execute the l-index loop and each of them then creates more threads to execute their share ofthe Jm and K-index loops. This can be inefficient due to the repeated starting and stopping ofparallel regions which are accompanied by the execution overhead introduced when forking andjoining threads. This was quickly implemented and the resulting execution time was around 20%more than the one obtained through the parallelization method described below.

The code was parallelized as illustrated in Figure 10 below. The parameters of the “Private”and “Shared” clauses were removed for clarity. This introduces two clauses not used before, the“Firstprivate” clause and the “Reduction” clause. The first is used to initialize private data sinceaccording to the OpenMP standard private variables are not defined upon entry to a parallelregion [2]. This was necessary since the original value of K oil, determined outside the loop, wasused within the loop in some cases and rewritten in others. The “Reduction” clause is used withcommutative and associative mathematical operations which occur recurrently within a loop [2]. Inthis case the sequential code guaranteed that dt lim was minimized by checking if a new minimumwas calculated at the end of every iteration. This is done in the parallel version using the OpenMPdefined “min” reduction operation which compares the final private values of dt lim from everythread and obtains the minimum. The performance of the code after this parallelization step isshown in Table 5 below.


libm powf l9 98.407sTEMP CALC IJ 41.931s

LUBRICATION TEMP 39.756sMET TEMP UPD 36.502s

EPSILON DERIVATIVE 33.298s


Tabell 5: Summary of the analysis of the code with parallel TEMP CALC METAL.

This step reduced the execution time by about 300 seconds. Now the program runs sequentiallyfor around 60% of the execution time down from around 80% at the previous parallelization step.

23

Figur 10: The nested loop in TEMP CALC METAL post parallelization.

The costliest serial subroutine now is the floating point power function from the standard mathlibrary which cannot be parallelized normally.

4.4 Addressing libm powf l9

Since this subroutine can not be threaded using OpenMP directives, the approach taken wasto run the segments of the code, that use this function, in parallel. This means independentpower calculations which would normally occur sequentially, one after the other, would be doneconcurrently on multiple cores. So instead of speeding up the subroutine, multiple instances of itare run in parallel to reduce its weight on the execution time of the program. To achieve this,the parts of the code that use libm powf l9 had to be identified. This was done with theassistance of VTune Amplifier since it provides a detailed caller/callee report as part of the codeanalysis. Following this, the source files of the caller subroutines were examined to identify possibleparallelization approaches.

The results of the caller/callee report that concern libm powf l9 are summarized in Table 6below. The caller/callee relationship is shown as a tree graph where the child nodes are the callersand the parent nodes are the callees of a given node in the tree. The root node is libm powf l9and the tree is expanded until common callers are reached on every branch in the tree. The branchesthat are not followed represent an insignificant fraction of the subroutine’s execution time. Theinteresting thing to note here is that all significant branches lead to LUBRICATION TEMPeventually. For that reason LUBRICATION TEMP’s code had to be examined more closely.

The interesting part of LUBRICATION TEMP is the location of the calls to the subroutinesthat contain (or in turn call subroutines that contain) floating point power calculations. Theseare: TEMP CALC IJ; EDA CALC; and NEWTONIAN. This is shown in Figure 11 whichhighlights that TEMP CALC IJ is called from within a nested do-loop while the other twosubroutines are called once, outside the loops. This means that intuitively, the nested loop shouldbe parallelized in order to do some of the floating point power calculations concurrently. Thisshould cut back around a third of the total execution time of libm powf l9. For the other two

24

Callee

CallerPercentage Wall time

libm powf l9

EPSILON DERIVATIVE

TEMP CALC IJ

LUBRICATION TEMP

POISEUILLE INCREMENT

EDA CALC

LUBRICATION TEMP

NEWTONIAN

LUBRICATION TEMP

HREE

100.0%

37%

36.5%

36.5%

0.5%

31.8%

31.8%

31.2%

30.8%

0.4%

98.407s

36.410s

35.948s

35.948s

0.462s

31.309s

31.309s

30.688s

30.337s

0.351s

Tabell 6: Caller/callee report for libm powf l9.

thirds EDA CALC and NEWTONIAN must be examined and parallelized accordingly.

Figur 11: The calls to the demanding subroutines in LUBRICATION TEMP.

The outer iterations in this nested loop were independent from one-another, similar to everynested loop described so far. Therefore, parallelizing it was straightforward. What sets this loopapart, however, is the potentially varying workload of the iterations of the inner loop. This is dueto the if-statement which can bypass most of the body of the loop if its predicate evaluates totrue. OpenMP provides some tools that can be used in this case to ensure a better balance to theworkload between threads. These are the schedule types which are set using the “Schedule” clause.OpenMP provides three different types of schedules called Static, Dynamic and Guided. As its nameimplies, the first schedule splits the loop into as equal chunks of iterations as possible and assignseach chunk to a thread. The other two schedule types assign smaller chunks of iterations to eachthread initially and assign more of what remains to the threads that finish executing their initial

25

shares in a first come first serve basis. The last two can reduce work imbalance but they come withmore execution overhead due to assigning work at run-time. This overhead is worth it in some casessince severe imbalance between threads can lead to accidental serialization of parallel regions whichin turn leads execution times higher than the pre-parallel code due to OpenMP’s thread creationoverhead [2]. Using a non-static schedule was deemed necessary in this case and the OpenMPdirectives used are shown in Figure 12. The second parameter in the “Schedule” clause is the sizeof the initial chunks assigned to each thread which in this case was set to a value proportional tothe number of threads used which is stored in cores. The last interesting thing to point out hereis that this parallelization required three reduction operations: a summation, maximization andminimization. As shown, multiple reductions are expressed using multiple “Reduction” clauses.

Figur 12: The parallelization of the nested loop that calls TEMP CALC IJ.

Moving on to NEWTONIAN, a quick inspection of its code shows that it is segmented,by if-statements, into similar parts consisting of one big nested do-loop each. The purpose of thisarrangement is to calculate certain simulation parameters using different physical models describedin literature. The if-statements check the value of lub param which is an input parameter providedto the program by the user through the input file described earlier. Therefore, only one part ofthe subroutine is used during a given simulation run. The approach to parallelize this subroutinewas by parallelizing every nested loop in every segment of the code. This was straightforward sinceonce again the iterations of the outer loops were not dependent on each other. One of the shortersegments is illustrated in Figure 13 below along with the OpenMP directives used with it. The twoshown statements, inside the loop, are the ones that use the power operator “**”.

Now EDA CALC remains. This one is a short subroutine with its main body being a nesteddo-loop. Again parallelizing the subroutine required parallelizing the nested-loop. This loop did notrequire using any new clauses not encountered before. Also, nothing in the body of the subroutineis interesting to note. Therefore, this is not documented further.

The performance after parallelizing these subroutines is shown in Table 7 below. This paralle-lization step lead to a significant boost in performance because it does not only reduce the timespent using libm powf l9, but it reduces the execution times of the three subroutines paralleli-zed as well. The time spent executing the program serially is now around 30% of the total executiontime down from around 60% in the previous step. What is left is to finalize the parallelization byaddressing the parts of the program that appear on the top five list now. Note that ReadFileand for get s are not parts of the EHL code and are implementations of other Fortran intrinsics.

26

Figur 13: One of the segments of NEWTONIAN parallelized.

Lastly, you can note that the usage of simultaneous cores between 1 and 6 is more significant nowand that is due to some imbalance in the workload between the threads.


MET TEMP UPD 37.799sLUBRICATION TEMP 20.032s

CP CALC 15.019sReadFile 2.337sfor get s 1.830s


Tabell 7: Summary of the analysis of the code with concurrent usage of libm powf l9.

4.5 Finalizing the parallelization

This section briefly documents how the remaining hotspots were addressed. The first subroutine tobe tackled here is MET TEMP UPD. Looking at its source, the subroutine is two alternatives ofsimple nested-loops that are selected by an if-statement based on the number of the time step beingsolved currently. Both loops contain no data dependencies between the outer iterations and theirparallelization required no new OpenMP clauses. The second subroutine, CP CALC, consists ofa single nested do-loop that executes some calculations. Again parallelizing this only required the“Parallel Do” work-sharing construct without any auxiliary clauses.

The last hotspot to tackle on the list is LUBRICATION TEMP again. Or at least theremaining serial parts of it. This was not done in the previous parallelization step because the restof the subroutine has nothing to do with libm powf l9. What is not shown in Figure 11 arethree other nested do-loops in the subroutine. Those loops only compute data without calling anyexternal functions. Also, and like every loop parallelized so far, the outer iterations are independentin all three loops. Parallelizing one of the loops required using a couple of reduction clauses butboth have been encountered before and therefore a new figure will not be introduced to illustratethis.

27

The final performance achieved using parallelization is shown in Table 8 below. The time theprogram spends in any of the serial hotspots now is not significant enough to justify parallelizingthem. That is due to the execution overhead OpenMP adds which would increase the total executiontime in this case. The performance shown can be improved upon if better workload balance betweenthe threads can be achieved. In an ideal case the program would have execution time logged undereither the 6 or 1 simultaneous cores only which is not the case here. Balancing the workload usingOpenMP is done using the “Schedule” clause mentioned previously. Also as mentioned, the dynamicschedule types introduce more overhead and might do more harm than good if the imbalance isnot significant enough [2]. In this case dynamic schedules in the identified regions of imbalance didlead to lower execution times. The only possible approach to dealing with this type of imbalanceis rewriting parts of the code to better fit the default work-sharing mechanisms of OpenMP. Thiswas not done in this study however, and can be a good area of future work.


ReadFile 2.487sfor get s 1.703s

intel avx rep memcpy 1.216sRES 0.935s

TEMP CALC METAL 0.561s


Tabell 8: Summary of the analysis of the final threaded code.

4.6 FFT VI

This subsection documents the alternative implementation of the VI subroutine which uses theFast Fourier Transformation (FFT) method instead of the Direct Summation (DS) method tocalculate the elastic deformation. First, a justification for this attempt is given. Afterwards, a briefdescription of the mathematics behind this method is described. Following that, the implementationis briefly described. Finally, the enhancements to the performance are presented.

4.6.1 Motivation

So far in this section, only serial hotspots have been taken into consideration. “CPU hotspots” isanother performance metric VTune Amplifier provides in its analyses. These hotspots representthe parts of the program that require the most “CPU time” which is the total collective time spentby all the cores used to execute a segment. The top five CPU hotspots of the final parallel versionof the code are shown in Table 9 below.

CPU hotspots CPU time

VI$omp$parallel for@138 903.498skmp fork barrier 346.769s

EPSILON DERIVATIVE 346.285skmp barrier template 312.307s

TEMP CALC METAL$omp$parallel for@373 266.505sTotal CPU time 2929.595s

Tabell 9: Total CPU time and CPU hotspots of the final threaded version of the code.

As shown in the table, the VI subroutine is still the heaviest hotspot even after parallelizationwhere around a third of the total CPU time is spent on it. While it does not affect the total walltime as much, because it is well parallelized, it is still very computationally heavy and resourceintensive. This, therefore, was the main reason why another implementation was sought after.The FFT approach was selected because it has shown great promise in both execution time andscalability [1]. Therefore, an efficient implementation should not only reduce the CPU time spent

28

on VI, but also reduce the performance penalty that comes with increasing the resolution of thesimulated contact area.

The strings appended to the names of the subroutines in the table show which specific blockof code in a subroutine is the actual hotspot. In this case blocks are equivalent to parallel regionswhich are defined by the inserted OpenMP directives. Therefore, VI$omp$parallel for@138means the parallel do-loop at line 138 of VI is the hotspot that required the shown amount ofCPU time. This loop is the one shown earlier in Listing 3. What this loop does is described underthe following sub-header.

4.6.2 Background

To understand why this approach can save CPU time it is important to understand how themathematics behind the considered alternatives are related. This is done by presenting how theDS method of calculating the elastic deformation resembles a circular discrete convolution whichin turn can be converted into multiplication in the frequency domain using an FFT algorithm.This part was not included in Section 2 because it is not part of a general introduction to TEHLmodeling.

Isolating the elastic deformation surface integral from Equation 3 gives the following.

w(x, y) =2

πE′

∫ ∞−∞

∫ ∞−∞

pdx′dy

′√(x− x′)(y − y′)

(18)

In this case w is the profile of the elastic deformation over the surface. One way of representingthe discretized version of this integral is the last term shown in Equation 16. However, this canalso be represented as

W (X,Y ) =

Ny−Y∑L=−Y

Nx−X∑K=−X

AK(|K|, |L|)P (K +X,L+ Y ) (19)

where W and P are the discrete elastic deformation and pressure profiles respectively, AK is thedeformation kernel, Nx is the number of nodes on the X-axis minus one and Ny is half the number ofnodes on the Y-axis minus one. Only half the Y-axis nodes are taken because it is always assumed,in this model, that the problem is symmetric around a line half-way into the contact area in theY domain. This representation is almost exactly what is used in the code to implement the DSmethod in VI where every summation term is implemented using a do-loop which creates the bignested loop with 4 levels shown in “Path 1” and in the second half of “Path 2” in Listing 1 from3.2.2.

By expanding the summation terms in Equation 19 it can be shown that the deformation kernelis simultaneously displaced over the pressure profile matrix and mirrored around its edges to cal-culate the deformation profile. For example to calculate W (5, 6) the element AK(0, 0) is multipliedwith P (5, 6) while P (4, 6) is multiplied with AK(1, 0) and P (5, 5) is multiplied with AK(0, 1)and so on. This is, therefore, a standard element-by-element calculation of a two dimensional cir-cular convolution (illustrated in Figure 14 below). Alternatively, the frequency representation ofthe convolution of the two matrices can be obtained by the entry-wise product of the frequencyrepresentation of said matrices. According to [1], the DS method has a computational complexityof O(N2), while the FFT method has a computational complexity of O(N · ln(N)) (where N isthe total number of nodes in the model and in this case N = Nx ·Ny).

None of what is mentioned so far corresponds to the do-loop shown in Listing 3. This expandsupon Equation 19 above by calculating the deformation due to the pressure profile outside of thecontact area. This is needed because the original simulation code simulates one of two generaltypes of surfaces, spheres or cylinders with infinite length. The pressure that affects the parts ofthe infinite cylinder outside the simulated area in turn affects how the nodes closer to the edge ofthe simulation deform. In order to calculate this contribution, the pressure profile is expanded andmultiplied with the parts of the deformation kernel that reach what is outside the simulation area.The influence of the rest of the cylinder is thus limited by the length of the deformation kernelwhich is only mirrored a finite number of times. Figure 15 below illustrates this in order to painta better picture of this process. The rest of the implementation details are expanded upon furtherunder the following sub-header.

29

Figur 14: Illustration of the circular convolution where the deformation kernel AK is mirrored andshifted over the pressure profile P to calculate elements of the elastic deformation profile matrixW .

Figur 15: Illustration of the overlap between the circular deformation matrix AK and the extendedpressure profile for infinite cylinders. Note that the pressure is only extended along one dimension(which represents the length in this case).

4.6.3 Implementation

Since some of the surface types require the expansion on the pressure profiles and some do not, itwas necessary to split the implementation of the new subroutine the same way it was done in theoriginal VI. Then, each path of the subroutine carries out the following steps which were adaptedfrom [14]:

1. Compute the size of the virtual domain of the calculation.

2. Expand AK into ĀK with wrap around order and necessary zero padding.

3. Expand P into P̄ with pressure extensions (if needed) and necessary zero padding.

4. Apply FFT to ĀK and P̄ to get the complex matrices ÂK and P̂ respectively.

5. Multiply ÂK and P̂ entry-wise to get Ŵ .

6. Apply IFFT to Ŵ to get W̄ .

7. Pick out the result of the convolution from W̄ to get W .

The term “Virtual domain” refers to the size of the matrices that would hold the inputs, outputsand intermediate results. For example, since the deformation kernel must be made circular, thevirtual domain should at least be big enough to fit this expanded matrix. The zero padding is usedto make the sizes of the two matrices equal after the necessary extensions are made. For example, ifthe size of the circular deformation kernel ĀK is 2Nx · 2Ny while the size of the pressure matrix isNx ·Ny, then P is appended by three blocks of zeros of size Nx ·Ny to double each of its dimensions.This was necessary because Fortran matrix operations were used to calculate the entry-wise product

30

of the matrices. Another reason zero padding was used was to make sure the virtual domain wasbig enough to include the result of the convolution.

Intel’s MKL was used to calculate the FFT and IFFT of the operands. The library offers ahighly efficient threaded version of the functions it includes which did not negatively affect theconcurrency of the code or require introducing any more OpenMP parallel regions.

4.6.4 Performance

Table 10 below shows the summary of the analysis of the program when using FFT VI. The tableshows the CPU hotspots as well as the total wall time of the program. This implementation thussuccessfully removes the elastic deformation calculations from the list of CPU hotspots. This alsolead to around a 170 seconds of reduction to the total execution time of the program.

CPU hotspots CPU time CPU histogram

EPSILON DERIVATIVE 319.358skmp barrier template 296.662s

kmp fork barrier 270.454sTEMP CALC METAL...@373 266.627s

TEMP CALC IJ 190.171s

Total CPU time 1901.177sTotal Wall time 325.603s

Tabell 10: Summary of the analysis of the threaded code using the FFT method for elastic defor-mation.

4.7 Final remarks and reflection

This parallelization effort reduced the execution time from 1787 seconds to 325.6 seconds. Theprogram had a significant amount of potential parallelism that was easy to exploit. This was dueto the nature of the subroutines which operate on the simulated grid one node at a time with nodata dependencies between computations. This type of parallelism is termed “parallel-by-point” in[7] and should in theory scale well with the number of cores used. However, since the program is not100% parallel some diminishing returns are expected due to Amdahl’s Law. How the performanceof the threaded version scales with problem size, number of cores and number of threads per coreis shown in the following section.

31

5 Results and discussion

This section offers a closer look at the performance characteristics of the final threaded versionof the code. How the performance is affected by the amount of hardware it uses is presented andcommented on. Two sets of performance charts were created one for every machine described inSection 3.1.1 above. Some differences in the scaling were observed due to the differences in thehardware architecture. Also, since Simultaneous Multi-Threading (SMT), or Hyper-Threading, asIntel terms it, is a technology used in most modern processors, it was important to show how thenumber of threads per core affects the performance scaling as well. Oversubscription was not lookedat and the maximum number of threads per core used throughout this section was the maximumnumber of threads a core can handle simultaneously with SMT. Finally, how the performance ofthe FFT method of calculating the elastic deformation changes with the size of the problem ispresented.

5.1 Concurrency scaling

This subsection is split into two parts one for each machine used to run the final threaded versionof the code. This distinction was necessary since

parallelization of a thermal elastohydrodynamic lubricated ...1470325/fulltext01.pdfera k arnor...

Documents