parallelization of smith-waterman algorithm using mpi

Universitat Politecnica de Catalunya

AMPP Final Project Report

Parallelization of Smith-WatermanAlgorithm

Author:Iuliia ProskurniaArinto MurdopoMuhammad Anis uddin Nasir

Supervisor:Josep R. Herrero

Dani Jimenez-Gonzalez

January 16, 2012

Contents

1 Introduction 1

2 Main Issues and Solutions 22.1 Available Parallelization Techniques . . . . . . . . . . . . . . 22.2 Blocking Technique . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2.1 Solution 1: Using Scatter and Gather . . . . . . . . . 22.2.2 Solution 1: Linear-array Model . . . . . . . . . . . . . 72.2.3 Solution 1: Optimum B for Linear-array Model . . . . 92.2.4 Solution 1: 2-D Mesh Model . . . . . . . . . . . . . . 92.2.5 Solution 1: Optimum B for 2-D Mesh Model . . . . . 102.2.6 Solution 2: Using Send and Receive . . . . . . . . . . 112.2.7 Solution 2: Linear-array Model . . . . . . . . . . . . . 152.2.8 Solution 2: Optimum B for Linear-array Model . . . . 152.2.9 Solution 2: 2-D Mesh Model . . . . . . . . . . . . . . 16

2.3 Blocking-and-Interleave Technique . . . . . . . . . . . . . . . 162.3.1 Solution 1: Using Scatter and Gather . . . . . . . . . 162.3.2 Solution 1: Linear-Array Model . . . . . . . . . . . . . 192.3.3 Solution 1: Optimum B and I for Linear-array Model 212.3.4 Solution 1: 2-D Mesh Model . . . . . . . . . . . . . . 222.3.5 Solution 1: Optimum B and I for 2-D Mesh Model . . 232.3.6 Solution 1: Improvement . . . . . . . . . . . . . . . . 242.3.7 Solution 1: Optimum B and I for the Improved Solution 272.3.8 Solution 2: Using Send and Receive . . . . . . . . . . 282.3.9 Solution 2: Linear-array Model . . . . . . . . . . . . . 322.3.10 Solution 2: Optimum B and I for Linear-array Model 322.3.11 Solution 2: 2-D Mesh Model . . . . . . . . . . . . . . 33

3 Performance Results 343.1 Solution 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.1 Performance of Sequential Code . . . . . . . . . . . . 343.1.2 Find Out Optimum Number of Processor (P) . . . . . 353.1.3 Find Out Optimum Blocking Size (B) . . . . . . . . . 363.1.4 Find Out Optimum Interleave Factor (I) . . . . . . . . 38

3.2 Solution 1-Improved . . . . . . . . . . . . . . . . . . . . . . . 383.2.1 Find Out Optimum Number of Processor (P) . . . . . 383.2.2 Find Out Optimum Blocking Size (B) . . . . . . . . . 403.2.3 Find Out Optimum Interleave Factor (I) . . . . . . . . 41

3.3 Solution 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.1 Find Out Optimum Number of Processor (P) . . . . . 413.3.2 Find Out Optimum Blocking Size (B) . . . . . . . . . 433.3.3 Find Out Optimum Interleave Factor (I) . . . . . . . . 44

3.4 Putting All the Optimum Values Together . . . . . . . . . . . 46

i

3.5 Testing with different GAP penalties . . . . . . . . . . . . . . 47

4 Conclusions 48

A Source Code Compilation 49

B Execution on ALTIX 50

C Timing diagram for Blocking technique in Solution 2 51

D Timing diagram for Blocking-and-Interleave technique inSolution 2 52

ii

List of Figures

1 Blocking Communication . . . . . . . . . . . . . . . . . . . . 72 Data Partitioning among processes . . . . . . . . . . . . . . . 123 Blocking Communication . . . . . . . . . . . . . . . . . . . . 144 Blocking and interleave communication . . . . . . . . . . . . 195 Blocking and Interleave Communication . . . . . . . . . . . . 246 Sequential Code Performance Measurement Result . . . . . . 347 Measurement result when N is 5000, B is 100 and I is 1 . . . 358 Diagram of measurement result when N is 5000, B is 100, I is 1 359 Measurement result when N is 10000, B is 100 and I is 1 . . . 3610 Diagram of measurement result when N is 10000, B is 100, I

is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3611 Performance measurement result when N is 10000, P is 8, I is 1 3712 Diagram of measurement result when N is 10000, P is 8, I is 1 3713 Diagram of measurement result when N is 10000, P is 8, B is

100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3814 Measurement result when N is 10000, B is 100 and I is 1 . . . 3815 Diagram of measurement result when N is 10000, B is 100, I

is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3916 Performance measurement result when N is 10000, P is 8, I is 1 4017 Diagram of measurement result when N is 10000, P is 8, I is 1 4018 Diagram of measurement result when N is 10000, P is 8, B is

200 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4119 Measurement result when N is 5000, B is 100 and I is 1 . . . 4120 Diagram of measurement result when N is 5000, B is 100, I is 1 4221 Measurement result when N is 10000, B is 100 and I is 1 . . . 4222 Diagram of measurement result when N is 10000, B is 100, I

is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4323 Performance measurement result when N is 10000, P is 32, I

is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4324 Diagram of measurement result when N is 10000, P is 32, I is 1 4425 Performance measurement result when N is 10000, P is 32, B

is 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4426 Diagram of measurement result when N is 10000, P is 32, B

is 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4527 Putting all of them together . . . . . . . . . . . . . . . . . . . 4628 Putting all of them together - the plot . . . . . . . . . . . . . 4629 Testing with different gap penalties . . . . . . . . . . . . . . . 4730 gap penalty vs Time . . . . . . . . . . . . . . . . . . . . . . . 4731 Performance Model Solution 2 . . . . . . . . . . . . . . . . . . 5132 Performance Model with Interleave . . . . . . . . . . . . . . . 52

iii

1 Introduction

The Smith–Waterman algorithm is a well-known algorithm for performinglocal sequence alignment for determining similar regions between two nu-cleotide or protein sequences. Proteins are made by aminoacid sequencesand similar protein structure has similar aminoacid sequence.In this projectwe did the parallel implementation of the Smith-Waterman Algorithm usingMessage Passing Interface code.

To compare two aminoacid sequence, initially we have to align the se-quences to compare them. To find the best alignment between two sequencesthe algorithm initially populates a matrix H of size N ×N (N is size of se-quence) using a scoring criteria. It requires a scoring matrix (cost of match-ing of two symbols) and a gap penalty for mismatch of two symbols. Afterpopulating the matrix H we can obtain the optimum local alignment bytracking back the matrix starting with the highest value in the matrix.

In our implementation of Smith-Waterman algorithm we populated thematrix H in parallel using multiple processes running of multicore machines.We used pipelined computation to achieve specific degree of parrallelism andcompared different parallelizing techniques to find optimum parallelizationtechnique for the problem. We started parallelizing our code using differ-ent blocking sizes B at the column level. Furthermore, we also introducedparallelization using different levels of interleave I at the row level.

For performance measurement we created the performance model of boththe implementations for two interconnection networks which are linear and2D-Mesh interconnection network. We executed our code for evaluationon Altix machine using different values of parameter ∆ (gap penalty), B(column interleaving factor) and I (row interleaving factor) to empiricallyfind optimum B and I for the problem. We also calculated the optimumB and I by finding the global minima of the equations of the performancemodel.

1

2 Main Issues and Solutions

2.1 Available Parallelization Techniques

We can achieve pipelining with both blocking at column and row level.Blocking at column level can be interpreted in different ways.

1. Each processor Pi processes B complete columns of the matrix beforedoing any communication.

2. Each processor Pi processes B complete columns. However after pro-cessing B columns of a row of the matrix it does a communication tonext processor.

3. Each processor Pi processes B complete columns. However after pro-cessing B columns of a set of rows of the complete B columns of thematrix it does a communication.

4. Each processor Pi processes N/P complete rows. After processing Bcolumns of those N/P rows, it does a communication.

Among above mentioned techniques, we choosed the last one because itprovides us with most optimum pipelined computation using the scheme.

2.2 Blocking Technique

2.2.1 Solution 1: Using Scatter and Gather

Based on chosen technique from our available parallelization techniques, wedeveloped this following solution. Note that in our solution here we alreadyincorporated I (Interleave factor), but we set the I to 1.

At the first step, process with rank 0 (which is the master process) readsall the necessary files which are two protein sequence files. The readingresult is stored in short* a and short* b. Other than that, it also allocatesenough memory to store the resulting matrix as shown in code snippet below

1 {2 // note t ha t s izeA i s the t o t a l number o f rows t ha t we need

proces s . We round up N i f N i s not d i v i s i b l e byt o t a l p r o c e s s e s as shown below . I i s s e t to 1 here .

34 i f (N % ( t o t a l p r o c e s s e s ∗ I ) != 0) {5 s izeA = N + ( t o t a l p r o c e s s e s ∗ I ) − (N % (

t o t a l p r o c e s s e s ∗ I ) ) ; // to handle case where N i s notd i v i s i b l e by ( t o t a l p r o c e s s e s ∗ I )

6 } else {7 s izeA = N;8 }9

2

10 r e a d f i l e s ( in1 , in2 , a , b , N − 1) ; // in1 = input f i l e 1 ,in2 = input f i l e 2 , a = r e s u l t i n g read ing from in1 , b= r e s u l t i n g read ing from in2

11 chunk s i z e = sizeA / ( t o t a l p r o c e s s e s ∗ I ) ; //number o frows t ha t each proce s s e s needs to work on

12 CHECKNULL( ( h a l l p t r = ( int ∗) c a l l o c (N ∗ ( s izeA+1) ,s izeof ( int ) ) ) ) ; // r e s u l t i n g data

13 CHECKNULL( ( h a l l = ( int ∗∗) c a l l o c ( ( s izeA+1) , s izeof (int ∗) ) ) ) ; // conta in l i s t o f po in t e r

1415 for ( i = 0 ; i <= sizeA ; i++)16 h a l l [ i ] = h a l l p t r + i ∗ N; // put the po in t e r in an

array1718 // i n i t i a l i z e the f i r s t row o f r e s u l t i n g matrix wi th 019 for ( i = 0 ; i < N; i++)20 {21 h a l l [ 0 ] [ i ] = 0 ;22 }2324 }

Every process reads the PAM matrix, and master process performsbroadcast of N and B value.

1 MPI Bcast(&chunk s ize , 1 , MPI INT , 0 , MPICOMMWORLD) ; //Broadcast chunk s i z e , which i s the number o f c a l c u l a t e drows by each s l a v e

2 MPI Bcast(&N, 1 , MPI INT , 0 , MPICOMMWORLD) ; //Broadcast N3 MPI Bcast(&B, 1 , MPI INT , 0 , MPICOMMWORLD) ; //Broadcast B4 MPI Bcast(&I , 1 , MPI INT , 0 , MPICOMMWORLD) ; //Broadcast I

Then each process needs to allocate enough memory to receive chunk size.Other than process with rank 0, they need to allocate memory to receivethe whole part of protein 2 (which has size equals to N).

1 CHECKNULL( ( chunk a = ( short ∗) c a l l o c ( s izeof ( short ) ,chunk s i z e ) ) ) ; // s l a v e proces s w i l l o b ta in i t from masterproces s

2 i f ( rank != 0) {3 CHECKNULL( ( b = ( short ∗) mal loc ( s izeof ( short ) ∗ (N) ) ) ) ;

// s l a v e proces s w i l l o b ta in i t from master proces s4 }56 MPI Bcast (b , N, MPI SHORT, 0 , MPICOMMWORLD) ; // broadcas t

p ro t e in 2 to every proces s

Now, let’s go to the parallel part, first we calculate how many blocks thatwe will process. We calculate, total blocks variable and also last blockvariable. last block variable contains the size of the last block to processif N is not divisible by B (N %B != 0)

1 int t o t a l b l o c k s = N / B + (N % B == 0 ? 0 : 1) ;2 int l a s t b l o c k = N % B == 0 ? B : N % B;

3

Then we scatter 1st protein sequence(in here we store it in a), with sizeof each scattered part equals to chunk size. After each process receiveseach scattered part, the computation begins for process with rank 0. It willnot wait to receive any data from other process and directly calculate the1st block of data. Meanwhile other proces with rank r, will wait for datafrom process with rank r-1. The data sent between process here is the lastrow of calculated block (which is an array of short with size equals to B.

After a process receive the required data, each process performs compu-tations for received data. In the end, each process with rank r will send thelast row of calculated block with size B to neighboring process withrank r+1.

In the end, we perform gather to combine the result. Note that cur-rent interleave variable is set to 0 and I is set to 1 here because we’re notusing interleaving factor. Code snippet below show how to implement thisfunctionality

1 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I ;c u r r e n t i n t e r l e a v e++) {

2 MPI Scatter ( a + cu r r e n t i n t e r l e a v e ∗ chunk s i z e ∗t o t a l p r o c e s s e s ,

3 chunk s ize , MPI SHORT, chunk a , chunk s ize ,MPI SHORT, 0 , MPICOMMWORLD) ; // chunk a i s ther e c e i v i n g b u f f e r

4 int current co lumn = 1 ;5 for ( i = 0 ; i < chunk s i z e + 1 ; i++) h [ i ] [ 0 ] = 0 ;67 for ( int cu r r en t b l o ck = 0 ; cu r r en t b l o ck < t o t a l b l o c k s

; cu r r en t b l o ck++) {8 // Receive9 int block end = MIN2( current co lumn − ( cu r r en t b l o ck

== 0 ? 1 : 0) + B, N) ;10 i f ( rank == 0 && cu r r e n t i n t e r l e a v e == 0) { // i f

rank 0 i s p roce s s ing the f i r s t b lock , i t doesn ’ tneed to r e c e i v e any th ing

11 for ( int k = current co lumn ; k < block end ; k++){

12 h [ 0 ] [ k ] = 0 ; // i n i t row 013 }14 } else {15 int r e c e i v e f r om = rank == 0 ? t o t a l p r o c e s s e s −

1 : rank − 1 ; // r e c e i v e from ne ighbor ingproces s

16 int s i z e t o r e c e i v e = cu r r en t b l o ck ==t o t a l b l o c k s − 1 ? l a s t b l o c k : B;

17 MPI Recv (h [ 0 ] + cu r r en t b l o ck ∗ B,s i z e t o r e c e i v e , MPI INT , rece ive f rom , 0 ,MPICOMMWORLD, &s ta tu s ) ;

18 }19 // Process20 for ( j = current co lumn ; j < block end ; j++,

4

current co lumn++) {21 for ( i = 1 ; i < chunk s i z e + 1 ; i++) {22 diag = h [ i −1] [ j −1] + sim [ chunk a

[ i − 1 ] ] [ b [ j − 1 ] ] ;23 down = h [ i −1] [ j ] + DELTA;24 r i gh t = h [ i ] [ j −1] + DELTA;25 max = MAX3( diag , down , r i g h t ) ;26 i f (max <= 0) {27 h [ i ] [ j ] = 0 ;28 } else {29 h [ i ] [ j ] = max ;30 }31 }32 }3334 // Send35 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 !=

t o t a l p r o c e s s e s ) {36 int send to = rank + 1 == t o t a l p r o c e s s e s ? 0 :

rank + 1 ;37 int s i z e t o s e n d = cur r en t b l o ck == t o t a l b l o c k s

− 1 ? l a s t b l o c k : B;38 MPI Send (h [ chunk s i z e ] + cu r r en t b l o ck ∗ B,

s i z e t o s end , MPI INT , send to , 0 ,MPICOMMWORLD) ;

39 p r i n t v e c t o r (h [ chunk s i z e ] + cu r r en t b l o ck ∗ B,s i z e t o s e n d ) ;

40 }4142 // Gathering r e s u l t43 MPI Gather ( hptr + N, N ∗ chunk s ize , MPI INT ,44 h a l l p t r + N + cu r r e n t i n t e r l e a v e ∗ chunk s i z e ∗

t o t a l p r o c e s s e s ∗ N,45 N ∗ chunk s ize , MPI INT , 0 , MPICOMMWORLD) ;46 }

Once the result is gathered, process with rank 0 deallocates the memoryand perform optional verification result. The verification result is obtainedby comparing the resulting parallel version of h matrix (by using h all) withserial version of h matrix (by using hverify)

1 i f ( rank == 0) {2 i f ( v e r i f yRe su l t == 1) {3 Max = 0 ;4 xMax = 0 ;5 yMax = 0 ;6 CHECKNULL( ( hv e r i f y p t r = ( int ∗) mal loc ( s izeof ( int )

∗(N+1)∗(N+1) ) ) ) ;7 CHECKNULL( ( hv e r i f y = ( int ∗∗) mal loc ( s izeof ( int ∗) ∗(

N+1) ) ) ) ;8 /∗ Mount h v e r i f y [N] [N] ∗/9 for ( i =0; i<=N; i++)

10 hve r i f y [ i ]= hv e r i f y p t r+i ∗(N+1) ;11 for ( i =0; i<=N; i++) hve r i f y [ i ] [ 0 ]= 0 ;

5

12 for ( j =0; j<=N; j++) hve r i f y [ 0 ] [ j ]=0;1314 for ( i =1; i<=N; i++)15 for ( j =1; j<=N; j++) {16 diag = hve r i f y [ i −1] [ j −1] + sim [ a [ i −1 ] ] [ b [

j −1 ] ] ;17 down = hve r i f y [ i −1] [ j ] + DELTA;18 r i gh t = hve r i f y [ i ] [ j −1] + DELTA;19 max=MAX3( diag , down , r i g h t ) ;20 i f (max <= 0) {21 hv e r i f y [ i ] [ j ]=0;22 }23 else i f (max == diag ) {24 hv e r i f y [ i ] [ j ]=diag ;25 }26 else i f (max == down) {27 hv e r i f y [ i ] [ j ]=down ;28 }29 else {30 hv e r i f y [ i ] [ j ]= r i gh t ;31 }32 i f (max > Max) {33 Max=max ;34 xMax=i ;35 yMax=j ;36 }37 }3839 int ve rFa i lF lag = 0 ;40 for ( i =0; i<=N−1; i++){41 for ( j =0; j<=N−1; j++){42 i f ( h a l l [ i ] [ j ] != hve r i f y [ i ] [ j ] ) {43 p r i n t f ( ” V e r i f i c a t i o n f a i l !\n” ) ;44 p r i n t f ( ” h a l l [ i ] [ j ] = %d , hv e r i f y [ i

] [ j ] = %d\n” , h a l l [ i ] [ j ] ,h v e r i f y [ i ] [ j ] ) ;

45 ve rFa i lF l ag = −1;46 break ;47 }48 }4950 i f ( ve rFa i lF l ag != 0) {51 break ;52 }53 }5455 i f ( ve rFa i lF l ag ==0)56 {57 p r i n t f ( ” V e r i f i c a t i o n suc c e s s !\n” ) ;58 }5960 }6162 f r e e ( hv e r i f y p t r ) ;

6

63 f r e e ( hv e r i f y ) ;64 f r e e ( a ) ;65 f r e e ( h a l l p t r ) ;66 f r e e ( h a l l ) ;67 }6869 f r e e (b) ;70 f r e e ( chunk a ) ;71 f r e e (h) ;72 f r e e ( hptr ) ;7374 MPI Final ize ( ) ;

Figure 1: Blocking Communication

To summarize this technique, Figure 1 shows the dividing of block in amatrix. The number inside the block indicates the step. The red portion inblock 1 indicate the amount of data (which is B integers) that is sent fromprocess 0 to process 1 in the end of calculation of block 1, in step 1.

2.2.2 Solution 1: Linear-array Model

First, we use linear-array topology to model our solution. Here is the modelfor communication part of our chosen blocking technique

1. Broadcasting chunk size, N, B, and I

tcomm−bcast−4−int = 4× (ts + tw)× log2(p)

2. Broadcasting of 2nd protein sequence (vector b)

tcomm−bcast−protein−seq = (ts + tw ×N)× log2(p)

3. Scattering chunk size for each process to compute

Note that the size of chunk size is the following

chunk size = Np

Therefore communication time for scattering is shown below

tcomm−scatter−protein−seq = ts × log2(p) + tw × Np × (p− 1)

7

4. Sending shared data

To start the first block of computation, process with rank 0 does notneed to wait for any data from other processes. That means we onlyhave (NB +p−2) stages for sending shared data. The shared data is thelast row of current finished block which consists of B items.Thereforeputting all of them together, communication time to send shared datais

tcomm−send−shared−data = (NB + p− 2)× (ts + (B × tw))

5. Gathering calculated data

Finally, we need to perform gather to combine all calculated data.Note that every process will need to combine N × chunk size data,which equals to N × N

P amount of data. Therefore the communicationtime for this step is given by

tcomm−gather = ts × log2(p) + tw × Np ×N × (p− 1)

6. Putting all the communication time together

tcomm−all = tcomm−bcast−4−int+tcomm−bcast−protein−seq+tcomm−scatter−protein−seq+tcomm−send−shared−data + tcomm−gather

tcomm−all(B) = (6 log2(p)+p−1)ts+((4+N) log2(p)+N+ (N+N2)(p−1)p )tw+

NtsB + (p− 2)×B × tw

Now we calculate the calculation time for this blocking technique. Notethat in our blocking technique we have N

B +p−1 stages of block-calculation.In each block-calculation, we need to compute N

p × B points. Therefore,if we represent time to compute one point as tc, we obtain this followingcalculation time model

tcalc = (NB + p− 1)× (Np ×B)× tc

tcalc = (N2

p + NB − NBp )× tc

tcalc = (N2

p + (N×(p−1)p )×B)× tc

Final model can be obtained by adding calculation time and communi-cation time

ttotal = tcomm + tcalcttotal(B) = (6 log2(p)+p−1)ts +((4+N) log2(p)+N + (N+N2)(p−1)

p )tw +NtsB + (p− 2)×B × tw + (N

2

p + (N×(p−1)p )×B)× tc

8

2.2.3 Solution 1: Optimum B for Linear-array Model

To find optimum B for linear array model, we need to calculate derivative offinal model of the linear topology with respect to B , and set the derivativeto 0 as shown below

dttotal(B)

dB= 0

And, using obtained model from section 2.2.2 we obtain this followingequation

−NB2

ts + (p− 2)tw +N(p− 1)

p× tc = 0

(p− 2)tw +N(p− 1)

p× tc =

N

B2ts

B2 =Nts

(p− 2)tw + N(p−1)p × tc

B2 =pNts

p(p− 2)tw + N(p− 1)× tc

B =

√pNts

p(p− 2)tw + N(p− 1)× tc

Using assumption that P is very small in comparison with N, we simplifythe equation above into this following

B ≈√

tstc

2.2.4 Solution 1: 2-D Mesh Model

Using the same steps as in section 2.2.2, here is the 2-D Mesh Model ofsolution 1.


tcomm−bcast−4−int = 4× 2× (ts + tw)× log2(√p)


tcomm−bcast−protein−seq = 2× (ts + tw ×N)× log2(√p)



chunk size = Np

9

Communication time for scattering in 2-D Mesh model can be modeledusing hypercube. It is similar as the communication time for scatteringin Linear Array model.[1]

tcomm−scatter−protein−seq = ts × log2(p) + tw × Np × (p− 1)

4. Sending shared dataSince sending shared data is using primitive send and receive, thecommunication time for this part in 2 D mesh model also does notchange.

tcomm−send−shared−data = (NB + p− 2)× (ts + (B × tw))


Communication time for gathering is using same formula as scattering,but different size of data that is gathered.

tcomm−gather = ts × log2(p) + tw × Np ×N × (p− 1)



tcomm−all(B) = ((10 log2(√p)+log2(p)+p−1)×ts+((8+2N) log2(

√p)+

N×(p−1)p + N + N2×(p−1)

p )× tw + NB × ts + (p− 2)B × tw

Calculation time does not change between 2-D mesh model and LinearArray model, therefore the calculation time is

tcalc = (N2

p + (N×(p−1)p )×B)× tc

Putting all together

ttotal = tcomm + tcalcttotal(B) = N2

p ×tc+(10 log2(√p)+log2(p)+p−1)×ts+((8+2N) log2(

√p)+

N×(p−1)p +N + N2×(p−1)

p )× tw + NB × ts +(p−2)B× tw +((N×(p−1)

p )×B)× tc

2.2.5 Solution 1: Optimum B for 2-D Mesh Model

We need to calculate derivative of final model of the 2-D Mesh model withrespect to B , and set the derivative to 0 as shown below

dttotal(B)

dB= 0


10

−NB2

ts + (p− 2)tw +N(p− 1)

ptc = 0

(p− 2)tw +N(p− 1)

p× tc =

N

B2ts

B2 =Nts

(p− 2)tw + N(p−1)p tc

B2 =pNts

p(p− 2)tw + N(p− 1)tc

B =

√pNts

p(p− 2)tw + N(p− 1)tc

B ≈√

tstc

As we observed here, the optimum B does not change when we use 2-D Mesh to model the communication. Using our solution 1, the usage of2-D mesh model only affect the broadcast time. And refering to total timeequation with respect to B(ttotal(B)), broadcast time is only a constant and

it disappears when we calculate dttotal(B)dB .

2.2.6 Solution 2: Using Send and Receive

In the second solution, we used Send and Receive methods provided in MPIlibrary for communicating among the processes. In this implementationevery process reads the input file. Every process also reads the similaritymatrix.

After reading the files each process calculates the number of rows thatit has to process and declares the required memory. Process with rank 0declares the matrix H of size N * N. In our implementation data distribu-tion is fair among all the process. In case of number of rows in the list arenot divisible among all the processes we give one more row to each processstarting from the master process. Figure 2 shows the distribution of data incase where data is not equally divisible among the processes.

Each process calculates the block size that it needs to communicate withits neighbour. Filling starts by master process and other process waits toreceive the block to start processing. Master communicates its first block,with its neighbour, after processing its required number of rows for the firstblock. Below mentioned is the code snippet for filling the matrix at all theprocess.

11

Figure 2: Data Partitioning among processes

1 i f ( id == 0)2 {3 for ( i =0; i<ColumnBlock ; i++)4 {5 for ( j =1; j<=s ; j++)6 {7 for ( k=i ∗B+1;k<=(i +1)∗B && k<=b [ 0 ] && k <= N; k++)8 {9 int RowPosition ;

10 i f ( id < r )11 RowPosition = id ∗ ( (N/p)+1)+j ;12 else13 RowPosition = ( r ∗ ( (N/p)+1) )+(( id−r ) ∗(N/p

) )+j ;1415 diag = h [ j −1] [ k−1] + sim [ a [ RowPosition ] ] [ b [ k

] ] ;16 down = h [ j −1] [ k ] + DELTA;17 r i gh t = h [ j ] [ k−1] + DELTA;18 max = MAX3( diag , down , r i g h t ) ;19 i f (max <= 0) {20 h [ j ] [ k ] = 0 ;21 } else {22 h [ j ] [ k ] = max ;23 }24 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;25 }26 }27 MPI Send ( chunk ,B,MPI SHORT, id +1 ,0 ,MPICOMMWORLD) ;28 }29 } else30 {

12

31 for ( i =0; i<ColumnBlock ; i++)32 {33 MPI Recv ( chunk ,B,MPI SHORT, id −1 ,0 ,MPICOMMWORLD,&

s ta tu s ) ;34 for ( z=0;z<B; z++)35 {36 i f ( ( i ∗B+z+1) <= N)37 h [ 0 ] [ i ∗B+z+1] = chunk [ z ] ;38 }39 for ( j =1; j<=s ; j++)40 {41 int RowPosition ;42 i f ( id < r )43 RowPosition = id ∗ ( (N/p)+1)+j ;44 else45 RowPosition = ( r ∗ ( (N/p)+1) )+(( id−r ) ∗(N/p) )+j

;4647 for ( k=i ∗B+1;k<=(i +1)∗B && k<=b [ 0 ] && k <= N; k++)48 {49 diag = h [ j −1] [ k−1] + sim [ a [ RowPosition ] ] [ b [ k

] ] ;50 down = h [ j −1] [ k ] + DELTA;51 r i gh t = h [ j ] [ k−1] + DELTA;52 max = MAX3( diag , down , r i g h t ) ;53 i f (max <= 0)54 h [ j ] [ k ] = 0 ;55 else56 h [ j ] [ k ] = max ;5758 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;5960 }61 }62 i f ( id != p−1)63 MPI Send ( chunk ,B,MPI SHORT, id +1 ,0 ,MPICOMMWORLD

) ;64 }65 }

At the end every process sends its portion of the matrix H to the masterprocess using the Send method available in the MPI library. Below men-tioned is the code snippet of gathering process.

1 i f ( id ==0)2 {3 int row , c o l ;4 for ( i =1; i<p ; i++)5 {6 MPI Recv(&row , 1 ,MPI INT , i , 0 ,MPICOMMWORLD,& s ta tu s ) ;7 CHECKNULL( ( r e cv hpt r = ( int ∗) mal loc ( s izeof ( int ) ∗(

row ) ∗(N) ) ) ) ;89 MPI Recv ( recv hptr , row∗N,MPI INT , i , 0 ,MPICOMMWORLD

13

,& s t a tu s ) ;1011 for ( j =0; j<row ; j++)12 {13 int RowPosition ;14 i f ( i < r )15 RowPosition = ( i ∗ ( (N/p)+1) )+j +1;16 else17 RowPosition = ( r ∗ ( (N/p)+1) )+(( i−r ) ∗(N/p) )+j

+1;1819 for ( k=0;k<N; k++)20 h [ RowPosition ] [ k+1]=recv hpt r [ j ∗N+k ] ;21 }22 f r e e ( r e cv hpt r ) ;23 }24 }25 else26 {27 MPI Send(&s , 1 ,MPI INT , 0 , 0 ,MPICOMMWORLD) ;28 CHECKNULL( ( r e cv hpt r = ( int ∗) mal loc ( s izeof ( int ) ∗( s ) ∗(

N) ) ) ) ;2930 for ( j =0; j<s ; j++)31 {32 for ( k=0;k<N; k++)33 {34 r e cv hpt r [ j ∗N+k ] = h [ j +1] [ k+1] ;35 }36 }37 MPI Send ( recv hptr , s ∗N,MPI INT , 0 , 0 ,MPICOMMWORLD) ;38 f r e e ( r e cv hpt r ) ;39 }

Once the result is gathered, process with rank 0 deallocates the memory.and perform optional verification result.

Figure 3: Blocking Communication

As reflected in Figure 3, the dividing of block in Solution 2 is samewith solution 1. But, instead of using scatter and gather to distribute data,

14

solution 2 uses primitive sends and receives.


Initially we calculated the performance model for the Linear interconnectionNetwork. The timing diagram could be found in the Appendix C.

1. In solution 2 every process calculates the N/p*B number of values be-fore communicating a chunk with the other process. It takes (N/B)+p-1 steps in total for computation.Below mentioned is the equation forcomputation.

tcomp1 = (NB + p− 1)× (Np ×B)× tc

2. After computation step each process communicates a Block with itsneighbour process. There are (N/B)+p-2 steps of communicationamong all the processes.

tcomm1 = (NB + p− 2)× (ts + B × tw)

3. After completing their part of matrix H every process sends it to themaster process.

tcomm2 = (ts + Np ×N × tw)

4. In the end master process puts all the partial result in the matrix Hto finalize the matrix H.

tcomp2 = (ts + Np ×N × tw)

The total time can be calculated by combining all the communication times.ttotal = tcomp1 + tcomm1 + tcomp2 + tcomm2

ttotal = (NB + p− 1)× (Np ×B)× tc + (NB + p− 2)× (ts +B × tw) + (ts +Np ×N × tw) + (ts + N

p ×N × tw)

2.2.8 Solution 2: Optimum B for Linear-array Model

To find optimum B for linear array model, we need to calculate derivative offinal model of the linear topology with respect to B , and set the derivativeto 0 as shown below

dttotal(B)

dB= 0


−NB2

ts + (p− 2)tw +N(p− 1)

ptc = 0

15

(p− 2)tw +N(p− 1)

ptc =

N

B2ts

B2 =Nts

(p− 2)tw + N(p−1)p tc

B2 =pNts

p(p− 2)tw + N(p− 1)tc

B =

√pNts

p(p− 2)tw + N(p− 1)tc


We calculated the performance model for the 2D-Mesh interconnection Net-work. And we found that there is no difference between the Linear ArrayModel and 2-D Mesh model because the difference between them is mainlyin the time to perform broadcasting and this solution does not involve anybroadcasting of element from root to other processes in the system.

2.3 Blocking-and-Interleave Technique

2.3.1 Solution 1: Using Scatter and Gather

Taking into account not only Blocking size B but also Interleave size I, wedeveloped solution below. First step is to allocate memory for all necessaryvariables in each processes. Master process also allocates memory for thefinal matrix where all the partial results will be stored. All slave processeswill also allocate memory for partial result matrices which eventually willbe send to the master process.

1 main ( int argc , char ∗argv [ ] ) {23 { . . . }45 int B, I ;67 MPI Init(&argc , &argv ) ;8 MPI Comm rank(MPICOMMWORLD, &rank ) ;9 MPI Comm size (MPICOMMWORLD, &t o t a l p r o c e s s e s ) ;

1011 i f ( rank == 0) {12 chunk s i z e = sizeA / ( t o t a l p r o c e s s e s ∗ I ) ;1314 CHECKNULL( ( h a l l p t r = ( int ∗) c a l l o c (N ∗ ( s izeA+1) ,

s izeof ( int ) ) ) ) ; // r e s u l t i n g data15 CHECKNULL( ( h a l l = ( int ∗∗) c a l l o c ( ( s izeA+1) , s izeof (

int ∗) ) ) ) ; // conta in l i s t o f po in t e r

16

16 for ( i = 0 ; i < s izeA ; i++)17 h a l l [ i ] = h a l l p t r + i ∗ N;1819 // i n i t i a l i z e the f i r s t row o f r e s u l t i n g matrix wi th 020 for ( i = 0 ; i < N; i++)21 {22 h a l l [ 0 ] [ i ] = 0 ;23 }2425 }2627 MPI Bcast(&chunk s ize , 1 , MPI INT , 0 , MPICOMMWORLD) ;28 MPI Bcast(&N, 1 , MPI INT , 0 , MPICOMMWORLD) ;29 MPI Bcast(&B, 1 , MPI INT , 0 , MPICOMMWORLD) ;30 MPI Bcast(&I , 1 , MPI INT , 0 , MPICOMMWORLD) ;3132 CHECKNULL( ( hptr = ( int ∗) mal loc ( s izeof ( int ) ∗ (N) ∗ (

chunk s i z e + 1) ) ) ) ;33 CHECKNULL( ( h = ( int ∗∗) mal loc ( s izeof ( int ∗) ∗ ( chunk s i z e +

1) ) ) ) ;34 for ( i = 0 ; i < chunk s i z e + 1 ; i++)35 h [ i ] = hptr + i ∗ N;3637 CHECKNULL( ( chunk a = ( short ∗) c a l l o c ( s izeof ( short ) ,

chunk s i z e ) ) ) ;38 i f ( rank != 0) {39 CHECKNULL( ( b = ( short ∗) mal loc ( s izeof ( short ) ∗ (N) ) ) ) ;40 }41 MPI Bcast (b , N, MPI SHORT, 0 , MPICOMMWORLD) ;

The master process scattering vector A to each process partially. Eachinterleave step there will be send part of the vector A. Sequence of codefor the interleave 0 will be the same as in previous section but only withone exception that the last process will send its results to the first process.Each process receives size B data from previous one before processing nextB columns. Each process sends data after processing B columns to the nextprocesses but the last process sends the data to the first(master) one if it’snot the last stage.

Finally after calculating all partial matrices each process sends its resultto the master process (It happens interleave times).

12 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I ;

c u r r e n t i n t e r l e a v e++) {3 MPI Scatter ( a + cu r r e n t i n t e r l e a v e ∗ chunk s i z e ∗

t o t a l p r o c e s s e s ,4 chunk s ize , MPI SHORT, chunk a , chunk s ize , MPI SHORT, 0 ,

MPICOMMWORLD) ;5 int current co lumn = 1 ;6 for ( i = 0 ; i < chunk s i z e + 1 ; i++) h [ i ] [ 0 ] = 0 ;7 for ( int cu r r en t b l o ck = 0 ; cu r r en t b l o ck < t o t a l b l o c k s ;

cu r r en t b l o ck++) {

17

8 // Receive9 int block end = MIN2( current co lumn − ( cu r r en t b l o ck ==

0 ? 1 : 0) + B, N) ;10 i f ( rank == 0 && cu r r e n t i n t e r l e a v e == 0) {11 for ( int k = current co lumn ; k < block end ; k++) {12 h [ 0 ] [ k ] = 0 ;13 }14 } else {15 int r e c e i v e f r om = rank == 0 ? t o t a l p r o c e s s e s − 1 :

rank − 1 ;16 int s i z e t o r e c e i v e = cu r r en t b l o ck == t o t a l b l o c k s

− 1 ? l a s t b l o c k : B;17 MPI Recv (h [ 0 ] + cu r r en t b l o ck ∗ B, s i z e t o r e c e i v e ,18 MPI INT , rece ive f rom , 0 , MPICOMMWORLD, &

s ta tu s ) ;19 }20 // Process21 for ( j = current co lumn ; j < block end ; j++,

current co lumn++) {22 for ( i = 1 ; i < chunk s i z e + 1 ; i++) {23 diag = h [ i −1] [ j −1] + sim [ chunk a [ i − 1 ] ] [ b [ j −

1 ] ] ;24 down = h [ i −1] [ j ] + DELTA;25 r i gh t = h [ i ] [ j −1] + DELTA;26 max = MAX3( diag , down , r i g h t ) ;27 i f (max <= 0) {28 h [ i ] [ j ] = 0 ;29 } else {30 h [ i ] [ j ] = max ;31 }32 }33 }34 // Send35 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 !=

t o t a l p r o c e s s e s ) {36 int send to = rank + 1 == t o t a l p r o c e s s e s ? 0 : rank

+ 1 ;37 int s i z e t o s e n d = cur r en t b l o ck == t o t a l b l o c k s − 1

? l a s t b l o c k : B;38 MPI Send (h [ chunk s i z e ] + cu r r en t b l o ck ∗ B,

s i z e t o s end ,39 MPI INT , send to , 0 , MPICOMMWORLD) ;40 }41 }42 MPI Gather ( hptr + N, N ∗ chunk s ize , MPI INT ,43 h a l l p t r + N + cu r r e n t i n t e r l e a v e ∗ chunk s i z e ∗

t o t a l p r o c e s s e s ∗ N,44 N ∗ chunk s ize , MPI INT , 0 , MPICOMMWORLD) ;45 }46 MPI Final ize ( ) ;47 }48 { . . . }

To summarize the interleave realization illustrated on Figure 4.

18

Figure 4: Blocking and interleave communication

2.3.2 Solution 1: Linear-Array Model

Here is linear array model for communication part for blocking techniquewith interleave


tcomm−bcast−4−int = 4× (ts + tw)× log2(p)


tcomm−bcast−protein−seq = (ts + tw ×N)× log2(p)



chunk size = Np×I where I is the interleave factor.

And scattering is performed I times. Therefore, the communicationcost of scattering is

tcomm−scatter−protein−seq = I × (ts × log2(p) + tw × Np×I × (p− 1))


To start the first block of computation, process with rank 0 does notneed to wait for any data from other processes. And, we need to take

19

note that in each interleave except the last interleave, last process((p − i)th process) needs to send N data to process 0. Therefore, forI − 1 occurences, we need (NB + p − 1) pipeline stages for sendingdata, and for the last Interleave step (the Ith steps), we will have(NB + p − 2) stages for sending data. The shared data is the last rowof current finished block which consists of B items.Therefore puttingall of them together, communication time to send shared data is

tcomm−send−shared−data = (I−1)× (NB +p−1)× (ts + (B× tw)) + (NB +p− 2)× (ts + (B × tw))


We need to perform gather to combine all calculated data in everyinterleave step. Note that every process will need to combine N ×chunk size data, which equals to N × N

P×I amount of data. Thisgather procedure is repeated I times. Therefore the communicationtime for this step is given by

tcomm−gather = I × (ts × log2(p) + tw × Np×I ×N × (p− 1))



Simplifying the equation with respect to B (by separating constantof the equation with the component of the equation containing B, sothat we can easily calculate the derivative of the equation to obtainmaximum B), we obtain this following equation

tcomm−all(B) = ((5 + 2I)log2(p) + (p− 1)(I − 1) + (p− 2))× ts + ((4 +

N)log2(p) + Np (p− 1) + N2

p (p− 1) + I − 1 +N)× tw + INB × ts + ((I −

1)(p− 1) + p− 2)B × tw

Simplyfing the equation with respect to I, we obtain this followingequationtcomm−all(I) = ((5 + 2I)log2(p) − 1) × ts + ((4 + N)log2(p) + N

p (p −1) + N2

p (p− 1) + B)× tw + (NB + p− 1)(ts + Btw)I

Now we calculate the calculation time for this blocking technique. Notethat in our blocking technique we have I × (NB + p − 1) stages of block-calculation. In each block-calculation, we need to compute N

p×I ×B points.Therefore, if we represent time to compute one point as tc, we obtain thisfollowing calculation time model

20

tcalc = I × (NB + p− 1)× ( Np×I ×B)× tc

I will be canceled and we obtain this following

tcalc = (N2

p + NB − NBp )× tc

tcalc = (N2

p + (N×(p−1)p )×B)× tc

Final model can be obtained by adding calculation time and communi-cation time, and here is the final equation with respect to B

ttotal = tcomm + tcalc

ttotal(B) = ((5+2I)log2(p)+(p−1)(I−1)+(p−2))×ts+((4+N)log2(p)+Np (p − 1) + N2

p (p − 1) + I − 1 + N) × tw + INB × ts + ((I − 1)(p − 1) + p −

2)B × tw + (N2

p + (N×(p−1)p )×B)× tc

Here is the final equation with respect to I

tcomm−all(I) = ((5 + 2I)log2(p)− 1)× ts + ((4 +N)log2(p) + Np (p− 1) +

N2

p (p− 1) + B)× tw + (NB + p− 1)(ts + Btw)I + (N2

p + (N×(p−1)p )×B)× tc

2.3.3 Solution 1: Optimum B and I for Linear-array Model

Optimum B can be derived by calculating dttotal(B)dB and set the inequality

to 0.

dttotal(B)

dB= 0

And, using obtained model from previous section we obtain this follow-ing equation

−INB2

ts + ((I − 1)(p− 1) + (p− 2))tw +N(p− 1)

ptc = 0

((I − 1)(p− 1) + (p− 2))tw +N(p− 1)

ptc =

IN

B2ts

B2 =INts

((I − 1)(p− 1) + (p− 2))tw + N(p−1)p tc

B2 =pINts

((I − 1)(p− 1) + (p− 2))ptw + N(p− 1)tc

21

B =

√pINts

((I − 1)(p− 1) + (p− 2))ptw + N(p− 1)tc

B ≈

√INts

(Ntc + I)

However, we can not find optimum I for Blocking-and-Interleave tech-nique because the derivation of dttotal(I)

dI results in a constant as shown below

dttotal(I)

dI= 0

(N

B+ p− 1)(ts + Btw) = 0

Looking at equation of dttotal(I), interleave factor only introduce morecommunication time when sending and receiving shared data. Therefore nooptimum interleave level can be derived using this model.


Using similar technique as what we have done in Linear-array model, hereis the communication and computation model of 2-D Mesh Model


tcomm−bcast−4−int = 4× 2× (ts + tw)× log2(√p)


tcomm−bcast−protein−seq = 2× (ts + tw ×N)× log2(√p)

3. Scattering chunk size for each process to computeAs what we discuss in section 2.2.4, scattering communication modelbetween 2-D Mesh model and Linear Array model are equals.

tcomm−scatter−protein−seq = I × (ts × log2(p) + tw × Np×I × (p− 1))

4. Sending shared data Communication time for sending shared data alsoequal between 2-D Mesh model and Linear Array model.

tcomm−send−shared−data = (I−1)× (NB +p−1)× (ts + (B× tw)) + (NB +p− 2)× (ts + (B × tw))

5. Gathering calculated data Gathering formula is equal to scatteringexcept for the amount of data being gathered.

tcomm−gather = I × (ts × log2(p) + tw × Np×I ×N × (p− 1))

22



Simplifying the equation with respect to B (by separating constantof the equation with the component of the equation containing B, sothat we can easily calculate the derivative of the equation to obtainmaximum B), we obtain this following equation

tcomm−all(B) = (10 log2(√p) + 2I log2(p) + (p− 1)(I − 1) + (p− 2))×

ts + ((8 + 2N)log2(√p) + N

p (p − 1) + N2

p (p − 1) + I − 1 + N) × tw +INB × ts + ((I − 1)(p− 1) + p− 2)B × tw

Simplyfing the equation with respect to I, we obtain this followingequationtcomm−all(I) = (10 log2(

√p)+2I log2(p)−1)×ts+((8+2N)log2(

√p)+

Np (p− 1) + N2

p (p− 1) + B)× tw + (NB + p− 1)(ts + Btw)I

2.3.5 Solution 1: Optimum B and I for 2-D Mesh Model


to 0.

dttotal(B)

dB= 0


−INB2

ts + ((I − 1)(p− 1) + (p− 2))tw +N(p− 1)

p× tc = 0

((I − 1)(p− 1) + (p− 2))tw +N(p− 1)

p× tc =

IN

B2ts

B2 =INts

((I − 1)(p− 1) + (p− 2))tw + N(p−1)p × tc

B2 =pINts

((I − 1)(p− 1) + (p− 2))ptw + N(p− 1)× tc

B =

√pINts

((I − 1)(p− 1) + (p− 2))ptw + N(p− 1)× tc

23

We observe that the resulting optimum B for 2-D Mesh model is equal toLinear Array model. As what we have discussed in section 2.2.5, 2-D Meshmodel only differs in the broadcast time which act as constant in ttotal(B)equation and the constant disappear when we calculate the derivaion of theequation.

Similar to calculation of optimum I in Linear Array Model, we can notfind optimum I for Blocking-and-Interleave technique because the derivationof dttotal(I)


dttotal(I)

dI= 0

(N

B+ p− 1)(ts + Btw) = 0

2.3.6 Solution 1: Improvement

Figure 5: Blocking and Interleave Communication

The main idea of this improvement is moving the gathering final dataprocess into the end of whole calculation in each process. That means,refering to Figure 5, gathering will be performed after step 14.

To implement this improvement, we performed these following steps:

1. Allocate enough memory for each process, to hold I×N×chunk size.Note that chunk size in this case is N

P×I .

24

1 CHECKNULL( ( hptr = ( int ∗) mal loc ( s izeof ( int ) ∗ (N) ∗ I ∗( chunk s i z e

2 + 1) ) ) ) ; // I n s t a n t i a t e temporary r e s u l t i n g matrix f o r eachproces s

3 CHECKNULL( ( h = ( int ∗∗) mal loc ( s izeof ( int ∗) ∗ I ∗ (chunk s i z e +

4 1) ) ) ) ; // l i s t o f po in t e r56 int ∗∗∗ h f i n ;7 CHECKNULL( h f i n = ( int ∗∗∗) mal loc ( s izeof ( int ∗∗∗) ∗ I ) ) ;89 for ( i = 0 ; i < ( chunk s i z e + 1) ∗ I ; i++) {

10 h [ i ] = hptr + i ∗ N; // put the p ro in t e r i n t the array11 }1213 for ( i = 0 ; i < I ; i++) {14 h f i n [ i ] = h + i ∗ ( chunk s i z e + 1) ;15 }

2. Change the way each process manipulates the data. Each processstores the data using hfin. hfin is a variable with type ***int, thereforewe need to store the data as shown in the following code snippet

1 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I; c u r r e n t i n t e r l e a v e++) {

23 MPI Scatter ( a + cu r r e n t i n t e r l e a v e ∗ chunk s i z e ∗

t o t a l p r o c e s s e s ,4 chunk s ize , MPI SHORT, chunk a , chunk s ize ,

MPI SHORT, 0 , MPICOMMWORLD) ; // chunk a i sthe r e c e i v i n g b u f f e r

56 int current co lumn = 1 ;7 for ( i = 0 ; i < chunk s i z e + 1 ; i++) h f i n [

c u r r e n t i n t e r l e a v e ] [ i ] [ 0 ] = 0 ;89 for ( int cu r r en t b l o ck = 0 ; cu r r en t b l o ck <

t o t a l b l o c k s ; cu r r en t b l o ck++) {10 // Receive11 int block end = MIN2( current co lumn − (

cu r r en t b l o ck == 0 ? 1 : 0) + B, N) ;12 i f ( rank == 0 && cu r r e n t i n t e r l e a v e == 0) { //

i f rank 0 i s p roce s s ing the f i r s t b lock , i tdoesn ’ t need to r e c e i v e any th ing

13 for ( int k = current co lumn ; k < block end ;k++) {

14 h f i n [ c u r r e n t i n t e r l e a v e ] [ 0 ] [ k ] = 0 ; //i n i t row 0

15 }16 } else {17 int r e c e i v e f r om = rank == 0 ?

t o t a l p r o c e s s e s − 1 : rank − 1 ; // r e c e i v e

25

from ne ighbor ing proces s18 int s i z e t o r e c e i v e = cu r r en t b l o ck ==

t o t a l b l o c k s − 1 ? l a s t b l o c k : B;1920 MPI Recv ( h f i n [ c u r r e n t i n t e r l e a v e ] [ 0 ] +

cu r r en t b l o ck ∗ B, s i z e t o r e c e i v e ,MPI INT , rece ive f rom , 0 , MPICOMMWORLD, &s ta tu s ) ;

21 }22 for ( j = current co lumn ; j < block end ; j++,

current co lumn++) {23 for ( i = 1 ; i < chunk s i z e + 1 ; i++) {24 diag = h f i n [ c u r r e n t i n t e r l e a v e ] [ i −1] [ j −1] +

sim [ chunk a [ i − 1 ] ] [ b [ j − 1 ] ] ;25 down = h f i n [ c u r r e n t i n t e r l e a v e ] [ i −1] [ j ] +

DELTA;26 r i gh t = h f i n [ c u r r e n t i n t e r l e a v e ] [ i ] [ j −1] +

DELTA;27 max = MAX3( diag , down , r i g h t ) ;28 i f (max <= 0) {29 h f i n [ c u r r e n t i n t e r l e a v e ] [ i ] [ j ] = 0 ;30 } else {31 h f i n [ c u r r e n t i n t e r l e a v e ] [ i ] [ j ] = max ;32 }33 }34 }3536 // Send37 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 !=

t o t a l p r o c e s s e s ) {38 int send to = rank + 1 == t o t a l p r o c e s s e s ?

0 : rank + 1 ;39 int s i z e t o s e n d = cur r en t b l o ck ==

t o t a l b l o c k s − 1 ? l a s t b l o c k : B;40 MPI Send ( h f i n [ c u r r e n t i n t e r l e a v e ] [

chunk s i z e ] + cu r r en t b l o ck ∗ B,s i z e t o s end , MPI INT , send to , 0 ,MPICOMMWORLD) ;

41 }42 }43 }

Note that hfin[i] means it contains the data for the ith interleavingstage in each process.

3. Move gathering process into the end of all calculation as shown in thefollowing code snippet

1 for ( i = 0 ; i < I ; i++) {2 MPI Gather ( hptr + N + i ∗ chunk s i z e ∗ N, N ∗

chunk s ize , MPI INT ,3 h a l l p t r + N + i ∗ chunk s i z e ∗

t o t a l p r o c e s s e s ∗ N,

26

4 N ∗ chunk s ize , MPI INT , 0 , MPICOMMWORLD) ;5 }

2.3.7 Solution 1: Optimum B and I for the Improved Solution

Here is the part of the model that are affected by the improved solution.


For the first I - 1 interleaving stages the communication time is fol-lowed:

(I − 1)× (ts + tw ×B)× NB

Then the last interleaving stage consist of following amount of com-munication time:

(ts + tw ×B)× (NB + P − 2)

Therefore putting all of them together, communication time to sendshared data is

(ts + tw ×B)× (NB + P − 2) + (I − 1)× (ts + tw ×B)× NB

2. Computational time

As well with sending and receive changes, time for computation arealso improved.

(NB ×B × NP×I × (I − 1) + B × N

P×I × (NB + P − 1))× tc

Optimal B and I for Improved SolutionTo calculate the optimal value we ignore all the communication time

which is not going to influent the value of optimal B and I. For optimal B,we only have the following formula the calculation.

t total improved(B) = (ts + tw×B)× (NB +P − 2) + (I − 1)× (ts + tw×B)× N

B + (NB ×B × NP×I × (I − 1) + B × N

P×I × (NB + P − 1))× tc

dt total improved(B)

dB= 0

−(I − 1)× ts ×N

B2− N × ts

B2+ (P − 2)× tw + (P − 1)× N

P × I× tc = 0

B =

√I2 × ts ×N × P

(P − 2)× tw × P × I + (P − 1)×N × tc

B ≈

√I ×N × ts

(P − 2)× tw

27

However, for optimal I value, we need to consider also scatter time aswell. Therefore we obtain this following formula for t total improved(I)

t total improved(I) = I× ts× log2(p)+(ts+ tw×B)×(NB +P −2)+(I−1)×(ts+tw×B)× N

B + NB ×B×× N

P×I ×(I−1)+B× NP×I ×(NB +P −1)×tc)

dt total improved(I)

dI= 0

ts×log2(p)+(ts+tw×B)×N

B+

N2 ×B

B × P × I2×tc−

B ×N

P × I2×(

N

B+P−1)×tc = 0

I =

√B2 ×N × (NB + P − 1)× tc −N2 ×B × tc

B × P × ts × log2(p) + (ts + tw ×B)×N × P

I ≈√

B ×N × tc

ts × log2(p) + N × tw + NB × ts

2.3.8 Solution 2: Using Send and Receive

This implementation also takes in account the row interleave factor alongwith the column interleave. Every process calculates the number of rows ithas to process at every interleave and initializes the memory. Master processdeclares the matrix H and use it for its partial processing as well.

Each process process N/(p*I) number of rows in every interleave andcommunicates the block with its neighbour process. Last process communi-cates its block with the master process and do not perform any communi-cation in the last interleave.

1 i f ( id == 0)2 {3 for ( i =0; i<ColumnBlock ; i++)4 {5 CHECKNULL( ( chunk = ( int ∗) mal loc ( s izeof ( int ) ∗(

B) ) ) ) ;67 for ( j =1; j<=s ; j++)8 {9 for ( k=i ∗B+1;k<=(i +1)∗B && k<=b [ 0 ] && k <= N;

k++)10 {11 int RowPosition ;1213 i f ( ( i n t e r l e a v e ∗p+id ) < r )14 RowPosition = ( i n t e r l e a v e ∗(N/(p∗ I )

+1)∗p) + id ∗ ( (N/(p∗ I )+1) ) + j ;15 else

28

16 RowPosition = ( r ∗(N/(p∗ I )+1) ) + (i n t e r l e a v e ∗p+id−r ) ∗(N/(p∗ I ) ) + j ;

1718 diag = h [ RowPosition −1] [ k−1] + sim [ a [

RowPosition ] ] [ b [ k ] ] ;19 down = h [ RowPosition −1] [ k ] + DELTA;20 r i gh t = h [ RowPosition ] [ k−1] + DELTA;21 max = MAX3( diag , down , r i g h t ) ;2223 i f (max <= 0) {24 h [ RowPosition ] [ k ] = 0 ;25 } else {26 h [ RowPosition ] [ k ] = max ;27 }28 chunk [ k−( i ∗B+1) ] = h [ RowPosition ] [ k ] ;29 }30 }//communicate toe p a r t i a l b l o c k to next proces s31 MPI Send ( chunk ,B,MPI INT , id +1 ,0 ,MPICOMMWORLD) ;32 f r e e ( chunk ) ;33 }34 //end f i l l i n g matrix H [ ] [ ] a t master35 } else i f ( id != p−1)36 {// f i l l i n g matrix a t o ther p roce s s e s3738 for ( i =0; i<ColumnBlock ; i++)39 {40 CHECKNULL( ( chunk = ( int ∗) mal loc ( s izeof ( int ) ∗(

B) ) ) ) ;4142 MPI Recv ( chunk ,B,MPI INT , id −1 ,0 ,MPICOMMWORLD,&

s ta tu s ) ;43 for ( z=0;z<B; z++)44 {45 i f ( ( i ∗B+z ) <= N)46 h [ 0 ] [ i ∗B+z+1] = chunk [ z ] ;47 }48 for ( j =1; j<=s ; j++)49 {50 int RowPosition ;5152 i f ( ( i n t e r l e a v e ∗p+id ) < r )53 RowPosition = ( i n t e r l e a v e ∗(N/(p∗ I )+1)∗p)

+ id ∗ ( (N/(p∗ I )+1) ) + j ;54 else55 RowPosition = ( r ∗(N/(p∗ I )+1) ) + (

i n t e r l e a v e ∗p+id−r ) ∗(N/(p∗ I ) ) + j ;565758 for ( k=i ∗B+1;k<=(i +1)∗B && k<=b [ 0 ] && k <= N;

k++)59 {60 diag = h [ j −1] [ k−1] + sim [ a [ RowPosition

] ] [ b [ k ] ] ;61 down = h [ j −1] [ k ] + DELTA;

29

62 r i gh t = h [ j ] [ k−1] + DELTA;63 max = MAX3( diag , down , r i g h t ) ;64 i f (max <= 0)65 h [ j ] [ k ] = 0 ;66 else67 h [ j ] [ k ] = max ;6869 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;70 }71 }72 MPI Send ( chunk ,B,MPI INT , id +1 ,0 ,MPICOMMWORLD) ;73 f r e e ( chunk ) ;74 }//end f i l l i n g matrix a t o ther p roce s s e s75 } else // s t a r t f i l l i n g matrix a t l a s t proces s76 {77 for ( i =0; i<ColumnBlock ; i++)78 {79 CHECKNULL( ( chunk = ( int ∗) mal loc ( s izeof ( int ) ∗(

B) ) ) ) ;8081 MPI Recv ( chunk ,B,MPI INT , id −1 ,0 ,MPICOMMWORLD,&

s ta tu s ) ;82 for ( z=0;z<B; z++)83 {84 i f ( ( i ∗B+z ) <= N)85 h [ 0 ] [ i ∗B+z+1] = chunk [ z ] ;86 }8788 f r e e ( chunk ) ;89 for ( j =1; j<=s ; j++)90 {91 int RowPosition ;92 i f ( ( i n t e r l e a v e ∗p+id ) < r )93 RowPosition = ( i n t e r l e a v e ∗(N/(p∗ I )+1)∗p)

+ id ∗ ( (N/(p∗ I )+1) ) + j ;94 else95 RowPosition = ( r ∗(N/(p∗ I )+1) ) + (

i n t e r l e a v e ∗p+id−r ) ∗(N/(p∗ I ) ) + j ;969798 for ( k=i ∗B+1;k<=(i +1)∗B && k<=b [ 0 ] && k <= N;

k++)99 {

100 diag = h [ j −1] [ k−1] + sim [ a [ RowPosition] ] [ b [ k ] ] ;

101 down = h [ j −1] [ k ] + DELTA;102 r i gh t = h [ j ] [ k−1] + DELTA;103 max = MAX3( diag , down , r i g h t ) ;104 i f (max <= 0)105 h [ j ] [ k ] = 0 ;106 else107 h [ j ] [ k ] = max ;108109

30

110 }111 }112 }113 }

After filling the partial matrix H, every process sends the partial result tothe master process at every interleave. Below mentioned is the code snippetof master gathering the partial result after every interleave.

1 i f ( id ==0)2 {3 int row , c o l ;4 for ( i =1; i<p ; i++)5 {6 MPI Recv(&row , 1 ,MPI INT , i , 0 ,MPICOMMWORLD,&

s ta tu s ) ;7 CHECKNULL( ( r e cv hpt r = ( int ∗) mal loc ( s izeof (

int ) ∗( row ) ∗(N) ) ) ) ;89 MPI Recv ( recv hptr , row∗N,MPI INT , i , 0 ,

MPICOMMWORLD,& s ta tu s ) ;1011 for ( j =0; j<row ; j++)12 {13 int RowPosition ;1415 i f ( ( i n t e r l e a v e ∗p+i ) < r )16 RowPosition = ( i n t e r l e a v e ∗(N/(p∗ I )+1)∗p)

+ i ∗ ( (N/(p∗ I )+1) ) + j +1;17 else18 RowPosition = ( r ∗(N/(p∗ I )+1) ) + (

i n t e r l e a v e ∗p+i−r ) ∗(N/(p∗ I ) ) + j +1;1920 for ( k=0;k<N; k++)21 h [ RowPosition ] [ k+1]=recv hpt r [ j ∗N+k ] ;2223 }24 f r e e ( r e cv hpt r ) ;25 }26 }27 else28 {29 MPI Send(&s , 1 ,MPI INT , 0 , 0 ,MPICOMMWORLD) ;30 CHECKNULL( ( r e cv hpt r = ( int ∗) mal loc ( s izeof ( int ) ∗(

s ) ∗(N) ) ) ) ;3132 for ( j =0; j<s ; j++)33 {34 for ( k=0;k<N; k++)35 recv hpt r [ j ∗N+k ] = h [ j +1] [ k+1] ;36 }37 MPI Send ( recv hptr , s ∗N,MPI INT , 0 , 0 ,MPICOMMWORLD) ;3839 f r e e ( r e cv hpt r ) ;

31

40 }

To summarize the interleave realization illustrated in Appendix D.


1. Every process calculates the (N/(p*I))*B number of values in everyinterleave before communicating a chunk with the other process. Ittakes ((N/B)+p-1)*I steps in total for computation.Below mentionedis the equation for computation.

tcomp1 = I × (NB + p− 1)× ( Np×I ×B)× tc

2. After computation step each process communicates a Block with itsneighbour process. There are (N/B)+p-2 steps of communicationamong all the processes.

tcomm1 = (I−1)×(NB +p−1)×(ts+B×tw)+(NB +p−2)×(ts+B×tw)

3. After completing their part of matrix H every process sends it to themaster process.

tcomm2 = (ts + N(p×I ×N × tw)× I

4. In the end master process puts all the partial result in the matrix Hto finalize the matrix H.

tcomp2 = I × (ts + N(p×I) ×N × tw)

The total execution time can be calculated by combining all the times.ttotal = tcomp1 + tcomm1 + tcomp2 + tcomm2

ttotal = I×(NB +p−1)×( Np×I×B)×tc+(I−1)×(NB +p−1)×(ts+B×tw)+

(NB +p−2)×(ts+B×tw)+(ts+ N(p×I ×N×tw)×I+I×(ts+ N

(p×I)×N×tw)

2.3.10 Solution 2: Optimum B and I for Linear-array Model


to 0.

dttotal(B)

dB= 0


−INB2

ts + ((I − 1)(p− 1) + (p− 2))tw +N(p− 1)

p= 0

32

((I − 1)(p− 1) + (p− 2))tw +N(p− 1)

p=

IN

B2ts

B2 =INts

((I − 1)(p− 1) + (p− 2))tw + N(p−1)p

B2 =pINts

((I − 1)(p− 1) + (p− 2))ptw + N(p− 1)

B =

√pINts

((I − 1)(p− 1) + (p− 2))ptw + N(p− 1)

However, we can not find optimum I for Blocking-and-Interleave tech-nique because the derivation of dttotal(I)


dttotal(I)

dI= 0

(N

B+ p− 1)(ts + Btw) = 0

Looking at equation of dttotal(I), interleave factor only introduce morecommunication time when sending and receiving shared data. Therefore nooptimum interleave level can be derived using this model.


As we have discussed in section 2.2.9, 2-D Mesh Model is same with LinearArray model because 2-D Mesh Model only affects the broadcast procedureand solution 2 does not include any broadcast procedure in its implementa-tion.

33

3 Performance Results

We did performance measurement of both parallel versions in Altix Machineand compare the results against the sequential version.

3.1 Solution 1

3.1.1 Performance of Sequential Code

First we measured the performance of Smith-Waterman algorithm, usingsequential code. Figure 6 shows the results.

Figure 6: Sequential Code Performance Measurement Result

Figure 6 shows that when N is increased, the time taken to completefilling matrix h is also increased almost linearly.

34

3.1.2 Find Out Optimum Number of Processor (P)

At first, we observe the performance by fixing number of compared pro-tein(N) to 5000 and 10000, block size (B) to 100 and set the interleavefactor (I) to 1. The result is shown in Figure 7.

1. Protein size equals to 5000 (N = 5000) Block size (B) is 100, and In-terleave factor (I) is 1

Figure 7: Measurement result when N is 5000, B is 100 and I is 1

Plotting the result in diagram in Figure 8

Figure 8: Diagram of measurement result when N is 5000, B is 100, I is 1

When the protein size (N) is 5000 and number of processor (P) is 4,we obtain t serial

t parallel = 3.31.454 = 2.26 times speedup.

35

2. Protein size equals to 10000 (N = 10000) We obtain this followingresult in Figure 9


Plotting the result in diagram in Figure 10


When the protein size (N) is 10000 and number of processor (P) is 8,we obtains t serial


Based on the result above, we found that maximum speedup is achievedwhen number of processor (P) is 8 and protein size (N) is 10000. Therefore,for the subsequent experiment, we will fix the number of processor to 8 andmodify other parameters.

3.1.3 Find Out Optimum Blocking Size (B)

In this subsection, we analyze the performance result and find optimumblocking size (B). We fix number of processor (P) to 8, number of protein(N) to 10000 and interleave factor (I) to 1. The results are on Figure 11

36

Figure 11: Performance measurement result when N is 10000, P is 8, I is 1

Figure 12: Diagram of measurement result when N is 10000, P is 8, I is 1

Plotting the result into diagram as shown in Figure 12. We zoomed inthe diagram in right hand side of Figure 12 so that we have clearer pictureon the performance when B is less than or equal to 500.

We found that optimum empirical blocking size (B) in the solution 1 is100. And this yield in t serial


37

3.1.4 Find Out Optimum Interleave Factor (I)

Using the result from previous section in finding optimum blocking size (B),we find out most optimum I. The result is shown on Figure 13

Figure 13: Diagram of measurement result when N is 10000, P is 8, B is 100

We found that optimum I is 1. And using optimum I of 1, we obtain4.76 times speedup compared to sequential execution.

3.2 Solution 1-Improved

We did the same experiment as Solution 1 performance result to obtainnecessary data about our improved solution


At first, we observe the performance by fixing number of compared pro-tein(N) to 10000, block size (B) to 100 and set the interleave factor (I) to 1.The result is shown in Figure 14.

We obtain this following result in Figure 14


Plotting the result in diagram in Figure 15When the protein size (N) is 10000 and number of processor (P) is 8, we

obtains t serialt parallel = 12.508

2.977 = 4.201 times speedup.

38


Based on the result above, we found that maximum speedup is achievedwhen number of processor (P) is 8 and protein size (N) is 10000. Therefore,for the subsequent experiment, we will fix the number of processor to 8 andmodify other parameters.

39




Plotting the result into diagram as shown in Figure 17. We zoomed inthe diagram in right hand side of Figure 17 so that we have clearer pictureon the performance when B is less than or equal to 500.


We found that optimum empirical blocking size (B) in the solution 1 is200. And this yield in t serial


40


Using the result from previous section in finding optimum blocking size (B),we find out most optimum I. The result is shown on Figure 18

Figure 18: Diagram of measurement result when N is 10000, P is 8, B is 200

We found that optimum I is 2. And using optimum I of 2, we obtaint serialt parallel = 12.508

2.613 = 5.08 times speedup.

3.3 Solution 2

Using similar sequential code performance result obtained during Solution1 evalution, we measured the performance of solution 2.


The first step that we did is to observe the performance by fixing number ofcompared protein (N), block size (B) and set the interleave factor (I) to 1.

1. Protein size equals to 5000 (N = 5000) Block size (B) is 100, andInterleave factor (I) is 1


Plotting the result into diagram

41


Using protein size (N) of 5000 and number of processor (P) is 32,we achieve maximum 31.55% speedup compared to existing sequentialcode.

2. Protein size equals to 10000 (N = 10000) Block size (B) is 100, andInterleave factor (I) is 1


Plotting the result into Figure 22

Using protein size (N) equals to 10000 and number of processor (P) is32, we achieve 54.67% speedup compared to existing sequential code.

Based on results obtained in this section, we found that parallel imple-mentation of solution 2 achieve most speedup when the number of procesor

42


is 32. In our subsequent performance evaluation, we will fix the number ofprocessor to 32, and observe most optimum value for other variables.




Plotting the result into diagram as shown in Figure 24 below

We found that optimum empirical blocking size (B) in our solution 2 is50. Interestingly, the performance using optimum B is slightly worse com-

43


pared to the result from section 3.3.1. Using B equals to 50, we achieve53.91% speedup compared to sequential execution, but 1.69% slower com-pared to result from section 3.3.1.


Using the result from previous section in finding optimum blocking size (B),we find out most optimum I. The result is shown on Figure 25 and Figure 26

Figure 25: Performance measurement result when N is 10000, P is 32, B is50

We found that optimum I is 30. Another interesting point is the obtainedresults shows that the execution times are very close to each other when Iis 10 up to 100. That means for existing configuration (N = 10000, P = 32and B = 50), the value of I does not affect the execution time much when

44

Figure 26: Diagram of measurement result when N is 10000, P is 32, B is50

it is from 10 to 100. Practically, we can choose any I value from 10 to 100.Using optimum I of 30, we obtain 58.79% speedup compared to sequential

execution, and 10.58% speedup compared to result without using interleav-ing.

45

3.4 Putting All the Optimum Values Together

Figure 27 and Figure 28 show the result of comparing all the execution timeswhen optimum parameters are used.

Figure 27: Putting all of them together

Figure 28: Putting all of them together - the plot

Improved solution 1 has slightly more execution time compared to theoriginal solution 1. The achieved results in improved solution 1 for timeperformance of the developed models include not only cost of the mainpart ( the interleave loop) but also all piggyback communication like initialbroadcast and final gather. Therefore, the result is pretty close to originalsolution 1.

46

3.5 Testing with different GAP penalties

Using optimum blocking size (B) of 50, optimum interleave factor (I) of 30and protein size of 10000, we tried to find out the result with different gappenalties. The result is shown on Figure 29

Figure 29: Testing with different gap penalties

Figure 30: gap penalty vs Time

We found that there was no effect or very minor effect of changing gappenalties on the over all execution time of the implementations.

47

4 Conclusions

We successfully implemented three different solutions of the Smith-WatermanAlgorithm. Initially we provided a solution using Scatter and Gather. Wefound that first version of solution 1 exhibits MPI barrier’s property of block-ing all process at certain point. In general MPI Gather doesn’t have suchproperty but for our pipelining realization, where each processes are depen-dent from each other, each process waits till master will be able to sendthe data. d realization was proposed. Therefore we optimized our firstimplementation so that it does not have the aforementioned MPI barrierproperty. In the improved version, each process allocates enough memoryfor all chunks to store results from interleave stage and final gather will beinvoked after all calculation work is completed. The second implementationused primitive Send and Receive method provided by MPI.

For all the implementation we did evaluation and Testing on the Altixmachine and empirically find out Optimum B and I. We created performancemodel for both the implementations using two different interconnection net-works i.e. Linear and 2D-Mesh. We also calculated optimum B and I byusing derivative.

We tested our implementations for different values of B,I,p and DELTA.Factor p which is related with the processor has the major effect on theexecution time. Increasing number of processor decreases the executiontime of the problem. Factor B also improves the performance of the codeas shown in the result. DELTA has no effect on the execution time of theimplementations. We also found that execution time has certain deviationso the choice of optimal parameter is very tricky

48

APPENDIX

A Source Code Compilation

We created Makefile to automate the compiling process. To compile thesource code that we created, we use this command

1 make

To remove the executables that created by compilation process, we usethis command

1 make c l ean

Here is the content of the makefile

1 CXX = i c c23 a l l : p r o t e i n f r e e p a r45 c l ean :6 rm p r o t e i n f r e e p a r78 p r o t e i n f r e e p a r : p ro te inFree . cpp9 ${CXX} prote inFree . cpp −o p r o t e i n f r e e p a r −lmpi

49

B Execution on ALTIX

We used Slurm+MOAB utility to submit the job at Altix Machine for ex-ecution of the code. Following is the script we used for submitting the jobto the Slurm.

1 #! / bin /bash2 # @ job name = t e s t3 # @ i n i t i a l d i r = .4 # @ output = mpi %j . out5 # @ e r r o r = mpi %j . e r r6 # @ t o t a l t a s k s = 47 # @ wa l l c l o c k l im i t = 00 : 02 : 0089 time mpirun −np 4 . / p r o t e i n f r e e p a r a 500k b 500k data .

s c o r e 1 5000 100 1

To execute the script we used mnsubmit command.

1 mnsubmit s c r i p t

You can find our script on below mentioned directory.

1 /home/ cur so s /ampp/ampp03/Documents/AMPP−Fina l / Prote inFree / s c r i p t

50

CT

imin

gd

iagra

mfo

rB

lock

ing

tech

niq

ue

inS

olu

tion

2

Fig

ure

31:

Per

form

ance

Mod

elS

olu

tion

2

51

DT

imin

gd

iagra

mfo

rB

lock

ing-a

nd

-Inte

rleave

tech

niq

ue

inS

olu

tion

2

Fig

ure

32:

Per

form

ance

Mod

elw

ith

Inte

rlea

ve

52

References

[1] Jun Zhang, Chapter 5: Basic Communication Operations. University ofKentucky, Lexington, 2010.

53

parallelization of smith-waterman algorithm using mpi

Education

diagram of measurement

performance model solution

optimum blocking size

matrix h of size n n

b execution

optimum values

optimum number of processor

d timing diagram