smith waterman algorithm parallelization

33
Universitat Polit` ecnica de Catalunya Facultat d’Inform` atica de Barcelona AMPP Final Project Smith-Waterman Algorithm Parallelization Authors: ario Almeida ˇ Zygimantas Bruzgys Umit Cavus Buyuksahin Supervisors: Josep Ramon Herrero Zaragoza Daniel Jimenez Gonzalez Barcelona 2012

Upload: mario-almeida

Post on 11-May-2015

1.401 views

Category:

Technology


3 download

DESCRIPTION

Smith-Waterman Algorithm Parallelization

TRANSCRIPT

Page 1: Smith waterman algorithm parallelization

Universitat Politecnica de CatalunyaFacultat d’Informatica de Barcelona

AMPP Final Project

Smith-Waterman AlgorithmParallelization

Authors:Mario AlmeidaZygimantas BruzgysUmit Cavus Buyuksahin

Supervisors:Josep Ramon Herrero Zaragoza

Daniel Jimenez Gonzalez

Barcelona2012

Page 2: Smith waterman algorithm parallelization

Contents

1 Introduction 3

2 Main Issues and Solutions 32.1 Parallelization Techniques . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 42.1.2 Blocking and Interleaving Technique . . . . . . . . . . 6

2.2 Performance Model on Linear Network Topology . . . . . . . . 72.2.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 72.2.2 Blocking Technique: Optimum B . . . . . . . . . . . . 102.2.3 Blocking and Interleaving Technique . . . . . . . . . . 112.2.4 Blocking and Interleaving Technique: Optimum B . . . 142.2.5 Blocking and Interleaving Technique: Optimum I . . . 14

2.3 Performance Model on 2D Torus Network Topology . . . . . . 152.3.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 152.3.2 Blocking and Interleaving Technique . . . . . . . . . . 15

2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Performance Results 213.1 Finding Optimal P and B . . . . . . . . . . . . . . . . . . . . 213.2 Finding Optimal I . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Conclusions 22

A How to Compile 24

B How to Execute on ALTIX 24

C Code 24

Page 3: Smith waterman algorithm parallelization

1 Introduction

In this project the parallel implementation of the Smith-Waterman Algo-rithm using Message Passing Interface (MPI). This algorithm is a well-knownalgorithm for performing local sequence alignment, which is, for determiningsimilar regions between two amino-acid sequences.

In order to find the best alignment between two amino-acid sequences amatrix H is computed of size N × N , where N is a size of each sequences.Every element of this matrix is based on Score Matrix (cost of matching twosymbols) and a gap penalty for mismatching symbols of sequences. Whenmatrix H is computed, the optimum alignment of sequences can be obtainedby tracking back the matrix starting with the highest value in the matrix.

In our parallel implementation only H matrix calculation was parallelizedas it is our only interest. The tracking part was removed from the codeand from the sequential code as well, in order to gather the most accuratecomputation times for comparison. For parallelization a pipelining methodwas used. Following this model, each process communicates with anotherafter calculating B columns of N

Prows. This is called blocking. We introduced

a parameter that easily allowed to change this value. Later interleavingparameter I was added.

During this project several performance models were created. One modelis for linear interconnection network and another one for 2D torus network.In models calculations B and I parameters were included. Later, the opti-mum B and I were found and performance tests were executed to empiricallyfind out those two parameters.

2 Main Issues and Solutions

In this section parallelization solutions are described. Solution with blockingat column level is explained and performance model is described. Then thesolution with both blocking at column level and interleaving at row level isexplained and performance model is described as well. Also, in this sectionthe calculations are provided for optimum blocking factor B and interleavingfactor I. The second part of the section is for description of the perfor-mance models for both solutions on the different network topology and thecalculations for finding optimum blocking factor B and interleaving factorI. Finally, our implementation of these techniques in C++ is provided andexplained.

3

Page 4: Smith waterman algorithm parallelization

2.1 Parallelization Techniques

2.1.1 Blocking Technique

Figure 1: Parallelization approach by introducing blocking at column level

The P processes share the matrix M in terms of consecutive rows. Forcalculating the matrix M of size N × N , each process Pi works with N/Pconsecutive rows of the matrix. When using a blocking technique for paral-lelization, columns are divided by a defined block size B. So, each processhas to calculate N/B blocks. These parameters are visualized at Figure 1.At the top part of the figure it can be seen how elements of the matrix aredivided between processes. And at the bottom part of the figure the par-allelization of calculations between processes is visualized. There is shownthat when first process computes the first block of the matrix, which is sizeof N/P ×B, it communicates with the next process. Then the next processstart calculating the other block of the matrix while the first process con-tinues calculations on the next block and so on. This type of parallelizationis called pipelining. In this type of parallelization, the problem is dividedinto a series of tasks that have to be completed one after the other. Be-fore explaining the parallelization in detail, we should analyze data and taskdependencies between processes to calculate the matrix.

In Figure 2 the data dependency for a particular matrix element is shown.In order to calculate a matrix element M [i][j], the process Pi+1 needs thecalculated data form the previous column M [i][j − 1] and elements M [i −1][j − 1] and M [i − 1][j] from the previous row as seen in the picture. If

4

Page 5: Smith waterman algorithm parallelization

Figure 2: Data dependency for calculating one matrix element

the previous row is calculated by the process Pi then that row is sent afterprocess Pi calculates the block of size N/P × B. This introduces a dataand task dependencies. The process Pi+1 can not start calculations till theprocess Pi sends the last row of the block, which is needed for calculating theblock of process Pi+1. To calculate the first row and column of the matrix itis considered that the predecessor row and column is filled with zeros.

Figure 3: Data dependecies between blocks of matrix

The Figure 3 shows the parallelism of the matrix in the wide window.The squares represent the blocks matrix and three arrows show data decen-cies between the blocks. As mentioned before, an element needs its upper,left, and upper-left values to be calculated. It is called data dependency.Therefore, blocks on the same minor diagonal are independent from eachother. So these blocks can be and are calculated in parallel.

The steps of calculations are as follows:

5

Page 6: Smith waterman algorithm parallelization

1. The process waits till the previous process finish calculation of a block(if applicable);

2. The process receives the last row of a block that was calculated by theprevious process;

3. After receiving the last row of a block calculated by previous process,the process has all necessary information to calculate its block. So, theprocess performs a calculation of its block;

4. When process finish the calculation, it sends the last row of its blockto the next process (if applicable);

5. The process repeats these steps until it finishes the calculation of allblocks, that is, calculates all rows that are assigned to the process.

2.1.2 Blocking and Interleaving Technique

Figure 4: Matrix calculation with interleaving factor, when I = 2

This parallelization method adds an interleave factor to a blocking tech-nique that was described above. With this method the matrix is dividedinto I parts, so that each part has N × N/I elements. Every part is thencalculated as explained in the previous section, that is, using blocking tech-nique. As soon as the process finish processing rows assigned to it from thefirst interleaving part it continues with the blocks from another interleavepart. For example, in Figure 4, where interleaving factor I = 2, the matrixis divided into two smaller parts. Each process Pi calculates N/(P · I) rowsof one part before moving to the second part.

6

Page 7: Smith waterman algorithm parallelization

The steps of calculations are very similar to those where blocking tech-nique is used and are as follows:

1. The process waits till the previous process finish calculation of a block(if applicable);

2. The process receives the last row of a block that was calculated by theprevious process;

3. After receiving the last row of a block calculated by previous process,the process has all necessary information to calculate its block. So, theprocess performs a calculation of its block;

4. When process finish the calculation, it sends the last row of its blockto the next process. If the process is the last one and there is anotherinterleave part to calculate, then it sends the row to the first process.Otherwise it does not send anything;

5. The process repeats these steps until it finishes the calculation of allblocks within the current interleave part, that is, calculates all rowsthat are assigned to the process within the interleave part. If thereis another interleave part to calculate it moves to next interleave partand repeats theses steps until all blocks from all interleave parts arecalculated.

2.2 Performance Model on Linear Network Topology

2.2.1 Blocking Technique

In this section we will be describing the performance model of our imple-mentation with blocking technique for a linear network topology. In latersections we will compare it with non linear topology, taking into account thedifferences in the performance models.

In order to focus on the main objectives of this performance analysis, wewill only take into account the parallel algorithms used for matrix calculation.This means that some parts of the code that were done sequentially on asingle process such as opening and reading the input files were ignored inthis model.

Some assumptions were made in terms of the models for different networktopologies, such as the assumption that the creation of new processes islocation aware in terms of their place in the network to make it more efficient.

For all the performance models described in this section we will use thefollowing annotation to represent them:

7

Page 8: Smith waterman algorithm parallelization

• ts: Startup time. (prepare message + routing algorithm + interfacebetween local node and the router).

• tc: Time of computation for each value in matrix.

• tw: Time of traversing per word.

• Tcomm: Total communication time.

• Tcomp: Total computation time.

Figure 5: Communication and computation times of matrix parallel calcula-tions by process using the blocking technique.

The diagram in Figure 5 represents the steps of the matrix calculationperformed by our algorithm as well as initial declarations and needed com-munications. These different steps are represented with different colors. Theblue color represents the scattering of one protein sequence to all the pro-cesses. The green colored areas represent the computation time needed todo the matrix calculations in each block and yellow color represents the timetaken to send the last row of a block to the next process.

In order to simplify the diagram, the time the last process needs to receivethe last row of the block of the previous process is already taken in accountin the upper yellow area. This explains why the last process doesn’t haveyellow areas in its time-line but still has to wait to receive the blocks neededto perform the matrix calculations. All of this will be considered in thisperformance model.

As we can observe from the diagram, the communication time of thismodel is composed by the scattering of the protein sequence vector (bluearea) and several communications to send the last row of each block to thenext process (yellow). The scatter method [2] will receive a vector of size N

8

Page 9: Smith waterman algorithm parallelization

and deliver a vector with size N/P to each process. The scattering time isgiven by:

Tscatter = ts · log(p) +N

P· (P − 1) · tw (1)

The sending of the last row of each block to the next process is composedby the communication startup time (ts) and the traversing time of the Belements in this blocks row. This is given by:

TrowComm = ts + B · tw (2)

In the total communication time, this startup and traversing are doneN/B times for the first process and an extra P − 1 times for the remainingpipeline stages of the remaining processes. In order to take into considerationthe fact that the last process doesn’t need to send its last row to anotherprocess we will consider that it takes P −2. So the total communication timeis given by:

Tcomm = Tscatter + (N

B+ P − 1− 1) · TrowComm

Tcomm = ts · log(p) +N

P· (P − 1) · tw + (

N

B+ P − 2) · (ts + B · tw) (3)

The next step is to calculate the total computation time. Having in mindthat a block is composed by N/P rows and B columns, the total number ofblock elements is B ·N/P . This means that the computation time of a singleblock is given by:

Tcomp block = tc ·B ·N

P(4)

As we did for the total communication time, this computation time ismultiplied N/B + P − 1 to calculate the computing of the blocks for all theprocesses:

Tcomp = (N

B+ P − 1) · Tcomp block

Tcomp = (N

B+ P − 1) · (tc ·B ·

N

P) (5)

To conclude this performance model, the total parallelization time is givenby the sum of the total communication and computation times. So the totalparallelization time is given by:

9

Page 10: Smith waterman algorithm parallelization

Tparallel = Tcomp + Tcomm

Tparallel =

((N

B+ P − 1

)·(tc ·B ·

N

P

))+

+

(ts · log(P ) +

N

P· (P − 1) · tw

)+

+

(N

B+ P − 2

)· (ts + B · tw) (6)

2.2.2 Blocking Technique: Optimum B

In order to find an optimum B for fixed values of N and P , and assumingN is much bigger than P , we need to find the value of B for each the totalparallel time of computation and communication is smaller. This value canbe found be deriving the total parallelization time equation and finding thevalue of B for which the derivate is equal to zero.

dTparallel

dB= 0⇔

⇔ −N(tcBN

P+ ts + Btw

)B−2+

(N

B+ P − 2

)(tcN

P+ tw

)+tcN

P= 0⇔

⇔ B =

√N · ts · P

P · tc ·N + P 2 · tw − tc ·N − 2 · tw · P⇔

⇔ B =

√N · ts · P

tc ·N · (P − 1) + P · tw · (P − 2)⇔

⇔ B =

√ts

tw·(P−2)N

+ tc·(P−1)P

For N � P :

B ≈√

tstc

(7)

10

Page 11: Smith waterman algorithm parallelization

2.2.3 Blocking and Interleaving Technique

In this section we will be describing the performance model of our implemen-tation with blocking and interleaving techniques for a linear network topol-ogy. In later sections we will compare it with non linear topology, taking intoaccount the differences in the performance models. As in the previous model,we will use the mentioned annotation and we will only take into account theparallel algorithms used for matrix calculation.

Figure 6: Communication and computation times of matrix parallel calcula-tions by process using the blocking and interleaving techniques.

The diagram in Figure 6 represents the steps of the matrix calculationperformed by our algorithm as well as initial declarations and needed com-munications. These different steps are represented with different colors. Theblue color represents the scattering of one protein sequence to all the pro-cesses. The green colored areas represent the computation time needed todo the matrix calculations in each block and yellow color represents the timetaken to send the last row of a block to the next process.

In order to simplify the diagram, the time the last process in the lastinterleave needs to receive the last row of the block of the previous processis already taken in account in the upper yellow area. This explains why thislast process doesn’t have yellow areas in its time-line but still has to wait toreceive the blocks needed to perform the matrix calculations. All of this willbe considered in this performance model.

As we can observe from the diagram, the communication time of thismodel is composed by the scattering of a part of the protein sequence vector(blue area) for each interleave and several communications to send the last

11

Page 12: Smith waterman algorithm parallelization

row of each block to the next process (yellow). The scatter method willreceive a vector of size N and deliver a vector with size N/(P · I) to eachprocess per interleave. The scattering time is given by:

Tscatter = ts · log(p) +N

P · I· (P − 1) · tw (8)

This scattering is done for each interleave. This means that we have tomultiply this Tscatter by I:

TTscatter = I · (ts · log(p) +N

P · I· (P − 1) · tw)

The sending of the last row of each block to the next process is composedby the communication startup time (ts) and the traversing time of the Belements in this blocks row. This is given by:

TrowComm = ts + B · tw (9)

In order to clearly describe the calculation of the total communicationtime we will be splitting it into communication time in the first I − 1 inter-leaves and the special case of the last interleave. For the first I−1 interleaves,one might notice that each interleave introduces N/B extra yellow areas.This means that the communication time for all the startups and traversingfor the first I − 1 interleaves is given by:

TcommInter = (I − 1) · (NB

) · TrowComm

TcommInter = (I − 1) · (NB

) · (ts + B · tw) (10)

The case of the last interleave is slightly different, we must have into ac-count the typical pipelining extra P − 1 communications due to the differentpipeline stages. Since in our implementation, the last process doesn’t needto send its last row to another process, there will be only P − 2 extra com-munications. So the communication time for all the startups and traversingis given by:

TcommLastInter = (N

B+ P − 2) · TrowComm

TcommLastInterleave = (N

B+ P − 2) · (ts + B · tw) (11)

12

Page 13: Smith waterman algorithm parallelization

With these formulas we can finally describe the total communication timeas being the sum of scattering times and startups and traversing times of allthe interleaves. So the total communication time is given by:

Tcomm = TTscatter + TcommInter + TcommLastInterleave

Tcomm = I · (ts · log(p) +N

P · I· (P − 1) · tw) + (I − 1) · (N

B) · (ts + B · tw) +

+ (N

B+ P − 2) · (ts + B · tw)

Tcomm = I · (ts · log(p) +N

P · I· (P − 1) · tw) + ((I − 1) · (N

B) +

+ (N

B+ P − 2)) · (ts + B · tw) (12)

The next step is to calculate the total computation time. Having in mindthat a block is composed by N/(P ·I) rows and B columns, the total numberof block elements is B ·N/(P · I). This means that the computation time ofa single block is given by:

TcompBlock = tc ·B ·N

P · I(13)

As we did for the total communication time, we have to take into accounthow the interleaving affects the computation. For the first I − 1 interleavesthe computation time is given by:

TcompInter = (I − 1) · (NB

) · tc ·B ·N

P · I(14)

Differently from the communication time, the last interleave has exactlyN/B + P − 1 extra computations of blocks. This means that the total com-putation time is given by:

Tcomp = ((I − 1) · (NB

) + (N

B+ P − 1)) · tc ·B ·

N

P · I(15)

To conclude this performance model, the total parallelization time is givenby the sum of the total communication and computation times. So the totalparallelization time is given by:

Tparallel = Tcomp + Tcomm

13

Page 14: Smith waterman algorithm parallelization

Tparallel = (I · (ts · log(p) +N

P · I· (P − 1) · tw)) +

+ ((I − 1) · (NB

) + (N

B+ P − 2))×

× (tc ·B ·N

P · I+ ts + B · tw) +

+ tc ·B ·N

P · I(16)

2.2.4 Blocking and Interleaving Technique: Optimum B

In order to find an optimum B in order to N, P and I values, and assumingN is much bigger than P, we need to find the value of B for each the totalparallel time of computation and communication is smaller. This value canbe found be deriving the total parallelization time equation and finding thevalue of B for which the derivate is equal to zero.

dTparallel

dB= 0⇔

⇔(−(I − 1)N

B2− N

B2

)·(tcBN

IP+ ts + Btw

)+

+

((I − 1)N

B+

N

B+ P − 2

)·(tcN

IP+ tw

)+

tcN

IP= 0⇔

⇔ B =

√NtsPI2

PtcN + P 2twI − tcN − 2twIP(17)

For N � P :

B ≈√

INtstw

(18)

2.2.5 Blocking and Interleaving Technique: Optimum I

In order to find an optimum I in order to N, P and B values, and assumingN is much bigger than P, we need to find the value of I for each the totalparallel time of computation and communication is smaller. This value canbe found be deriving the total parallelization time equation and finding thevalue of I for which the derivate is equal to zero.

14

Page 15: Smith waterman algorithm parallelization

dTparallel

dI= 0⇔

⇔ I =

√Ntc(P − 1)B2

P (Bts log(P ) + Nts + NBtw)(19)

I ≈

√NtcB2

Bts log(P ) + Nts + NBtw(20)

2.3 Performance Model on 2D Torus Network Topol-ogy

2.3.1 Blocking Technique

Assuming that the spawning of processes is location aware in terms of thenetwork topology, the only difference between the linear topology mentionedin the previous sections and the 2D Torus network topology is in the scat-tering of data [1]. So the new performance model for this topology is givenby:

Tparallel =

((N

B+ P − 1

)·(tc ·B ·

N

P

))+

+ 2 ·(ts · log(

√P ) +

N

P· (P − 1) · tw

)+

+

(N

B+ P − 2

)· (ts + B · tw) (21)

Although the scattering of data is done faster, as it is not affected by the vari-able B, it will not affect the calculation of the optimum B. So the optimumB remains the following:

B ≈√

tstc

(22)

2.3.2 Blocking and Interleaving Technique

Lets also assume that the spawning of processes is location aware in termsof the network topology. This means the only difference between the lin-ear topology mentioned in the previous sections and the 2D Torus network

15

Page 16: Smith waterman algorithm parallelization

topology is in the scattering of data. So the new performance model for thistopology is given by:

Tparallel = (I · 2 · (ts · log(√P ) +

N

P · I· (P − 1) · tw)) +

+ ((I − 1) · (NB

) + (N

B+ P − 2))×

× (tc ·B ·N

P · I+ ts + B · tw) +

+ tc ·B ·N

P · I(23)

Just as in the blocking technique, the scattering is not affected by B but itis affected by I. This means that the scattering is dependent on the level ofinterleaving. So the new equation for the optimum I is given by:

I ≈

√NtcB2

2Bts log(√P ) + Nts + NBtw

(24)

The corresponding optimum B is given by:

B ≈√

INtstw

(25)

Taking into account the logarithmic properties, we deduce that the opti-mum I is the same for both network topologies. The only difference betweenthe two is the time needed to perform the scattering.

2.4 Implementation

In this section, the implementation of the our solution is provided and ex-plained. Our solution compared to provided sequential one requires extraparameters B and I. Where B is a blocking factor and I is an interleavingfactor. Note that in order not to use interleaving, the I parameter should beset to 1.

In our solution, all required data is firstly read by the root process andlater broad-casted or scattered to other processes. Vector A is scattered to allof the process. How much of information is scattered to every process dependson I parameter and number of processes and every process receives N/(I ·P )rows before computing each of the interleave parts. Usually, N elementscan not be divided by I · P parameter, so the padding is introduced. Theamount of elements that each process will receive during scatter procedureis calculated and stored as follows:

16

Page 17: Smith waterman algorithm parallelization

sizeA = N % (total_processes * I) != 0 ? N +

(total_processes * I) - (N % (total_processes * I)) : N;

chunk_size = sizeA / (total_processes * I);

Then the root process reads the data and shares the data as follows:

// Broadcast the Similarity Matrix

MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD);

// Broadcast the portion of vector A that will be received

during broadcast

MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD);

// Broadcast N, B, I and DELTA parameters

MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);

MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD);

MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD);

MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD);

Later, each process allocates space for a portion of H matrix, portion of Avector and for a whole B vector. Note that in our solution every process doesnot allocate the full-sized H matrix, but just enough portion of this matrixwhere every process writes their results. So the sum of sizes of each H matrixportions distributed throughout the processes will be N×N+N+N ·(P ·I). Itis the whole matrix, initial column filled with zeros and extra lines where theprocesses receives information from other processes. The portions is stored ina three dimensional array where the first dimension refers to an interleavingID and the rest refers to column and row. The memory is allocated mappedand the B vector is broad-casted as follows:

CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) *

(chunk_size + 1))));

CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) *

(chunk_size + 1))));

CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I)));

for(int i = 0; i < (chunk_size + 1) * I; i++)

chunk_h[i] = chunk_hptr + i * N;

for (int i = 0; i < I; i++)

chunk_ih[i] = chunk_h + i * (chunk_size + 1);

CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) *

(chunk_size))));

if (rank != 0) { // The root process already have B vector

CHECK_NULL((b = (short *) malloc(sizeof(short) * (N))));

17

Page 18: Smith waterman algorithm parallelization

}

MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD);

Later each process calculates how many blocks there are in total and whatis the size of the final block. This is needed since usually N is not dividableby B, so the final block is usually smaller then the rest of them. The timethat marks the beginning of computation is stored in a variable start. Inthe main loop that counts interleaves, each process receives a portion of Avector. Main loop is repeated I times as explained earlier (in the sectiondescribing blocking and interleaving technique).

int total_blocks = N / B + (N % B == 0 ? 0 : 1);

int last_block_size = N % B == 0 ? B : N % B;

MPI_Status status;

int start, end;

start = getTimeMilli();

for (int current_interleave = 0; current_interleave < I;

current_interleave++) {

MPI_Scatter(a + current_interleave * chunk_size *

total_processes,

chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT,

0, MPI_COMM_WORLD);

int current_column = 1;

// Fill first column with 0

for (int i = 0; i < chunk_size + 1; i++)

chunk_ih[current_interleave][i][0] = 0;

Then the main calculations begin. Firstly, the process checks whether ithas to receive from another process. If so, it receives data required for thecalculations. Then it processes the current cell, stores the result in separatearray which will be gathered later. Finally, the process checks if it has tosend the to another process. If so, it sends the last row of current block toanother process. The process repeats these actions totalblocks times. Finally,it saves the time after execution in the end variable.

for (int current_block = 0; current_block < total_blocks;

current_block++) {

// Receive

int block_end = MIN2(current_column - (current_block ==

18

Page 19: Smith waterman algorithm parallelization

0 ? 1 : 0) + B, N);

if (rank == 0 && current_interleave == 0) {

for (int k = current_column; k < block_end; k++) {

chunk_ih[current_interleave][0][k] = 0;

}

} else {

int receive_from = rank == 0 ? total_processes - 1 :

rank - 1;

int size_to_receive = current_block == total_blocks

- 1 ? last_block_size : B;

MPI_Recv(chunk_ih[current_interleave][0] +

current_block * B, size_to_receive, MPI_INT,

receive_from, 0, MPI_COMM_WORLD, &status);

if (DEBUG) printf("[%d] Received from %d: ", rank,

receive_from);

if (DEBUG)

print_vector(chunk_ih[current_interleave][0] +

current_block * B, size_to_receive);

}

// Process

for (int j = current_column; j < block_end; j++,

current_column++) {

for (int i = 1; i < chunk_size + 1; i++) {

int diag = chunk_ih[current_interleave][i - 1][j

- 1] + sim[chunk_a[i - 1]][b[j - 1]];

int down = chunk_ih[current_interleave][i -

1][j] + DELTA;

int right = chunk_ih[current_interleave][i][j -

1] + DELTA;

int max = MAX3(diag, down, right);

chunk_ih[current_interleave][i][j] = max < 0 ? 0

: max;

}

}

// Send

if (current_interleave != I - 1 || rank + 1 !=

total_processes) {

int send_to = rank + 1 == total_processes ? 0 : rank

+ 1;

int size_to_send = current_block == total_blocks - 1

? last_block_size : B;

MPI_Send(chunk_ih[current_interleave][chunk_size] +

19

Page 20: Smith waterman algorithm parallelization

current_block * B, size_to_send, MPI_INT,

send_to, 0, MPI_COMM_WORLD);

if (DEBUG) printf("[%d] Sent to %d: ", rank,

send_to);

if (DEBUG)

print_vector(chunk_ih[current_interleave][chunk_size]

+ current_block * B, size_to_send);

}

}

}

end = getTimeMilli();

When all he calculations are finished, all processes starts the gather exe-cution. After gather is executed, the root process has the all H matrix. Thenthe root process prints an execution time to stderr stream and if debug isenabled it prints the H matrix.

for (int i = 0; i < I; i++) {

MPI_Gather(chunk_hptr + N + i * chunk_size * N, N *

chunk_size, MPI_INT,

hptr + i * chunk_size * total_processes * N,

N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD);

}

if (rank == 0) {

fprintf(stderr, "Execution: %f s\n", (double) (end - start)

/ 1000000);

}

if (DEBUG) {

if (rank == 0) {

for (int i = 0; i < N - 1; i++) {

print_vector(h[i], N);

}

}

}

MPI_Finalize();

The full code is provided in the Appendix section.

20

Page 21: Smith waterman algorithm parallelization

3 Performance Results

In this section, the performance results of our implementation on ALTIX isprovided. Also, the results is compared to a sequential code performance.

3.1 Finding Optimal P and B

In order to find out optimal P and B, we tested the application with differentP and B parameters, where N = 10, 000. Before that we tested the sequentialcode. This code executed calculations for 12.598 seconds. The parallelizedversion execution times are shown in Figure 7.

Figure 7: Performance results with different P and B where N = 10, 000

From this it can be concluded that with parameters N = 10000, B = 100,P = 8 and I = 1 the parallel code executed calculations 9 times faster.

3.2 Finding Optimal I

In order to find out the optimal I, we selected the best result from theprecious test where P = 8 and ran the test with different I and B parameters.The result is shown in Figure 8.

Because the environment like network congestion affects our performancetests, the results might not be completely accurate. That is why we deducedfrom the results that the optimal parameters configuration for N = 10, 000 isI = 2, B = 200, P = 8. With this configuration parallel code calculates thematrix 8 times faster than the sequential code. Finally, we tested the parallelcode with N = 25, 000 and parameters that we found to be optimal. Thecode executed calculations for 11.822213 seconds, where the sequential coderan for 76.884 seconds. From this it can be concluded that the parallel coderuns 6.5 times faster. The result is slower, because as it was stated earlier,

21

Page 22: Smith waterman algorithm parallelization

Figure 8: Performance results with different I and B where N = 10, 000 andP = 8

the B and I depends on N , so the parameters configuration for calculatingvectors similarity of size N = 25, 000 is not optimal.

4 Conclusions

During this project the parallel implementation of the Smith-Waterman Al-gorithm was made using blocking and interleaving techniques. The tech-niques and the code were explained in detail. The performance models forboth linear and 2D torus were calculated. Also, for each network topologiesthe equations for finding optimum blocking factor B when using blockingtechnique and optimum B and interleaving factor I when using blocking andinterleaving technique were found. After calculating the models, the conclu-sion was made that the calculation of B and I factors for our algorithm onthese particular network topologies is the same.

Performance tests using multiple processes on different processors weredone. It was found out that the optimal configuration for calculating se-quence alignment of two vectors of size N = 10, 000 using our implemen-tation is I = 2, B = 200, P = 8. With this configuration the parallelcode calculates the matrix 8 times faster than the sequential code. With thesame parameters configuration the parallel code calculates the matrix of sizeN = 25, 000 6.5 times faster than the sequential code.

22

Page 23: Smith waterman algorithm parallelization

References

[1] Peter Harrison, William Knottenbelt, Parallel Algorithms. Departmentof Computing, Imperial College London, 2009.

[2] Norm Matlo, Programming on Parallel Machines. University of Califor-nia, Davis, 2011.

23

Page 24: Smith waterman algorithm parallelization

A How to Compile

all: seq par

seq:

gcc SW.c -o seq.out

par:

icc protein.cpp -o protein.out -lmpi

B How to Execute on ALTIX

#!/bin/bash

# @ job_name = ampp01parallel

# @ initialdir = .

# @ output = mpi_%j.out

# @ error = mpi_%j.err

# @ total_tasks = <number_of_process>

# @ wall_clock_limit = 00:01:00

mpirun -np <number_of_process> ./protein.out <vector_a> <vector_b>

<similarity_matrix> <gap_penalty> <N> <B> <I>

C Code

#include <stdio.h>

#include <stdlib.h>

#include <ctype.h> // character handling

#include <stdlib.h> // def of RAND_MAX

#include <sys/time.h>

#include "mpi.h"

#define DEBUG 1

#define MAX_SEQ 50

#define CHECK_NULL(_check) {\

if ((_check)==NULL) {\

fprintf(stderr, "Null Pointer allocating memory\n");\

24

Page 25: Smith waterman algorithm parallelization

exit(-1);\

}\

}

#define AA 20 // number of amino acids

#define MAX2(x,y) ((x)<(y) ? (y) : (x))

#define MAX3(x,y,z) (MAX2(x,y)<(z) ? (z) : MAX2(x,y))

#define MIN2(x,y) ((x)>(y) ? (y) : (x))

// function prototypes

int getTimeMilli();

void read_pam(FILE* pam);

void read_files(FILE* in1, FILE* in2);

void print_vector(int* vector, int size);

void print_short_vector(short* vector, int size);

void memcopy(int* src, int* dst, int count);

/* begin AMPP*/

int char2AAmem[256];

int AA2charmem[AA];

void initChar2AATranslation(void);

/* end AMPP */

/* Define global variables */

int rank, total_processes;

int DELTA;

short *a, *b;

int *chunk_hptr;

int **chunk_h, ***chunk_ih;

int *sim_ptr, **sim; // PAM similarity matrix

int N, sizeA, B, I, chunk_size;

short *chunk_a;

int* hptr;

int** h;

FILE *pam;

main(int argc, char *argv[]) {

/* begin AMPP */

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &total_processes);

CHECK_NULL((sim_ptr = (int *) malloc(AA * AA * sizeof(int))));

25

Page 26: Smith waterman algorithm parallelization

CHECK_NULL((sim = (int **) malloc(AA * sizeof(int*))));

for(int i = 0; i < AA; i++)

sim[i] = sim_ptr + i * AA;

if (rank == 0) {

FILE *in1, *in2;

/**** Error handling for input file ****/

if (!(argc >= 5 && argc <= 8)) {

fprintf(stderr,"%s protein1 protein2 PAM gapPenalty [N]

[B] [I]\n",argv[0]);

exit(1);

} else {

in1 = fopen(argv[1],"r");

in2 = fopen(argv[2],"r");

N = (argc > 5 ? atoi(argv[5]) : MAX_SEQ) + 1;

B = argc > 6 ? atoi(argv[6]) : total_processes;

I = argc > 7 ? atoi(argv[7]) : 1;

DELTA = atoi(argv[4]);

}

/* end AMPP */

/* begin AMPP */

sizeA = N % (total_processes * I) != 0 ? N +

(total_processes * I) - (N % (total_processes * I)) : N;

CHECK_NULL((a = (short *) calloc(sizeof(short), sizeA)));

CHECK_NULL((b = (short *) malloc(sizeof(short) * (N))));

initChar2AATranslation();

read_files(in1, in2);

chunk_size = sizeA / (total_processes * I);

CHECK_NULL((hptr = (int *) calloc(N * sizeA, sizeof(int))));

CHECK_NULL((h = (int **) calloc(sizeA, sizeof(int*))));

for(int i = 0; i < sizeA; i++)

h[i] = hptr + i * N;

pam = fopen(argv[3], "r");

read_pam(pam);

}

MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD);

MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD);

MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);

26

Page 27: Smith waterman algorithm parallelization

MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD);

MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD);

MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD);

CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) *

(chunk_size + 1) * I)));

CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) *

(chunk_size + 1) * I)));

CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I)));

for(int i = 0; i < (chunk_size + 1) * I; i++)

chunk_h[i] = chunk_hptr + i * N;

for (int i = 0; i < I; i++)

chunk_ih[i] = chunk_h + i * (chunk_size + 1);

CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) *

(chunk_size))));

if (rank != 0) {

CHECK_NULL((b = (short *) malloc(sizeof(short) * (N))));

}

MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD);

/*** PARALLEL PART ***/

/** compute "h" local similarity array **/

int total_blocks = N / B + (N % B == 0 ? 0 : 1);

int last_block_size = N % B == 0 ? B : N % B;

MPI_Status status;

int start, end;

start = getTimeMilli();

for (int current_interleave = 0; current_interleave < I;

current_interleave++) {

MPI_Scatter(a + current_interleave * chunk_size *

total_processes,

chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT,

0, MPI_COMM_WORLD);

int current_column = 1;

// Fill first column with 0

27

Page 28: Smith waterman algorithm parallelization

for (int i = 0; i < chunk_size + 1; i++)

chunk_ih[current_interleave][i][0] = 0;

for (int current_block = 0; current_block < total_blocks;

current_block++) {

// Receive

int block_end = MIN2(current_column - (current_block ==

0 ? 1 : 0) + B, N);

if (rank == 0 && current_interleave == 0) {

for (int k = current_column; k < block_end; k++) {

chunk_ih[current_interleave][0][k] = 0;

}

} else {

int receive_from = rank == 0 ? total_processes - 1 :

rank - 1;

int size_to_receive = current_block == total_blocks

- 1 ? last_block_size : B;

MPI_Recv(chunk_ih[current_interleave][0] +

current_block * B, size_to_receive, MPI_INT,

receive_from, 0, MPI_COMM_WORLD, &status);

if (DEBUG) printf("[%d] Received from %d: ", rank,

receive_from);

if (DEBUG)

print_vector(chunk_ih[current_interleave][0] +

current_block * B, size_to_receive);

}

// Process

for (int j = current_column; j < block_end; j++,

current_column++) {

for (int i = 1; i < chunk_size + 1; i++) {

int diag = chunk_ih[current_interleave][i - 1][j

- 1] + sim[chunk_a[i - 1]][b[j - 1]];

int down = chunk_ih[current_interleave][i -

1][j] + DELTA;

int right = chunk_ih[current_interleave][i][j -

1] + DELTA;

int max = MAX3(diag, down, right);

chunk_ih[current_interleave][i][j] = max < 0 ? 0

: max;

}

}

// Send

28

Page 29: Smith waterman algorithm parallelization

if (current_interleave != I - 1 || rank + 1 !=

total_processes) {

int send_to = rank + 1 == total_processes ? 0 : rank

+ 1;

int size_to_send = current_block == total_blocks - 1

? last_block_size : B;

MPI_Send(chunk_ih[current_interleave][chunk_size] +

current_block * B, size_to_send, MPI_INT,

send_to, 0, MPI_COMM_WORLD);

if (DEBUG) printf("[%d] Sent to %d: ", rank,

send_to);

if (DEBUG)

print_vector(chunk_ih[current_interleave][chunk_size]

+ current_block * B, size_to_send);

}

}

}

end = getTimeMilli();

for (int i = 0; i < I; i++) {

MPI_Gather(chunk_hptr + N + i * chunk_size * N, N *

chunk_size, MPI_INT,

hptr + i * chunk_size * total_processes * N,

N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD);

}

if (rank == 0) {

fprintf(stderr, "Execution: %f s\n", (double) (end - start)

/ 1000000);

}

if (DEBUG) {

if (rank == 0) {

for (int i = 0; i < N - 1; i++) {

print_vector(h[i], N);

}

}

}

//Free everything!

free(sim_ptr);

free(sim);

free(b);

29

Page 30: Smith waterman algorithm parallelization

free(chunk_ih);

free(chunk_h);

free(chunk_hptr);

free(chunk_a);

if (rank == 0) {

free(a);

free(hptr);

free(h);

}

MPI_Finalize();

}

void memcopy(int* src, int* dst, int count) {

for (int i = 0; i < count; i++) {

dst[i] = src[i];

}

}

void print_vector(int* vector, int size) {

for (int i = 0; i < size; i++) {

printf("%2d ", vector[i]);

}

printf("\n");

}

void print_short_vector(short* vector, int size) {

for (int i = 0; i < size; i++) {

printf("%2d ", vector[i]);

}

printf("\n");

}

void read_pam(FILE* pam) {

int i, j;

int temp;

/** read PAM250 similarity matrix **/

/* begin AMPP */

fscanf(pam,"%*s");

/* end AMPP */

for (i = 0; i < AA; i++)

for (j = 0; j <= i; j++) {

if (fscanf(pam, "%d ", &temp) == EOF) {

30

Page 31: Smith waterman algorithm parallelization

fprintf(stderr, "PAM file empty\n");

fclose(pam);

exit(1);

}

sim[i][j]=temp;

}

fclose(pam);

for (i = 0; i < AA; i++)

for (j = i + 1; j < AA; j++)

sim[i][j] = sim[j][i]; // symmetrify

}

void read_files(FILE* in1, FILE* in2) {

int i=0;

int nc;

char ch;

do {

nc=fscanf(in1,"%c",&ch);

if (nc>0 && char2AAmem[ch]>=0)

{

a[i++] = char2AAmem[ch];

}

} while (nc>0 && (i<N));

fclose(in1);

/** read second file in array "b" **/

i=0;

do {

nc=fscanf(in2,"%c",&ch);

if (nc>0 && char2AAmem[ch]>=0)

{

b[i++] = char2AAmem[ch];

}

} while (nc>0 && (i<N));

fclose(in2);

}

/* Begin AMPP */

void initChar2AATranslation(void)

{

int i;

for(i=0; i<256; i++) char2AAmem[i]=-1;

char2AAmem[’c’]=char2AAmem[’C’]=0;

31

Page 32: Smith waterman algorithm parallelization

AA2charmem[0]=’c’;

char2AAmem[’g’]=char2AAmem[’G’]=1;

AA2charmem[1]=’g’;

char2AAmem[’p’]=char2AAmem[’P’]=2;

AA2charmem[2]=’p’;

char2AAmem[’s’]=char2AAmem[’S’]=3;

AA2charmem[3]=’s’;

char2AAmem[’a’]=char2AAmem[’A’]=4;

AA2charmem[4]=’a’;

char2AAmem[’t’]=char2AAmem[’T’]=5;

AA2charmem[5]=’t’;

char2AAmem[’d’]=char2AAmem[’D’]=6;

AA2charmem[6]=’d’;

char2AAmem[’e’]=char2AAmem[’E’]=7;

AA2charmem[7]=’e’;

char2AAmem[’n’]=char2AAmem[’N’]=8;

AA2charmem[8]=’n’;

char2AAmem[’q’]=char2AAmem[’Q’]=9;

AA2charmem[9]=’q’;

char2AAmem[’h’]=char2AAmem[’H’]=10;

AA2charmem[10]=’h’;

char2AAmem[’k’]=char2AAmem[’K’]=11;

AA2charmem[11]=’k’;

char2AAmem[’r’]=char2AAmem[’R’]=12;

AA2charmem[12]=’r’;

char2AAmem[’v’]=char2AAmem[’V’]=13;

AA2charmem[13]=’v’;

char2AAmem[’m’]=char2AAmem[’M’]=14;

AA2charmem[14]=’m’;

char2AAmem[’i’]=char2AAmem[’I’]=15;

AA2charmem[15]=’i’;

char2AAmem[’l’]=char2AAmem[’L’]=16;

AA2charmem[16]=’l’;

char2AAmem[’f’]=char2AAmem[’F’]=17;

AA2charmem[17]=’L’;

char2AAmem[’y’]=char2AAmem[’Y’]=18;

AA2charmem[18]=’y’;

char2AAmem[’w’]=char2AAmem[’W’]=19;

AA2charmem[19]=’w’;

}

int getTimeMilli() {

struct timeval tv;

32

Page 33: Smith waterman algorithm parallelization

gettimeofday(&tv, NULL);

int ret = tv.tv_usec;

ret += (tv.tv_sec * 1000000); // Add seconds

return ret;

}

/* end AMPP*/

33