performance analysis of ocb (offset · pdf filestudent : parag sheth i certify that this...

62
PERFORMANCE ANALYSIS OF OCB (OFFSET CODEBOOK) USING TBB(THREADING BUILDING BLOCKS) Parag Sheth B.E., L. D. College of Engineering, 2006 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE at CALIFORNIA STATE UNIVERSITY, SACRAMENTO FALL 2010

Upload: buikien

Post on 18-Mar-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

PERFORMANCE ANALYSIS OF OCB (OFFSET CODEBOOK) USING TBB(THREADING BUILDING BLOCKS)

Parag ShethB.E., L. D. College of Engineering, 2006

PROJECT

Submitted in partial satisfaction of the requirements for the degree of

MASTER OF SCIENCE

in

COMPUTER SCIENCE

at

CALIFORNIA STATE UNIVERSITY, SACRAMENTO

FALL2010

PERFORMANCE ANALYSIS OF OCB (OFFSET CODEBOOK) USING TBB (THREADING BUILDING BLOCKS)

A Project

by

Parag Sheth

Approved by:

_______________________________, Committee ChairTed Krovetz, Ph.D.

_______________________________, Second ReaderChung-E Wang, Ph.D.

____________________Date

ii

Student : Parag Sheth

I certify that this student has met the requirements for format contained in the University

format manual, and that this project is suitable for shelving in the Library and credit is to

be awarded for the project.

__________________________, Graduate Coordinator _____________________Nikrouz Faroughi, Ph.D. Date

Department of Computer Science

iii

Abstract

of

PERFORMANCE ANALYSIS OF OCB (OFFSET CODEBOOK)USING TBB (THREADING BUILDING BLOCKS)

by

Parag Sheth

My project would be to explore Intel’s open source library (for C++), named TBB

(Threading Building Blocks), and make its use to analyze performance gain for the

implementation of OCB (Offset Codebook). The analysis would begin with identifying

the parallel portions inside an OCB algorithm, followed by its implementation using

TBB. After that I would analyze the performance gain obtained by changing various

parameters of the OCB algorithm.

TBB(Threading Building Blocks) :

It is Intel’s open source template library for C++. The aim of it is to provide task

level parallelism as opposed to thread level parallelism. It makes implementation more

portable and easy to understand. TBB library internally keeps a pool of worker threads.

The application developer needs to specify the parallel portion of the application and

most of the remaining work is taken care of by the TBB library. The library determines

iv

required no of threads for the task and schedules them on available processor cores. TBB

uses work-stealing scheduler design to schedule its threads.

For developers, the benefits of TBB are :

1. It reduces the length of the code for a multithreaded application.

2. It relieves the programmer from handling all the thread management stuff.

3. It automatically identifies the underlying system and determines optimal no of threads.

It also automatically balances the work load between these threads and makes

maximum use of all the available processor cores to achieve maximum performance.

4. The applications developed using TBB automatically becomes portable and scalable to

machines with any no of core.

OCB(Offset Codebook) :

It is a shared key encryption - authentication scheme, built from a block cipher.

OCB achieves authenticated encryption in essentially the same amount of time as other

modes, like CBC, achieve privacy alone. Or in other words we can say that it takes about

half the time as "conventional" modes, like CCM, to achieve privacy and authenticity

combined. On top of this OCB is a simple, easy and highly parallelizable method which

can be easily implemented in hardware and software. It can also be proved that it is as

secure as its underlying primitive algorithms.

v

Some of the key features of OCB are :

1. It can encrypt messages of any bit length and messages don’t have to be multiple of the

block length.

2. Encryption and decryption depend on an n-bit nonce N, which must be selected as a

new value for each encryption. The nonce need not be random or secret.

3. It is an on-line algorithm, meaning one need not know the length of the header or

message to proceed with encryption, and one need not know the length of header or

cipher text to proceed with decryption.

4. OCB is parallelizable : the bulk of its block cipher calls may be performed

simultaneously. Thus OCB is suitable for encrypting messages in hardware at the

highest network speeds.

5. It needs very little memory to run.

6. It is nearly endian-neutral.

___________________________________, Committee ChairTed Krovetz, Ph.D.

_______________________Date

vi

TABLE OF CONTENTS

Page

List of Tables ..................................................................................................................... ix

List of Figures .................................................................................................................... x

Chapter

1. INTRODUCTION TO AUTHENTICATED ENCRYPTION ....................................... 1

Security Model ....................................................................................................... 1

Notions of Security .................................................................................... 2

Notions of Attacks ..................................................................................... 3

Security of Message Authentication Code (MAC) .................................... 4

Authenticated Encryption ...................................................................................... 4

2. INTRODUCTION TO OFFSET CODEBOOK (OCB) ................................................ 7

Overview ............................................................................................................... 7

Notation and Basic Operation ............................................................................... 9

OCB Parameters .................................................................................................... 9

Header Authentication : PMAC ........................................................................... 10

Encryption : OCB-ENCRYPT ............................................................................. 11

Decryption : OCB-DECRYPT ............................................................................. 12

Parallel Portion of the Encryption / Decryption Algorithm ................................. 13

Security Consideration of OCB ........................................................................... 14vii

3. INTRODUCTION TO THREADING BUILDING BLOCKS (TBB) ........................ 15

Overview .............................................................................................................. 15

Task Scheduling ................................................................................................... 16

TBB Provided Algorithms ................................................................................... 19

Containers ............................................................................................................ 20

Scalable Memory Allocation ................................................................................ 21

4. OCB IMPLEMENTATION USING TBB ................................................................... 22

Class Definition ................................................................................................... 22

Class Definition : OCB_With_TBB ........................................................ 23

Class Definition : encryptBlockParallel .................................................. 24

Class Definition : xorBlockParallel ......................................................... 25

Class Implementation .......................................................................................... 26

5. RESULTS .................................................................................................................... 34

Experiments ......................................................................................................... 34

Conclusion ........................................................................................................... 40

References ........................................................................................................................ 52

viii

LIST OF TABLES

Page

1. Table 1 CPB Comparison at Different Processor Cores (Experiment A) ........... 41

2. Table 2 CPB Comparison at Different Block Lengths (Experiment B) ............. 49

3. Table 3 CPB Comparison at Different Chunk Sizes (Experiment C) ................. 51

ix

LIST OF FIGURES

Page

1. Figure 1 Sample Task Graph ...................................................................................... 17

2. Figure 2 Sample Ready Pool ...................................................................................... 18

3. Figure 3 CPB Comparison at Different Processor Cores ........................................... 35

4. Figure 4 CPB Comparison at Different Block Lengths .............................................. 37

5. Figure 5 CPB Comparison at Different Chunk Sizes ................................................. 39

x

Chapter 1

INTRODUCTION TO AUTHENTICATED ENCRYPTION

Authentication and encryption are two different objectives to be achieved while

designing any secure communication system. Encryption refers to the privacy of the

actual message while authentication refers to the mechanism which can prove that the

sender of the message is actually the one who he or she claims to be. Formally speaking

Encryption is the process of transforming information (referred to as plaintext) using an

algorithm (called cipher) to make it unreadable to anyone except those possessing special

knowledge, usually referred to as a key. [1] The concept of authenticity is similar to the

concept of signature in real world.

1.1 Security Model

There are various algorithms available for encryption and authentication purpose.

One needs to make sure that the algorithm that he or she is planning to use can give

enough security against all the possible type of attacks in that scenario. But to understand

that we need to formally define the security model. In other words, we need to formalize

different types of possible attacks on the system and different levels of security that an

algorithm can provide against those attacks. It is quite possible that some encryption

algorithms are secure against a particular type of attack but they are easily broken when

thrown against another kind of attack.

1

1.1.1 Notions of Security

There are essentially 3 notions of security that needs to be defined - Perfect Security,

Semantic Security and Polynomial Security.

1. Perfect Security : The algorithm is said to be having perfect security or information

theoretic security if the adversary with infinite amount of computational power can

learn nothing about the plaintext given the ciphertext. This is actually a very strong

definition and no such algorithm is possible in real world.

2. Semantic Security : This notion is similar to the perfect security but here an adversary

is given only polynomial amount of time. Polynomial time can be defined as t = f(|M|),

where |M| is the length of the given message. In other words the algorithm is said to

have semantic security if an adversary can learn nothing about the plaintext given the

ciphertext in a certain amount of finite time (polynomial time).

3. Polynomial Security : This is an extended concept of semantic security and it is also

provable. Here an adversary is allowed to select 2 messages M1 and M2 of the same

length. Now the adversary is given ciphertext of one of this messages Ci, where i is a

randomly chosen unknown bit. The algorithm is said to be having polynomial security

if an adversary cannot identify the message (either M1 or M2) related to the ciphertext

Ci with significantly higher probability (probability higher than 1/2). It can be proved

that if the algorithm is polynomially secured than it is also semantically secured. Here

the advantage of an adversary Adv A = | Pr ( A ( guess, Ci, y, M1, M2 ) = b ) - 1/2 ) |,

2

where y is a secret key. The scheme is polynomially secured if Adv A ≤ 1 / p(k), for all

adversaries A and all polynomials p and sufficiently large k.

1.1.2 Notions of Attacks

There are mainly 3 different kind of attacks. Passive attack, chosen ciphertext attack and

adaptive chosen ciphertext attack.

1. Passive Attack : This is the weakest form of attack in which an adversary is allowed to

observe only ciphertexts. An adversary also has an access to the encryption black box

to which he / she can submit plaintext blocks and observe the returned ciphertexts.

2. Chosen Ciphertext Attack : Here an adversary is given an access to the decryption box

where she can submit any number of ciphertexts and observe the returned plaintext

messages. In the next stage she is given a challenge ciphertext and is asked to get the

plaintext or at least some information about the plaintext. In this later stage she is not

allowed to use the decryption box.

3. Adaptive Chosen Ciphertext Attack. This is a very strong type of attack. Here, in

addition to all the accesses given in CCA, an adversary is also allowed to use the

decryption box during the challenge stage except for the challenge ciphertext.

Based on the notions above we can say that “A public key encryption algorithm is said to

be secured if it is polynomially secured against an adaptive chosen ciphertext attack.” [2]

Similar kind of approach defines the security for a symmetric encryption algorithm. The

3

actual difference between them is that in case of public key encryption scheme, the

algorithm needs to be probabilistic while in case of symmetric key encryption scheme,

deterministic algorithm can be used.

1.1.3 Security of Message Authentication Code (MAC)

Security of a MAC can be defined in various ways but selective forgery is a

widely used notion. In this notion an adversary is asked to choose a plaintext message

M1. The MAC generator algorithm returns the MAC S1 for some random key K. Now

the challenge for an adversary is to generate another valid pair of (M2, S2) where, M1 ≠

M2. If the adversary succeeds in generating such a pair than this is know as a selective

forgery.

1.2 Authenticated Encryption

After discussing the security model, now we are in a position to discuss

Authenticated Encryption. Various practices of providing privacy and authenticity have

been used for years. In traditional approach, encryption and authentication algorithms are

applied one after the other to achieve data security and authenticity. These kind of

schemes are known as “generic compositions”. Here encryption and authentication

algorithms can be applied in any order and based on that, the schemes are known as

Encryption then Authentication (EtA) or Authentication then Encryption (AtE) or Encrypt

and MAC (E&M). One such EtM generic composition scheme can be described as

4

follows. For example Bob wants to send a message to Alice. They both share a secret key

K. Bob first encrypts the message using this key K and possibly a nonce N. The nonce

here can be any random number or a counter value. After generating the ciphertext, it is

applied to the authentication algorithm along with the key. This will generate a tag known

as message authentication tag T. Now Bob can send this triplet (C, N, T) to Alice. On the

other end, Alice can apply exactly the reverse process to retrieve original plaintext and to

verify the authenticity of the received message. There are various such schemes available

but the problem with them is that both encryption and authentication functions need to be

applied separately, which takes almost double time and processing power. Some times

designers make a mistake of using regular hash instead of a secure hash (MAC). This

approach is almost always broken. So as a conclusion, it would be best for any generic

composition scheme that ‘Encrypt then MAC’ (EtM) approach with a provably secured

encryption scheme and a provably secured MAC (each with independent keys) are used.

The concept of Authenticated encryption (AE) is to provide a single method

which can achieve data security and authenticity in a single pass, and thus improving the

efficiency. Some researchers have pointed out that even though the individual elements of

the scheme are secure, the combined scheme - if not designed properly - may lead to the

insecure implementation. A properly designed AE scheme can also provide security

against the chosen ciphertext attack. This is a kind of attack where an adversary can

submit carefully chosen ciphertext to the decryption oracle. By analyzing the pattern in

5

plaintext, an adversary may be able to get some information about the secret key being

used. Instead an ideal AE scheme will just refuse to decrypt the message without giving

much information, if the message is not properly authenticated. Thus an adversary cannot

submit just any random ciphertext and expect its corresponding plaintext. This approach

avoids CCA. Another important aspects while designing an AE schemes are efficiency,

parallelizability, simplicity and portability. One such scheme known as Offset Codebook

(OCB) is described in the next chapter.

The security of any AE scheme is dependent on its primitive algorithms. It is very

difficult to provide a proof of security for such primitives. But once we have shown that

no known attacks seem to work, it is possible to show that the schemes, based on these

primitives are as secure as its underlying algorithms. For that matter it can be proved that

OCB scheme is as secure as its underlying encryption algorithm.

6

Chapter 2

INTRODUCTION TO OFFSET CODEBOOK (OCB)

2.1 Overview

Offset Codebook (OCB) mode of authenticated encryption was developed by

Philip Rogaway, who credited the design to Mihir Bellare, John Black and Ted Krovetz

for their support. This mode is based on the IAMP scheme developed by Charanjit Jutla.

The OCB scheme improves the original IAMP scheme in certain criteria such as 1)

Minimizing number of block cypher calls, 2) Giving direction when the length of the

original message is not the multiple of block length n, 3) Avoids multiple encryption

keys, 4) Makes use of a nonce which is required to be unique for each encryption but is

not required to be secret or random. There are two versions of the OCB scheme. The

initial version is 1.0 and the current version is 2.0 which is an improvement over version

1.0. The key differentiators between two versions are that the version 2.0 allows

associated data to be included with the message and a new method for generating the

sequence of offsets. This associated data travels in plain text along with the cipher text

but it needs to be authenticated. This is similar to the header requirement discussed in

chapter 1.

OCB uses a block cypher - typically AES. It allows a predefined header to be

authenticated along with the message. OCB also requires a unique nonce N along with

7

each encryption. It typically requires h + m + 2 block cypher calls in total, where h is the

block length of header and m is the block length of the original message. Once header is

authenticated there is virtually no cost in subsequent authentication of H. So OCB uses m

+ 2 block cypher calls. OCB is also highly parallelizable. As it will be discussed later on,

some parts of the OCB algorithm can be done independently and it implies that the

efficiency of the OCB operation can be improved dramatically if underlying hardware

supports robust parallel processing. Another advantage of OCB is that it is an online

scheme. In other words it is not required to know the length of the complete message

before starting the encryption. Similarly it is not required to know the length of the

complete cypher text before starting the decryption. OCB generated output is of the same

length as the original message plus the length of the authentication tag. This is a huge

advantage as it minimizes the actual data being transferred. This might be a cause of

concern in cases where traffic analysis is possible. In such scenario other schemes such as

padding needs to be used. Following sections describe the actual algorithm or pseudo

code for OCB and it’s constructs.

8

2.2 Notation and Basic Operation [3]

c^i The integer c raised to the i-th power ceil(x) The smallest integer no smaller than xbitlength(S) The length of string S in bitszeros(n) The string made of zero bitsS xor T The string that is the bitwise exclusive-or of S and T. Strings S and T must have the same lengthS[i] The i-th bit of the string S (indices begin at 1).S[i..j] The substring of S consisting of bits i through j.S || T The string S concatenated with string T (eg, 000 || 111 = 000111).S << n The string S shifted left n bit positions. More formally, S << n = S[n+1..bitlength(S)] || zeros(n).num2str(x, n) The n-bit binary representation of the integer x. More formally, the n-bit string S where x = S[1] * 2^{n-1} + S[2] * 2^{n-2} + ... + S[n] * 2^{0}. Only used when 0 <= x < 2^n.const(n) The lexicographically first n-bit string C among all strings that have a minimal possible number of "1" bits and which name a polynomial x^n + C[1] * x^{n-1} + ... + C[n-1] * x^1 + C[n] * x^0 that is irreducible over the field with two elements. In particular, const(128) = num2str(135, 128). For other values of n, refer to a standard table of irreducible polynomialstimes2(S) S << 1 if S[1] = 0, and (S << 1) xor const(bitlength(S)) if S[1] = 1times3(S) times2(S) xor S

2.3 OCB Parameters [3]

BLOCKLEN The length of the plaintext block that the block-cipher operates on.KEYLEN The blockcipher's key length, in bits.ENCIPHER(K,P) The application of the blockcipher on P (a string of BLOCKLEN bits) using key K (a string of KEYLEN bits).DECIPHER(K,C) The application of the inverse of the blockcipher on C (a string of BLOCKLEN bits) using key K (a string of KEYLEN bits).

9

2.4 Header Authentication : PMAC [3]

Function Name : PMACInput : K, string of KEYLEN bits H, string of any length // Header to co-authenticateOutput : Auth, string of BLOCKLEN bits // Header authenticator

// Break H into blocks m = max(1, ceil(bitlength(H) / BLOCKLEN))Let H_1, H_2, ..., H_m be strings such that H = H_1 || H_2 || ... || H_m and bitlength(H_i) = BLOCKLEN for all 0 < i < m.// Initialize strings used for offsets and checksumsOffset = ENCIPHER(K, zeros(BLOCKLEN))Offset = times3(Offset)Offset = times3(Offset)Checksum = zeros(BLOCKLEN)// Accumulate the first m - 1 blocksfor i = 1 to m - 1 do // Skip if m < 2 Offset = times2(Offset) Checksum = Checksum xor ENCIPHER(K, H_i xor Offset)end for// Accumulate the final blockOffset = times2(Offset)if bitlength(H_m) = BLOCKLEN then Offset = times3(Offset) Checksum = Checksum xor H_melse Offset = times3(Offset) Offset = times3(Offset) Tmp = H_m || 1 || zeros(BLOCKLEN - (bitlength(H_m) + 1)) Checksum = Checksum xor Tmpend if// Compute resultAuth = ENCIPHER(K, Offset xor Checksum)

10

2.5 Encryption : OCB-ENCRYPT [3]

Function Name : OCB-ENCRYPTInput : K, string of KEYLEN bits // Key N, string of BLOCKLEN bits // Nonce H, string of any length // Header M, string of any length // PlaintextOutput : C, string of length equal to M // Cipher text core T, string of BLOCKLEN bits // Authentication tag

// Break M into blocksm = max(1,ceil(bitlength(M) / BLOCKLEN))Let M_1, M_2, ..., M_m be strings such that M = M_1 || M_2 || ... || M_m and bitlength(M_i) = BLOCKLEN for all 0 < i < m.// Initialize strings used for offsets and checksumsOffset = ENCIPHER(K,N)Checksum = zeros(BLOCKLEN)// Encrypt and accumulate first m - 1 blocksfor i = 1 to m - 1 do // Skip if m < 2 Offset = times2(Offset) Checksum = Checksum xor M_i C_i = Offset xor ENCIPHER(K, M_i xor Offset)end for// Encrypt and accumulate final blockOffset = times2(Offset)b = bitlength(M_m) // Value in 0..BLOCKLENPad = ENCIPHER(K, num2str(b, BLOCKLEN) xor Offset)C_m = M_m xor Pad[1..b] // Encrypt M_mTmp = M_m || Pad[b+1..BLOCKLEN]Checksum = Checksum xor Tmp// Compute authentication tagOffset = times3(Offset)T = ENCIPHER(K, Checksum xor Offset)if bitlength(H) > 0 thenT = T xor PMAC(K, H)end if// Assemble the ciphertextC = C_1 || C_2 || ... || C_m

11

2.6 Decryption : OCB-DECRYPT [3]

Function Name : OCB-DECRYPTInput : K, string of KEYLEN bits // Key N, string of BLOCKLEN bits // Nonce H, string of any length // Header C, string of any length // Cipher text coreOutput : M, string // Plaintext V, boolean // Validity indicator

m = max(1,ceil(bitlength(C) / BLOCKLEN))Let C_1, C_2, ..., C_m be strings such that C = C_1 || C_2 || ... || C_m and bitlength(C_i) = BLOCKLEN for all 0 < i < m.Offset = ENCIPHER(K,N)Checksum = zeros(BLOCKLEN)// Decrypt and accumulatefor i = 1 to m - 1 do // Skip if a < 2 Offset = times2(Offset) M_i = Offset xor DECIPHER(K, C_i xor Offset) Checksum = Checksum xor M_iend forOffset = times2(Offset)b = bitlength(C_m) // Value in 0..BLOCKLENPad = ENCIPHER(K, num2str(b, BLOCKLEN) xor Offset)M_m = C_m xor Pad[1..b]Tmp = M_m || Pad[b+1..BLOCKLEN]Checksum = Checksum xor Tmp// Compute valid authentication tagOffset = times3(Offset)FullValidTag = ENCIPHER(K, Offset xor Checksum)if bitlength(H) > 0 then FullValidTag = FullValidTag xor PMAC(K, H)end ifif T = FullValidTag[1..bitlength(T)] then V = true M = M_1 || M_2 || ... || M_melse V = false M = <emptystring>end if

12

2.7 Parallel Portion of the Encryption / Decryption Algorithm

If we carefully look at the encryption algorithm, it is evident that the main loop to

encrypt the plain text is highly parallelizable. Calculating the cipher text for message

block ‘i’, requires the knowledge of only current offset value and it is not dependent on

any other message block or the cipher text blocks. We can calculate offset separately. It is

also possible to calculate the offset values in advance as it is not dependent on the actual

plaintext values. Now until we reach the last message block, n number of processes

(depending on the available independent processing units) can encrypt independent block

(s) of message and they will also have to keep track of their independent checksum

values. As checksum is calculated by XORing the message block and current checksum

value and - XOR being the associative and commutative operation - individual processing

units can keep track of their own checksum values. And at the end, final checksum can be

calculated by XORing the individual checksum values. Similar parallelizability exists in

the decryption algorithm as well. Thus it is quite possible to take advantage of this highly

parallel scheme and implement it such that the implementation becomes scalable and

efficient in terms of using all the available processing power. Threading Building Block

(TBB) is one such mechanism provided by Intel, which allows the developer to use all

the available processor cores without making the implementation super complicated.

Subsequent chapters will discus TBB in detail and the implementation of OCB using

TBB.

13

2.8 Security Consideration of OCB

1. OCB scheme is as secure as the underlying block cipher scheme. So the designer

should choose only well trusted block cipher. The privacy and authenticity decreases as

per s^2 / 2^BLOCKLEN, where s is the total number of blocks that the adversary

acquires. Thus the BLOCKLEN should be selected carefully. Choosing a smaller value

for BLOCKLEN will result in the higher probability of adversary’s success. Usually

the BLOCKLEN of 128 is sufficient.

2. For the secure operation, it is required that the nonce value is not repeated for the same

encryption key. If there are multiple parties communicating with the same key than

they should divide the nonce space such that they do not overlap. Nonce is not required

to be a secret. A simple counter can also work fine.

3. Designer can also choose the length of the authentication tag. But choosing a small

value for the tab length increases the chances of an adversary being capable of forgery.

4. OCB scheme (or any other authenticated encryption scheme for that matter) can

provide security against chosen cipher text attack. But for that, designer needs to make

sure that when the decryption or authentication fails, the system should not give all the

details to the adversary.

14

Chapter 3

INTRODUCTION TO THREADING BUILDING BLOCKS (TBB)

3.1 Overview

It is often quite challenging to develop a multi-threaded application that can scale

itself based on the available processor cores. Ideally if some application gives X amount

of performance on a dual core machine than that performance should improve on a quad

core machine. If the developer tries to develop this kind of scalable application using raw

threads such as POSIX or windows thread than he or she needs to manage a lot of thread

overhead. In addition to that he needs to take care of a lot of low level stuff such as load

balancing, memory contention and cache performance. Threading Building Blocks (TBB)

is such a template library for C++ - developed by Intel - which can help developers in

such cases. It is a very high level library where developer need to identify different tasks

in the application and not the threads. TBB will automatically map all the tasks to

appropriate number of threads and will run it efficiently. TBB can identify available

number of processor cores and can load balance the task to get maximum performance.

TBB offers a vast range of advantages over the native threads, such as

1. It is platform independent and processor independent.

2. It can be seamlessly integrated with other threading libraries in the same application.

15

3. TBB targets data parallel programming. Instead of parallelizing independent tasks, it

tries to divide one data intensive task into multiple threads. This approach often works

better in terms of efficiency.

4. Instead of relying on a global task queue, it uses task stealing mechanism thus avoiding

the main point of contention. Task stealing is described in more details in the following

sub section.

3.2 Task Scheduling

Task scheduling is the heart of TBB. It is the component which allocates tasks

onto available worker threads and maintains the load balancing. When ever the task

scheduler is initialized it creates a task graph. Each node in the graph represents a task.

Each arrow points to another task which is it’s parent task. Each node also keeps a count

of its child threads - a reference count. Nodes also keep one more counter called the

depth count, which is usually one more than its parent. One such task graph is shown in

figure 1. There are 2 options to traverse this task graph, breadth first search and depth

first search. Usually depth first search is more efficient in terms of sequential execution.

There are two main reason for that.

1. The deepest task is the last one created. So it is more likely to be in cache. Executing

such a task will obviously improve performance.

2. It minimizes the memory usage. If the graph is unfolded in breadth first fashion than it

will create many more new tasks simultaneously which will occupy memory.

16

On the other hand, choosing depth first approach reduces parallelism. So TBB task

scheduler uses a fine mixture of both the schemes.

Depth = 0

RefCount = 2

Depth = 1

RefCount = 2

Depth = 1

RefCount = 0

Depth = 2

RefCount = 0

Depth = 2

RefCount = 2

Figure 1 : Sample Task Graph

17

The scheduler creates enough number of worker threads based on the available

number of processing units. Each thread has an Execute() method. Once the execute

method calls one task than that task is bound with that thread and cannot move to another

thread. That thread can execute some other task if the current task is in sleep mode. The

task graph is searched in the breadth first fashion and the tasks are assigned to the worker

threads. After that each thread keeps its own ready pool. This pool is basically an array of

lists. The array is indexed on the depth of the task node and each list works as a stack

(LIFO). Newly created task (in ready state) is pushed in a queue which is at the level of

it’s depth. The task will always go to the ready pool of the thread that created it.

Shallow

Deep

Task A

Task B Task C

Task D

Figure 2 : Sample Ready Pool

18

When it comes to selecting the task to execute, following rules are followed in order.

1. Run the task returned by the Execute() method of the previous task

2. Select the task which is at the deepest non-empty list in the pool.

3. If thread’s own pool is empty then it can steal a task from other threads shallowest list

of the pool.

In summary, TBB task scheduler uses breadth first task stealing and depth first

work strategy. Breadth first maximizes the parallelism while depth first ensures that the

threads work efficiently once they have enough work to do.

3.3 TBB Provided Algorithms

TBB provides number of constructs / algorithms to help parallelize most common

parallel structures in the software. Some of them are listed below with a short description.

1. parallel_for and parallel_reduce : They are used when there is a need to parallelize a

fixed number of independent loops.

2. parallel_scan : It is used to parallelize a loop of which each iteration is dependent on

the other loop iteration. For example a running sum calculation can be parallelized

using this construct.

3. parallel_while : This is used when there is a need to parallelize a continuous

unstructured stream of work. New work can be added on the go.

19

4. pipeline : This construct can efficiently parallelizes the typical pipeline structured

segment of the code.

5. parallel_sort : The complexity of this construct while sorting is no higher than O(n log

n) for a single processor. As more processors become available, complexity approaches

O(n).

In each of these constructs it is possible to provide the chunk size manually. In that case

TBB will divide the task accordingly and each worker thread will work on the specified

chuck of data. TBB also provides auto_partition functionality where it determines the

chunk size based on the available resources and the parallelism of the code.

auto_partition works well in cases where actual data size is not available in advance.

3.4 Containers

TBB provides highly concurrent containers. These containers are quite similar to

the STL containers but they are thread safe. Usually STL containers are not thread safe so

there is practice to put a lock during their access, which eventually kills the purpose of

parallelism. TBB currently provides only 3 such containers namely concurrent queue,

vector and a hash-map. TBB also uses 2 different kind of locking mechanism to provide

maximum parallelizability.

1. Fine grained lock : It only locks the portion of container which is being used. So as far

as two threads are accessing different portions of the same container than their access

is actually parallel.

20

2. Lock-free algorithm : This algorithm allows concurrent access without locking the

container but it keeps track of any possible corruption and provides a correction for it.

This type of concurrent thread-safe containers are not as fast as the one from STL. But if

used properly, the gain from parallelism can outperform the slowness of these containers.

3.5 Scalable Memory Allocation

Memory allocation is a huge bottleneck when it comes to the multi processor

systems. All the parallel threads try to allocate memory from the same heap and that

reduces parallelism. There is also a chance of false sharing in this case. False sharing is

caused due to the way processor accesses the memory. Even if it needs to read only one

byte, it has to read entire cache line. So if more than 1 processes are using different bytes

in the same cache line, than cache miss ratio is going to be very high and that will impact

performance a lot. To avoid these problems TBB offers 2 different allocators using which

we can minimize the bottleneck.

1. scalable_allocator : Allocating memory using this allocator ensures that each thread is

given memory from a different pool.

2. cache_aligned_allocator : Using this allocator makes sure that besides using separate

pools for each thread, each memory allocation is aligned with the cache line. This is

likely to increase memory wastage, hence should be used very carefully.

21

Chapter 4

OCB IMPLEMENTATION USING TBB

This chapter includes the actual implementation of OCB algorithm using the

template library TBB. As discussed in chapter 1, this implementation tries to parallelize

the main loop in the algorithm which does the actual encryption and authentication.

Calculation of offset values is not parallelized as it is something that can be calculated in

advance.

4.1 Class Definitions

Definition of class encryptBlockParallel in section 4.1.2 is the main portion where

TBB operation is defined. Each worker thread created by TBB for encryption process,

calls the function defined in this class. By carefully analyzing the function, we can see

that TBB actually passes a range of values defined by the blocked_range template. The

worker thread will work in that specific range of values before going to take the next task

from the ready pool. Similarly class definition for xorBlockParallel in section 4.1.3 tries

to parallelize the xor operation on 2 blocks passed as it’s parameters. This function can

really improve the performance if the block size is too large. TBB can automatically

detect the cost behind dividing the xor operation and will do so if feasible. So in our case

where block size is 16 bytes, it is very less likely that TBB will spread that operation

across multiple worker threads.

22

4.1.1 Class Definition : OCB_With_TBB

# define BLOCK_LEN 16typedef unsigned char byte;typedef byte BLOCK[BLOCK_LEN];

using namespace tbb;using namespace std;

class OCB_With_TBB{private: int nextBlockNum; HANDLE hThread1, hThread2, hSemaphore; DWORD dwThreadID1, dwThreadID2; BLOCK *pOffset; BLOCK key, nonce; BLOCK checksum; BLOCK currentOffset; CRijndael objAES; unsigned int lenHeader; unsigned int lenPlainText;

bool times2(BLOCK* input, BLOCK* output); bool times3(BLOCK* input, BLOCK* output); void xorBlock(BLOCK* retBlock, BLOCK* leftBlock, BLOCK* rightBlock); void pmac(BLOCK* result);

public: BLOCK *pPlainText, *pCipherText, *pHeader; BLOCK authTag; OCB_With_TBB(void); ~OCB_With_TBB(void); void Initialize(BLOCK* pKey, BLOCK* pNonce, BLOCK* pPlainText, unsigned int lenPlainText, BLOCK* pHeader, unsigned int lenHeader); void OCBEncrypt(); // int getNextBlockNum(); void xorBlockWithOffset(BLOCK* retBlock, BLOCK* input, int blockNum); CRijndael* getAESObject();};

23

4.1.2 Class Definition : encryptBlockParallel

class encryptBlockParallel{ OCB_With_TBB* const objOCB;

public: void operator()(const blocked_range<size_t>& range)const { BLOCK tempBlock1, tempBlock2;

for(int i = 0; i < 100000; ++i) { for(size_t index = range.begin(); index < range.end(); ++index) { objOCB->xorBlockWithOffset(&tempBlock1, & ((objOCB->pPlainText)[index]), index); (objOCB->getAESObject())->EncryptBlock((const char*) &tempBlock1, (char*)&tempBlock2); objOCB->xorBlockWithOffset(&((objOCB->pCipherText) [index]), &tempBlock2, index); } } }

encryptBlockParallel(OCB_With_TBB* objOCB) : objOCB(objOCB) { }};

24

4.1.3 Class Definition : xorBlockParallel

class xorBlockParallel{ BLOCK const *ret; BLOCK const *left; BLOCK const *right;

public: void operator()(const blocked_range<size_t>& range)const { for(size_t index = range.begin(); index < range.end(); ++index) { ((unsigned char*)ret)[index] = ((unsigned char*)left)[index] ^ ((unsigned char*)right)[index]; } }

xorBlockParallel(BLOCK* retBlock, BLOCK* leftBlock, BLOCK* rightBlock) { ret = retBlock; left = leftBlock; right = rightBlock; }

};

25

4.2 Class Implementation

Below is the implementation of class OCB_With_TBB and its related functions.

The main thing to notice here is a call to parallel_for() in function OCBEncrypt. This is a

TBB provided functionality which will cause the loop to be divided into tasks and each

tasks will be given to the available worker threads. The last parameter to parallel_for is

specified as “auto_partitioner” which indicates that TBB will divide work on its own.

TBB also provides a way to specify our own partitioner where a programmer can specify

how to divide the loop. In other words we can specify the grain size.

# define PROCESSOR_FREQUENCY 2680000000

void OCB_With_TBB::xorBlock(BLOCK* retBlock, BLOCK* leftBlock, BLOCK* rightBlock){ int loopIndex = 0; /* parallel_for(blocked_range<size_t>(0, BLOCK_LEN), xorBlockParallel (retBlock, leftBlock, rightBlock), auto_partitioner()); */

for(loopIndex = 0; loopIndex < BLOCK_LEN; ++loopIndex) { ((unsigned char*)retBlock)[loopIndex] = ((unsigned char*)leftBlock) [loopIndex] ^ ((unsigned char*)rightBlock)[loopIndex]; }}

void OCB_With_TBB::xorBlockWithOffset(BLOCK* retBlock, BLOCK* input, int blockNum){ xorBlock(retBlock, input, &(pOffset[blockNum]));}

26

bool OCB_With_TBB::times2(BLOCK* input, BLOCK* output){ int loopIndex = 0; if(NULL != input && NULL != output) { unsigned char carry = 0; carry = ((unsigned char*)input)[0] >> 7; for(loopIndex = 0; loopIndex < BLOCK_LEN - 1; ++loopIndex) { ((unsigned char*)output)[loopIndex] = (((unsigned char*)input) [loopIndex] << 1) | (((unsigned char*)input)[loopIndex + 1] >> 7); } ((unsigned char*)output)[BLOCK_LEN - 1] = (((unsigned char*)input) [BLOCK_LEN - 1] << 1);

if(carry) { ((unsigned char*)output)[BLOCK_LEN - 1] ^= 0x87; } return true; } return false;}

bool OCB_With_TBB::times3(BLOCK* input, BLOCK* output){ if(NULL != input && NULL != output) { int loopIndex = 0; BLOCK temp; times2(input, output); xorBlock(&temp, output, input); memcpy(output, temp, BLOCK_LEN); return true; } return false;}

27

void OCB_With_TBB::Initialize(BLOCK* pKey, BLOCK* pNonce, BLOCK* pPlainText, unsigned int lenPlainText, BLOCK* pHeader, unsigned int lenHeader){ BLOCK temp; unsigned int loopIndex = 0;

unsigned int numPlainTextBlocks = ceil((double)lenPlainText / (double)BLOCK_LEN); unsigned int numHeaderBlocks = ceil((double)lenHeader / (double)BLOCK_LEN); nextBlockNum = 0;

hSemaphore = CreateSemaphore(NULL, 1, 1, NULL);

if(NULL != key && NULL != nonce && NULL != pPlainText) { this->lenPlainText = lenPlainText; this->lenHeader = lenHeader;

memcpy(&key, pKey, sizeof(key)); memcpy(&nonce, pNonce, sizeof(nonce)); this->pPlainText = (BLOCK*)calloc(numPlainTextBlocks, BLOCK_LEN); this->pCipherText = (BLOCK*)calloc(numPlainTextBlocks, BLOCK_LEN); this->pOffset = (BLOCK*)calloc(numPlainTextBlocks, BLOCK_LEN); this->pHeader = (BLOCK*)calloc(numHeaderBlocks, BLOCK_LEN);

memcpy(this->pPlainText, pPlainText, lenPlainText); memcpy(this->pHeader, pHeader, lenHeader);

memset(&authTag, 0, sizeof(authTag));

// Initializing AES objAES.MakeKey((const char*)&key, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", sizeof(key), BLOCK_LEN); // Initializing Offset

28

memset(temp, 0, sizeof(temp)); memset(currentOffset, 0, sizeof(currentOffset));

objAES.EncryptBlock((const char*)&nonce, (char*)&currentOffset); memset(&checksum, 0, sizeof(checksum));

// Now pre-calculating offset and checksum values if(1 < numPlainTextBlocks) { times2(&currentOffset, &temp); memcpy(pOffset, &temp, BLOCK_LEN); xorBlock(&temp, &checksum, pPlainText); memcpy(&checksum, &temp, BLOCK_LEN);

for(loopIndex = 1; loopIndex < numPlainTextBlocks - 1; ++loopIndex) { times2(&(pOffset[loopIndex - 1]), &temp); memcpy(&(pOffset[loopIndex]), &temp, BLOCK_LEN); xorBlock(&temp, &checksum, &(pPlainText[loopIndex])); memcpy(&checksum, &temp, sizeof(checksum)); } memcpy(&currentOffset, pOffset[loopIndex - 1], BLOCK_LEN); } }}

29

void OCB_With_TBB::OCBEncrypt(){ unsigned int bitLength = 0; unsigned int loopIndex = 0, numBlocks = 0, numPlainTextBlocks = 0; BLOCK tempBlock1, tempBlock2, pad;

numPlainTextBlocks = ceil((double)lenPlainText / (double)BLOCK_LEN);

long int startTime, endTime; double cpuCycles;

startTime = clock();

parallel_for(blocked_range<size_t>(0, numPlainTextBlocks), encryptBlockParallel(this), auto_partitioner()); endTime = clock(); cpuCycles = ((double)(endTime - startTime) / (double)CLOCKS_PER_SEC) * PROCESSOR_FREQUENCY; cpuCycles = cpuCycles / (double)100000; // loop count cpuCycles = (cpuCycles / (double)(numPlainTextBlocks * 16));

// Now processing last block numBlocks = ceil((double)lenPlainText / (double)BLOCK_LEN); times2(&currentOffset, &tempBlock1); memcpy(&currentOffset, &tempBlock1, BLOCK_LEN);

if(1 < numBlocks) { memcpy(&(pOffset[numBlocks - 1]), currentOffset, BLOCK_LEN); } memset(&tempBlock1, 0, sizeof(tempBlock1));

//numPlainTextBlocks = ceil((double)lenPlainText / (double)BLOCK_LEN); if(1 < numPlainTextBlocks) { bitLength = ((this->lenPlainText) - ((numPlainTextBlocks - 1) * BLOCK_LEN)) * 8; }

30

else { bitLength = this->lenPlainText * 8; } for(loopIndex = 0; loopIndex < sizeof(bitLength); ++loopIndex) { // following line is specific to a little endian machine tempBlock1[sizeof(tempBlock1) - sizeof(bitLength) + loopIndex] |= ((unsigned char*)(&bitLength))[(sizeof(bitLength) - loopIndex) - 1]; }

xorBlock(&tempBlock2, &tempBlock1, &currentOffset); (this->getAESObject())->EncryptBlock((const char*)&tempBlock2, (char*) &pad);

for(loopIndex = 0; loopIndex < ceil((double)bitLength / 8); ++loopIndex) { ((unsigned char*)(&(pCipherText[numBlocks - 1])))[loopIndex] = ((unsigned char*)(&(pPlainText[numBlocks - 1])))[loopIndex] ^ ((unsigned char*(&pad))[loopIndex]; }

memset(&tempBlock1, 0, sizeof(tempBlock1)); memcpy(&tempBlock1, pPlainText[numBlocks - 1], (bitLength / 8)); memcpy(&((unsigned char*)&tempBlock1)[(bitLength / 8)], &((unsigned char*) &pad)[(bitLength / 8)], (BLOCK_LEN - (bitLength / 8)));

xorBlock(&tempBlock2, &checksum, &tempBlock1); memcpy(&checksum, &tempBlock2, BLOCK_LEN);

// Computing authentication tag memset(&tempBlock1, 0, sizeof(tempBlock1)); memset(&tempBlock2, 0, sizeof(tempBlock1));

times3(&currentOffset, &tempBlock1); xorBlock(&tempBlock2, &checksum, &tempBlock1); (getAESObject())->EncryptBlock((const char*)&tempBlock2, (char*)&authTag);

31

if(lenHeader > 0) { pmac(&tempBlock1); xorBlock(&tempBlock2, &authTag, &tempBlock1); memcpy(&authTag, &tempBlock2, sizeof(BLOCK)); }}

void OCB_With_TBB::pmac(BLOCK* result){ unsigned int numHeaderBlocks, loopIndex; BLOCK offset, checksum, tempBlock1, tempBlock2;

numHeaderBlocks = ceil((double)lenHeader / (double)BLOCK_LEN); memset(&offset, 0, sizeof(offset)); memset(&checksum, 0, sizeof(checksum)); memset(&tempBlock1, 0, sizeof(tempBlock1)); memset(&tempBlock2, 0, sizeof(tempBlock2));

objAES.EncryptBlock((const char*)&tempBlock1, (char*)&offset);

times3(&offset, &tempBlock1); memcpy(&offset, &tempBlock1, sizeof(offset)); times3(&offset, &tempBlock1); memcpy(&offset, &tempBlock1, sizeof(offset));

for(loopIndex = 0; loopIndex < numHeaderBlocks - 1; ++loopIndex) { times2(&offset, &tempBlock1); memcpy(offset, &tempBlock1, sizeof(offset));

xorBlock(&tempBlock1, &(pHeader[loopIndex]), &offset); objAES.EncryptBlock((const char*)&tempBlock1, (char*)&tempBlock2); xorBlock(&tempBlock1, &checksum, &tempBlock2); memcpy(&checksum, &tempBlock1, sizeof(checksum)); }

32

// Now processing last block times2(&offset, &tempBlock1); memcpy(&offset, &tempBlock1, sizeof(offset));

if(0 == (lenHeader % BLOCK_LEN)) { times3(&offset, &tempBlock1); memcpy(&offset, &tempBlock1, sizeof(offset));

xorBlock(&tempBlock1, &checksum, &pHeader[numHeaderBlocks - 1]); memcpy(&checksum, &tempBlock1, BLOCK_LEN); } else { times3(&offset, &tempBlock1); memcpy(&offset, &tempBlock1, BLOCK_LEN); times3(&offset, &tempBlock1); memcpy(&offset, &tempBlock1, BLOCK_LEN);

memset(&tempBlock1, 0, BLOCK_LEN); // assuming lenHeader in bytes and not in bits memcpy(&tempBlock1, &(pHeader[numHeaderBlocks - 1]), BLOCK_LEN); ((unsigned char*)&tempBlock1)[lenHeader % BLOCK_LEN] = 0x80; xorBlock(&tempBlock2, &checksum, &tempBlock1); memcpy(&checksum, &tempBlock2, BLOCK_LEN); }

xorBlock(&tempBlock1, &offset, &checksum); objAES.EncryptBlock((const char*)&tempBlock1, (char*)result);}

33

Chapter 5

RESULTS

5.1 Experiments

Experiment A : First experiment was carried on a machine with 2.67 GHz Intel

xeon processor, 6 GB RAM and 8 MB cache memory. There was also a need to compare

results for a machine with 2 core processor and a machine with 4 core processor. I used

one of the setting in Windows operating system to change the visible processor cores.

Using that we can configure the OS such that it can see either 1, 2 or 4 processor cores.

The advantage of this approach is that all the results are comparable as they are taken on

the same physical machine except the visible number of processor cores. All the results

are calculated in Cycles Per Byte (CPB) unit. Below is a chart based on the numbers, I

collected from the experiment [Figure 3]. The results clearly indicate that performance

does really improve when number of visible cores is changed from 1 to 2 to 4. But if we

compare the performance for the same number of processors cores between with and

without TBB cases, than there is not much difference in the numbers. But the major point

to note here is that “without TBB” implementation needs to change when it is to be

executed optimally on a 1 core machine vs a 2 core machine vs a 4 core machine. We

need to change the number of worker threads and also we need to divide the data range to

be processed. Where as the “with TBB” implementation need absolutely no change. Once

34

compiled code can work on all 3 different configuration machine and can give optimum

performance.

Figure 3 : CPB Comparison at Different Processor Cores

Note : Corresponding CPB numbers are listed in table 1 at the end of the chapter.

0

10

20

30

40

1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248

CPB

(Cyc

les

Per B

yte)

Number of Blocks

Without TBB - 1 ThreadWithout TBB - 2 ThreadWithout TBB 4 ThreadWith TBB - 2 CoreWith TBB - 4 Core

35

Experiment B : This experiment was carried out on a machine with Intel Core 2

Quad 2.66 GHz processor, 2 GB RAM and 6 MB cache memory. This experiment was to

compare performance of OCB execution at different block sizes. The bloc sizes compared

were 16, 24 and 32 bytes. The chart below in figure 4 is prepared based on the results of

this experiment. The results indicate that the performance for 16 byte block length is

nearly same in both the cases. In case of 32 byte block length, TBB performs slightly

better. But there is a striking difference in performance when block length is kept at 24

bytes. Here 24 byte block is not aligned to the word boundary in the actual memory. So

the obvious suspect would be incorrect division of data range by TBB. If the range is

divided such that two adjacent blocks are given to two different worker threads than there

would be lot of cache miss during the entire run. And the data will go back and forth

between different processor cache. To eliminate this suspicion, I tried to change the way

paralel_for is called in the implementation. I tried to statically divide the range instead of

relying on an auto_partitioner object provided by TBB. But to the surprise, results did not

change. It indicates that the range division done by TBB is not the culprit. The only

possibility remains is that the task scheduler - which assigns tasks to the available worker

threads - causes too much overhead and thus causes a significant drop in performance.

36

Figure 4 : CPB Comparison at Different Block Lengths

Note : Corresponding CPB numbers are listed in table 2 at the end of the chapter.

0

225

450

675

900

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

CPB

(Cyc

les

per B

yte)

Number of Blocks

Block Length 16 - With TBBBlock Length 16 - Without TBBBlock Length 24 - With TBBBlock Length 24 - Without TBBBlock Length 32 - With TBBBlock Length 32 - Without TBB

37

Experiment C : Third experiment was carried out on a machine with Intel Core 2

Quad 2.66 GHz processor, 2 GB RAM and 6 MB cache memory. This was to observe

TBB performance at various chunk sizes on a multi-core vs single core machine. In this

experiment, I changed the partitioner parameter in parallel_for call. Instead of using

auto_partitioner, I statically divided the range into chunks of 1, 2, 4, 8, 16, 32, 64, 128

and 258 blocks (1 block = 16 bytes). So for chunk size of n blocks, each TBB worker

thread will work on n number of blocks before asking for another work. The results of

experiment are displayed below in figure 5. It indicates that increasing the chuck size on

a multicore processor does really improve the overall performance. But it does not

improve the performance on a single core processor. The reason is quite clear. On a

multicore machine there are more than 1 active threads running at the same time to take

advantage of increased chunk size. While on a single core machine there is only one

thread active at a time. So even though chunk size is increased, there is only one active

thread available to take real advantage of it. Hence the line for single core machine in the

below chart is almost horizontal. The results can also be explained in opposite way. In a

multicore machine, performance degrades rapidly by decreasing the chunk size. Mainly

because there are more number of active threads at any given time, which can

simultaneously ask for new task. And simultaneous calls to the task manager will of

course cause some penalty to the performance.

38

Figure 5 : CPB Comparison at Different Chunk Sizes

Note : Corresponding CPB numbers are listed in table 3 at the end of the chapter.

0

20

40

60

80

1 2 4 8 16 32 64 128 256

Cyc

les

per B

yte

(CPB

)

Chunk Size (in blocks)

1 CPU Core - With TBB4 CPU Core - With TBB

39

5.2 Conclusion

Based on above experiments, I can say that in most of the cases TBB performs

optimally or nearly optimal. The major advantage of TBB is that we do not need to

recompile the code for different platforms. Once the code is compiled with TBB, than it

can be executed on any machine and we can expect better utilization of available

resources. But there are certain cases where TBB does not perform well for example,

OCB execution with block length set to 24 bytes. In this case TBB performs worse than

the implementation with native threads. So if we know the actual data size of input data

in advance than it is better to compare performance of both the methods and if TBB is not

worse than the other than we should deploy the application developed using TBB.

40

Table 1 : CPB comparison at Different Processor Cores (Experiment A)

Num. of Blocks

Without TBB - 1

Thread - 4 Core System

Without TBB - 2

Threads - 4 Core System

Without TBB - 4

Threads - 4 Core System

With TBB - 2 Core System

With TBB - 4 Core System

1 25.031250 26.700000 25.031250 26.800000 25.1250002 25.865625 12.515625 13.350000 13.400000 13.4000003 26.143750 17.800000 8.343750 17.308333 17.3083334 32.540625 12.932813 6.675000 12.981250 6.7000005 26.032500 20.692500 15.686250 21.105000 10.3850006 26.143750 17.521875 13.071875 13.120833 13.1208337 29.799107 14.780357 7.390179 18.664286 7.4178578 25.865625 16.270312 9.803906 12.981250 9.8406259 28.925000 17.429167 8.529167 20.286111 8.56111110 26.032500 13.016250 10.513125 13.065000 7.87250011 28.520455 16.535795 9.405682 16.597727 9.59318212 28.090625 12.932812 6.535938 15.354167 8.65416713 28.112019 16.045673 10.012500 15.976923 7.98846214 25.984821 14.899554 9.297321 14.955357 9.33214315 27.812500 13.906250 8.677500 15.745000 7.03500016 27.638672 14.601563 6.466406 14.656250 8.16562517 27.583456 15.313235 9.227206 15.370588 9.16323518 28.925000 14.462500 8.714583 14.516667 8.74722219 27.402632 15.106579 6.850658 15.163158 8.28684220 27.284062 13.016250 7.759688 14.321250 6.53250021 27.335714 14.939286 8.661607 14.915476 8.69404822 28.368750 12.970739 8.343750 14.313636 8.29886423 27.135326 15.889402 7.908424 14.710870 7.93804324 27.117188 14.045313 7.578906 13.120833 6.56041725 28.168500 14.618250 8.343750 15.678000 8.37500026 27.983654 13.991827 8.022837 14.044231 7.98846227 27.009028 15.451389 7.663889 14.516667 7.75463028 27.891964 13.945982 7.449777 13.998214 7.47767929 27.850862 14.328233 8.113578 14.439655 8.08620730 27.756875 14.796250 7.787500 14.795833 7.87250031 26.861492 14.265121 7.536290 14.318548 6.75403232 26.856445 14.653711 7.352930 14.708594 7.328125

41

Num. of Blocks

Without TBB - 1

Thread - 4 Core System

Without TBB - 2

Threads - 4 Core System

Without TBB - 4

Threads - 4 Core System

With TBB - 2 Core System

With TBB - 4 Core System

33 28.368750 14.159091 7.888636 14.262879 7.91818234 27.583456 13.791728 7.656618 13.794118 7.68529435 27.510536 14.875714 7.437857 14.931429 8.23142936 27.488021 14.462500 7.231250 14.516667 7.25833337 27.421622 14.071622 7.712331 14.124324 7.74121638 28.105263 14.403947 8.255921 14.457895 7.58157939 27.341827 13.991827 7.316827 14.731410 7.34423140 27.367500 14.351250 7.175625 13.735000 7.20250041 27.921037 14.571037 7.611128 14.666463 7.63963442 27.891964 13.667857 7.429911 14.277381 7.45773843 27.243314 14.514244 7.257122 14.607558 7.90755844 27.799858 14.184375 7.699006 13.666477 7.11875045 27.775417 14.462500 7.527917 14.516667 7.55611146 27.171603 14.148098 7.364266 14.201087 6.80923947 27.694149 14.415160 7.740160 14.433511 6.66436248 27.638672 14.114844 7.057422 14.167708 7.64218749 27.653571 14.337628 7.969133 19.723980 7.45204150 27.567750 14.050875 7.275750 14.639500 7.30300051 28.074265 14.298897 7.656618 14.352451 7.19264752 27.534375 14.023918 7.028005 14.559615 7.02211553 27.518632 14.231604 7.839976 14.284906 7.90094354 27.967014 14.462500 7.725694 14.051389 7.25833355 27.458523 14.199545 7.099773 14.709545 7.12636456 27.415179 13.945982 6.972991 13.998214 7.44776857 27.871053 14.169737 7.758224 14.222807 7.34649158 27.361746 13.896659 7.192888 15.768103 7.65301759 27.803072 14.566208 7.495233 14.620763 7.52330560 27.756875 13.878437 6.953125 13.930417 6.97916761 27.329201 14.088627 7.249488 20.978689 7.71598462 27.722782 14.265121 7.563206 23.612097 7.15927463 27.680060 14.065179 7.019345 20.312698 7.47103264 27.664746 13.819336 6.909668 17.168750 7.35429765 27.624231 14.428269 7.599231 14.868846 7.215385

42

Num. of Blocks

Without TBB - 1

Thread - 4 Core System

Without TBB - 2

Threads - 4 Core System

Without TBB - 4

Threads - 4 Core System

With TBB - 2 Core System

With TBB - 4 Core System

66 27.610227 13.805114 7.104830 14.237500 7.53750067 27.596642 14.371175 7.397295 14.450000 7.02500068 27.927022 13.791728 6.871324 14.212868 7.29117669 27.546467 14.317391 7.545652 14.395290 7.18550770 27.534375 14.136696 7.080268 14.165714 7.46571471 27.851673 13.937588 7.333099 14.367254 7.36056372 27.488021 14.091667 7.231250 14.144444 7.25833373 27.820120 14.264384 7.132192 13.973630 7.15890474 27.421622 13.733361 7.374071 14.463851 7.42432475 27.768000 14.217750 7.298000 14.293333 7.32533376 27.753947 14.052632 6.850658 14.435855 6.87631677 27.718588 14.195211 7.433523 19.012338 7.46136478 27.363221 14.013221 7.338221 14.387821 7.36570579 27.692801 14.173813 6.928481 15.223418 7.27246880 27.659531 13.996641 7.154766 14.363125 7.18156381 27.627083 14.132870 7.396065 14.516667 7.11358082 27.941387 13.980869 7.305869 14.033232 7.31280583 27.604744 14.114006 7.217846 14.489759 7.24488084 27.574107 13.945982 7.112054 13.998214 7.15863185 27.563824 14.076397 7.362132 16.277059 7.37000086 27.553779 13.932122 7.257122 14.295930 6.99215187 27.831681 14.059698 7.192888 14.709195 7.50862188 27.515412 13.918892 6.788778 14.256534 8.01335289 27.787500 14.325000 7.312500 14.096348 7.64101190 27.478750 13.887708 7.527917 14.218889 7.55611191 27.745261 14.303571 7.151786 14.357143 7.17857192 27.733899 13.857880 6.783832 14.201087 7.10054393 27.704839 13.995968 7.572177 14.336559 7.29435594 27.427859 14.131117 7.189827 14.166223 7.23457495 27.666118 13.964803 7.131711 14.299211 7.42289596 27.656055 14.114844 6.779297 14.167708 7.08385497 27.646198 13.952126 7.242719 14.263402 7.26984598 27.636543 14.082207 7.441263 14.134949 7.195663

43

Num. of Blocks

Without TBB - 1

Thread - 4 Core System

Without TBB - 2

Threads - 4 Core System

Without TBB - 4

Threads - 4 Core System

With TBB - 2 Core System

With TBB - 4 Core System

99 27.610227 13.923106 7.096402 14.262879 7.122980100 27.851438 14.067562 7.025438 14.371500 7.051750101 27.575681 14.176114 7.220235 13.963861 7.512624102 27.567096 13.775368 7.149449 14.352451 7.176225103 27.801699 14.160073 7.080036 14.196845 7.090291104 27.534375 14.007873 6.995913 17.345913 7.038221105 27.526429 13.890357 7.199464 16.909524 7.226429106 27.754776 13.995460 7.115802 14.300708 7.395283107 27.495386 14.114194 7.064895 14.167056 7.326168108 27.719792 13.983507 6.999479 14.035880 7.010185109 27.695126 14.084862 7.164908 14.383486 7.191743110 27.701250 13.971989 7.099773 14.009091 7.126364111 27.662162 14.071622 7.035811 14.365766 7.062162112 27.668471 13.945982 6.972991 13.998214 6.999107113 27.645133 14.044082 7.369082 14.333850 7.159513114 27.622204 13.935526 7.070230 14.208114 7.331798115 27.628696 14.032011 7.023261 14.317609 7.049565116 27.591918 14.141218 6.962716 13.963147 7.205388117 27.598558 14.020353 7.331090 14.301923 7.143803118 27.576801 13.887394 7.070975 14.166525 7.310381119 27.569433 14.009086 6.997532 14.272689 8.783193120 27.770781 13.878437 7.161719 13.944375 7.844583121 27.541271 14.205062 7.102531 14.244421 8.859504122 27.739549 13.869775 7.030635 14.141393 7.702254123 27.717530 13.974085 6.987043 14.230691 9.137602124 27.507460 14.063256 6.930696 14.115927 7.159274125 27.701250 13.950750 7.289100 14.217400 7.316400126 27.680060 14.051935 7.019345 14.104563 7.471032127 27.672343 13.941289 7.174311 14.402362 7.201181128 27.664746 13.819336 6.909668 14.080469 7.144922129 27.644331 14.126163 7.063081 16.412403 7.089535130 27.637067 14.017500 7.214135 14.070000 7.241154131 27.617176 13.910496 6.955248 14.154389 7.173092

44

Num. of Blocks

Without TBB - 1

Thread - 4 Core System

Without TBB - 2

Threads - 4 Core System

Without TBB - 4

Threads - 4 Core System

With TBB - 2 Core System

With TBB - 4 Core System

132 27.610227 14.007386 7.104830 14.262879 7.131439133 27.603383 14.090273 7.038863 14.331955 7.468233134 27.783442 13.798321 7.185588 14.050000 7.400000135 27.565278 14.079306 6.946944 14.119630 6.972963136 27.571186 13.963511 7.079917 14.225184 7.303493137 27.747536 14.068659 7.223130 15.062774 7.054562138 27.534375 13.954620 7.170788 21.775000 7.197645139 27.720459 14.046313 6.927113 15.978777 7.145863140 27.701250 13.945982 7.068348 14.369107 7.082857141 27.516622 14.036436 7.195745 14.279078 7.234574142 27.675396 13.925836 7.156822 14.166725 7.171831143 27.680245 14.026836 6.908392 14.255070 7.859615144 27.650260 13.917839 7.057422 14.330556 7.258333145 27.655216 14.005991 7.181379 14.416552 7.208276146 27.637243 13.898630 7.132192 14.145719 7.342466147 27.619515 13.997066 7.083673 14.220408 7.281122148 27.624578 14.071622 7.035811 14.294088 7.243243149 27.595973 13.977181 7.156586 14.209396 7.183389150 27.779125 13.884000 7.120000 14.103500 8.196333151 27.584106 14.134644 7.072848 14.198675 7.088245152 27.567311 13.876974 7.015337 14.094243 7.217928153 27.736152 13.949877 7.143995 14.177288 7.181699154 27.556047 14.032670 7.108442 14.248377 7.124188155 27.712016 13.942137 7.051815 14.156452 7.251129156 27.534375 14.013221 7.006611 14.237500 7.698558157 27.688495 13.923965 7.132046 14.146815 7.158758158 27.682239 14.004826 7.086907 14.057278 7.113449159 27.665566 14.084670 7.031840 14.305975 7.226730160 27.659531 13.996641 6.998320 14.038594 7.024531161 27.653571 13.899340 7.120691 14.284317 7.136957162 27.637384 13.978356 7.231250 14.030710 7.423765163 27.631633 14.056403 7.023083 14.273466 7.213804164 27.778582 13.807889 6.980259 14.012805 7.169817

45

Num. of Blocks

Without TBB - 1

Thread - 4 Core System

Without TBB - 2

Threads - 4 Core System

Without TBB - 4

Threads - 4 Core System

With TBB - 2 Core System

With TBB - 4 Core System

165 27.610227 14.047841 7.099773 14.252727 7.126364166 27.755535 13.953163 7.057003 14.015512 7.244880167 27.589334 14.029491 7.014746 15.335778 7.191467168 27.742969 13.945982 6.972991 17.418006 7.308185169 27.568935 14.021450 7.089719 17.314941 7.274852170 27.720882 13.929154 7.048015 15.370588 7.833088171 27.549013 14.013596 6.997039 14.213012 7.033041172 27.699310 13.922420 6.966061 14.130378 7.138227173 27.693533 13.996279 7.070484 14.048699 7.251879174 27.678233 13.915841 7.029849 14.112356 7.056178175 27.663107 14.131929 6.989679 15.974714 7.915571176 27.667116 13.899929 6.949964 14.104261 7.128267177 27.793644 13.972246 7.061547 14.175989 7.087994178 27.637500 13.893750 7.171875 18.933146 7.189326179 27.632263 14.105133 6.973324 18.537291 7.879050180 27.627083 13.887708 6.943854 14.088611 7.407222181 27.760256 13.949275 7.191298 14.140331 7.218232182 27.607727 14.019334 7.014251 14.219093 7.325824183 27.739549 13.942725 7.112705 14.132240 7.139344184 27.588791 14.002989 6.928940 14.201087 6.954891185 27.719291 13.927297 7.180135 14.124324 8.049054186 27.570262 13.995968 6.997984 14.192473 7.024194187 27.708389 14.063904 7.094418 14.250936 7.129947188 27.694149 13.847074 7.065559 14.041489 7.225665189 27.680060 14.047520 7.019345 14.241931 7.329233190 27.674901 13.973586 7.131711 14.158158 7.563947191 27.538743 13.909162 6.945844 14.233115 7.936518192 27.795117 13.958398 7.048730 14.019401 7.075130193 27.651101 14.033063 7.020855 14.215803 7.177332194 27.637597 13.952126 7.105090 14.142526 7.131701195 27.641346 13.889135 7.077212 14.198846 7.103718196 27.619515 13.945982 6.904879 14.134949 7.067474197 27.623319 14.002253 7.140895 14.190736 7.159137

46

Num. of Blocks

Without TBB - 1

Thread - 4 Core System

Without TBB - 2

Threads - 4 Core System

Without TBB - 4

Threads - 4 Core System

With TBB - 2 Core System

With TBB - 4 Core System

198 27.736648 13.939962 6.969981 14.119066 7.258333199 27.605653 13.995697 7.060741 14.182789 7.095603200 27.592781 13.925719 7.025438 14.111875 7.051750201 27.588340 13.989272 6.998787 14.166667 7.150000202 27.707859 14.052197 7.088057 14.104827 7.371658203 27.694674 13.974754 7.053140 14.282882 7.079557204 27.567096 13.906250 7.018566 14.089706 7.044853205 27.684970 14.098902 6.984329 14.151707 7.141220206 27.672087 13.900850 7.071936 14.204976 7.098422207 27.667391 13.962681 7.045833 14.136353 7.072222208 27.662740 14.015895 6.883594 14.197236 7.038221209 27.650150 13.948834 7.098176 14.129306 7.244976210 27.645625 14.009554 7.072321 14.181667 7.218452211 27.759775 13.943158 7.030895 14.122393 7.930450212 27.628833 13.995460 6.997730 14.174292 7.268868213 27.624472 13.937588 7.090229 15.578286 7.360563214 27.729322 13.989428 7.057097 14.284463 7.083528215 27.608110 14.040785 7.016512 14.101163 7.050581216 27.603906 13.859896 6.991753 14.152199 7.134259217 27.707402 14.034418 7.082575 14.210484 7.101382218 27.702781 13.977695 7.042431 14.145298 7.191743219 27.583904 14.020548 7.010274 14.195434 7.044178220 27.686080 13.964403 6.985994 14.130909 7.126364221 27.681618 14.022031 7.067647 14.188235 7.094118222 27.677196 13.951351 7.035811 14.124324 8.005293223 27.665331 14.008520 7.116508 14.181166 7.030493224 27.661021 13.945982 6.972991 14.117857 7.111272225 27.649333 14.002667 7.409250 14.166778 7.198778226 27.645133 14.051466 7.022041 14.104093 7.055752227 27.751239 13.989565 7.116079 14.160022 7.135352228 27.636842 13.928207 6.960444 14.097917 7.339145229 27.618177 13.983979 7.163237 14.263100 7.073035230 27.737527 14.039266 7.016005 14.084565 7.158804

47

Num. of Blocks

Without TBB - 1

Thread - 4 Core System

Without TBB - 2

Threads - 4 Core System

Without TBB - 4

Threads - 4 Core System

With TBB - 2 Core System

With TBB - 4 Core System

231 27.610227 13.971266 7.101218 14.255628 7.120563232 27.714197 14.026131 6.955523 14.078664 7.097091233 27.709844 13.965933 7.154855 14.241094 7.181652234 27.698397 14.020353 7.117147 14.072863 7.143803235 27.580532 14.067207 6.980346 14.226809 7.227447236 27.689936 13.894465 7.056833 14.173623 7.196822237 27.678718 14.061155 7.034098 14.226899 7.053376238 27.674606 14.002075 7.109716 14.047479 7.136345239 27.670528 13.943488 7.079969 14.212971 7.113494240 27.659531 13.989688 7.050469 14.160729 7.076875241 27.648626 14.042427 7.125078 14.199274 7.151763242 27.755036 13.984401 7.102531 14.036777 7.129132243 27.640818 14.036728 7.073302 14.192695 7.092901244 27.630123 13.869775 7.037474 14.031557 7.180533245 27.626327 14.024311 7.117730 14.398163 7.144388246 27.731098 13.967302 6.987043 16.246138 7.224289247 27.612070 14.018851 7.060096 18.730162 7.086538248 27.709325 13.854662 7.031628 14.115927 7.057964249 27.604744 14.114006 7.110617 14.059237 7.238153250 27.694575 13.950750 7.082175 14.217400 7.316400251 27.690613 14.001544 7.153685 14.053984 7.080378252 27.587351 13.945982 7.025967 14.097917 7.052282253 27.676186 13.996393 7.097134 14.148123 7.123715254 27.777461 13.941289 7.174311 14.197933 7.306693255 27.661985 14.082941 7.048015 14.142255 7.172941256 27.658228 14.034448 7.320337 14.191699 7.144922

48

Table 2 : CPB Comparison at Different Block Lengths (Experiment B)

Num. of Blocks

Block Length = 16Block Length = 16 Block Length = 24Block Length = 24 Block Length = 32Block Length = 32Num. of Blocks

With TBB Without TBB

With TBB Without TBB

With TBB Without TBB

1 78.1375 51.5375 172.9000 172.9000 194.5125 194.51252 39.0688 25.7688 346.3542 251.0375 370.3219 493.76253 26.0458 26.0458 496.5333 282.9944 554.1667 640.61674 19.5344 19.5344 593.2354 337.4875 714.2516 824.80785 30.9225 30.9225 512.4933 270.2117 610.4700 563.58756 26.0458 30.4792 513.7125 285.7653 614.8479 653.77817 25.8875 22.0875 554.1667 314.1333 649.4438 749.66888 19.5344 22.8594 597.5302 335.5479 714.3555 823.14539 26.0458 25.8611 548.3787 275.2361 656.5951 650.868810 23.2750 23.4413 550.7308 303.0183 662.4231 718.283111 21.3102 18.8920 571.4970 325.8500 677.7710 774.573912 21.6125 19.5344 588.8021 336.2868 706.8396 822.591113 24.0423 25.9606 562.0955 299.6763 672.3534 700.360114 22.2063 22.2063 570.3167 311.7583 667.9688 749.609415 20.8367 20.8367 574.9294 326.7367 687.4992 787.969616 21.0930 17.8719 589.8411 338.8036 716.7973 823.093417 21.4169 22.8838 568.4446 308.6382 688.3728 731.157718 20.2271 20.2271 581.1361 319.4463 676.1295 769.229519 20.4750 19.1625 576.9750 332.6750 693.1750 797.737520 19.5344 18.2044 592.2379 337.7092 713.0463 818.905921 23.5125 21.0583 572.3222 311.7583 694.5688 748.362522 21.2347 20.0256 580.1117 321.9205 683.0608 772.835823 21.4679 19.2272 580.5257 329.0304 694.5997 800.204624 19.4651 19.4651 593.1431 334.0701 715.4292 814.451825 21.8120 20.8145 576.3333 316.5843 694.0938 761.657826 20.9731 18.9269 580.8093 322.3971 688.3709 787.289727 20.1963 18.2875 582.4086 329.7086 697.5419 805.758328 19.5344 18.5250 590.6229 337.6854 714.3406 818.276629 20.5806 19.7207 578.0532 317.6713 697.7914 764.520730 20.7813 19.0633 585.9389 327.3278 692.7083 779.712531 20.1109 19.3065 583.7699 332.3927 696.3462 746.623432 19.4824 18.6512 592.5773 335.5133 713.5242 801.6107

49

Num. of Blocks

Block Length = 16Block Length = 16 Block Length = 24Block Length = 24 Block Length = 32Block Length = 32Num. of Blocks

With TBB Without TBB

With TBB Without TBB

With TBB Without TBB

33 21.2598 19.6981 580.4308 323.2639 698.2248 738.754534 20.6346 20.5857 584.7110 328.0341 696.0252 781.595035 20.0450 18.5725 584.3450 331.4867 702.4775 802.298836 19.4420 18.7493 592.1887 334.3472 715.4292 795.529337 20.3993 19.6804 581.3059 322.9444 698.2051 757.897838 20.4750 19.1188 566.9125 329.0292 691.4469 786.121939 19.9926 19.3106 557.7190 327.6972 701.7029 781.971840 18.8278 18.2044 553.3077 333.3867 692.2858 795.215341 20.2744 19.6256 554.5992 320.1732 705.4880 772.008242 19.7917 18.5646 554.1667 327.8028 702.6042 792.893843 19.9113 18.7128 568.2657 331.4432 704.9773 804.379444 20.1011 18.2875 587.2152 338.0920 715.2528 775.159545 20.2086 19.0633 583.7961 334.0270 704.2535 774.688146 19.7332 19.1910 569.5870 329.4159 701.6473 797.638647 21.0112 18.8181 561.5477 327.9252 703.5735 806.100348 18.9456 18.3914 567.1434 335.8943 713.5415 808.252149 20.1196 19.0679 560.8845 327.6369 683.6098 773.741150 19.7505 18.7198 559.3758 328.3327 700.0621 787.609451 19.3632 18.8417 565.7064 334.1299 705.6987 749.999452 18.9909 18.4793 580.4683 337.0399 712.1095 770.552853 19.6050 19.1031 578.6755 327.4184 706.0292 758.225554 19.2419 18.7801 579.8225 329.6676 703.5300 792.766255 19.8593 17.9550 588.1824 332.1776 705.3836 781.662256 19.0000 18.0797 594.6802 335.8448 713.1977 801.785257 20.0667 19.1333 588.4861 329.0389 705.0167 785.225058 19.6920 18.8321 590.5888 331.4299 704.2837 788.254359 19.3864 18.0339 589.1073 333.4393 705.5481 794.491860 19.4790 17.7610 593.9928 334.2364 714.5702 801.158861 20.0045 19.1596 589.3790 327.0492 705.8403 779.508162 19.7087 18.8506 588.5071 331.2665 705.3558 790.197063 19.7917 18.5514 575.8759 335.6315 709.3993 761.979264 19.4824 18.2615 595.0365 336.3445 713.5502 787.2067

50

Table 3 : CPB Comparison at Different Chunk Sizes (Experiment C)

Chunk Size

(Blocks)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Avg

Chunk Size

(Blocks)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Cycles per

Byte (CPB)

Avg

1 CPU Core1 CPU Core1 CPU Core1 CPU Core1 CPU Core1 CPU Core1 CPU Core1 CPU Core1 CPU Core

1 2 4 8 16 32 64 128 256

76.0983 75.6957 75.5009 75.5009 75.6957 75.6957 74.4748 73.8773 73.059174.4878 74.4748 74.28 74.28 74.4878 74.6826 73.2669 72.4486 72.448674.28 74.8904 74.28 74.28 74.267 74.4878 73.0591 72.2408 72.448674.4748 74.4748 74.4748 74.267 74.28 74.6826 73.0591 72.4616 72.448674.28 74.28 74.28 74.28 74.28 74.4748 73.0591 72.4486 72.455174.6826 74.4748 74.28 74.28 74.4748 74.4748 72.8513 72.4486 72.448674.4748 74.28 74.267 74.0722 74.28 74.4748 72.8513 72.4486 72.448674.28 74.28 74.28 74.267 74.28 74.6826 73.0591 72.4486 72.448674.28 74.4748 74.28 74.28 74.267 74.4878 73.0591 72.4486 72.455174.28 74.28 74.28 74.28 74.28 74.6826 73.0591 72.2538 72.3447

74.5618 74.5605 74.4203 74.3787 74.4592 74.6826 73.1799 72.5525 72.5006

4 CPU Cores4 CPU Cores4 CPU Cores4 CPU Cores4 CPU Cores4 CPU Cores4 CPU Cores4 CPU Cores4 CPU Cores

1 2 4 8 16 32 64 128 256

29.6263 29.0288 28.4054 27.4053 25.3661 24.353 21.5151

NA NA

29.6263 29.0158 28.2105 27.6001 25.5739 24.5608 21.5086

NA NA

29.4314 29.0288 28.4054 27.3923 25.3661 24.5479 21.5151

NA NA29.6263 28.808 28.4184 27.3923 25.5609 24.5608 21.5086

NA NA29.6393 28.821 28.2105 27.6001 25.5739 24.7557 21.4112 NA NA29.4185 29.0158 28.4054 27.3923 25.5739 24.7687 21.5086

NA NA

29.6263 28.821 28.2105 27.3923 25.3661 24.1452 21.4112

NA NA

29.6393 29.0158 28.2105 27.4053 25.5739 24.5608 21.5151

NA NA

29.4185 28.821 28.2105 27.4053 25.5739 24.5479 21.5086

NA NA

29.5613 28.9307 28.2986 27.4428 25.5032 24.5334 21.4891

51

REFERENCES

[1] Definition of Encryption - www.wikipedia.org

[2] Nigel Smart, “Cryptography: An Introduction”.

[3] The OCB Authenticated-Encryption Algorithm <draft-krovetz-ocb-00.txt>.

http://www.cs.ucdavis.edu/~rogaway/papers/draft-krovetz-ocb-00.txt

52