parallelized and efficient implementations of rate ... and efficient implementations of rate...

Parallelized and Efficient

Implementations of Rate Dematching and Channel Deinterleaver

on a Symmetrical Multiprocessor Architecture

A R A S H B A Z R A F S H A N

Master of Science Thesis Stockholm, Sweden 2010

Parallelized and Efficient

Implementations of Rate Dematching and Channel Deinterleaver

on a Symmetrical Multiprocessor Architecture

A R A S H B A Z R A F S H A N

Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2010 Supervisor at CSC was Stefan Nilsson Examiner was Stefan Arnborg TRITA-CSC-E 2010:169 ISRN-KTH/CSC/E--10/169--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc

Praise be to God, Master of the Universe

Parallelized and Efficient Implementations of Rate dematching and Channel deinterleaver on a Symmetrical Multiprocessor Architecture

Abstract

In this master’s thesis parallelized implementations of Rate dematching and Channel

deinterleaver have been written for the symmetrical multiprocessor (SMP) architecture

used on Ericsson AB’s base stations that are designed for the new standard LTE in

mobile communication. This report reviews the capabilities of the SMP architecture, and

states how critical loops of a common kind should be written for the architecture so that

their execution times can not be further reduced and thus are optimal. The report then

describes how the parallelized implementations that have been written work and how

they have been optimized, and it also suggests how they should be further optimized to

reduce execution time. Some hardware changes to the SMP architecture are also

proposed for consideration. The report also reviews the efficiency of assembler code

produced by the compiler that compiles implementations that are to run on the SMP

architecture. The conclusion is that the compiler uses the SMP architecture’s capabilities

inadequately.

Paralleliserade samt effektiva implementationer av Rate dematching och Channel deinterleaver för en symmetrisk multiprocessorsarkitektur

Sammanfattning

I detta examensarbete har paralleliserade implementationer av Rate demtaching och

Channel deinterleaver skrivits för den symmetriska multiprocessorarkitektur (SMP

arkitektur) som används på de basstationer som Ericsson AB har utvecklat för den nya

standarden LTE i mobil kommunikation. Denna rapport undersöker SMP arkitekturens

funktionaliteter samt förklarar hur kritiska loopar av en sort som är normalt

förekommande ska skrivas för arkitekturen så att deras exekveringstid kan inte reduceras

mer. Därav blir looparna optimala. Rapporten förklarar sedan hur de paralleliserade

implementationerna som har skrivits fungerar och hur de har optimerats, och det föreslås

också hur de ska vidare optimeras. Det ges också förslag på hårdvaruändringar för SMP

arkitekturen. Rapporten undersöker också effektiviteten av assemblerkod som den

kompilator som används för att kompilera implementationer som ska exekveras av SMP

arkitekturen. Slutsatsen är att kompilatorn använder SMP arkitekturens funktionaliteter

otillräckligt.

TABLE OF CONTENTS

1 Introduction ................................................................................................................... 1

1.1 Background information ......................................................................................... 1

1.1.1 Network of base stations.................................................................................. 1

1.1.2 Mobile communication standards and 3GPP................................................... 2

1.1.3 LTE and Ericsson ............................................................................................ 2

1.1.4 Processing steps performed on the mobile device ........................................... 2

1.1.5 Processing steps performed on the base station. ............................................. 4

1.2 Specification of thesis ............................................................................................. 5

1.3 Reading guidelines ................................................................................................. 5

1.3.1 Prior knowledge required by the reader .......................................................... 6

1.3.2 Sensitive information regarding EMPA .......................................................... 6

1.3.3 The report’s content ......................................................................................... 8

1.3.3.1 Chapters 1 and 2 ....................................................................................... 8

1.3.3.2 Chapters 3 and 4 ....................................................................................... 8

1.3.3.3 Chapter 5 .................................................................................................. 8

1.3.3.4 Chapters 6, 7 and 8 ................................................................................... 8

1.3.3.5 Appendices ............................................................................................... 9

1.3.4 Mathematical notations used ........................................................................... 9

1.3.4.1 Constants and variables ............................................................................ 9

1.3.4.2 Arrays ....................................................................................................... 9

1.3.4.3 Matrices .................................................................................................. 10

1.3.5 Abbreviations ................................................................................................ 11

1.3.6 Expressions .................................................................................................... 11

2 EMPA .......................................................................................................................... 16

2.1 The Digital Signal Processors .............................................................................. 16

2.1.1 The DSP’s LDM and LPM ............................................................................ 17

2.1.2 The DSP’s registers ....................................................................................... 18

2.1.2.1 Temporary registers ................................................................................ 18

2.1.2.2 Address registers .................................................................................... 18

2.1.2.3 Offset registers ........................................................................................ 18

2.1.3 Instruction execution on the DSP .................................................................. 18

2.1.3.1 The DSP’s instruction pipeline ............................................................... 18

2.1.3.2 Executing parallel instructions on the DSP ............................................ 19

2.1.3.3 LDM instructions and their capabilities ................................................. 20

2.1.4 Hardware support for executing loops on the DSP ....................................... 21

2.2 The Common Memory ......................................................................................... 21

2.3 The Command Bus ............................................................................................... 22

3 Execution methodologies used on EMPA ................................................................... 23

3.1 Locks .................................................................................................................... 23

3.2 To dispatch a JOB ................................................................................................ 23

3.3 Memory allocation ................................................................................................ 27

3.4 Barrier synchronization between the job DSPs .................................................... 27

4 Definition of an optimal loop ...................................................................................... 28

5 Specification of the processing steps ........................................................................... 32

5.1 Overview of the Uplink Transport Channel ......................................................... 32

5.2 Channel deinterleaver ........................................................................................... 33

5.2.1 Specification of Channel interleaver ............................................................. 33

5.2.2 Specification of Channel deinterleaver.......................................................... 36

5.3 Rate dematching ................................................................................................... 38

5.3.1 Specification of Rate matching...................................................................... 38

5.3.1.1 Padding ................................................................................................... 40

5.3.1.2 Permutation I .......................................................................................... 41

5.3.1.3 permutation II ......................................................................................... 42

5.3.1.4 Bit collection .......................................................................................... 43

5.3.1.5 Bit selection ............................................................................................ 44

5.3.2 Specification of Rate dematching .................................................................. 44

6 Processing algorithm for Channel deinterleaver ......................................................... 47

6.1 Initial discussion ................................................................................................... 47

6.1.1 Regarding existing literature on matrix transposition ................................... 47

6.1.2 The q symbols ................................................................................................ 48

6.2 Suggestion for processing algorithm .................................................................... 48

6.2.1 How to parallelize .......................................................................................... 48

6.2.2 Clearing the q symbols .................................................................................. 56

6.3 Description of the processing algorithm ............................................................... 57

7 Processing algorithm for Rate dematching .................................................................. 60

7.1 Initial discussion ................................................................................................... 60

7.1.1 Redefinition of Permutation II ....................................................................... 60

7.1.2 Bit collection seen from a new perspective ................................................... 63

7.1.3 Bit selection seen from a new perspective..................................................... 63

7.1.4 Rate demtaching seen from a new perspective .............................................. 67

7.1.5 Remarks regarding SatFunc and soft combining ........................................... 70

7.1.6 Typical input parameters for Rate demtaching.............................................. 70

7.2 Suggestion for processing algorithm .................................................................... 70

7.2.1 The benefit of clearing U, V and W .............................................................. 70

7.2.2 Efficient soft-combining of multiple matrix repetitions ................................ 71

7.2.3 Working with bytes in words......................................................................... 71

7.2.4 How to parallelize .......................................................................................... 71

7.3 Description of the processing algorithm ............................................................... 75

8 Conclusion ................................................................................................................... 76

8.1 Implementation of Channel deinterleaver ............................................................ 76

8.2 Implementation of Rate dematching..................................................................... 77

8.3 The implementations’ correctness ........................................................................ 78

References ...................................................................................................................... 79

APPENDICES

Appendices ..................................................................................................................... 80

Appendix 1. Tools used to write implementations ......................................................... 81

Appendix 2. Channel deinterleaver ................................................................................ 82

2.1. Specification of the processing algorithm ........................................................... 82

2.1.1. Specification of deint_master ....................................................................... 82

2.1.2. Specification of deint_slave.......................................................................... 83

2.1.3. Specification of deint_section ...................................................................... 85

2.1.4. Specification of deint_segment .................................................................... 86

2.1.5. Specification of deint_column ...................................................................... 87

2.2. Test cases used for writing implementation ........................................................ 88

2.2.1. Test cases for testing correctness.................................................................. 88

2.2.2. Test cases for measuring performance ......................................................... 90

2.2.3. Test cases for measuring performance during optimization ......................... 91

2.3. Implementation of the processing algorithm ....................................................... 91

2.3.1. Performance of the first version of the implementation ............................... 92

2.3.2. Major optimizations performed in C code .................................................... 93

2.3.2.1. Using specific symbol sizes in deint_column ....................................... 93

2.3.2.2. Unrolling deint_column’s loop .............................................................. 94

2.3.2.3. Reading and writing more words per LDM instruction ........................ 96

2.3.2.4. Clearing the q symbols .......................................................................... 96

2.3.3. Performance of the implementation after C optimizations ........................... 98

2.3.4. Optimizations performed in assembler code .............................................. 100

2.3.4.1. Initial changes before optimization ..................................................... 100

2.3.4.2. Description of the assembler optimizations ........................................ 102

2.3.5. Performance of the implementation after assembler optimizations ........... 103

2.4. Future improvements ......................................................................................... 103

2.4.1. Changes of convenience ............................................................................. 103

2.4.2. Using hardware support to execute loops ................................................... 104

2.4.3. Making the critical loop of deint_column_4 optimal ................................. 104

2.4.4. Making the critical loops of deint_column_2 optimal ............................... 106

2.4.5. Making the critical loops of deint_column_6 optimal ............................... 108

Appendix 3. Rate dematching ...................................................................................... 109

3.1. Specification of the processing algorithm ......................................................... 109

3.1.1. Description of byte_array ........................................................................... 109

3.1.2. Description of rd_data ................................................................................ 109

3.1.3. Description of input_reader ........................................................................ 110

3.1.4. Description of column_traverser ................................................................ 110

3.1.5. Specification of dematch ............................................................................ 110

3.1.6. Specification of soft_comb_array ............................................................... 111

3.2. Test cases used for writing implementation ...................................................... 113

3.2.1. Test cases for testing correctness................................................................ 113

3.2.2. Test cases for measuring performance ....................................................... 113

3.3. Implementation of the processing algorithm ..................................................... 114

3.3.1. Major optimization steps performed .......................................................... 115

3.3.2. Description of assembler optimizations ..................................................... 117

3.3.3. The implementation’s performance after optimizations ............................. 118

3.4. Future improvements ......................................................................................... 120

3.4.1. Bypassing the structures ............................................................................. 120

3.4.2. Processing consecutive columns to obtain optimality ................................ 120

Appendix 4. Review of assembler code produced by EMC ......................................... 123

4.1. First critical loop ................................................................................................ 123

4.2. Second critical loop ........................................................................................... 125

4.3. Conclusion ......................................................................................................... 126

Appendix 5. Hardware changes to EMPA.................................................................... 127

5.1. Swapping contents of temporary register parts ................................................. 127

1

1 Introduction This report will present the results of the thesis and explain how they were achieved.

The thesis’ purpose must be stated first, which is done in this chapter. The thesis is

concerned with mobile communication, and so to enable the reader to understand its

purpose some background information is presented first in this chapter.

The chapter is also concerned with helping the reader to understand the report. Reading

guidelines for the report are provided by this chapter, and two lists of special

abbreviations and expressions used in the report are presented. Special mathematical

notations used in the report are also explained.

1.1 Background information The following subparagraphs give a brief introduction to mobile communication in

order to enable the reader to understand the thesis’ purpose. Technical details that are not

directly relevant are avoided and interested readers are referred to the the subparagraphs’

sources, namely [4], [8] and [1]. On the other hand some aspects are relevant and need to

be further elaborated on than what is done here. This will be covered by Chapter 5.

1.1.1 Network of base stations

The underlying system that enables users to communicate via mobile phones and other

mobile devices includes a network of base stations, see Figure 1. When a mobile device

is turned on it registers itself with a base station that is sufficiently close to communicate

with. That base station is now responsible for relaying data to and from the device.

If mobile device A is to transmit data to device B, then the former first sends the data to

the base station that it has registered with. Device B may also have registered with the

same station, in which case the station transmits the data to B. If B on the other hand has

registered with another station then the data is first transmitted to that station, which in

turn relays it to B.

Of course mobile devices may be physically moving during communication. The device

always keeps track of what base stations are in its proximity and registers with a new one

if it finds it more suitable.1

1 This process is known as a handover.

2

Mobile

device A

Mobile

device B

Mobile

device C

Network of base stations

Base station Base station

Data

to B Data

to C

Data

from A

Data

from A

Data from A to C

Figure 1: A description of how a network of base stations enables mobile communication. The mobile devices A

and B have registered with one base station, while C has registered with another. Device A wishes to send data to

B and C, respectively. It is shown how data reaches its destination through the base stations.

1.1.2 Mobile communication standards and 3GPP

To enable mobile communication, a standard is required that enforces protocols on how

the data is to be transmitted. The most widely used standard is the Global System for

Mobile Communications (GSM). Another standard known as 3G was originally

developed in different countries, such as Japan, US and South Korea. They all used

different concepts of the technology WCDMA. Several standard-developing

organizations from different countries were keeping WCDMA standardized in parallel.

This ended when the Third Generation Partnership Project (3GPP) was formed by many

such organizations from all regions of the world. Its initial objective was to maintain and

develop specifications for 3G mobile technology based on WCDMA. Its scope was later

extended to include maintenance of the GSM standard.

1.1.3 LTE and Ericsson

Long Term Evolution (LTE) is the successor of 3G and is currently being specified by

3GPP. The company Ericsson AB is a provider of telecommunication equipment, and is

currently in the work of developing base stations capable of utilizing LTE. Throughout

this report, the division of Ericsson AB that works with LTE is referred to as ELTE. This

is the division where the thesis was performed.

1.1.4 Processing steps performed on the mobile device

Throughout this report, the expression processing step will define what must be done to

some input to produce a desired output. The expression processing algorithm refers to

3

how a processing step may be performed (only the word “step” or “algorithm” is used for

short when it is obvious what the word refers to). The word “implementation” is, as

always, a well defined set of instructions that can be used by a processor to execute an

algorithm. An implementation may use multiple processors simultaneously to execute its

work, meaning it is a parallelized implementation. To clarify with an example, a

processing step may require that some elements are ordered in ascending order. The

processing algorithm can be any sorting algorithm, such as merge sort. An

implementation of merge sort can be used to perform the processing step.

LTE specifies that the data must be processed in a certain way before it is transmitted

from a mobile device to a base station. A certain set of processing steps are performed in

a specific order on the data. The data is the input to the first step and its output is the

input to the next step, and so on. See the upper half of Figure 2. This chain of steps is

called the Uplink Transport Channel (ULTC).2 The output of the final step is transmitted

to the base station.

2 Downlink Transport Channel is the chain of processing steps that must be applied before the data is to be

transmitted from the base station to the mobile device, but this is not relevant for the thesis.

4

Processing

step 1

Processing

step 2

Rate

matching

Channel

interleaver

Processing

step N

Data ready for

transmission

Processing

step N-1Input

Input

Input

Input

Input

Input

Output

Output

Output

Output

Output

Output

Mobile device (ULTC)

Output

Output

Output

Output

Output

OutputInput

Input

Input

Input

Input

Input

Base station

Procesing

step 1

Original

data

Processing

step M

Procssing

step M-1

Rate

dematching

Channel

deinterleaver

Processing

step 2

Data received

over transmission

Transmission from

mobile device to

base station

Original

data

Figure 2: A comparison between the processing steps performed on the mobile device and the base station. The

names of four of the steps that are relevant to the thesis are shown. Dashed lines mean there are more steps

between. ULTC’s processing steps are performed on the mobile device. Another set of processing steps must be

applied to the received transmission on the base station to obtain the original data.

1.1.5 Processing steps performed on the base station.

When ULTC’s processing steps have been applied to the original data, the output of the

last step is transmitted from the mobile device to the base station. The latter must obtain

the original data that was the first input of ULTC’s steps. This is achieved on the base

station by applying another set of processing steps in a specific order to the received

transmission. The lower half of Figure 2 clarifies this.

Some of these steps have been implemented as dedicated hardware components on

ELTE’s base station. The implementations of other steps are executed on

microprocessors. For the latter case, a symmetrical multiprocessor architecture is

provided on the base station. This specific multiprocessor architecture will throughout

this report be referred to as ELTE’s Multiprocessor Architecture (EMPA). EMPA can

execute parallelized implementations. A special compiler is used to compile

implementations that have been written in C code and are to run on EMPA. The compiler

will be referred to as the EMPA Compiler (EMC).

5

1.2 Specification of thesis The following are two processing steps that must be implemented on EMPA to process

received data from mobile devices (Figure 2 shows them):

Channel deinterleaver

Rate dematching

The three main purposes of this thesis are to

write one parallelized implementation for each of the processing steps Channel

deinterleaver and Rate dematching,

improve the implementations’ memory usage and execution time as much as the

duration of the thesis permits, and to

suggest how they can be further improved.

Throughout the work the thesis’ author decided to extend its scope by

suggesting hardware changes to EMPA that may make it easier to write efficient

implementations in the future, and

reviewing the efficiency of the assembler code produced by EMC.

1.3 Reading guidelines The purpose of this report is not only to present the result of the thesis, but also to give a

good understanding of how those results were achieved. This requires a thorough

description of the obvious aspects of the thesis such as

the hardware and the other tools that have been used,

the specifications of the processing steps that the thesis is concerned with,

how those specifications can be seen from a new perspective in order to write

efficient processing algorithms,

a definition of what characterizes an efficient implementation,

the implementations that were written and how they have been improved,

how the implementations can be further improved, and

how the hardware and the tools that have been used can be further improved.

As the reader can see a lot of ground needs to be covered. But in addition the author has

done his best to make it easy to understand the necessary information, arguments and

ideas that are presented throughout this report. Many figures have been used to among

other things clarify concepts and ideas and to give illustrative examples of arguments and

descriptions. Also many boxes of pseudo code are used to additionally clarify nearly all

the algorithms that are described and discussed. The author hopes that the report’s

purpose and these measurements that have been taken justify the large number of pages

that it requires. It is also the author’s sincere wish that the length of the report does not

dismay the reader from reading it.

6

Having said that, some reading guidelines for this report will now be presented. The

following subparagraphs intend to inform the reader about what he can expect from this

report, and to help him to better understand its content.

1.3.1 Prior knowledge required by the reader

The thesis is concerned with computer science, so to understand this report it is required

by the reader to have some prior knowledge within certain fields. When it comes to

computer hardware, one should be familiar with basic concepts such as

primary memory,

secondary memory,

data bus, and

processor.

A deeper understanding of processors is required, but more of their functionalities than

underlying architecture. Specifically the reader should be familiar with key words and

phrases such as

instruction set,

instruction pipeline,

clockfrequency and cycles,

memory accesses, and

registers.

EMPA is as it has been stated a symmetrical multiprocessor architecture. It is believed

that this report describes such architectures in a manner that it can serve as an

introduction to them if the reader is unfamiliar with them. However the thesis is also

concerned with parallel programming and for this the reader should know of concepts

such as workload balancing and various techniques for synchronizations such as locks

and barrier synchronization. One purpose of the thesis is to write parallelized

implementations and the report will describe them extensively. For this the reader should

have basic skills in C and assembler programming. On a final note he should also know

what a compiler is and have a basic understanding of how it works.

1.3.2 Sensitive information regarding EMPA

As it has been said Ericsson AB is currently in the process of developing base stations

capable of utilizing LTE. It is of course the company’s goal to convince customers that

they benefit more from investing in their base stations than the competitors’. It is

believed that certain properties of EMPA’s components will give Ericsson AB

advantages over its competitors in the market. Such properties may not be publicly

revealed, but the problem is that some of them are relevant to the thesis and it is required

that a report that describes the thesis is made publicly available, which is this report. This

paragraph will describe exactly how such properties have been concealed and how it

affects the report’s content. The properties fall into two broad categories. The first are

numbers that in some way describe EMPA’s components. This can for instance be a

7

memory component’s latency or size. The second are properties that simply put are more

than numbers. They describe for instance the layout of one of EMPA’s components or

how access to it is limited in some way.

Properties falling into the first category are simply replaced by named constants. The

constants can then be used in for instance formulas or when describing problems that

were encountered in the thesis. What such a constant’s value represents is of course

explained first time it is used. All of them are also presented in the table in this

paragraph.

Properties falling into the second category can simply not be discussed in any way. In

fact the reader may not even be hinted whether if or not some component of EMPA has

such a property. But it has still been the author’s goal to make sure that the reader’s

understanding of the report is not too severely limited due to this.

Table 1: The table shows constants that are used in the report to conceal some sensitive values regarding

EMPA. The meaning and unit of each constant is presented.

Constant name Meaning Unit

availLDM The amount of memory in KB of a Digital Signal

Processor’s Local Data Memory that can be used

by an implementation.

KiloByte

latreadCM _ The minimum time it takes to initiate a Common

Memory instruction that reads from the Common

Memory

number of

cycles

latwriteCM _ The minimum time it takes to initiate a Common

Memory instruction that writes to the Common

Memory

number of

cycles

speedCM

The maximum amount of data that can be read

from or written to the Common Memory by a

Common Memory instruction once the

instruction has been initiated.

number of

words

sizeCM The size of the Common Memory. MegaByte

nrINSTRPAR _ The maximum number of instructions that a

Digital Signal Processor can execute in parallel.

number of

instructions

nrLDMWITHPAR __ The maximum number of instructions that a

Digital Signal Processor can execute in parallel

with two Local Data Memory instructions.

number of

instructions

bitsPAYLOADMSG _ The size of the payload of messages that are

transferred over the Command Bus. Bit

cyclesTIMEMSG _ The amount of time it takes from that a process

sends a message over the Command Bus until the

receiving process obtains it.

number of

cycles

8

1.3.3 The report’s content

The following subparagraphs give an overview of the report’s content. The purpose of

each chapter and how the chapter relates to other chapters are described.

1.3.3.1 Chapters 1 and 2

The rest of this chapter continues providing guidelines for this report. Next the aspects

of EMPA that are relevant to the thesis are thoroughly described by Chapter 2. This is

done by describing EMPA component by component and explaining how they are related

to one another. It is some of these component’s properties that can not be publicly

revealed, due to the reasons stated in Paragraph 1.3.2.

1.3.3.2 Chapters 3 and 4

There are some rules and execution methodologies that an implementation that is to run

on EMPA must apply to. These must be understood before it is discussed how any

processing step can be implemented. Chapter 3 covers this topic.

Having covered the relevant aspects of EMPA, it would be good to know if an

implementation utilizes its capabilities maximally and is thus optimal. But one can hardly

claim to have written an implementation that can no longer be further improved. Even

judging this is highly subjective. However, Chapter 4 shows that this sort of judgment

can be cast on a certain type of loops that are common in implementations that are to run

on EMPA. The chapter defines optimality for such a loop by stating a criterion that must

be met by it. This criterion is later used when discussing the critical loops of the

implementations written in the thesis. It is also used to suggest hardware changes to

EMPA that can make it easier to write such loops that are optimal.

1.3.3.3 Chapter 5

The specification of the processing steps that are to be implemented in the thesis must

be thoroughly understood. This becomes easier if one also understands ULTC and some

of its processing steps that are applied on the mobile device (see Paragrah 1.1.4).

Therefore Chapter 5 describes ULTC in more detail and specifies all the processing steps

that are relevant to the thesis.

1.3.3.4 Chapters 6, 7 and 8

Chapter 6 and 7 are the heart of this report. Each of them is devoted to one of the

processing steps, and suggests a processing algorithm that is believed to be efficient. The

report also specifies the algorithms in further detail and discusses how they were

implemented. But this is done by Appendix 2 and Appendix 3. The appendices also

discuss how the implementations performance and correction were verified.

Chapter 8 concludes the report by summarizing its results.

9

1.3.3.5 Appendices

Appendix 1 describes the tools that were used to write implementations, prior to

Appendix 2 and Appendix 3 that among other things describe the processing algorithms

and how they were implemented. Appendix 4 discusses the efficiency of the assembler

code produced by the compiler EMC. Specifically how the compiler makes use of

EMPA’s capabilities when it compiles some simple critical loops that were written in C

code are reviewed. Appendix 5 suggests some hardware changes to EMPA that might

make it easier to write efficient implementations.

1.3.4 Mathematical notations used

Variables, constants, arrays and matrices will be extensively referred to throughout this

report. This paragraph clarifies exactly how they are used and their notations.

1.3.4.1 Constants and variables

A constant or variable is as always a representation of a value. This may be for instance

an integer or a real number. A constant that is introduced will represent the same value

through the rest of the report. For instance it may be used in an algorithm, or can

represent some characteristic of a hardware component. The value of a variable on the

other hand may vary in the context that it is used. For instance it can be the input

parameter of an algorithm or function. Some literatures make a difference between the

two by using capital letters when naming a constant. However this is not the case for this

report.

A constant or variable can be given a name with multiple characters. Because of this

when performing multiplication in formulas, the character “*” will always be used to

avoid confusion. When a sentence is concerned with the constant or variable itself as a

noun, it simply uses its name. But sometimes it is the value that is of interest. To avoid

confusion in the latter case, the name is prefixed by “#”. On a final note a variable that is

a number can be cleared, meaning that its value is set to zero.

The following sentences give examples of these policies.3

nrCAT is an integer variable that denotes the number of cats, while 2.839weightCAT is a

constant that denotes every cat’s weight in gram. So there are # nrCAT cats, and each one

of them weights # weightCAT grams. Together they weight weightnr CATCAT * grams.

1.3.4.2 Arrays

An array is an ordered sequence of elements, where all elements are of the same type.

For instance there can be an array of integers or an array of characters, but there can not

3 Besides from this example, cats have absolutely no relevance to the thesis.

10

be an array that contains both integers and characters. Also the elements can be either

variables or constants, but not both. The elements are indexed by integers from zero in

ascending order of appearance in the sequence. An array named A that has n elements is

denoted by 10 nA . The element of A that has index ni 0 is denoted by iA .

A subarray of A is an array of its own right. It consists of (i) all of A’s elements that are

between two specified elements and (ii) the two specified elements. The subarray’s

elements have the same order as they have in A. The subarray that spans from element

iA to kA is denoted by kiA if it is important to stress that it is a subarray. In such

case the elements maintain the index values that they have in A. But if it is more

interesting to regard it as an array in its own right, then it can also be given a new name,

say B. In such case it is denoted by ikB 0 as any array would. Now the elements are

given new index values.

One can clear an array that contains number variables, which means that every variable

is set to zero.

1.3.4.3 Matrices

A matrix is as always a rectangular grid of ordered horizontal rows and vertical

columns, and in each row and column there is an element. All the elements are of the

same type and are either variables or constants. The rows are indexed from zero in

ascending order of appearance, and so are the columns. The orientation of a matrix is

such that the topmost row has index zero, and the leftmost column has index zero. The

combination of any row’s index coupled with any column’s index effectively becomes

the index of the element that is located in that row and column. A matrix M with #r rows

and #c columns is denoted by 10,10 crM . The element of row s and column t is

denoted by tsM , .

Some task can be performed on every element of a matrix, such as reading them or

changing them in some way. When it is said that a task is performed on a matrix row by

row, it means that the task is performed on every element one row at a time starting from

the topmost row to the bottommost row. The task is performed on the elements of each

row from left to right. Likewise when a task is performed on a matrix column by column,

it means that the task is performed on every element one column at a time starting from

the leftmost column to the rightmost. The task is performed on the elements of each

column from top to bottom. For instance a matrix’s elements can be read row by row, or

the elements of an array can be placed into a matrix column by column.

A matrix 10,10 crM can also be regarded as an array. In such case the array

will always have the same name as the matrix. The matrix’s cr * elements appear row

by row in the array, and so it is denoted by 1*0 crM . To clarify, element iM is

located in row Ri and column Ci% in the matrix.

A matrix that contains number variables can be cleared, which means that every

variable is set to zero.

11

1.3.5 Abbreviations

This report will make use of some abbreviations. What expression each of them

abbreviates will be explained first time they are presented. They are also presented in the

following table for convenience. For each abbreviation the expression that it represents is

listed. The expressions are further explained in the following paragraph.

Table 2: List of abbreviations that are used in this report.

Abbreviation Expression

3GPP Third Generation Partnership Project

CM Common Memory

DSP Digital Signal Processor

ELTE Ericsson LTE

EMC EMPA Compiler

EMPA Ericsson Multiprocessor Architecture

EMS EMPA Simulator

LDM Local Data Memory

LPM Local Program Memory

LTE Long Term Evolution

CB Command Bus

ULTC Uplink Transport Channel

1.3.6 Expressions

The following table presents expressions used in this report and their meanings. For

each of them an explanation is presented, but the explanation might depend on other

expressions or abbreviations. These have been marked in italic and can be found in the

same table or Table 2 that lists abbreviations.

12

Table 3: List of special words and expression used in this report.

Word or expression Explanation

a0, a1, a2, … These denote the temporary registers of any DSP.

a0l, a0h, a1l, a1h, a2l,

a2h, …, …

These denote the temporary register parts of the temporary

registers.

address register A type of registers in a DSP. Each such register can contain an

LDM address. The registers are then used by LDM instructions.

base station A device that enables mobile communication within its

proximity.

Channel deinterleaver A processing step that is to execute on a base station that

implements the standard LTE.

Channel interleaver One of the processing steps of ULTC.

Clearing The act of setting the value of a variable to zero, or setting every

variable in a matrix or an array to zero.

CM instruction A DSP instruction that is executed to read an array of words

from CM to LDM, or to write in the opposite direction.

column buffer The expression is used when discussing the processing algorithm

and implementation of Channel deinterleaver. It denotes a part of

a job DSP’s LDM that the DSP uses to store one column of its

current segment. The DSP then processes that column by placing

it into the segment.

column repetition This expression is used in the context of Rate matching and Rate

demtaching. In the former, it denotes a subarray of the resulting

bit array e that originate from one column of one of the matrices

X, Y or Z. In the latter, it denotes a subarray of the input byte

array E that is to be soft combined with one column of one of the

matrices U, V or W.

Command Bus A component of EMPA that enables communication between any

pair of DSPs via short messages.

criterion for optimal

loop performance

A criterion that defines optimality for a loop that is executed by a

DSP and that reads or writes data to LDM. See the end of

Chapter 4 for a precise definition.

critical loop A loop that contains no other loops and that an implementation

spends a large time of its execution time to execute.

Digital Signal

Processor

A component of EMPA, although EMPA has multiple examples

of it. It is a microprocessor.

dispatch DSP A DSP that has been dedicated to dispatching JOBs.

DSP instruction Any instruction in the DSP’s instruction set.

13

ELTE Multiprocessor

Architecture

A symmetrical multiprocessor architecture that will be provided

on the base stations that Ericsson AB is developing for the

standard LTE.

EMPA Compiler A compiler used to compile implementations that are to run on

EMPA.

EMPA Simulator A simulator that simulates all the capabilities of EMPA on an

ordinary UNIX computer.

entry function The first function of a JOB to be invoked on any job DSP that

participates in executing that JOB.

Ericsson AB A company that among other things provide telecommunication

equipment.

Ericsson LTE A division of Ericsson AB that is in the work of developing base

stations that utilize the standard LTE.

JOB A job that is dispatched by a dispatch DSP to be executed by

some job DSPs on EMPA. It specifies the functions and variables

that are to be used by the job DSPs.

job DSP A DSP that has been dedicated to executing JOBs that are

dispatched by dispatch DSPs.

LDM instruction A DSP instruction that is executed by a DSP to read 8, 16 or 32

bits from its LDM to one of its temporary registers, or to write in

the opposite direction.

Local Data Memory The local memory of a DSP that is used to store data that the

DSP is to read or modify.

Local Program

Memory

A local memory of a DSP that is used to store instructions that

the DSP is to execute.

Long Term Evolution A standard in mobile communication that is being specified by

3GPP.

m0, m1, m2, … These denote the offset registers of a DSP.

memory instruction A microprocessor instruction that either reads or writes data to a

memory component.

move instruction A DSP instruction that copies the contents of one temporary

register part to another.

offset register A register that can be used by an LDM instruction to modify the

value of the address register that the instruction is using.

parallelized

implementation

An implementation that is executed in parallel on more than one

processor.

processing algorithm An algorithm that specifies how a processing step can be

performed.

14

processing step A definition of what must be done to some input in order to

produce a valid output for that input. The processing step

therefore also specifies what a valid input is.

r0, r1, r2, … These denote the memory registers of a DSP.

Rate dematching A processing step that is to execute on a Base station that

implements the standard LTE.

Rate matching One of the processing steps of ULTC.

repetition This word has a special meaning for the processing steps Rate

matching and Rate demtaching. In the former, it denotes a

subarray of the resulting bit array e that originates from one of

the matrices X, Y or Z. In the latter, it denotes a subarray of the

input byte array E that is to be soft combined with one of the

matrices U, V or W.

SatFunc A function that calculates the sum of two integer bytes. If the

sum is lower than 128 or greater than 127 then the result is

adjusted to the closest boundary.

Section A number of consecutive rows of the matrix in Channel

deinterleaver that one job DSP has been designated to process.

Segment A number of consecutive rows in a section that a job DSP can fit

in its LDM. The DSP processes its section one segment at a time.

The word also denotes the part of its LDM that a DSP has

allocated to store segments of its section.

soft combining The act of calculating the sum of two integer bytes. If the result

is smaller than 128 or larger than 127 then it must be adjusted

to the closest boundary. This is done in Rate demtaching by

using the function SatFunc.

soft value An integer in the range 128 to 127. It denotes the probable

value of one bit that a base station has received from a mobile

device. A high or low soft value indicates that the received bit’s

value is 1 or 0, respectively. A soft value close to zero indicates

that the bit’s value is ambiguous.

temporary register A type of registers in a DSP that can be used by most DSP

instructions to retrieve data from or store results to.

temporary register part One of two parts of a temporary register. This is the register’s

either lowest 16 bits or next higher 16 bits.

Third Generation

Partnership Project

An organization tasked with developing and maintaining

specifications for the 3G and GSM standards

15

Uplink Transport

Channel

This is specified by the standard LTE. It is a set of processing

steps that must be applied to data in a specific order before it is

transmitted from a mobile device to a base station.

Word This denotes as always a sequence of bits in the context of some

computer hardware or one of its memory components. How long

that sequence is depends on the word size of the hardware or

memory. The word size of EMPA’s CM and its DSPs’ LDM and

LPM is 16 bits, and so the expression denotes a sequence of 16

bits.

worker id A non negative integer value that is given to each job DSP that

participates in executing one dispatched JOB. The value is

unique among those job DSPs.

16

2 EMPA EMPA is the symmetrical multiprocessor architecture provided by ELTE’s base

stations. It is used to execute implementations that need to run on the base station, such as

implementations that apply processing steps to received transmissions as described by

Paragraph 1.1.5. EMPA can execute parallelized implementations.

How an implementation is executed on EMPA will be explained by Paragraph 3.2,

while this chapter is concerned with presenting the components of EMPA that are

relevant to the thesis and explain how they are connected. These components are the

Digital Signal Processors, the Common Memory and the Command Bus. However some

of them have properties that can not be publicly revealed due to the reasons stated in

Paragraph 1.3.2. Figure 3 gives a simple overview of EMPA and shows how the relevant

components are connected. The chapter’s source is [11].

The word size of all of EMPA’s relevant memory components is 16 bits. The individual

memory components will be described in this chapter, but because of this throughout this

report the term word refers to a sequence of 16 bits. Also the phrase memory instruction

refers to a processor instruction that either reads or writes data to the memory in question.

There are therefore Local Data Memory instructions as well as Common Memory

instructions that will be greatly referred to throughout this report. The two expressions

will be further explained in this chapter.

Digital Signal

Processor

Common Memory

Data bus

Digital Signal

Processor

Digital Signal

Processor

Command Bus

Figure 3: An overview of EMPA and its relevant units. A dashed line means there are multiple examples of the

adjacent component.

2.1 The Digital Signal Processors The Digital Signal Processors (DSP) are the microprocessors of EMPA that are used to

execute implementations. A parallelized implementation is executed on EMPA by

executing it on multiple DSPs in parallel.

17

The DSP must be well understood if one is to write efficient implementations.4 In fact

this component of EMPA was the most important to the thesis, and it will be extensively

referred to in this report. Therefore a large part of this chapter has been dedicated to

discussing the DSP’s components and capabilities. What need to be covered for the

purpose of the thesis are (i) the DSP’s local memories, (ii) the registers used by the

DSP’s instructions, (iii) the DSP’s capabilities for executing instructions and how some

specific instructions work and (iv) the DSP’s hardware support for executing loops more

quickly. These topics will be covered by the following subparagraphs. Figure 4 gives an

overview of the DSP and those components of it that are relevant to the thesis. The reader

can refer back to it as he reads on. Throughout the report the phrase DSP instruction

refers to any instruction from the DSP’s instruction set.

Local program memoryLocal data memory

Memory control unit

Temporary registers

Adress registers

Offset registers

Registers

Program control unit

Computational units

Digital Signal Processor

Figure 4: An overview of the DSP and its relevant units.

2.1.1 The DSP’s LDM and LPM

Each DSP has a Local Data Memory (LDM) and a Local Program Memory (LPM).

LPM, as any program memory, is used to store instructions that the DSP is to execute,

while LDM is used to store data that is to be read from or written to. The word size of

each memory is 16 bits. The memories follow the big endian principle. A portion of both

4 Sometimes the phrase “the DSP” is used to refer to any DSP of EMPA, but it can of course also be used

to refer to a specific DSP. Which is the case will be obvious by the context that the phrase is used in.

18

memories are unavailable for memory allocations. For the thesis it is relevant to know

that at least # availLDM bytes of LDM are available for use.

2.1.2 The DSP’s registers

The DSP has a large set of registers. Those of them that are relevant to the thesis are

described in the following subparagraphs.

2.1.2.1 Temporary registers

The temporary registers are used by almost every DSP instruction to read values from

and to store results in. Throughout this report they will be denoted by a0,a1,a2,…. Each

register has two temporary register parts. To clarify, a0 will be taken as an example. The

16 least significant bits of it are in the lower register part and it is denoted by a0l. The

second register part contains the next 16 bits of a0 and it is denoted by a0h. So the two

register parts span over a0’s 32 least significant bits. Many DSP instructions use these

parts as if they are individual registers in their own right. For instance, the assembler

instruction mv a0l,a1l copies a0l’s value to a1l.

2.1.2.2 Address registers

Throughout this report an LDM instruction is a DSP instruction that writes data from a

temporary register to LDM, or it reads data in the opposite direction. Also the DSP’s

address registers will be denoted by r0,r1,r2,…. An address register is used to store the

address to a byte in LDM. An LDM instruction then uses the register to read or write to

its address. This will be further explained by Paragraph 2.1.3.3 that covers LDM

instructions.

2.1.2.3 Offset registers

Throughout this report the DSP’s offset registers will be denoted by m0,m1,m2,…. An

offset register is used by certain DSP instructions to modify an address register’s value.

Each offset register can be used to modify only certain address registers. When the report

shows examples of a DSP instruction that uses an offset register for this purpose, it will

be shown which register is being used and which address register is being modified.

2.1.3 Instruction execution on the DSP

What needs to be highlighted regarding instruction execution on the DSP is (i) its

instruction pipeline (ii) how the DSP can execute multiple instructions in parallel and (iii)

what capabilities do LDM instructions have.

2.1.3.1 The DSP’s instruction pipeline

The DSP has a very short instruction pipeline consisting of only three steps, namely

fetch, decode and execute. Most DSP instructions complete each step in one cycle.

Because of this when the DSP is in the progress of executing a series of such instructions,

19

one of them finishes the pipeline by each cycle. Therefore the effective execution time of

each of these instructions is one cycle.

2.1.3.2 Executing parallel instructions on the DSP

The DSP can execute up to # nrINSTRPAR _ instructions in parallel. This is because it

has multiple units that execute instructions. The instructions pass through each step of the

instruction pipeline simultaneously. What instructions are to be performed in parallel

must be specified for the DSP. Certain limitations apply when instructions are to be

executed in parallel. Those that are relevant to the thesis are:

Up to two LDM instructions can be executed in parallel.

Up to 1__ nrLDMWITHPAR instructions can be executed in parallel with

two LDM instructions.

Some instructions can not be executed in parallel with others.

To specify the execution time required to perform parallel instructions, suppose there is

some set of instructions. The one of them with the longest individual execution time

requires #c cycles. This means that if the instruction is not executed in parallel with other

instructions, then it requires #c cycles. Then if the mentioned set of instructions are

executed in parallel, #c cycles are required to finish the task. This means that when the

DSP is executing some instructions in parallel, all of them must finish before the DSP can

move to the next set of parallel instructions.

Parallel instructions may not write to the same register, but one of them may write to a

register that the others read from. The result of this is that the register is first read from by

the other instructions and then written to. This does not prolong the execution time of the

parallel instructions.

To clarify the statements, an example with assembler instructions is in order. The syntax

that is used for parallel instructions will also be presented.

mv 0,a1h | copy a1,a2

is two instructions performed in parallel. The former clears a1h while the latter copies a1

to a2. The result equals to first executing the latter instruction and then the former. The

execution time is the same as for the parallel instructions

mv 0,a1h | copy a2,a3

Note how in this example the two instructions do not use the same register. Also pay

attention to the syntax used for parallel instructions as this will be used in the report.

Specifically,

“task A | task B | task C”

means three parallel instructions perform the tasks A, B and C.

20

There is no way to specify that some instructions are to be performed in parallel when

writing C code. This is completely up to the compiler EMC when it produces the

resulting assembler code from the C code.

Throughout this report it is important to be mindful of the difference between several

DSPs that execute one parallelized implementation on EMPA, and one DSP that executes

several instructions in parallel.

2.1.3.3 LDM instructions and their capabilities

An LDM instruction can (read or) write 8, 16 or 32 bits (from or) to an address of LDM

specified by an address register. The instruction writes data from a temporary register to

the memory (or it reads in the opposite direction). If the instruction reads or writes 8 or

16 bits then it uses a temporary register part. On the other hand if the instruction reads or

writes 32 bits, then it uses a complete temporary register. In any case the data is read

from or written to a number of bytes in LDM, where the address register being used

points out the first byte.

LDM instructions will be referred to extensively throughout the report, and so some

assembler instructions will be shown as examples. st a0,*r0 writes 32 bits from a0 to

four bytes in LDM, where r0 addresses the first. ld *r0,a1 reads four bytes from LDM

where r0 addresses the first. On the other hand st a0l,*r0 writes only two bytes, while

ld *r0,a1h reads two bytes. This becomes obvious by the fact that the latter two

instructions use temporary register parts instead of complete temporary registers.

However LDM instructions that read or write 8 bits use temporary register parts too. It

must be further specified that only 8 bits are being handled. Therefore stb a0l,*r0 and

ldb *r0,a0h writes and reads one byte, respectively.

An LDM instruction takes up to # cyclesTIMELDM _ cycles to execute. Its execution

time is irrespective of the amount of data that it handles. However when two LDM

instructions are executed in parallel there may be a conflict in hardware resources used,

which is imposed by the architecture of the DSP. In such case one of the two instructions

will have its execution time prolonged by # cyclesDELAYLDM _ cycles. But as stated in

Paragraph 2.1.3.2 the DSP must finish executing all instructions that it is currently

executing in parallel before it begins with the next set of parallel instructions. An

example is in order to explain what the consecuence of this may be. The following shows

two LDM instructions executed in parallel with some other instruction.

ld *r0, a1 | ld *r1, a2 | SOME_INSTR

If there is a conflict in resource usage among the two LDM instructions, then one of them

will have its execution time prolonged by # cyclesDELAYLDM _ cycles. If its resulting

execution time c is greater than SOME_INSTR’s execution time, then the DSP requires #c

cycles to execute the set of parallel instructions. However if SOME_INSTR requires at least

#c cycles to execute, then the conflict in resource usage has no effect on execution time.

What makes an LDM instruction powerful is that it can also subtract or add a certain

offset to the address register. This is done after the instruction has read or written data to

21

the memory, but it does not prolong the execution time of the instruction. It is important

to understand that handling the data and modifying the address register are performed by

one LDM instruction, not by two parallel instructions. The offset must be stored in the

address register’s offset register, and the latter is then used to modify the address register

(Paragraph 2.1.2.3 explains offset registers).

To clarify with some examples, st a0,*r0++m0 writes 32 bits to the address specified by

r0, and then increments the address register by the value of m0. ld *r0--m0,a0l reads 16

bits from the memory, and then decrements the address register. Offset registers can in

the same manner be used in conjunction with the instructions ldb and stb that handle

only 8 bits.

2.1.4 Hardware support for executing loops on the DSP

This report will extensively discuss efficient execution of loops. For this sake, a loop’s

overhead refers to the execution of every instruction in the loop that (i) modifies any

counter that keeps track of how many iterations of the loop have been executed, (ii)

compares any such counter with another value to determine if the loop should stop

iterating, or (iii) that are conditional branch instructions that either continue execution in

the next iteration or finish the loop.

The DSP has hardware support to execute a loop without any overhead, but it is

required that the number of times that the loop iterates is known before it is initiated. The

hardware then keeps track of the number of times the loop has iterated. When the last set

of parallel instructions has been executed in the end of one iteration, the DSP continues

executing the first set of parallel instructions in the beginning of the next iteration. This is

done without any delay.

2.2 The Common Memory The Common Memory (CM) is a shared memory that any DSP can access. Its size is #

sizeCM MB, though a portion of it is not available for memory allocations. Its word size

is 16 bits, and it follows the big endian principle.

Throughout this report, a CM instruction is an instruction that the DSP executes to write

data from its LDM to CM (or it reads data in the opposite direction). Be mindful of the

difference that an LDM instruction transfers data between a temporary register and LDM,

while a CM instruction transfers data between LDM and CM.

A CM instruction writes an array of sequential words from LDM to CM (or it reads an

array of sequential words in the opposite direction). Depending on if data is to be read

from or written to CM, the instruction takes at least # latreadCM _ or # latwriteCM _ cycles to

initiate, respectively. It may take longer to initiate due to conflicts in hardware resources

used when several DSPs access CM simultaneously. This is due to the architecture of

EMPA. Once a CM instruction has been initiated it can transfer up to # speedCM words per

cycle. Again, at some cycles the number of words may be smaller due to conflicts in

22

resource usage. However the speed is irrespective from if data is being read or written to

CM.

A byte can not be addressed in CM. If a CM instruction is to read from or write to a

byte in CM, then this must be done to the word that the byte is located in. Because of this

a CM instruction can only read from or write to an array of complete words in CM, and

not any array of bytes. If the instruction is to handle an array of bytes in CM that begins

or ends at an uneven byte address, then the array must first be expanded so it

encompasses complete words. Thus the CM instruction must read from or write to up to

two bytes at the ends of the array that actually are not of interest.

2.3 The Command Bus The Command Bus (CB) allows a pair of processes running on two DSPs to

communicate with one another via short messages. A message can carry a payload of #

bitsPAYLOADMSG _ bits. Additionally the message specifies the id of the receiving

process and a signal number. A process that sends a message specifies these two values.

A receiving process declares interest in receiving a message with a certain signal number.

The execution of the receiving process is halted until it receives a message with that

signal number.

The procedure, from that a process sends a message until the receiving process may

resume its work, takes in average # cyclesTIMEMSG _ to complete.

23

3 Execution methodologies used on EMPA

There are rules and execution methodologies that an implementation that is to run on

EMPA must apply to. Those of them that are relevant to the thesis are presented in this

chapter. What is relevant is (i) how DSPs can synchronize their work by using locks, (ii)

how the execution of a parallelized implementation is dispatched to be performed by

multiple DSPs, and (iii) how the DSPs should perform barrier synchronization.

3.1 Locks There are locks that can be used for synchronizing the work of the DSPs. A lock must

of course be accessible by all DSPs and so it uses a variable in CM. Because of this the

memory latencies of # latreadCM _ and # latwriteCM _ cycles apply when reading and writing

to the lock, respectively (see Paragraph 2.2).

These locks may differ from how the reader would expect a lock to function. First of all

when locking or unlocking a lock, what matters is what DSP performs it, not what

process it is that is running on the DSP and that requests the operation. This means that

when a lock is locked it is the DSP that becomes the holder of it, and the process that

requested the operation may very well completely finish its execution on the DSP. The

DSP will still be the holder of the lock. Its unlocking may very well be requested by a

different process that executes on the DSP at a later time. However confusion still does

not arise because a process must allocate a lock before it can be used. Therefore that

process is the only one that “has knowledge” of the lock, unless it shares this information

with other processes in some manner.

There is another important difference from how these locks differ from the ordinary

idea. Of course if a DSP has locked a lock, then other DSPs that request it to be locked

will be put on hold. They must wait until it is unlocked, in which case one of the waiting

DSPs may proceed and becomes the holder of the lock. This is as expected. However a

lock may be unlocked by any DSP, and not only the one that is its holder.

3.2 To dispatch a JOB When an implementation is requested to run on EMPA, a master function is invoked on

a DSP. The implementation’s input is given to the function. It is most common that the

input is very large, in which case it is placed on a location in CM and a pointer to that

location is provided for the function. The objective of the function is to dispatch the

implementation on a chosen number of other DSPs that will execute it. If more than one

DSP is used for this purpose then it is a parallelized implementation.

24

This procedure is called to dispatch a JOB.5 The DSP that dispatches it is referred to as

a dispatch DSP, while the DSPs that the JOB is dispatched to are called job DSPs. Each

DSP of EMPA has been dedicated to one of the two purposes.

To clarify, say that a processing step is to be applied to some data. The data is placed in

CM and an implementation of the processing step begins to execute when the master

function is invoked on a dispatch DSP. If it is a parallelized implementation, then the

function dispatches a JOB to multiple job DSPs. Those DSPs will cooperate to perform

the processing step over the data.

A JOB defines all the functions and variables that are to be used by a job DSP that is to

perform it. These have been placed in CM even before the master function was invoked.

They will be uploaded to the job DSP’s LPM and LDM when it begins to execute it.

However the definition of a JOB does not include the number of job DSPs that are to

execute it, nor the input that is provided for the master function. These may be different

from time to time that the JOB is dispatched. Also a JOB is dispatched with one of three

possible priorities. What meaning the priorities have will be explained shortly but the

reader must understand that the same JOB can be dispatched with different priorities.

Every JOB has an entry function, which is the first function to be invoked on the job

DSPs. The function can have two 32 bit variables as its input. If this is not enough for the

part of the input that the job DSP is to process, then one of these variables may be a

pointer to a location in CM where the input is stored. The entry function can not

explicitly return a variable as an ordinary function. Its output is to either be stored in CM

at a location specified by the entry function’s input or sent back to the dispatch DSP via

CB.

Figure 5 illustrates what happens when a dispatch DSP dispatches a JOB, while Figure

6 shows how a job DSP cycles through executing one JOB after another. The reader

should review them as he reads on. When the master function performs the dispatch it

specifies (i) the input of the entry function, (ii) the JOBs priority and (iii) the number of

job DSPs that are to execute the JOB. Upon dispatch, those job DSPs that announce

themselves as available receive a message over CB that instructs them to execute the

JOB. If there aren’t enough available DSPs, then the JOB is also placed in a queue in

CM. It is important to note that those DSPs that were available began executing the JOB,

even if they aren’t sufficiently many. It is specified in the queue how many more DSPs

are required. Of course if there are no available DSPs, then the JOB is only placed in the

queue.

There are three queues, one for each priority that a JOB may have. Each queue follows

the first in first out principle. When any job DSP completes executing its current JOB, it

reads the queues in order of priority. It picks a JOB from the first non-empty queue. If

there are no JOBs at all, then the DSP announces itself as available and awaits a message

5 Capital letters are used for this name so that the ordinary word “job” can be used freely without

confusion.

25

over CB as previously described. Note that completing a JOB does not make a DSP to

announce itself as available, since if it finds a JOB in a queue then it begins executing it.

It is only if the queues are empty that it becomes available.

A dispatch DSP dispatches

a JOB that is to execute on

#N job DSPs

Are #N or more job

DSPs announcing

themselves as

available?

Are there 0<X<N job

DSPs that announce

themselves as

available?

#X job DSPs are

instructed via CB to begin

executing the JOB

The JOB is placed on a queue

with a request that N-X more

job DSPs participate in

executing it.

#N job DSPs are

instructed via CB to begin

executing the JOB

The JOB is placed on a

queue with a request that

#N job DSPs execute it

Yes

No

Yes No

Figure 5: The figure shows how a JOB is dispatched by a dispatch DSP.

26

The job DSP finishes

executing its current JOB

Are there any

JOBs in the

queue?

The JOB with the highest

priority is chosen

The job DSP awaits a

message via CB

The job DSP uploads the

JOB’s functions and data

to its local memories and

begins executing it

Is the job DSP

instructed via CB

by a dispatch DSP

to execute a JOB?

The job DSP is announcing itself as unavailable

The job DSP is announcing itself as available

No

Yes

No

Yes

Figure 6: The figure shows how a job DSP cycles through executing JOBs, and it shows when the DSP is

announcing itself as available.

If the input to the entry function is to be different among the job DSPs then multiple

separate dispatches of the same JOB is required. Compare this to dispatching it once

which requires only one message over CB and is performed much quicker. But even if

the JOB is dispatched only once, each job DSP that participates in it is provided with a

worker id. This is a non negative integer value that is unique among the job DSPs that are

participating. Each of them can use it to calculate what part of the JOB it has been

designated to perform.

Once the master function has dispatched its JOB, it must await a message from the job

DSPs that confirms the JOB has been completed. Previously, each job DSP sent a

message over CB once it had completed executing. The master function would keep track

of the number of received messages and realize when all of the job DSPs are done. A

problem with this method is that it generates a lot of undue stress on the dispatch DSP.

The solution that is now required to be used by every implementation is that only the last

job DSP that completes executing the JOB sends a message. A lock is used to keep track

of this.

It is strongly requested that the dispatch DSP (where the master function runs) takes

part in the actual work of the JOB as little as possible. Execution time on dispatch DSPs

is solely to be used to perform dispatches and for overall maintenance work required by

EMPA. In fact, the thread running the master function on the dispatch DSP will most

27

likely be switched out once the JOB has been dispatched and the master function begins

to wait for a message.

3.3 Memory allocation Both job and dispatch DSPs can allocate memory in CM. But only a dispatch DSP can

dynamically allocate memory in its LDM. Functions that are to run on job DSPs must

specify all the memory that they require at compile time. But for simplicity when the

report describes some work that a job DSP performs it may still be stated that it “allocates

memory in its LDM”. The reader should not be confused and understand that this means

that the memory has been reserved at compile time.

3.4 Barrier synchronization between the job DSPs It is possible for some job DSPs that require barrier synchronization to perform this by

using locks. During the thesis an algorithm was written for this purpose that uses three

locks to enforce mutual exclusion and to perform the actual barrier synchronization.

However, it was of no use because it is not acceptable that a set of job DSPs use explicit

barriers to synchronize their work. Read Paragraph 3.2 to understand how JOBs are

dispatched to the job DSPs. Say that a JOB is requested to execute on three DSPs, but

only two are available at the time of the dispatch. They begin to execute the JOB, and the

last DSP can begin much later. If the first two synchronize at a barrier, then they must

wait until the last one initiates the JOB and arrives at the same barrier.

An unacceptable large portion of execution time on all job DSPs is wasted if barriers are

widely used. In fact, it is theoretically possible that all job DSPs of EMPA fall into a

deadlock because each of them is waiting at a barrier in a JOB that has not been fully

dispatched to the required number of job DSPs.

If an algorithm requires barrier synchronization at a point in its calculations, then it is

not implemented as one JOB using an explicit barrier. Instead it is implemented as two

JOBs. The first JOB ends where the barrier synchronization takes place in the algorithm,

and the next JOB continues from that location. When the job DSPs have finished the first

JOB and the master function is notified, it proceeds by dispatching the second JOB. Each

DSP can finish executing the first JOB at any time irrespective of the other DSPs. The

disadvantage of this solution is that another costly dispatch must be performed, thus

prolonging the execution time of the algorithm’s implementation.

28

4 Definition of an optimal loop Efficient execution of loops will be extensively discussed in this thesis. Therefore some

crucial definitions are in order. First, a critical loop is a loop that (i) does not contain any

loops and (ii) that the implementation spends a large part of its execution time to execute.

The latter criterion is a bit subjective but nevertheless the definition suffices for this

report. Obviously optimizing critical loops can be a great way to reduce the

implementation’s overall execution time.

The next definition is a bit bold, namely whether if a loop that a DSP is to execute is an

optimal loop or not. To answer this, one must first know (i) what is the task of the loop

and (ii) how does the capabilities of the DSP limit that task? If the limit is met and the

capabilities of the DSP are thus optimally used by the loop, then it is said to be optimal.

This definition may seem subjective but what are required are a precise definition of the

loop’s task and a complete understanding of the limitations of the DSP that apply to the

task.6

An example is in order. Say that a loop is tasked with adding 8 bit integers from two

arrays and store the results in a third array. Specifically, there are three byte arrays X, Y

and Z in LDM with n bytes each. iYiXiZ is to be calculated for every

10 ni , and it is so that 127128 iYiX .

The task of the loop has been defined. Now one must obtain a complete understanding

of the DSP’s capabilities that apply to the loop in order to make it optimal. First and

foremost, the loop’s overhead must be eliminated by using the DSP’s hardware support

for executing loops (see Paragraph 2.1.4 for a description of the hardware support).

Second, each LDM instruction must read or write as many bits as possible (namely 32

bits) to LDM and modify its address register if necessary (see Paragraph 2.1.3.3 for the

capabilities of LDM instructions). Separate instructions may not be used to update the

address registers. Last but not least, a careful review is required of what DSP instructions

can perform additions. Carefully reviewing [11] shows that the instruction add4 is very

suitable. add4 a0,a1,a2 treats the four lower bytes of each of the three temporary

registers as individual integers. The lowest byte of a0 is added to the lowest of a1 and the

result is stored in the lowest of a2, and so on. The instruction performs four additions in

this manner. Two add4 instructions can be performed in parallel, but only one add4

instruction can be performed in parallel with two LDM instructions. It is also guaranteed

that an add4 instruction will never require longer execution time than an LDM

instruction.

In light of this, one might be tempted to think that the loop shown in Algorithm 1

performs the task optimally. It calculates eight integers of Z per iteration, so it must be

6 Throughout this paragraph, one should consider that a capability can also be regarded as a limitation.

29

followed by another loop that calculates any remaining integers in the end of Z. But there

can not be more than seven remaining, and so the second loop has been omitted.

Algorithm 1: A non-optimal loop for the task.

//The loop calculates eight bytes of Z in each iteration.

//X, Y and Z are in LDM.

//The variable i is only used for clarification,

//and no instruction is used to update it.

//The LDM instructions update their address registers accordingly.

//The loop has no overhead since the DSP’s hardware support is used.

//i = 0;

loop {

//Two parallel LDM instructions.

LDM instruction that reads X[i...i+3] to a0 |

LDM instruction that reads X[i+4...i+7] to a2;


LDM instruction that reads Y[i...i+3] to a1 |

LDM instruction that reads Y[i+4...i+7] to a3;

//Two parallel add4 instructions.

add4 a0, a1, a4 | add4 a2, a3, a5;


LDM instruction that writes a4 to Z[i...i+3] |

LDM instruction that writes a5 to Z[i+4...i+7];

//i += 8;

}

This loop is not optimal. It is true that every LDM instruction transfers the maximal

possible amount of data, which is 32 bits. Also LDM instructions are performed in

parallel pair-wise, which is the maximal number of LDM instructions that can be

executed in parallel. Paragraph 2.1.3.3 explains how there may be a conflict in hardware

resource usage when parallel LDM instructions are executed. This would prolong the

execution of the LDM instructions by # cyclesDELAYLDM _ cycles. However it is

guaranteed that this does not happen in the loop.

Still the loop fails be optimal due to the simple reason that it does not execute the add4

instructions in parallel with the LDM instructions. But this remark can be stated in a more

convenient manner. No matter how the loop’s task is done, in the end a certain amount of

data must be read from LDM (specifically from X and Y), and a certain amount of data

must be written to LDM (specifically to Z). This is not being done with the maximal

possible speed, because when the add4 instructions are being executed then no data is

being read nor written to LDM. Another attempt is made:

30

Algorithm 2: An optimal loop for the task.

read X[0...3] to a0 | read Y[0...3] to a1;

read X[4...7] to a2 | read Y[4...7] to a3;

//i = 0;

loop {

read X[i...i+3] to a0 | read Y[i...i+3] to a1 | add4 a0, a1, a4;

read X[i+4...i+7] to a2 | read Y[i+4...i+7] to a3 | add4 a2, a3, a5;

write a4 to Z[i...i+3] | write a5 to Z[i+4...i+7];

//i += 8;

}

//The loop stopped at i = j.

add4 a0, a1, a4 | add4 a2, a3, a5;

write a4 to Z[j...j+3] | write a5 to Z[j+4...j+7];

The construction requires that some of the work is performed before and after the loop,

but what instructions are performed outside of the loop does not determine if it is optimal

or not. For this loop it is easy to make sure that there is no conflict in resource usage

when parallel LDM instructions are executed. Also recall that it is guaranteed that the

execution time of the add4 instruction is not longer than an LDM instruction’s. So no

LDM instruction will have its execution time prolonged due to resource conflict, nor

because it must wait until a parallel add4 instruction is done.

This makes the new loop optimal. To see this understand that no matter how the loop’s

task is performed, in the end some data must be read from and written to LDM, and the

loop spends its entire execution time to transfer data between the temporary registers and

LDM with the maximal possible speed.

Loops that are tasked with (i) reading data from LDM, (ii) processing it in some way

and (iii) writing the result to LDM occur frequently in the implementations that are

written by ELTE. The critical loops of the implementations written in the thesis are also

of this type. A simple yet instructive criterion can be stated that guarantees such loop is

optimal: the loop must uphold the following:

By each cycle two parallel LDM instructions must be executing.

There may be no conflicts in hardware resource usage when executing parallel

LDM instructions.

Each LDM instruction must read or write 32 bits to LDM.

Other instructions that are used may only be executed in parallel with LDM

instructions.

Other instructions may not have a longer execution time than an LDM instruction.

Throughout this report the criterion will be referred to as the criterion for optimal loop

performance. If it is met then it is guaranteed that the loop is transferring data between its

temporary registers and LDM with the maximal possible speed. It does not matter what

31

the loop is required to do with the data and how this is done. In the end the data must at

least be read and written and the loop spends its entire execution time to do this as fast as

possible.

Note that a loop can be optimal even if it does not meet this criterion. Its task may be

either completely different or perhaps limited by some other DSP limitation than the

amount of data that can be read or written to LDM per cycle.

32

5 Specification of the processing steps The purpose of this chapter is to specify the processing steps Channel deinterleaver and

Rate dematching that are to be implemented on EMPA in the thesis. But understanding

them becomes easier by first understanding how they are related to two processing steps

that are performed on the mobile device. The sources of the chapter are [4], [1] and [9].

5.1 Overview of the Uplink Transport Channel The standard LTE specifies an Uplink Transport Channel (ULTC) that outlines how

data must be processed before being transmitted from the mobile device to the base

station. A set of processing steps must be applied to the data in a specific order. See the

upper half of Figure 2 for a description. The data is the input of the first processing step

and its output is the input of the next step which in turn produces an output that the next

one will receive, and so on. The output of the last step is transmitted to the base station.

To explain each of these processing steps and what their purposes are is beyond the scope

of the thesis. Only those of them that are relevant to the thesis will be described in detail,

namely Rate matching and Channel interleaver.7

The base station must obtain the original data that was the first input of ULTC’s

processing steps. This is achieved on the base station by applying another set of

processing steps in a specific order to the received transmission. Figure 2 clarifies this.

Some of the base station’s steps are directly related to some of ULTC’s. Two examples

are Rate dematching and Channel deinterleaver that are to run on the base station and

were implemented in the thesis. They are directly related to the two steps of ULTC that

were mentioned above. Understanding the two processing steps that were implemented in

the thesis becomes easier by understanding their two counterparts in ULTC. That is why

all four processing steps will be specified in this chapter.

There is another subtle difference between ULTC’s and the base station’s processing

steps. When the base station has received the transmission it can not be certain if the bits

that the mobile device transmitted have changed due to radio interference. For this

purpose, in mobile communication technology a method is used for estimating the

probability that a received bit has a certain value. Let bP0 and bP1 denote the

probabilities that a received bit b is zero or one, respectively. The following value

rounded to the closest integer is called a soft value:

7 Note that these are not the processing steps that were implemented in the thesis. Read further to

understand.

33

otherwiselog

log128 if128

log127 if127

0

1

0

1

0

1

bPbP

bPbP

bPbP

Formula 1: The value rounded to an integer is a soft value.

A soft value close to 127 or 128 says that the intended value of the bit is likely one or

zero, respectively. But if the soft value is close to zero then the bit’s value is ambiguous.

When the base station receives a transmission, for each received bit it determines a soft

value. Then it applies its processing steps to the soft values. Thus ULTC’ steps work with

bits, while the base station’s steps work with bytes.

5.2 Channel deinterleaver In this paragraph Channel interleaver, that is one of ULTC’s processing steps, is

specified first. Channel deinterleaver is specified next, which is one of the base station’s

processing steps.

5.2.1 Specification of Channel interleaver

The processing step receives three arrays of so called symbols. A symbol is #M bits

long, where M is also specified by the input. The following is the input:

An array of symbols 10 Gg .

An array of symbols 10 Ir .

An array of symbols 10 Qq .

A positive integer M that equals 2, 4 or 6. It denotes a symbol’s length in bits.

A positive integer C that equals 10 or 12. It denotes the number of columns in a

matrix that is to be used in the processing step.

The lengths of the arrays G, I and Q are also specified by the input. The following will

be the processing step’s output:

An array of symbols 10 IGs .

A matrix X is used in the processing step. Each cell of the matrix can fit a symbol. It has

#C columns and CIGR rows. IG is a multiple of C. R can not be greater than

1200.

The processing step will now be explained. Figure 7 shows how it works step by step

for a small example. The reader should consult the figure as he reads on.

34

The symbols of the array r are first placed into the matrix as Algorithm 3 shows.

Starting from the last row and moving up, the symbols are inserted into four columns

designated by the array columnSet.

Algorithm 3: The algorithm shows how the symbols of r are placed into the matrix X.

//The symbols of r[0...I-1] are to be placed into the matrix X.

//columnSet[0..3] is an array of positive integers.

//R is the number of rows in the matrix X.

columnSet = [1, 10, 7, 4];

i = 0;

j = 0;

currRow = R - 1;

while (i < I) {

currCol = columnSet[j];

X[currRow, currCol] = r[i];

i++;

j++;

j = j % 4;

if (j == 0) {

currRow--;

}

}

The symbols of g are then inserted into the free cells of X as Algorithm 4 shows. The

symbols are inserted row by row, but cells that are already occupied by one of r’s

symbols are skipped.

35

Algorithm 4: The algorithm shows how the symbols of g are placed into the matrix X.

//The symbols of g[0...G-1] are to be placed in the matrix X.

//R and C are X’s number of rows and columns respectively.

i = 0;

currRow = 0;

currCol = 0;

while (i < G) {

if (X[currRow, currCol] is not occupied by a symbol of r) {

X[currRow, currCol] = g[i];

i++;

}

currCol++;

if (currCol == C) {

currCol = 0;

currRow++;

}

}

Now all the matrix’s cells are occupied by symbols because it has exactly IG cells.

Next some of g’s symbols in the matrix are replaced by q’s according to Algorithm 5.

Note the similarity between it and Algorithm 4. The only difference is that columnSet

designates different columns in the two algorithms.

Algorithm 5: The algorithm shows how the symbols of q are placed into the matrix X.

//The symbols of q[0...Q-1] are to be placed in the matrix X.

//columnSet[0..3] is an array of positive integers.

//R is the number of rows in the matrix X.

columnSet = [2, 9, 8, 3];

i = 0;

j = 0;

currRow = R - 1;

while (i < Q) {


X[currRow, currCol] = q[i];

i++;

j++;

j = j % 4;

if (j == 0) {

currRow--;

}

}

36

The processing step’s final output is an array of symbols 10 IGs . It is obtained

by reading X’s symbols column by column.

Figure 7: The processing step is demonstrated for the matrix X with 12C columns and 6R rows. The

arrays g, r and q have 59G , 13I and 10Q symbols respectively. X is shown after each step from top

to bottom. Cells that are modified have a darker shade.

5.2.2 Specification of Channel deinterleaver

The processing step receives one array of symbols, but now a symbol is #M bytes long.

M is also specified by the input. The following is the input:

A positive integer Q (its purpose will be explained shortly).

An array of symbols 10 NS , where N is also specified by the input.

A positive integer M that equals 2, 4 or 6. It denotes a symbol’s size in bytes.

A positive integer C that equals 10 or 12. It denotes the number of columns in a

matrix that is to be used in the processing step.

The following will be the processing step’s output:

37

An array of symbols 10 NT .

A matrix Y is used in the processing step. Each cell of the matrix can fit a symbol. It has

#C columns and CNR rows. N is a multiple of C. R can not be greater than 1200.

How the processing step works will now be explained. Figure 8 shows how it works

step by step for a small example. The reader should consult the figure as he reads.

First, the symbols of S are placed into the matrix Y column by column. The symbols are

placed into the matrix in incremental order starting with 0S , then 1S , and so on. Every

cell of Y is now occupied by a symbol of S.

Next, #Q of the symbols in Y are cleared, i.e. they are set to zero. Algorithm 6 shows

what symbols. Note the similarity between it and Algorithm 5 that shows where Channel

interleaver places the array q’s symbols into the matrix X.

Algorithm 6: The algorithm shows what symbols of Y are cleared.

//#Q of Y’s symbols are cleared.

columnSet = [2, 9, 8, 3];

i = 0;

j = 0;

currRow = R - 1;

while (i < Q) {


Y[currRow, currCol] = 0;

i++;

j++;

j = j % 4;

if (j == 0) {

currRow--;

}

}

The processing steps final output is an array of symbols 10 NT . It is obtained by

reading the symbols of Y row by row.

38

Figure 8: The processing step is demonstrated for Y with 12C columns and 6R rows. The array S has

72N symbols. Q equals 10. Y is shown after each step from top to bottom. Cells that are modified have a

darker shade.

5.3 Rate dematching In this paragraph Rate matching, that is one of ULTC’s processing steps, is specified

first. Rate dematching is specified next, which is one of the base station’s processing

steps. The specification of the latter relies greatly on the former. The reader is therefore

advised to read the paragraph about Rate matching carefully before he attempts to

understand Rate demtaching.

5.3.1 Specification of Rate matching

The input of the processing step is the following:

Three bit arrays, namely 10 Da , 10 Db and 10 Dc . The first #N

bits of each array are NULL. The length of the arrays is also specified by the

input and is constrained by 6148D .

A positive integer S, which denotes the length of the processing step’s output bit

array.

A non-negative integer 3T (its purpose will be explained shortly).

The output of the processing step is:

A bit array 10 Se .

39

Obviously a bit's value can be set to either 0 or 1, not NULL. But the #N first bits of the

three arrays are considered to be NULL. Besides from this, they will be treated just as all

the other bits in this processing step. If it is stated in the description of this processing

step that a bit is set to NULL, then the reader should understand that from there and on

the bit’s value is considered to be NULL.

For simplicity of explanation, this processing step can be divided into four consecutive

steps. The reader should not be confused by the usage of the word “step” and not

consider them to be individual processing steps. The steps are in order Padding,

Permutation I, Permutation II, Bit collection and Bit selection. Figure 9 gives an

overview of the entire processing step. The reader should often refer back to it as he reads

on. Padding is performed on the arrays a, b, and c individually. Then the output of each

step is passed on as input to a following step. Finally the array e is the output of Bit

selection. The integer T is used in the final step.

In the following subparagraphs, the five steps will be explained individually. A small

example of the entire processing step will be followed throughout each step. A figure in

each subparagraph will explain what that step does to the example. The reader should

also consult these figures often.

Two positive integers will be used throughout the processing step. R is the smallest

integer such that RD *32 . K equals 32*R. Also, the function P will be used. It is

defined as follows:

Table 4: Definition of the function P.8

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

iP 0 16 8 24 4 20 12 28 2 18 10 26 6 22 14 30

i 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

iP 1 17 9 25 5 21 13 29 3 19 11 27 7 23 15 31

8 Note that the function is its own inverse. Also note that 161 iPiP ,

8116 iPiP and 2418 iPiP for all 70 i . Longer such chains can be

observed but the details are omitted.

40

a[0...D-1] b[0...D-1] c[0...D-1]

Padding Padding Padding

a’[0...K-1] b’[0...K-1] c’[0...K-1]

Permutation I Permutation I Permutation II

Bit collection

Bit selection

w[0...3K-1]

e[0...S-1]

x[0...K-1] y[0...K-1] z[0...K-1]

Figure 9: An overview of the entire processing step. For each step it is shown what is its input array and

what output array is produced.

5.3.1.1 Padding

This step is applied to the bit arrays a, b and c individually. The output for them will be

three bit arrays 10 Ka , 10 Kb and 10 Kc , respectively. It will only be

explained for a, as it does the same thing to the three of them. The first DK bits of a

are set to NULL. The array a is then placed into the remaining D bits of a . Figure 10

shows how the step works for a small example.

41

Figure 10: The figure demonstrates the step Padding when applied to the array a for 246D and 62N .

This implies that 256K , and so the first 10DK bits of a are set to NULL. The tables in the figure

show how a is modified through the step. They show only some parts of the array, and bits that are NULL have

a darker shade.

5.3.1.2 Permutation I

This step is applied to the bit arrays a and b individually. It will only be explained for

a , as it does the same thing to both of them. The output for them will be two bit arrays

10 Kx and 10 Ky respectively. A matrix of bits 31...0,10 RY is used in

this step. It has #R rows and 32 columns. Figure 11 illustrates the step for a small

example.

The bits of a are placed in Y row by row. Next, Y’s columns are permuted such that

after the permutation column i was before the permutation column iP . The output bit

array x is then obtained by reading the bits of Y column by column.

42

Figure 11: The figure demonstrates the step Permutation I when applied to the array a for 246D and

62N . This implies that 8R and 256K . The tables in the figure show how the matrix Y is modified

through the step. Only some of the matrix’s columns are shown. Each table shows how the matrix’s bits originate

from the array a. Those cells that only say “NULL” are bits that were inserted during the step Padding and so do

not originate from a.

5.3.1.3 permutation II

This step is applied to the bit array c . The output will be the bit array 10 Kz . A

function is used that is defined as follows:

10 where,%1%*32 KiKRiRiPiF

Formula 2: Definition of the function F.

The function P is defined as in Table 4. The bits of c are moved to z as follows:

10for , KiiFciz

The following figure demonstrates the step for a small example:

43

Figure 12: The figure demonstrates the result of applying the step Permutation II to the array c for 246D

and 62N . This implies that the length of the array is 256K . Some parts of the resulting array z are

shown. It is shown how its bits originate from the array c. Those array positions that say “NULL” are bits that

were inserted during the step Padding and so do not originate from c.

5.3.1.4 Bit collection

This step is applied to the bit arrays x, y and z. The output will be a bit array

130 Kw . The three arrays’ bits are placed into w as follows:

10for ,12

10for ,2

10for ,

KkkzkKw

KjjyjKw

Kiixiw

Formula 3:How the three arrays’ bits are placed in w.

The following figure demonstrates the step for a small example.

Figure 13: The figure demonstrates the result of applying the step Bit collection to the arrays x, y and z for

246D and 62N . This implies that the length of each of the arrays is 256K and so the resulting

array w’s length is 3K. Some parts of w are shown. It is shown how its bits originate from the arrays a, b and c.

Some of the bits are NULL bits that were placed into the arrays a , b and c during the step Padding. These

have been marked by “NULL_a’”, “NULL_b’” and “NULL_c’”. Note how Formula 3 places x into the first third

of w, while y and z are interlaced bit by bit and fill the rest of w. This becomes obvious in this example.

44

5.3.1.5 Bit selection

This step is applied to the bit array w. The output will be a bit array 10 Se .

Algorithm 7 shows how e is produced. Figure 14 demonstrates the step for a small

example. Starting from bit offsetw where 224 TRoffset , w is traversed circularly

until enough bits have been collected for e. However, bits that are NULL are skipped.

Algorithm 7: The algorithm shows how the bit array e is produced from w.

//The input is a bit array w[0...3K-1],

//a non-negative integer T<=3 and a positive integer S.

//The output is a bit array e[0...S-1].

offset = R*(24*T + 2);

j = 0;

k = 0;

while (k < S) {

if (w[(offset + j)%(3*K)] != NULL) {

e[k] = w[(offset + j)%(3*K)];

k++;

}

j++;

}

Figure 14: The figure demonstrates the result of applying the step Bit selection to the array w for 246D ,

62N , 1T and 498S . This implies that the length of w is 7683 K . Also the processing step’s

final output array e’s length is 498S . Some parts of e are shown in the figure. It is shown how its bits

originate from the arrays a, b and c. NULL bits are not included in e so every bit is from either a, b or c. Due to

the value of T, e begins with bits from a. 5523 ND of w’s bits are not NULL, yet the length of e is

552S . So w has been traversed a complete time and then some more when sufficiently many bits have been

read. This is why e also ends with bits from a.

5.3.2 Specification of Rate dematching

The input of the processing step is:

A byte array 10 SE , where S is specified in the input.

45

Three byte arrays, namely 10 DA , 10 DB and 10 DC . The first

#N bytes of each array are NULL. The length of the arrays is constrained by

6148D .

A non-negative integer 3T (its purpose will be explained shortly).

A boolean CLEAR.

The processing step has no explicit output. Its purpose is to modify A, B and C as it will

shortly be described. The bytes of the arrays are soft values. Therefore no byte value can

be spared to indicate that the byte is NULL. But as in Rate matching, the #N first bytes of

the three arrays are considered to be NULL. Besides from this, they will be treated as the

other bytes in the processing step.

A function SatFunc will be used in this specification and is defined as follows for any

integer x:

otherwise

128 if128

127127

x

x

x

xSatFunc

Formula 4:Definition of SatFunc for any integer x.

The processing step begins by clearing A, B and C if CLEAR is set. The rest of the

specification relies on Rate matching. Assume Rate matching is performed for three bit

arrays 10 Da , 10 Db and 10 Dc where the first #N bits of each array are

set to NULL. The values of the other bits are irrelevant. The inputs D, N, S and T of Rate

matching are to be the same as they are for Rate demtaching. The output of the former

will be a bit array 10 Se . It is important to understand that each of e’s bits

originates from a bit of either a, b or c. It is true that e was produced from the array w in

the last step Bit selection, and some of w’s bits are NULL bits that were not introduced

before the step Padding. But NULL bits were not written to e and besides from these all

the other bits of w originate from either a, b or c. Now, starting from 0E and moving

forward the values of E are used to modify A, B and C as Algorithm 8 shows.

46

Algorithm 8: The algorithm shows how the bytes of E are used to modify A, B and C.

//Input is four byte arrays E[0...S-1], A[0...D-1], B[0...D-1] and C[0...D-1]

//and two non-negative integers N and T <= 3.

//Rate matching is performed first.

e[0...S-1] = Rate_matching (a[0...D-1], b[0...D-1], c[0...D-1], N, T, S);

//The array E is then used to modify the arrays A, B and C as follows.

for (i = 0; i < S; i++) {

if (e[i] originates from a[j]) {

A[j] = SatFunc (A[j] + E[i]);

} else if (e[i] originates from b[k]) {

B[k] = SatFunc (B[k] + E[i]);

} else { //e[i] originates from c[l]

C[l] = SatFunc (C[l] + E[i]);

}

}

To give an example and introduce some terminology, if ie originates from ja then

iE is used to modify jA by using SatFunc. It is said that the latter two are soft

combined. Figure 14 illustrates an example of Rate matching, where the input parameters

D, N, T and S have some specific values. The resulting array e is shown. Suppose Rate

demtaching is executed for the same values for the four input parameters. Then the same

figure also makes it obvious for each byte of E with what byte of A, B or C it is to be soft

combined with.

47

6 Processing algorithm for Channel deinterleaver

This chapter presents the processing algorithm that was used for Channel deinterleaver.

Paragraph 6.1 first makes some relevant observations regarding the processing step.

Paragraph 6.2 suggests how a processing algorithm can perform the processing step. This

includes how the work can be divided among the job DSPs that are to perform the

processing step. The algorithm that was chosen as a base for the implementation is

presented by Paragraph 6.3 and further specified in Appendix 2.

An implementation of the processing algorithm was written during the thesis. For

details regarding the implementation and what steps were taken to verify its correctness

and to improve its performance, refer to Appendix 2.

6.1 Initial discussion The following subparagraphs discuss the processing step beyond its specification in

Chapter 5. The purpose is to highlight interesting properties that it has before a

processing algorithm is discussed.

6.1.1 Regarding existing literature on matrix transposition

The major task of the processing step is to perform matrix transposition; the symbols

are placed into a matrix column by column and then read row by row (see Paragraph

5.2.2). The symbols are to be permuted from column major order to row major. This

problem has been widely studied, and there are many algorithms in the literatures that are

claimed to be efficient. The problem is that no algorithms that are suitable for the

processing step’s matrix were found. Nor is the underlying hardware architecture that the

literatures assume is being used a good description of EMPA.

To give examples, [3], [6] and [5] provide a good summary of several transposition

algorithms and review their performance. Most of these algorithms are concerned with

square matrices, which is clearly not suitable for the processing step, since the matrix has

up to 120 times more rows than columns. An approach where the non square matrix is

extended to be square is clearly infeasible.

However, even the algorithms that do work with a non square matrix still put special

restrictions on its dimension that is not suitable for the processing step. For instance one

of the algorithms of [3] first divides the matrix along its columns into multiple new

matrices, then divides each new matrix along its rows, then divides the resulting matrices

along their columns, and so on. This is clearly not applicable to the processing step’s

matrix, as it can have many times more rows than columns.

Besides from issues regarding the shape of the matrix, the assumptions that have been

put on the underlying hardware do not match EMPA. Some of the algorithms of [3], [6]

48

and [5] assume that the matrix in question is so large that it must reside on secondary

memory while parts of it are moved to the primary memory for processing. One might be

tempted to regard EMPA’s CM as secondary memory while the DSP’s LDM is the

primary memory. Even though it is the case that the processing step’s matrix can be so

large that it can not reside in LDM, the memory latencies of CM that require only a few

cycles can not be compared to the I/O operations that [3] and [6] assume take multiple

milliseconds.

On a final note, all the algorithms of [3], [6] and [5] are in-place. This is not required by

an implementation of the processing step; for the input that it receives in CM, it is

allowed to use an equally large portion of the memory to produce the output.

All these issues also occur in literatures that suggest how matrix transposition can be

parallelized (see [7]).

6.1.2 The q symbols

Some of the symbols that are placed into the processing step’s matrix are to be cleared

(see Paragraph 5.2.2). These symbols will be greatly discussed and will therefore be

referred to as q symbols. q symbols are in the matrix’s last rows, and there can be at most

four in each row.

The number of q symbols is independent of the number of rows in the matrix, except

there can be at most four q symbols in each row. It rarely exceeds 100, and is in average

no more than 40.

6.2 Suggestion for processing algorithm A processing algorithm is suggested in this paragraph. What need to be discussed is

how the work should be divided among the job DSPs, and how the q symbols can be

cleared.

6.2.1 How to parallelize

Consider a general case where a matrix is to be transposed on EMPA. Initially the

matrix’s elements are written column by column in CM, and they are to be permuted to

be in the order row by row. A part of the memory that is equally large as the input is to be

used to write the output. Every element is a number of complete words large, but a

specific size will not be mentioned to keep the discussion general. Also, the matrix can

have any number of columns and rows. Figure 15 demonstrates this for a small matrix.

49

Figure 15: The figure shows two ways a matrix 70,70 M can appear in CM, namely column by column

and then row by row.

A number of job DSPs are to work together. Each of them will read some of the

matrix’s elements from the input in CM to its LDM. When they have been rearranged

they are written back to CM where the output should be.

A DSP will perform a CM instruction to read or write an array of elements from the

memory. The memory latencies latreadCM _ and latwriteCM _ apply for initiating such an

instruction (see Paragraph 2.2). Besides from them, say that in average # cyclescm cycles

are required to read or write an element of the array. Also, the LDM of every DSP has

available memory for # elementsldm elements.

Each DSP can be assigned a rectangular part of the matrix, covering a number of

columns and rows. Its task is to read the part’s columns from the input, and write the

corresponding rows of the part to the correct positions of the output. The part of the

matrix that has been assigned to a DSP will be referred to as its section. Assume for this

discussion that the matrix can be divided into equally large sections, one for each DSP.

50

Figure 16 demonstrates the idea for a small matrix. It also highlights what elements one

of the DSPs must read from the input and write to the output.

The benefit of this method is that it is straightforward and no synchronization is

required. Each DSP writes to its own parts of the output, not interfering with the work of

the other DSPs.

51

Figure 16: The matrix 70,70 M is divided into 8 sections. Each section spans over 2 columns and 4

rows. The figure first highlights one of the sections and the elements that it includes. It then shows for that

section what parts of the input must be read from CM, and finally what parts of the output must be written to.

Assume that the section does not fit into the DSP’s LDM. One way to process the

section is to divide it into multiple rectangular parts. Such a part will be referred to as a

52

segment, and it spans over a number of rows and columns.9 The DSP processes one

segment at a time.10

It must read the segment’s elements from the input in CM to LDM,

and write them to the output. It is desirable that the execution time spent on performing

these CM instructions is minimized. To achieve this, two things must be considered.

The first is in what manner are the CM instructions performed. The DSP should use

only one instruction to read each of the segment’s columns from the input, and only one

instruction to write each of the segment’s rows to the output. None of them should be

read or written partially or twice.11

But in order to write one of the segment’s rows to the

output all the elements of that row must be present in LDM. And since no column can be

read partially or twice, all the columns must be present in the memory to produce the row

and the other rows. This requires that the size of the segment is smaller than LDM.

The second thing to consider is the segment’s dimension. Say that its width and height

are # nrcols and # nrrows elements respectively. The number of cycles required to perform

CM instructions to process the segment is:

cyclesnrnrlatwritenrlatreadnr cmrowscolsCMrowsCMcols ***2** __

To explain the formula, # nrcols instructions must be initiated to read the entire segment

from CM. # nrrows instructions are initiated to write the segment. Besides from initiating

the instructions the elements must of course be read and written. Now, the expression is

to be minimized under the following restriction:

elementsnrnr ldmrowscols *

The restriction is too strict for practical purposes, as it uses all of the available memory in

LDM for the segment. But it suffices for this discussion. To make further mathematical

arguments more convenient, say that nrrowsx and nrcolsy and allow the two

variables to have real values. The following function

cycleslatwritelatread cmyxCMxCMyyxf ***2**, __

Formula 5:Definition of the function f for any real values x and y.

9 Figure 16 already illustrates this if one considers the figure’s entire matrix to be a section, and the

highlighted section in the figure to be a segment.

10 The verb “process” will be frequently used to state that some task is being performed. Do not confuse

this with a processing step.

11 This argument may sound a bit too restrictive but it will suffice for the final conclusion of this paragraph.

In the mean time, the reader should recall that Channel deinterleaver’s matrices have a very special

dimension.

53

is to be minimized under the following constraint

elementsldmyx *

Formula 6: The restriction that must be met by f.

Note that f is a continuous function, but it is constrained to an endless curve even if only

positive values for x and y are to be considered.

The method of Lagrange multipliers states that if f has any minimal turning point yx,

while subject to the constraint, then there is a v such that vyx ,, solves the following set

of equations:

0*

0***2

0***2

_

_

elements

cycleslatread

cycleslatwrite

ldmyx

vxcmxCM

vycmyCM

Formula 7: The equations that any minimal turning point of f solves.

However the converse implication does not have to be true. A triplet vyx ,, that solves

these equations does not necessarily yield a point of extrema yx, for f under the

constraint. Thus before solving the equations it must first be proved that there is at least

one such minimal extrema. For this purpose the following evaluations of f are made

under the constraint of Formula 6:

if f is subject to the constraint and x then f

if f is subject to the constraint and y then f

Formula 8: Evaluations made of f under the constraint of Formula 6.

f grows to infinity as x or y grows to infinity along the curve in the first quadrant. Yet f is

a continuous function. This proves f must have at least one minimal extrema in the first

quadrant when subject to the constraint. To find one, the three equations can be solved.

One solution is:

latread

latwriteelements

CM

CMldmy

_

_*

Only positive values for y are of interest. For the purpose of the processing step, consider

the largest symbol size that is six bytes. For this symbol size elementsldm should be

6

availelements

LDMldm

54

Given this the solution suggests that a segment should span over 31y columns and

65x rows. These numbers increase for smaller symbols.

The obvious problem is that the matrix of the processing step can not have that many

columns. In such case a segment should instead cover as many consecutive rows of the

matrix as LDM allows. A section too should cover a consecutive number of rows. Figure

17 demonstrates the idea for a small example. This approach is beneficial, because now

all the segment’s rows are to be written directly after one another in the output, and this

can be done using a single CM instruction.

55

Figure 17: A matrix M of the processing step with 12 columns and 36 rows. The symbol size is irrelevant. It is

first shown how it is divided into 3 sections, and a segment in the second section is highlighted. It is then shown

what parts of the input must be read to process the segment, and then what part of the output must be written to.

Recall that the input appears in CM column by column, but it is to be written row by row. 12 CM instructions are

required to read from the input, while only one is required to write to the output.

Note that this discussion only considers how much of an implementation’s execution

time is spent on performing CM instructions. In fact how the implementation otherwise

works was disregarded from. But its purpose was to show that any implementation of the

processing step should give each job DSP a set of consecutive rows to process. Otherwise

an unacceptable amount of the execution time is spent on performing many CM

instructions, no matter of how efficient the implementation is otherwise.

56

Dividing the processing step’s matrix into sections and segments in this way was

chosen as a base for an algorithm.

6.2.2 Clearing the q symbols

By dividing the matrix’s rows into equally large sections, the workload may be

unevenly divided among the job DSPs since only the sections containing the last rows

have q symbols that need to be cleared. It will hardly be notable if each section spans

over say 400 rows and there are only 25 rows with q symbols in the last section. And

even if the sections are smaller, an even workload may be maintained by not making the

sections equally large. Somehow a section that contains q symbols can differ in size from

other sections.

How they should differ depends entirely on the implementation. It might actually be

beneficial for a DSP to be assigned a section containing q symbols, since it does not have

to read those symbols from the input. For a column of the segment that contains q

symbols in the end, the implementation can read from the input in CM only the first part

of the column that does not contain any q symbols. It then clears the last part of the

column. Figure 18 illustrates the idea.

57

Figure 18: A matrix of the processing step is shown from top to bottom, but some rows have been skipped. q

symbols are marked with a “q” throughout the matrix. A segment that spans over 8 rows is shown in shaded

colors. The parts of the columns that must be read from the input in CM appear in a darker shade, while q

symbols that do not need to be read have a lighter shade.

On the other hand, each of the segment’s columns may be read completely to LDM and

then the q symbols can be cleared (before the segment is written to the output in CM).

The latter strategy was chosen for an initial implementation, but it can be switched later.

Also sections are to be equally large. If it turns out that q symbols cause a too uneven

workload balance, then the method of choosing section size can be changed to

compensate for this.

6.3 Description of the processing algorithm The processing step’s matrix is divided into equally large sections, one for each job

DSP. Each section is divided into many segments. How a segment’s size is determined

will be described shortly. Each of the segment’s columns is then read from the input in

CM to a column buffer in LDM. Each of the segment column’s symbols is then placed

from the column buffer into the segment in LDM. Once all of the segment columns have

been placed into the segment in this manner, all of its q symbols are cleared. The segment

58

is then written to CM using only one instruction. The next segment can now be

processed. Figure 19 demonstrates the procedure for a small example.

Figure 19: The figure illustrates the processing of one segment. It shows how data is moved between CM and

LDM. Specifically, the last column of the segment is read from the input to the column buffer in LDM using one

CM instruction. The symbols are then moved from the buffer to the segment by multiple LDM instructions.

Then, q symbols are cleared using such instructions too. Finally, the segment is written to the output in CM by

one CM instruction.

So in LDM there must be room for the entire segment and the column buffer. Say that

the matrix has # nrcols columns and a symbol is # sizesymbol bytes large (Paragraph 5.2.2

shows the variable’s possible values). LDM has # AVAILLDM bytes available (see

Paragraph 2.1.1). In such case the segment can span over nrrows rows, where nrrows

solves the following equation:

59

AVAILsizenrnrnr LDMsymbolrowsrowscols **

Formula 9: The equation is used to determine the size of a segment.

The benefit of this approach is that the segments are as large as possible. This

minimizes the number of CM instructions that must be performed. A possible

disadvantage is that due to the capabilities of a DSP the implementation can perhaps be

improved by processing say two of the segment’s column at a time, which would require

a larger column buffer. However this can be considered for later optimizations.

60

7 Processing algorithm for Rate dematching

This chapter presents the processing algorithm that was used for Rate dematching.

Paragraph 7.1 makes some initial observations regarding the processing step that are

relevant. Paragraph 7.2 suggests how a processing algorithm for Rate dematching can

work. The processing algorithm that was used as a base for the implementation is

presented by Paragraph 7.3 and further specified in Appendix 3.

An implementation of the processing algorithm was written during the thesis. For

details regarding the implementation and what steps were taken to verify its correctness

and to improve its performance, refer to Appendix 3.

7.1 Initial discussion This paragraph discusses the processing step beyond its specification. The purpose is to

give a new idea of how it works that makes it easier to understand how an efficient

algorithm can be written. This will be done from Paragraph 7.1.4 and forward. But its

specification relies greatly on Rate matching, and because of this it will first be presented

how the latter processing step works from a new perspective. This will be done in

paragraphs 7.1.1 to 7.1.3. A good understanding of the processing steps’ specifications is

recommended (see Paragraph 5.3).

7.1.1 Redefinition of Permutation II

Recall the function iF that is used to rearrange the bits of the array c to z in

Permutation II (see Paragraph 5.3.1.3). Suppose c is placed in a matrix

310,10 RY row by row just as Permutation I would. This enables one to see that

the function F permutes the bits of c to z in a way very similar to how Permutation I

would do it.

Suppose the function

RiRiiG %32

is used in Permutation II instead of F. This equals to placing the bits of c in Y row by

row. z is then obtained by reading the matrix’s bits column by column. To se this, note

how Ri chooses the current column while Ri%32 chooses row.

Now, suppose the function

RiRiPiH %32

61

is used instead of F (recall the function P’s definition from Paragraph 5.3.1). This would

make Permutation I and II equal. To see this, compare iG to iH for the same i in the

context of Permutation II. The former function assigns the bit in column Ri and row

Ri%32 in Y to iz . The latter function chooses the bit in the same row, but instead in

column RiP . So G produces the output by reading the matrix’s bits column by

column, while H first permutes them by using P. This is exactly what Permutation I do.

A crucial observation can be made at this stage, namely

KiHiF %1

Recall that RK 32 and that this is the number of bits in Y. The similarity of

Permutation I and II will be established in the following argument. The reader should

remember that using H instead of F in Permutation II is equal to Permutation I. Also keep

in mind that c has been placed in Y row by row.

Suppose iHc is a bit located in any of Y’s columns but the rightmost. Then

1 iHc is in the same row but in the column to the right of it.

Suppose iHc is in the rightmost column but not the bottom row. Then

1 iHc is the leftmost bit of the next row.

Finally, suppose iHc is the rightmost bit of the bottom row. Then

KiHc %1 is the leftmost bit of the top row.

Permutation II can now be defined in a more convenient way that reveals how similar it

is to permutation I.

Redefinition of permutation II: place the bits of c in the matrix Y row by row. Each

row is filled from left to right. In the leftmost column, move each cell one row up. Move

the cell that was in the top row to the bottom of the column. Now, rearrange the columns

such that after the rearrangement column i was before the rearrangement column iQ . Q

is defined as follows:

32%1 iPiQ

Formula 10: Definition of the function Q.

The bit array z is produced by reading the matrix’s bits column by column. Each

column is read from top to bottom. Figure 20 demonstrates the step for the same small

example that was followed throughout Paragraph 5.3.1 and its subparagraphs. Compare it

to Figure 11 that illustrates Permutation I.

This is the new definition of Permutation II from here and on. The reader should consult

this one when then the permutation is mentioned in the rest of this chapter. The old

definition should only be read for clarifying this one. From here and on in the report,

when it is said that the matrix’s columns are permuted as Permutation II would, then this

62

includes two things. Namely (i) how the cells of column zero are shifted up and then (ii)

the rearrangement of the order of the columns.

Figure 20: The figure demonstrates the step Permutation II when applied to the array c for 246D and

62N .12

This implies that 8R and 256K . The tables in the figure show how the matrix Y is modified

through the step. Only some of the matrix’s columns are shown. Each table shows how the matrix’s bits originate

from the array c. Those cells that only say “NULL” are bits that were inserted during the step Padding and so do

not originate from c.

12 Note that this is the same as the example that was followed throughout paragraph 5.3.1 and its

subparagraphs.

63

7.1.2 Bit collection seen from a new perspective

Before discussing the step Bit collection in Rate matching, one should recall from

where it obtains its input. Its input are the bit arrays 10 Kx , 10 Ky and

10 Kz . The two former were produced by Permutation I, while the last was

produced by Permutation II (see paragraphs 5.3.1.2 and 7.1.1). Each permutation

produced its output by reading the bits of its matrix column by column. The matrix’s

columns have been permuted before that. Say that Permutation I used the matrices

31...0,1...0 RX and 31...0,1...0 RY while Permutation II used 31...0,1...0 RZ .

Assume that the matrices’ columns have already been permuted as the two permutations

do.

Now, in the step Bit collection the bit array x is placed in the beginning of the bit array

w (see Paragraph 5.3.1.4). This equals to reading matrix X’s bits column by column and

placing them in the beginning of w.

Next in the step, the arrays y and z are interlaced bit by bit in the order

,1,1,0,0 zyzy . The bits are then placed in the remaining part of w. Interlacing the

arrays y and z equals to interlacing the bits of Y and Z column by column. The order of

the bits is

,1,0,1,0,0,1,0,1,,0,1,0,1,0,0,0,0 ZYRZRYZYZY

Formula 11: The order in which the bits of Y and Z are placed in w.

The bits are then placed in the remaining part of w instead of y and z. w is now full. The

important observation is that the matrices appear in w column by column, only that X’s

columns come first and then the columns of Y and Z come interlaced. Figure 21 shows

how the columns of X, Y and Z appear in w for a small example.

Figure 21: The figure shows how the columns of the matrices X, Y and Z appear in w. The columns of Y and Z

have been interlaced bit by bit.

7.1.3 Bit selection seen from a new perspective

In the step Bit selection in Rate demtaching, the array w is traversed circularly until

sufficiently many non NULL bits have been collected for the output array e (see

64

Paragraph 5.3.1.5). The previous paragraph showed how the bits of the matrices X, Y and

Z were placed in w. The purpose of this paragraph is to show how traversing w equals to

traversing the matrices. Recall that their columns have been permuted by permutation I

and II.

Imagine that Bit selection traverses w from its beginning.13

The previous paragraph

shows that this equals to first traversing X column by column. When the bottom bit of its

rightmost column has been reached, Y and Z are traversed in the order specified by

Formula 11. This means the matrices are traversed column by column, but first one of Y’s

bits, then one of Z’s bits, then one of Y’s bits and so on. When the last bit in this order has

been reached w has been fully traversed. In such case it is again traversed from its

beginning as described. This continues until sufficiently many non NULL bits have been

collected from the three matrices for the output array e.

However, w is not traversed from its beginning. The starting position is bit

1122 TRoffset

where T is a non-negative integer. But note that offset is always an even multiple of R.

This means beginning traversing w from bit #offset equals to skipping a number of

complete matrix columns, but only in the first traversal of w. Specifically:

If Roffset 32 then R

offset of X’s columns are skipped.

If Roffset 32 then X is completely skipped and R

Roffset

2

32 columns of each

of the matrices Y and Z are skipped.

To clarify with an example, if Roffset 46 then all of X’s columns and the first 7

columns of Y and Z respectively are skipped in the first traversal of w. In the beginning of

the next traversal, one begins from the first column of X again.

Some crucial remarks will now be made regarding in what manner the matrices’

columns appear in e. Figure 22 illustrates the observations that will be made for a small

example. The reader should consult the figure frequently as he reads on.

First, one must know where the NULL bits appear in the matrices. There are equally

many in each matrix and it is possible that there are none at all. Assume that there are

some and take X as an example. Consider it before its columns were permuted. By

reading its bits row by row, all NULL bits appear first in this ordering. This means that in

any column only the topmost bits can be NULL, and the number of NULL bits in any two

columns can differ only by one. Now, Permutation I permutes the columns of X and Y. It

only switches place between the columns. So still the same observations can be made: in

13 This is actually not possible because the variable 0offset decides the starting position, but this will

be taken into consideration later.

65

any column only the topmost bits can be NULL, and the numbers of NULL bits in any two

columns can differ only by one. Finally, Permutation II is applied to Z. It first rearranges

the bits of column 0, and then switches place between the columns. After the permutation

column 31 was originally 0. Thus the following observation can be made: the numbers of

NULL bits in any two columns can differ only by one. In any column the topmost bits can

be NULL. The bottommost bit of column 31 is NULL. No other bit anywhere in the matrix

can be NULL.

Now it can be discussed how the matrices’ bits appear in e. Remember that all of them

have been permuted, and how some columns may be skipped due to the variable offset.

Starting with X, its bits appear in e column by column. Each column appears from top to

bottom but the top NULL bits have been skipped.

Continuing with Y, its bits appear in e column by column, but in a special manner.

Consider any column i. Say that it has n NULL bits in the top. The column’s remaining

nR appear from top to bottom in e, only that they have been interlaced bit by bit with

some of Z’s bits. Specifically, the bits of Y’s column appear in e in the following order:

inY , , one Z bit, inY ,1 , one Z bit, …, iRY ,2 , one Z bit, iRY ,1

Exactly which of Z’s bits appear in the ordering will not be specified. It is noteworthy

though that they all belong to column i in Z with one possible exception. The last one

may come from column 1i .

Finally how Z’s bits appear in e will be shown. Consider any column 31i . Say that it

has n NULL bits in the top. The remaining bits appear in e in a fashion very similar to the

previous ordering presented, namely

inZ , , one Y bit, inZ ,1 , one Y bit, …, iRZ ,2 , one Y bit, iRZ ,1

All the Y bits belong to column i in Y with one possible exception. The first one may

come from column 1i . The ordering is very similar for column 31 of Z. If the column

has no NULL bits then the ordering begins by 31,0Z and ends by 31,1RZ . If there

are n NULL bits then the ordering begins by 31,1nZ and ends by 31,2RZ .

As a final note, it should be noted that e may end abruptly after any bit, no matter what

matrix it belongs to. Also one might be tempted to believe that all of the bits of Y and Z

are interlaced in e. However there is an exception, but only if the matrices have NULL

bits. Each time the two matrices are traversed, two Z bits will appear side by side in e.

The first is the bottommost bit of one column, and the second is the topmost non NULL

bit of the next column.

66

Figure 22: A small example of Rate matching is illustrated. The purpose of this figure is to show in what manner

the columns of X, Y and Z appear in e. Because of this for every bit it is only shown which matrix and column it

belongs to. The matrices’ columns have already been permuted. This is notable by how NULL bits have been

rearranged in every matrix. The example’s input parameters are 143D , 36N , 0T and 411S .

This implies that every matrix shall have 5R rows and a total of 16032 RK bits. 17DK bits

of each matrix are NULL that were introduced in the step Padding. So each matrix has a total of 5317 N

NULL bits, which leaves 107 ND non NULL bits. Note that 321107*3 S , so e traverses all of

the matrices once and then some of X. Five parts of e are shown, and note that in the middle part two bits from Z

are adjacent.

67

7.1.4 Rate demtaching seen from a new perspective

In Rate dematching the bytes of the array E are used to modify the byte arrays A, B and

C. The specification relies greatly on Rate matching (see Paragraph 5.3.2). Specifically as

shown by Algorithm 8, iE is soft combined with jA (or kB or lC ) if ie

originates from ja (or jb or kc ). But so far in this chapter it has been shown in

what manner the bits of a, b and c appear in e. A short summary follows.

First, the step Padding inserts some NULL bits in the beginning of each of a, b and c.

Then, the steps Permutation I and II place the arrays into the matrices X, Y and Z row by

row, and permute their columns. From there, it was shown how the matrices’ bits appear

in e column by column.

The same idea will now be applied to A, B and C but an obvious redefinition of the

steps of Rate matching is required. Padding, Permutation I and II actually work with bit

arrays, but suppose they would work in the same way with byte arrays. To clarify with an

example, if Padding places DK NULL bits in the beginning of a bit array then it

places equally many NULL bytes in the beginning of a byte array.14

Now, suppose the steps are applied to A, B and C. Specifically Padding is applied to the

three arrays individually. Then Permutation I places A and B into the matrices

310,10 RU and 310,10 RV respectively and permutes their columns.

Permutation II places C into 310,10 RW and permutes its columns.

Now, Algorithm 8 can be rephrased by saying that if ie originates from kjX , then

iE is to be soft combined with kjU , . Y and Z are in the same way related to V and W

respectively. But it has been shown in the previous paragraphs in what manner the bits of

X, Y and Z appear in e column by column. In the same manner, the bytes of E are to be

soft combined with the bytes of U, V and W column by column.

This technique will be used extensively and therefore calls for some terminology. If e is

long enough then X can be traversed multiple times while producing e. One says that

there are multiple repetitions of X in e. In the same manner U is to be soft combined with

multiple repetitions in E. One also says that a column repetition of U is read from E and

soft combined with a column of the matrix. It is therefore important to be mindful of the

difference between a column repetition and a column. The former is in E, while the latter

is in a matrix. The word “column” will never be used as a substitution for “column

repetition”. Also a column repetition is to be soft combined with only the non NULL

bytes of its corresponding column.

But it would be inefficient to actually permute the columns of U, V and W. Instead the

bytes of A, B and C are placed into the matrices as described but the columns are not

14 One can not argue that confusion arises since a bit array can be considered to be a byte array. The two

types of arrays are different. This is due to the strict definition of an array presented for the purpose of this

report in paragraph 1.3.4.2.

68

permuted. Now take U as an example and recall that Permutation I would use function P

to permute its columns. When reading one of U’s column repetitions from E the function

P can be used to calculate with which one of U’s columns it should be soft combined

with.

The argument presented has been lengthy and based on other arguments from previous

paragraphs. This calls for a specific example that clarifies the final idea. Figure 22 will be

used for this purpose. Remember that Permutation I has used the function P to permute

the columns of X and Y (see Paragraph 5.3.1.2). Permutation II has used the function Q to

permute the columns of Z (see Paragraph 7.1.1). Suppose Rate dematching is to be

performed for three byte arrays A, B, C and E, and that the input parameters D, N, T and S

have the same values as in the figure’s example. What values the arrays contain is

irrelevant for the purpose of this example. A, B and C are used to form the matrices U, V

and W as previously stated. However their columns are not permuted.

Now, pay attention to the bit array e in the figure. Due to the value of T, the variable

offset equals 5*22 R . So the first repetition in e belongs to X but it is only partial

because it skips the matrix’s first two columns. So the following can be said of the first

30232 column repetitions in e:

The first column repetition belongs to column 2 of X.

The next column repetition belongs to column 3 of X.


…


The final column repetition of the repetition belongs to column 31 of X.

Because of this the following can be said of the first 30232 column repetitions in

E.

The first column repetition is to be soft combined with column 82 P of U.

The next column repetition is to be soft combined with column 243 P of U.

…

The next column repetition is to be soft combined with column 1530 P of U.

The final column repetition of the repetition is to be soft combined with column

3131 P of U.

Next in e there comes a bit sequence that contains a repetition of Y, but it also contains a

repetition of Z. Y’s repetition will be regarded first. It is complete, so it contains 32

column repetitions that belong to Y. Each of them has been interlaced bit by bit with a

part of Z’s repetition. Besides from this, the following obvious observation can be made:

The first column repetition belongs to column 0 f Y.

The next column repetition belongs to column 1of Y.

…

The final column repetition of the repetition belongs to column 31 of Y.

69

Therefore it can be deduced that next in E there comes a byte sequence that contains a

repetition of V and W. Beginning by regarding V’s repetition, it is also complete with 32

column repetitions. Each of them has been interlaced byte by byte with a part of W’s

repetition. So if iE and jE are the first and last bytes of one of V’s column

repetitions, then the subarray jiE contains the entire column repetition but only

every second byte is to be used for soft combination. Keeping this in mind, soft

combination can proceed as follows:

The first column repetition is to be soft combined with column 00 P of V.

The next column repetition is to be soft combined with column 161 P of V.

…


3131 P of V.

Likewise in the same bit sequence of e there is also a complete repetition of Z. Its

column repetitions appear in increasing order of column index too, and each one has been

interlaced bit by bit with a part of Y’s repetition. Therefore in the same byte sequence of

E there is a complete repetition of W, and its column repetitions have been interlaced byte

by byte with a part of V’s repetition. Keeping this in mind soft combination can proceed

as follows:

The first column repetition is to be soft combined with column 10 Q of W.

The next column repetition is to be soft combined with column 171 Q of W.


…



031 Q of W.

Finally in e there is a partial repetition of X. It covers the 29 first column repetitions, but

the last one is partial. These are to be soft combined with the columns

00 P , 161 P , …, 728 P and 2329 P

of W, but only some of the topmost non NULL bytes of column 23 bytes are soft

combined with.

In all coming discussion the matrices U, V and W effectively replace A, B and C. The

reader should remember how the former can be produced from the latter, and that the

matrices’ columns are not permuted. Each matrix has 32 columns and up to 19332 D

rows.

70

7.1.5 Remarks regarding SatFunc and soft combining

The binary function SatFunc is used to perform soft combining. It is commutative but

not associative. If a series of bytes are to be soft combined, then this must be done in the

order that they appear.

7.1.6 Typical input parameters for Rate demtaching

Typical input parameters of Rate demtaching are important to know before a processing

algorithm is discussed. The matrices are to be cleared if Rate dematching’s input

parameter CLEAR indicates this. This happens roughly nine out of ten times. The input

parameter T is hardly ever anything else but zero ([2], Table 8.6.1-1). This means that

mostly only the two first column repetitions of U are skipped in E.

The bit array w in Rate matching contains DN 3 non NULL bits, and so it is

traversed SDN 3 times when producing e. The same ratio multiplied by 32 tells how

many column repetitions of a matrix can be found in E, although some of them could be

in partial repetitions of the matrix. So to clarify if 13 SDN and 0T then there

are exactly 32 column repetitions in E belonging to U. Thirty of them appear first in E in

a partial repetition of U, while the last two appear last in E in another partial repetition.

The value of this ratio can vary greatly, making it difficult to predict the number of

column repetitions that must be processed for each matrix. Its value is typically within

0.5 to 1, but it can be as low as 0.38 or as high as 6 ([2], Table 7.1.7.2.1-1). However

higher values than 1.1 rarely occur and do so only for small matrices and E. This means

that it hardly ever happens that a column of a matrix is to be soft combined with more

than one column repetition.

7.2 Suggestion for processing algorithm A processing algorithm for Rate demtaching is suggested in this paragraph. First some

interesting observations are made that can be used to write an efficient algorithm. Then

different strategies for dividing the work among the job DSPs are discussed.

The algorithm will use the matrices U, V and W and place the byte arrays A, B and C

into them. Each array is initially located in CM when the processing step is to be

performed. DK NULL bytes must first be inserted into the array’s beginning, but this

happens to be an even number of bytes. Suppose a DSP is to process A. For this it can

reserve an array of 2K words in its LDM. The first 2DK words are considered to

be NULL and A is read from CM into the rest of the words.

7.2.1 The benefit of clearing U, V and W

If each matrix is to be cleared due to the value of CLEAR, then once this has been done

its 32 first column repetitions in E can be directly inserted into it without performing soft

combination. This is because of the simple observation iiSatFunc 0, for any byte i.

71

7.2.2 Efficient soft-combining of multiple matrix repetitions

Suppose that the first column repetition of U’s column t in E is the subarray jiE .

The column is to be soft combined with it, but then it must also be soft combined with

NDjNDiE 33 , NDjNDiE 66 ,

NDjNDiE 99 , NDjNDiE 1212 , …

and so on until the end of the array. This is because there can be multiple repetitions of

U and between them there are repetitions of V and W. Then all the mentioned subarrays

of E can be read from CM to LDM. The benefit of this is that if kU is in column t and it

is to be soft combined with lE , then it must also be soft combined with

NDlE 3 , NDlE 6 , NDlE 9 , NDlE 12 , …

but these bytes of E have already been read to LDM. Therefore kU can be read from

LDM, soft combined with them all, and then be written back to LDM. Compare this to

being forced to reading and writing the byte to the memory once per soft combination.

The method described needs to be a bit further elaborated before it is practically usable.

However this will not be further specified because it would still be of no use due to what

is stated by Paragraph 7.1.6. It rarely happens that any matrix has multiple repetitions in

E, and even then the matrices and E are very small. It is not feasible to take this into

consideration for an algorithm.

7.2.3 Working with bytes in words

The processing step works with bytes, but a byte can not be addressed in CM (see

Paragraph 2.2). So when an array of bytes in CM is to be read to LDM, then the array

must first be expanded so that it encompasses complete words if this is not the case. Thus

when a CM instruction is executed that reads the resulting word array from CM to LDM,

it might be so that up to two bytes at the ends of the array are of no interest. This problem

does not make it more difficult to write an algorithm, but it makes the process of

implementing it more error-prone. It should therefore be taken into consideration when

specifying an algorithm.

7.2.4 How to parallelize

Two strategies were considered for dividing the workload among the job DSPs. They

differ in what parts of the matrices a DSP is designated to process. This includes clearing

those parts if necessary and then to soft combine them with bytes of E. In the first

strategy, each DSP is designated all the rows in one of the matrices. So only three DSPs

are used. In the second strategy, each DSP is given row i to j of each matrix. So there can

be as many DSPs as there are rows. Each DSP is then interested in a subarray of every

72

column repetition in E, no matter what matrix it belongs to. Figure 23 illustrates the two

strategies with three DSPs each.

Figure 23: The figure shows the two strategies that were considered for dividing the processing step’s workload

among the job DSPs. Both strategies are shown with three DSPs for easy comparison, although the second

strategy can use more.

Benefits of the first strategy are:

Efficient use of CM instructions and the memory in LDM.

A DSP can fit its entire matrix into LDM. A CM instruction can then read as

large part of the matrix’s repetition as LDM allows. The number of required CM

instructions is thus small.

No synchronization required.

A DSP processes its own matrix irrespective of what the others do. It reads the

repetition of the matrix from E and when it is done it writes the matrix back to

CM. The work does not need to be synchronized.

Simpler processing algorithm.

A simpler algorithm does not necessarily mean a more efficient implementation.

But the algorithm’s simplicity must be taken into consideration because writing

an implementation for Rate demtaching is already very error-prone. Bytes can

not be addressed in CM (see Paragraph 7.2.3), and the columns of V and W have

73

been interlaced in E in a special way (see 7.1.3). These are only two concerns

that make an implementation susceptible to errors. For this strategy, the required

CM instructions that transfer data between CM and LDM have been described.

It is relatively easy to calculate what parts of E are of interest for a DSP (namely

those where the matrix’s repetitions are located), and the DSP works with

complete columns.

Better improvements by optimizing critical loops.

If an implementation is to perform the soft combination column by column, then

this strategy is preferable for it works with complete columns. A critical loop

that soft combines a column will execute for a longer duration of time per

column it processes. This increases the amount of time that the implementation

spends in the loop. Optimizing the loop gives therefore greater reduction in the

implementation’s execution time.

The disadvantages of the first strategy are:

The number of job DSPs is restricted.

One DSP can first process one matrix and then another, but there can not be

more than three of them sharing the workload.

Bad workload balancing due to the interlaced bytes of V and W.

A DSP that is to process U reads a repetition of it from E. It is interested in

every byte that it reads. But a DSP that reads a repetition of V or W from E is

only interested in roughly every second byte, due to how the matrices’ bytes

have been interlaced. The CM instructions of the latter case take more time to

execute, but the same problem occurs when considering LDM instructions.

When the DSP has placed a repetition of the matrix into its LDM and begins to

read sequences of bytes from it into its temporary registers, it is again only

interested in every second byte. The critical loop will therefore require more

time to execute.

Bad workload balancing due to where E ends.

E may end abruptly anywhere in the repetition of a matrix. If there is a complete

repetition of U but only one column repetition of V and W in E, then two of the

DSPs will finish executing much faster than the third.

Benefits of the second strategy are:

Only practical restrictions on the number of DSPs.

There can theoretically be as many of them sharing the workload as each matrix

has rows.

No synchronization required.

No synchronization is required for this strategy either.

Good workload balancing.

The previous strategy had workload balancing issues due to how the column

repetitions of V and W have been interlaced in E. But in this strategy a DSP is

given the rows i to j of U, V and W respectively. So each DSP must participate

in processing V and W that is more laborious than U. Also E can end anywhere

in a matrix’s repetition but the workload will still be nearly equal for the DSPs.

This is because for each column repetition in E, every DSP is interested in a part

74

of it. But in the previous strategy only one DSP is interested in that column

repetition.

No unnecessary bytes read from CM.

The repetitions of V and W in E have been interlaced. Suppose the subarray

lkE is retrieved from CM to be soft combined with the rows i to j of

column c in V. Only every second byte of the subarray is of interest for this task.

But Paragraph 7.1.3 shows how nearly all the others bytes are to be soft

combined with column d of W, and in fact most of them belong to rows i to j

too. The method needs to be carefully elaborated further if it is going to be used.

But in this strategy a DSP is to process the rows i to j in both of the columns c

and d. For one repetition of V and W in E, it can obtain the bytes that it requires

for this task by using a single CM instruction.

The disadvantages of the second strategy are as follows:

More CM instructions, and inefficient use of them.

A DSP must still process every column of the matrices, even if only rows i to j

are covered in each column. Consider a complete repetition lkE of a matrix.

Since a DSP processes a part of each column, it needs 32 separate subarrays of

the repetition. 32 CM instructions must therefore be executed to retrieve them to

LDM. The subarrays become shorter the more DSPs share the matrices’ rows.

This makes the read latency of each CM instruction more significant compared

to the time spent on reading actual data.

More complex processing algorithm.

For each repetition of a matrix in CM, it must be calculated where the 32

subarrays that a DSP requires are located. Only performing these calculations is

error-prone. But retrieving each of them to LDM makes it worse due to the

reasons stated in Paragraph 7.2.3. It was previously suggested that a single CM

instruction can obtain bytes that are to be soft combined with a column of V and

W respectively. However this too requires careful calculations.

More inefficient implementation and less beneficial optimizations.

As mentioned, this strategy requires more calculations to find the subarrays of E

that are of interest to a DSP. Also, a DSP covers only a part of every column.

Suppose that in the algorithm a critical loop is to soft combine one column each

time it is executed. If at least two DSPs are used then the execution time of this

critical loop is less than a half of the critical loop of the previous strategy. This

makes it less beneficial to optimize it. One can argue that the loop can therefore

process more than one column each time it is executed, but the same argument

can be made for the other strategy.

Perhaps the best benefit of the second strategy is a good workload balancing. But it is

believed that if it uses three DSPs as the first strategy, then the latter processes the same

input quicker. Estimations were made for the amount of time required to execute CM

instructions by each strategy. Longer execution time is not an acceptable price for better

workload balancing.

75

On the other hand, the second strategy can divide the work among multiple job DSPs

thus reducing execution time. But it will not be reduced significantly. The matrices’

columns are only 193 rows long. If three DSPs are used then each of them is to process

roughly 60 rows. It is true that the number reduces the more DSPs are used, but there is a

constant amount of work that must be performed by each DSP for each column repetition

in E. This becomes more notable as more DSPs are used.

Yet the most convincing reason speaking in favor of the first strategy is related to a

future version of EMPA. For reasons that are omitted here, a new processing step will be

required immediately after Rate dematching is done with A, B and C (see [10] Paragraph

3.3.4). The processing step permutes the three byte arrays individually. Specifically, it

places the bytes 140 DA in a matrix row by row, and the output is produced by

reading it column by column. The last bytes 14 DDA that were omitted are then

appended to the output.

The matrix’s dimension in the processing step depends on D and the details are omitted.

It is notable though that its number of columns greatly exceeds 32 for all but very small

values of D. For example if 6148D then it will have 384 columns and 16 rows.

By using the first strategy, once a job DSP has performed Rate demtaching over its byte

array that is in its LDM, it can immediately proceed to perform this processing step. No

new JOB dispatch is required.

Using the second strategy, each job DSP must first write its rows of the matrices to CM

and then they all must synchronize at a barrier before they can proceed with this new

processing step. Barrier synchronization is very costly due to the reasons stated in

Paragraph 3.4 because it can only be implemented by performing a new JOB dispatch.

In light of these important differences, the first strategy was chosen as a base for a

processing algorithm.

7.3 Description of the processing algorithm The algorithm uses three job DSPs. Each one is designated to process one of the byte

arrays A, B or C, and it must clear the array if necessary and then soft combine it with E.

All input parameters, including the four arrays, are in CM when the JOB begins.

Suppose a DSP is to process A. When it begins executing the JOB, it first allocates

memory in its LDM for the matrix U. It then reads A from CM and places it into U as

described in Paragraph 7.2. It proceeds by reading repetitions of U in E from CM into

LDM and performing the soft combination. The soft combination is done column by

column as described in Paragraph 7.1.4. Paragraph 7.2.1 shows how a more efficient

algorithm can be written if the matrix is to be cleared. However this was not considered

in this algorithm, but it was utilized later when the implementation was optimized.

76

8 Conclusion This chapter concludes the report by summarizing the thesis’ results.

8.1 Implementation of Channel deinterleaver A parallelized implementation for the processing step Channel deinterleaver that can

run on EMPA has been written throughout the thesis (Chapter 2 describes EMPA and

Chapter 6 describes the processing algorithm). Its correctness has been verified to the

extent stated in Paragraph 8.3. Its execution time is linear compared to the input size. Its

memory usage with respect to the Common Memory is linear compared to the input size

(Paragraph 2.2 describes the Common Memory of EMPA). Specifically the

implementation requires a sequential portion of the memory that is equally large as the

input. The complexity of its execution time and memory usage is irrespective of the

number of job DSPs that perform the processing step (Paragraph 3.2 describes EMPA’s

job DSPs and their role in executing processing steps).

There can be as many job DSPs that execute the implementation as the matrix in the

implementation’s input has rows (Paragraph 5.2.2 specifies Channel deinterleaver and

explains what matrix, while Paragraph 6.2.1 explains how the work is divided among the

job DSPs). However, Appendix 2.3.3 shows that using more than eight job DSPs is

infeasible for even large input.

The implementation was written in C code. Its execution time was improved for all

possible inputs by optimizing C code (Appendix 2.3.2 describes the optimizations

performed in C code). The execution time was further improved for certain types of input

by optimizing assembler code (Appendix 2.3.4 explains what types, and describes the

optimizations that were performed in assembler code). To give an idea of to what extent

the implementation was optimized, the following table compares the performance of the

implementation’s first version to the final version. This spans over both the optimizations

performed in C code as well as assembler code. The two versions’ execution times have

been measured for some test cases that are specified by Appendix 2.2.2. The execution

times tell how long it would take to execute each test case on EMPA, however their exact

values can not be revealed and they have therefore been normalized to the smallest one in

the table.

Table 5: The table compares the performance of the first version of the implementation to the final version.

The execution times of some test cases are shown for respective version. However, all the execution times

have been normalized to the smallest of them.

Test case Execution time for the

first version

Execution time for the

final version

Final execution time

compared to first

perf_600_2 13.80 1 7.25%

perf_600_6 20.85 2.33 11.15%

perf_600_6_q 29.79 2.07 6.95%

77

None of the implementation’s critical loops are optimal (Chapter 4 specifies what is

required by a loop that is executed by a DSP to be optimal). Appendix 2.4 suggests how

the implementation can be further improved, which would bring most of its critical loops

to optimality and the rest of them closer to optimality.

8.2 Implementation of Rate dematching A parallelized implementation for the processing step Rate dematching that can run on

EMPA has been written throughout the thesis (Chapter 2 describes EMPA, while Chapter

7 describes the processing algorithm). Its correctness has been verified to the extent

stated in Paragraph 8.3. Its execution time is linear compared to the input size. Its

memory usage with respect to the Common Memory is constant compared to the input

size, if one disregards from the amount of memory required to store the input (Paragraph

2.2 describes the Common Memory of EMPA). Specifically the implementation requires

a small portion of the memory and its size is irrespective of the input’s size. The

complexity of its execution time and memory usage is irrespective of the number of job

DSPs that perform the processing step (Paragraph 3.2 describes EMPA’s job DSPs and

their role in executing processing steps).

No more than three job DSPs can be used to execute the implementation (Paragraph

7.2.4 explains how the work is divided among the job DSPs).

The implementation was written in C code. Its execution time was improved for all

possible inputs by optimizing C code and then assembler code (paragraphs 3.3.1 and

3.3.2 in Appendix 3 describe the optimizations). To give an idea of to what extent the

implementation was optimized, the following table compares the performance of the

implementation’s first version to the final version. The two versions’ execution times

have been measured for some test cases that are specified by Appendix 3.2.2. The

execution times tell how long it would take to execute each test case on EMPA, however

their exact values can not be revealed and they have therefore been normalized to the

smallest one in the table.

Table 6: The table compares the performance of the first version of the implementation to the final version.

The execution times of some test cases are shown for respective version. However, all the execution times

have been normalized to the smallest of them.

Test case Execution time for the

first version

Execution time for the

final version

Final execution time

compared to first

matrix_cleared_U 44.06 1.80 4.08%

matrix_cleared_V 44.77 2.21 4.95%

matrix_cleared_W 44.16 2.23 5.06%

matrix_U 42.26 2.50 5.92%

matrix_V 42.98 2.96 6.89%

matrix_W 42.37 2.99 7.06%

overhead_W 1.85 1 54.01%

78

None of the implementation’s critical loops are optimal (Chapter 4 specifies what is

required by a loop that is executed by a DSP to be optimal). Appendix 3.4 suggests how

the implementation can be further improved, which would bring all of its critical loops

closer to optimality, but not fully.

8.3 The implementations’ correctness ELTE provides test cases that can be used to verify that an implementation of a

processing step complies with the step’s specification. Why these tests are considered to

be correct and how they are produced is omitted, but it is noteworthy that the division of

ELTE where they are produced is different from the division where the implementations

are written.

Such tests have been used into verify that the implementation of Channel deinterleaver

works correctly. Yet more tests were produced in the thesis as specified in Appendix

2.2.1. The reason is that the tests provided by ELTE were few in number and even worse

is that they test the implementation for only large input. The author deemed that even

though ELTE’s tests can be used to verify that the processing step’s specification has not

been misunderstood while writing the processing algorithm, they could not be relied on to

verify if any mistakes had been done while writing the implementation.15

But

nevertheless the implementation works correctly with respect to ELTE’s tests.

However the tests that ELTE provide for verifying an implementation of Rate

demtaching could not be used. The reason was that every test’s output is produced from

the input by applying Rate demtaching and then a following processing step. There was

not enough time to understand the step, nevertheless writing even a simple

implementation for it. The author could therefore only rely on the tests that he produced

as specified in Appendix 3.2.1. They are many in number and were produced by a

separate implementation of the processing step. It is not a parallelized implementation

and it was developed by following the step’s specification as closely as possible. Only at

a few occasions was its performance improved with caution since the amount of time it

would otherwise require to produce the tests was unacceptable.

But there is still no guarantee that the author did not misunderstand the processing

step’s specification when writing this simple implementation, nevertheless when writing

the parallelized implementation that is part of the thesis result. Thus it is up to ELTE to

further verify the correctness of the parallelized implementation and do corrections if it

does not follow the specification the way they believe to be correct.

15 Such mistakes are more informally known as “bugs”.

79

References [1] 3GPP; 2008; 3GPP Technical Specification 36.212, version 8.4.0

[2] 3GPP; 2009; 3GPP Technical Specification 36.213, version 8.6.0

[3] Siddhartha Chatterjee, Sandeep Seen; 2000; Cache-Efficient Matrix Transposition

[4] Erik Dahlman, Stefan Parkvall, Johan Sköld, Per Beming; 2008; 3G Evolution,

HSPA and LTE for Mobile Broadband

[5] S. D. Kaushik, C.-H. Huang, R. W. Johnson, P. Sadayappan, J. R. Johnson; 2004;

Efficient Transposition Algorithms for Large Matrices

[6] Sriram Krishnamoorthy, Gerald Baumgartner, Daniel Cociorva, Chi-Chung Lam, P.

Sadayappan; 2003; On Efficient Out-of-core Matrix Transposition

[7] Sriram Krishnamoorthy, Gerald Baumgartner, Daniel Cociorva, Chi-Chung Lam P.

Sadayappan; 2005; Efficient Parallel Out-of-core Matrix Transposition

[8] Rolf Redhul; 2003; Radio School, G1 Mobile radio communication, Overview

[9] Base station’s processing steps specification; ELTE internal document

[10] EMPA software interface; ELTE internal document

[11] DSP Specification; ELTE internal document

80

Appendices Here follows the appendices of the report. Appendix 1 briefly explains what tools were

used to write implementations during the course of the thesis. Appendix 2 and Appendix

3 are devoted to specifying in detail each processing algorithm that was implemented,

and also discuss how the implementations underwent when it comes to testing and

optimizations.

As Appendix 1 will explain a certain compiler was used to produce implementations for

EMPA. Appendix 4 reviews the assembler code produced by the compiler and how

efficiently it uses EMPA’s capabilities.

Appendix 5 suggests hardware changes to EMPA that can make it easier to write

efficient implementations.

To avoid confusion when referring to paragraphs in the appendices, if the paragraph is

in the appendices then its number is simply stated. However if the paragraph is in one of

the report’s main chapters, then the chapter number is stated first. For instance

“Paragraph 2.2” refers to a paragraph in Appendix 2, while “Chapter 2, Paragraph 2.2”

naturally refers to a paragraph in one of the report’s main chapters.

81

Appendix 1. Tools used to write implementations

ELTE has developed a compiler that compiles C code to execute on EMPA. This

compiler will from now on be referred to as the EMPA Compiler (EMC).

A copy of EMPA is not available to execute implementations and see if they work

correctly. Instead a simulator is used that can run on a host computer with UNIX. This

simulator will from now on be referred to as the EMPA Simulator (EMS). All the

capabilities of EMPA are simulated, including concurrent execution of multiple DSPs.

The number of DSPs can be chosen freely. The host computer is not required to have

multiple cores for this. The execution time between any two points of the implementation

can be precisely measured as if it was running on EMPA.

Data from files on the host computer can be written to CM in the simulator. This can

then be used as input to an implementation. To verify the correctness of the

implementation, the results that it produces in CM can be compared to contents of files

on the host computer. EMS constitutes the backbone of the debugger that is used to find

errors in implementations.

Therefore EMS was greatly used in the thesis to verify correctness and measure

performance of implementations that were written.

82

Appendix 2. Channel deinterleaver This appendix is devoted to specifying the processing algorithm that was used as a base

for the implementation of Channel deinterleaver, and to explain how the implementation

underwent.

It is Paragraph 2.1 that specifies the processing algorithm. Paragraph 2.2 describes the

test cases that were used for verifying correctness and performance while writing the

implementation.

Paragraph 2.3 explains what steps were taken to optimize the implementation. The

improvements are described step by step, and how they changed the performance is

measured with the test cases. The implementations final performance is also presented.

The appendix is concluded by Paragraph 2.4 that explains how the implementation

should be improved in the future.

2.1 Specification of the processing algorithm The processing algorithm will now be presented by specifying the functions that it uses

(refer to Chapter 6, Paragraph 6.2 to understand this paragraph). This is done in the

following subparagraphs. It is noteworthy that these functions are also used in the first

version of the implementation. The functions are deint_master, deint_slave,

deint_section, deint_segment and deint_col. The three latter process the section.

deint_master is the master function and deint_slave is the entry function (see Chapter 3,

Paragraph 3.2 for how a JOB is dispatched). None of the functions have an explicit return

value. Since this is the first processing algorithm to be presented, the dispatch procedure

performed by the master and entry functions will be thoroughly described.

2.1.1 Specification of deint_master

This is the master function. The function does not calculate how the matrix is divided

into sections. Instead each job DSP calculates where its own section is. This is done to

dispatch the JOB only once and to reduce the amount of work performed by the dispatch

DSP executing the master function. Once the dispatch has been performed, the function

awaits a message that notifies it that the JOB has been completed.

The function’s input is as follows:

CM_input, a pointer to the beginning of the matrix in CM.

CM_output, a pointer to the beginning of where the matrix is to be written in CM.

nr_cols, the number of columns the matrix has.

nr_rows, the number of rows the matrix has.

symb_size, the number of bytes per symbol.

nr_q_symbs, the number of q symbols in the matrix.

nr_DSPs, the number of job DSPs that the JOB is to be dispatched to.

The function works as follows:

83

Algorithm 9: The algorithm specifies how deint_master works.

deint_master (

CM_input, CM_output, nr_cols, nr_rows, symb_size, nr_q_symbs, nr_DSPs

) {

write CM_input, CM_output, nr_cols, symb_size, nr_rows,

nr_q_symbs, nr_DSPs to CM at location CM_deint_input;

dispatch the JOB deint_slave(CM_deint_input) to #nr_DSPs job DSPs;

await a message with signal job_done;

return;

}

2.1.2 Specification of deint_slave

This is the entry function that is the first function to be initiated on every job DSP that

participates in the JOB.

What section of the matrix the DSP is to work with is calculated by (i) reading the input

parameters stored in CM by the master function and (ii) reading the unique worker id that

has been assigned to the DSP and is located in the received message’s header.

The function uses two small functions, namely (i) calculate_section_position to

calculate the section’s position and size and (ii) calculate_segment_size to calculate the

maximum size a segment can have. These are specified by Algorithm 11.

In the end of the entry function the DSP checks if it has finished its section last. If so

then a message is sent back to the dispatch DSP running the master function, notifying it

that the JOB has been completed. The job DSPs coordinate this by using a lock that has

been initiated by the master function. The details have been omitted.


CM_deint_input, a pointer to CM where the master function stored the input

parameters.


84

Algorithm 10: The algorithm specifies how deint_slave works.

deint_slave (CM_deint_input) {

Read the variables CM_input, CM_output, nr_cols, symb_size, nr_rows,

nr_q_symbs and nr_dsps from CM at location CM_deint_input;

//See the input of deint_master for an explanation of each.

read the variable worker_id (0 <= worker_id < nr_dsps)

from the dispatch message;

//Calculate on what row the DSP’s section begins and

//how many rows it spans over.

(sec_start_row, sec_nr_rows) =

calculate_section_position (nr_rows, worker_id, nr_dsps);

//Calculate the maximum number of rows a segment can span over.

seg_nr_rows = calculate_segment_size (nr_cols, symb_size, ldm_size);

//Reserve memory in LDM for the segment.

reserve in LDM seg_nr_rows*nr_cols*symb_size bytes and

let LDM_seg be a pointer to the location;

//Reserve memory in LDM for the column buffer.

reserve in LDM seg_nr_rows*symb_size bytes and

let LDM_col_buff be a pointer to the location;

//Process the section.

deint_section (

CM_input, CM_output, nr_cols, symb_size, nr_rows, nr_q_symbs,

sec_start_row, sec_nr_rows, seg_nr_rows, LDM_seg, LDM_col_buff

);

if (this DSP is the last one to finish) {

send a message with signal job_done to

the dispatch DSP that dispatched this JOB;

}

return;

}

The following algorithm specifies the two functions that were used by deint_slave:

85

Algorithm 11: The two functions that are used by deint_slave are specified.

//The function calculates where the section of a job DSP begins and

//over how many rows it spans.

calculate_section_position (nr_rows, worker_id, nr_dsps) {

divide the rows of the matrix in #nr_dsps sections,

where the largest section is no more than

one row larger than the smallest;

choose section #worker_id by assigning

the variables sec_start_row and sec_nr_rows;

return sec_start_row, sec_nr_rows;

}

//The function calculates the maximum number of rows

//that a segment can span over.

calculate_segment_size (nr_cols, symb_size, ldm_size) {

seg_nr_rows = round_down (ldm_size/(symb_size*nr_cols + symb_size));

//This is a solution to the equation of Formula 9.

return seg_nr_rows;

}

2.1.3 Specification of deint_section

This function processes a section. It divides the section into multiple segments and

invokes the function deint_segment on each of them. When a segment has been

processed deint_section writes it to the output in CM. All segments will contain as many

rows as specified by the input, except the last one. It will be as small as required to cover

the end of the section.

Input parameters that have not previously been described are:

sec_start_row, the matrix’s row where the section starts.

sec_nr_rows, how many rows the section spans over.

seg_nr_rows, the maximum size of a segment.

LDM_seg, a pointer to the segment in LDM.

LDM_col_buff, a pointer to the column buffer in LDM.


86

Algorithm 12: The algorithm shows how deint_section works.

deint_section (

CM_input, CM_output, nr_cols, symb_size, nr_rows, nr_q_symbs,

sec_start_row, sec_nr_rows, seg_nr_rows, LDM_seg, LDM_col_buff

) {

divide the matrix from row #sec_start_row to

row (sec_start_row+sec_nr_rows-1) into multiple segments,

each of them with #seg_nr_rows rows except the last that may be smaller;

for (each segment) {

//The current segment starts at row #seg_start_row

//and spans over #curr_seg_nr_rows rows.

deint_segment (

CM_input, LDM_seg, LDM_col_buff, nr_cols, symb_size,

nr_rows, nr_q_symbs, seg_start_row, curr_seg_nr_rows

);

//deint_segment has processed the current segment

//and the result has been stored in LDM at LDM_seg.

write the segment located at LDM_seg to the correct position

of the output at CM_output using one CM instruction;

}

return;

}

2.1.4 Specification of deint_segment

This function processes one segment. For each column of the segment, it invokes

deint_column. In a last step, all the segment’s q symbols are cleared.


seg_start_row, the matrix’s rows where the segment begins.

curr_seg_nr_rows, how many rows the segment spans over.


87

Algorithm 13: The algorithm specifies how deint_segment works.

deint_segment (

CM_input, LDM_seg, LDM_col_buff, nr_cols, symb_size,

nr_rows, nr_q_symbs, seg_start_row, curr_seg_nr_rows

) {

for (col_id = 0; col_id < nr_cols; col_id++) {

deint_column (

CM_input, LDM_col_buff, LDM_seg, nr_cols, col_id,

symb_size, nr_rows, seg_start_row, curr_seg_nr_rows

);

}

for (each row in the segment containing q symbols) {

for (each q symbol in the row) {

for (each word of the q symbol) {

set the word to 0;

}

}

}

return;

}

2.1.5 Specification of deint_column

This function reads one of the segment’s columns from CM to the column buffer in

LDM. The column’s symbols are then placed into the segment.


col_id, the column of the segment that is to be processed.


88

Algorithm 14: The algorithm specifies how deint_column works.

deint_column (

CM_input, LDM_col_buff, LDM_seg, nr_cols, col_id,

symb_size, nr_rows, seg_start_row, curr_seg_nr_rows

) {

//The input is CM_input, LDM_col_buff, LDM_seg, nr_cols, col_id,

//symb_size, nr_rows, seg_start_row and curr_seg_nr_rows.

read column #col_id from the input in CM (located at CM_input)

to the column buffer in LDM (located at LDM_col_buff);

for (each symbol in the column buffer) {

for (each word of the symbol) {

place the word at its position in the segment at LDM_seg;

}

}

return;

}

2.2 Test cases used for writing implementation Throughout this report, a test case is when an implementation is executed once for a

certain input that the test specifies. The purpose of the test may be to verify the

implementation’s correctness or performance. If it is the former then the test also

specifies the output that is believed to be correct.

Three kinds of test cases were used while writing the implementation of the processing

algorithm that has been suggested. These are tests that (i) verify an implementation’s

correctness, (ii) measure an implementation’s performance and (iii) measure how

intermediate optimizations change an implementation’s performance. These are described

in the following subparagraphs.

The execution time of test cases that were used for measuring the implementation’s

performance can not be revealed. Throughout the chapter some tables discuss the

implementation’s performance by using these tests. In these tables the tests’ execution

times have been normalized. This means that every execution time in a table has been

divided by the shortest execution time in the same table.

2.2.1 Test cases for testing correctness

The following table shows the test cases that were used to verify the correctness of an

implementation. Each row is a “test series” that includes multiple test cases. The input

matrix is the same for all of the test cases of a series. The correctness of the

implementation is tested in two ways in each series.

89

In the first way a dispatch is performed by using the function deint_master (see

Paragraph 2.1 for a specification of the implementation’s functions). A varying number

of job DSPs are used, and there is a separate test case for each number of job DSP.

In the second way only the function deint_section is tested. No dispatch is performed

and the function is invoked on a DSP where the work is performed. The maximum

number of rows that a segment may span over is varied. A separate test case is used for

each segment size.

The main reason for using this approach was to be able to control the segment size in

some of the test cases. It was deemed that some errors in an implementation can occur if

only certain segment sizes are used. If a dispatch is performed then the entry function

deint_slave calculates what segment size deint_section should use, and this calculation

is solely based on the available memory in LDM.

Table 7: The test cases that were used to verify an implementation’s correctness.

test

series

number

of rows

number of

columns

symbol size

(bytes)

number of q

symbols

Number of job DSPs

used when performing

a dispatch

segment sizes used

when only testing

deint_section

1 1 12 2 4 1, 2 1, 2

2 1 12 4 4 1, 2 1, 2

3 1 10 2 1 1, 2 1, 2

4 1 10 4 0 1, 2 1, 2

5 1 12 6 2 1 1

6 1 10 6 3 1 1

7 5 12 4 1 1, 2, 3, … 5 1, 2, 3, ... 5

8 7 10 2 28 1, 2, 3, … 7 1, 2, 3, … 7

9 10 12 6 0 1, 2, 3, … 10 1, 2, 3, … 10

10 13 10 4 29 1, 2, 3, … 13 1, 2, 3, … 13

11 400 12 4 3999 1, 3, 5 47, 87, 121

12 400 10 2 1197 4, 7, 11 49, 99, 143

13 800 12 6 798 13, 16, 19 81, 111, 153

14 800 10 4 400 7, 17, 20 59, 96, 173

15 1200 12 6 4800 9, 15, 21 65, 103, 153

Each test series specifies an output that the implementation is expected to produce,

given the input. The output of each test series has been generated using a separate

implementation that was written in the thesis. It runs on an ordinary computer and

follows the processing step’s specification step by step. Performance was not taken into

consideration while writing it in order to make it easier to verify its correctness.

Test series 1 to 10 use small matrices. Due to this it was possible to perform the tests

using all possible segment sizes and number of job DSPs. In fact, the segment sizes and

90

requested number of DSPs in test series 1 to 4 are greater than the number of rows. It is

required that the implementation works despite this.

Test cases 11 to 15 are large and only specific segment sizes and number of DSPs could

be used, otherwise the tests would take too long to execute.

Paragraph 2.3.4 specifies optimizations that were performed by rewriting assembler

code. The optimizations were aimed at matrices with 12 columns and symbol sizes of 2

and 6 bytes. At that stage it was deemed that the test cases of Table 7 were not enough to

further verify the correctness of the implementation. Because of this, twelve new test

cases were introduced. Six of them use small matrices, but they will not be further

specified.

2.2.2 Test cases for measuring performance

The following table presents test cases that were used to measure the performance of an

implementation. Again each row is a “test series”, including multiple test cases. In a

series a varying number of job DSPs are used, and there is a separate test case for each

number of job DSP. The time each DSP requires to execute the function deint_section

for its section is measured.

Table 8: Test cases that were used for measuring an implementation’s performance.

test series number of rows number of columns symbol size

(bytes)

number of q

symbols to reset

number of job

DSPs used

1 5 12 2 0 1

2 200 12 4 100 1, 2, 3, … 8

3 600 12 2 0 1, 2, 3, … 8

4 600 12 4 0 1, 2, 3, … 8

5 600 12 6 0 1, 2, 3, … 8

6 1200 12 2 0 1, 2, 3, … 8

7 1200 12 4 0 1, 2, 3, … 8

8 1200 12 6 0 1, 2, 3, … 8

The purpose of test series 1 is to see how the implementation performs for a small

matrix. Test series 2 has been designed to see how the resetting of q symbols disturbs the

workload. Only the last section contains q symbols in each test case. As the number of

DSPs is increased, the q symbols cover a larger share of the rows of the last section.

Ultimately with 8 DSPs, each row of the last section contains q symbols.

Test series 3 to 8 use large matrices and their purpose is to measure the

implementation’s performance for varying symbol sizes.

This is a total of 57 test cases. To use them after each step of optimization would

require too much time, even if one disregards from the time required analyzing the

results. Therefore they were only used at two occasions. First when the first version of

the implementation had been completed (results presented in Paragraph 2.3.1), and then

91

when all the optimizations that were performed in C code had been completed (results

presented in Paragraph 2.3.3). The test cases presented in the next paragraph were

designed to be used more often to measure performance.

2.2.3 Test cases for measuring performance during optimization

The following table presents test cases that were used to measure how an

implementation’s performance changes due to an optimization. Each row is a test case. A

dispatch takes place in each test, and only one job DSP is used. The time it takes for the

DSP to execute the function deint_section for its section (which is the entire matrix) is

measured.

Table 9: Test cases that were used for measuring performance change due to an optimization.

test case name Number of

rows

number of

columns

symbol size

(bytes)

number of q symbols number of job DSPs

used

perf_600_2 600 12 2 0 1

perf_600_2_q 600 12 2 2400 1

perf_600_4 600 12 4 0 1

perf_600_6 600 12 6 0 1

perf_600_6_q 600 12 6 2400 1

These test cases will be extensively referred to in the paragraphs that describe the

optimizations that were performed. Because of this the tests have been named.

The tests’ sole purpose is to clearly indicate how beneficial an attempted optimization

is, without requiring too much time to execute and to analyze the results. This is why the

test cases are few in number, use only large matrices and only one job DSP each.

perf_600_4 was introduced when an optimization was done in C that favored symbol

size 4 bytes, and so was not used from the beginning. perf_600_2_q was first used when

assembler optimizations were made.

The purpose of perf_600_2_q and perf_600_6_q is to see how the optimizations affect

the resetting of q symbols. The matrices have as many q symbols as they can have.

2.3 Implementation of the processing algorithm The following subparagraphs describe the implementation of the processing algorithm

that has been specified in Paragraph 2.1. Paragraph 2.3.1 describes the performance of

the first version of the implementation. Paragraph 2.3.2 describes the optimizations that

were performed in C code and presents their respective impact on performance.

Paragraph 2.3.3 analyzes the implementation’s performance after all the C optimizations.

Paragraph 2.3.4 describes the optimizations that were made in assembler code. Finally

92

Paragraph 2.3.5 shows how the execution time improved due to the assembler

optimizations.

2.3.1 Performance of the first version of the implementation

When the first version of the implementation had been completed, its performance was

measured by using the test series presented in Paragraph 2.2.2. The following table

presents the results. Each row in the table represents a test series, and each cell in a row

shows a test case in that series. The longest and shortest execution times of the job DSPs

are shown per test case. Recall that only the time required to execute deint_column is

measured for each DSP.

Table 10: For each test case the shortest and longest execution times of the job DSPs are shown, however

the values have been normalized. Every execution time in the table has been divided by the shortest

execution time.

Number of job

DSPs

1 2 3 4 5 6 7 8

Test series

1 1

2 43.71 22.89

20.85

15.91

14.08

12.62

10.58

10.56

8.53

9.13

7.09

8.10

6.06

7.48

5.45

3 83.21 41.62

41.62

27.85

27.85

20.97

20.97

16.83

16.83

14.08

14.08

12.15

12.02

10.64

10.64

4 103.80 52.05

52.05

34.62

34.62

26.05

26.05

20.90

20.90

17.47

17.47

15.06

14.89

13.18

13.18

5 124.40 62.22

62.22

41.67

41.67

31.13

31.13

24.96

24.96

20.85

20.85

17.98

17.77

15.72

15.72

6 166.11 83.21

83.21

55.40

55.40

41.62

41.62

33.36

33.36

27.85

27.85

24.00

23.86

20.97

20.97

7 207.56 103.80

103.80

69.21

69.21

52.05

52.05

41.76

41.76

34.62

34.62

29.82

29.65

26.05

26.05

8 248.75 124.40

124.40

83.03

83.03

62.22

62.22

49.89

49.89

41.67

41.67

35.92

35.71

31.13

31.13

The execution times of the job DSPs in any test case are equal except for two kinds of

test cases. The first is tests where seven DSPs are used. The matrix’s number of rows can

not be evenly divided among the sections, and so some of them are one row longer than

others.

The second is test cases belonging to series 2. In each of them the DSP that clears q

symbols has more work to do than the others.

The only test case of series 1 requires many cycles per symbol. The number would be

worse if there were fewer symbols in the matrix. The test’s entire input and output both

93

fit in LDM. It is not necessary to read the input column by column from CM (as this

implementation does), and instead the entire matrix can be read with one single CM

instruction. If small matrices will be common, then the execution time for them can be

greatly improved by writing a special implementation for them.

Test series 2 shows that uneven load balancing due to q symbols is a practical problem.

In each of its test cases the longest execution time belongs to the DSP that clears q

symbols. The execution times among the remaining DSPs are nearly equal. When using

eight DSPs, the one that clears q symbols requires 37% more time than the others. This

must be addressed by incorporating one of the ideas suggested in Chapter 6, Paragraph

6.2.2.

In test series 3 to 8 the work is well divided among the job DSPs. Compare the

execution time of any test case where one DSP is used to the test case in the same series

where eight DSPs are used. The latter is only slightly more than one eighth of the former.

2.3.2 Major optimizations performed in C code

In this paragraph it will be described how the implementation was optimized in C code.

Some of the test cases presented in Paragraph 2.2.3 were used to measure performance

changes. The first version of the implementation performed as the following table shows.

The following paragraphs describe the optimizations one by one.

Table 11: The table shows how the first version of the implementation performed for some test cases. The

implementation’s execution times for the tests were measured in cycles, but the values have been

normalized.

Test case Execution time

perf_600_2 1

perf_600_6 1.51

perf_600_6_q 2.16

2.3.2.1 Using specific symbol sizes in deint_column

The innermost loop of deint_column currently places a symbol from the column buffer

into the segment, but it transfers only one word per iteration. There are only three symbol

sizes, and so the loop was replaced by a conditional statement with three outcomes. See

the following algorithm and compare to Algorithm 14:

94

Algorithm 15: The algorithm shows how the symbols are placed from the column buffer into the segment

after the optimization.

for (each symbol in the column buffer) {

if (symbol size is 2 bytes) {

place the word at its position in the segment in LDM;

} else if (symbol size is 4 bytes) {

place the first word at its position in the segment in LDM;

place the second word at its position in the segment in LDM;

} else { //symbol size is 6 words

place the first word at its position in the segment in LDM;

place the second word at its position in the segment in LDM;

place the third word at its position in the segment in LDM;

}

}

Depending on the symbol size, the conditional statement places the correct number of

words into the segment. This makes better use of the DSP’s short instruction pipeline, for

the conditional branch instruction that was performed once per iteration of the innermost

loop has been removed. Other instructions related to the loop’s overhead have also been

removed. The innermost loop of deint_segment that clears a q symbol one word per

iteration was also replaced by a conditional statement in the same manner.

The following table shows how the implementation’s performance changed.

Table 12: The table shows how the optimization changed the implementation’s performance. The test

cases’ new execution times are being compared to the latest measured in Table 11. All presented execution

times have been normalized to the smallest one.

Test case Execution time before

the optimization

Execution time after the

optimization

New execution time

compared to before

perf_600_2 1.70 1 59%

perf_600_6 2.58 1.07 41%

perf_600_6_q 3.68 1.84 50%

2.3.2.2 Unrolling deint_column’s loop

The new loop of deint_column, shown in Algorithm 15, should be unrolled so it

processes more symbols per iteration. This is to increase the amount of work performed

per iteration and thus make the loop’s overhead less significant. A new loop must be

placed after it that processes any remaining symbols. But leaving the conditional

statement shown in Algorithm 15 within the loop would delay the instruction pipeline

due to branch instructions and this is not necessary. Specifically, the loop was replaced

by a conditional statement which chooses among three outcomes depending on symbol

size. In each outcome, an unrolled loop that processes five symbols per iteration is

95

executed first. Then, a loop that processes the remaining (maximum four) symbols is

executed. Thus there are a total of six loops. The following algorithm clarifies this for

symbol size 2 bytes, and the same approach is used for the other symbol sizes.

Algorithm 16: The algorithm shows how the symbols are placed from the column buffer into the segment

after the optimization.

if (symbol size is 1 word) {

...

} else if (symbol size is 2 words) {

for (symbols 5*i to 5*i+4 in the column buffer, where i = 0, 1, 2, ...) {

place symbol 5*i into the segment word by word;

place symbol 5*i+1 into the segment word by word;




}

for (each remaining symbol in the column buffer) {

place the symbol into the segment word by word;

}


...

}

No changes were made to how q symbols are cleared. Execution time changed as the

following table shows. Test case perf_600_4 was introduced at this stage for the next

optimization.



times have been normalized to the smallest one. The test perf_600_4 has not previously been used.


optimization

Execution time after

optimization

New execution time

compared to before

perf_600_2 4.43 1 23%

perf_600_4 First time of use 1.67 First time of use

perf_600_6 4.73 2.44 52%

perf_600_6_q 8.14 5.83 72%

96

2.3.2.3 Reading and writing more words per LDM instruction

Two words can be read or written by a single LDM instruction, and this is usable when

working in C.16

This was used in the four loops that process symbol sizes 4 and 6 bytes.

For the former, each symbol is read and written from the column buffer to the segment

using two instructions. For the latter, the first two words of a symbol are read and written

using two instructions, followed by another two for the last word. Still no changes were

made to how q symbols are cleared.

Execution time changed as follows:





optimization


optimization

New execution time

compared to before

perf_600_2 1 1.01 101%

perf_600_4 1.67 1.16 69%

perf_600_6 2.44 1.83 75%

perf_600_6_q 5.83 5.31 72%

2.3.2.4 Clearing the q symbols

In the next optimization, the q symbols were dealt with. Chapter 6, Paragraph 6.2.2

discusses two strategies for clearing them, and states which one was chosen for the first

version of the implementation. A switch of strategy was made in this optimization.

Specifically, the clearing of q symbols was moved from deint_segment to deint_column.

Algorithm 17 demonstrates how deint_column was changed by using symbol size 4

bytes as an example. A new parameter nr_q_symbols was added to the input of the

function. It specifies how many symbols in the end of the matrix’s column are q symbols.

When the segment’s column is to be read from CM to the column buffer, only those

symbols that are not q symbols are read. For each symbol size there are then four loops

that process the segment column.

The first two place the symbols that are not q symbols from the column buffer into the

segment. They still work as the loops described in the previous paragraph. The following

two loops clear the remaining q symbols of the column. The first is unrolled and clears

three q symbols per iteration, and the next loop clears any remaining q symbols. The first

loop clears only three q symbols per iteration due to the low number of q symbols in a

matrix. If this number is increased then the following loop will have to do more of the

work.

16 It is a question of what data type is used.

97

Algorithm 17: How deint_column works after the optimization is described. nr_q_symbols is a new

input parameter. Two new loops are introduced for each symbol size. They clear the q symbols. The first is

unrolled by clearing three q symbols per iteration, while the second clears any remaining q symbols.

nr_non_q_symbols = nr_symbols_in_column – nr_q_symbols;

read the column’s first #nr_non_q_symbols from CM

to the column buffer in LDM;

if (symbol size is 1 word) {

...

} else if (symbol size is 2 words) {

//The old loops only place the non q symbols

//from the column buffer into the segment.

for (

symbols 5*i to 5*i+4 in the column buffer,

where i = 0, 1, 2, ..., nr_non_q_symbols/5-1

) {

place symbol 5*i into the segment;

place symbol 5*i+1 into the segment;




}

for (each remaining non q symbol in the column) {

place the symbol in the segment;

}

//The new loops come next. They clear the q symbols

//in the end of the segment’s column.

for (

symbols (nr_non_q_symbols+3*i) to (nr_non_q_symbols+3*i+2)

in the segment’s column, where i = 0, 1, 2, ...

) {

clear symbol nr_non_q_symbols+3*i in the segment’s column;

clear symbol (nr_non_q_symbols+3*i+1) in the segment’s column;

clear symbol (nr_non_q_symbols+3*i+2) in the segment’s column;

}

for (each remaining q symbol in the column) {

clear the symbol in the segment’s column;

}


...

}

Execution time changed as the following table shows. All test cases but perf_600_6_q

have slightly worsened because now the function deint_column deals with q symbols too

98

but none of these tests have such symbols. perf_600_6_q executes faster than perf_600_6

even if the matrices are equally large in both tests. This is because the earlier has four

columns that only contain q symbols. The columns are not read from CM, and instead the

corresponding symbols in the segment in LDM are only cleared.





optimization


optimization

New execution time

compared to before

perf_600_2 1 1.01 101%

perf_600_4 1.15 1.17 101%

perf_600_6 1.82 1.85 102%

perf_600_6_q 5.28 1.75 33%

2.3.3 Performance of the implementation after C optimizations

When the implementations optimizations done in C had been completed, its

performance was measured by using the test series presented in Paragraph 2.2.2. The

following table demonstrates the results. Each row in the table represents a test series,

and each cell in a row shows a test case in that series. The longest and shortest execution

times of the job DSPs are shown per test case.

99

Table 16: For each test case, the shortest and longest execution times of the job DSPs are shown, however

the values have been normalized. Every execution time in the table has been divided by the shortest

execution time.

Number of job DSPs 1 2 3 4 5 6 7 8

Test series

1 1

2 10.70 5.52

5.48

3.91

3.77

3.29

3.25

2.84

2.80

2.53

2.39

2.30

2.16

2.17

1.99

3 15.91 8.10

8.10

5.69

5.69

4.49

4.49

3.76

3.76

3.28

3.28

2.93

2.76

2.68

2.68

4 19.49 10.20

10.20

6.68

6.68

5.24

5.24

4.37

4.37

3.79

3.79

3.36

3.20

3.08

3.08

5 30.35 15.21

15.21

10.74

10.74

7.76

7.76

6.42

6.42

5.52

5.52

4.85

4.71

4.41

4.41

6 30.95 15.91

15.91

10.50

10.50

8.10

8.10

6.65

6.65

5.69

5.69

4.87

4.80

4.49

4.49

7 38.94 19.49

19.49

13.09

13.09

10.20

10.20

8.47

8.46

6.68

6.68

5.73

5.66

5.24

5.24

8 60.39 30.35

30.35

20.42

20.42

15.21

15.21

12.53

12.53

10.74

10.74

9.59

9.49

7.76

7.76

Still the test case of series 1 shows that the implementation requires in average many

cycles per symbol. A separate implementation for small matrices should still be

considered if they are common.

In each of the test cases of series 2, the DSP that requires the shortest execution time

has been assigned the last section and therefore clears all q symbols. A DSP now benefits

from clearing q symbols. The execution times among the other DSPs are nearly equal.

When eight DSPs are used, the shortest execution time is 9% smaller than the longest. No

further load balancing is necessary for it would affect the longest execution time only

slightly.

Compare the execution times of the first and last tests of series 3. The latter is only one

sixth of the former, even if eight DSPs are used. The same observation can be made for

the following series, even though the situation is not equally bad for them since the input

is larger in their cases. But one must recall that the presented execution times do not

include the dispatch procedure. The conclusion is that it is now less beneficial to use

more DSPs than it was before optimizations were applied, and using more than eight

DSPs is probably a bad usage of EMPA’s job DSPs.

The following table compares each test case’s longest execution time after

optimizations in C have been applied to before optimizations were done. The values of

Table 16 are being compared to Table 10.

100

Table 17: The table shows how the C optimizations have improved the performance. The new longest

execution time of each test case is compared to the old before the optimizations.

Number of job DSPs 1 2 3 4 5 6 7 8

Test series

1 66%

2 16% 16% 16% 17% 18% 18% 19% 19%

3 13% 13% 13% 14% 15% 15% 16% 17%

4 12% 13% 13% 13% 14% 14% 15% 15%

5 16% 16% 17% 16% 17% 17% 18% 18%

6 12% 13% 12% 13% 13% 13% 13% 14%

7 12% 12% 12% 13% 13% 13% 13% 13%

8 16% 16% 16% 16% 16% 17% 18% 16%

Great improvements can be seen for each test case. For each series, the general rule is

that the observed improvement becomes worse as the number of DSPs increases. This is

because as they increase in number, the time each of them spends executing the function

deint_column decreases, and all optimizations were aimed at this function. The time they

spend executing other parts of the implementation has not improved because no

optimizations were aimed at those parts. Thus the observed improvement decreases as

more DSPs are used in each test series, with only a few exceptions. The exceptions are

because sometimes the number of segments suddenly decreases by one when an

additional DSP is used, which reduces the time spent executing parts of the

implementation that have not been optimized.

2.3.4 Optimizations performed in assembler code

After the optimizations performed in C code, the assembler code written for the

function deint_column by EMC was reviewed. It was inefficient and had many

deficiencies (Appendix 4 discusses the quality of EMC). It was deemed that the

performance can be greatly improved by performing further optimizations in assembler

code.

The amount of time that could be spent on these optimizations was limited. Therefore

the implementation was only partially optimized. Paragraph 2.3.4.1 explains this further.

It also describes initial changes that were made to the implementation in C code.

Paragraph 2.3.4.2 describes the assembler optimizations. Finally Paragraph 2.3.5 shows

how the execution time improved.

2.3.4.1 Initial changes before optimization

The function deint_column was replaced by six functions, one for each combination of

symbol size and number of columns. This was done in C code. To give an example and

establish a naming convention, deint_column_x_y is used when the matrix has #x

columns and a symbol is #y bytes long.

101

The division of deint_column (into new functions) by the three symbol sizes was made

to make it easier to overview the assembler code when optimizing it. In deint_column a

choice is made between twelve loops depending on symbol size (see Algorithm 17). The

four loops that are chosen are the only ones to be executed each time the function is

called in the same JOB. The assembler code produced contains all the twelve loops and

would be difficult to overview. The twelve loops were therefore divided among the new

functions that replaced deint_column. To give a precise example, the four loops shown in

Algorithm 17 were moved to deint_column_12_4 and deint_column_10_4.

Prior to performing optimizations in assembler code, the DSPs’ set of registers were not

well known by the author, and the amount of time left to work on the implementation was

very limited. It was deemed that the registers might not be sufficiently many to optimize

deint_column’s loops to the required extent. In the loops the number of columns must be

known to find the segment symbols that are to be modified. To avoid occupying a register

with this value in the loops, the function was also divided by the possible numbers of

columns. This makes the value explicitly known in each function that replaces

deint_column.

In hindsight the author knows that the set of registers that a DSP has at its disposal was

greatly underestimated. See Chapter 2, Paragraph 2.1.2 for an overview of the registers.17

Only two of the six new functions were optimized in assembler, namely

deint_column_12_2 and deint_column_12_6.

Prior to optimizing in assembler, the two unrolled loops in each function were slightly

changed in C. In deint_column_12_2, the first loop that places the column buffer’s

symbols into the segment (referred to as symb_loop_12_2) was changed so it processes

eight symbols per iteration. The second loop (q_symb_loop_12_2) now clears four q

symbols per iteration. In deint_column_12_6 the corresponding unrolled loops

(symb_loop_12_6 and q_symb_loop_12_6) process four symbols per iteration respectively.

These changes will be justified when the assembler optimizations are described.

When compiled by EMC the implementation performs as the following table shows.

The results are being compared to Table 15. At this stage the test case perf_600_2_q was

introduced. The clearing of q symbols of size 2 bytes is to be optimized in assembler and

the test case will show how the time required clearing those changes.

17 Specifically, the author did not know about the address registers and their offset registers. It was believed

that only the temporary registers can be used to execute the instructions of the loop, including LDM

instructions.

102

Table 18: The implementation was changed in C code before performing assembler optimizations. The

table shows how this changed the implementation’s performance. The test cases’ new execution times are

being compared to the latest measured in Table 15. All presented execution times have been normalized to

the smallest one.


the changes

Execution time after the

changes

New execution time

compared to before

perf_600_2 1.11 1 90%

perf_600_2_q First time of use 1.04 First time of use

perf_600_6 2.02 2.08 102%

perf_600_6_q 1.92 1.93 101%

2.3.4.2 Description of the assembler optimizations

The assembler code produced by EMC was reviewed and the following table shows the

average number of cycles spent per symbol by the unrolled loops. This includes their

overhead. By modifying the assembler code all of them were optimized. The same table

shows how much they were improved.

Table 19: Each loop’s performance before and after the assembler optimizations are presented. The average

number of cycles per symbol is shown for every loop, however the values have been normalized to the

smallest value.

loop name Average number of cycles

spent per symbol before

assembler optimizations

Average number of cycles

spent per symbol after

optimizations

The new

performance

compared to before

symb_loop_12_2 2.1 1 48%

q_symb_loop_12_2 2.8 1 31%

symb_loop_12_6 4.2 2.2 46%

q_symb_loop_12_6 3.6 1.6 47%

To describe the assembler optimizations, take symb_loop_12_2 as an example. The

number of LDM instructions used to execute the loop was minimized. They were also

performed in parallel pair wise. In each iteration of the loop, four symbols are read from

the column buffer in LDM to temporary registers using two LDM instructions. They are

then written to the segment using four LDM instructions. No other instructions besides

those related to the loop’s overhead are performed. This motivates for only four symbols

per iteration, not eight as it was changed to. The purpose of this was to make the loop’s

overhead less significant compared to the actual work. At this moment the DSP’s

hardware support for executing loops without any overhead was not known by the author.

The remaining loops q_symb_loop_12_2, symb_loop_12_6 and q_symb_loop_12_6 were

improved in a similar manner. There is no conflict in hardware resource usage when

executing parallel LDM instructions in any of the four loops (Chapter 2, Paragraph

2.1.3.3 explains such conflicts and their consecuenses). But still none of the loops meet

the criterion for optimal performance (Chapter 4 states the criterion).

103

2.3.5 Performance of the implementation after assembler optimizations

Execution time of the test cases improved as the following table shows. The

performance is compared to before the assembler optimizations.

Table 20: The table shows how the assembler optimizations changed the performance. The test cases’ new

execution times are being compared to the latest measured in Table 18. All presented execution times have

been normalized to the smallest one.


optimization


optimization

New execution time

compared to before

perf_600_2 1.77 1.06 60%

perf_600_2_q 1.85 1 54%

perf_600_6 3.69 2.46 67%

perf_600_6_q 3.42 2.19 66%

The rightmost columns of tables Table 19 and Table 20 reflect each other well. The

former shows that q_symb_loop_12_2 benefited more than symb_loop_12_2 by the

assembler optimizations. Only the latter loop is used for perf_600_2 but both of them are

used for perf_600_2_q. Because of this the execution time has improved more for

perf_600_2_q than perf_600_2.

perf_600_6 has improved less than perf_600_2. This is because the former’s matrix is

three times larger in memory size than the latter’s. It must spend at least three times more

cycles to perform CM instructions. It also requires twice as many segments to cover its

matrix.

2.4 Future improvements The following paragraphs show how the implementations should be further improved.

Paragraph 2.4.1 shows how the implementation should be changed to make it more

convenient for future changes. Paragraph 2.4.2 gives an idea how much the

implementation’s performance would improve by using the DSP’s hardware support for

executing loops. Paragraphs 2.4.3, 2.4.4 and 2.4.5 assume that the hardware support is

used to execute the critical loops, and then discuss how they can meet the criterion for

optimal performance.

2.4.1 Changes of convenience

The division of deint_column by possible number of columns for the reasons stated in

Paragraph 2.3.4.1 is not necessary. A DSP uses address registers to perform LDM

instructions and their value can be modified by using offset registers.

104

Therefore, only three functions are needed to perform the work of deint_column, one for

each symbol size. The number of columns must be in their inputs.18

These functions will

be referred to as deint_column_2, deint_column_4 and deint_column_6. By small

modifications of deint_column_12_2 and deint_column_12_6 (that were optimized in

assembler) one can obtain deint_column_2 and deint_column_6.

2.4.2 Using hardware support to execute loops

The DSPs hardware support for executing loops without overhead should be used for

the critical loops. To give a comparison, the following table shows how the unrolled

loops of deint_column_12_2 and deint_column_12_6 that were optimized in assembler can

be further improved. The results are being compared to Table 19.

Table 21: The table shows how each loop’s performance will improve if the DSP’s hardware support for

executing loops is used. The average number of cycles per symbol is shown for every loop, however the

values have been normalized to the smallest one.

Loop name Average number of cycles

spent per symbol by loop

without using hardware

support

Average number of

cycles spent per symbol

by using hardware

support

The performance with

hardware support

compared to without

symb_loop_12_2 2.5 1.5 60%

q_symb_loop_12_2 2.5 1 40%

symb_loop_12_6 5.5 3.5 64%

q_symb_loop_12_6 4 2 50%

It can clearly be seen that the two that clear q symbols will benefit more. This is

because their iterations are shorter than the two loops that must read from the column

buffer and write to the segment. Their overhead is more significant.

Nevertheless, this should greatly improve the execution times of all the four test cases

used during assembler optimizations.

2.4.3 Making the critical loop of deint_column_4 optimal

Consider the function deint_column_4 that will be used when the symbol size is four

bytes. If its critical loop is to process one segment column at a time as shown by

Algorithm 17, then there can be conflicts in hardware resource usage when executing

parallel LDM instructions (Chapter 2, Paragraph 2.1.3.3 describes such conflicts and

18 The division of deint_column by symbol sizes makes the assembler code easier to overview in the

author’s opinion, but this is just a matter of taste. It will also make further discussions easier to understand.

105

explains the consecuenses). The loop would not meet the criterion for optimal

performance (the criterion is stated in Chapter 4).

One solution to this problem is that deint_column_4’s critical loop processes two

consecutive columns of the segment each time it is called. So it begins by reading column

i*2 and 1*2 i from CM to the column buffer in LDM. It then places the symbols from

the buffer into the segment in LDM. But the problem is solved by in what manner the

LDM instructions are performed. First, two symbols belonging to the same row are read

from the two columns in the column buffer into temporary registers. These symbols will

be consecutive in the segment, and so they can be written to it by two parallel LDM

instructions. A pair of q symbols can be cleared in the same manner. Algorithm 18

specifies deint_column_4 by using this approach.

Algorithm 18: The algorithm shows how deint_column_4 should process two segment columns at a

time.

//The algorithm processes two consecutive segment columns 2*i and 2*i+1.

//Two CM instructions are used to read the non q symbols of the columns

//to the column buffer in LDM.

read the segment column a[0...n-1] from CM to the column buffer in LDM;

read the segment column b[0...n-1] from CM to the column buffer in LDM;

//The following loop places the symbols


loop {

//Two symbols are read from the column buffer to two temporary registers.

//This is done by two parallel LDM instructions.

LDM instruction that reads a[j] from the column buffer to a0 |

LDM instruction that reads b[j] from the column buffer to a1;

//The two symbols are written from the temporary registers to the segment.

//This is also done by two parallel LDM instructions.

LDM instruction that writes a[j] from a0 to the segment |

LDM instruction that writes b[j] from a1 to the segment;

}

//The following loop clears the q symbols in the end of the columns.

loop {

LDM instruction that clears the q symbol

of row j and column 2*i in the segment |

LDM instruction that clears the q symbol

of row j and column 2*i+1 in the segment;

}

If the DSP’s hardware support is used for executing the shown loops then it is

guaranteed that they will meet the criterion for optimal performance.

On the other hand, the column buffer must be larger if two segment columns are to be

read from CM to LDM. This means the segments must be smaller. Specifically, each

segment can span over nrrows rows, where nrrows solves the following equation.

106

AVAILsizenrnrnr LDMsymbolrowsrowscols **2*

But the maximum size of a segment is reduced only marginally compared to if

deint_column_4 processes one column at a time (compare to Formula 9). This change

does not necessarily mean that more segments are required to cover the DSP’s section,

and the number of segments can in fact increase by at most one. The time lost in handling

an additional segment is minimal compared to what is gained by having optimal critical

loops.

2.4.4 Making the critical loops of deint_column_2 optimal

If deint_column_2 is to process only one segment column each time it is called, as

deint_column_12_2 does, then the criterion for optimal performance will not be met by its

critical loops. They would work as symb_loop_12_2 and q_symb_loop_12_2. The former

reads eight bytes from the column buffer by two parallel LDM instructions, but then uses

four LDM instructions that are parallel pair-wise to write them to the segment. The latter

only writes to LDM but it too uses four LDM instructions that are parallel pair-wise.

The problem is that LDM instructions that write to the segment are limited by the

symbol size. Each of them writes two bytes, while it is required by the criterion to write

four.

To address this, deint_column_2 can also like deint_column_4 process two consecutive

segment columns each time it is called. But there is a slight difference in how this must

be done. The reader should be patient throughout the following argument until Algorithm

19 is reached.

Four symbols can be read from the column buffer by using two parallel LDM

instructions. Two of the symbols belong to one row, and the other two belong to the next.

So their positions in the segment in LDM are pair wise consecutive. The four of them can

therefore be written to the segment using two parallel instructions. The problem is that

during this procedure the four symbols are in two temporary registers. When the symbols

have been read from the column buffer, each register contains two symbols of the same

column. When they are to be written to the segment, each register must contain two

symbols of the same row. The contents of two register parts must be swapped.

To do this and maintain an optimal loop is difficult due to reasons that are mentioned in

Paragraph 5.1. Specifically, the instruction that can be used to easily swap two register

parts can not be performed in parallel with two LDM instructions. Instead three

instructions and an extra temporary register are required for each pair of register parts

that are to be swapped. This requires the loop to process more than four symbols per

iteration. The following algorithm embodies the lengthy argument:

107

Algorithm 19: The algorithm shows in what manner deint_column_2’s critical loop should move

symbols from the column buffer to the segment.

//The algorithm processes two consecutive segment columns 2*i and 2*i+1.

//Two CM instructions are used to read the non q symbols of the columns

//to the column buffer in LDM.

read column s[0...n-1] of the segment from CM to LDM;

read column t[0...n-1] of the segment from CM to LDM;

//The temporary registers a0, a1, a2 and a3 will be used.

LDM instruction that reads s[0] and s[1] from the column buffer to a2 |

LDM instruction that reads t[0] and t[1] from the column buffer to a3;

//The following loop places the symbols


//The variable i is only used for clarification.

//Eight symbols per iteration are processed.

//Instructions required to swap s[1] and t[0].

Instructions that begin swapping s[1] and t[0] between a2h and a3l;

//i = 2;

loop {

LDM instruction that reads s[i] and s[i+1] from the column buffer to a0 |

LDM instruction that reads t[i] and t[i+1] from the column buffer to a1 |

Instructions that finish swapping s[i-1] and t[i-2] between a2h and a3l;

//Instructions required to swap s[i-1] and t[i-2].

LDM instruction that writes s[i-2] and t[i-2] from a2 to the segment |

LDM instruction that writes s[i-1] and t[i-1] from a3 to the segment |

Instructions that begin swapping s[i+1] and t[i] between a0h and a1l;

//Instructions required to swap s[i+1] and t[i].

LDM instruction that reads s[i+2] and s[i+3] from the column buffer to a2 |

LDM instruction that reads t[i+2] and t[i+3] from the column buffer to a3 |

Instructions that finish swapping s[i+1] and t[i] between a0h and a1l;

//Instructions required to swap s[i+1] and t[i].

LDM instruction that writes s[i] and t[i] from a0 to the segment |

LDM instruction that writes s[i+1] and t[i+1] from a1 to the segment |

Instructions that begin swapping s[i+3] and t[i+2] between a2h and a3l;

//Instructions required to swap s[i+3] and t[i+2].

//i = i + 4;

}

//The loop stopped at i = j.

Instructions that finish swapping s[j-1] and t[j-2] between a2h and a3l;

//Instructions required to swap s[j-1] and t[j-2].

LDM instruction that writes s[j-2] and t[j-2] from a2 to the segment |

LDM instruction that writes s[j-1] and t[j-1] from a3 to the segment;

The unrolled loop shown meets the criterion for optimal performance. However, the

loop has been unrolled and must be followed by another one that places any remaining

symbols from the column buffer into the segment. This has been omitted. By processing

two segment columns each time, it is easy to construct a loop that clears four q symbols

per iteration and is also optimal. This has also been omitted.

108

2.4.5 Making the critical loops of deint_column_6 optimal

Earlier it was suggested that the function deint_column_12_6 can be modified to obtain

deint_column_6. Recall that the former processes only one column at a time. Its critical

loop symb_loop_12_6 does not meet the criterion for optimal performance, but in a sense it

is not far from achieving this. If the loop is changed so it becomes optimal then the

number of cycles it spends per symbol is reduced by only 14%. The improvement of the

implementation’s total execution time for this symbol size would be even smaller.

Besides from the lack of benefit, it is very hard if not impossible to improve

deint_column_6 the way deint_column_2 has been. The former can be improved by

processing two columns each time it is called as the latter. But Algorithm 19 shows how

words must be swapped between temporary register parts in deint_column_2’s critical

loop. If deint_column_6 is to process two segment columns at a time, then this task

becomes increasingly difficult to perform in its critical loop that places symbols from the

column buffer into the segment. There are too many pairs of register parts that must swap

contents and too few cycles to do so. All attempts to bring the loop to optimal

performance failed. The details are omitted. It is noteworthy though that with the

hardware changes suggested in Paragraph 5.1, it would be possible to bring the loop to

optimal performance.

However, if deint_column_6 is to process two segment columns, then it is easy to

construct a loop that clears four q symbols per iteration and is optimal. Nevertheless, if

deint_column_2 and deint_column_4 are to process two segment columns, then

deint_column_6 should do so too. After all the idea requires that the overall

implementation is restructured. However one of this function’s critical loops does not

need to be optimal.

109

Appendix 3. Rate dematching This appendix is devoted to specifying the processing algorithm that was used as a base

for the implementation of Rate dematching, and to explain how the implementation

underwent.

It is Paragraph 3.1 that specifies the processing algorithm. Paragraph 3.2 describes the

test cases that were used for verifying correctness and performance while writing the

implementation.

Paragraph 3.3 explains what steps were taken to optimize the implementation. The

improvements are described step by step, and how they changed the performance is

measured with the test cases. The implementations final performance is also presented.

The appendix is concluded by Paragraph 3.4 that explains how the implementation

should be improved in the future.

3.1 Specification of the processing algorithm The algorithm is organized around four structures. There are functions that extract data

from the structures or operate on them. The structures are rd_data, input_reader,

column_traverser and byte_array. They will be described in the following paragraphs.

They are used by the functions dematch, soft_comb_matrix, and soft_comb_col. They are

specified after the structures. The dispatch procedure of the JOB will be omitted.

3.1.1 Description of byte_array

byte_array is used to (i) transfer byte arrays between CM and LDM and to (ii) read or

write to single bytes in a byte array in LDM. The first task is complicated due to the

reasons stated in Chapter 7, Paragraph 7.2.3, while the second task is complicated

because a single byte in LDM can not be addressed when writing C code. The structure

addresses these two issues.

3.1.2 Description of rd_data

rd_data provides necessary data. A few examples are listed here:

Basic information such as the matrix’s number of rows, the length of E, and so

on.

The function P and its inverse.

The function Q and its inverse.

Number of NULL bytes in a column i before permutation.

Number of NULL bytes in a column i after permutation.

To calculate some of the values that are provided by the structure, calculations that have

been performed in advance and stored in LDM are used. Thus it requires a small number

of lookups in word arrays.

110

3.1.3 Description of input_reader

input_reader is used to traverse E in CM and to read necessary parts of it to LDM. It

makes sure that only the repetitions of the DSP’s matrix are read. Each repetition is read

in as large parts as LDM allows, but no column repetition is read partially. This structure

uses byte_array.

3.1.4 Description of column_traverser

column_traverser traverses a part of a repetition of the DSP’s matrix that has been read

from E to LDM. It traverses the part one column repetition at a time, and for each of them

it tells what matrix column it is to be soft combined with. It makes use of byte_array.

3.1.5 Specification of dematch

This function is executed by each job DSP in order to perform the processing step for its

matrix. It expects that the DSP’s matrix has been placed in LDM when it begins. The

function clear_matrix is used to clear the entire matrix if required. It will not be further

specified. The function soft_comb_matrix is used to perform the soft combination.


LDM_matrix, a pointer to the matrix in LDM.

CM_input_E, a pointer to the array E in CM.

The integers D, S, N, T and the boolean CLEAR that are among the processing step’s

input parameters (see Chapter 5, Paragraph 5.3.2).

It works as the following algorithm shows.

Algorithm 20: The algorithm specifies the function.

rd_data data;

//Initiate the structure data.

init_data (data, D, S, N, T);

input_reader reader;

//Initiate the structure reader.

init_reader (reader, data, CM_input_E);

if (CLEAR == TRUE) {

clear_matrix (LDM_matrix, data);

}

//Perform the soft combination.

soft_comb_matrix (LDM_matrix, reader, data);

return;

111

3.1.6 Specification of soft_comb_array

This function performs the actual soft combination of the matrix column by column.

Every repetition of the matrix is read to LDM by using input_reader. A repetition in

LDM is traversed by a column_traverser. Each column repetition is then soft combined

to the matrix by the function soft_comb_col. The latter will not been further specified.

Its input is as follows:

LDM_matrix, a pointer to the matrix in LDM.

reader, the input_reader that is used to read repetitions in E from CM.

data, the rd_data structure that provides data.


112

Algorithm 21: The algorithm specifies the function.

column_traverser traverser;

//Initiate the structure column_traverser.

init_traverser (traverser, data);

byte_array buf_array;

//Initiate buf_array that will be used to read repetitions to LDM.

init_array (buf_array);

byte_array col_array;

//Initiate col_array that will be used by column_traverser

//to point out the beginning of the current column repetition.

init_array (col_array);

//Iterate as long as there are more repetitions of the matrix in E in CM.

while (has_more_input (reader)) {

//Read as much as possible of the current repetition from CM to LDM.

//Use buf_array for this task.

read_repetition_to_ldm (buf_array, reader);

//Retrieve the index of the first column repetition

//that has been read to LDM.

i = first_col_id (reader);

//Initiate the traverser to the first column repetition that has been read.

start_traverser (traverser, buf_array, i);

//Iterate as long as there are more column repetitions

//that have been read to LDM.

while (has_more_columns (traverser)) {

//Point out the beginning of the current

column repetition by using col_array.

point_to_curr_column (col_array, traverser);

//Retrieve the index of matrix column that

//it is to be soft combined with.

i = curr_col_id (traverser);

//Perform the soft combination between the column and column repetition.

soft_comb_col (LDM_matrix, col_array, i);

//Move the column_traverser one column repetition forward.

move_forward_one_col (traverser);

}

//Move forward the input_reader to the next part of the current repetition

//or to the beginning of the next repetition.

move_forward (reader);

}

return;

113

3.2 Test cases used for writing implementation This paragraph specifies the test cases that were used while writing an implementation

of the processing algorithm that has been suggested. There are tests for verifying

correctness and those that show how performance changes due to optimizations. The

execution time of tests that were used for measuring the implementation’s performance

can not be revealed. Throughout the chapter some tables discuss the implementation’s

performance by using these tests. In these tables the tests’ execution times have been

normalized. This means that every execution time in a table has been divided by the

shortest execution time in the same table.

3.2.1 Test cases for testing correctness

More than 300 test cases were used to verify the correctness of the implementation,

approximately 100 for each matrix U, V and W. The large number of tests was motivated

by how error-prone an implementation of the processing step is on EMPA, and more tests

were introduced when the implementation was optimized in assembler. Only a few test

cases in each series use very large input. The parameters that have been varied over the

tests are:

The size of the matrix.

The number of NULL bytes.

If the matrix is to be cleared or not.

The length of the byte array E.

The number of initial columns repetitions skipped in E.

The size of the buffer in LDM used to read the matrix’s repetitions in E from

CM.

The input arrays of each test case were generated randomly. Each test specifies an

output that the implementation is expected to produce, given the input. The output of the

test has been generated using a separate implementation. This implementation is simple

and follows the processing step’s specification step by step. Performance was not taken

into consideration while writing it in order to make it easier to verify its correctness.

3.2.2 Test cases for measuring performance

The following test cases were used to measure the performance of the implementation.

In each test the execution time of the function dematch is measured in number of cycles.

114

Table 22: Test cases used for measuring performance.

Test case’s name Number of

rows in

matrix

Length of

E (bytes)

If the matrix is to be

cleared before soft

combination

Size of the buffer in LDM

used to store repetitions

from E (bytes)

matrix_cleared_U 193 18528 TRUE 6148

matrix_cleared_V 193 18528 TRUE 6148

matrix_cleared_W 193 18528 TRUE 6148

matrix_U 193 18528 FALSE 6148

matrix_V 193 18528 FALSE 6148

matrix_W 193 18528 FALSE 6148

overhead_W 1 96 FALSE 6148

There are no NULL bytes in the matrices. The first repetition in E belongs to U in every

test, and no column repetitions have been skipped (see Chapter 7, Paragraph 7.1.3).

Compare the length of E to the number of rows in the matrix in each test. There is

exactly one repetition of the matrix in E. The first six tests use large input arrays. The

matrix is to be cleared in the three first. Note that the buffer in LDM is not large enough

to contain the matrix’s entire repetition in any test. It must be read partially two or three

times.

It will be interesting to know how much of a test’s execution time is spent on not (i)

clearing the matrix, (ii) executing CM instructions that read E nor (iii) executing the

function soft_comb_col. What remains besides these tasks is among other things

initiating and managing the structures that the implementation uses (see Paragraph 3.1).

32 columns are processed in every test case but in overhead_W each of them has only one

row. This test spends a minimal amount of time performing the three mentioned tasks and

can therefore be used to make the estimation for the other tests.

The function SatFunc that is used to perform soft combination requires that the sum is

adjusted if it is outside of a certain bound (see Chapter 5, Paragraph 5.3.2). In each test

every byte of the matrix and E has the value 127. This is to ensure that the sum is always

out of bounds, and the implementation is thus forced to make adjustments by each soft

combination.

3.3 Implementation of the processing algorithm The algorithm that has been described in Paragraph 3.1 was implemented by writing C

code. It was then further optimized. Paragraph 3.3.1 presents its initial performance and

describes the major optimization steps and their impact on performance. This includes

assembler optimizations, but they are further described in Paragraph 3.3.2. The

implementations performance after all the optimizations is discussed in Paragraph 3.3.3.

115

3.3.1 Major optimization steps performed

The first version of the implementation performed as follows

Table 23: The performance of the first version of the implementation is presented. The test case’s execution

times are shown, but the values have been normalized to the smallest one.

Test case Execution time

matrix_cleared_U 23.80

matrix_cleared_V 24.18

matrix_cleared_W 23.85

matrix_U 22.83

matrix_V 23.21

matrix_W 22.89

overhead_W 1

The tests that need to clear the matrix require more time. It might be surprising that the

three tests require nearly equally much time. The same observation can be made for

matrix_U, matrix_V and matrix_W. The repetitions of V and W have after all been

interlaced in E. One might think that their tests should require considerable more time to

execute. The reason is that bytes in LDM can not be addressed when writing C code and

in the function soft_comb_col bit arithmetic is used to obtain a word’s byte. It does not

therefore make much of a difference which matrix is being processed.

Chapter 7, Paragraph 7.2.1 suggests how the implementation can be improved for

matrices that are to be cleared. When the matrix has been cleared, its first repetition in E

can be used to fill it instead of performing soft combination. The implementation was

improved in this way by introducing the functions fill_matrix and fill_col. The former

reads the first 32 occurrences of the matrix’s column repetitions from E and inserts them

into the matrix. Each column is inserted by the latter function. fill_matrix will not be

further specified for it works almost exactly as soft_comb_matrix shown in Algorithm 21.

The only two differences are that fill_matrix’s inner loop is restricted to iterate no more

than 32 times, and it calls fill_col instead of soft_comb_col. The function dematch was

modified and the following algorithm shows how for clarification:

116

Algorithm 22: The algorithm specifies dematch after the optimization.

rd_data data;

init_data (data, D, S, N, T);

input_reader reader;

init_reader (reader, data, CM_input_E);

if (CLEAR == TRUE) {

clear_matrix (LDM_matrix, data);

//The first 32 occurrences of the matrix’s column repetitions in E are read

//and inserted into the matrix without soft combination.

fill_matrix (LDM_matrix, reader, data);

}

//The remaining relevant column repetitions in E are read and soft combined.

//Note how fill_matrix and soft_comb_matrix use reader,

//which keeps track of what column repetitions have already been processed.

soft_comb_matrix (LDM_matrix, reader, data);

return;

The performance was improved as the following table shows. As expected those tests

where the matrix is cleared execute faster.

Table 24: The table shows how the optimization changed the performance. The test case’s execution times

are shown, but the values have been normalized to the smallest one.


optimization


optimization

New execution time

compared to before

matrix_cleared_U 23.80 16.89 71%

matrix_cleared_V 24.18 17.45 72%

matrix_cleared_W 23.85 16.61 70%

matrix_U 22.83 22.83 100%

matrix_V 23.21 23.22 100%

matrix_W 22.89 22.89 100%

overhead_W 1.00 1 100%

The assembler code produced by EMC was inefficient since in C code an LDM byte can

not be addressed. However, the assembler code had many other deficiencies due to

EMC’s quality. This is further discussed in Appendix 4. Because of this the functions

clear_matrix, fill_col and soft_comb_col were fully rewritten in assembler to optimize

the implementation further. Paragraph 3.3.2 describes the assembler optimizations. The

execution time improved as the following table shows. Pay attention to the improvement

for overhead_W. It has only one row in each column yet its execution time has been

reduced by nearly one seventh due to the optimization. This gives an idea of how

inefficient the assembler code was prior to the optimization. Now it also becomes

obvious that processing V or W requires more effort than U, no matter if the matrix is to

117

be cleared or not. Processing W requires a notable more amount of time than V due to the

difference between Permutation I and II.

Table 25: The table shows how the optimization changed the performance.


optimization


optimization

New execution time

compared to before

matrix_cleared_U 19.27 1.47 7.6%

matrix_cleared_V 19.91 1.76 8.8%

matrix_cleared_W 18.94 1.78 9.4%

matrix_U 26.05 1.88 7.2%

matrix_V 26.48 2.21 8.3%

matrix_W 26.11 2.24 8.6%


But what’s most alarming is the comparison of overhead_W’s execution time with the

others’. It shows that the time spent on not (i) clearing the matrix, (ii) executing CM

instructions that read E nor executing the functions (iii) soft_comb_col or (iv) fill_col is

now making for a significant share of any test’s execution time. All other tasks besides

the four mentioned are being performed by parts of the implementation that are still

written in C code and compiled by EMC. This suggests that improving those parts can

give a significant boost in performance, and this was done as a last optimization. The

execution times improved as follows.

Table 26: The table shows how the optimization changed the performance.


optimization


optimization

New execution time

compared to before

matrix_cleared_U 2.38 1.80 75%

matrix_cleared_V 2.85 2.22 78%

matrix_cleared_W 2.89 2.23 77%

matrix_U 3.06 2.50 82%

matrix_V 3.59 2.96 83%

matrix_W 3.63 2.99 82%


3.3.2 Description of assembler optimizations

While optimizing the implementation it was decided that the tasks of the functions

clear_matrix, fill_col and soft_comb_col should be optimized by rewriting them in

assembler code. clear_matrix has only one loop, and improving it to meet the criterion

for optimal performance was easy.

118

However this was not doable for the other two functions as they currently work. Each

function’s critical loop processes one column at a time and the matrix is in row by row

order in LDM. So an LDM instruction in the loop that reads or writes to the matrix

transfers only one byte. Also there may be conflicts in hardware resource usage when

parallel LDM instructions are executed. Thus the loop fails to uphold two points of the

criterion for optimal performance.

Having said that the functions were still optimized in assembler, fill_col was replaced

by two functions, namely fill_col_u and fill_col_vw. The former is used if fill_matrix

processes U, while the latter is used if either V or W is being processed. In the same way,

soft_comb_col_u and soft_comb_col_vw replaced soft_comb_col. The four functions were

directly written in assembler code.

Each of the four functions has an unrolled loop that processes 8 rows of the column per

iteration. The DSP’s hardware support is used for executing the loops (see Chapter 2,

Paragraph 2.1.4). To perform soft combination a very handy instruction is used in the

loops of soft_comb_col_u and soft_comb_col_vw, namely add4. add4 a0,a1,a2 treats each

of the four lower bytes of every temporary register as an individual integer. Each byte of

a0 is in this way added to the corresponding byte of a1. If the sum is larger than 127 or

smaller than -128 then it is adjusted to the closest boundary. The result is then stored in

the corresponding byte of a2. In this way soft_comb_col_u soft combines four rows of the

matrix’s column per add4 instruction. soft_comb_col_vw however soft combines only two

rows per instruction due to how the columns of V and W have been interlaced in E.

The critical loops have the following performance. It is only natural that the loops of

soft_comb_col_u and soft_comb_col_vw have twice the execution time of the ones of

fill_col_u and fill_col_vw. The former must additionally read bytes from the matrix

and do soft combination.

Table 27: The performance of each function’s unrolled loop is presented. The average number of cycles

spent per row of the matrix’s column is shown, but the values have been normalized to the smallest one.

Function name Average number of cycles spent per row of the matrix’s

column by the function’s unrolled loop.

fill_col_u 1

fill_col_vw 1.11

soft_comb_col_u 2

soft_comb_col_vw 2.22

3.3.3 The implementation’s performance after optimizations

The following table presents the number of cycles required to execute the function

dematch for every test case. Depending on what test is executed, either the critical loop of

fill_col or soft_comb_col is executed. The table also shows how large share of the

execution time is spent in the loop.

119

Table 28: The table presents the test cases’ execution times and shows how large share of each test’s time is

spent in the critical loop. The execution times have been normalized to the smallest one.

Test case Execution time of the test

case

Share of the test’s execution time spent in the

critical loop

matrix_cleared_U 1.75 42%

matrix_cleared_V 2.15 38%

matrix_cleared_W 2.17 37%

matrix_U 2.45 60%

matrix_V 2.90 56%

matrix_W 2.92 56%

overhead_W 1 0.85%

It becomes obvious that in the first six tests a large share of the execution time is not

spent in the critical loop. Yet the matrices in the tests are already as large as they can be

in the processing step. This can not be blamed on the CM instructions that read the

matrix’s repetition from CM to LDM. Nor is it because of the execution of clear_matrix

for the three first tests. These tasks do not make for a significant amount of the tests’

execution times.

The observations being made are due to how the implementation was developed. A

modularized approach was used, using multiple structures that perform individual tasks

(see Paragraph 3.1). This decision was made since writing an implementation for the

processing step was deemed to be very error-prone. In this way each structure’s functions

that operate on it could be tested for correctness individually. This benefit was very

important due to the limited time of the thesis that could be dedicated to resolve all the

issues that surfaced while implementing the processing step.

However the disadvantages of using a modularized approach instead of writing a

“seamless” implementation are now significant. They can be made more obvious by a

simple observation. This will also clarify the results in the table. Algorithm 21

demonstrates the function soft_comb_matrix before the optimizations. After the

optimizations, the call to soft_comb_col has been replaced by a conditional statement

with two outcomes. It chooses between either soft_comb_col_u or soft_comb_col_vw.

fill_matrix is nearly equal, and it suffices to observe the algorithm throughout the

following argument.

Now, consider one iteration of the inner loop of fill_matrix for the test

matrix_cleared_U after the optimizations. 62% of the iteration’s execution time is spent

on executing fill_col_u. This means that 38% of the time is spent on the constant

amount of work required per iteration. The latter number would be larger if the matrix

had fewer rows.

The conclusion is that the problem is mainly twofold. First, the columns of the matrix

are too short. Second, the constant amount of work per iteration is too large. While

nothing can be done about the former since it has to do with the processing step’s

specification, something can be done about the latter.

120

But the approach used to write the implementation can not solely be blamed. The

constant amount of work required per iteration is performed by parts of the

implementation that are still written in C code, and as Appendix 4 shows EMC is not of

high quality.

3.4 Future improvements The future improvements suggested in this paragraph are mentioned in a specific order.

It is believed that those that are mentioned first are of more practical use than the others,

due to the typical input parameters when executing Rate demtaching (see Chapter 7,

Paragraph 7.1.6). It is first suggested how the constant amount of work required for each

column repetition can be reduced. Next it is suggested how the critical loops of the

functions fill_col_u, fill_col_vw, soft_comb_col_u and soft_comb_col_vw can be

brought closer to being optimal. Recall that each of them currently processes only one

column repetition at a time. This means that the LDM instructions that write to the

matrix’s column write only one byte. Also there may be conflicts in hardware resource

usage when executing parallel LDM instructions.

3.4.1 Bypassing the structures

Now that the implementation has been completed it becomes more obvious what roles

the implementation’s structures fill and how these can be performed more swiftly by

bypassing the structures completely. To give an example, consider the inner loop of

soft_comb_matrix (see Algorithm 21). It processes one column repetition each time it is

executed and uses column_traverser to do so. The structure can be replaced by an

unsigned integer that points out the beginning of the current column repetition in the

buffer in LDM that stores repetitions of the matrix. This integer can be passed to

soft_comb_col_u or soft_comb_col_vw depending on what matrix is being processed. In

the end of the loop, it is calculated where the next column repetition begins. This can be

done by incrementing the unsigned integer. How much it is to be incremented by depends

on what matrix column the current column repetition has been soft combined with. This

can be precalculated and stored in a word array in LDM, thus requiring only a lookup.

The inner loop of fill_matrix can be improved in the exact same way. By simplifying

the inner loops to this degree, it becomes an easy task to further optimize them in

assembler and thus bypassing EMC completely. This should significantly reduce the

constant amount of work required per column repetition.

3.4.2 Processing consecutive columns to obtain optimality

Let us see if the critical loop of fill_col_u can be made optimal by modifying it. Recall

that it is used to directly insert the first 32 column repetitions into the matrix U when it

has been cleared. It is chosen for the sake of discussion because it does not need to

perform soft combination as in soft_comb_col_u, and its task is not as complicated as in

fill_col_vw due to interlaced column repetitions. Yet it will be shown that making the

121

chosen loop optimal is probably impossible, thus implying the case is the same for the

other functions. Also the difficulties that will be stated are only a few of all that were

encountered when attempting to do so.

The critical loop of fill_col_u needs to read bytes from column repetitions and insert

them into U’s columns. If the loop is to be optimal then every LDM instruction that

writes to the matrix must write four bytes to LDM. The bytes must of course be

consecutive in LDM. Since the matrix appears in LDM row by row the bytes span over

four consecutive columns. So the loop must process four column repetitions per iteration.

Yet every LDM instruction that reads from a column repetition must read four bytes.

Therefore the procedure sums up to (i) executing four LDM instructions that read a total

of 164*4 bytes from LDM into temporary registers, (ii) processing the bytes by

permuting them and then (iii) executing four LDM instructions that write them to LDM.

In what order this must be done is not relevant for this discussion. What must be

understood is that there are a total of eight LDM instructions, and they must be executed

in parallel pair-wise. This gives four pairs of parallel LDM instructions. Coupling the

instructions two and two to achieve this is not as simple as it may sound but this is not the

main problem. The criterion for optimal loop performance requires that all other

instructions are executed in parallel with LDM instructions. Thus the instructions that

reorder the 16 bytes into some temporary registers (before they are written to LDM) must

be executed in parallel with the four pairs of LDM instructions. This means that the extra

instructions must be arranged into four sets of parallel instructions, in such manner that

each set can be executed in parallel with two LDM instructions. This limit could not be

met by the author.19

But of course the critical loops do not need to be optimal and instead the stated

technique can be used to improve their performance significantly. But the improvement

would be of limited practical use due to the reasons stated in Chapter 7, Paragraph 7.1.6.

Most of the time when the processing step is to be performed the first repetition in E

belongs to U but the first two column repetitions have been skipped. Also there is hardly

ever a second repetition of U. This means that there are no column repetitions to soft

combine with the columns 00 P and 161 P of the matrix. The situation is worse

for V and W. For each of them there is hardly ever more than one repetition in E, and in it

some of the last column repetitions are missing. Take V as an example and say that the

last four column repetitions are missing. This means that there are no column repetitions

to soft combine with the columns 728 P , 1530 P , 2329 P , and 3131 P of

the matrix.

The technique is still useful if only a few columns of every matrix are missing column

repetitions. But if a critical loop is to process four consecutive columns each time it is

executed, then what should it do with those that have no column repetitions? A simple

19 The author even assumed that there is a fictional instruction that can copy any byte of one temporary

register to any byte of another, and that two such instructions can be performed in parallel with two LDM

instructions.

122

solution is to have three separate word arrays in LDM that are cleared. They can be used

as substitutions for column repetitions that are missing. Now if some of the four columns

are missing column repetitions when the loop is to be executed, then the corresponding

address registers can instead point to these word arrays that have been set to zero on

every position. The result will still be correct when performing soft combination due to

the simple observation iiSatFunc 0, for any byte i.

123

Appendix 4. Review of assembler code produced by EMC

During the thesis it became obvious that EMC produces assembler code that has many

deficiencies for the C code that it compiles. This had an unacceptable impact on the

performance of the thesis’ implementations when they were still solely written in C. The

most obvious reason was because the critical loops were compiled by EMC. The

following paragraphs give examples of critical loops that were written in C code during

the thesis. The assembler code produced for each of them by EMC is presented and the

code’s performance is discussed. The final paragraph draws some important conclusions

regarding EMC.

4.1 First critical loop The following C code shows a loop that moves data from one array in LDM to another.

Every array has two words in each position. This was a critical loop used in Channel

deinterleaver’s implementation.

Code 1: A critical loop written in C code.

while (current_src < src_limit) {

LDM_dst_32[current_dst] = LDM_src_32[current_src];

LDM_dst_32[current_dst + 10] = LDM_src_32[current_src + 1];

LDM_dst_32[current_dst + 2*10] = LDM_src_32[current_src + 2];




current_dst += 6*10;

current_src += 6;

}

In each iteration LDM_src_32 can be read using six LDM instructions. They can be

performed in parallel pair-wise. Likewise six LDM instructions are required to write to

LDM_dst_32 and they too can be performed in parallel. The LDM instructions can update

their address registers accordingly (see Chapter 2, Paragraph 2.1.3.3). However, the

following is the assembler code produced by EMC for the loop.

124

Code 2: The assembler code produced by EMC for the C code shown by Code 1.

.begin_iteration_label:

ld *r1++, a2 | mv 20, m0 | add 6, a1

st a2, *r0++m0 | mv 20, m0

ld *r1++, a2

st a2, *r0++m0 | mv 20, m0

ld *r1++, a2

st a2, *r0++m0 | mv 20, m0

ld *r1++, a2

st a2, *r0++m0 | mv 20, m0

ld *r1++, a2

st a2, *r0++m0 | mv 20, m0

ld *r1++, a2

st a2, *r0++m0

.loop_condition_label:

cmp a0, a1

brr .begin_iteration_label, .a1:lt

The assembler code will not be explained in detail but only enough to later show its

deficiencies. The instruction add 6,a1 increments the temporary register a1 by 6.

ld *r1++,a2 reads two words from LDM_src_32 to a2. st a2,*r0++m0 then writes the

words to LDM_dst_32. These two LDM instructions also modify the address registers r0

and r1 that point out the current positions in the respective arrays. The former increments

r1 by one, while the latter increments r0 by the value of the offset register m0. mv 24,m0

sets it to 24.

cmp a0,a1 compares the value of a0 and a1 and then sets some flags. The next

instruction uses one of these flags to determine if the loop should stop executing or

continue.

The following observations can be made:

m0 is set to the same value six times, even if it is not changed in between.

besides from the first set of parallel instructions, every other set contains only

one meaningful instruction (which happens to be an LDM instruction).

The loop’s execution time can be reduced by 47% by modifying only the assembler

code shown in Code 2. By modifying the code in this manner and using the DSP’s

hardware support for executing loops the time can be reduced by 65% (see Chapter 2,

Paragraph 2.1.4 for a description of this hardware support). This gives an idea of how

EMC is having an impact on performance.

The C code shown in Code 1 was modified in an attempt to make it more obvious for

EMC that LDM instructions can be performed in parallel. In each iteration the six

positions of LDM_src_32 were first read to temporary variables. These variables were then

written to the six positions of LDM_dst_32. In this way it should be obvious that six

consecutive read instructions are performed, followed by six write instructions. But the

125

execution time of the new loop produced by EMC was worse than before the

modification.20

4.2 Second critical loop The following C code shows a loop that clears an array in LDM. The array has two

words in each position. This was a critical loop used in Channel deinterleaver’s

implementation.

Code 3: A critical loop written in C code.

while (counter < counter_limit) {

LDM_dst_32[current_dst] = 0;

LDM_dst_32[current_dst + 10] = 0;

LDM_dst_32[current_dst + 2*10] = 0;

LDM_dst_32[current_dst + 3*10] = 0;

current_dst += 4*10;

counter += 4;

}

In every iteration the array’s four positions can be cleared by four LDM instructions,

and they can be performed in parallel pair-wise. However, the following is the assembler

code EMC produced for it.

Code 4: The assembler code produced by EMC for the C code shown in Code 3.

.loop_condition_label:

cmp a0, a1

brr .end_loop_label, .a1:ge

.begin_iteration_label:

mv 0, a2l | mv 20, m0 | add 4, a1

st a2, *r0++m0 | mv 0, a2l

mv 20, m0

st a2, *r0++m0 | mv 0, a2l

mv 20, m0

st a2, *r0++m0 | mv 0, a2l

mv 20, m0

st a2, *r0++m0

brr .loop_condition_label

.end_loop_label:

The instruction mv 0,a2l clears the entire of a2.21

This loop has the same deficiencies as

the previous. Note how no LDM instructions are performed in parallel. But in addition

the following can be observed:

20 In one attempt the six variables were declared with the keyword “register” in C to clarify to EMC that

their values are only temporarily stored.

126

a2 is set to the same value multiple times, even if it is not changed in between.

Two branch instructions are used when only one is necessary to construct the

loop.

The loop’s execution time can be reduced by 57% by modifying only the assembler

code shown in Code 4. By modifying the code in this manner and using the DSP’s

hardware support for executing loops the time can be reduced by 85%. Again the

compiler is having a significant impact on performance.

4.3 Conclusion In light of these observations, the following conclusions can be drawn regarding the

assembler code produced by EMC:

the DSP’s very short instruction pipeline is used inefficiently (by performing

unnecessary branch instructions that delay it),

unnecessary instructions are performed multiple times, and

the DSP’s capability to perform multiple instructions in parallel is used

inefficiently.

In ELTE it is a common practice to compile implementations with EMC without further

improving the result in assembler. This is due to a constant shortage of time. It is

therefore absolutely crucial that EMC is improved in order to use EMPA’s capabilities to

a greater extent.

21 This happens even if the instruction only addresses a register part. Sign extension occurs over the entire

register due to the standard configuration of a DSP.

127

Appendix 5. Hardware changes to EMPA During the thesis, some of the hardware aspects of EMPA made it notoriously difficult

to write efficient implementations. In this chapter it will be highlighted exactly how they

became an obstacle. Hardware changes that would make sure that this never occurred will

also be suggested. However a further discussion regarding if the changes are feasible

from the perspective of most implementations written by ELTE is necessary. A deeper

understanding of EMPA’s underlying architecture is also required to judge if the changes

are possible to perform. These discussions are outside the thesis’ scope and will be left at

the discretion of ELTE.

5.1 Swapping contents of temporary register parts Chapter 4 presents the criterion for optimal loop performance, which states that in the

loop every cycle must be used to transfer the maximal possible amount of data between

temporary registers and LDM. This requires that two parallel LDM instructions are being

executed by every cycle. Other instructions may only be executed in parallel with them.

It is possible that the loop’s task includes permuting the words that have been read from

LDM. This means words must be swapped between temporary register parts that hold the

data that has been read from LDM (see Chapter 2, Paragraph 2.1.2.1 to understand

temporary registers). For this purpose the instruction mv a0l,a1h comes in handy. It

copies the contents of a0l to a1h, and can do this for any pair of register parts. Such an

instruction will be referred to as a move instruction. So mv a0l,a1h|mv a1h,a0l is two

move instructions in parallel and the result is that the contents of a0l and a1h are

swapped.

The problem is that the total number of LDM instructions and move instructions

executed in parallel can not be more than two. So an LDM instruction and a move

instruction can be executed in parallel, but not even one move instruction can be used in

parallel with two LDM instructions. The conclusion is that a loop can not use move

instructions to perform its designated task and still be optimal.

The problem that move instructions can not be executed in conjunction with LDM

instructions has nothing to do with what temporary registers each instruction uses, for

they may be totally different. There are certain limits that must be met by any set of

instructions are to be executed in parallel. Two LDM instructions and two move

instructions meet all these limits. [11] states that the Data Address Arithmetic Unit of the

DSP can execute two instructions in parallel, and that this is the unit that executes LDM

instructions. A careful conclusion is that this unit also executes move instructions but this

was neither denied nor confirmed by [11]. If this is true then an interesting question

would be why does a unit of the DSP, which is mainly responsible for executing LDM

instructions, also execute an instruction that is purely arithmetical? A DSP has other

computational units that are tasked with many arithmetical instructions. Can they not be

tasked with executing move instructions too? For instance the instruction copy a0,a1

128

copies the content of a0 to a1, and it can be executed in parallel with two LDM

instructions.

In Paragraph 2.4.4 a loop was presented that had this exact task, namely reading data

from LDM, swapping the contents of register parts, and then writing it back to LDM. The

loop is optimal and because of this not a single move instruction could be used. For each

pair of register parts that had to switch contents, three instructions and an extra temporary

register were required. Algorithm 19 presents the loop and shows how these

complications were barely managed. In Paragraph 2.4.5 the problem was encountered

again but this time no solution was found. The pairs of register parts that had to switch

contents were too many, and so the same technique could not be used again.

It is very likely that a critical loop is tasked with reading data from LDM, permuting the

words that have been read, and then writing them back to LDM. In order to make it

easier to write such a loop that meets the criterion for optimal performance, the DSP

should be able to execute at least one move instruction in parallel with two LDM

instructions. Two is recommended, so that constructions like mv a0l,a1h|mv a1h,a0l can

be used to swap the contents of two register parts using only one cycle and requiring no

extra temporary registers.

TRITA-CSC-E 2010:169 ISRN-KTH/CSC/E--10/169-SE

ISSN-1653-5715

www.kth.se

parallelized and efficient implementations of rate ... and efficient implementations of rate...

Documents