scenario: - technionsoftlab-pro-web.technion.ac.il/projects/oggvorbis/doc/… · web viewaverage...

Optimizing software performance using architectural considerations

Adir Abraham (adirab}{Technion.ac.il), Tal Abir (shabi}{t2.technion.ac.il)

Background:

System architecture can have a crucial impact on software performance. However, not all software projects fully exploit the knowledge of the target architecture, such as the cache size, the branch predictor mechanism, and the ability to hyperthread.

Overview:

Ogg Vorbis is a new audio compression format. It is roughly comparable to other formats used to store and play digital music, such as MP3, VQF, AAC, and other digital audio formats. It is different from these other formats because it is completely free, open, and unpatented.

Vorbis is a general purpose perceptual audio CODEC. The encoder allows maximum encoder flexibility, thus allowing it to scale competitively over an exceptionally wide range of bitrates. At the high quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits), it is in the same league as MPEG-2 and MPC. Similarly, the 1.0 encoder can encode high-quality CD and DAT rate stereo at below 48kpbs without resampling to a lower rate. Vorbis is also intended for lower and higher sample rates (from 8kHz telephony to 192kHz digital masters) and a range of channel representations (monaural, polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255 discrete channels).

The Vorbis CODEC design assumes a complex, psychoacoustically-aware encoder and simple, low-complexity decoder. Vorbis decode is computationally simpler than mp3, although it does require more working memory as Vorbis has no static probability model; the vector codebooks used in the first stage of decoding from the bitstream are packed, in their entirety, into the Vorbis bitstream headers. In packed form, these codebooks occupy only a few kilobytes; the extent to which they are pre-decoded into a cache is the dominant factor in decoder memory usage.

Vorbis provides none of its own framing, synchronization or protection against errors; it is solely a method of accepting input audio, dividing it into individual frames and compressing these frames into raw, unformatted 'packets'. The decoder then accepts these raw packets in sequence, decodes them, synthesizes audio frames from them, and reassembles the frames into a facsimile of the original audio stream. Vorbis is a free-form VBR codec and packets have no minimum size, maximum size, or fixed/expected size. Packets are designed that they may be truncated (or padded) and remain decodable; this is not to be considered an error condition and is used extensively in bitrate management in peeling. Both the transport mechanism and decoder must allow that a packet may be any size, or end before or after packet decode expects.

Mapping

A mapping contains a channel coupling description and a list of 'submaps' that bundle sets of channel vectors together for grouped encoding and decoding. These submaps are not references to external components; the submap list is internal and specific to a mapping.

A 'submap' is a configuration/grouping that applies to a subset of floor and residue vectors within a mapping. The submap functions as a last layer of indirection such that specific special floor or residue settings can be applied not only to all the vectors in a given mode, but also specific vectors in a specific mode. Each submap specifies the proper floor and residue instance number to use for decoding that submap's spectral floor and spectral residue vectors.

Floor

Vorbis encodes a spectral 'floor' vector for each PCM channel. This vector is a low-resolution representation of the audio spectrum for the given channel in the current frame, generally used akin to a whitening filter. It is named a 'floor' because the Xiph.Org reference encoder has historically used it as a unit-baseline for spectral resolution.

A floor encoding may be of two types. Floor 0 uses a packed LSP representation on a dB amplitude scale and Bark frequency scale. Floor 1 represents the curve as a piecewise linear interpolated representation on a dB amplitude scale and linear frequency scale. The two floors are semantically interchangeable in encoding/decoding. However, floor type 1 provides more

stable inter-frame behavior, and so is the preferred choice in all coupled-stereo and high bitrate modes. Floor 1 is also considerably less expensive to decode than floor 0.

Residue

The spectral residue is the fine structure of the audio spectrum once the floor curve has been subtracted out. In simplest terms, it is coded in the bitstream using cascaded (multi-pass) vector quantization according to one of three specific packing/coding algorithms numbered 0 through 2. The packing algorithm details are configured by residue instance. As with the floor components, the final VQ/entropy encoding is provided by external codebook instances and each residue instance may choose from any and all available codebooks.

Codebooks

Codebooks are a self-contained abstraction that perform entropy decoding and, optionally, use the entropy-decoded integer value as an offset into an index of output value vectors, returning the indicated vector of values.

The entropy coding in a Vorbis I codebook is provided by a standard Huffman binary tree representation. This tree is tightly packed using one of several methods, depending on whether codeword lengths are ordered or unordered, or the tree is sparse.

High-level Decode Process

Decode setup

Before decoding can begin, a decoder must initialize using the bitstream headers matching the stream to be decoded. Vorbis uses three header packets; all are required, in-order, by this specification. Once set up, decode may begin at any audio packet belonging to the Vorbis stream. In Vorbis I, all packets after the three initial headers are audio packets.

The header packets are, in order, the identification header, the comments header, and the setup header.

Identification HeaderThe identification header identifies the bitstream as Vorbis, Vorbis version, and the simple audio characteristics of the stream such as sample rate and number of channels.

Comment HeaderThe comment header includes user text comments ["tags"] and a vendor string for the application/library that produced the bitstream. The encoding of the comment header is described within this document; the proper use of the comment fields is described in the Ogg Vorbis comment field specification.

Setup HeaderThe setup header includes extensive CODEC setup information as well as the complete VQ and Huffman codebooks needed for decode.

Decode Procedure

The decoding and synthesis procedure for all audio packets is fundamentally the same.

1. decode packet type flag 2. decode mode number 3. decode window shape [long windows only] 4. decode floor 5. decode residue into residue vectors 6. inverse channel coupling of residue vectors 7. generate floor curve from decoded floor data 8. compute dot product of floor and residue, producing audio spectrum

vector 9. inverse monolithic transform of audio spectrum vector, always an

MDCT in Vorbis I 10.overlap/add left-hand output of transform with right-hand output of

previous frame 11.store right hand-data from transform of current frame for future

lapping. 12.if not first frame, return results of overlap/add as audio result of

current frame

Target and motivation:In this project we will take Ogg Vorbis (v1.0.1), tune it for performance using VTune, and (hopefully) return it to the open-source community.

VTune

The VTune Performance Analyzer provides an integrated perforamance analysis and tuning environment that enables to analyze your code’s performance on Intel architecture processors of the IA-32 processor families. In our performance analysis, we used a Pentium 4 3.0GHz, with HT technology and 512MB of DDR-SDRAM. The VTune analyzer provides time- and event-based sampling, call-graph profiling, hotspot analysis, a tuning assistant, and many other features to assist performance tuning. It also has an integrated source viewer to link profiling data to precise locations in source code.

Thread Profiler

The Intel Thread Profiler facilitates tuning of OpenMP programs. It provides performance counters that are specific to OpenMP constructs. Intel Thread Profiler provides details on the time spent in serial regions, parallel regions, and critical sections and graphically displays performance bottlenecks due to load imbalance, lock contention, and parallel overhead. Performance data can be displayed for the whole program, by region, and even down to individual threads profiling data to precise locations in source code.

Thread Checker

The Intel Thread Checker facilitates debugging of multithreaded programs by automatically finding common errors such as storage conflicts, deadlock, API violations, inconsistent variable scope, thread stack overflows, etc. The non-deterministic nature of concurrency errors makes them particularly difficult to find with traditional debuggers. Thread Checker pinpoints error locations down to the source lines involved and provides stack traces showing the paths taken by the threads to reach the error. It also identifies the variables involved.

Scenario:

We encode a wav file called radio to an ogg file.The wav file is called radio.wav and it is ~100MB size.We run the encoder configured to get the highest bit rate. (“-q 10”)The output ogg file is radio.wav.ogg.

First run output:>oggenc.exe -q 10 radio.wav -o radio.wav.oggOpening with wav module: WAV file readerEncoding "radio.wav" to "radio.wav.ogg"at quality 10.00 [100.0%] [ 0m00s remaining] -

Done encoding file "radio.wav.ogg"

File length: 9m 14.0s Elapsed time: 0m 50.0s Rate: 11.0968 Average bitrate: 425.7 kb/s

1st run time: 50sec.

Sampling:

From the sampling we can see that ftol take 8.19% of the run time (The second hottest function in the run).

Considering pitfalls problems:

First optimization: _ftol.

_ftol is a small function used to converting float to int. Due to the ANSI standard, the usual code generated to convert a floating-point value to a long includes a call, floating point processor adjustments, and the actual conversion.

That _ftol function is a general-purpose call, which can handle both 32-bit and 64-bit conversion. It temporarily modifies the floating-point rounding mode to perform a straight truncation, does the roundoff and type conversion, and then sets the rounding mode back again to its previous state. This method, according to Intel research, is a long latency operation—much longer than you might think.

To say the least, this can eat a huge amount of CPU resources, as we can see from the results above.

As can be seen in the call graph bellow the potential of this optimization is great since there are ~3Milions calls to _ftol.

Solution: We’ve written an alternative function. Using call graph out

#define CONVERT_FLT2INT(MyFloat,MyInt){ \ int FltInt = *(int *)&MyFloat; \ int mantissa = (FltInt & 0x07fffff) | 0x800000; \ int exponent = ((FltInt >> 23) & 0xff) - 0x7f; \ int d = 23-exponent;\ if ((d) < 0) {\ MyInt = (mantissa << -(d)); \ d = -d;\ }\ else MyInt = (mantissa >> (d)); \ if (d>31)\ MyInt = 0;\ if (FltInt & 0x80000000) \ MyInt = -MyInt; \}

output: Opening with wav module: WAV file readerEncoding "radio.wav" to "radio.wav.ogg"at quality 10.00 [100.0%] [ 0m00s remaining] -



We can see now that the optimization Using VTune can show that the optimization potential is about 8% that in simple calculation will give 2 seconds out of the total of 50secs that is 4% improvement a half of the potential was fulfilled.

2# optimization: rint

As can be seen from the sampling, the third function _ctrlfp has 6.33% potential of improvement.

As we can see from the call graph it is called excessively from floor.After some “debugging” we came to the conclusion that rint calls _ctrlfp.

Rint() function rounds x to an integer value according to the prevalent rounding mode. The default rounding mode is to round to the nearest integer.

_ctrlfp has 2 performance issues — blocked store-forwarding and serialization. _ctrlfp is a macro which is used as part of the C function rint. _ctrlfp uses “fldcw”, which is an unpairable instruction, and we avoided using it, by writing an alternative code for rint.

Solution: we decided to rewrite this code and to avoid calling rint.

The rint() function rounds x to an integer value according to the prevalent rounding mode. The default rounding mode is to round to the nearest integer.

Here is the optimization code: #define RINT(MyFloat,MyInt) \{ \

int temp; \float out; \float in_ = (MyFloat) + 0.5; \if((MyFloat)>=0.){ \

CONVERT_FLT2INT(in_, temp); \(MyInt) = temp; \

}else{ \in_ = -in_ + 1; \CONVERT_FLT2INT(in_, temp); \temp = -temp; \out = (MyInt) = temp; \if(out == (MyFloat) - 0.5){ \

(MyInt)++; \} \

} \}output: Opening with wav module: WAV file readerEncoding "radio.wav" to "radio.wav.ogg"at quality 10.00 [100.0%] [ 0m00s remaining] -


File length: 9m 14.0s Elapsed time: 0m 45.0s Rate: 12.3298 Average bitrate: 425.7 kb/sAs we can see from the results, we’ve improved the benchmark in 6% out of 6.33% potential thus almost all potential was fulfilled.

3# optimization: 64k Aliasing

64k aliasing conflicts are cache evictions associated with mapping conflicts between multiple data regions. You want to avoid needing cache lines which are exactly 64k apart. To fix 64k aliasing conflicts you need to step through your code w/ a debugger and look at the different memory addresses you are loading from and storing to.

If you see addresses that are offset by a multiple of 64k (to be exact with address bits 7-15 the same since we really care about cache lines here not individual addresses) then you can have a potential aliasing problem.

Vtune indicated us that there may be a 64k Aliasing problem in bark_noise_hybridmp:

Description:Data could not be put in the first-level cache because there was already data in the cache with the same values for bits 0 through 15 of the linear address

float *N=alloca(n*sizeof(*N));float *X=alloca(n*sizeof(*N));float *XX=alloca(n*sizeof(*N));float *Y=alloca(n*sizeof(*N));float *XY=alloca(n*sizeof(*N));

Empirically, n is 1024 or 128 thus N array elements

Solution: we decided to add 1K to the X array so that …


We can see now that the optimization gives 2 seconds speedup out of the total of 50secs that is 4% improvement.

4# optimization:

Threading

Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.

Using Vtune sampling and the call graph, we saw that _vp_tonemask and _vp_noisemask take 3.7 seconds and 7.7 seconds. After reading the code we saw that these two functions are independent. Moreover, they are being called one after the other.

Solution: We’ve decided to parallel these functions. _vp_noisemask will remain at the main thread. _vp_tonemask will move to another thread which will be started before _vp_noisemask is called. We’ll use a Rendezvous in order to make sure that the program will make the sequenced calls after both function completed their work.

output: File length: 9m 14.0s Elapsed time: 0m 42.0s Rate: 13.2105 Average bitrate: 425.7 kb/s

We got a speedup of 1 second that is 2% out of 50 seconds. This is 28% of what we would have gotten in DP machine.

Verifying the solution : We ran Vtune Thread checker on the scenario:

It didn’t find any faulty points.We ran Vtune Thread profiler and got the following interaction between the threads:

Legend:

The impact time is the time that the current thread on the critical path delays the next thread on the critical path via some synchronization resource. Typically, the current thread holds a shared resource such as a lock while the next thread waits for this lock. The impact time begins when the next thread starts waiting for a shared resource. The impact time ends when the current thread releases that resource.

The cruise time is the time that the current thread on the critical path does not delay the next thread on the critical path by holding a synchronization resource or by invoking a blocking API. That is, there is no thread or synchronization interaction that affects the execution time during this portion of time.

We can see from the results the 2nd thread has a little impact on the main thread. The impact may be caused by the synchronization mechanism.

5# optimization:SIMD

The Streaming SIMD Extensions (SSE) enhance the Intel x86 architecture in four ways:

* 8 new 128-bit SIMD floating-point registers that can be directly addressed;

* 50 new instructions that work on packed floating-point data;

* 8 new instructions designed to control cacheability of all MMX and 32-bit x86 data types, including the ability to stream data to memory without polluting the caches, and to prefetch data before it is actually used;

* 12 new instructions that extend the MMX instruction set.

This set enables the programmer to develop algorithms that can mix packed, single-precision, floating-point and integer using both SSE and MMX instructions respectively. This approach was chosen because most media processing applications have the following characteristics:

* inherently parallel

* wide dynamic range, hence floating-point based

* regular memory access patterns

* data independent control flow.

We’ve decided to try and rewrite some of the code in this function into SIMD code. Since it is the major function in the application it seemed to have the best potential for optimization.We’ve tried to optimize the first loop in the function which was the heaviest loop in the function:

It takes around 3.8 seconds.The rewritten code take 4 floats out of the f array, computes w, w*x,…,w*x*y on all 4 floats and store these middle computations in N,…, XY arrays:

#ifdef USE_SSE {

const float k = 0.f;const float a[4]={0.f,1.f,2.f,3.f};const float four=4;const float one=1;

__m128 reg_1; __m128 reg_offset; __m128 reg_4;

__m128 reg_x;reg_1=_mm_load_ps1(&one);reg_offset=_mm_load_ps1(&offset);reg_4=_mm_load_ps1(&four);reg_x=_mm_loadu_ps(a);

for (; i < n; i+=4)

{ __m128 reg_w;#define reg_mask reg_w __m128 reg_wx; __m128 reg_wy;

__m128 reg_y; reg_x=_mm_add_ps(reg_x, reg_4);

#define reg_f reg_y reg_f =_mm_loadu_ps(f+i); reg_y = _mm_add_ps(reg_f, reg_offset); //y = f[i] + offset; reg_y = _mm_max_ps(reg_y,reg_1); //if (y < 1.f) y = 1.f;

//w = y * y; reg_w= _mm_mul_ps(reg_y, reg_y); _mm_store_ps(N+i, reg_w);

// X[i] = w*x; reg_wx= _mm_mul_ps(reg_w, reg_x); _mm_store_ps(X+i, reg_wx);

#define reg_wxx reg_wx reg_wxx= _mm_mul_ps (reg_wx, reg_x); _mm_store_ps(XX+i, reg_wxx); reg_wy= _mm_mul_ps (reg_w, reg_y); _mm_store_ps(Y+i, reg_wy);

#define reg_xy reg_w reg_xy= _mm_mul_ps (reg_wy, reg_x); NOPE _mm_store_ps(XY+i, reg_xy);

} }Then in each array we summarize all floats till the i cell and store it in the i cell.#define ST(N) N[i]+=N[i-1];

for(i=1;i<n;i++){ST(N);

} sse(X,XX,n); sse(Y,XY,n);The following function does the summarizing for two arrays:void sse(float *f,float* f2, int n)

{_asm { xorps xmm2,xmm2 // P P P P (first P is 0) xorps xmm3,xmm3 mov eax,f mov edx,n xorps xmm6,xmm6 // P P P P (first P is 0) xorps xmm7,xmm7 mov ebx,f2 xor ecx,ecx

L: movaps xmm0,[eax+ecx*4]

movaps xmm4,[ebx+ecx*4] movaps xmm1,xmm0 movaps xmm5,xmm4 pslldq xmm1,4 // xmm1 = 0, A, B, C pslldq xmm5,4 // xmm1 = 0, A, B, C addps xmm0,xmm1 // xmm0 = A, A+B, C+B, c+D addps xmm4,xmm5 // xmm0 = A, A+B, C+B, c+D movlhps xmm3,xmm0 // 0 0 , A B+ A movlhps xmm7,xmm4 // 0 0 , A B+ A addps xmm0,xmm3 // A, A+B A+B+C A+B+C+D addps xmm4,xmm7 // A, A+B A+B+C A+B+C+D addps xmm2,xmm0 // P + A, P + A+B P +A+B+C, P+A+B+C+D addps xmm6,xmm4 // P + A, P + A+B P +A+B+C, P+A+B+C+D movaps [eax+ecx*4],xmm2 movaps [ebx+ecx*4],xmm6 pshufd xmm2,xmm2,0xFF // new P, New P, new P, new P // fixed: 0xe4 pshufd xmm6,xmm6,0xFF // new P, New P, new P, new P // fixed: 0xe4 add ecx,4 cmp ecx,edx jne L }}

output: File length: 9m 14.0s Elapsed time: 0m 41.0s Rate: 13.5327 Average bitrate: 425.7 kb/s

We got a speedup of 1 second that is 2% out of 50 seconds.

Results analysis :

Now the loop takes only 1.4 seconds:

And the whole function takes 7.47 seconds:

Instead of 8.73 seconds:

6# optimization:Intel compiler

We used Intel compiler 8.0 instead of the visual compiler and got the following results: File length: 9m 14.0s Elapsed time: 0m 38.0s Rate: 14.6011 Average bitrate: 425.7 kb/s

This is a speedup of 3 second that is 6% out of 50 seconds.

A note about Intel compiler vs Visual C++ compiler

We realized that there was a difference between oggenc which was compiled with Intel compiler and oggenc which was compiled via Visual C++ compiler. We used the two different oggenc files to gain two different OGG files, using the original WAV file.We decoded the two OGG files which we got, back to a wav file, and the result was the same wav files from the two OGG files, although the two OGG files were not the same.The explanation for this situation is that FP rounding can cause the different results.2 compilers can get different rounding in FP code and as a result you can get different Hafman tables. After converting it to floating point you may get different numbers but if it is a small error the rounding to 16 bit fix point (as in WAV) will eliminate the rounding error.

Misc:

We added support to non-threading and non-SSE compilations, for users who might not want threading and might not have the SSE instruction set.

Checking the SSE support was done via checking the CPUID, and non-threading support was done via environment variable VORBIS_NO_THREADS.

Final results:

Ogg Vorbis’ performance before the optimization:

50 seconds.

Ogg Vorbis’ performance after the optimization, using Visual C++ 6 compiler:

41 seconds (18% of performance boost).

Ogg Vorbis’ performance after the optimization, using Intel Compiler 8:

38 seconds (24% of performance boost).

scenario: - technionsoftlab-pro-web.technion.ac.il/projects/oggvorbis/doc/… · web viewaverage...

Documents