software performance tuning project – final presentation prepared by: eyal segal koren shoval...

24
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Post on 21-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Software Performance Tuning Project – Final Presentation

Prepared By: Eyal Segal Koren Shoval

Advisors: Liat Atsmon Koby Gottlieb

Page 2: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

WavPack – Description

• WavPack is a an open source audio compression format.– Allows lossless audio compression.

• Compresses WAV files to WV files– Average compression ratio is 30-70%.

• Support for windows and mobile devices.– Cowon A3 PMP, iRiver, iPod, Nokia phones, and more.

Page 3: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Project Goals

• Enhance the Wavpack performance by:– Working and analyzing with Intel® VTune™ Performance

Analyzer.– Studying and applying instructions of Intel®’s new

processors.– Implementing multi-threading techniques in order to

achieve high performance.

• Return the source code to the community.

Page 4: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Algorithm Description

• Input file is processed in blocks of 512kb.– A global context exists for all blocks.– Blocks are divided into sub blocks.

• 24,000 samples equivalent to 0.5 second of wav at CD quality.

– Encodes each block and writes to output.– Updates context data for next block.

Page 5: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Is Lossless & Stereo

Is Lossless & Stereo

Configuration stereo/mono

bps { 8,16,24,32}, pass count, etc.

Configuration stereo/mono

bps { 8,16,24,32}, pass count, etc.Go over the buffer.

take a block of 24,000 samplesGo over the buffer.

take a block of 24,000 samples

Read buffer of 512kb from Input FileRead buffer of 512kb from Input File

Transform l.ch & r.ch to mid, diffTransform l.ch & r.ch to mid, diff

… moreoptions

…more options

Perform wavpack decorralation algorithm

on the buffer

Perform wavpack decorralation algorithm

on the buffer

Write the resulted buffer to the output.

This is thecompression stage.

Write the resulted buffer to the output.

This is the compression stage.

1st part of the wavpack algorithm

1st part of the wavpack algorithm

2nd Part of the wavpack algorithm

2nd Part of the wavpack algorithm

This is why parallelizing of the

entire flow fails

This is why parallelizing of the

entire flow fails

Calculate additional

information for compression

Calculate additional

information for compression

Perform the compression bit

by bit

Perform the compression bit

by bit

Count ones and zeros until

change occurs

Count ones and zeros until

change occurs

Each subset of bytes depends on an

indeterminate subset of the previous

bytes.

Each subset of bytes depends on an

indeterminate subset of the previous

bytes.

ContextGlobal Information

Passed down to each function

ContextGlobal Information

Passed down to each function

… moreoptions

…more options

Init

x Pass count

Finish

Page 6: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Testing Environment

• Hardware– Core i7 2.66GHz CPU, Quad6600 2.4GHz.– 4GB of RAM.

• Software– Windows XP/Vista.– Visual studio 2008.– Intel VTune Toolkit.– Compiled with Microsoft compiler.

• Tests are done on a 330Mb WAV file.

Page 7: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Original Implementation

• Single threaded application– Read from disk.– Encode.– Write to disk directly.

• Old MMX Instructions are used.

• Processing of 330Mb Wav file takes about 30 seconds.

Page 8: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

OptimizationsParallel IO/CPU

Page 9: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

OptimizationsParallel IO/CPU

• General– Separate read, write and processing operations into several threads.

• Flow– Use the main thread to read input file.

• Create “jobs” and submit them into a work queue.

– Use an additional thread to process the “jobs”.• Output is redirected to memory instead of disk.

– Another thread writes the processed output to the disk.

Page 10: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

OptimizationsParallel IO/CPU – cont.

• Benchmark– VTune analysis showed the following results

– Average running time is about 29 seconds.– Speedup is 1.026.

• Refers to original results.

• Conclusions– No significant improvement.– I/O operations take considerably less time than the blocks processing.

• Reads are done long before the processing is done.• Writing thread is almost never busy.

Page 11: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Optimizations Multi Threaded Processing

Page 12: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Optimizations Multi Threaded Processing

• General– Obstacle: Each block is dependent on the previous processed block.

• Parallelizing entire flow is impossible.

– Multithreading parts of the algorithm.• Locate parts of the code where the program spends most of the time.• Parallelize several functions in these parts.

• Implementation– Using “Thread Pool”.– Work is separated to left and right channel.

• At each channel, each sample is dependent on the previous sample.• Can’t use more than two threads.

– Each thread uses different memory area.• Results must be combined after work is done.

Page 13: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Is Lossless & Stereo

Is Lossless & Stereo

Processingthread

more options…

Workerthread 2

Fill two new“ Thread Args” structures .

One with left channel data andone with the Right.

Fill two new“ Thread Args” structures .

One with left channel data and one with the Right.

Submit each work to the “Thread Pool”Submit each work to the “Thread Pool”

Wait on the “OnComplete” mutexWait on the “OnComplete” mutex

worker thread 1

Wait for work to arrive into the “Thread Pool”and start the work.

Wait for work to arrive into the “Thread Pool”and start the work.

Perform Wavpack decorrelation algorithm

on the buffer

Perform Wavpack decorrelation algorithm

on the buffer

Write the resulted buffer to the output.

This is thecompression stage.

Write the resulted buffer to the output.

This is the compression stage.

Calculate additional

information for compression

Calculate additional

information for compression

Perform the compression bit

by bit

Perform the compression bit

by bit

Count ones and zeros until

change occurs

Count ones and zeros until

change occurs

x Pass count

Return to “Thread Pool”Return to “Thread Pool”

RightChannel

Wait for work to arrive into the “Thread Pool”

and start the work.

Wait for work to arrive into the “Thread Pool”

and start the work.

Return to “Thread Pool”

Return to “Thread Pool”

LeftChannel

Interleave left & right channels data to one

output buffer

Interleave left & right channels data to one

output buffer

Create a duplicates of each shared

data structure to avoid cache

conflicts

Create a duplicates of each shared

data structure to avoid cache

conflicts

Page 14: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Optimizations Multi Threaded Processing – cont.

• Benchmark– VTune analysis showed the following results

– Average running time is about 25 seconds.– Speedup is 1.167.

• Refers to original results.

• Conclusions– About 17% of the running time is parallelized. – Total improvement –

• Due to overhead improvement is a little bit smaller.

0.17 30 5.1sec

Page 15: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Optimizations Moving to SIMD

Page 16: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Optimizations Moving to SIMD

• General– Locate mathematical calculations and loops.

• Where the program spends most of the time. – Use 128bit width instructions.– Convert four operations of 32bit to one of 128bit.

• Theoretically, performance can be x4 faster.• In practice, there is overhead (load, store).

• Implementation– Re-factor the code as a basis for adding SIMD operations.– Loop unrolling.

• Make sure to complete the “leftovers” of the loop.

– Re-implement using SIMD code.

Page 17: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Optimizations Moving to SIMD – cont.

• Benchmark– VTune analysis showed the following results

– Average running time is about 28 seconds.– Speedup is 1.043.

• Refers to original results.

• Conclusions– Mathematical calculations can be mainly done with SSE2, SSE3.– SSE4 instructions were not useful for this application. – Improvement alone isn’t significant.

• More significant when combined with Multi Threading Optimization.

Page 18: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Optimizations Implementation Improvements

Page 19: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Optimizations Implementation Improvements

• General– We found several hot spots of the program that we couldn’t improve

using the mentioned methods.• Branch misprediction.

– Re-implement in a more efficient way.

• Implementation– Focused on one main function.

• Lots of branch mispredictions.• 16bit Integer was used as buffered output.

– Removed most of the branch instructions.– Re-implemented same logic with 64bit Integer buffer.

• Largest register size.• SIMD would require too much overhead.

Page 20: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Optimizations Implementation Improvements – cont.

• Benchmark– VTune analysis showed the following results

– Average running time is about 28 seconds.– Speedup is 1.06.

• Refers to original results.

• Conclusions– Branch instructions and branch mispredictions were reduced.– Improvement in performance – almost 2 seconds less.– Implementation is centered in one method.

• Easy to re-factor.• Requires no major architecture changes.

Page 21: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Summary

• The most significant optimization was multi threading code sections.– 16% speedup.

• The most insignificant was the multithreaded I/O.– 2.6% speedup.

Page 22: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Summary – Cont.• Benchmark

– VTune analysis showed the following results

– Average running time is about 22 seconds.– Total speedup we achieved is 1.335.

• The program runs faster by 33.5%.

Page 23: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Summary – Cont.• Conclusions

– Multithreading is something to be considered in the architectural stages of the application.

• In this application, the performance improvement does not worth the development and maintenance effort.

– SIMD Optimizations should only be used in specific cases.• Harder to use and understand the code.

– Decreasing branch mispredictions and cache misses is a better way to improve performance.

• Refactoring only specific methods. • Easier to implement and usually simplifies the code.• Using VTune and similar analysis tools is a good practice.

– Leveraging new CPU instructions should be the compiler’s responsibility.

• Don’t really need developer to do this job.• Code gets clattered.

Page 24: Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

Sources • WavPack official website

– http://www.wavpack.com • Intel® VTune™ Performance Analyzer• Sourceforge website

– http://sourceforge.net/• Software lab website

– http://softlab.technion.ac.il/• MSDN

– http://msdn.microsoft.com• Wikipedia

– http://en.wikipedia.org/wiki/• Intel website

– http://www.intel.com/