performance tuning panotools - ptmender. layout project goal about panotools multi-threading simd,...
TRANSCRIPT
Performance TuningPanotools - PTMender
Layout
• Project Goal
• About Panotools
• Multi-threading
• SIMD, micro-architectural pitfalls
• Results
Project Goal
• Gaining performance on PanoTools
• This goal will be achieved through:
1. Multi-threading the application – using new multi-core machines which is the most significant performance promise.
2. Using SSE code.
3. Trying to find micro-architectural pitfalls and solving them – using VTune tuning assist.
About Panotools
• Panotools is the cross-platform library behind Panorama Tools and many other GUI photo stitchers.
• Gaining much popularity as back-end engine for many panoramic applications.
• Selected to participate in the “Google Summer Of Code 2007”.
• We focused on the PTMender module of the library.
More details on Panotools on: http://panotools.sourceforge.net/
Multi-threading
• Two major approaches in multi-threading an existing single-threaded application:
1. Data decomposition – Dividing data to smaller parts and performing parallel work on each part.This is not always possible due to algorithmic dependencies between divided parts.
2. Functional decomposition – Dividing the work according to functional tasks. Each thread performs a unique predefined task.This is harder to perform and requires deep understanding of original algorithm.
Multi-threading – contd.
• Naturally we started looking for Data decomposition.
• In theory, because PTMender works on several files we could have processed a number of files simultaneously.
• Alternatively, we could have divided a single file and processed its parts simultaneously.
• In practice, using the Call Graph function in VTune, we noticed a native division of each file into independent parts on which the algorithm runs.
• Clearly, the chosen method was the later because it provides a better scalability.
VTune - Call graph
Serial task
The Parallel model
thread0
thread1
Multi-threading – contd.
• Data sharing – We created arrays of thread specific data structures.
And not:
Padding is used to create full cache line separation between array entries and prevent “false sharing”.
typedef struct thread_vars{Image result;TrformStr transform;int pad[16];
}thread_vars_t; thread_vars_t thread_private[NUM_THREADS]
Image result[NUM_THREADS]TrformStr transform[NUM_THREADS];
Thread Checker
Thread Checker - Debug
Noise
• Effects of data races were later obvious from output observations
Thread Checker – Debug - Contd.• Adding synchronization around critical sections
#ifdef PROTECT_WRITE// Request ownership of mutex.
dwWaitResult = WaitForSingleObject(
hTiffWriteMutex, // handle to mutex5000L); // five-second time-out interval
if (dwWaitResult == WAIT_OBJECT_0){__try { // Write to the database.
#endif
Thread Profiler
Thread Profiler – contd.
Image comparison
SIMD & uArchitecture
• Unfortunately we did not find good opportunities for vectorizing.
• Main Micro-architectural issue is Mispredicted indirect calls. This cannot be solves since the panotools mechanism works allot with function pointers for flexibility
• FP activity is significant. We changed floating point model in compilation from “precise” to “fast” and reduced instruction count in benchmark to under 90% from original code generation
Results
Thank you