tuning difx2 for performance adam deller astron 6th difx workshop, csiro atnf, sydney aus

35
Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Upload: anna-cameron

Post on 27-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Tuning DiFX2 for performance

Adam DellerASTRON

6th DiFX workshop, CSIRO ATNF, Sydney AUS

Page 2: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Outline

I/O bottlenecks and solutions Communication with the real world (reading raw data, writing visibilities)

Interprocess communication Keeping out of memory trouble Minimizing CPU load in various corners of parameter space

For more information and pictures:http://cira.ivec.org/dokuwiki/doku.php/difx/mpifxcorr/

Page 3: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Getting data into DiFX

Master Node

Core 1DataStream 1

DataStream 2

DataStream N

Core 2

Core M

… …

Timerange, destination

Baseband data

Visibilities

Source dataSource data

Large, segmented ring buffer

Visibility buffer

Visibility buffer

Visibility buffer

processing buffer

processing buffer

processing buffer

Page 4: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Getting data into DiFX

How to test? neutered_difx, with a small number of channels

Fundamental limit: native transfer speed (disk read, network pipe) If this is the problem, buy a RAID or get infiniband or …

Potential troublemaker: CPU utilisation on datastream node (competition) Can come from tsys estimation

Tweaking: datastream databuffer

Page 5: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Datastream databuffer

QuickTime™ and a decompressor

are needed to see this picture.

Key parameters:dataBufferFactornDataSegmentssubintNS

/“Subint”

Only real potential problem I/O-wise: buffer too short (databufferfactor)

Page 6: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Getting visibilities out of DiFX

Master Node

Core 1DataStream 1

DataStream 2

DataStream N

Core 2

Core M

… …

Timerange, destination

Baseband data

Visibilities

Source dataSource data

Large, segmented ring buffer

Visibility buffer

Visibility buffer

Visibility buffer

processing buffer

processing buffer

processing buffer

To disk

Page 7: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Getting visibilities out of DiFX FxManager writes the visibilities to disk

This is very rarely a problem unless you have a dying disk or very large and/or frequent visibility dumps

Testing: neutered_difx + fake data source (ensures good input speeds)

Tweaking: none If you want to write out visibilities faster, put a fast disk (probably RAID) on the manager node!

Page 8: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Interprocess @ the Datastream

Master Node

Core 1DataStream 1

DataStream 2

DataStream N

Core 2

Core M

… …

Timerange, destination

Baseband data

Visibilities

Source dataSource data

Large, segmented ring buffer

Visibility buffer

Visibility buffer

Visibility buffer

processing buffer

processing buffer

processing buffer

Page 9: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Interprocess @ the Datastream Generally not a problem Tweaking: dataBufferFactor, ensure reasonable size (avoids latency issues)

Default (32) generally okbut couldusually bebigger w/oproblems(increasenSegmentsalso)

QuickTime™ and a decompressor

are needed to see this picture.

Page 10: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Interprocess @ the Core

Master Node

Core 1DataStream 1

DataStream 2

DataStream N

Core 2

Core M

… …

Timerange, destination

Baseband data

Visibilities

Source dataSource data

Large, segmented ring buffer

Visibility buffer

Visibility buffer

Visibility buffer

processing buffer

processing buffer

processing buffer

Tweaking:• subintNS• Output visibility size (nChan / nBaselines)

Page 11: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Interprocess @ the Core

QuickTime™ and a decompressor

are needed to see this picture.

Page 12: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Interprocess @ the Core

In terms of reducing data transmission, increasing subintNS is the only real knob to turn Unimportant for continuum, single phase centre - it’s only very high spectral resolution and/or multiphase centre where this is relevant

In those cases, bigger is better; but be careful about memory (later)

Page 13: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Interprocess @ the FxManager

Master Node

Core 1DataStream 1

DataStream 2

DataStream N

Core 2

Core M

… …

Timerange, destination

Baseband data

Visibilities

Source dataSource data

Large, segmented ring buffer

Visibility buffer

Visibility buffer

Visibility buffer

processing buffer

processing buffer

processing buffer

The most common trouble point! Must aggregatedata from all Core nodes, can lead to high data rates

Page 14: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Interprocess @ the FxManager

QuickTime™ and a decompressor

are needed to see this picture.

Page 15: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Interprocess @ the FxManager To calculate the rate into FxManager, work out the rate for one Core node and scale

Tweaking: maximise subintNS! Or (although this is usually not possible) reduce visibility size (via nChan or the number of phase centers)

Page 16: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Memory @ the Datastream

Just don’t make the combination of dataBufferFactor and subintNS too big (can also control via “sendSize”)

QuickTime™ and a decompressor

are needed to see this picture.

Page 17: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Memory @ the Core

Usually the biggest problem, memory-wise

QuickTime™ and a decompressor

are needed to see this picture.

Page 18: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Memory @ the Core

Usually the biggest problem, memory-wise

Never used to be a problem, but multi-field center jobs hit hard

Bigger subint means more memory (storing datastream baseband)

More threads means more memory - at the pre-average spectral resolution

Buffering more FFTs costs more (x the number of threads, too!)

Page 19: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Memory @ the Core

Tweaking: subintNS nThreads (threads file) numBufferedFFTs

And be aware of: nFFTChans (for multiphase centre/high spectral resolution)

Number of phase centres

Page 20: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

QuickTime™ and a decompressor

are needed to see this picture.

Memory @ the FxManager

Tweaking: visBufferLength Multiplies the size of a single visibility (nChan, nBaselines, nPhaseCentres)

Page 21: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Memory @ the FxManager

Tweaking: visBufferLength Multiplies the size of a single visibility (nChan, nBaselines, nPhaseCentres)

Generally not a problem Note: visBufferLength should not be too short, especially if you have many (esp. heterogeneous) Core nodes, as the subints can come in out of order

Page 22: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Datastream

Loading of Datastream is usually pretty light But, Datastream often runs on old hardware (e.g. Mk5 units) with limited CPU capacity

A couple of options can cause problematically high loads: Tsys extraction (.v2d: tcalFreq = xx) Interlaced VDIF formats (used with multi-thread VDIF data, e.g. phased EVLA)

More efficient implementations coming; for now, buy faster CPU if needed!

Page 23: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

Many considerations here, including parameters usually fixed by the science Number of phase centres Spectral resolution (nChan/nFFTChan)

Plus several on array management strideLength numBufferedFFTs xmacLength

And then a few others as well: nThreads fringe rotation order

Page 24: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

Number of phase centers For each phase centre, phase rotation and separate accumulation from thread to main buffer

QuickTime™ and a decompressor

are needed to see this picture.

Page 25: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

Number of phase centers For each phase centre, phase rotation and separate accumulation from thread to main buffer

That costs CPU (proportional to number of baselines and number of phase centres), but also ensures that results don’t fit in cache (more later)

Page 26: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

Spectral resolution More channels means a bigger FFT, and that costs CPU

Doesn’t typically follow a logN law like it should - bigger gets worse fast beyond ~1024 due to cache performance

Really big (>=8192 channels/subband) gets very expensive

Worst thing: typically comes in combination with multiple phase centres! (required to avoiding bandwidth smearing)

Page 27: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

Array management #1: strideLength (auto setting usually best)

-180°

180°One FFT of data

sin/cos the first “strideLength” samples, and every “strideLength”’th after that

Page 28: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

Array management #2: numBufferedFFTs (auto=10 usually ok)

Mitigates the cache miss problem by x10Mode 1 Mode 2 Mode 3 … Mode N

Visibility buffer(too big for cache)

But one slot fits in cache!

Precompute numBufferedFFTs FFT results, one station at a time

Page 29: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

Array management #3: xmacLength (auto setting of 128 usually fine; further subdivides XMAC step)

Mode 1 Mode 2 Mode 3 … Mode N

Visibility buffer(too big for cache)

But one slot fits in cache!

Precompute numBufferedFFTs FFT results, one station at a time

Page 30: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

nThreads Usually, set nThreads = n(CPU cores) - 1

Occasionally, can be advantageous to use fewer threads (avoiding swap memory / cache contention)

Page 31: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

Fringe Rotation Order Default is 1, and this is almost always fine

2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?)

BUT: 0th order could often be used, and almost never is: it can be about 25% faster

Fringerotationphase time

1st FFT 2nd FFT

Here, fringe rate is too high for 0th order

Page 32: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

Fringe Rotation Order Default is 1, and this is almost always fine

2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?)

BUT: 0th order could often be used, and almost never is: it can be about 25% faster

Fringerotationphase time

1st FFT 2nd FFT

But at low fringe rate, 0th order approximation can be acceptable

Page 33: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the Core

Fringe Rotation Order Default is 1, and this is almost always fine

2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?)

BUT: 0th order could often be used, and almost never is: it can be about 25% faster

.v2d: fringeRotOrder = [0, 1, 2]

Page 34: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

CPU @ the FxManager

CPU load at the FxManager is typically light - it only does low-cadence accumulation and scaling of visibilities

Very short subintNS can potentially lead to problems (although network issues are more likely)

Page 35: Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF

Questions?