1 next generation correlators, june 26 th 29 th, 2006 the lofar blue gene/l correlator stichting...
DESCRIPTION
3 Next Generation Correlators, June 26 th −29 th, 2006 Outline central processing Blue Gene/L work distribution the correlator performance discussionTRANSCRIPT
1
Next Generation Correlators, June 26th−29th, 2006
The LOFARThe LOFARBlue Gene/L CorrelatorBlue Gene/L Correlator
Stichting ASTRON (Netherlands Foundation for Research in Astronomy)Dwingeloo, the Netherlands
John W. RomeinJohn W. RomeinP. Chris BroekemaP. Chris BroekemaEllen van MeijerenEllen van Meijeren
Kjeld van der SchaafKjeld van der SchaafWalther H. ZwartWalther H. Zwart
2
Next Generation Correlators, June 26th−29th, 2006
LOFARLOFAR distributed sensor
network simple receivers 20–240 MHz 37–77+ stations
virtual telescope 32 in central core remote stations
central processing on supercomputer
Groningen
3
Next Generation Correlators, June 26th−29th, 2006
OutlineOutline central processing Blue Gene/L work distribution the correlator performance discussion
4
Next Generation Correlators, June 26th−29th, 2006
The LOFAR Central ProcessorThe LOFAR Central Processor
5
Next Generation Correlators, June 26th−29th, 2006
Signal Processing StepsSignal Processing Steps
Delay, PolyPhase Filter, FX Correlator, Flagging
6
Next Generation Correlators, June 26th−29th, 2006
CharacteristicsCharacteristics 37–77+ stations 160 subbands; 32 MHz bandwidth input:
195 KHz; 2 pols; i16complex 10–20 GB/s
after PPF: 763 Hz; 256 channels; 2 pols; complex float
output after correlation: 703–3003+ baselines; 256 channels; 4 pols; 1 sec.
integration; complex float 1–4 GB/s
7
Next Generation Correlators, June 26th−29th, 2006
The Blue Gene/LThe Blue Gene/L
700 MHz dual PowerPC 440 256 MB RAM per core 2 FPUs per core
complex numbers support 2 FMAs / cycle 2.8 GFLOP/s per core
Ethernet, tree, torus networks synchronous communication!
12,288 cores 34.4 TFLOP/s & 768 Gb/s
8
Next Generation Correlators, June 26th−29th, 2006
External I/OExternal I/O
16 compute cores behind 1 Gb/s Ethernet interface I/O node bridges between Ethernet and tree
create TCP socket on compute node 768 Psets
9
Next Generation Correlators, June 26th−29th, 2006
Work Distribution (1/2)Work Distribution (1/2) parallel in subbands (160) 1 subband: too much work for 1 core use specialized cores
10
Next Generation Correlators, June 26th−29th, 2006
Work Distribution (2/2)Work Distribution (2/2) parallel in subbands distribute second of sampled data round-robin over cores
core filters, shifts phase, correlates
11
Next Generation Correlators, June 26th−29th, 2006
The Correlator The Correlator
weigh partially flagged data floating point
FOR stat2 IN 1 .. NrStations DO FOR stat1 IN 1 .. stat2 DO FOR pol1 IN [X,Y] DO FOR pol2 IN [X,Y] DO sum = (0,0) FOR time IN 1 .. IntegrationTime DO sum += samples[stat1][time][pol1] * ~samples[stat2][time][pol2] END correlation[baseline(stat1,stat2)][pol1][pol2] = sum END END ENDEND
12
Next Generation Correlators, June 26th−29th, 2006
Correlator OptimizationsCorrelator Optimizations written in assembly correlate 3x2 stations
why? see next slide treat autocorrelations
differently
13
Next Generation Correlators, June 26th−29th, 2006
Correlator Correlator CodeCode
2 instructions per correlation/integration
hide FPU latencies interleave with other
correlations minimize #loads
hide load latencies use large register file
concurrent FPU ops & loads
…
fxcpnsma X0X2,X0,X2,X0X2
lfpsux X3,p3,incfxcpnsma X0Y2,X0,Y2,X0Y2
lfpsux Y3,p3,incfxcpnsma Y0X2,Y0,X2,Y0X2
fxcpnsma Y0Y2,Y0,Y2,Y0Y2
fxcpnsma X1X2,X1,X2,X1X2
fxcpnsma X1Y2,X1,Y2,X1Y2
fxcpnsma Y1X2,Y1,X2,Y1X2
fxcpnsma Y1Y2,Y1,Y2,Y1Y2
fxcxma X0X2,X0,X2,X0X2
fxcxma X0Y2,X0,Y2,X0Y2
fxcxma Y0X2,Y0,X2,Y0X2
fxcxma Y0Y2,Y0,Y2,Y0Y2
fxcxma X1X2,X1,X2,X1X2
fxcxma X1Y2,X1,Y2,X1Y2
fxcxma Y1X2,Y1,X2,Y1X2
fxcxma Y1Y2,Y1,Y2,Y1Y2
…
…
fxcpnsma X0X2,X0,X2,X0X2
lfpsux X3,p3,incfxcpnsma X0Y2,X0,Y2,X0Y2
lfpsux Y3,p3,incfxcpnsma Y0X2,Y0,X2,Y0X2
fxcpnsma Y0Y2,Y0,Y2,Y0Y2
fxcpnsma X1X2,X1,X2,X1X2
fxcpnsma X1Y2,X1,Y2,X1Y2
fxcpnsma Y1X2,Y1,X2,Y1X2
fxcpnsma Y1Y2,Y1,Y2,Y1Y2
fxcxma X0X2,X0,X2,X0X2
fxcxma X0Y2,X0,Y2,X0Y2
fxcxma Y0X2,Y0,X2,Y0X2
fxcxma Y0Y2,Y0,Y2,Y0Y2
fxcxma X1X2,X1,X2,X1X2
fxcxma X1Y2,X1,Y2,X1Y2
fxcxma Y1X2,Y1,X2,Y1X2
fxcxma Y1Y2,Y1,Y2,Y1Y2
…
X0Y2 += X0 * ~Y2
14
Next Generation Correlators, June 26th−29th, 2006
Computational PerformanceComputational Performance
1 second of station samples, 1 subband, 1 core correlator: 98% of FPU peak performance!
15
Next Generation Correlators, June 26th−29th, 2006
Network PerformanceNetwork Performance
need multiple concurrently-communicating cores
one core does not achieve 1 Gbit/s
OS problem
16
Next Generation Correlators, June 26th−29th, 2006
Overall PerformanceOverall Performance
37 stations, 1 subband, 195 KHz → 256 channels on 6 cores
I/O limited
17
Next Generation Correlators, June 26th−29th, 2006
The EoR observation modeThe EoR observation mode computationally most-challenging mode
32–37 stations 160 subbands ±24 beams i4complex input samples 10 second integration time
requires ±25 (!) TFLOP/s need 6-rack capacity
need faster communication
18
Next Generation Correlators, June 26th−29th, 2006
Discussion & ConclusionsDiscussion & Conclusions software great flexiblity Blue Gene/L
excellent computational performance correlator achieves 98%
need faster communication estimated development time: < 1 man-year paper: http://www.astron.nl/~romein/
[SPAA'06]