using many-core processors to improve the performance of space computing platforms

17
Faculty of Informatics Chair of Computer Architectures Fi ik K j Fisnik Kraja Phd Candidate 2011 IEEE Aerospace Conference, 5-12 March 2011, Big Sky, Montana

Upload: fisnik-kraja

Post on 19-Jun-2015

321 views

Category:

Documents


0 download

DESCRIPTION

IEEE Aerospace Conference 2011

TRANSCRIPT

  • 1. Faculty of Informatics Chair of Computer ArchitecturesFisnik KrajaFi ik K j Phd Candidate2011 IEEE Aerospace Conference, 5-12 March 2011, Big Sky, Montana

2. Subject: New computing architecture for future satellites. Purpose: To introduce many-core and other COTStechnologies in the design process. Main points will be: State f th St t of the art of space applications and computing platforms t f li ti dtil tf Proposed system architecture Performance Estimations (Benchmarking) Discussions and conclusions3/12/20112 3. On-board computers offer minimal functionality. Constrains like power , size , heat High-reliability requirements, because of radiation effects: Total Ionizing Dose (TID) Single Event Upset (SEU) Single Event Transient (SET) Single Event Latch up (SEL) Latch-up New space applications ask for improved on-boardprocessing abilities in terms of abilities, high processing power and throughput without losing the required reliability.3/12/20113 4. HRWS SAR(High resolution wide swath synthetic aperture radar). Used to reduce the amount of data to be transmitted to ground Uses separate apertures to transmit and receive Uses multiply phase centers in receive Each panel represents an independent phase center 7 Panels are used, each consisting of 12 tiles3/12/20114 5. Parallelism of the algorithm: 7 independent panel processing 12x7=84 independent tileprocessing Requirements: 1 Tera 16-bit fixed point Ops/s16 bit (complex multiply and add) Peak sample rate : 8Gbps Full t F ll antenna average raw datad t rate 603.1 Gbps3/12/2011It is impossible to fulfill these requirements 5with currently available technology for space. 6. To efficiently apply the upcoming many-core processorsand other COTS products to improve the on-boardprocessing power. i Reliability of the system should be addressed by: traditional hardware techniques (TMR) software-implemented fault-tolerant techniques Thread/process/service replication This system should provide other important features: flexibility, scalabilityl bilit portability.3/12/20116 7. 3/12/2011 7 8. I/O RHPUMemory MemoryMemory Reliable Local BusBus interfacing3/12/20118 9. Solution to the tradeoff between performance and reliability might be therotating consistency check, in which only some processes are replicatedand results checked for consistency at a time, but over a longer period all ofthem get verified. 3/12/20119 10. Why SSCA#3?Computationally taxingLarge block data transfers L bl k d t t fStressful memory access patternsScalable to mimic different problem sizes 1. Synthetic Data Generation stage is used to produce raw SARdata approximates, which are similar to what would be obtainedfrom a real SAR system.f lt 2. SAR Sensor Processing stage reconstructs a SAR imageusing a wavefront spotlight SAR reconstruction method known as2D F i M t h d Filt i and I t Fourier Matched Filtering d Interpolation.l ti3/12/201110 11. SDG:Kernel1:SyntheticSARreturnsReconstructed SARimagefromauniformgridofpointreflectors3/12/201111 12. The symmetric SMA (UMA)The distributed SMA (NUMA) 1 Nehalem CPU: Intel Core i7 CPU 920 2 Nehalem CPUs: Intel Xeon CPUX5670, 2.67GHz Frequency 2.93GHz processor frequency 8MB L3 Smart Cache 12MB L3 Smart Cache 4Cores4 Cores (8 Threads in Hyper threading)Hyper-threading) 6 Cores/CPU 130W power consumption 95W power consumption 24Gigabytes of DDR3 RAM 36(18x2) Gigabytes of DDR3 RAM 4.8Giga Transfers/s QPIg 6.4 Giga Transfers/s QPI g3/12/2011 12 13. UMA-SMANUMA-SMAarchitectures offer flexibility butarchitectures avoid bottleneckthey tend to have memoryyy problems in memories, but require p qbottlenecks.manual/pinned allocation of memory for each thread.3/12/201113 14. Sequential FFT Multithreaded FFTParallelized Loops with OpenMP Tiling Technique Threaded FFT using OpenMP GOMP_CPU_AFFINITY=0-11 More Private Variables3/12/2011 14 15. Most important optimizations: Thread Pinning (first touch policy of memory) Private Data (stack, local)/Shared Data(remote cached, evicted) (stack Data(remote, cached SchedulingStatic for loops with regular workloadsDynamic for loops with non regular onesOutlook The SAR data generation and image formation are scalable to 4 cores i UMA (U ifi d Min (Unified Memory A Access) ) 12 cores in NUMA-2x[6Cores, 16GB RAM] Speedup is almost linear in these SMA architectures This code is expected to scale to bigger numbers of cores Further parallelization paradigms are planed: MPI(Message Passing Interface) for clusters CUDA f GPGPUsfor GPGPU3/12/2011 15 16. By combining many-core processors and other COTS products with radiation-hardened specific components one can benefit: A speedup by a factor of 10 to 100 Improved reliability and robustness of the system. Efficient and faster application development via already familiarprogramming models. Ability to port applications directly to the space environment. Minimization f theMi i i ti of th non-recurring di development ti lt time and costs f dt forfuture missions. Efficient, flexible and portable software fault-tolerancetechniques that can be applied in the space environmentenvironment. Portability to future advances in technology.3/12/201116 17. Thank you for your attention!Fisnik KrajaLRR - L h t hl f R h t h ik und R hLehrstuhl fr Rechnertechnik d Rechnerorganisation i ti Technische Universitt Mnchen [email protected] @3/12/201117