tx-2500 air - slide-1 mit lincoln laboratory tx-2500 an interactive, on-demand rapid-prototyping hpc...

TX-2500 AIR - Slide-1 MIT Lincoln Laboratory TX-2500 An Interactive, On-Demand Rapid-Prototyping HPC System Albert Reuther, William Arcand, Tim Currie, Andrew Funk, Jeremy Kepner, Matthew Hubbell, Andrew McCabe, and Peter Michaleas HPEC 2007 September 18-20, 2007 This work is sponsored by the Department of the Air Force under Air Force contract FA C Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. TX-2500 AIR - Slide-2 MIT Lincoln Laboratory Motivation HPEC Software Design Workflow Outline Introduction Interactive, On-Demand High Performance Summary TX-2500 AIR - Slide-3 MIT Lincoln Laboratory LLGrid Applications TX-2500 AIR - Slide-4 MIT Lincoln Laboratory Example App: Prototype GMTI & SAR Signal Processing Analyst Workstation Running Matlab Streaming Sensor Data SAR GMTI (new) RAID Disk Recorder A/D 180 MHz BW Subband Filter Bank Delay & Equalize Real-time front-end processing 4 16 Non-real-time GMTI processing Doppler Process STAP (Clutter) Target Detection Subband Combine 3 Adaptive Beamform Pulse Compress 430 GOPS Research Sensor On-board processing Airborne research sensor data collected Research analysts develop signal processing algorithms in MATLAB using collected sensor data Individual runs can last hours or days on single workstation TX-2500 AIR - Slide-5 MIT Lincoln Laboratory Algorithm Development HPEC Software Design Workflow Cluster Parallel code Embedded Computer Embedded codeDevelop serial code Desktop Verification Parallel codeEmbedded code Data Analysis Parallel codeEmbedded code Requirements Low barrier-to-entry Interactive Immediate execution High performance Requirements Low barrier-to-entry Interactive Immediate execution High performance TX-2500 AIR - Slide-6 MIT Lincoln Laboratory Technology Results Outline Introduction Interactive, On-Demand High Performance Summary TX-2500 AIR - Slide-7 MIT Lincoln Laboratory Library Layer (pMatlab) Parallel Matlab (pMatlab) Vector/Matrix Comp Task Conduit Application Parallel Library Parallel Hardware Input Analysis Output User Interface Hardware Interface Kernel Layer Math (MATLAB) Messaging (MatlabMPI) Layered Architecture for parallel computing Kernel layer does single-node math & parallel messaging Library layer provides a parallel data and computation toolbox to Matlab users Cluster Launch (gridMatlab) TX-2500 AIR - Slide-8 MIT Lincoln Laboratory Batch vs. Interactive, On-Demand Scheduling Tradeoffs What is the goal of the system? Capability computing Capacity computing What types of algorithms and code will system run? Stable algorithms Algorithm development Whats more expensive? The system and its maintenance The researchers time Interactive, On- Demand Scheduling Batch Queue(s) Capacity computing, algorithm development, and/or high researcher cost Capability computing, stable algorithms and/or high system cost TX-2500 AIR - Slide-9 MIT Lincoln Laboratory Interactive, On-Demand Queue Batch (Default) Queue Medium Priority Medium Priority Low Priority Low Priority High Priority High Priority Limit: 64 CPUs per user (more upon request) Limit: 96/128 (days/nights, weekends) CPUs per user (more upon request) LLGrid User Queues Not using schedulers interactive features ( lsrun, lsgrun, bsub -I ) Certain CPUs for interactive, on-demand jobs only CPU allotments will change when upgrading to larger system Interactive Only Interactive and Batch TX-2500 AIR - Slide-10 MIT Lincoln Laboratory LLGrid Usage December 2003 August 2007 LLGrid Usage >8 CPU hours - Infeasible on Desktop >8 CPUs - Requires On- Demand Parallel Computing Processors used by Job Job duration (seconds) M Statistics CPUs 226 Users 234,658 Jobs 130,160 CPU Days Jobs Legend: alphaGrid betaGrid TX-2500 Jobs Legend: alphaGrid betaGrid TX-2500 AIR - Slide-11 MIT Lincoln Laboratory Usage Statistics December-03 August-07 Most jobs use less than 32 CPUs Many jobs with 2, 8, 24, 30, 32, 64, and 100 CPUs Very low median job duration Modest mean job duration Some jobs still take hours on 16 or 32 CPUs Total jobs run 234,658 Median CPUs per job11 Mean CPUs per job20 Maximum CPUs per job856 Total CPU time130,160d 15h Median job duration35s Mean job duration40m 41s Max job duration18d 7h 6m Processors Used by Jobs 0 Number of Jobs Duration of Jobs (seconds) 0 Number of Jobs TX-2500 AIR - Slide-12 MIT Lincoln Laboratory Individuals Usage Examples Developing non-linear equalization ASIC Post-run processing from overnight run Debug runs during day Prepare for long overnight runs Simulating laser propagation through atmosphere Simulation results direct subsequent algorithm development and parameters Many engineering iterations during course of day 18:00 19:00 20:00 21:00 Tuesday, 15-May Processors Used by Jobs :00 21:00 22:00 23:00 0:00 Saturday, 5-May Processors Used by Jobs TX-2500 AIR - Slide-13 MIT Lincoln Laboratory Technology Results Outline Introduction Interactive, On-Demand High Performance Summary TX-2500 AIR - Slide-14 MIT Lincoln Laboratory TX-2500 Shared network storage Rocks Mgmt, 411, Web Server, Ganglia Service Nodes Dual 3.2 GHz EM64-T Xeon (Irwindale CPUs on Lindenhurst Chipset) 8 GB RAM memory Two Gig-E Intel interfaces InfiniBand interface Six 300-GB disk drives PowerEdge Nodes CPUs 3.4 TB RAM 0.78 PB of Disk 28 Racks LSF-HPC resource manager/ scheduler To LLAN TX-2500 AIR - Slide-15 MIT Lincoln Laboratory InfiniBand Topology Four Cisco SFS 7008P Core Switches (96 InfiniBand 10-Gbps 4X ports each) 28 Cisco SFS 7000P Rack Switches (24 InfiniBand 10- Gbps 4X ports each) 28 Racks 16 Servers Each Two uplinks from each rack switch to each core switch Provides redundancy and minimizes contention TX-2500 AIR - Slide-16 MIT Lincoln Laboratory Parallel Computing Architecture Issues CPU RAM Disk CPU RAM Disk CPU RAM Disk CPU RAM Disk Standard Parallel Computer Architecture CPU RAM Disk CPU RAM Disk CPU RAM Disk CPU RAM Disk Network Switch Corresponding Memory Hierarchy Performance Implications Increasing Bandwidth Increasing Latency Increasing Capacity Increasing Programmability Standard architecture produces a steep multi-layered memory hierarchy Programmer must manage this hierarchy to get good performance Need to measure each level to determine system performance Standard architecture produces a steep multi-layered memory hierarchy Programmer must manage this hierarchy to get good performance Need to measure each level to determine system performance Registers Cache Local Memory Remote Memory Disk Instr. Operands Blocks Pages Messages TX-2500 AIR - Slide-17 MIT Lincoln Laboratory Registers Cache Local Memory Remote Memory Disk Instr. Operands Blocks Pages Messages HPC Challenge with Iozone measures this hierarchy Benchmarks performance of architecture HPC Challenge with Iozone measures this hierarchy Benchmarks performance of architecture HPC Challenge Benchmarks HPC Challenge Benchmark Corresponding Memory Hierarchy Top500: solves a system Ax = b STREAM: vector operations A = B + s x C FFT: 1D Fast Fourier Transform Z = FFT(X) RandomAccess: random updates T(i) = XOR( T(i), r ) Iozone: Read and write to disk (Not part of HPC Challenge) bandwidth latency TX-2500 AIR - Slide-18 MIT Lincoln Laboratory Target ID Change Detection SAR Images Raw SAR Data Image Formation Adaptive Beamforming Example SAR Application Pulse compression Polar Interpolation FFT, IFFT (corner turn) Large images difference & threshold Front-End Sensor Processing Back-End Detection and ID Solve linear systems Many small correlations on random pieces of large image Top500 FFT STREAM Random Access HPC Challenge benchmarks are similar to pieces of real apps Real applications are an average of many different operations How do we correlate HPC Challenge with application performance? HPC Challenge benchmarks are similar to pieces of real apps Real applications are an average of many different operations How do we correlate HPC Challenge with application performance? TX-2500 AIR - Slide-19 MIT Lincoln Laboratory Flops: Top500 Problem Size Comparison - InfiniBand Ethernet vs. InfiniBand Problem Size=4000 MB/node Top500 measures register-to-CPU comms (1.42 TFlops peak) Top500 does not stress the network subsystem CPUs Problem Size (MB/node) GFLOPS TX-2500 AIR - Slide-20 MIT Lincoln Laboratory Memory Bandwidth: STREAM Problem Size Comparison - InfiniBand Ethernet vs. InfiniBand Memory to CPU bandwidth tested (~2.7 GB/s per node) Unaffected by memory footprint of data set Problem Size=4000 MB/node CPUs Problem Size (MB/node) GFLOPS TX-2500 AIR - Slide-21 MIT Lincoln Laboratory Network Bandwidth: FFT Problem Size Comparison - InfiniBand Ethernet vs. InfiniBand Infiniband makes gains due to higher bandwidth Ethernet plateaus around 256 CPUs Problem Size=4000 MB/node CPUs Problem Size (MB/node) GFLOPS TX-2500 AIR - Slide-22 MIT Lincoln Laboratory Network Latency: RandomAccess Problem Size Comparison - InfiniBand Ethernet vs. InfiniBand Low latency of InfiniBand drives GUPS performance Strong mapping to embedded interprocessor comms Problem Size=4000 MB/node CPUs Problem Size (MB/node) GFLOPS TX-2500 AIR - Slide-23 MIT Lincoln Laboratory Iozone Results Command used: iozone -r 64k -s 16g -e -w -f /state/partition1/iozone_run 98% of the RAID arrays exceeded 50 MB/s throughput Comparable performance to embedded storage Write Bandwidth (MB/s) Read Bandwidth (MB/s) TX-2500 AIR - Slide-24 MIT Lincoln Laboratory HPC Challenge Comparison Mega Giga Tera Effective Bandwidth (words/second) CPUs Top500 STREAM FFT RandomAccess InfiniBand, Problem Size=4000 MB/node All results in words/second Highlights memory hierarchy Registers Cache Local Memory Remote Memory Disk Instr. Operands Blocks Pages Messages Memory Hierarchy bandwidth latency TX-2500 AIR - Slide-25 MIT Lincoln Laboratory Summary Low barrier-to-entry Interactive Immediate execution High Performance Low barrier-to-entry Interactive Immediate execution High Performance Parallel Matlab and User Statistics HPC Challenge and Results Algorithm Development Parallel codeEmbedded codeDevelop serial code Verification Parallel codeEmbedded code Data Analysis Parallel codeEmbedded code TX-2500 AIR - Slide-26 MIT Lincoln Laboratory Back Up Slides TX-2500 AIR - Slide-27 MIT Lincoln Laboratory TX-2500: Analysis of Potential Bottlenecks Peak Bandwidths GigE Central Switches Nortel 5530 Stack 5 GB/s Back Plane GigE Central Switches Nortel 5530 Stack 5 GB/s Back Plane IB Core Switch Cisco SFS 7008P 240 GB/s Back Plane IB Core Switch Cisco SFS 7008P 240 GB/s Back Plane CPU1 3.2 GHz Irwindale CPU1 3.2 GHz Irwindale North Bridge Lindenhurst MCH North Bridge Lindenhurst MCH DDR2-400MHz 2 GB RAM DDR2-400MHz 2 GB RAM CPU0 3.2 GHz Irwindale CPU0 3.2 GHz Irwindale South Bridge ICH5 USB, Video, etc. South Bridge ICH5 USB, Video, etc. DDR2-400MHz 2 GB RAM DDR2-400MHz 2 GB RAM DDR2-400MHz 2 GB RAM DDR2-400MHz 2 GB RAM FSB-800MHz (6.4 GB/s) DDR2-400MHz Ch.B (3.2 GB/s) DDR2-400MHz Ch.A (3.2 GB/s) Hublink 1.5 GB/s PCIeX X4 2.0 GB/s PERC4e/DC RAID Cache PERC4e/DC RAID Cache DDR2-400MHz 256 MB Cache DDR2-400MHz 256 MB Cache PERC4e/DC RAID Contr. PERC4e/DC RAID Contr. U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s PCIeX X8 4.0 GB/s PCIeX X8 4.0 GB/s Infiniband HBA 1.8GB/s Infiniband HBA 1.8GB/s GigE Interface 1 Gbps (0.18GB/s) GigE Interface 1 Gbps (0.18GB/s) GigE Interface 0.18GB/s GigE Rack Switch Nortel 5510 Stack 5 GB/s Back Plane GigE Rack Switch Nortel 5510 Stack 5 GB/s Back Plane PCIeX X8 4.0 GB/s GigE 0.18 GB/s 10 Gig 1.8 GB/s IB Rack Switch Cisco SFS 7000P 60 GB/s Back Plane IB Rack Switch Cisco SFS 7000P 60 GB/s Back Plane 10X IB 1.8 GB/s 10X IB 1.8 GB/s U320 SCSI 320 MB/s Dell PowerEdge 2850 CPU0 3.2 GHz Irwindale CPU0 3.2 GHz Irwindale North Bridge Lindenhurst MCH North Bridge Lindenhurst MCH DDR2-400MHz 2 GB RAM DDR2-400MHz 2 GB RAM CPU1 3.2 GHz Irwindale CPU1 3.2 GHz Irwindale South Bridge ICH5 USB, Video, etc. South Bridge ICH5 USB, Video, etc. DDR2-400MHz 2 GB RAM DDR2-400MHz 2 GB RAM DDR2-400MHz 2 GB RAM DDR2-400MHz 2 GB RAM FSB-800MHz (6.4 GB/s) DDR2-400MHz Ch.B (3.2 GB/s) DDR2-400MHz Ch.A (3.2 GB/s) Hublink 1.5 GB/s PCIeX X4 2.0 GB/s PERC4e/DC RAID Cache PERC4e/DC RAID Cache DDR2-400MHz 256 MB Cache DDR2-400MHz 256 MB Cache PERC4e/DC RAID Contr. PERC4e/DC RAID Contr. U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s U320 SCSI 10k RPM ~40 MB/s PCIeX X8 4.0 GB/s PCIeX X8 4.0 GB/s Infiniband HBA 1.8GB/s Infiniband HBA 1.8GB/s GigE Interface 1 Gbps (0.18GB/s) GigE Interface 1 Gbps (0.18GB/s) GigE Interface 0.18GB/s GigE Rack Switch Nortel 5510 Stack 5 GB/s Back Plane GigE Rack Switch Nortel 5510 Stack 5 GB/s Back Plane PCIeX X8 4.0 GB/s GigE 0.18 GB/s 10 Gig 1.8 GB/s IB Rack Switch Cisco SFS 7000P 60 GB/s Back Plane IB Rack Switch Cisco SFS 7000P 60 GB/s Back Plane 10X IB 1.8 GB/s 10X IB 1.8 GB/s U320 SCSI 320 MB/s Dell PowerEdge 2850 TX-2500 AIR - Slide-28 MIT Lincoln Laboratory Spatial/Temporal Locality Results MetaSim data from Snavely et al (SDSC) HPL (Top500) FFT STREAM RandomAccess Scientific Applications HPC Challenge Scientific Applications HPC Challenge bounds real applications Allows us to map between applications and benchmarks HPC Challenge bounds real applications Allows us to map between applications and benchmarks Intelligence, Surveillance, Reconnaisance Applications TX-2500 AIR - Slide-29 MIT Lincoln Laboratory HPL Top500 Benchmark Linear system solver (requires all-to-all communication) Stresses local matrix multiply performance DARPA HPCS goal: 2 Petaflops (8x over current best) Linear system solver (requires all-to-all communication) Stresses local matrix multiply performance DARPA HPCS goal: 2 Petaflops (8x over current best) High Performance Linpack (HPL) solves a system Ax = b Core operation is a LU factorization of a large MxM matrix Results are reported in floating point operations per second (flops) P0P2 P1P3 P0P2 P1P3 P0P2 P1P3 P0P2 P1P3 LU Factorization A L P0P2 P1P3 P0P2 P1P3 P0P2 P1P3 P0P2 P1P3 U 2D block cyclic distribution is used for load balancing Registers Cache Local Memory Disk Instr. Operands Blocks Pages Remote Memory Messages Parallel Algorithm TX-2500 AIR - Slide-30 MIT Lincoln Laboratory STREAM Benchmark Basic operations on large vectors (requires no communication) Stresses local processor to memory bandwidth DARPA HPCS goal: 6.5 Petabytes/second (40x over current best) Basic operations on large vectors (requires no communication) Stresses local processor to memory bandwidth DARPA HPCS goal: 6.5 Petabytes/second (40x over current best) Performs scalar multiply and add Results are reported in bytes/second Registers Cache Local Memory Disk Instr. Operands Blocks Pages Remote Memory Messages Parallel Algorithm A = B + s x C Np Np Np TX-2500 AIR - Slide-31 MIT Lincoln Laboratory FFT Benchmark FFT a large complex vector (requires all-to-all communication) Stresses interprocessor communication of large messages DARPA HPCS goal: 0.5 Petaflops (200x over current best) FFT a large complex vector (requires all-to-all communication) Stresses interprocessor communication of large messages DARPA HPCS goal: 0.5 Petaflops (200x over current best) 1D Fast Fourier Transforms an N element complex vector Typically done as a parallel 2D FFT Results are reported in floating point operations per second (flops) Registers Cache Local Memory Disk Instr. Operands Blocks Pages Remote Memory Messages Parallel Algorithm : Np-1 FFT rows FFT columns corner turn 1 Np-1. TX-2500 AIR - Slide-32 MIT Lincoln Laboratory RandomAccess Benchmark Randomly updates memory (requires all-to-all communication) Stresses interprocessor communication of small messages DARPA HPCS goal: 64,000 GUPS (2000x over current best) Randomly updates memory (requires all-to-all communication) Stresses interprocessor communication of small messages DARPA HPCS goal: 64,000 GUPS (2000x over current best) Randomly updates N element table of unsigned integers Each processor generates indices, sends to all other processors, performs XOR Results are reported in Giga Updates Per Second (GUPS) Registers Cache Local Memory Disk Instr. Operands Blocks Pages Remote Memory Messages Parallel Algorithm Generate random indices 0 Table Send, XOR, Update 1Np-1 01NP-1. TX-2500 AIR - Slide-33 MIT Lincoln Laboratory IO Zone File System Benchmark File system benchmark tool Can measure large variety of file system characteristics We benchmarked: Read and write throughput performance (MB/s) 16 GB files (to test RAID, not caches) 64 kB blocks (best performance on hardware) On six-disk hardware RAID set On all 432 compute nodes Local Disk Memory Hierarchy CPUs Local Cache Local Memory Disk Cache Disk(s) Our iozone tests

tx-2500 air - slide-1 mit lincoln laboratory tx-2500 an interactive, on-demand rapid-prototyping hpc...

Documents