high performance computing meets radio astronomy willi … · top 10 ranks of top500 (november...
TRANSCRIPT
Mitg
lied
de
r H
elm
hol
tz-G
emei
nsc
haft
High Performance Computingmeets
Radio Astronomy
Willi Homberg
German SKA Science Meeting12-13 February 2014
Bielefeld
Jülich Supercomputing Centre (JSC)
SKA Observatory Diagram
Credit: Peter Dudney, SKA project office
SDP Element Concept● Ingest processor including routing capability
■ Data rate out of correlator: 4670 / 842 GBytes/s (SURVEY/LOW), 1800 Gbytes/s (Mid)■ Max data rate into SDP: 995 GBytes/s (SURVEY/LOW), 255 Gbytes/s (Mid)
● A data parallel processing system■ local data buffer linked to ingest processor■ multi-core CPU (multi-GPU) based compute system
■ Max computing load: 32 PFlop/s (SURVEY/LOW), 10 PFlop/s (Mid)■ Emphasis is on the framework to manage the throughput
■ Max buffer size: 14 PBytes (SURVEY/LOW), 11 Pbytes (Mid)■ Hardware platform to be replaced on a short duty cycle
● Tiered data archive: fast-access (data < 1 year), higher latency (tape) archive● Master Controller and data archives
Credit: Bojan NikolicSDP Management Team
Top500 Trends: Peak Performance#1 Nov 2013: Tianhe-2Peak: 54,902.4 Tflop/sLinpack: 33,862.7 TFlop/s
SDP Processing Example Implementation
Bojan Nikolic12.12.2013
Processors: Increasing parallelism● Limits of frequeny increase have been reached
■ Typical clock rates of HPC processors: 2.5-4 GHz● Increase of parallelism at various levels
■ Multiple to many cores■ Simultaneous multi-threading■ SIMD instructions
■ 128, 256, 512 bit operands, up to 16-way SP floating-point operations● Relevant processor architectures
■ Intel Xeon: used by >80% of TOP500 systems; stable road-map■ IBM POWER: small markest share; new opportunities due to OpenPOWER■ Other: AMD Opteron, Blue Gene/Q processor, Sparc VIIIfx
● Accelerators:■ NVIDIA Kepler
■ K20, K40■ Intel Knights Corner
■ Xeon Phi
5110P, 7120
NVIDIA Tesla K20X (1x Kepler GK110)● Flops: 3.94 / 1.31 TFlops SP / DP● Compute Units: 14● Processing Elements: 192 / CU● Total # PEs: 14 x 192 = 2688● CU frequency: 732 MHz● Memory: 6 GB (ECC) – 384bit● Memory frequency: 5.2 GHz● Memory bandwidth: 250 GB/s● Power consumption: 235 W
INTEL Xeon Phi (MIC) Coprocessor 5110P● Flops: 2.02 / 1.01 TFlops SP / DP● Compute Units: 60 (Cores)● Processing Elements: 16 / Core● Total # PEs: 60 x 16 = 960● Core frequency: 1.053 GHz● Memory: 8 GB● Memory bandwidth: 320 GB/s● Power consumption: 225 W
System Memory
● Top 10 ranks of Top500 (November 2013)
■ Max memory capacity stagnating (~1.5 PiBytes)
■ But: TOP500 provides selective view
■ Increasing number of systems with accelerators
● GORDON@SDSC■ Architecture integrating a large
number of SSDs■ Rank #129 at Top500 list
[GiByte]
Storage Technology Parameters
● Storage Capacity:■ HDD (O(2) Tbytes)■ SSD (Laptop/Desktop O(256) Gbytes■ Enterprise O(1) Tbytes)
● Bandwidth:■ Data transfer rate from disk or storage system■ Can be Measured at different levels■ HDD (O(150) Mbytes/s)■ SSD (Laptop/Desktop O(400) Mbytes/s■ Enterprise O(2.5) Gbytes/s) ■ System level: JUST3 66GB/s, JUST4-GSS 160GB/s
● IOPS■ Number of I/O operations per second■ Linked to bandwidth by request size■ HDD (75-210 IOPS)■ SSD (8.6k-1.2m IOPS)
Storage Issues
RAID rebuild time■ JUST 3: Large disks plus standard RAID6:
■ long time to rebuild ■ observed rebuild times of about 22 hours■ 3-4 failures per week■ Risk of failures during time of rebuild:
Observed failure of 2nd disk during rebuild■ Performance penalties during rebuilt:
noticeable storage server performance degradation
■ JUST 4: GPFS Storage Server (GSS):■ Uses de-clustered RAID for faster rebuild■ End-to-end integrity checksum
● Availability, scalability, maintainability● Silent data corruption
● Module calculations assuming UDE/IO rate of 10-13 (estimated mean time to undetected error for 1000 disk system over 5 years)
● JUST storage system comprises O(10,000) disks
JSC tape system:
● Actual capacity 44.5 Pbytes
● Maximum capacity 100 Pbytes
● Tapes 16600
● Libraries 2 (at different locations in JSC)
● Transfer (T1000C up to 240 MB/s, T1000B/A up to 120 MB/s)
Network link technologies and topologies
● InfiniBand■ Top500 system share of 41.4 %
● Ethernet■ Top500 system share of 42.4 %■ Very large market
● Proprietary link technologies■ TOFU■ Blue Gene/Q■ Aries■ EXTOLL
● Processor network attachment■ PCIe attached
■ Most common approach■ most systems still Gen2
■ Proprietary IO bus, e.g. Fujitsu HSIO■ On-chip transceivers, e.g. BG/Q
● Key aspects■ Nearest-neighbour connectivity■ Network diameter■ Bi-sectional bandwidth
● Popular technologies■ Fat tree■ D-dimensional torus■ Toroidal topologies like TOFU■ dragonfly
HPC Energy Efficiency
● Energy efficiency in HPC■ Top500■ Green500■ PUE
● Cooling■ Air-cooling■ Water-cooling■ Warm-water■ Free cooling
● Energy-aware scheduling■ LRZ SuperMUC
■ configuration dependent of application profile ■ reduce power vs. decrease execution time
■ eeClust■ Fit4Green
Courtesy of Erich Strohmaier, Lawrence Berkeley National Laboratory
5.04 x in 5 y
3.13 x in 5 y
3.25 x in 5 y
System Software and Management● Operating system
■ support fault tolerance and fault-resiliency● Interconnect management
■ adaptive and dynamic routing■ congestion control
● Cluster management■ on-the-fly analysis monitoring■ post mortem data mining■ health checking
● Resource management and job scheduling■ load balancing■ flexible allocation coupled with applications
● Energy efficiency● Programming environment / basic porting
■ languages: C/C++, FORTRAN, Python■ standards: MPI, OpenMP, OpanACC, OpenCL, CUDA, OmpSs■ compiler, debugger
● tuning applications, performance analysis
Performance Analysis on Extreme-Scale Systems● Technical challenges:
■ Heterogeneity■ Extreme concurrency■ Perturbation and data volume■ Drawing insight from measurements■ Quality information sources
● Steps:■ Instrumentation■ Measurement
■ Profiling (time, counts)■ Tracing (events)■ Filtering, reporting, examination
Summary
● Processor technology■ Increasing parallelism■ Accelerator support
● Memory trends■ DRAM, GDDR5, SSD
● Storage technology■ Parameters■ Issues
● Network link technologies and topologies● Energy efficiency● Software environment
■ System software and management■ Performance analysis
End of Presentation
Jülich Supercomputing Centre Supercomputer operation for
Centre – FZJ
Regional – JARA
Helmholtz & National – NIC, GCS
Europe – PRACE, EU committees
Application support
SimLabs
Cross Sectional Groups
Peer review coordination
R&D Work
Algorithms, performance analysis, and tools
Community data management service
Novel computer architectures:
Exascale laboratories: EIC (IBM), ECL (Intel), NVIDIA
Education and Training
Processors and Accelerators
Memory technologies
● DDR SRAM■ Mass market, clear road-map
● High-bandwidth memory■ New, emerging solutions■ Small, volatile market
● Dense memory■ Technical limitations, unclear
road-map■ Large market
Network link technologies and topologies
[M.Gerndt, 2013]
[Faanes et al., 2012]
Comparison of selected Top500 systems