cybershake study 14.2 technical readiness review
DESCRIPTION
CyberShake Study 14.2 Technical Readiness Review. Study 14.2 Scientific Goals. Compare impact of velocity models on Los Angeles-area hazard maps with various velocity models CVM-S4.26, BBP 1D, CVM-H 11.9, no GTL Compare to CVM-S, CVM-H 11.9 with GTL Investigate impact of GTL - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/1.jpg)
CyberShake Study 14.2Technical Readiness Review
![Page 2: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/2.jpg)
Study 14.2 Scientific Goals
• Compare impact of velocity models on Los Angeles-area hazard maps with various velocity models• CVM-S4.26, BBP 1D, CVM-H 11.9, no GTL• Compare to CVM-S, CVM-H 11.9 with GTL
• Investigate impact of GTL• Compare 1D reference model• Compare tomographic inversion results
• 286 sites (10 km mesh + points of interest)
![Page 3: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/3.jpg)
Study 14.2 Technical Goals
• Run both SGT and post-processing workflows on Blue Waters
• Plan to measure CyberShake application makespan• Equivalent to the makespan of all of the workflows
• (All jobs complete) – (first workflow submitted)• Includes hazard curve calculation time• Includes system downtime, workflow stoppages
• Will estimate time-to-solution by adding estimate of setup-time and analysis time.
• Compare performance, queue times, results of GPU and CPU AWP-ODC-SGT
![Page 4: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/4.jpg)
Performance Enhancements
• New version of seismogram synthesis code to reduce read I/O• Reads in set of extracted SGTs• Synthesizes multiple RVs (using 5 in production)
• Reduce number of subworkflows to 6 (from 8)• Fewer jobs, less queuing time
• For CPU SGTs, increase core count• Each processor has ~64x50x50 chunk of grid points
• For GPU SGTs, decrease processor count• Volume must be multiple of 20 in X and Y• 10 x 10 x 1 GPUs, regardless of volume
![Page 5: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/5.jpg)
Proposed Study sites (286)
![Page 6: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/6.jpg)
Study 14.2 Data Products
• 2 CVM-S4.26 Los Angeles-area hazard models• 1 BBP 1D Los Angeles-area hazard model• 1 CVM-H 11.9, no GTL Los Angeles-area
hazard model• Hazard curves for 286 sites x 4 conditions, at
3s, 5s, 10s• 1144 sets of 2-component SGTs• Seismograms for all ruptures (~470M)• Peak amplitudes in DB for 3s, 5s, 10s
![Page 7: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/7.jpg)
Study 14.2 Notables
• First CVM-S4.26 hazard models• First CVM-H, no GTL hazard model• First 1D hazard model• First study using AWP-SGT-GPU• First CyberShake Study using a single workflow
on one system (Blue Waters)
![Page 8: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/8.jpg)
Study 14.2 Parameters
• 0.5 Hz, deterministic• 200 m spacing
• CVMs• Vs min = 500 m/s
• UCERF 2• Graves & Pitarka (2010) rupture variations
![Page 9: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/9.jpg)
Verification
• 4 sites (USC, PAS, WNGC, SBSM)• AWP-SGT-CPU, CVM-S4.26• AWP-SGT-GPU, CVM-S4.26• AWP-SGT-CPU, BBP 1D• AWP-SGT-GPU, CVM-H 11.9, no GTL
• Plotted with previously calculated curves
![Page 10: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/10.jpg)
CVM-S4.26 (CPU)
![Page 11: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/11.jpg)
CVM-H, no GTL (CPU)
![Page 12: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/12.jpg)
Changes to SGT Software Stack
• Velocity Mesh generation• Switched from 2 jobs (create, then merge) to 1 job
• SGTs• AWP-ODC-SGT CPU v14.2
• Has wrapper because of issue with getting exit code back• AWP-ODC-SGT GPU v14.2
• Has wrapper to read in parameter file and construct command-line arguments
• Nan Check• Always had NaN check for RWG SGTs, now for
AWP SGTs also
![Page 13: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/13.jpg)
Changes to PP Software Stack
• Seismogram Synthesis / PSA Calculation• Modified to synthesize multiple seismograms per
invocation• Will use 5 rupture variations per invocation• Reduces read I/O by factor of 5• Needed to avoid congestion protection events
• All codes tagged in SVN before study begins
![Page 14: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/14.jpg)
Changes to Workflows
• Changed workflow hierarchy• 1 integrated workflow per site, per • Added ability to select SGT core count dynamically• Put volume creation job into top-level workflow to
reduce hierarchy to 2 levels• Reduced number of post-processing sub-
workflows to 6• Fewer jobs in queue
• Will not keep job output if job succeeds• Reduce size of workflow logs
![Page 15: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/15.jpg)
Workflow HierarchyIntegrated Workflow(1 per model per site)
PreCVM(creates volume)
Generate SGT Workflow
SGT Workflow
PP Pre Workflow
PP subwf 0 PP subwf 1 PP subwf 5…
DB workflow
More details on next slide
![Page 16: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/16.jpg)
6
68000 68000
![Page 17: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/17.jpg)
Distributed Processing
• Cron job on shock.usc.edu creates/plans/runs full workflows• Pegasus 4.4, from Git repository• Condor 8.0.3• Globus 5.0.4
• Jobs submitted to Blue Waters via GRAM• Results staged back to shock, DB populated,
curves generated• Alternate CPU and GPU workflows for best
queue performance
![Page 18: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/18.jpg)
Computational Requirements
• Computational time: 275K node-hrs• SGT Computational time: 180K node-hrs
• CPU: 150 node-hrs/site x 286 sites x 2 models = 86K node-hrs (XE, 32 cores/node)
• GPU: 90 node hrs/site x 286 sites x 2 models = 52K node-hrs (XK)
• Study 13.4 had 29% overrun on SGTs• PP Computational time: 95K node-hrs
• 60 node-hrs/site x 286 sites x 4 models= 70K node-hrs (XE, 32 cores/node)
• Study 13.4 had 35% overrun on PP
• Current allocation has 3.0M node-hrs remaining
![Page 19: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/19.jpg)
Blue Waters Storage Requirements
• Planned unpurged disk usage: 45 TB• SGTs: 40 GB/site x 286 sites x 4 models
= 45 TB, archived on Blue Waters• Planned purged disk usage: 783 TB
• Seismograms: 11 GB/site x 286 sites x 4 models= 12.3 TB, staged back to SCEC
• PSA files: 0.2 GB/site x 286 sites x 4 models= 0.2 TB, staged back to SCEC
• Temporary: 690 GB/site x 286 sites x 4 models= 771 TB
![Page 20: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/20.jpg)
SCEC Storage Requirements
• Planned archival disk usage: 12.5 TB• Seismograms: 12.3 TB (scec-04 has 19 TB)• PSA files: 0.2 TB (scec-04)• Curves, disagg, reports: 93 GB (99% reports)
• Planned database usage: 210 GB• 3 rows/rupture variation x 410K rupture
variations/site x 286 sites x 4 models = 1.4B rows• 1.4B rows x 151 bytes/row = 210 GB (880 GB free)
• Planned temporary disk usage: 5.5 TB• Workflow logs: 5.5 TB – possibly smaller, not saving
all output anymore (scec-02 has 12 TB free)
![Page 21: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/21.jpg)
Metrics Gathering
• Monitord for workflow metrics• Will run after workflows have completed
• Python scripts• Used to obtain some of the standard CyberShake
metrics for comparison• Cronjob on Blue Waters
• Core usage over time• Jobs running and idle counts
• Will use start and end of workflow logs to perform makespan measurement
![Page 22: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/22.jpg)
Estimated Duration
• Limiting factors:• Queue time
• Especially for XK nodes, could be substantial percentage of run time
• Blue Waters -> SCEC transfer• If Blue Waters throughput is very high, transfer could be
bottleneck
• With queues, estimated completion is 4 weeks• 1 hazard map/week• Requires average of 410 nodes• 603 nodes averaged during Study 13.4
• With a reservation, completion depends on the reservation size
![Page 23: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/23.jpg)
Personnel Support• Scientists
• Tom Jordan, Kim Olsen, Rob Graves• Technical Lead
• Scott Callaghan
• SGT code support• Efecan Poyraz, Yifeng Cui
• Job Submission / Run Monitoring• Scott Callaghan, David Gill, Heming Xu, Phil Maechling
• NCSA Support• Omar Padron, Tim Bouvet
• Workflow Support• Karan Vahi, Gideon Juve
![Page 24: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/24.jpg)
Risks
• Queue times on Blue Waters• In tests, at times GPU queue times have been > 1
day• Congestion protection events
• If triggered consistently, will either need to throttle post-processing or suspend run until improvements are developed
![Page 25: CyberShake Study 14.2 Technical Readiness Review](https://reader035.vdocument.in/reader035/viewer/2022081513/5681645e550346895dd631c9/html5/thumbnails/25.jpg)
Thanks for your time!