deepsense computing platform · o job resume: bresume o job kill: bkill •check available hosts...
TRANSCRIPT
DeepSense Computing Platform
Agenda
• System Overview
• File Systems
• Data Transfer
• IBM Spectrum LSF
• Conductor with Spark
• Technical Support
System Overview
• Compute Nodes• 20 Large memory nodes
-20 core, 512GB memory
• 4 Huge memory nodes
-20 core, 1TB memory
• 10 GPU nodes
-2XP100, 20 core, 512 GB Memory
• Operating System • Redhat Enterprise 7.5
Heterogeneous Computing with GPU
NVIDIA P100 GPU
Cores 3584
Memory 16 GB HBM2
Memory Bandwidth 720 GB/s
FLOPS (sp) 9.3 TFLOPS
FLOPS (dp) 4.7 TFLOPS
FLOPS (hp) 18.7 TFLOPS
Power consumption 250W
CUDA compute ability 6.0
IBM Power 8 with NVIDIA P100
CPU-GPU Systems Connected via PCI-e
NVLink Enables Fast Unified Memory Access between CPU and GPU memory
Applications
• Domain
- Ocean data products, Ship Building, Fisheries and
Aquaculture, Seaport and Logistics, Security and
Defense, Marine Risk…
• Data Source
- Sensor logs, text, image, video, web traffic
geospatial, AIS, …
• Analytics
- Image processing, Time-Series, Predictive
Analytics, Machine Learning, Deep Learning,
Distributed Computing, ..
File System
File System
Directory Purpose Quota Backed up?
Purged?
Home /dshome/subdir/username development 1Tb, 500k files per user yes no
Data /data/projectname development 2Tb, 500k files per project
yes no
Scratch /scratch/username computation 2TB, 1M files per user no yes
Data Transfer
• Two protocol nodesoprotocol1.deepsense.ca
oprotocol2.deepsense.ca
• Connect using SAMBA:
- smb://protocol1.deepsense.ca
- use your DeepSense account
Data Backups
• DeepSense is platform for data analytics. It is not meant for long term storage.• Users should ensure their original data is backed up at their own site.
• We do have daily backups• /dshome, /data, /software are backed up every evening
• The backup keeps 7 versions of files
• Once a file is deleted, it is kept backed up for 30 days. After which, it is no longer accessible
• If you need to restore a file, please let us know
IBM Spectrum LSF
• Workload management platform
• Maximize utilization for distributed
High Performance computing
• GPU Support
• Execute batch/interactive jobs
• Containerized workloads
LSF Access and Login
• User account => Deepsense account
• Login nodes:• login1.deepsense.ca
• login2.deepsense.ca
• Example connection:• ssh <username>@login1.deepsense.ca
• for Mac or Linux client use terminal
• for windows client use PuTTY, MobaXterm
• If you are off campus, need a Dalhousie VPN connection
Submitting Job to LSF
• Development/test jobs• For testing/dev use the login nodes
• Shared with all users
• Batch jobs• Command: bsub
• With ‘bsub’ options specify:
input/output files, GPU option, CPU/Memory Limit, etc..
• Interactive jobs• Command: bsub -I
LSF Monitoring/Cancelling jobs
• Check Running jobso bjobs -l
o bjobs -l <jobid> // for job details
• Control job executiono Job suspend: bstop <jobid>
o Job resume: bresume <jobid>
o Job kill: bkill <jobid>
• Check available hostso bhosts
IBM Spectrum Conductor with Spark (CWS)
• Spark integration and lifecycle management platform
• Support for multiple Spark versions
• Integrated application platform
• Notebooks, Deep Learning packages
• Simplified administration
Accessing CWS
• Management ConsoleoGo to url:https://ds-mgm-02.deepsense.cs.dal.ca:8443
o Login using DS account
• Command Line Optiono from login node ssh to:
ssh ds-cmhm-02.deepsense.cs.dal.ca
o source the environment
o login to cws using DS account
CWS - Spark Instance Group
• From dashboard go to:oWorkload -> Spark -> Spark
Instance Group
o Specify name, directory and user
• Choose Spark Versiono Spark 2.3.1, Spark 2.2.0,
Spark 2.1.1, Spark 1.6.1
• Optional: choose Notebook
Technical support
• DocumentationoDeepsense computing platform wiki page
https://docs.deepsense.ca
o IBM Knowledge Center
https://www.ibm.com/support/knowledgecenter/
• Troubleshooting/technical questionso Send email to [email protected]
Questions ?