high throughput computing with condor at purdue xsede ecss monthly symposium
DESCRIPTION
Condor. High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium. Topics. What is Condor? What is High Throughput Computing? Why Condor? Why not Condor? Condor at Purdue Submitting and managing jobs Suitable jobs. What is Condor?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/1.jpg)
High Throughput Computing with Condor at Purdue
XSEDE ECSS Monthly Symposium
Condor
![Page 2: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/2.jpg)
• What is Condor?
• What is High Throughput Computing?
• Why Condor? Why not Condor?
• Condor at Purdue
• Submitting and managing jobs
• Suitable jobs
Topics
![Page 3: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/3.jpg)
• A product of the University of Wisconsin-Madison
• A job scheduler• A resource manager• A workflow management system• Focused on High Throughput Computing
What is Condor?
![Page 4: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/4.jpg)
What is High Throughput Computing (HTC)?
• Large amounts of processing
• Long period of time
![Page 5: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/5.jpg)
HTC v. HPC
• FLOPS extracted v. FLOPS
• Distributed Ownership v. Central Ownership
• Capturing Idle Cycles v. Losing Idle Cycles
• Throughput v. Response Time
• Distributed Memory v. Tightly-coupled Memory
• 1,000 Jobs v. 1 Job
![Page 6: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/6.jpg)
Why Condor?
• Wasted compute cycles
• Scheduling of related jobs
• Access to more cores
![Page 7: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/7.jpg)
Advantages of Condor
• Many tasks running at once
• Access to more powerful computers
• Using wasted cycles
• Minimal impact on remote computers
• Security
• Little or no code modification
![Page 8: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/8.jpg)
Disadvantages of Condor
• Compete for access• Task may take longer to complete• Processing can be lost• Parallel jobs aren’t available• Large files can impact the remote computer• Heterogeneity of the remote computers• Few compatible compilers
![Page 9: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/9.jpg)
Condor at Purdue
• Installed on large cyberinfrastructure clusters• Installed in distributed desktops• Used as a scavenger of free cycles• Parallel jobs not supported• ~27K Linux cores and 1K Windows cores• Several more kilocores at DiaGrid partner sites
![Page 10: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/10.jpg)
Condor at Purdue
• Jobs are vacated when a PBS job starts– Long running jobs may never complete
• Common home directory across clusters• Scratch directories roughly per-cluster• ~7 TB of checkpoint storage for standard
universe jobs
![Page 11: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/11.jpg)
Job Universes
• Vanilla universe– Doesn't require a recompile– No native checkpoint mechanism
• Standard universe– Streams I/O (can overload the submit node)– Supports checkpointing– No fork(), shared memory, pipes
![Page 12: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/12.jpg)
File transfer
• A vanilla universe feature• Allows jobs to flow to other sites
![Page 13: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/13.jpg)
Compiling for Condor
• A standard universe requirement• The condor_compile command wraps a
limited compiler set.• Links against Condor libraries to add support for
I/O streaming and checkpointing
![Page 14: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/14.jpg)
Checkpointing
• Saves all state information
• Transfers state information to Condor management
• Deletes job from processor
• Restarts interrupted job on another unused processor
![Page 15: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/15.jpg)
Job lifecycle
• Job is submitted• Scheduler process contacts negotiator process• Negotiator matches job to an available slot• If no slots are available, scheduler contacts
remote negotiator• Execute node runs job• If job gets evicted, scheduler process contacts
negotiator process again
![Page 16: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/16.jpg)
Submitting a job
• Create a submit file:# Simple Condor job file
Executable = bin/simpletest
Arguments = 600
Universe = standard
Log = log/$(Cluster).$(Process).log
Error = log/$(Cluster).$(Process).err
Output = log/$(Cluster).$(Process).out
+TGProject = TG-STA060013N
Queue 10
![Page 17: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/17.jpg)
Submitting a job
• With file transfer:# Simple Condor job file
Executable = bin/process_files.sh
Universe = vanilla
ShouldTransferFiles = if_needed
Transfer_input_files = input.dat
Transfer_output_files = output.png
Log = log/$(Cluster).$(Process).log
+TGProject = TG-STA060013N
Queue
![Page 18: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/18.jpg)
Submitting a job
• Job submitted with the condor_submit command:
condor_submit myjobfile.condor
![Page 19: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/19.jpg)
Managing jobs
• Get all jobs in queue: condor_q• Get only user's jobs: condor_q user• Why isn't my job running?
condor_q -better-analyze jobid• Remove a job: condor_rm jobid
![Page 20: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/20.jpg)
Getting the most cores: Requirements = ...
• Condor tries to be helpful by inserting automatic job requirements
• OpSys• Arch• FileSystemDomain• Memory >= ImageSize
• This sometimes over-constrains jobs
![Page 21: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/21.jpg)
Getting the most cores: Requirements = ...
• The Requirements attribute gives you the flexibility to add or remove execute nodes
• Example: job files are in your home directory
Requirements = regexp(“rcac.purdue.edu”,FilesystemDomain)
• Example: job executable is a Windows binary
Requirements = (OpSys==“WINNT61”)
![Page 22: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/22.jpg)
A special note about Memory
• Condor sometimes overestimates the memory usage of a job
• Condor reports totalmemory/cores, but jobs are not memory constrained
• It’s best to put a dummy memory requirement in the submission file
![Page 23: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/23.jpg)
Getting the most out of your cores: Rank = ...
• You can prefer a job land on particular nodes• Example: prefer 64-bit nodes with lots of
memory
Rank = (ARCH==“X86_64”)*1000 + Memory
![Page 24: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/24.jpg)
Workflow management with DAGman
• Directed Acyclic Graph Manager
• Defines parent-child relationships among jobs
• Allows pre- and post-execution hooks
• Submit with condor_submit_dag
![Page 25: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/25.jpg)
Diamond DAG
C
A
B1 B2
![Page 26: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/26.jpg)
Diamond DAG
# Diamond-shaped DAG
Job First p_00060.A.sub
Job Second_1 p_00060.B1.sub
Job Second_2 p_00060.B2.sub
Job Third p_00060.C.sub
PARENT First CHILD Second_1 Second_2
PARENT Second_1 Second_2 CHILD Third
![Page 27: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/27.jpg)
More complex DAGs
![Page 28: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/28.jpg)
Who Benefits from Condor?
• Monte Carlo simulations
• Parameter sweeps
• “Embarrassingly parallel” jobs
![Page 29: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/29.jpg)
Purdue’s Condor Users
• Structural Biology
• Education
• Chemical Engineering
• Bioinformatics
• Climate Visualization
• Distributed Rendering
• High Energy Physics
![Page 30: High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium](https://reader035.vdocument.in/reader035/viewer/2022062802/56814495550346895db134af/html5/thumbnails/30.jpg)
For more information
• University of Wisconsin website:
• http://research.cs.wisc.edu/condor
• Email: