instrumenting folding@work
Post on 16-Feb-2016
27 Views
Preview:
DESCRIPTION
TRANSCRIPT
Instrumenting Folding@Work
Badi Abdul-Wahid, RJ NowlingCSE 60641 Operating Systems
Professor Striegel
Overview
• Problem Description– Experimental Structure– Folding@Work Workflow
• Benchmarks• Results– Weak Scaling (ns / day)– Server Capacity– Available Workers Over Time– Variability of Computation Time
• Conclusions
Experimental Structure
Folding@Work Workflow
Benchmarks
• Tasks: 1 ns generations (approx 2 hr on test machine)
• 10 consecutive generations / simulations• Weak Scaling– 10 simulations / 10 workers– 100 simulations / 100 workers– 1,000 simulations / 1,000 workers
• Condor, later added SGE jobs• 1 Trial of each; Took ~ 2 days to run
Weak Scaling of F@W
Server Capacity (Wait Time)
Available Workers over Time
Transfer Times
Variability of Computation Time
Example Execution Timeline
Performance Model
€
Nwu =⟨texe⟩+ ⟨tW ,wait⟩
⟨tnew⟩+ ⟨ttrans⟩+ ⟨tM ,wait⟩
Weak Scaling (updated)
Wait Times
Tasks Waiting
Identified Areas of Improvement• Availibility of Resources
– Benchmarks limited by number of sustained workers available through Condor
– New feature: WorkQueue Worker Pool can be used to start new workers• WorkQueue Limits Number of Workers
– Increasing number of file descriptors allowed up to 2,500 workers to connect– Bad behavior occuring in calls to select()– Working with WorkQueue developers to switch to poll()
• Long-Running Work Units Delay Completion of Trajectories– Some work units not returned / taking very long time– Prevents trajectories from finishing– Use fast abort feature to re-assign work units that take longer than a
specified time
Conclusion
• Accomplished– Identified key metrics (ns / day, wait time)– Developed scaling model– Tested model
• Conclusions– Real scientific applications scale well– Forcing short workunits adds load to Master– Performance model validated– “Self-correcting” behavior
top related