instrumenting folding@work

Post on 16-Feb-2016

27 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Instrumenting Folding@Work. Badi Abdul-Wahid, RJ Nowling CSE 60641 Operating Systems Professor Striegel. Overview. Problem Description Experimental Structure Folding@Work Workflow Benchmarks Results Weak Scaling (ns / day) Server Capacity Available Workers Over Time - PowerPoint PPT Presentation

TRANSCRIPT

Instrumenting Folding@Work

Badi Abdul-Wahid, RJ NowlingCSE 60641 Operating Systems

Professor Striegel

Overview

• Problem Description– Experimental Structure– Folding@Work Workflow

• Benchmarks• Results– Weak Scaling (ns / day)– Server Capacity– Available Workers Over Time– Variability of Computation Time

• Conclusions

Experimental Structure

Folding@Work Workflow

Benchmarks

• Tasks: 1 ns generations (approx 2 hr on test machine)

• 10 consecutive generations / simulations• Weak Scaling– 10 simulations / 10 workers– 100 simulations / 100 workers– 1,000 simulations / 1,000 workers

• Condor, later added SGE jobs• 1 Trial of each; Took ~ 2 days to run

Weak Scaling of F@W

Server Capacity (Wait Time)

Available Workers over Time

Transfer Times

Variability of Computation Time

Example Execution Timeline

Performance Model

Nwu =⟨texe⟩+ ⟨tW ,wait⟩

⟨tnew⟩+ ⟨ttrans⟩+ ⟨tM ,wait⟩

Weak Scaling (updated)

Wait Times

Tasks Waiting

Identified Areas of Improvement• Availibility of Resources

– Benchmarks limited by number of sustained workers available through Condor

– New feature: WorkQueue Worker Pool can be used to start new workers• WorkQueue Limits Number of Workers

– Increasing number of file descriptors allowed up to 2,500 workers to connect– Bad behavior occuring in calls to select()– Working with WorkQueue developers to switch to poll()

• Long-Running Work Units Delay Completion of Trajectories– Some work units not returned / taking very long time– Prevents trajectories from finishing– Use fast abort feature to re-assign work units that take longer than a

specified time

Conclusion

• Accomplished– Identified key metrics (ns / day, wait time)– Developed scaling model– Tested model

• Conclusions– Real scientific applications scale well– Forcing short workunits adds load to Master– Performance model validated– “Self-correcting” behavior

top related