1 december 12, 2009 robust asynchronous optimization for volunteer computing grids department of...
TRANSCRIPT
1December 12, 2009
Robust Asynchronous Optimizationfor Volunteer Computing Grids
Department of Computer ScienceDepartment of Physics, Applied Physics and Astronomy
Rensselaer Polytechnic Institute
E-Science 2009December 12, Oxford, UK
Travis Desell, Malik Magdon-Ismail, Boleslaw Szymanski, Carlos Varela,
Heidi Newberg, Nathan Cole
December 12, 2009 2
Overview
Introduction• Motivation• Driving Scientific
Application
Asynchronous Genetic Search• Why asynchronous?• Methodology• Recombination• Particle Swarm
Optimization
Generic Optimization Framework• Approach• Architecture
Results• Convergence Rates• Re-computation rates
Conclusions & Future Work
Questions?
December 12, 2009 3
Motivation
Scientists need easily accessible distributed optimization tools
Distribution is essential for scientific computing• Scientific models are becoming increasingly complex• Rates of data acquisition are far exceeding increases in
computing power
Traditional optimization strategies not well suited to large scale computing• Lack scalability and fault tolerance
December 12, 2009 4
Astro-Informatics
• Observing from inside the Milky Way provides 3D data:• SLOAN digital sky survey has collected over 10 TB data.• Can determine it's structure – not possible for other galaxies.• Very expensive – evaluating a single model of the Milky Way with a single set of
parameters can take hours or days on a typical high-end computer.
• Models determine where different star streams are in the Milky Way, which helps us understand better its structure and how it was formed.
What is the structure and origin of the Milky Way
galaxy?
December 12, 2009 6
Separation of Concerns• Distributed Computing
• Optimization
• Scientific Modeling
“Plug-and-Play”• Simple & generic interfaces
Generic Optimization Framework
December 12, 2009 7
Two Distribution Strategies
Asynchronous evaluations Results may not be
reported or reported late No processor
dependencies
o Faults can be ignored
• Grids & Internet
Single parallel evaluation Always uses most evolved
population Can use traditional
methods
o Faults require recalculation o Grids require load
balancing
• Supercomputers & Grids
December 12, 2009 8
Asynchronous Architecture
Scientific Models
Distributed Evaluation Framework
Search Routines
…EvaluatorCreation
Data Initialisation
Integral FunctionIntegral Composition
Likelihood FunctionLikelihood Composition
InitialParameters
OptimisedParameters
BOINC (Internet) SALSA/Java (RPI Grid)
Evaluator (N)
WorkRequest
ResultsWork
RequestResultsWork Work
Evolutionary MethodsGenetic Search
Particle Swarm Optimisation…
Evaluator (1)
December 12, 2009 9
GMLE Architecture (Parallel-Asynchronous)
Results
…
DistributeParameters
CombineResults
Evaluator(1)
Evaluator(N)
Evaluator(2)
WorkWork
Request
Worker (1)
Results
…
DistributeParameters
CombineResults
Evaluator(1)
Evaluator(M)
Evaluator(2)
WorkWork
Request
Worker (Z)
…
CommunicationLayer BOINC - HTTP Grid - TCP/IP Supercomputer - MPI
Search Routines
MPI MPI
December 12, 2009 10
Issues With Traditional Optimization
Traditional global optimization techniques are evolutionary, but dependent on previous steps and are iterative• Current population is used to generate the next population
Dependencies and iterations limit scalability and impact performance• With volatile hosts, what if an individual in the next generation is
lost?• Redundancy is expensive• Scalability limited by population size
December 12, 2009 11
Asynchronous Optimization Strategy
Use an asynchronous methodology• No dependencies on unknown results• No iterations
Continuously updated population• N individuals are generated randomly for the initial population• Fulfil work requests by applying recombination operators to the
population • Update population with reported results
December 12, 2009 12
Asynchronous Search Strategy
Work QueuePopulation
Workers
Generate membersfrom population
Request workwhen queue is low
Parameter Set (1)Parameter Set (2)
Parameter Set (n)
.
.
.
.
.
Fitness (1)Fitness (2)
.
.
.
.
.
Fitness (n)
Unevaluated Parameter Set (1)Unevaluated Parameter Set (2)
Unevaluated Parameter Set (m)
.
.
.
.
.
Report results andupdate population
Sendwork
Requestwork
December 12, 2009 13
Asynchronous Genetic Search Operators (1)
Average• Simple operator for continuous problems• Generated parameters are the average of two randomly selected
parents
Mutation• Takes a parent and generates a mutation by randomly selecting a
parameter and mutating it
December 12, 2009 14
Asynchronous Genetic Search Operators (2)
Double Shot - two parents generate three children• Average of the parents• Outside the less fit parent, equidistant to parent and average• Outside the more fit parent, equidistant to parent and average
December 12, 2009 15
Asynchronous Genetic Search Operators (3)
Probabilistic Simplex• N parents generate one or more children• Points placed randomly along the line created by the worst parent,
and the centroid (average) of the remaining parents
16December 12, 2009
Particle Swarm Optimization
Particles ‘fly’ around the search space.
They move according to their previous velocity and are pulled towards the global best found position and their locally best found position.
Analogies:cognitive intelligence (local best knowledge)
social intelligence (global best knowledge)
16
17December 12, 2009
Particle Swarm Optimization
PSO:vi(t+1) = w * vi(t) + c1 * r1 * (li - pi(t)) + c2 * r2 * (g - pi(t))
pi(t+1) = pi(t) + vi(t+1)
w, c1, c2 = constants
r1, r2 = random float between 0 and 1
vi(t) = velocity of particle i at iteration t
pi(t) = position of particle i at iteration t
li = best position found by particle i
g = global best position found by all particles
17
18December 12, 2009
Asynchronous PSO
Generating new positions does not necessarily require the fitness of the previous position
1. Generate new particle or individual positions to fill work queue
2. Update local and global best on resultsPSO:
If result improves particle’s local best, update local best, particle’s position and velocity of the result
18
19December 12, 2009
Particle Swarm Optimization (Example)
19
previous: pi(t-1)
current: pi(t)
local best
global best
c1 * (li - pi(t))c2 * (g - pi(t))
w * vi(t)
velocity: vi(t)
possible newpositions
20 December 12, 2009
Particle Swarm Optimization (Example)
20
previous: pi(t-1)
current: pi(t)
local best
global best
c2 * (g - pi(t))
w * vi(t)
velocity: vi(t)
possible newpositions
Particle finds a new local best position and the global best position
previous: pi(t-1)
current: pi(t)
local best
global best
velocity: vi(t)new position
21December 12, 2009
Particle Swarm Optimization (Example)
21
c2 * (g - pi(t))
w * vi(t)
possible newpositions
c1 * (li - pi(t))
previous: pi(t-1) current: pi(t)
local best
global best
velocity: vi(t)
Another particle finds the global best position
22December 12, 2009 22
Population
Fitness (1)
Fitness (2)
Fitness (n)
.
.
.
.
.
.
.
.
Individual (1)
Individual (2)
Individual (n)
.
.
.
.
.
.
.
.
Unevaluated Individuals
Unevaluated Individual (1)
Unevaluated Individual (2)
Unevaluated Individual (n)
.
.
.
.
.
.
.
.
Workers (Fitness Evaluation)
Report results and update population
Request Work
Send Work
Generate individuals when queue is low
Local and global best updated if new individual
has better fitness
Select individual to generate new individual from in
round-robin manner
Asynchronous PSO
Computing Environment: Milkyway@home
http://milkyway.cs.rpi.edu BOINC
Einstein@home, SETI@home, etc
>50,000users; 80,000 CPUs; 600 teams; from 99 countries; Second largest BOINC computation (among 100’s)About 500 Teraflops
Donate your idle computer time to help perform our calculations.
December 12, 200923
December 12, 2009 25
Computing Environments - BOINC
• MilkyWay@Home: http://milkyway.cs.rpi.edu/
• Multiple Asynchronous Workers• Approximately 10,000 – 30,000 volunteered computers engages at
a time• Asynchronous architecture used
• Asynchronous Evaluation• Volunteered computers can queue up to 20 pending individuals• Population updated when results reported• Individuals may be reported slowly or not at all
User Participation
Users do more than volunteer computing resources (Citizen’s Science):
Open-source code gives users access to the MilkyWay@Home application
Users have submitted many bug reports, fixes, and performance enhancements
A user even created an ATI GPU capable version of the MilkyWay@Home application
Forums provide opportunities for users to learn about astronomy and computer science
December 12, 2009
Malicious/Incorrect Result Verification
With open-source application code, users can compile their own compiler-optimized versions and many do.
However, there is also the possibility of users returning malicious results
BOINC traditionally uses redundancy on every result to verify their correctness. This requires at least 2 results
for every work unit!
Asynchronous search doesn't require all work units to be verified, only those which improve the population
We reduce the redundancy by comparing a result against the current partial results.
December 12, 2009
29December 12, 2009
Limiting Redundancy (Genetic Search)
29
60% verification found best solutions
Increased verification reduces reliability
Reliability and convergence by number of parents seems dependent on verification rate
30December 12, 2009
Limiting Redundancy (PSO)
30
30% verification found best solutions
Increased verification reduces reliabilityNot as dramatically as AGS
Lower inertia weights give better results
31December 12, 2009
Optimization Method Comparison
31
APSO found better solutions than AGS.
APSO needed lower verification rates and was less effected by different verification rates.
December 12, 2009 32
Conclusions
Asynchronous search is effective on large scale computing environments• Fault tolerant without expensive redundancy• Asynchronous evaluation on heterogeneous environment
increases diversity• BOINC converges almost as fast as the BlueGene, while offering
more availability and computational power• Even computers with slow result report rates are useful
Particle Swarm and Simplex-Genetic Hybrid methods provide significant improvement in convergence
December 12, 2009 33
Future Work
Optimization• Use report times to determine how to
generate individuals• Simulate asynchrony for benchmarks• Automate selection of parameters
Distributed Computing• Parallel asynchronous workers• Handle Malicious “Volunteers”
Continued Collaboration
http://www.nasa.gov