nsf site visit 2-23-2006 hydra using windows desktop systems in distributed parallel computing
Post on 17-Dec-2015
216 Views
Preview:
TRANSCRIPT
NSF Site Visit2-23-2006
HYDRAUsing Windows Desktop Systems in Distributed Parallel Computing
Introduction…
• Windows desktop systems at IUB student labs– 2300 systems, 3 year replacement cycle– Pentium IV (>=1.6 GHz), 256/512/1024 MB
memory, 10/100 Mbps/GigE, Windows XP– More than 1.5 TF
NSF Site Visit2-23-2006
Possibly Utilize Idle Cycles?
Red: total owner Blue: total idle Green: total Condor
NSF Site Visit2-23-2006
Problem Description
• Once again... Windows desktop systems at IUB student labs:
– As a scientific resource
– Harvest idle cycles
NSF Site Visit2-23-2006
Constraints
• Systems dedicated to students using desktop office applications — not parallel scientific computing – making their availability unpredictable and sporadic
• Microsoft Windows environment
• Daily software rebuild (updates)
NSF Site Visit2-23-2006
What could these systems be used for?
• Many small computations and a few small messages– Foreman-worker– Parameter studies– Monte Carlo
• Goal: High Throughput Computing (not HPC)– Parallel runs of the aforementioned small computations
to make better use of resource– Parallel libraries – MPI, PVM, etc. – have constraints if
availability of resources is ephemeral i.e. not predictable
NSF Site Visit2-23-2006
Solution• Simple Message Brokering Library (SMBL)
– Limited replacement for MPI• Both server and client library based on TCP socket
abstraction
– Porting from MPI is fairly straight forward
• Process and Port Manager (PPM) • Plus …
– Condor for job management, file transfer, no checkpointing or parallelism
– Web portal for job submission
NSF Site Visit2-23-2006
The Big PictureWe’ll discuss each part in more detail next…
The shaded box indicates components hosted on multiple desktop computers
NSF Site Visit2-23-2006
SMBL (Server)
• SMBL server maintains a dynamic pool of client process connections
• Worker job manager hides details of ephemeral workers at the application level
SMBL Rank Condor Assigned Node
0
(Foreman)
Wrubel Computing Center, sacramento
1 Chemistry Student Lab, computer_14
2 CS Student Lab, computer_8
3 Library, computer_6
SMBL Server Process Table for 4 CPU parallel session
NSF Site Visit2-23-2006
SMBL (Server)
• SMBL server maintains a dynamic pool of client process connections
• Worker job manager hides details of ephemeral workers at the application level
SMBL Rank Condor Assigned Node
0
(Foreman)
Wrubel Computing Center, sacramento
1 Chemistry Student Lab, computer_14
2 Physics Student Lab, computer_11
3 Library, computer_6
SMBL Server Process Table for 4 CPU parallel session
NSF Site Visit2-23-2006
SMBL (Client)
• Client library implements selected MPI-like calls– MPI_Send () SMBL_Send ()
– MPI_Recv () SMBL_Recv ()
• In charge of message delivery for each parallel process
NSF Site Visit2-23-2006
Process and Port Manager (PPM)
• Starts the SMBL server and application processes on demand
• Assigns port/host to each parallel session• Directs workers to their servers
NSF Site Visit2-23-2006
PPM with two SMBL servers (two parallel sessions)
SMBL Rank Condor Assigned Node
0 (Foreman) Wrubel Computing Center, sacramento
1 Chemistry Student Lab, computer_14
2 CS Student Lab, computer_8
3 Wells Library, computer_6
0 (Foreman) Wrubel Computing Center, sacramento
1 Wells Library, computer_27
2 Biology Student Lab, computer_4
3 CS Student Lab, computer_2
PPM (cont’d ...)
NSF Site Visit2-23-2006
Parallel Session 1
Parallel Session 2
Once again … the big picture
The shaded box indicates components hosted on multiple desktop computers
NSF Site Visit2-23-2006
Recent Development
• Hydra cluster Teragrid enabled! (Nov 2005)– Allow TG users to use resource– Virtual Host based solution – two different
URLs for IU and Teragrid users– Teragrid users authenticate against PSC’s
Kerberos server
NSF Site Visit2-23-2006
• PPM, SMBL server, Condor and web portal running on Linux server– Dual Intel Xeon 3.0 GHz, 4 GB memory,
GigE
• Second Linux server running Samba to serve BLAST database
System Layout
NSF Site Visit2-23-2006
Portal
• Creates and submits Condor files, handles data files
• Apache/PHP based• Kerberos authentication
• URLs:– http://hydra.indiana.edu (IU users)– http://hydra.iu.teragrid.org (Teragrid users)
NSF Site Visit2-23-2006
Utilization of Idle Cycles
Red: total owner Blue: total idle Green: total Condor
NSF Site Visit2-23-2006
Summary
• Large parallel computing facility created at a low cost– SMBL parallel message passing library that can deal
with ephemeral resources– PPM port broker that can handle multiple parallel
sessions
• SMBL Homepage– http://smbl.sourceforge.net (Open Source)
NSF Site Visit2-23-2006
Links and References
• Hydra Portal– http://hydra.indiana.edu (IU users)– http://hydra.iu.teragrid.org (Teragrid users)
• SMBL home page: http://smbl.sourceforge.net• Condor home page:
http://www.cs.wisc.edu/condor/• IU Teragrid home page – http://iu.teragrid.org
NSF Site Visit2-23-2006
Links and References (cont’d..)
• Parallel FastDNAml: http://www.indiana.edu/~rac/hpc/fastDNAml
• Blast: http://www.ncbi.nlm.nih.gov/BLAST• Meme: http://meme.sdsc.edu/meme/intro.html
NSF Site Visit2-23-2006
top related