fast-os: petascale single system image
DESCRIPTION
FAST-OS: Petascale Single System Image. Jeffrey S. Vetter (PI) Computer Science and Mathematics Computing & Computational Sciences Directorate Team Members Nikhil Bhatia, Collin Mccurdy, Hong Ong, Phil Roth, WeiKuan Yu. PetaScale single system image. What? - PowerPoint PPT PresentationTRANSCRIPT
Presented by
FAST-OS: Petascale Single System Image
Jeffrey S. Vetter (PI) Computer Science and Mathematics
Computing & Computational Sciences Directorate
Team Members
Nikhil Bhatia, Collin Mccurdy,
Hong Ong, Phil Roth, WeiKuan Yu
2 Ong_SSI_0611
PetaScale single system image What?
Make a collection of standard hardware look like one big machine, in as many ways as feasible
Why? Provide a single solution to address
all forms of clustering Simultaneously address availability,
scalability, manageability, and usability
Components Virtual cluster using contemporary
hypervisor technology Performance and scalability of parallel shared
root file system Paging behaviors of applications,
i.e., reducing TLB misses Reducing “noise” from operating systems at scale
3 Ong_SSI_0611
Virtual clusters using contemporary hypervisor technology Goal: Virtual cluster of domUs on multiple dom0s
Enable preemptive migration of OS images for reliability, maintenance, load-balancing
Have virtual login node images, virtual service node images, and virtual compute node images (x86_64, FC5-based) Initial experimentation on two x86_64 RHEL4 hosts
Collect per-node/per-domain high-level information about host and guest domains
Use a copy-on-write scheme at NFS server Unionfs-based root file system trees, one per virtual cluster node
Investigate oprofile for active profiling Performance data should help make migratory decisions
Investigate functionality of Xentop and the LibVirt API for scalable management of domUs
Develop management console for manual and algorithmic control
4 Ong_SSI_0611
High-speed migration of virtual clusters with InfiniBand (IB)Goal: Reduce impact of virtualization on high-performance
computing application performance• High bandwidth, low latency• Effective migration of OS
Phase 1: Xen virtualization with IB• IBM and OSU have a prototype version, functioning over IA32• Functioning over IA32, support only IB verbs and IPoIB
Need to port to x86_64 Need support for other IB components
Phase 2: Xen migration over IB, prototype under• Migration of IB-capable Xen domains over TCP
Snapshot and close IB domains before migration Open and reconnect after migration Address issues in handling user-space IB info
• Protection key• IB context, queue pair numbers and LID• Need to coordinate static domains, too
Phase 3: Migration of IB-capable Xen domains over IB, live migration
5 Ong_SSI_0611
Paging behavior of DOE applications
Trace-based memory subsystem simulator• Performs dynamic tracing using PIN (Intel’s) • For each data memory reference, simulator
Executes code that looks up data in caches/TLB/PTEs Estimates latency of misses
• Produces estimate of cycles spent in memory subsystem• Offers these advantages of simulation
Allows evaluation of nonexistent configurations:• What if paging were turned off?• What if Opteron used Xeon smaller-L1-cache-for-larger-TLB tradeoff?
Potential weakness: inaccuracy• Sequential vs parallel memory accesses• Intel (simulator) vs Portland Group (XT3) compiler• Different runtime libraries (MPI)
Goal: Understand paging behavior of applications in large scale system
6 Ong_SSI_0611
Applications and methodology
Applications:• High-Performance Computing and Communications benchmarks
Sequential: STREAM, DGEMM, FFT, RandomAccess Parallel: MPI_RandomAccess, HPL, FFT, PTRANS
• Climate: POP, CAM, HYCOM• Fusion: GYRO, GTC• Molecular Dynamics: AMBER, LAMMPS
Methodology:• 16p version of each benchmark• Unmeasured: init code, warm up caches; measured: three time steps• Eight variants (results normalized to HW-large):
HW: large and small pages Simulated Opteron: large, small, and no pages Simulated Opteron/Xeon hybrid: large, small, no pages
7 Ong_SSI_0611
Example results: HYCOM
Initial simulation results match measurement and illustrate differences in page configurations
8 Ong_SSI_0611
Parallel root file system
Goals of study• Use a parallel file system for implementation
of shared root environment• Evaluate performance of parallel root file system • Understand root I/O access pattern and potential scaling limits
Current status• RootFS implemented using NFSv4, PVFS-2, and Lustre 1.4.6• RootFS distributed using ramdisk implementation
Modified mkinitrd program locally (pvfs2 and lustre) Modified to init scripts to mount root at boot time
(pvfs2 and lustre) Configured compute nodes to run small Linux kernel Released two Lustre 1.4.6.4 patches for kernel 2.6.15
• File systems performance measured using IOR, b_eff_io, and NPB I/O
9 Ong_SSI_0611
Example results: IOR read/write throughput
10 Ong_SSI_0611
Contacts
Jeffery S. VetterPrinciple InvestigatorComputer Science and MathematicsComputing & Computational Sciences Directorate(865) [email protected]
10 Ong_SSI_0611
Hong Ong(865) [email protected]
Nikhil Bhatia(865) [email protected]
Collin McCurdy(865) [email protected]
Phil Roth(865) [email protected]
WeiKuan Yu(865) [email protected]