fast-os: petascale single system image

Presented by

FAST-OS: Petascale Single System Image

Jeffrey S. Vetter (PI) Computer Science and Mathematics

Computing & Computational Sciences Directorate

Team Members

Nikhil Bhatia, Collin Mccurdy,

Hong Ong, Phil Roth, WeiKuan Yu

2 Ong_SSI_0611

PetaScale single system image What?

Make a collection of standard hardware look like one big machine, in as many ways as feasible

Why? Provide a single solution to address

all forms of clustering Simultaneously address availability,

scalability, manageability, and usability

Components Virtual cluster using contemporary

hypervisor technology Performance and scalability of parallel shared

root file system Paging behaviors of applications,

i.e., reducing TLB misses Reducing “noise” from operating systems at scale

3 Ong_SSI_0611

Virtual clusters using contemporary hypervisor technology Goal: Virtual cluster of domUs on multiple dom0s

Enable preemptive migration of OS images for reliability, maintenance, load-balancing

Have virtual login node images, virtual service node images, and virtual compute node images (x86_64, FC5-based) Initial experimentation on two x86_64 RHEL4 hosts

Collect per-node/per-domain high-level information about host and guest domains

Use a copy-on-write scheme at NFS server Unionfs-based root file system trees, one per virtual cluster node

Investigate oprofile for active profiling Performance data should help make migratory decisions

Investigate functionality of Xentop and the LibVirt API for scalable management of domUs

Develop management console for manual and algorithmic control

4 Ong_SSI_0611

High-speed migration of virtual clusters with InfiniBand (IB)Goal: Reduce impact of virtualization on high-performance

computing application performance• High bandwidth, low latency• Effective migration of OS

Phase 1: Xen virtualization with IB• IBM and OSU have a prototype version, functioning over IA32• Functioning over IA32, support only IB verbs and IPoIB

Need to port to x86_64 Need support for other IB components

Phase 2: Xen migration over IB, prototype under• Migration of IB-capable Xen domains over TCP

Snapshot and close IB domains before migration Open and reconnect after migration Address issues in handling user-space IB info

• Protection key• IB context, queue pair numbers and LID• Need to coordinate static domains, too

Phase 3: Migration of IB-capable Xen domains over IB, live migration

5 Ong_SSI_0611

Paging behavior of DOE applications

Trace-based memory subsystem simulator• Performs dynamic tracing using PIN (Intel’s) • For each data memory reference, simulator

Executes code that looks up data in caches/TLB/PTEs Estimates latency of misses

• Produces estimate of cycles spent in memory subsystem• Offers these advantages of simulation

Allows evaluation of nonexistent configurations:• What if paging were turned off?• What if Opteron used Xeon smaller-L1-cache-for-larger-TLB tradeoff?

Potential weakness: inaccuracy• Sequential vs parallel memory accesses• Intel (simulator) vs Portland Group (XT3) compiler• Different runtime libraries (MPI)

Goal: Understand paging behavior of applications in large scale system

6 Ong_SSI_0611

Applications and methodology

Applications:• High-Performance Computing and Communications benchmarks

Sequential: STREAM, DGEMM, FFT, RandomAccess Parallel: MPI_RandomAccess, HPL, FFT, PTRANS

• Climate: POP, CAM, HYCOM• Fusion: GYRO, GTC• Molecular Dynamics: AMBER, LAMMPS

Methodology:• 16p version of each benchmark• Unmeasured: init code, warm up caches; measured: three time steps• Eight variants (results normalized to HW-large):

HW: large and small pages Simulated Opteron: large, small, and no pages Simulated Opteron/Xeon hybrid: large, small, no pages

7 Ong_SSI_0611

Example results: HYCOM

Initial simulation results match measurement and illustrate differences in page configurations

8 Ong_SSI_0611

Parallel root file system

Goals of study• Use a parallel file system for implementation

of shared root environment• Evaluate performance of parallel root file system • Understand root I/O access pattern and potential scaling limits

Current status• RootFS implemented using NFSv4, PVFS-2, and Lustre 1.4.6• RootFS distributed using ramdisk implementation

Modified mkinitrd program locally (pvfs2 and lustre) Modified to init scripts to mount root at boot time

(pvfs2 and lustre) Configured compute nodes to run small Linux kernel Released two Lustre 1.4.6.4 patches for kernel 2.6.15

• File systems performance measured using IOR, b_eff_io, and NPB I/O

9 Ong_SSI_0611

Example results: IOR read/write throughput

10 Ong_SSI_0611

Contacts

Jeffery S. VetterPrinciple InvestigatorComputer Science and MathematicsComputing & Computational Sciences Directorate(865) [email protected]

10 Ong_SSI_0611

Hong Ong(865) [email protected]

Nikhil Bhatia(865) [email protected]

Collin McCurdy(865) [email protected]

Phil Roth(865) [email protected]

WeiKuan Yu(865) [email protected]

fast-os: petascale single system image

Documents

xen migration

ib componentsphase

ib verbs

close ib domains

prototype undermigration

virtual service node

virtual cluster of domus

capable xen domains