group file operations: a new idiom for scalable tools
DESCRIPTION
Group File Operations: A New Idiom for Scalable Tools. Michael J. Brim Paradyn Project Paradyn / Condor Week Madison, Wisconsin April 30 – May 3, 2007. Talk Overview. Group Process Control and Inspection A New Group File Idiom TBŌN-FS: Scalable Group File Operations. Research Domain. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/1.jpg)
Group File Operations: A New Idiom for Scalable Tools
Michael J. BrimParadyn Project
Paradyn / Condor WeekMadison, Wisconsin
April 30 – May 3, 2007
![Page 2: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/2.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 2 of 23
Talk Overview
•Group Process Control and Inspection
•A New Group File Idiom
•TBŌN-FS: Scalable Group File Operations
![Page 3: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/3.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 3 of 23
Research Domain•HPC Tools & Middleware
•Middleware: run applications and manage system•Tools: diagnose and correct problems
•Large scale systems•Tools and middleware are CRUCIAL•More resources to manage •Many problems appear as scale increases
Tools/middleware that can be used on the largest current systems are scarce
![Page 4: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/4.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 4 of 23
Example Tools & Middleware•Parallel Application Runtime Environments
• MPI, PVM, BProc, IBM POE, Sun CRE, Cplant yod
•Parallel Application Monitoring and Steering• Paradyn, Open|SpeedShop, MATE
•Distributed Application Debuggers• TotalView, DDT, Eclipse PTP, mpigdb
•Resource Monitoring and Management• SLURM, PBS, LoadLeveler, LSF, Ganglia
![Page 5: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/5.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 5 of 23
Group Process Control and Inspection
•Modify or examine process state•Launch processes and manage stdin/out/err•Send job control signals (e.g. STOP, CONT, KILL)•Read and write memory, registers•Collect asynchronous events (e.g. breakpoints and
signals)•Read process information files (i.e. Linux /proc)
•For groups of 10,000 – 100,000 processesAnd More!!!
![Page 6: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/6.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 6 of 23
New Idiom: Group File Operations•Abstract all operations as file access
•Natural, Intuitive, Portable•/proc
8th edition UNIX (1985) Plan9 (1992) → 4.4BSD (1994), Solaris 2.6 (1997) Linux
•Global mount of remote files•Distributed OS: LOCUS (1983), …, BProc (2002)•Remote mount: UNIX United (1987), …, Xcpu (2006)
•Operate on groups of files (processes)•How to do so in a scalable manner?
![Page 7: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/7.jpg)
/proc
/proc
/proc
/proc
/proc /proc
/proc
/proc
/proc
/proc
/proc /proc
/proc
/proc
/proc
/proc
/proc /proc
/proc
/proc
/proc
/proc
/proc /proc
ClusterXCompute Nodes
User Host
/ClusterX/
/cn0/…
/cn1/…
/cn2/…
…
/cn99999/…
Tool Process
Global Mount
File Operations
![Page 8: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/8.jpg)
Group Operations: Current Technology
VirtualFile System
FileSystem
UserLevel
vfs_read()
fs_read()
GroupRead(){ foreach(member) read(fd,…);}
User-level group operations iterate.
Cost ≈ G × ( T + L + R )
User-Kernel Trap (T)
Local Processing (L)
Remote Communication & Processing (R)
SystemCalls sys_read()
/proc
![Page 9: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/9.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 9 of 23
Scalable Group File Operations
•How to avoid iteration over files?
•Explicit groups: gopen()•One OS interaction for each group operation
•How to provide scalable group operations?•Group-Aware File System: TBŌN-FS
![Page 10: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/10.jpg)
VirtualFile System
FileSystem
UserLevel
SystemCalls sys_read()
Group Operations: Scalable Approach
fs_grp_read()
With OS and File System support, group operations can use scalable techniques.
Cost ≈ T + L + ( log(G) × R )
vfs_read()
/proc /proc
GroupRead(){ read(gfd, …);}
![Page 11: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/11.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 11 of 23
Group File Operations
•Forming Groups•Directory = a natural file system group abstraction
mkdir/rmdir : create/delete groupmv,cp,ln : add membersrm : delete members
•Accessing Groups
gfd = gopen(char* gdir, int flags)
•Operating on Groups•Pass group file descriptor to file operations
e.g., read, write, lseek, chmod •Semantics - operation applied to each group member
![Page 12: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/12.jpg)
/proc/proc /proc /proc
Group File Operationsint rc = read(gfd, databuf, 1024)
read(1024) read(1024) read(1024) read(1024)
Return Code (status/error)
Data Output Buffer
rc rc rc rcdata data data data
![Page 13: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/13.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 13 of 23
Data Aggregation
•Definition: construct a whole from parts
•Provides various levels of data resolutionSUMMARY PARTIAL COMPLETE
• min• max• average• sum
• x > 0.9• y є {…}• TopN(z)
• concatenate• equiv. class
![Page 14: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/14.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 14 of 23
Aggregating Group Results•Fit existing interfaces
•Status → summaryNeed to choose appropriate default for each op
•Data → concatenate
•New operations for controlling results•Retrieve individual status gstatus(…)•Load custom aggregations gloadaggr(…)
•Bind aggregations to operations gbindaggr(…)
![Page 15: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/15.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 15 of 23
Example: System Resource Monitor
•Collects 1-, 5-, 15-minute load averages•Reads /proc/loadavg from each node
•Calculates (for each granularity) •Minimum load across all nodes•Maximum load across all nodes•Average load across all nodes
![Page 16: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/16.jpg)
BEFORE AFTER
open()
read(1min)
read(5min)
read(15min)
close()
ComputeMMA(…)
symlink()
gfd = gopen(“grp_dir”)
// Bind read to aggrgbindaggr(gfd, OP_READ,
AGGR_MIN_MAX_AVG)
// Read & Computeread(gfd, 1min)read(gfd, 5min)read(gfd, 15min)
close(gfd)
gdefine()
![Page 17: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/17.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 17 of 23
Group File Operations: Other Uses?•Distributed System Administration
•Disk-full clusters System file patching Software installation
•System log monitoring
•Utility programs that operate on file groups•e.g., ps, top, grep, chmod/chown
•Internet Applications•Peer2Peer – file retrieval a la BitTorrent•Search/Crawl – websites are really just files
![Page 18: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/18.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 18 of 23
TBŌN-FS
•New distributed file system•Scalable group file operations•Efficient single file operations•Tens to hundreds of thousands of servers
Single mount point
•Integrates Tree-Based Overlay Network•One-to-many multicast & gather communication
•Distributed data aggregation
![Page 19: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/19.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 19 of 23
Scalable Group File Operations•Why not use a distributed/parallel FS?
Distributed File System TBON-FSParallel File System
![Page 20: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/20.jpg)
TBŌN-FS: Proposed Architecture
FileSystems& Devices
User
sys_read()
read()
VirtualFileSystem
SystemCalls
TBON-FS /dev/tbonfs
ToolApplication
TBON-FSClient
TBON-FSServer
File Systems
StandardFile Access
TBON
vfs_read()
![Page 21: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/21.jpg)
TBŌN-FS: Current Prototype
TBON-FSServer
File Systems
StandardFile Access
TBON
ToolApplication
TBON-FS Library
Usermount_tbonfs() unmount_tbonfs()
gopen() gsize() gstatus() gbindaggr()
grp_close() grp_lseek() grp_read() grp_write()
![Page 22: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/22.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 22 of 23
Current & Future Research•Group file operations
•OS support•More file types & operations (e.g., sockets and pipes)
•Tool integrations•Ganglia wide-area system monitor (in progress)•TotalView debugger
•TBŌN Model Extensions•Topology-aware filters•Persistent host state•Multi-organization TBON
![Page 23: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/23.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 23 of 23
Summary
•“Iteration is the bane of scalability.”
•Group File Operations•Are natural, intuitive, and portable•Eliminate iteration•Allow for custom data aggregation
•TBON-FS: scalable group file operations
![Page 24: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/24.jpg)
Distributed Debugger (BEFORE)// Open all /proc/<pid>/memforeach file ( ‘ClusterX/cn*/[1-9]*/mem’ )fds[i] = open(file, flags);grp_size++;
// Set breakpoint & waitfor i=0 to grp_sizelseek(fds[i], brkpt_addr, SEEK_SET);write(fds[i], brkpt_code_buf, code_sz);
WaitForAll();
// Read variable & compute equivalence classesfor i=0 to grp_sizelseek(fds[i], var_addr, SEEK_SET);var_buf = grp_var_buf[i];read(fds[i], var_buf, var_sz);close(fds[i]);
ComputeEquivClasses(grp_var_buf, var_classes_buf);
![Page 25: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/25.jpg)
Distributed Debugger (AFTER)// Open all /proc/<pid>/memforeach file ( ‘ClusterX/cn*/[0-9]*/mem’ )// add link to file in group directorysymlink(file, “grp_dir”);
gfd = gopen(“grp_dir”, flags);grp_size = gsize(gfd);
// Set breakpoint & waitlseek(gfd, brkpt_addr, SEEK_SET);write(gfd, brkpt_code_buf, code_sz);WaitForAll();
// Read variable & compute equivalence classeslseek(gfd, var_addr, SEEK_SET);gbindaggr(gfd, OP_READ, AGGR_EQUIV_CLASS, var_sz);read(gfd, var_classes_buf, var_sz);close(gfd);
![Page 26: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/26.jpg)
System Monitor (BEFORE)// Open all /proc/loadavgforeach file ( ‘ClusterX/cn*/loadavg’ )fds[i] = open(file, flags);grp_size++;
// Read 1-minute, 5-minute, 15-minute loadsfor i=0 to grp_sizeread(fds[i], 1min_buf[i], load_sz);read(fds[i], 5min_buf[i], load_sz);read(fds[i], 15min_buf[i], load_sz);close(fds[i]);
// Compute min/max/avg for each granularity ComputeMinMaxAvg(1min_buf, 5min_buf, 15min_buf);
![Page 27: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/27.jpg)
System Monitor (AFTER)// Open all /proc/loadavgforeach member_file ( ‘ClusterX/cn*/loadavg’ )// add link to member in group directorysymlink(member_file, “grp_dir”);
gfd = gopen(“grp_dir”, flags);
// Read 1-minute, 5-minute, 15-minute loads// and calculate min/max/avggbindaggr(gfd, OP_READ, AGGR_MIN_MAX_AVG, load_sz);read(gfd, 1min_buf, load_sz);read(gfd, 5min_buf, load_sz);read(gfd, 15min_buf, load_sz);close(gfd);
![Page 28: Group File Operations: A New Idiom for Scalable Tools](https://reader036.vdocument.in/reader036/viewer/2022081512/56815964550346895dc6a1f1/html5/thumbnails/28.jpg)
©2007 Michael J. Brim Group File Operations: A New Idiom for Scalable Tools 28 of 23
Related Work
•Xcpu•File system interface for distributed process
management•Uses Plan9 9P protocol and recent Linux support
(V9FS)
•HEC POSIX I/O Extensions•Explicit sharing of files by process groups
•openg and sutoc