performance analysis of hpc with lmbench didem unat supervisor: nahil sobh july 22 nd 2005...

Post on 18-Jan-2018

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Compiler & optimization issues The GNU C compiler is used for all the resources but copper IBM xlc compiler was used on copper. All of the benchmarks were compiled with optimization -O except the benchmarks that calculate clock speed and the context switch times

TRANSCRIPT

Performance Analysis of HPC with Lmbench

Didem Unat Supervisor: Nahil Sobh

July 22nd 2005

netfiles.uiuc.edu/dunat2/www

Lmbench: Micro-Benchmark Suite

• Simple, portable benchmarks• Compares different Unix systems

performance• Measures latency and bandwidth • Only analyzes performance of

processor, memory, network, file system and disk

• Free software

Compiler & optimization issues

• The GNU C compiler is used for all the resources but copper

• IBM xlc compiler was used on copper. • All of the benchmarks were compiled with

optimization -O except the benchmarks that calculate clock speed and the context switch times

Metrics in the Benchmark

Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies

Metrics in the Benchmark

Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies

Inter Process Communication Bandwidth

• Transfers 64 MB of data in 64 KB chunks

through• Unix Pipe • Unix sockets • TCP/IP sockets 0

500

1000

1500

2000

2500

3000

Pipe AF Unix TCP

W Co Cu Hg

MB/sec

Inter Process Communication Bandwidth

• Transfers 64 MB of data in 64 KB chunks

through• Unix Pipe • Unix sockets • TCP/IP sockets 0

500

1000

1500

2000

2500

3000

Pipe AF Unix TCP

W Co Cu Hg

MB/sec

W

Co

Metrics in the Benchmark

Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• File and VM system• Inter process communication • Memory read latencies

Cached file read• A reread benchmark, intended to be used

on a file that is in memory • File reread :

copies data from the kernel’s file system page into the processor’s buffer

• Mmap reread :

maps the entire file (8 MB) into process’s address space

Metrics in the Benchmark

Bandwidth • Pipe/TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• File and VM system• Inter process communication • Memory read latencies

Memory copy• Measures how fast the system

can bcopy data• Bcopy copies n bytes from string

source to string destination• An 8 MB to 8 MB copy, does not

fit in the cache• Kernel bcopy and C library bcopy• C library bcopy shown in the

next slide

Metrics in the Benchmark

Bandwidth • Pipe/TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• File and VM system• Inter process communication • Memory read latencies

Memory read/writeRead• Measures the time to read data into

the processor• An unrolled loop that sums up a series

of integers

Write• Measures the time to write data to

memory• An unrolled loop that stores a value

into an integer

12

3

Metrics in the Benchmark

Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies

Operating System Entry/ Signal Handling / Process Creation Costs

• Process-related latencies

• System Call null call, null I/O, stat, open/close

• Signal Handling signal installation, signal handling

• Process Creation fork + exit, fork + execve, fork +

/bin/sh -c

Metrics in the Benchmark

Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies

Metrics in the Benchmark

Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies

Context Switching• The time to save the state of one process and

restore the state of another process

• The processes are connected in a ring of Unix pipes

• A token is passed from process to process

• The process allocates an array and sums the array

• Context-switch time doesn't include the overhead of doing the work.

• Two parameters: number and size of processes

Metrics in the Benchmark

Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies

Interprocess Communication Latencies• Passing a small message back and forth

between two processes

• The time reported is one round trip

• Message size: a byte or a word

• Metrics: Pipe, Unix Socket, UDP and TCP , RPC/UDP-TCP, TCP connection latency

Metrics in the Benchmark

Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies

File & VM System• File create/ delete creates a number of small files in the current

working directory and then removes the files

• Mmap latency : costs of mmapping and unmmapping varying file sizes

• Prot fault : the time to catch a protection fault • Page fault : the cost of page faulting pages from a file

• 100 fd selct : the time to do a select on n file descriptors

Metrics in the Benchmark

Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication • File and VM system• Memory read latencies

Memory Latencies

• Measures memory read latency for varying memory sizes and strides

• The size of the array starts from 512 bytes

• The stride varies from 16 to 1024

• Does not include the instruction execution time

Conclusion the best has problems IPC bandwidth

Co W, Cu

Cashed I/O bandwidth

W Co, Hg

Memory R/W Bandwidth

W Co, Hg

Process Creation

Cu Co

CPU ops W , Co, Hg Cu

Network Lat W Co, Cu

Memory Lat W, Co Cu

THANK YOU !

Have a nice weekend !

References

• “Lmbench – Tools for Performance Analysis” http://www.bitmover.com/lmbench/

• Larry McVoy and Carl Staelin, “Lmbench: Portable tools for performance analysis”

http://www.usenix.org/publications/library/proceedings/ sd96/full_papers/mcvoy.pdf

• Carl Staelin, “Lmbench:an extensible micro-benchmark suite”

http://www.hpl.hp.com/techreports/2004/HPL-2004-213.html

top related