a micro-benchmark evaluation of catamount and cray linux … · 2010. 8. 11. · a micro-benchmark...

A Micro-Benchmark Evaluation of Catamount and Cray Linux

Environment (CLE) Performance

Jeff Larkin

Cray Inc.

Jeff Kuehn

ORNL

THE BIG QUESTION!

Does CLE waddle like a penguin, or run like a catamount?

2CUG2008

Overview

BackgroundMotivation

Catamount and CLE

Benchmarks

Benchmark System

Benchmark ResultsIMB

HPCC

Conclusions

3CUG2008

BACKGROUND

4CUG2008

Motivation

Last year at CUG “CNL” was in its infancy

Since CUG07Significant effort spent scaling on large machines

CNL reached GA status in Fall 2007

Compute Node Linux (CNL) renamed Cray Linux Environment (CLE)

A significant number of sites have already made the change

Many codes have already ported from Catamount to CLE

Catamount scalability has always been touted, so how does

CLE compare?Fundamentals of communication performance

HPCC

IMB

What should sites/users know before they switch?

5CUG2008

Background: Catamount

Developed by Sandia for Red Storm

Adopted by Cray for the XT3

Extremely light weightSimple Memory Model

No Virtual Memory

No mmap

Reduced System Calls

Single Threaded

No Unix Sockets

No dynamic libraries

Few Interrupts to user codes

Virtual Node (VN) mode added for Dual-Core

6CUG2008

Background: CLE

First, we tried a full SUSE Linux Kernel.

Then, we “put Linux on a diet.”

With the help of ORNL and NERSC, we began running at

large scale

By Fall 2007, we released Linux for the compute nodes

What did we gain?Threading

Unix Sockets

I/O Buffering

7CUG2008

Background: Benchmarks

HPCCSuite of several benchmarks, released as part of DARPA HPCS program

MPI performance

Performance for varied temporal and spatial localities

Benchmarks are run in 3 modes

SP – 1 node runs the benchmark

EP – Every node runs a copy of the same benchmark

Global – All nodes run benchmark together

Intel MPI Benchmarks (IMB) 3.0Formerly Pallas benchmarks

Benchmarks standard MPI routines at varying scales and message sizes

8CUG2008

Background: Benchmark System

All benchmarks were run on the same system, “Shark,” and

with the latest OS versions as of Spring 2008

System BasicsCray XT4

2.6 GHz Dual-Core Opterons (Able to run to 1280 Cores)

DDR2-667 Memory, 2GB/core

Catamount (1.5.61)

CLE, MPT2 (2.0.50)

CLE, MPT3 (2.0.50, xt-mpt 3.0.0.10)

9CUG2008

BENCHMARK RESULTS

10CUG2008

HPCC

11CUG2008

Parallel Transpose (Cores)

0

20

40

60

80

100

120

140

0 500 1000 1500

GB

/s

Processor Cores

Catamount SN

Catamount VN

CLE MPT2 N1

CLE MPT2 N2

CLE MPT3 N1

CLE MPT3 N2

12CUG2008

Parallel Transpose (Sockets)

0

20

40

60

80

100

120

0 100 200 300 400 500 600

GB

/s

Sockets

Catamount SN

Catamount VN

CLE MPT2 N1

CLE MPT2 N2

CLE MPT3 N1

CLE MPT3 N2

13CUG2008

MPI Random Access

0

0.5

1

1.5

2

2.5

3

0 500 1000 1500

GU

P/s

Processor Cores

Catamount SN

Catamount VN

CLE MPT2 N1

CLE MPT2 N2

CLE MPT3 N1

CLE MPT3 N2

14CUG2008

MPI-FFT (cores)

0

50

100

150

200

250

0 200 400 600 800 1000 1200

GF

lop

s/s

Processor Cores

Catamount SN

Catamount VN

CLE MPT2 N1

CLE MPT2 N2

CLE MPT3 N1

CLE MPT3 N2

15CUG2008

MPI-FFT (Sockets)

0

50

100

150

200

250

0 100 200 300 400 500 600

GF

lop

s/s

Sockets

Catamount SN

Catamount VN

CLE MPT2 N1

CLE MPT2 N2

CLE MPT3 N1

CLE MPT3 N2

16CUG2008

Naturally Ordered Latency

512

Catamount SN 6.41346

CLE MPT2 N1 9.08375

CLE MPT3 N1 9.41753

Catamount VN 12.3024

CLE MPT2 N2 13.8044

CLE MPT3 N2 9.799

0

2

4

6

8

10

12

14

16

Tim

e (

usec

)

17CUG2008

Naturally Ordered Bandwidth

512

Catamount SN 1.07688

CLE MPT2 N1 0.900693

CLE MPT3 N1 0.81866

Catamount VN 0.171141

CLE MPT2 N2 0.197301

CLE MPT3 N2 0.329071

0

0.2

0.4

0.6

0.8

1

1.2

MB

/s

18CUG2008

IMB

19CUG2008

IMB Ping Pong Latency (N1)

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Tim

e (

usec)

Message Size (B)

Catamount

CLE MPT2

CLE MPT3

20CUG2008

IMB Ping Pong Latency (N2)

0

1

2

3

4

5

6

7

8

9

10

0 200 400 600 800 1000 1200

Avg

uS

ec

Bytes

Catamount

CLE MPT2

CLE MPT3

21CUG2008

IMB Ping Pong Bandwidth

0

100

200

300

400

500

600

0 200 400 600 800 1000 1200

MB

/s

Message Size (Bytes)

Catamount

CLE MPT2

CLE MPT3

22CUG2008

MPI Barrier (Lin/Lin)

0

20

40

60

80

100

120

140

160

0 500 1000 1500

Tim

e (

usec)

Processor Cores

Catamount

CLE MPT2

CLE MPT3

23CUG2008

MPI Barrier (Lin/Log)

0

20

40

60

80

100

120

140

160

1 10 100 1000 10000

Tim

e (

usec)

Processor Cores

Catamount

CLE MPT2

CLE MPT3

24CUG2008

MPI Barrier (Log/Log)

0.1

1

10

100

1000

1 10 100 1000 10000

Tim

e (

usec)

Processor Cores

Catamount

CLE MPT2

CLE MPT3

25CUG2008

SendRecv (Catamount/CLE MPT2)

26CUG2008

SendRecv (Catamount/CLE MPT3)

27CUG2008

Broadcast (Catamount/CLE MPT2)

28CUG2008

Broadcast (Catamount/CLE MPT3)

29CUG2008

Allreduce (Catamount/CLE MPT2)

30CUG2008

Allreduce (Catamount/CLE MPT3)

31CUG2008

AlltoAll (Catamount/CLE MPT2)

32CUG2008

AlltoAll (Catamount/CLE MPT3)

33CUG2008

CONCLUSIONS

34CUG2008

What we saw

Catamount

Handles Single Core

(SN/N1) Runs slightly

better

Seems to handle small

messages and small core

counts slightly better

CLEDoes very well on dual-

core

Likes large messages and

large core counts

MPT3 helps performance

and closes the gap

between QK and CLE

35CUG2008

What’s left to do?

We’d really like to try this again on a larger machineDoes CLE continue to beat Catamount above 1024, or will the lines converge or cross?

What about I/O?Linux adds I/O buffering, how does this affect I/O performance at scale?

How does this translate into application performance?See "Cray XT4 Quadcore: A First Look", Richard Barrett, et.al., Oak Ridge National Laboratory (ORNL)

36CUG2008

CLE RUNS LIKE A BIG CAT!

Does CLE waddle like a penguin, or run like a catamount?

37CUG2008

Acknowledgements

This research used resources of the National Center for

Computational Sciences at Oak Ridge National Laboratory,

which is supported by the Office of Science of the U.S.

Department of Energy under Contract No. DE-AC05-

00OR22725.

Thanks to Steve, Norm, Howard, and others for help

investigating and understanding these results

CUG2008 38

a micro-benchmark evaluation of catamount and cray linux … · 2010. 8. 11. · a micro-benchmark...

Documents

cray xc series application programming and optimization ·...

from many, one: building the catamount library network...

membership guide - cbs...

performance and cost -...

cray marketing partner network (mpn) · the cray marketing...

cray customers are amazing.€¦ · 05/04/2018 · at a...

cray assembly language (cal) for cray x1™ systems...

benchmark performance of different compilers on a cray...

cray a seymour cray perspective seymour cray lecture series...

understanding mpi on cray xc30 · mpich2 and cray mpt cray...

hpcmpug2011 cray tutorial

cray cluster supercomputers - cray user group · legal...

self governance increases staff morale - …+ron+...self...

cray unicos/mp operating system on cray x1 hardware ... ·...

mr. peters’ cray-cray apush study guide & crystal ball

the catamount editorial pal y! rafferty):

the cray-1 computer system,...

cray timeline

1 cray inc. 11/28/2015 cray inc slide 2 cray 2004-2010 cray...

catamount football 2021 game notes catamount …