comp4300/comp8300 parallel systems alistair rendell and joseph antony research school of computer...

COMP4300/COMP8300COMP4300/COMP8300Parallel SystemsParallel Systems

Alistair Rendell and Joseph AntonyAlistair Rendell and Joseph AntonyResearch School of Computer ScienceResearch School of Computer Science

Australian National UniversityAustralian National University

Concept and RationaleConcept and Rationale The ideaThe idea

– Split your program into bits that can be executed Split your program into bits that can be executed simultaneouslysimultaneously

MotivationMotivation– Speed, Speed, Speed, Speed, SpeedSpeed… … at a cost effective priceat a cost effective price– If we didn’t want it to go faster we would not be If we didn’t want it to go faster we would not be

bothered with the hassles of parallel programming!bothered with the hassles of parallel programming! Reduce the time to solution to acceptable levelsReduce the time to solution to acceptable levels

– No point waiting 1 week for tomorrow’s weather No point waiting 1 week for tomorrow’s weather forecastforecast

– Simulations that take months to run are not useful in a Simulations that take months to run are not useful in a design environmentdesign environment

Sample Application AreasSample Application Areas Fluid flow problemsFluid flow problems

– Weather forecasting/climate modelingWeather forecasting/climate modeling– Aerodynamic modeling of cars, planes, rockets etcAerodynamic modeling of cars, planes, rockets etc

Structural MechanicsStructural Mechanics– Building bridge, car, etc strength analysisBuilding bridge, car, etc strength analysis– Car crash simulationCar crash simulation

Speech and character recognition, image processingSpeech and character recognition, image processing Visualization, virtual realityVisualization, virtual reality Semiconductor design, simulation of new chipsSemiconductor design, simulation of new chips Structural biology, molecular level design of drugsStructural biology, molecular level design of drugs Human genome mappingHuman genome mapping Financial market analysis and simulationFinancial market analysis and simulation Datamining, machine learningDatamining, machine learning Games programmingGames programming

World Climate ModelingWorld Climate Modeling

Atmosphere divided into 3D regions or cellsAtmosphere divided into 3D regions or cells Complex mathematical equations describe conditions in Complex mathematical equations describe conditions in

each cell, eg pressure, temperature, velocityeach cell, eg pressure, temperature, velocity– Conditions change according to neighbour cellsConditions change according to neighbour cells– Updates repeated frequently as time passesUpdates repeated frequently as time passes– Cells are affected by more distant cells the longer range the Cells are affected by more distant cells the longer range the

forecastforecast AssumeAssume

– Cells are 1x1x1 mile to a height of 10 miles, 5x10Cells are 1x1x1 mile to a height of 10 miles, 5x1088 cellscells

– 200 flops to update each cell per timestep200 flops to update each cell per timestep– 10 minute timesteps for total of 10 days10 minute timesteps for total of 10 days

100 days on 100 mflop machine100 days on 100 mflop machine 10 minutes on a tflop machine10 minutes on a tflop machine

ParallelSystems@ANU: NCIParallelSystems@ANU: NCI NCI: National Computational InfrastructureNCI: National Computational Infrastructure

– http://nci.org.au http://nci.org.au History: established APAC in 1998 with $19.5M grant from History: established APAC in 1998 with $19.5M grant from

federal government, NCI created in 2007federal government, NCI created in 2007 Current NCI collaboration agreement (2012–15)Current NCI collaboration agreement (2012–15)

– Major Collaborators: ANU, CSIRO, BoM, GA, Major Collaborators: ANU, CSIRO, BoM, GA, – Universities: Adelaide, Monash, UNSW, UQ, Sydney, Deakin, RMIT Universities: Adelaide, Monash, UNSW, UQ, Sydney, Deakin, RMIT – University Consortia: Intersect (NSW), QCIF (Queensland)University Consortia: Intersect (NSW), QCIF (Queensland)

Co-investment (for recurrent operations) :Co-investment (for recurrent operations) :– 2007: $0M; 2008: $3.4M; 2009: $6.4M; 2011: $7.5M; 2012: $8.5M; 2007: $0M; 2008: $3.4M; 2009: $6.4M; 2011: $7.5M; 2012: $8.5M;

2013: $11M2013: $11M; ; 2014 $11+M; 2014 $11+M; to provide for all recurrent operationsto provide for all recurrent operations

Current infrastructure: Data Centre

• New Data Centre: $24M (opened Nov. 2012) Machine Room: 920 sq. m.

• Power (after 2014 upgrades)– 4.5 MW capacity raw; 1 MW UPS; – 2 x 1.1 MVA Cummins generators

• Cooling in two loops:– Server: 2 x 1.8 MW Carrier chillers; 3 x 0.8 MW “free

cooling” heat exchangers; 18 deg C; 75 l/sec pump rate– Data: 3 x 0.5 MW Carrier chillers; 15 deg C

• PUE: approx. 1.25

6

NCI: Raijin—Petascale Supercomputer

Raijin – Supercomputer (June 2013 commissioning)– 57,472 cores (Intel Xeon Sandy Bridge, 2.6 GHz) in

3592 compute nodes– 160 TBytes (approx.) of main memory;– Mellanox Infiniband FDR interconnect (52 km cable)– 10 PBytes (approx.) of usable fast filesystem (for

short-term scratch space apps, home directories).– Power: 1.5 MW max. load– Cooling systems: 100 tonnes of water

– 24th fastest in the world in debut (November 2012); first petaflop system in Australia (November 2014: #52)

• Fastest file-system in the southern hemisphere• Custom monitoring and deployment• Custom Kernel• Highly customised PBS Pro scheduler.

7

NCI’s integrated high-performance environment

10 GigE

/g/data 56Gb FDR IB Fabric

/g/data1~7 PB

/g/data2~6 PB

/short7.6PB

/home, /system, /images,

/apps

Cache 1.0PB, Tape 20 PB

Massdata (tape) Persistent global parallel filesystem

Raijin high-speed filesystem

Raijin HPC ComputeRaijin HPC Compute

Raijin Login + Data moversRaijin Login + Data moversVMwareVMware CloudCloudNCI data

moversNCI data movers

To

Hu

xley

DC

Raijin 56Gb FDR IB Fabric

InternetInternet

8

ParallelSystems@DCSParallelSystems@DCS Bunyip: Bunyip:

tsg.anu.edu.au/Projects/Bunyiptsg.anu.edu.au/Projects/Bunyip– 192 processor PC Cluster192 processor PC Cluster– winner of 2000 Gordon Bell prize winner of 2000 Gordon Bell prize

for best price performancefor best price performance

High Performance High Performance Computing GroupComputing Group– Jabberwocky clusterJabberwocky cluster– Saratoga clusterSaratoga cluster– Sunnyvale clusterSunnyvale cluster

The Rise of Parallel ComputingThe Rise of Parallel Computing

Year Hardware Languages

1950 Early Designs Fortran I (Backus, 57)

1960 Integrated circuits Fortran 66

1970 Large scale integration C (72)

1980 RISC and PC C++ (83), Python 1.0 (89)

1990 Shared and distributed parallel MPI, OpenMP, Java (95)

2000 Faster, better, hotter Python 2.0 (00)

2010 Throughput oriented CUDA, OpenCL

Parallelism became an issue for programmers from late 80sPeople began compiling lists of big parallel systems

November 2014 Top500

(NCI now number 52)

Planning the Future

Top500 Supercomputers

Growth in ANU/NCI’s computing performance (measured in TFlops) since 1987.

Architecture and capability determined by research and innovation drivers

International Top500 supercomputer growth since 1993.Red: #1 machine each yearYellow: #500 machine eachBlue: Sum of all machines

Graphs show growth factors of between 8 and 9 times every 3 years.

15

Transitioning Australia to its HPC Future

16

Moore’s Law‘Transistor density will double

approximately every two years.’

Dennard Scaling‘As MOSFET features shrink,

switching time and power consumption will fall

proportionately’

which le

d to higher H

ertz

and faste

r flops

We also had Increased Node We also had Increased Node Performance Performance

Agarwal, Hrishikesh, Keckler Burger, Clock Rate Versus IPC, ISCA 2000

250nm, 400mm2, 100%

180nm, 450mm2, 100%

130nm, 566mm2, 82%

100nm, 622mm2, 40%

70nm, 713mm2, 19%

50nm, 817mm2, 6.5%

35nm, 937mm2, 1.9%

Until the chips became too big…

…so multiple cores appeared on chip

…until we hit a bigger problem…

2004 Sun releases Sparc IV with dual cores and heralding the start of multicore

…the end of Dennard scaling…

Dennard, Gaensslen, Yu, Rideout, Bassous and Leblanc, IEEE SSC, 1974

Dennard scaling‘As MOSFET features shrink, switching time and power consumption will fall proportionately.’

Moore’s Law‘Transistor density will double approximately every two years.’

✗✗✓✓…ushering in..

……a new philosophy in processor a new philosophy in processor design is emergingdesign is emerging

1960-2010 2010-?

Few transistors No shortage of transistors

No shortage of power Limited power

Maximize transistor utility Minimize energy

Generalize Customize

…and a fundamentally new set of building blocks for our petascale systems

Petascale and Beyond: Petascale and Beyond: Challenges and OpportunitiesChallenges and Opportunities

Level Characteristic Challenge/Opportunity

As a whole Sheer number of node•Tianhe 2 has equivalent >3M cores

• Programming language/environment

• Fault tolerance

Within a domain

Heterogeneity•Tianhe system uses CPUs and GPUs

• What to use when• Co-location of data with unit

processing it

On the chip Energy minimization•Already processors have frequency and voltage scaling

• Minimize data size and movement including use of just enough precision

• Specialized cores

In RSCS we are working in all these areas

Other Important ParallelismOther Important Parallelism

Multiple instruction units:Multiple instruction units:– Typical processors issue ~4 instructions per cycleTypical processors issue ~4 instructions per cycle

Instruction Pipelining:Instruction Pipelining:– Complicated operations are broken into simple Complicated operations are broken into simple

operations that can be overlappedoperations that can be overlapped Graphics Engines:Graphics Engines:

– Use multiple rendering pipes and processing elments Use multiple rendering pipes and processing elments to render millions of polygons a secondto render millions of polygons a second

Interleaved Memory:Interleaved Memory:– Multiple paths to memory that can be used at same Multiple paths to memory that can be used at same

timetime Input/Output:Input/Output:

– Disks are striped with different blocks of data written Disks are striped with different blocks of data written to different disks at the same timeto different disks at the same time

ParallelisationParallelisation Split program up and run parts simultaneously on Split program up and run parts simultaneously on

different processorsdifferent processors– On N computers the time to solution should (ideally!) be 1/NOn N computers the time to solution should (ideally!) be 1/N– Parallel ProgrammingParallel Programming: the art of writing the parallel code!: the art of writing the parallel code!– Parallel ComputerParallel Computer: the hardware on which we run our parallel : the hardware on which we run our parallel

code!code!

COMP4300 will discuss bothCOMP4300 will discuss both Beyond raw compute power other motivations includeBeyond raw compute power other motivations include

– Enabling more accurate simulations in the same time (finer Enabling more accurate simulations in the same time (finer grids)grids)

– Providing access to huge aggregate memoriesProviding access to huge aggregate memories– Providing more and/or better input/output capacityProviding more and/or better input/output capacity

Parallelism in a Single “CPU” BoxParallelism in a Single “CPU” Box

Multiple instruction units:Multiple instruction units:– Typical processors issue ~4 instructions per cycleTypical processors issue ~4 instructions per cycle

Instruction Pipelining:Instruction Pipelining:– Complicated operations are broken into simple Complicated operations are broken into simple

operations that can be overlappedoperations that can be overlapped Graphics Engines:Graphics Engines:

– Use multiple rendering pipes and processing elments Use multiple rendering pipes and processing elments to render millions of polygons a secondto render millions of polygons a second

Interleaved Memory:Interleaved Memory:– Multiple paths to memory that can be used at same Multiple paths to memory that can be used at same

timetime Input/Output:Input/Output:

– Disks are stripped with different blocks of data written Disks are stripped with different blocks of data written to different disks at the same timeto different disks at the same time

Health Warning!Health Warning! Course is run every other yearCourse is run every other year

– Drop out this year and it won’t be repeated until 2017 Drop out this year and it won’t be repeated until 2017 It’s a 4000/8000 level course, it’s supposed to:It’s a 4000/8000 level course, it’s supposed to:

– Be more challenging that a 3000 level course!Be more challenging that a 3000 level course!– Be less well structuredBe less well structured– Have a greater expectation on youHave a greater expectation on you– Have more student participationHave more student participation– Be fun!Be fun!

Nathan Robertson, 2002 honours studentNathan Robertson, 2002 honours student– ““Parallel systems and thread safety at Medicare: 2/16 Parallel systems and thread safety at Medicare: 2/16

understood it - the other guy was a $70/hr contractor”understood it - the other guy was a $70/hr contractor”

Learning ObjectivesLearning Objectives Parallel Architecture:Parallel Architecture:

– Basic issues concerning design and likely Basic issues concerning design and likely performance of parallel systemsperformance of parallel systems

Specific Systems:Specific Systems:– Will make extensive use of research systems in our Will make extensive use of research systems in our

group and also visit the NCI facilitiesgroup and also visit the NCI facilities

Programming Paradigms:Programming Paradigms:– Distributed and shared memory, things in between, Distributed and shared memory, things in between,

Grid computingGrid computing Parallel Algorithms:Parallel Algorithms:

– Numeric and non-numericNumeric and non-numeric The futureThe future

Course ContentCourse Content

Discussion of Schedule:Discussion of Schedule:http://cs.anu.edu.au/courses/COMP4300/schedule.htmlhttp://cs.anu.edu.au/courses/COMP4300/schedule.html

Commitment and AssessmentCommitment and Assessment

The piecesThe pieces– 2 lectures per week (~30 core lecture hours)2 lectures per week (~30 core lecture hours)– 6 Labs (not marked, solutions provided)6 Labs (not marked, solutions provided)– 2 assignments (40%)2 assignments (40%)– 1 mid-semester exam (1 hours, 15%)1 mid-semester exam (1 hours, 15%)– 1 final exam (3 hours, 45%)1 final exam (3 hours, 45%)

Final mark is sum of assignment, mid-Final mark is sum of assignment, mid-semester and final exam marksemester and final exam mark

LecturesLecturesTwo slotsTwo slots

– Mon Mon 10:00-12:00 PSYC G610:00-12:00 PSYC G6– ThuThu 11:00-12:00 PSYC G611:00-12:00 PSYC G6

Exact schedule on web siteExact schedule on web sitePartial notes will be posted on the web sitePartial notes will be posted on the web site

– bring copy to lecturebring copy to lectureAttendance at lectures and labs is strongly Attendance at lectures and labs is strongly

recommendedrecommended– Attendance at labs will be recordedAttendance at labs will be recorded

Course Web SiteCourse Web Site

http://cs.anu.edu.au/courses/comp4300http://cs.anu.edu.au/courses/comp4300

We will use wattle only for lecture recordings We will use wattle only for lecture recordings

LaboratoriesLaboratories

Start in week 3 (March 2Start in week 3 (March 2ndnd))– See web page for detailed scheduleSee web page for detailed schedule

4 sessions available4 sessions available– MonMon 15:00-17:0015:00-17:00 N113N113– TueTue 13:00-13:0013:00-13:00 N114N114– WedWed 14:00-16:0014:00-16:00 N113N113– FriFri 12:00-14:0012:00-14:00 N113N113

Who cannot make any of these?Who cannot make any of these?Not assessed, but will be examinedNot assessed, but will be examined

PeoplePeople

Alistair Rendell (Convener)Alistair Rendell (Convener)– CSIT Bldg Rm N226 (and N338)CSIT Bldg Rm N226 (and N338)– [email protected]@anu.edu.au– Phone 6125 4386Phone 6125 4386

Joseph Antony (lecturer)Joseph Antony (lecturer)– Senior HPC Data Specialist NCISenior HPC Data Specialist NCI– NCI Bldg 143 (near JCSMR)NCI Bldg 143 (near JCSMR)– [email protected]@anu.edu.au– Phone 6125 5988Phone 6125 5988

Gaurav Mitra (tutor)Gaurav Mitra (tutor)– PhD student, Computer SystemsPhD student, Computer Systems– CSIT Bldg Rm 230CSIT Bldg Rm 230– [email protected]@anu.edu.au– Phone 6125 9658Phone 6125 9658

Course CommunicationCourse Communication Course web pageCourse web page

cs.anu.edu.au/course/comp4300cs.anu.edu.au/course/comp4300 Bulletin board (forum – available from streams)Bulletin board (forum – available from streams)

cs.anu.edu.au/streamscs.anu.edu.au/streams At lectures and in labsAt lectures and in labs EmailEmail

[email protected] In personIn person

– Office hours (Alistair), Thu 12:00-13:00 (after lecture)Office hours (Alistair), Thu 12:00-13:00 (after lecture)– Email for appointment if you want other timeEmail for appointment if you want other time

Useful BooksUseful Books Principles of Parallel Programming, Calvin Lin and

Lawrence Snyder, Pearson International Edition, ISBN 978-0-321-54942-6

Introduction to Parallel Computing, 2nd Ed., Grama, Gupta, Karypis, Kumar, Addison-Wesley, ISBN 0201648652 (Electronic version accessible on line from ANU library – search for title)

Parallel Programming: techniques and applications using networked workstations and parallel computers, Barry Wilkinson and Michael Allen. Prentice Hall 2nd edition. ISBN 0131405632.

and others on web page

Questions so far!?Questions so far!?

comp4300/comp8300 parallel systems alistair rendell and joseph antony research school of computer...

Documents

mw ups

mw carrier chillers

mw capacity raw

fastest filesystem

data centrenew data

deg cpue

time passescells

machine room