comp4300/comp8300 parallel systems alistair rendell and joseph antony research school of computer...
TRANSCRIPT
COMP4300/COMP8300COMP4300/COMP8300Parallel SystemsParallel Systems
Alistair Rendell and Joseph AntonyAlistair Rendell and Joseph AntonyResearch School of Computer ScienceResearch School of Computer Science
Australian National UniversityAustralian National University
Concept and RationaleConcept and Rationale The ideaThe idea
– Split your program into bits that can be executed Split your program into bits that can be executed simultaneouslysimultaneously
MotivationMotivation– Speed, Speed, Speed, Speed, SpeedSpeed… … at a cost effective priceat a cost effective price– If we didn’t want it to go faster we would not be If we didn’t want it to go faster we would not be
bothered with the hassles of parallel programming!bothered with the hassles of parallel programming! Reduce the time to solution to acceptable levelsReduce the time to solution to acceptable levels
– No point waiting 1 week for tomorrow’s weather No point waiting 1 week for tomorrow’s weather forecastforecast
– Simulations that take months to run are not useful in a Simulations that take months to run are not useful in a design environmentdesign environment
Sample Application AreasSample Application Areas Fluid flow problemsFluid flow problems
– Weather forecasting/climate modelingWeather forecasting/climate modeling– Aerodynamic modeling of cars, planes, rockets etcAerodynamic modeling of cars, planes, rockets etc
Structural MechanicsStructural Mechanics– Building bridge, car, etc strength analysisBuilding bridge, car, etc strength analysis– Car crash simulationCar crash simulation
Speech and character recognition, image processingSpeech and character recognition, image processing Visualization, virtual realityVisualization, virtual reality Semiconductor design, simulation of new chipsSemiconductor design, simulation of new chips Structural biology, molecular level design of drugsStructural biology, molecular level design of drugs Human genome mappingHuman genome mapping Financial market analysis and simulationFinancial market analysis and simulation Datamining, machine learningDatamining, machine learning Games programmingGames programming
World Climate ModelingWorld Climate Modeling
Atmosphere divided into 3D regions or cellsAtmosphere divided into 3D regions or cells Complex mathematical equations describe conditions in Complex mathematical equations describe conditions in
each cell, eg pressure, temperature, velocityeach cell, eg pressure, temperature, velocity– Conditions change according to neighbour cellsConditions change according to neighbour cells– Updates repeated frequently as time passesUpdates repeated frequently as time passes– Cells are affected by more distant cells the longer range the Cells are affected by more distant cells the longer range the
forecastforecast AssumeAssume
– Cells are 1x1x1 mile to a height of 10 miles, 5x10Cells are 1x1x1 mile to a height of 10 miles, 5x1088 cellscells
– 200 flops to update each cell per timestep200 flops to update each cell per timestep– 10 minute timesteps for total of 10 days10 minute timesteps for total of 10 days
100 days on 100 mflop machine100 days on 100 mflop machine 10 minutes on a tflop machine10 minutes on a tflop machine
ParallelSystems@ANU: NCIParallelSystems@ANU: NCI NCI: National Computational InfrastructureNCI: National Computational Infrastructure
– http://nci.org.au http://nci.org.au History: established APAC in 1998 with $19.5M grant from History: established APAC in 1998 with $19.5M grant from
federal government, NCI created in 2007federal government, NCI created in 2007 Current NCI collaboration agreement (2012–15)Current NCI collaboration agreement (2012–15)
– Major Collaborators: ANU, CSIRO, BoM, GA, Major Collaborators: ANU, CSIRO, BoM, GA, – Universities: Adelaide, Monash, UNSW, UQ, Sydney, Deakin, RMIT Universities: Adelaide, Monash, UNSW, UQ, Sydney, Deakin, RMIT – University Consortia: Intersect (NSW), QCIF (Queensland)University Consortia: Intersect (NSW), QCIF (Queensland)
Co-investment (for recurrent operations) :Co-investment (for recurrent operations) :– 2007: $0M; 2008: $3.4M; 2009: $6.4M; 2011: $7.5M; 2012: $8.5M; 2007: $0M; 2008: $3.4M; 2009: $6.4M; 2011: $7.5M; 2012: $8.5M;
2013: $11M2013: $11M; ; 2014 $11+M; 2014 $11+M; to provide for all recurrent operationsto provide for all recurrent operations
Current infrastructure: Data Centre
• New Data Centre: $24M (opened Nov. 2012) Machine Room: 920 sq. m.
• Power (after 2014 upgrades)– 4.5 MW capacity raw; 1 MW UPS; – 2 x 1.1 MVA Cummins generators
• Cooling in two loops:– Server: 2 x 1.8 MW Carrier chillers; 3 x 0.8 MW “free
cooling” heat exchangers; 18 deg C; 75 l/sec pump rate– Data: 3 x 0.5 MW Carrier chillers; 15 deg C
• PUE: approx. 1.25
6
NCI: Raijin—Petascale Supercomputer
Raijin – Supercomputer (June 2013 commissioning)– 57,472 cores (Intel Xeon Sandy Bridge, 2.6 GHz) in
3592 compute nodes– 160 TBytes (approx.) of main memory;– Mellanox Infiniband FDR interconnect (52 km cable)– 10 PBytes (approx.) of usable fast filesystem (for
short-term scratch space apps, home directories).– Power: 1.5 MW max. load– Cooling systems: 100 tonnes of water
– 24th fastest in the world in debut (November 2012); first petaflop system in Australia (November 2014: #52)
• Fastest file-system in the southern hemisphere• Custom monitoring and deployment• Custom Kernel• Highly customised PBS Pro scheduler.
7
NCI’s integrated high-performance environment
10 GigE
/g/data 56Gb FDR IB Fabric
/g/data1~7 PB
/g/data2~6 PB
/short7.6PB
/home, /system, /images,
/apps
Cache 1.0PB, Tape 20 PB
Massdata (tape) Persistent global parallel filesystem
Raijin high-speed filesystem
Raijin HPC ComputeRaijin HPC Compute
Raijin Login + Data moversRaijin Login + Data moversVMwareVMware CloudCloudNCI data
moversNCI data movers
To
Hu
xley
DC
Raijin 56Gb FDR IB Fabric
InternetInternet
8
ParallelSystems@DCSParallelSystems@DCS Bunyip: Bunyip:
tsg.anu.edu.au/Projects/Bunyiptsg.anu.edu.au/Projects/Bunyip– 192 processor PC Cluster192 processor PC Cluster– winner of 2000 Gordon Bell prize winner of 2000 Gordon Bell prize
for best price performancefor best price performance
High Performance High Performance Computing GroupComputing Group– Jabberwocky clusterJabberwocky cluster– Saratoga clusterSaratoga cluster– Sunnyvale clusterSunnyvale cluster
The Rise of Parallel ComputingThe Rise of Parallel Computing
Year Hardware Languages
1950 Early Designs Fortran I (Backus, 57)
1960 Integrated circuits Fortran 66
1970 Large scale integration C (72)
1980 RISC and PC C++ (83), Python 1.0 (89)
1990 Shared and distributed parallel MPI, OpenMP, Java (95)
2000 Faster, better, hotter Python 2.0 (00)
2010 Throughput oriented CUDA, OpenCL
Parallelism became an issue for programmers from late 80sPeople began compiling lists of big parallel systems
November 2014 Top500
(NCI now number 52)
1414
Planning the Future
Top500 Supercomputers
Growth in ANU/NCI’s computing performance (measured in TFlops) since 1987.
Architecture and capability determined by research and innovation drivers
International Top500 supercomputer growth since 1993.Red: #1 machine each yearYellow: #500 machine eachBlue: Sum of all machines
Graphs show growth factors of between 8 and 9 times every 3 years.
15
Transitioning Australia to its HPC Future
16
Moore’s Law‘Transistor density will double
approximately every two years.’
Dennard Scaling‘As MOSFET features shrink,
switching time and power consumption will fall
proportionately’
which le
d to higher H
ertz
and faste
r flops
We also had Increased Node We also had Increased Node Performance Performance
Agarwal, Hrishikesh, Keckler Burger, Clock Rate Versus IPC, ISCA 2000
250nm, 400mm2, 100%
180nm, 450mm2, 100%
130nm, 566mm2, 82%
100nm, 622mm2, 40%
70nm, 713mm2, 19%
50nm, 817mm2, 6.5%
35nm, 937mm2, 1.9%
Until the chips became too big…
…so multiple cores appeared on chip
…until we hit a bigger problem…
2004 Sun releases Sparc IV with dual cores and heralding the start of multicore
…the end of Dennard scaling…
Dennard, Gaensslen, Yu, Rideout, Bassous and Leblanc, IEEE SSC, 1974
Dennard scaling‘As MOSFET features shrink, switching time and power consumption will fall proportionately.’
Moore’s Law‘Transistor density will double approximately every two years.’
✗✗✓✓…ushering in..
……a new philosophy in processor a new philosophy in processor design is emergingdesign is emerging
1960-2010 2010-?
Few transistors No shortage of transistors
No shortage of power Limited power
Maximize transistor utility Minimize energy
Generalize Customize
…and a fundamentally new set of building blocks for our petascale systems
Petascale and Beyond: Petascale and Beyond: Challenges and OpportunitiesChallenges and Opportunities
Level Characteristic Challenge/Opportunity
As a whole Sheer number of node•Tianhe 2 has equivalent >3M cores
• Programming language/environment
• Fault tolerance
Within a domain
Heterogeneity•Tianhe system uses CPUs and GPUs
• What to use when• Co-location of data with unit
processing it
On the chip Energy minimization•Already processors have frequency and voltage scaling
• Minimize data size and movement including use of just enough precision
• Specialized cores
In RSCS we are working in all these areas
Other Important ParallelismOther Important Parallelism
Multiple instruction units:Multiple instruction units:– Typical processors issue ~4 instructions per cycleTypical processors issue ~4 instructions per cycle
Instruction Pipelining:Instruction Pipelining:– Complicated operations are broken into simple Complicated operations are broken into simple
operations that can be overlappedoperations that can be overlapped Graphics Engines:Graphics Engines:
– Use multiple rendering pipes and processing elments Use multiple rendering pipes and processing elments to render millions of polygons a secondto render millions of polygons a second
Interleaved Memory:Interleaved Memory:– Multiple paths to memory that can be used at same Multiple paths to memory that can be used at same
timetime Input/Output:Input/Output:
– Disks are striped with different blocks of data written Disks are striped with different blocks of data written to different disks at the same timeto different disks at the same time
ParallelisationParallelisation Split program up and run parts simultaneously on Split program up and run parts simultaneously on
different processorsdifferent processors– On N computers the time to solution should (ideally!) be 1/NOn N computers the time to solution should (ideally!) be 1/N– Parallel ProgrammingParallel Programming: the art of writing the parallel code!: the art of writing the parallel code!– Parallel ComputerParallel Computer: the hardware on which we run our parallel : the hardware on which we run our parallel
code!code!
COMP4300 will discuss bothCOMP4300 will discuss both Beyond raw compute power other motivations includeBeyond raw compute power other motivations include
– Enabling more accurate simulations in the same time (finer Enabling more accurate simulations in the same time (finer grids)grids)
– Providing access to huge aggregate memoriesProviding access to huge aggregate memories– Providing more and/or better input/output capacityProviding more and/or better input/output capacity
Parallelism in a Single “CPU” BoxParallelism in a Single “CPU” Box
Multiple instruction units:Multiple instruction units:– Typical processors issue ~4 instructions per cycleTypical processors issue ~4 instructions per cycle
Instruction Pipelining:Instruction Pipelining:– Complicated operations are broken into simple Complicated operations are broken into simple
operations that can be overlappedoperations that can be overlapped Graphics Engines:Graphics Engines:
– Use multiple rendering pipes and processing elments Use multiple rendering pipes and processing elments to render millions of polygons a secondto render millions of polygons a second
Interleaved Memory:Interleaved Memory:– Multiple paths to memory that can be used at same Multiple paths to memory that can be used at same
timetime Input/Output:Input/Output:
– Disks are stripped with different blocks of data written Disks are stripped with different blocks of data written to different disks at the same timeto different disks at the same time
Health Warning!Health Warning! Course is run every other yearCourse is run every other year
– Drop out this year and it won’t be repeated until 2017 Drop out this year and it won’t be repeated until 2017 It’s a 4000/8000 level course, it’s supposed to:It’s a 4000/8000 level course, it’s supposed to:
– Be more challenging that a 3000 level course!Be more challenging that a 3000 level course!– Be less well structuredBe less well structured– Have a greater expectation on youHave a greater expectation on you– Have more student participationHave more student participation– Be fun!Be fun!
Nathan Robertson, 2002 honours studentNathan Robertson, 2002 honours student– ““Parallel systems and thread safety at Medicare: 2/16 Parallel systems and thread safety at Medicare: 2/16
understood it - the other guy was a $70/hr contractor”understood it - the other guy was a $70/hr contractor”
Learning ObjectivesLearning Objectives Parallel Architecture:Parallel Architecture:
– Basic issues concerning design and likely Basic issues concerning design and likely performance of parallel systemsperformance of parallel systems
Specific Systems:Specific Systems:– Will make extensive use of research systems in our Will make extensive use of research systems in our
group and also visit the NCI facilitiesgroup and also visit the NCI facilities
Programming Paradigms:Programming Paradigms:– Distributed and shared memory, things in between, Distributed and shared memory, things in between,
Grid computingGrid computing Parallel Algorithms:Parallel Algorithms:
– Numeric and non-numericNumeric and non-numeric The futureThe future
Course ContentCourse Content
Discussion of Schedule:Discussion of Schedule:http://cs.anu.edu.au/courses/COMP4300/schedule.htmlhttp://cs.anu.edu.au/courses/COMP4300/schedule.html
Commitment and AssessmentCommitment and Assessment
The piecesThe pieces– 2 lectures per week (~30 core lecture hours)2 lectures per week (~30 core lecture hours)– 6 Labs (not marked, solutions provided)6 Labs (not marked, solutions provided)– 2 assignments (40%)2 assignments (40%)– 1 mid-semester exam (1 hours, 15%)1 mid-semester exam (1 hours, 15%)– 1 final exam (3 hours, 45%)1 final exam (3 hours, 45%)
Final mark is sum of assignment, mid-Final mark is sum of assignment, mid-semester and final exam marksemester and final exam mark
LecturesLecturesTwo slotsTwo slots
– Mon Mon 10:00-12:00 PSYC G610:00-12:00 PSYC G6– ThuThu 11:00-12:00 PSYC G611:00-12:00 PSYC G6
Exact schedule on web siteExact schedule on web sitePartial notes will be posted on the web sitePartial notes will be posted on the web site
– bring copy to lecturebring copy to lectureAttendance at lectures and labs is strongly Attendance at lectures and labs is strongly
recommendedrecommended– Attendance at labs will be recordedAttendance at labs will be recorded
Course Web SiteCourse Web Site
http://cs.anu.edu.au/courses/comp4300http://cs.anu.edu.au/courses/comp4300
We will use wattle only for lecture recordings We will use wattle only for lecture recordings
LaboratoriesLaboratories
Start in week 3 (March 2Start in week 3 (March 2ndnd))– See web page for detailed scheduleSee web page for detailed schedule
4 sessions available4 sessions available– MonMon 15:00-17:0015:00-17:00 N113N113– TueTue 13:00-13:0013:00-13:00 N114N114– WedWed 14:00-16:0014:00-16:00 N113N113– FriFri 12:00-14:0012:00-14:00 N113N113
Who cannot make any of these?Who cannot make any of these?Not assessed, but will be examinedNot assessed, but will be examined
PeoplePeople
Alistair Rendell (Convener)Alistair Rendell (Convener)– CSIT Bldg Rm N226 (and N338)CSIT Bldg Rm N226 (and N338)– [email protected]@anu.edu.au– Phone 6125 4386Phone 6125 4386
Joseph Antony (lecturer)Joseph Antony (lecturer)– Senior HPC Data Specialist NCISenior HPC Data Specialist NCI– NCI Bldg 143 (near JCSMR)NCI Bldg 143 (near JCSMR)– [email protected]@anu.edu.au– Phone 6125 5988Phone 6125 5988
Gaurav Mitra (tutor)Gaurav Mitra (tutor)– PhD student, Computer SystemsPhD student, Computer Systems– CSIT Bldg Rm 230CSIT Bldg Rm 230– [email protected]@anu.edu.au– Phone 6125 9658Phone 6125 9658
Course CommunicationCourse Communication Course web pageCourse web page
cs.anu.edu.au/course/comp4300cs.anu.edu.au/course/comp4300 Bulletin board (forum – available from streams)Bulletin board (forum – available from streams)
cs.anu.edu.au/streamscs.anu.edu.au/streams At lectures and in labsAt lectures and in labs EmailEmail
[email protected] In personIn person
– Office hours (Alistair), Thu 12:00-13:00 (after lecture)Office hours (Alistair), Thu 12:00-13:00 (after lecture)– Email for appointment if you want other timeEmail for appointment if you want other time
Useful BooksUseful Books Principles of Parallel Programming, Calvin Lin and
Lawrence Snyder, Pearson International Edition, ISBN 978-0-321-54942-6
Introduction to Parallel Computing, 2nd Ed., Grama, Gupta, Karypis, Kumar, Addison-Wesley, ISBN 0201648652 (Electronic version accessible on line from ANU library – search for title)
Parallel Programming: techniques and applications using networked workstations and parallel computers, Barry Wilkinson and Michael Allen. Prentice Hall 2nd edition. ISBN 0131405632.
and others on web page
Questions so far!?Questions so far!?