scientific computing, high-performance computing and data …rzon/publ/39_2019... · 2019-03-25 ·...

8
Scientific Computing, High-Performance Computing and Data Science in Higher Education Lessons learned from the training program in a super-computer center in Canada Marcelo Ponce Erik Spence Ramses van Zon Daniel Gruner [email protected] [email protected] [email protected] [email protected] SciNet HPC Consortium, University of Toronto Toronto, ON, Canada ABSTRACT We present an overview of current academic curricula for Scientific Computing, High-Performance Computing and Data Science. After a survey of current academic and non-academic programs across the globe, we focus on Canadian programs and specifically on the education program of the SciNet HPC Consortium, using its detailed enrollment and course statistics for the past six to seven years. Not only do these data display a steady and rapid increase in the demand for research-computing instruction, they also show a clear shift from traditional (high performance) computing to data- oriented methods. It is argued that this growing demand warrants specialized research computing degrees. CCS CONCEPTS Social and professional topics Model curricula; Comput- ing education programs; Accreditation; KEYWORDS Training and Education, Scientific Computing, High-Performance Computing, Data Science, Master’s Program 1 INTRODUCTION The computational resources available to scientists and engineers have never been greater. The ability to conduct simulations and analyses on thousands of low-latency-connected computer proces- sors has opened up a world of computational research which was previously inaccessible. Researchers using these resources rely on scientific-computing and high-performance-computing techniques; a good understanding of computational science is no longer optional Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright ©JOCSE, a supported publication of the Shodor Education Foundation Inc. © 2019 Journal of Computational Science Education https://doi.org/10.22369/issn.2153-4136/10/1/5 for researchers in a variety of fields, ranging from bioinformatics to astrophysics. Similarly, the advent of the internet has resulted in a paradigm where information can be more easily captured, transmitted, stored, and accessed than ever before. Researchers, both in academia and industry [19], have been actively developing technologies and ap- proaches for dealing with data of previously-unimaginable scale. Researchers’ ability to analyze data has never been greater, and many branches of science are actively using these newly-developed techniques. Unfortunately, the skills needed to harness these computational and data-empowered resources are not systematically taught in university courses [20]. Some researchers, postdocs and students may find non-academic programs to fill this void, but others either do not have access to these courses or cannot commit the time to follow them. These researchers typically end up learning by trial and error, or by self-teaching, which is rarely optimal. A number of academic programs that aim to address this issue have emerged at universities across the world (a few examples are [11, 28]). Some of these grew out of the training efforts of High Performance Computing (HPC) centres and organizations (e.g.[27]). Recognizing the need for additional skills in their users, computing centres such as those in the XSEDE partnership [29] in the U.S., PRACE in Europe, and Compute/Calcul Canada have been providing local and online HPC training as part of their user support. Universities have also developed graduate programs in both Scientific and High-Performance Computing, to train scientists and engineers in the use of these computational resources. A more-recent complement to these graduate programs is the de- velopment of the degree in Data Science (DS), that is, degrees which focus on the analysis of data, especially at scale. These degrees come in a variety of forms, from multi-year academic graduate programs to specialized private-sector training. These programs are in strong demand at present, as large companies have discovered the value in thoroughly analysing the vast quantities of customer data which they collect. It is expected that this field will continue to grow, and academic programs will continue to be introduced to meet this demand. Volume 10, Issue 1 Journal of Computational Science Education 24 ISSN 2153-4136 January 2019

Upload: others

Post on 04-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scientific Computing, High-Performance Computing and Data …rzon/publ/39_2019... · 2019-03-25 · Computing, High-Performance Computing and Data Science. After a survey of current

Scientific Computing, High-Performance Computing and DataScience in Higher Education

Lessons learned from the training program in a super-computer center in Canada

Marcelo PonceErik Spence

Ramses van ZonDaniel Gruner

[email protected]@[email protected]

[email protected] HPC Consortium, University of Toronto

Toronto, ON, Canada

ABSTRACTWe present an overview of current academic curricula for ScientificComputing, High-Performance Computing and Data Science. Aftera survey of current academic and non-academic programs acrossthe globe, we focus on Canadian programs and specifically onthe education program of the SciNet HPC Consortium, using itsdetailed enrollment and course statistics for the past six to sevenyears. Not only do these data display a steady and rapid increase inthe demand for research-computing instruction, they also show aclear shift from traditional (high performance) computing to data-oriented methods. It is argued that this growing demand warrantsspecialized research computing degrees.

CCS CONCEPTS• Social and professional topics→Model curricula;Comput-ing education programs; Accreditation;

KEYWORDSTraining and Education, Scientific Computing, High-PerformanceComputing, Data Science, Master’s Program

1 INTRODUCTIONThe computational resources available to scientists and engineershave never been greater. The ability to conduct simulations andanalyses on thousands of low-latency-connected computer proces-sors has opened up a world of computational research which waspreviously inaccessible. Researchers using these resources rely onscientific-computing and high-performance-computing techniques;a good understanding of computational science is no longer optional

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the fullcitation on the first page. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Copyright ©JOCSE,a supported publication of the Shodor Education Foundation Inc.

© 2019 Journal of Computational Science Educationhttps://doi.org/10.22369/issn.2153-4136/10/1/5

for researchers in a variety of fields, ranging from bioinformaticsto astrophysics.

Similarly, the advent of the internet has resulted in a paradigmwhere information can be more easily captured, transmitted, stored,and accessed than ever before. Researchers, both in academia andindustry [19], have been actively developing technologies and ap-proaches for dealing with data of previously-unimaginable scale.Researchers’ ability to analyze data has never been greater, andmany branches of science are actively using these newly-developedtechniques.

Unfortunately, the skills needed to harness these computationaland data-empowered resources are not systematically taught inuniversity courses [20]. Some researchers, postdocs and studentsmay find non-academic programs to fill this void, but others eitherdo not have access to these courses or cannot commit the time tofollow them. These researchers typically end up learning by trialand error, or by self-teaching, which is rarely optimal.

A number of academic programs that aim to address this issuehave emerged at universities across the world (a few examplesare [11, 28]). Some of these grew out of the training efforts ofHigh Performance Computing (HPC) centres and organizations(e.g. [27]). Recognizing the need for additional skills in their users,computing centres such as those in the XSEDE partnership [29]in the U.S., PRACE in Europe, and Compute/Calcul Canada havebeen providing local and online HPC training as part of their usersupport. Universities have also developed graduate programs inboth Scientific andHigh-Performance Computing, to train scientistsand engineers in the use of these computational resources.

A more-recent complement to these graduate programs is the de-velopment of the degree in Data Science (DS), that is, degrees whichfocus on the analysis of data, especially at scale. These degrees comein a variety of forms, from multi-year academic graduate programsto specialized private-sector training. These programs are in strongdemand at present, as large companies have discovered the valuein thoroughly analysing the vast quantities of customer data whichthey collect. It is expected that this field will continue to grow, andacademic programs will continue to be introduced to meet thisdemand.

Volume 10, Issue 1 Journal of Computational Science Education

24 ISSN 2153-4136 January 2019

Page 2: Scientific Computing, High-Performance Computing and Data …rzon/publ/39_2019... · 2019-03-25 · Computing, High-Performance Computing and Data Science. After a survey of current

The SciNet HPC Consortium [15, 26] is the high-performance-computing center at the University of Toronto. SciNet provides bothcomputational resources and specialized user support for Canadianacademic researchers, and as members of its support team, we areresponsible for training researchers, postdocs and graduate stu-dents at the University of Toronto in HPC techniques. In this paperwe provide a review of the current state of graduate-level ScientificComputing, High-Performance Computing and Data Science aca-demic programs, and endeavour to share our training and teachingexperiences as they might result useful for other super-computercenters. The paper is organized as follows. In Sec. 2 we discuss howcomputation has become an essential ingredient in many academicresearch endeavours; in Sec. 3 we review the current status of edu-cation in the areas of High-Performance and Scientific Computing.In Sec. 4 we present the Data Science education efforts at the aca-demic and non-academic level. We conclude with final remarks andperspectives for the future in Sec. 5 and 6.

2 THE ROLE OF HPC IN CURRENTRESEARCH

It is essentially impossible to give an overview of all the uses ofcomputational methods in current scientific and academic research.Wewill nonetheless attempt a review of at least some computationalscientific research, since the way computers are used in research(and other realms of inquiry) influences what should be taught tostudents.

Astrophysical computational research inherently involves largescale computing, such as the simulation of gravitational systemswith many particles, magnetohydrodynamic systems, and bodiesinvolving general relativity. Atmospheric physics requires largeweather and climate models with many components to be simu-lated in a variety of scenarios. High-energy particle physics projects,such as the ATLAS project at CERN, require the analysis of manyrecorded events from large experiments, while other high-energyphysics projects have a need for large scale simulations (e.g. latticeQCD investigations). Condensed matter physics, quantum chem-istry and materials science projects must often numerically solvequantum mechanical problems in one approximation or another;the approximations make the calculations feasible but still rely onlarge computing resources. Soft condensed matter and chemicalbiophysics research often involve molecular dynamics or MonteCarlo simulations, and frequently require sampling a large parame-ter space. Engineering projects can involve optimizing or analyzingcomplex airflow or combustion, leading to large fluid dynamicscalculations. Bioinformatics often involves vast quantities of ge-nomic input data to be compared or assembled, requiring manysmall computations. Research in other data-driven fields such associal science, humanities, health care and biomedical science [13],is also starting to outgrow the capacity of individual workstationsand standard tools.

Examining these cases in more detail, one can distinguish differ-ent ways in which research relies on computational resources:

(1) Research that is inherently computational, i.e. it cannot rea-sonably be done without a computer, but which requiresrelatively minor resources (e.g. a single workstation).

(2) Research that investigates problems that do not fit on a singlecomputer, and therefore rely on multiple computing nodesattached through a low-latency network.

(3) Research that requires many relatively small computations.(4) Research that requires access to a large amount of storage,

but not necessarily a lot of other resources.(5) Research that requires access to a lot of storage, on which

many relatively small calculations are performed.

The distinction between the various types of research determinesthe appropriate systems and tools to use. Graduate students thatare just starting their research often do not have enough knowledgeto make the distinction (as nobody has taught them about this), letalone select and ask for the resources that they will need [20].

Note that all five categories fall under “Advanced Research Com-puting” (ARC). The categories are not mutually exclusive, but re-search of the second and third kind are usually associated with HPC,while the fourth and fifth, and sometimes the first, are associatedwith Data Science (DS). Although there is a lot of overlap betweenHPC and DS, these fields require somewhat different techniques.For that reason, we will consider separate programs for HPC andDS.

3 PROGRAMS IN HIGH-PERFORMANCECOMPUTING

Much of the research presented in the previous section falls in thecategory of Scientific Computing (SC). The growth in the compu-tational approach to research, both academic and industrial, hasprompted some institutions to develop graduate-level programscrafted to teach the skills needed to design, program, debug and runsuch calculations. These programs, having been in developmentfor more than two decades, are now fairly widespread and mature,and are known by the names of “Scientific Computing” or “Compu-tational Science and Engineering”. Scientific Computing graduatedegrees are offered internationally in several graduate educationhubs around the world (U.S., England, Germany, Switzerland, etc.,— lists of which can be found at the SIAM and HPC University [14]websites). Canada is no exception, with at least eight universitiesoffering graduate-level programs in Computational Science. Theseprograms include one-year and two-year Master’s programs, aswell as Ph.D. programs. Most of these programs (e.g. the ones shownin Tables 1 and 4) require a final thesis. The projects and thesesare faculty-guided research projects and are usually one-term long,though, as with all research, these projects sometimes take longer.

A typical curriculum for a two-year Master’s program in Scien-tific Computing (in this case from San Diego State University) ispresented in Table 1. It clearly shows that Scientific Computing hasits roots in research in the physical sciences; the program heavilyemphasizes numerical analysis and scientific modelling. In someways this is not surprising: computers are very apt at solving suchproblems, and the formalism of the physical sciences often lendsitself easily to computer programming. Other topics of study whichare also often encountered in these programs include finite elementanalysis, matrix computations, optimization, stochastic methods,differential equations and stability.

In contrast to Scientific Computing, HPC requires somewhatwider knowledge; its practitioners need to understand more than

Journal of Computational Science Education Volume 10, Issue 1

January 2019 ISSN 2153-4136 25

Page 3: Scientific Computing, High-Performance Computing and Data …rzon/publ/39_2019... · 2019-03-25 · Computing, High-Performance Computing and Data Science. After a survey of current

Course Name TypeIntroduction to Computational Science requiredComputational Methods for Scientists requiredComputational Modelling for Scientists requiredComputational Imaging requiredScientific Computing requiredApplied Mathematics for Computational Scientists requiredSeminar Problems in Computational Science requiredComputational and Applied Statistics electiveComputational Database Fundamentals electiveResearch requiredThesis required

Table 1: The curriculum for the two-year Master’s programat the Computational Science Research Center at San DiegoState University [9]; this forms a good example of a typicalScientific Computing graduate program.

just the theoretical and numerical principles. They require skillssuch as serial and parallel programming (often in several languages,and on different platforms) and scripting, as well familiarity withnumerics, data handling, statistics, and supercomputers and theirtechnical bottlenecks. In addition, these practitioners are usuallynot computer science students, so they must cope without thatbackground. This is somewhat unavoidable as they need to havesufficient domain knowledge as well. Much of the same holds truefor Data Science.

3.1 Academic HPC ProgramsThere are not many academic programs that focus on HPC. Partof the reason may be that such programs require access to a high-performance-computing machine so that students can develop theirskills on real hardware, in a real supercomputing environment.These machines require multiple computing nodes which are con-nected by a low-latency network. Fortunately, such systems donot need to be local: as long as the machine is accessible throughthe internet the machine could be used for teaching. Nonetheless,having the hardware local to the students lends advantages, sincemost of the administrators and analysts of the system are typicallyavailable to assist students with optimizing their codes and devel-oping good computational strategies. Not surprisingly, the majorityof the currently offered HPC graduate programs seem to have beendeveloped by or in conjunction with supercomputer centres. Avail-ability of HPC resources at the local institution or department levelvaries significantly by country or region. While most top-rankedinstitutions in the US have HPC facilities, the funding model inCanada [6] is such that all the HPC resources are concentrated in afew national centers (like SciNet).

As examples of High Performance Computing programs, the Uni-versity of Edinburgh (UK) offers an MSc in High Performance Com-puting, the Universitat Politècnica de Catalunya/Barcelona Tech(Spain) offers a Master in High Performance Computing and a Mas-ter program in Data Mining and Business Intelligence, SISSA/ICTPin Italy offers a Master in High Performance Computing, while acollaboration between the University ITMO (Russia) and the Uni-versity of Amsterdam (Netherlands) offers a Double-Degree Master

Programs in Applied Mathematics and Informatics (ComputationalScience). Note that many of these programs emerged from locationswith a very strong tradition and consolidated background in HPC.

3.2 SciNet’s HPC ProgramsManyHPC centers provide training for their users to fill the computational-skills gap for the wider scientific community, such as, SDSC, PSC,TACC, NCSA, BSC, EPCC, CSCS, SHARCNET, AceNet, CalculQuébec, among many others. In its capacity as an HPC and ARCcentre based at the University of Toronto, SciNet has developedseveral education and training programs [24] aimed at helping stu-dents and users obtain the skills and knowledge required to get themost out of advanced-research-computing resources. SciNet’s train-ing events and courses are currently taken by researchers, postdocs,and graduate students across many different departments and evenfrom outside of the University of Toronto (UofT). Some of thesecourses are considered part of the official curricula and count asgraduate level courses within the Ph.D. programs at UofT.

Initially SciNet provided training specifically oriented towardScientific Computing, with the purpose of maximizing user pro-ductivity. These early classes focused on parallel programming(MPI and OpenMP), best coding practices, debugging, and otherscientific computing needs. Over the years the breadth of courseshas grown, with classes offered in Linux shell programming, par-allel input/output, advanced C++ and Fortran coding, accelera-tor programming, and visualization. This is in addition to the an-nual HPC Summer School which SciNet runs in collaboration withtwo other HPC centres within Compute Ontario[4], CAC[1] andSHARCNET[2]. The summer school is a week-long intensive work-shop1 on HPC topics, and more recently, also Data Science andMedical/Bio-Informatics topics. Table 2 shows the curriculum ofour summer school for last year.

Table 3 shows the training events and courses that SciNet hasalready been teaching in the areas of HPC and Data Science. Thenumber and types of classes which SciNet teaches have grownsignificantly [25]. This can be seen in Figure 1, which presents thetotal student class-hours taught by SciNet over the last six years.This remarkable growth is a testament to the latent need for thismaterial to be taught. The need for this training is supported by theenrolment statistics: our students constitute 35% of SciNet’s totalusers, clearly showing that even in a specialized audience this kindof training is still needed.

For several years the four-week graduate-style classes offered bySciNet have been accepted for graduate class credit by the depart-ments of Physics, Chemistry and Astrophysics at UofT. This waspossible by accepting the classes as “modular” (or “mini”) courses,one-third semester long, and bundling three such classes into afull-semester course. This arrangement has been so popular withstudents and faculty that the Physics Department recently listedSciNet’s winter twelve-week HPC class in the course calendar [5],allowing graduate students from other departments in the univer-sity to take the class for university credit. Following exactly thesame path, a six-week module designed to teach students from Bio-logical and Medical Sciences, basics on data analysis with emphasis

1Similar initiatives and trends are being carried on by the International HPC SummerSchool [3] within the theme of HPC Challenges in Computational Sciences.

Volume 10, Issue 1 Journal of Computational Science Education

26 ISSN 2153-4136 January 2019

Page 4: Scientific Computing, High-Performance Computing and Data …rzon/publ/39_2019... · 2019-03-25 · Computing, High-Performance Computing and Data Science. After a survey of current

HPC Stream Data Science Stream BioInformatics/MedicalStream

Introduction toHPC & SciNet

Introduction toLinux Shell

PLINK

Shared MemoryProgrammingwith OpenMP

Introduction to R Next GenerationSequencing

ProgrammingClusters withMPI

Data Science withPython

RNASeq Analysis

ProgrammingGPUs withCUDA

Parallel R for DataScience

Python for MRI Analy-sis

Debugging andProfiling

Python for HPC(Parallel Python)

Image Analysis at Scale

Bring your owncode Labs

Visualization withPython

Machine Learning forNeuroImaging

Scientific Visualiza-tion Suites

R for MRI Analysis

Public Datasets for Neu-roimagingHCP with HPC: Sur-face Based Neuroimag-ing Analysis

Table 2: Curriculum for the 2017 SciNet’s summer school,showing three parallel streams: the traditional High-Performance Computing, Data Science and adding a newparallel stream on BioInformatics/Medical applications.This type of events not only benefit the students and partici-pants of the summer school, but also enables collaborationsbetween departments and consortia, as it was this particularcase, where part of the trainingwas delivered in partnershipwith colleagues from SHARCNET and the Center for Addi-tion and Mental Health (CAMH).

Figure 1: Attendance hours at SciNet training and educationevents, per year, for all SciNet classes and Data Science spe-cific classes.

Course Name Certificate CreditsQuantitative Applications for Data Analysis‡ DS/SC 36Introduction to Computational BioStatisticswith R‡2

DS/SC 36

Introduction to Neural Network Program-ming

DS/SC 4

Neural Network Programming DS/SC 16Advanced Neural Networks DS/SC 4Intro to Apache Spark DS 3Machine Learning Workshop DS/SC 6Hadoop Workshop DS 3Scalable Data Analysis Workshop DS 12Relational Database Basics DS/SC 6Storage and Input/Output in Large Scale Sci-entific Projects

DS/SC 6

Workflow Optimization for Large ScaleBioinformatics

DS/HPC/CS 6

Python for High Performance Computing DS/HPC/SC 12Parallel R DS/HPC/SC 3Python GUIs with Tkinter DS/SC 2Scientific Visualization DS/SC 6Visualizing Data with Paraview DS/SC 6Scientific Computing for Physicists‡3 HPC/SC 36Intro to Programming with Python‡ SC/DS 12Intro to Research Computing with Python‡ SC 8Intro to High Performance Computing HPC 3Advanced Parallel Scientific Computing HPC 12Intro to Scientific C++ HPC/SC 6Intro to Scientific Programming with Mod-ern FORTRAN

HPC/SC 7

Intro to Parallel Programming HPC/SC 7Programming Clusters with Message Pass-ing Interface

HPC/SC 12

Programming Shared Memory Systemswith OpenMP

HPC/SC 6

Practical Parallel Programming Intensive HPC/SC 32Intro to GPGPU with CUDA HPC/SC 9Programming GPUs with CUDA HPC/SC 12SciNet/CITA CUDA GPU Minicourse HPC/SC 12Coarray Fortran HPC/SC 2Parallel I/O HPC/SC 6Debugging, Optimization, Best Practices HPC/SC 6HPC Best Practices and Optimization HPC/SC 3HPC Debugging HPC/SC 3Intro to the Linux Shell HPC/SC 2Advanced Shell Programming HPC/SC 3Seminars in High Performance Computing HPC/SC 4Seminars in Scientific Computing HPC/SC 4

Table 3: Courses taught by SciNet onData Science (DS),High-Performance Computing (HPC), and Scientific Computing(SC). ‡ denotes courses already recognized at the Univer-sity of Toronto in several departments, such as, Physics, As-trophysics, Chemistry, Ecology and Evolutionary Biology,Institute of Medical Science, Physical and EnvironmentalSciences, Engineering, as graduate level credits. We shouldalso mention that students from other universities in theprovince of Ontario –e.g. Ryerson University– were allowedto enroll in some of these graduate courses for credit via aprovincial academic transfer program.

Journal of Computational Science Education Volume 10, Issue 1

January 2019 ISSN 2153-4136 27

Page 5: Scientific Computing, High-Performance Computing and Data …rzon/publ/39_2019... · 2019-03-25 · Computing, High-Performance Computing and Data Science. After a survey of current

in statistical analysis using the R Statistical Language [18], has nowbecome one of the most popular courses at the Institute of MedicalScience at UofT [7].

The skills that SciNet aims to transfer are rare and sought-after,and complement and enhance the skills students learn in regularcurricula. That is why SciNet has developed a set of CertificatePrograms [23], that users and students can pursue in Scientific Com-puting,High Performance Computing, and/orData Science, once theyhave completed enough credit-hours. As a document that provesthe holder has highly competitive skills, and in lieu of graduatecredit for most SciNet courses, the certificates are in high demand.In a resounding endorsement of our teaching, thus far studentshave completed a total of 161 certificates (116 in Scientific Comput-ing, 23 in High-Performance Computing and 22 in Data Science4).According to the current registration and trends, we are projectingto have over 250 certificates completed by mid-2018. Moreover,anecdotal feedback from some of our students suggests that thecourses and their SciNet certificates were instrumental in successfuljob applications, in industry and in the financial sector.

4 PROGRAMS IN DATA SCIENCEThe wide adoption of the internet in the professional and the per-sonal sphere ushered in the age of “Big Data”. The ease of recordingof people’s online behaviour, and the ability to rapidly move data,lead to a large, diffuse, complex amount of data waiting to be minedfor useful information. Because of the typically large size of thedata special hardware and training are often needed. In contrast toScientific Computing and HPC, there are many applications of DataScience in the private sector, e.g. in the medical, banking, retail,insurance, and internet industries.

Bioinformatics also has a large component in the academic world.Though a more-recent addition to the HPC disciplines, the bioinfor-matics field is well-populated with graduate programs, a testamentto its rapid growth and latent demand. Its emergence as a majoruser of HPC systems has resulted in the development of “Master’sof Bioinformatics”, and related degrees. A typical Master’s programis outlined in Table 4, this one from the Indiana-Purdue Univer-sity at Indianapolis. While having many features in common witha more-standard SC Master’s program, such as the study of pro-gramming and algorithms, it exhibits the particular needs of thebioinformatics community, stressing the importance of geneticsand biological processes, and a lesser emphasis on mathematicsand programming theory.

Degrees in Data Science are relatively new, with the first Mas-ter’s program only being introduced in the U.S. (by North CarolinaState University) in 2007. A sample of some of the classes offeredin one such program is given in Table 5. As can be seen, these pro-grams have a strong focus on data, with statistics, machine learning,and databases being their standard focus. Analyzing data that aretoo big to fit on a standard desktop computer requires specializedequipment; such training is also part of these graduate-level pro-grams, as indicated by the presence of the “Cloud Computing” and“Distributed Systems” classes. Like typical graduate-level programs,these degrees usually require a final project or thesis to be presentedby the student.

4This number has triplicated since the launch of the program in 2016.

Course Name TypeIntroduction to Bioinformatics requiredSeminar in Bioinformatics requiredBiological Database Management requiredProgramming for Life Science requiredHigh Throughput Data in Biology requiredMachine Learning in Bioinformatics electiveComputational System Biology electiveStructural Bioinformatics electiveTransitional Bioinformatics Applications electiveAlgorithms in Bioinformatics electiveStatistical Methods in Bioinformatics electiveComputational Methods for Bioinformatics electiveNext Generation Genomic Data Analytics electiveNext Generation Sequencing electiveBioinformatics Project required

Table 4: The curriculum for the “Project Track” two-year Master’s of Science in Bioinformatics at the IndianaUniversity-Purdue University in Indianapolis; this forms agood example of a typical Bioinformatics graduate program.

One could argue that the novelty of methods in Data Science isdue to its roots in Business Analytics (BA), where the objective isto make a decision. The field has certainly grown beyond that, andBA is now considered a sub-field of Data Science. Another more-recently developed sub-field is in the realm of health care (“HealthInformatics”). Because these sub-fields are directly applicable to theprivate sector (and the associated revenue streams these present)these have become the most-commonly implemented post-graduateprograms. The Business Analytics programs focus on using data torefine business administration, as well as develop marketing strate-gies. Health Informatics programs concentrate on using clinicaldata to optimize health care processes.

The practical focus of Data Science is reflected in the presenceof an internship in the Data Science curriculum listed in Table 5.

Course Name TypeAnalysis of Algorithms requiredMachine Learning requiredAdvanced Database Concepts requiredDistributed Systems electiveAdvanced Database Concepts electiveCloud Computing electiveInformation Retrieval electiveData Mining electiveWeb Mining electiveApplied Machine Learning electiveComplex Networks and Their Applications electiveRelational Probabilistic Models electiveInternship in Data Science elective

Table 5: A selection of the courses available for the Master’sof Data Science at the Indiana University.

Volume 10, Issue 1 Journal of Computational Science Education

28 ISSN 2153-4136 January 2019

Page 6: Scientific Computing, High-Performance Computing and Data …rzon/publ/39_2019... · 2019-03-25 · Computing, High-Performance Computing and Data Science. After a survey of current

Internships in such programs are similar to other co-op-type ar-rangements: the student works with an employer for a semester,allowing the student to gain hands-on experience applying theskills learnt during such period.

4.1 Academic Data Science ProgramsGraduate level programs in Data Science are not difficult to find. Forinstance, programs in bioinformatics (a data-driven field), can befound on theweb site of the International Society for ComputationalBiology. It speaks to the rapid rise of the field bioinformatics, thatthere are more bioinformatics programs available than ScientificComputing programs. Examples lists of other Data Science pro-grams can be found at the NCSU analytics web site, the online busi-ness analytics programs site of predictiveanalyticstoday.comand at online.coursereport.com. Initially these programs werenot as common as Scientific Computing, due to the fact that Data Sci-ence was a relatively new field of study. This situation has changeddramatically in the last few years. For instance in the United Statesalone there are hundred of Data Science degrees and certificates[8]. Among those programs about half are offered in the fields ofBusiness Analytics and Health Informatics, with the other half be-ing proper Data Science programs. There has also been a surge inMachine Learning and Artificial Intelligence programs, which spandata science and scientific computing.

4.2 Non-academic Data Science trainingThe demand for Data Science skills (or “Data Analytics” skills asthey are often called in the private sector) is so high [19] that theprivate sector has developed programs to meet the growing demand.See, for instance, on skilledup.com, which contains a list of data sci-ence boot-camps. The format of these classes is varied, though theyare all oriented toward a “boot-camp” format: some are in person,some online; some are one-week long, others twelve weeks. Theseprograms are very applied, often with one-on-one mentorship witha seasoned Data Analytics expert. They also include direct contactwith possible future employers.

Moreover, a great number of these training programs are notfocused on developing analytical thinking or problem-solving skills,[12] but rather are aimed at Ph.D.s and postdocs, whose problem-solving skills are assumed to have already developed. This allowsthem to focus on the technical training relevant to the job market.Some of these programs are free, some of them offer fellowships,and many of them charge on the order of 10-30 thousand US-dollarsfor a training period of, typically, three months. These programshave acquired such a level of popularity among young and recentgraduates that the companies offering these programs have startedto perform evaluation tests in order to assess which candidates aremore suitable to be accepted to their programs. Perhaps the mostappealing part for trainees is the networking platform offered bythese programs, as in most cases they provide the opportunity tointeract with actual companies looking for new talent and avoidrecruitment layers.

Institutions in the non-profit arena are also starting to offerprograms on Data Science. For instance the Fields Institute, a tra-ditional institution for mathematical research, has offered severalworkshops and courses, and developed a thematic program on Big

Data. Other examples include the International Centre for Theo-retical Physics (ICTP) and the International School for AdvancedStudies (SISSA), prestigious institutions with a well known tradi-tion in theoretical physics, now offering training in “Research DataScience”.

HPC centers are also venturing into Data Science training, offer-ing workshops on R, Hadoop, machine learning, etc. SciNet startedoffering classes with greater data-oriented content (cf. Table 3) in2013, with a four-week class in scientific analysis using Python.Having now finished its third year, the class remains popular, withabout twenty students taking the class each year. The 2015 fallsemester also inaugurated SciNet’s first “Data Science with R” class,focusing on data analysis techniques using the R language. Thisclass was very popular with over twenty-five students finishingthe course, and most students requesting a second installment withmore advanced material. Continuing its growth in the Data Sci-ence area, in the last year SciNet has held workshops in machinelearning, scalable data analysis, and Apache Spark.

Comparing the student- and taught-hours per year shown inFig. 2, one sees that the Data Science classes have been growingconsistently, both in absolute as well as relative numbers (DataScience related courses roughly constituted less than 2% (2012), 4%(2013), 21% (2014), 22% (2015), 27% (2016) and 28% (2017) of the totalclasses taught respectively in each year. While attendance to DataScience courses, follows even a more significant trend: 1% (2012),23% (2013 and 2014), 29% (2015), 40% (2016) and 41% (2017). Weproject for the current year a growth of at least 10%, positioning atequal levels the traditional Scientific Computing and Data Sciencetrainings.

5 DISCUSSIONAs mentioned above, scientific computing is used by scientists andengineers as never before, and graduate-level programs in ScientificComputing are numerous in Canada and around the world. Incontrast, the development of HPC and Data Science programs isin its early stages, both in academia and the private sector. Theseprograms are being developed to meet the continued shortfall inskill in these areas, with the McKinsey Global Institute estimatingthat the United States will be short 140,000 to 190,000 data analyticsprofessionals by 2018 [16].

One may wonder whether online learning could not satisfy thisneed. A few examples of MOOCs (Massively Open Online Courses)in HPC and Data Science do exist5. However, seeing the growth inenrolment in SciNet’s in-person courses and the summer schoolover the years (cf. Figs. 1, 2 and 3) shows that many students stillprefer the face-to-face format.

Similarly, one may wonder why certificate programs do not suf-fice for HPC and DS education. As successful as these programs are,they have a few disadvantages. Firstly, they are mostly collectionsof fairly specific technical training: this leaves no room for morefundamental material. Secondly, it is also hard to encorporate an in-ternship or thesis into such a certificate. Finally, certificates tend tocarry less weight than degrees, and, in line with this, the demand forfor-credit courses is larger than that for not-for-credit courses, as

5E.g. https://www.citutor.org/, http://www.hpc-training.org, https://www.futurelearn.com.

Journal of Computational Science Education Volume 10, Issue 1

January 2019 ISSN 2153-4136 29

Page 7: Scientific Computing, High-Performance Computing and Data …rzon/publ/39_2019... · 2019-03-25 · Computing, High-Performance Computing and Data Science. After a survey of current

Figure 2: Total student hours (top) and taught hours (bottom)per year, for SciNet’s Data Science and Scientific Computingrelated courses. The trends clearly show the need for train-ing courses in both fronts, at the level of university recog-nized courses.

Figure 3: Average class size trend for Scientific Computingand Data Science courses at SciNet.

our experience with SciNet’s Scientific Computing graduate coursehas shown.

A degree program in HPC or DS could offer more academic andfundamental education, which would leave the student with theanalytical skills and high-level knowledge to stay on top of theirfield regardless of changes in computational technology.

6 CONCLUSIONWe believe we have offered quantitative evidence demonstratingthe need for programs in higher education in High-PerformanceComputing and Data Science, in particular in our own institution.

If the qualitative evidence of this seems somewhat limited, it shouldbe understood that existing HPC and DS programs (academic andnon-academic) are still relatively new. While some such programsare already in existence, in many cases students must use non-academic options, or teach the material to themselves. Academicprograms would offer the benefit of not just teaching specific techni-cal skills, but an education in the fundamentals of HPC and DS andinstilling the analytical skills needed to adopt to an ever-changingtechnological landscape.

We have reviewed existing academic and non-academic educa-tion programs, in both HPC and DS. In light of this review, weinvite eager readers to look at a longer and complementary versionof this manuscript [17] where we present the design for a tenta-tive Master’s programs in HPC and DS, based on the examplesdiscussed here and drawing from the experience and enrollmentstatistics in not-for-credit training in HPC and DS by the SciNetHPC Consortium at the University of Toronto.

Getting well-founded graduate programs off the ground willnot be without challenges. It will likely involve partnerships anddiscussions with other departments and institutes in order to offera stronger and multi-disciplinary program. Existing HPC Centers,which already operate across multiple disciplines, can play a fun-damental role in bringing together such programs. Thus, we havedescribed SciNet’s path in developing and transgressing the usualrole of training events for users, into full credited graduate coursesrecognized at the university level for masters and doctorate de-grees. Moreover, the creation of these graduate courses allowed usto leverage two key elements:

(1) Several of our analysts involved in the educational effortsgot recognized by obtaining graduate teaching affiliationswith the corresponding institutes/departments that weresponsoring the courses.

(2) Interact with students at a more professional level, collabo-rating and participating in research projects, allowing us tolaunch research initiatives [22] for which we already havesuccessful cases [10, 21] and many more in the process ofbeing developed. Perhaps the most important point here isto note that these collaborations were catalysed by the directinteraction with our students and researchers.

We hope this might help other super-computer centers transition asimilar path and develop strategies to play a fundamental role inthe multidisciplinary fronts that are HPC, SC and DS.

REFERENCES[1] Centre for Advanced Computing (CAC) at Queen’s University, Kingston, Ontario,

Canada. https://cac.queens.ca/. (????). Accessed: 2016-06-14.[2] 2001. Shared Hierarchical Academic Research Computing Network (SHARCNET).

https://www.sharcnet.ca. (2001). Accessed: 2016-04-08.[3] 2010. International HPC Summer School. http://www.ihpcss.org/. (2010). Ac-

cessed: 2016-04-08.[4] 2014. Compute Ontario. http://computeontario.ca/. (2014). Accessed: 2016-04-08.[5] 2016. Scientific Computing for Physicists (PHY1610H). http://www.physics.

utoronto.ca/students/graduate-courses/current/phy1610h-f. (2016). Accessed:2016-04-06.

[6] 2017. Canada’s Fundamental Science Review. http://www.sciencereview.ca.(2017). Accessed: 2018-06-02.

[7] 2017. Introduction to Computational Biostatistics with R (MSC1090H). https://ims.utoronto.ca/wp-content/uploads/2014/04/IMS-syllabus-2017-REV-MSC1090H.pdf. (2017). Accessed: 2018-04-11.

[8] 2018. Masters’s in Data Science. http://www.mastersindatasience.org/. (2018).Accessed: 2018-06-02.

Volume 10, Issue 1 Journal of Computational Science Education

30 ISSN 2153-4136 January 2019

Page 8: Scientific Computing, High-Performance Computing and Data …rzon/publ/39_2019... · 2019-03-25 · Computing, High-Performance Computing and Data Science. After a survey of current

[9] Computational Science Research Center at San Diego State University. 2011.Master’s programs. http://www.csrc.sdsu.edu/masters.html. (2011). Accessed:2016-04-06.

[10] Philippe Berger and George Stein. 2018. A volumetric deep Convolutional NeuralNetwork for simulation of dark matter halo catalogues. (2018). arXiv:astro-ph.CO/1805.04537

[11] Helmar Burkhart, Danilo Guerrera, and Antonio Maffia. 2014. Trusted High-performance Computing in the Classroom. In Proceedings of the Workshop onEducation for High-Performance Computing (EduHPC ’14). IEEE Press, Piscataway,NJ, USA, 27–33. https://doi.org/10.1109/EduHPC.2014.13

[12] Ed Lazowska. November 2015. Dear GeekWire: A coding bootcamp is not areplacement for a computer science degree. http://www.geekwire.com/2015/dear-geekwire-a-coding-bootcamp-is-not-a-replacement-for-a-computer-science-degree.(November 2015). Accessed: 2016-04-06.

[13] Petri Ihantola, Arto Vihavainen, Alireza Ahadi, Matthew Butler, Jürgen Börstler,Stephen H. Edwards, Essi Isohanni, Ari Korhonen, Andrew Petersen, Kelly Rivers,Miguel Ángel Rubio, Judy Sheard, Bronius Skupas, Jaime Spacco, Claudia Szabo,and Daniel Toll. 2015. Educational Data Mining and Learning Analytics inProgramming: Literature Review and Case Studies. In Proceedings of the 2015ITiCSE on Working Group Reports (ITICSE-WGR ’15). ACM, New York, NY, USA,41–63. https://doi.org/10.1145/2858796.2858798

[14] Scott Lathrop, Ange Mason, Steven I. Gordon, and Marcio Faerman. 2013. HPCUniversity: Getting Information About Computational Science Professional andEducational Resources and Opportunities for Engagement. In Proceedings of theConference on Extreme Science and Engineering Discovery Environment: Gatewayto Discovery (XSEDE ’13). ACM, New York, NY, USA, Article 67, 4 pages. https://doi.org/10.1145/2484762.2484771

[15] Chris Loken, Daniel Gruner, Leslie Groer, Richard Peltier, Neil Bunn, MichaelCraig, Teresa Henriques, Jillian Dempsey, Ching-Hsing Yu, Joseph Chen,L Jonathan Dursi, Jason Chong, Scott Northrup, Jaime Pinto, Neil Knecht, andRamses Van Zon. 2010. SciNet: Lessons Learned from Building a Power-efficientTop-20 System and Data Centre. Journal of Physics: Conference Series 256, 1 (2010),012026. http://stacks.iop.org/1742-6596/256/i=1/a=012026

[16] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs,Charles Roxburgh, and Angela Hung Byers. 2011. Big data: The next frontier forinnovation, competition, and productivity. Technical Report. McKinsey GlobalInstitute.

[17] Marcelo Ponce, Erik Spence, Daniel Gruner, and Ramses van Zon. 2016. DesigningAdvanced Research Computing Academic Programs. CoRR abs/1604.05676 (2016).arXiv:1604.05676v2 http://arxiv.org/abs/1604.05676v2

[18] R Core Team. 2017. R: A Language and Environment for Statistical Computing.(2017). https://www.R-project.org/

[19] Alex Radermacher and Gursimran Walia. 2013. Gaps Between Industry Expec-tations and the Abilities of Graduates. In Proceeding of the 44th ACM TechnicalSymposium on Computer Science Education (SIGCSE ’13). ACM, New York, NY,USA, 525–530. https://doi.org/10.1145/2445196.2445351

[20] Mark Richards and Scott Lathrop. 2011. A Training Roadmap for New HPC Users.In Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery (TG ’11).ACM, New York, NY, USA, Article 56, 7 pages. https://doi.org/10.1145/2016741.2016801

[21] Alejandro Saettone, Jyoti Garg, Jean-Philippe Lambert, Syed Nabeel-Shah,Marcelo Ponce, Alyson Burtch, Cristina Thuppu Mudalige, Anne-Claude Gingras,Ronald E. Pearlman, and Jeffrey Fillingham. 2018. The bromodomain-containingprotein Ibd1 links multiple chromatin-related protein complexes to highly ex-pressed genes in Tetrahymena thermophila. Epigenetics & Chromatin 11, 1 (09Mar 2018), 10. https://doi.org/10.1186/s13072-018-0180-6

[22] SciNet HPC Consortium. Research SciNet. https://www.scinethpc.ca/research-scinet/. (????). Accessed: 2018-04-13.

[23] SciNet HPC Consortium. SciNet’s Certificate Program. http://www.scinethpc.ca/scinet-certificate-program. (????). Accessed: 2016-04-06.

[24] SciNet HPC Consortium. SciNet’s Education website. https://support.scinet.utoronto.ca/education. (????). Accessed: 2016-04-06.

[25] SciNet HPC Consortium. SciNet’s Training, Outreach and Education. http://www.scinethpc.ca/training-outreach-and-education. (????). Accessed: 2016-04-06.

[26] SciNet HPC Consortium. SciNet’s website. http://www.scinet.utoronto.ca. (????).Accessed: 2016-04-06.

[27] Kris Stewart. 1995. HPC Undergraduate Curriculum Development at SDSUUsing SDSC Resources. In Proceedings of the 1995 ACM/IEEE Conference onSupercomputing (Supercomputing ’95). ACM, New York, NY, USA, Article 19.https://doi.org/10.1145/224170.224209

[28] Guy Tel-Zur. 2014. PDC Education in the BGU ECE Department. In Proceedings ofthe Workshop on Education for High-Performance Computing (EduHPC ’14). IEEEPress, Piscataway, NJ, USA, 15–20. https://doi.org/10.1109/EduHPC.2014.9

[29] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood,S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr.2014. XSEDE: Accelerating Scientific Discovery. Computing in Science Engineering

16, 5 (Sept 2014), 62–74. https://doi.org/10.1109/MCSE.2014.80

Journal of Computational Science Education Volume 10, Issue 1

January 2019 ISSN 2153-4136 31