hpc compass 2016_17


Upload: marco-van-der-hart

Post on 10-Jan-2017




7 download


Page 1: HPC Compass 2016_17

HIGH PERFORMANCE COMPUTING Technology Compass 2016/17


Page 2: HPC Compass 2016_17






More Than 35 Years of Experience

in Scientific Computing

Page 3: HPC Compass 2016_17

1980 marked the beginning of a decade where numerous startups

were created, some of which later transformed into big players in

the IT market. Technical innovations brought dramatic changes

to the nascent computer market. In Tübingen, close to one of

Germany’s prime and oldest universities, transtec was founded.

In the early days, transtec focused on reselling DEC computers

and peripherals, delivering high-performance workstations to

university institutes and research facilities. In 1987, SUN/Sparc

and storage solutions broadened the portfolio, enhanced by IBM/

RS6000 products in 1991. These were the typical workstations

and server systems for high performance computing then, used

by the majority of researchers worldwide. Meanwhile, transtec

is the biggest European-wide and Germany-based provider of

comprehensive HPC solutions. Many of our HPC clusters entered

the TOP 500 list of the world’s fastest computing systems.

Thus, given this background and history, it is fair to say that tran-

stec looks back upon a more than 35 years’ experience in scientif-

ic computing; our track record shows more than 1,000 HPC instal-

lations. With this experience, we know exactly what customers’

demands are and how to meet them. High performance and ease

of management – this is what customers require today. HPC sys-

tems are for sure required to peak-perform, as their name indi-

cates, but that is not enough: they must also be easy to handle.

Unwieldy design and operational complexity must be avoided or

at least hidden from administrators and particularly users of HPC

computer systems.

High Performance Computing has changed in the course of the

last 10 years. Whereas traditional topics like simulation, large-

SMP systems or MPI parallel computing still are important cor-

nerstones of HPC, new areas of applicability like Business Intelli-

gence/Analytics, new technologies like in-memory or distributed

databases or new algorithms like MapReduce have entered the

arena which is now commonly called HPC & Big Data.

Knowledge about Hadoop or R nowadays belongs to the basic

knowledge of any serious HPC player.

Dynamical deployment scenarios are becoming more and more

important for realizing private (or even public) HPC cloud sce-

narios. The OpenStack ecosystem, object storage concepts like

Ceph, application containers with Docker – all these have been

bleeding-edge technologies once, but have become state-of-the-

art meanwhile.

Therefore, this brochure is now called “Technology Compass

HPC & Big Data”. It covers all of the above-mentioned currently

discussed core technologies. Besides those, more familiar tradi-

tional updates like the Intel Omni-Path architecture are covered

as well, of course. No matter which ingredient an HPC solution

requires – transtec masterly combines excellent and well-cho-

sen components that are already there to a fine-tuned, custom-

er-specific, and thoroughly designed HPC solution.

Your decision for a transtec HPC solution means you opt for most

intensive customer care and best service in HPC. Our experts will

be glad to bring in their expertise and support to assist you at

any stage, from HPC design to daily cluster operations, to HPC

Cloud Services.

Last but not least, transtec HPC Cloud Services provide custom-

ers with the possibility to have their jobs run on dynamically pro-

vided nodes in a dedicated datacenter, professionally managed

and individually customizable. Numerous standard applications

like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes like

Gromacs, NAMD, VMD, and others are pre-installed, integrated

into an enterprise-ready cloud management environment, and

ready to run.

Have fun reading the

transtec HPC Compass 2016/17!

Page 4: HPC Compass 2016_17

High Performance Computing for Everyone

Interview with Dr. Oliver Tennert

Director Technology Management

and HPC Solutions

Page 5: HPC Compass 2016_17

Dr. Tennert, what’s the news in HPC?

Besides becoming more and more commodity, the HPC

market has seen two major changes within the last couple of

years: on the one hand side HPC has reached over to adjacent

areas that would not have been recognized as part of HPC 5

years ago. Meanwhile however every serious HPC company

tries to find their positioning within the area of “big data”,

often vaguely understood, but sometimes put more precisely

as “big data analytics”.

Superficially, it’s all about analyzing large amounts of data as

has been part of HPC like ever before. But the areas of appli-

cation are becoming more: “big data analytics” is a vital ap-

plication of business intelligence/business analytics. Along

with this new methods, new concepts, and new technologies

emerge. The other change has to do with the way HPC environ-

ments are going to be deployed and managed. Dynamic provi-

sioning of resources for private-cloud scenarios are becoming

more and more important, to which OpenStack has contrib-

uted a lot. In the near future, dynamic deployment will be as

commodity as efficient, but static deployment today.

What’s the deal with “dynamic deployment” anyway?

Dynamic deployment is the core provisioning technology in

cloud environments, public or private ones. Virtual machines

with different OSses, web servers, compute nodes, or any kind

of resources are only deployed when they are needed and


Dynamic deployment makes sure that the available hardware

capacity will be prepared and deployed for exactly the pur-

pose for which it is needed. The advantage is: less idle capacity

and better utilization than could happen with static environ-

ments where e.g. inactive Windows servers block hardware

capacity whereas Linux-based compute nodes would be

needed at a certain point in time.

Everyone is discussing OpenStack at the moment. What do you

think of the hype and the development of this platform? And

what is its importance for transtec?

OpenStack has indeed been discussed at a time when it still

had little business relevance for an HPC solutions provider.

But this is the case with all innovations, be they hardware- or


But the fact that OpenStack is a joint development effort of

several big IT players, had added to the demand for private

clouds in the market on the one hand side, and to its relevance

as the software solution for that on the other hand side.

We at transtec have several customers who either ask us to

deploy a completely new OpenStack environment, or would

like us to migrate their existing one to OpenStack.

OpenStack has the reputation of being a very complex beast,

given the many components. Implementing it can take a while.

Is there any way to simplify this?

Indeed life can be rough if you do it wrong. But there is no need

to: no one goes ahead and installs Linux from scratch nowa-

days, apart from playing around and learning: instead every-

one uses professional distributions with commercial support.

There are special OpenStack distributions too, with commer-

cial support: especially within HPC, Bright Cluster Manager

is often the distribution of choice, but also long-term players

like Red Hat offer simple deployment frameworks.

Our customers can rest assured that they get an individually

optimized OpenStack environment, which is not only de-

signed by us, but also installed and maintained throughout.

It has always been our job to hide any complexity underneath

a layer of “ease of use, ease of management”.

Page 6: HPC Compass 2016_17


Who do you want to address with this topic?

Everyone who is interested in a highly flexible, but also effec-

tive utilization of their existing hardware capacity. We address

customers from academia, as well as SMB customers who

have to turn towards new concepts and new technologies,

like big enterprises.

As said before: in our opinion, this way of resource provision-

ing will become mainstream soon.

Let’s turn to the other big topic: What is the connection between

HPC and Big Data Analytics, and what does the market have to

say about this?

As already mentioned, the HPC market is reaching out towards

big data analytics. “HPC & Big Data” is named in one breath at

various key events like ISC or Supercomputing.

This is not without reason of course: as different as the technol-

ogies in use or the goals to reach may be, there is one common

denominator: and this is a high demand for computational

capacity. An HPC solution provider like transtec, who is ca-

pable of designing and deploying a one-petaflop HPC cluster,

naturally also manages the provisioning of say a ten-petabyte

Hadoop cluster for big data analytics to a customer.

But the target group may be a different one: classical HPC

mostly takes place within the value chain of the customer. The

users are developers. Important areas for big data analytics,

however, can be found in business controlling, i.e. BI/BA, but

also in companies’ operations departments, when operation-

al decisions have to be made depending on certain vital state

data, in real time. This is what “internet of things” is all about.

High Performance Computing for Everyone

Interview with Dr. Oliver Tennert

Director Technology Management

and HPC Solutions

Page 7: HPC Compass 2016_17


Where are the technical differences between HPC and Big Data

Analytics, and what do both have in common?

Both areas share the “scale-out” approach. It is not the single

server that gets more and more powerful, but rather accelera-

tion is reached via parallelization and massive load balancing

on several servers.

In HPC, parallel computational jobs that run using the MPI

programming interface are state of the art. Simulations run on

many compute nodes at once, thus cutting down wall-clock

time. Within the context of big data analytics, distributed and

above all in-memory databases are becoming more and more

important. Database requests can thus be handled by several

servers, and with a low latency.

How do companies handle the backup and the archiving of

these huge amounts of data?

Given this enormous amount of data, standard backup strat-

egies or archiving concepts do indeed not apply anymore. In-

stead, for big data analytics, data are stored in a redundant

way in principle, to become independent from hardware fail-

ures. Storage concepts like Hadoop or object storage like Ceph

offer redundancy in data storage as well.

Moreover, when it comes to big data analytics, the basic prem-

ise is that it really does not matter much when a tiny fraction

out of these huge amounts of data go missing for some rea-

son. The goal of big data analytics is not to get a precise ana-

lytical result out of these often unstructured data, but rather

the quickest possible extraction of relevant information and

a sensible interpretation, whereby statistical methods are of

great importance.

For which companies is it important to do this very quick extrac-

tion of information?

Holiday booking portals that use booking systems have to get

an immediate overview of the hotel rooms that are available

in some random town, sorted by price, in order to provide cus-

tomers with this information. Energy or finance companies

who do exchange tradings of their assets, have to know their

current portfolio’s value in the fraction of a second to do the

right pricing.

Search engine providers – there are more than just Google –

investigative authorities, credit card clearance houses who

have to react quickly to attempted frauds: the list of potential

companies who could make use of big data analytics grows

alongside with the technical progress.

Page 8: HPC Compass 2016_17


High Performance Computing ............................................. 10Performance Turns Into Productivity .........................................................12

Flexible Deployment With xCAT .....................................................................14

Service and Customer Care From A to Z ...................................................16

OpenStack for HPC Cloud Environments ...................... 20What is OpenStack? ...............................................................................................22

Object Storage with Ceph ..................................................................................24

Application Containers with Docker ...........................................................30

Advanced Cluster Management Made Easy ................ 38Easy-to-use, Complete and Scalable ............................................................40

Cloud Bursting With Bright ...............................................................................48

Intelligent HPC Workload Management ........................ 50Moab HPC Suite ........................................................................................................52

IBM Platform LSF .....................................................................................................56

Slurm ...............................................................................................................................58

IBM Platform LSF 8 ..................................................................................................61

Intel Cluster Ready ...................................................................... 64A Quality Standard for HPC Clusters............................................................66

Intel Cluster Ready Builds HPC Momentum ...........................................70

The transtec Benchmarking Center .............................................................74

Remote Visualization and Workflow Optimization ............................................................ 76NICE EnginFrame: A Technical Computing Portal ...............................78

Remote Visualization ............................................................................................82

Desktop Cloud Visualization ............................................................................84

Cloud Computing With transtec and NICE ..............................................86

NVIDIA Grid Boards .................................................................................................88

Virtual GPU Technology .......................................................................................90

Technology Compass

Table of Contents and Introduction

Page 9: HPC Compass 2016_17


Big Data Analytics with Hadoop ......................................... 92Apache Hadoop ........................................................................................................94

HDFS Architecture ..................................................................................................98

NVIDIA GPU Computing –The CUDA Architecture ...........................................................108What is GPU Computing? ................................................................................ 110

Kepler GK110/210 GPU Computing Architecture .............................. 112

A Quick Refresher on CUDA ............................................................................ 122

Intel Xeon Phi Coprocessor ..................................................124The Architecture.................................................................................................... 126

An Outlook on Knights Landing and Omni Path ............................... 128

Vector Supercomputing .........................................................138The NEC SX Architecture .................................................................................. 140

History of the SX Series..................................................................................... 146

Parallel File Systems – The New Standard for HPC Storage ................................154Parallel NFS .............................................................................................................. 156

What’s New in NFS 4.1? ..................................................................................... 158

Panasas HPC Storage ......................................................................................... 160

BeeGFS ........................................................................................................................ 172

General Parallel File System (GPFS) ........................................................... 176

What’s new in GPFS Version 3.5 ................................................................... 190

Lustre ........................................................................................................................... 192

Intel Omni-Path Interconnect .............................................196The Architecture.................................................................................................... 198

Host Fabric Interface (HFI) .............................................................................. 200

Architecture Switch Fabric Optimizations ........................................... 202

Fabric Software Components ....................................................................... 204

Numascale ......................................................................................206NumaConnect Background ............................................................................ 208

Technology ............................................................................................................... 210

Redefining Scalable OpenMP and MPI .................................................... 216

Big Data ...................................................................................................................... 220

Cache Coherence .................................................................................................. 222

NUMA and ccNUMA ............................................................................................. 224

Allinea Forge for parallel Software Development ..........................................................226Application Performance and Development Tools ......................... 228

What’s in Allinea Forge? ................................................................................... 230

Glossary ............................................................................................242

Page 10: HPC Compass 2016_17
Page 11: HPC Compass 2016_17


High Performance Computing – Performance Turns Into ProductivityHigh Performance Computing (HPC) has been with us from the very beginning of the

computer era. High-performance computers were built to solve numerous problems

which the “human computers” could not handle.

The term HPC just hadn’t been coined yet. More important, some of the early principles

have changed fundamentally.

Engineering | Life Sciences | Automotive | Price Modelling | Aerospace | CAE | Data Analytics

Page 12: HPC Compass 2016_17


Performance Turns Into ProductivityHPC systems in the early days were much different from those

we see today. First, we saw enormous mainframes from large

computer manufacturers, including a proprietary operating

system and job management system. Second, at universities

and research institutes, workstations made inroads and scien-

tists carried out calculations on their dedicated Unix or VMS

workstations. In either case, if you needed more computing

power, you scaled up, i.e. you bought a bigger machine. Today

the term High-Performance Computing has gained a fundamen-

tally new meaning. HPC is now perceived as a way to tackle

complex mathematical, scientific or engineering problems. The

integration of industry standard, “off-the-shelf” server hardware

into HPC clusters facilitates the construction of computer net-

works of such power that one single system could never achieve.

The new paradigm for parallelization is scaling out.

Computer-supported simulations of realistic processes (so-called

Computer Aided Engineering – CAE) has established itself as a

third key pillar in the field of science and research alongside theo-

ry and experimentation. It is nowadays inconceivable that an air-

craft manufacturer or a Formula One racing team would operate

without using simulation software. And scientific calculations,

such as in the fields of astrophysics, medicine, pharmaceuticals

and bio-informatics, will to a large extent be dependent on super-

computers in the future. Software manufacturers long ago rec-

ognized the benefit of high-performance computers based on

powerful standard servers and ported their programs to them


The main advantages of scale-out supercomputers is just that:

they are infinitely scalable, at least in principle. Since they are

based on standard hardware components, such a supercomput-

er can be charged with more power whenever the computational

capacity of the system is not sufficient any more, simply by add-

ing additional nodes of the same kind. A cumbersome switch to a

different technology can be avoided in most cases.

“transtec HPC solutions are meant to provide customers with

unparalleled ease-of-management and ease-of-use. Apart from

that, deciding for a transtec HPC solution means deciding for the

most intensive customer care and the best service imaginable.”

Dr. Oliver Tennert | Director Technology Management & HPC Solutions

High Performance Computing

Performance Turns Into Productivity

Page 13: HPC Compass 2016_17






tech BOX


Variations on the theme: MPP and SMP Parallel computations exist in two major variants today. Ap-

plications running in parallel on multiple compute nodes are

frequently so-called Massively Parallel Processing (MPP) appli-

cations. MPP indicates that the individual processes can each

utilize exclusive memory areas. This means that such jobs are

predestined to be computed in parallel, distributed across the

nodes in a cluster. The individual processes can thus utilize the

separate units of the respective node – especially the RAM, the

CPU power and the disk I/O.

Communication between the individual processes is implement-

ed in a standardized way through the MPI software interface

(Message Passing Interface), which abstracts the underlying

network connections between the nodes from the processes.

However, the MPI standard (current version 2.0) merely requires

source code compatibility, not binary compatibility, so an off-the-

shelf application usually needs specific versions of MPI libraries

in order to run. Examples of MPI implementations are OpenMPI,

MPICH2, MVAPICH2, Intel MPI or – for Windows clusters – MS-MPI.

If the individual processes engage in a large amount of commu-

nication, the response time of the network (latency) becomes im-

portant. Latency in a Gigabit Ethernet or a 10GE network is typi-

cally around 10 µs. High-speed interconnects such as InfiniBand,

reduce latency by a factor of 10 down to as low as 1 µs. Therefore,

high-speed interconnects can greatly speed up total processing.

The other frequently used variant is called SMP applications.

SMP, in this HPC context, stands for Shared Memory Processing.

It involves the use of shared memory areas, the specific imple-

mentation of which is dependent on the choice of the underlying

operating system. Consequently, SMP jobs generally only run on

a single node, where they can in turn be multi-threaded and thus

be parallelized across the number of CPUs per node. For many

HPC applications, both the MPP and SMP variant can be chosen.

Many applications are not inherently suitable for parallel execu-

tion. In such a case, there is no communication between the in-

dividual compute nodes, and therefore no need for a high-speed

network between them; nevertheless, multiple computing jobs

can be run simultaneously and sequentially on each individual

node, depending on the number of CPUs.

In order to ensure optimum computing performance for these

applications, it must be examined how many CPUs and cores de-

liver the optimum performance.

We find applications of this sequential type of work typically in

the fields of data analysis or Monte-Carlo simulations.

Page 14: HPC Compass 2016_17



xCAT as a Powerful and Flexible Deployment Tool

xCAT (Extreme Cluster Administration Tool) is an open source

toolkit for the deployment and low-level administration of HPC

cluster environments, small as well as large ones.

xCAT provides simple commands for hardware control, node

discovery, the collection of MAC addresses, and the node deploy-

ment with (diskful) or without local (diskless) installation. The

cluster configuration is stored in a relational database. Node

groups for different operating system images can be defined.

Also, user-specific scripts can be executed automatically at in-

stallation time.

xCAT Provides the Following Low-Level Administrative


� Remote console support

� Parallel remote shell and remote copy commands

� Plugins for various monitoring tools like Ganglia or Nagios

� Hardware control commands for node discovery, collecting

MAC addresses, remote power switching and resetting of


� Automatic configuration of syslog, remote shell, DNS, DHCP,

and ntp within the cluster

� Extensive documentation and man pages

For cluster monitoring, we install and configure the open source

tool Ganglia or the even more powerful open source solution Na-

gios, according to the customer’s preferences and requirements.

High Performance Computing

Flexible Deployment With xCAT

Page 15: HPC Compass 2016_17


Local Installation or Diskless Installation

We offer a diskful or a diskless installation of the cluster nodes.

A diskless installation means the operating system is hosted par-

tially within the main memory, larger parts may or may not be

included via NFS or other means. This approach allows for de-

ploying large amounts of nodes very efficiently, and the cluster

is up and running within a very small timescale. Also, updating

the cluster can be done in a very efficient way. For this, only the

boot image has to be updated, and the nodes have to be reboot-

ed. After this, the nodes run either a new kernel or even a new

operating system. Moreover, with this approach, partitioning the

cluster can also be very efficiently done, either for testing pur-

poses, or for allocating different cluster partitions for different

users or applications.

Development Tools, Middleware, and Applications

According to the application, optimization strategy, or underly-

ing architecture, different compilers lead to code results of very

different performance. Moreover, different, mainly commercial,

applications, require different MPI implementations. And even

when the code is self-developed, developers often prefer one MPI

implementation over another.

According to the customer’s wishes, we install various compilers,

MPI middleware, as well as job management systems like Para-

station, Grid Engine, Torque/Maui, or the very powerful Moab

HPC Suite for the high-level cluster management.

Page 16: HPC Compass 2016_17



transtec HPC as a ServiceYou will get a range of applications like LS-Dyna, ANSYS, Gromacs,

NAMD etc. from all kinds of areas pre-installed, integrated into an

enterprise-ready cloud and workload management system, and

ready-to run. Do you miss your application?

> Ask us: [email protected]

transtec Platform as a ServiceYou will be provided with dynamically provided compute nodes

for running your individual code. The operating system will be

pre-installed according to your requirements. Common Linux dis-

tributions like RedHat, CentOS, or SLES are the standard. Do you

need another distribution?

> Ask us: [email protected]

transtec Hosting as a ServiceYou will be provided with hosting space inside a professionally

managed and secured datacenter where you can have your ma-

chines hosted, managed, maintained, according to your require-

ments. Thus, you can build up your own private cloud. What

range of hosting and maintenance services do you need?

> Tell us: [email protected]

HPC @ transtec: Services and Customer Care from A to Ztranstec AG has over 30 years of experience in scientific computing

and is one of the earliest manufacturers of HPC clusters. For nearly a

decade, transtec has delivered highly customized High Performance

clusters based on standard components to academic and industry

customers across Europe with all the high quality standards and the

customer-centric approach that transtec is well known for. Every

transtec HPC solution is more than just a rack full of hardware – it is

a comprehensive solution with everything the HPC user, owner, and

operator need. In the early stages of any customer’s HPC project,

transtec experts provide extensive and detailed consulting to the

customer – they benefit from expertise and experience. Consulting

is followed by benchmarking of different systems with either spe-

cifically crafted customer code or generally accepted benchmarking

routines; this aids customers in sizing and devising the optimal and

detailed HPC configuration. Each and every piece of HPC hardware

that leaves our factory undergoes a burn-in procedure of 24 hours or

High Performance Computing


individual Presalesconsulting


site-specificsizing of

HPC solution

burn-in testsof systems

benchmarking of different systems


software& OS






maintenance,support &

managed services

Services and Customer Care from A to Z

Services and Customer

Care from A to Z

Page 17: HPC Compass 2016_17


more if necessary. We make sure that any hardware shipped meets

our and our customers’ quality requirements. transtec HPC solutions

are turnkey solutions. By default, a transtec HPC cluster has every-

thing installed and configured – from hardware and operating sys-

tem to important middleware components like cluster management

or developer tools and the customer’s production applications.

Onsite delivery means onsite integration into the customer’s pro-

duction environment, be it establishing network connectivity to the

corporate network, or setting up software and configuration parts.

transtec HPC clusters are ready-to-run systems – we deliver, you

turn the key, the system delivers high performance. Every HPC proj-

ect entails transfer to production: IT operation processes and poli-

cies apply to the new HPC system. Effectively, IT personnel is trained

hands-on, introduced to hardware components and software, with

all operational aspects of configuration management.

transtec services do not stop when the implementation projects

ends. Beyond transfer to production, transtec takes care. transtec

offers a variety of support and service options, tailored to the cus-

tomer’s needs. When you are in need of a new installation, a major

reconfiguration or an update of your solution – transtec is able to

support your staff and, if you lack the resources for maintaining the

cluster yourself, maintain the HPC solution for you.

From Professional Services to Managed Services for daily opera-

tions and required service levels, transtec will be your complete HPC

service and solution provider. transtec’s high standards of perfor-

mance, reliability and dependability assure your productivity and

complete satisfaction. transtec’s offerings of HPC Managed Services

offer customers the possibility of having the complete management

and administration of the HPC cluster managed by transtec service

specialists, in an ITIL compliant way. Moreover, transtec’s HPC on De-

mand services help provide access to HPC resources whenever they

need them, for example, because they do not have the possibility

of owning and running an HPC cluster themselves, due to lacking

infrastructure, know-how, or admin staff.

transtec HPC Cloud ServicesLast but not least transtec’s services portfolio evolves as custom-

ers‘ demands change. Starting this year, transtec is able to provide

HPC Cloud Services. transtec uses a dedicated datacenter to provide

computing power to customers who are in need of more capacity

than they own, which is why this workflow model is sometimes

called computing-on-demand. With these dynamically provided

resources, customers with the possibility to have their jobs run on

HPC nodes in a dedicated datacenter, professionally managed and

secured, and individually customizable. Numerous standard ap-

plications like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes

like Gromacs, NAMD, VMD, and others are pre-installed, integrated

into an enterprise-ready cloud and workload management environ-

ment, and ready to run.

Alternatively, whenever customers are in need of space for host-

ing their own HPC equipment because they do not have the space

capacity or cooling and power infrastructure themselves, transtec

is also able to provide Hosting Services to those customers who’d

like to have their equipment professionally hosted, maintained, and

managed. Customers can thus build up their own private cloud!

Are you interested in any of transtec’s broad range of HPC related

services? Write us an email to [email protected]. We’ll be happy to

hear from you!

Page 18: HPC Compass 2016_17


High Performance Computing

Services and Customer

Care from A to Z

Are you an end user?Then doing your work as productive as possible is of utmost im-

portance to you! You are interested in an IT environment running

smoothly, with a fully functional infrastructure, in competent and

professional support by internal administrators and external ser-

vice providers, and a service desk with guaranteed availability at

working times. HPC cloud services ensure you are able to run your

simulation jobs even if computational requirements exceed the lo-

cal capacities.

� support services

� professional services

� managed services

� cloud services

Are you an administrator?Then you are responsible for providing a well-performing IT infra-

structure, for supporting the end users when problems occur. In

situations where problem management fully exploits your capacity,

or in cases where you just don’t have time or maybe lack experience

in performing certain tasks, you might be glad to be assisted by spe-

cialized IT experts.

� support services

� professional services

Are you an R & D manager?Then your job is to ensure that developers, researchers, and other

end users are able to do their job. To have a partner at hand that pro-

vides you with a full range of operations support, either in a com-

pletely flexible, on-demand base, or in a fully-managed way based

on ITIL best practices, is heartily welcome for you.

� professional services

� managed services

Page 19: HPC Compass 2016_17


Services Continuum


Paid to Support





Configuration Changes


Paid to Do

Service Desk

Event Monitoringand Notification

Release Management



Paid to Operate






Paid to Use






Paid to Advise

Are you a CIO?Then your main concern is a smoothly-running IT operations envi-

ronment with a maximum of efficiency and effectivity. Managed

Services to ensure business continuity are a main part of your con-

siderations regarding business objectives. HPC cloud services are

a vital part of your capacity plannings to ensure the availability of

computational resources to your internal clients, the R & D depart-


� managed services

� cloud services

Are you a CEO?Profitability, time-to-result, business continuity, and ensuring future

viability are topics that you continuously have to think about. You

are happy to have a partner at hand, who gives you insight into fu-

ture developments, provides you with technology consulting, and is

able to help you ensure business continuity and fall-back scenarios,

as well as scalability readiness, by means of IT capacity outsourcing


� cloud services

� consulting services

Page 20: HPC Compass 2016_17
Page 21: HPC Compass 2016_17

Life Sciences | CAE | High Performance Computing | Big Data Analytics | Simulation | CAD

OpenStack for HPCCloud EnvironmentsOpenStack is a set of software tools for building and managing cloud computing

platforms for public and private clouds. Backed by some of the biggest companies in

software development and hosting, as well as thousands of individual community

members, many think that OpenStack is the future of cloud computing.

OpenStack is managed by the OpenStack Foundation, a non-profit organization which

oversees both development and community-building around the project.

Page 22: HPC Compass 2016_17


What is OpenStack? Introduction to OpenStackOpenStack lets users deploy virtual machines and other instances

which handle different tasks for managing a cloud environment

on the fly. It makes horizontal scaling easy, which means that

tasks which benefit from running concurrently can easily serve

more or less users on the fly by just spinning up more instances.

For example, a mobile application which needs to communicate

with a remote server might be able to divide the work of commu-

nicating with each user across many different instances, all

communicating with one another but scaling quickly and easily

as the application gains more users.

And most importantly, OpenStack is open source software, which

means that anyone who chooses to can access the source code,

make any changes or modifications they need, and freely share

these changes back out to the community at large. It also means

that OpenStack has the benefit of thousands of developers all

over the world working in tandem to develop the strongest, most

robust, and most secure product that they can.

How is OpenStack used in a cloud environment?The cloud is all about providing computing for end users in a

remote environment, where the actual software runs as a service

on reliable and scalable servers rather than on each end users

computer. Cloud computing can refer to a lot of different things,

but typically the industry talks about running different items “as

a service” – software, platforms, and infrastructure. OpenStack

falls into the latter category and is considered Infrastructure as

a Service (IaaS). Providing infrastructure means that OpenStack

makes it easy for users to quickly add new instance, upon which

other cloud components can run. Typically, the infrastructure

then runs a “platform” upon which a developer can create soft-

ware applications which are delivered to the end users.

OpenStack for HPC

Cloud Environments

What is OpenStack?

Page 23: HPC Compass 2016_17


What are the components of OpenStack?OpenStack is made up of many different moving parts. Because

of its open nature, anyone can add additional components to

OpenStack to help it to meet their needs. But the OpenStack

community has collaboratively identified nine key components

that are a part of the “core” of OpenStack, which are distributed

as a part of any OpenStack system and officially maintained by

the OpenStack community.

� Nova is the primary computing engine behind OpenStack. It

is used for deploying and managing large numbers of virtual

machines and other instances to handle computing tasks.

� Swift is a storage system for objects and files. Rather than

the traditional idea of a referring to files by their location on

a disk drive, developers can instead refer to a unique identi-

fier referring to the file or piece of information and let Open-

Stack decide where to store this information. This makes

scaling easy, as developers don’t have the worry about the

capacity on a single system behind the software. It also

allows the system, rather than the developer, to worry about

how best to make sure that data is backed up in case of the

failure of a machine or network connection.

� Cinder is a block storage component, which is more analo-

gous to the traditional notion of a computer being able to

access specific locations on a disk drive. This more tradi-

tional way of accessing files might be important in scenarios

in which data access speed is the most important consider-


� Neutron provides the networking capability for OpenStack.

It helps to ensure that each of the components of an Open-

Stack deployment can communicate with one another

quickly and efficiently.

� Horizon is the dashboard behind OpenStack. It is the only

graphical interface to OpenStack, so for users wanting to

give OpenStack a try, this may be the first component they

actually “see.” Developers can access all of the components

of OpenStack individually through an application program-

ming interface (API), but the dashboard provides system

administrators a look at what is going on in the cloud, and to

manage it as needed.

� Keystone provides identity services for OpenStack. It is

essentially a central list of all of the users of the OpenStack

cloud, mapped against all of the services provided by the

cloud which they have permission to use. It provides multiple

means of access, meaning developers can easily map their

existing user access methods against Keystone.

� Glance provides image services to OpenStack. In this case,

“images” refers to images (or virtual copies) of hard disks.

Glance allows these images to be used as templates when

deploying new virtual machine instances.

� Ceilometer provides telemetry services, which allow the

cloud to provide billing services to individual users of the

cloud. It also keeps a verifiable count of each user’s system

usage of each of the various components of an OpenStack

cloud. Think metering and usage reporting.

� Heat is the orchestration component of OpenStack, which

allows developers to store the requirements of a cloud appli-

cation in a file that defines what resources are necessary for

that application. In this way, it helps to manage the infra-

structure needed for a cloud service to run.

Page 24: HPC Compass 2016_17


OpenStack for HPC

Cloud Environments

Object Storage with Ceph

Object Storage with Ceph Why “Ceph”?

“Ceph” is an odd name for a file system and breaks the typical

acronym trend that most follow. The name is a reference to the

mascot at UCSC (Ceph’s origin), which happens to be “Sammy,”

the banana slug, a shell-less mollusk in the cephalopods class.

Cephalopods, with their multiple tentacles, provide a great meta-

phor for a distributed file system.

Ceph goalsDeveloping a distributed file system is a complex endeavor, but

it’s immensely valuable if the right problems are solved. Ceph’s

goals can be simply defined as:

� Easy scalability to multi-petabyte capacity

� High performance over varying workloads (input/output

operations per second [IOPS] and bandwidth)

� Strong reliability

Unfortunately, these goals can compete with one another (for

example, scalability can reduce or inhibit performance or impact

reliability). Ceph has developed some very interesting concepts

(such as dynamic metadata partitioning and data distribution

and replication), which this article explores shortly. Ceph’s design

also incorporates fault-tolerance features to protect against

single points of failure, with the assumption that storage failures

on a large scale (petabytes of storage) will be the norm rather

than the exception. Finally, its design does not assume particular

workloads but includes the ability to adapt to changing distrib-

uted workloads to provide the best performance. It does all of

this with the goal of POSIX compatibility, allowing it to be trans-

parently deployed for existing applications that rely on POSIX

semantics (through Ceph-proposed enhancements). Finally, Ceph

is open source distributed storage and part of the mainline Linux

kernel (2.6.34).

Page 25: HPC Compass 2016_17



Advanced Services (Consume Iaas)

Cloud Management

Infrastucture as a Service


CLI(python - client)









Image Management(Glance)








Deployment(Triple O)



Common Library(Oslo)



Message Queue/Database(s)

OpenStack Overview

Page 26: HPC Compass 2016_17


OpenStack for HPC

Cloud Environments

Object Storage with Ceph

Ceph architectureNow, let’s explore the Ceph architecture and its core elements at

a high level. I then dig down another level to identify some of the

key aspects of Ceph to provide a more detailed exploration.

The Ceph ecosystem can be broadly divided into four segments

(see Figure 1): clients (users of the data), metadata servers (which

cache and synchronize the distributed metadata), an object

storage cluster (which stores both data and metadata as objects

and implements other key responsibilities), and finally the cluster

monitors (which implement the monitoring functions).

As Figure 1 shows, clients perform metadata operations (to iden-

tify the location of data) using the metadata servers. The meta-

data servers manage the location of data and also where to store

new data. Note that metadata is stored in the storage cluster (as

indicated by “Metadata I/O”).




Metadata ops

File (object) I/O

Metadata I/O


Figure 1: Conceptual architecture of the Ceph ecosystem

Page 27: HPC Compass 2016_17


Actual file I/O occurs between the client and object storage

cluster. In this way, higher-level POSIX functions (such as open,

close, and rename) are managed through the metadata servers,

whereas POSIX functions (such as read and write) are managed

directly through the object storage cluster.

Another perspective of the architecture is provided in Figure 2. A

set of servers access the Ceph ecosystem through a client inter-

face, which understands the relationship between metadata

servers and object-level storage. The distributed storage system

can be viewed in a few layers, including a format for the storage

devices (the Extent and B-tree-based Object File System [EBOFS]

or an alternative) and an overriding management layer designed

to manage data replication, failure detection, and recovery and

subsequent data migration called Reliable Autonomic Distrib-

uted Object Storage (RADOS). Finally, monitors are used to iden-

tify component failures, including subsequent notification.


Object storage daemon



Metadata servers

Ceph client interface

Figure 2: Simplified layered view of the Ceph ecosystem

Ceph componentsWith the conceptual architecture of Ceph under your belts, you

can dig down another level to see the major components imple-

mented within the Ceph ecosystem. One of the key differences

between Ceph and traditional file systems is that rather than

focusing the intelligence in the file system itself, the intelligence

is distributed around the ecosystem.

Figure 3 shows a simple Ceph ecosystem. The Ceph Client is the

user of the Ceph file system. The Ceph Metadata Daemon provides

the metadata services, while the Ceph Object Storage Daemon

provides the actual storage (for both data and metadata). Finally,

the Ceph Monitor provides cluster management. Note that

there can be many Ceph clients, many object storage endpoints,

numerous metadata servers (depending on the capacity of the

file system), and at least a redundant pair of monitors. So, how is

this file system distributed?


cmon (Ceph monitor)

cosd (Ceph object storage daemon)


cmds (Ceph metadata daemon)

DRAM cache

(Enhanced) POSIX



Linux kernelCeph

Disk DiskDiskDisk

Ceph client

Figure 3: Simple Ceph ecosystem

Page 28: HPC Compass 2016_17


Kernel or user space

Early versions of Ceph utilized Filesystems in User SpacE (FUSE),

which pushes the file system into user space and can greatly

simplify its development. But today, Ceph has been integrated

into the mainline kernel, making it faster, because user space

context switches are no longer necessary for file system I/O.

Ceph clientAs Linux presents a common interface to the file systems

(through the virtual file system switch [VFS]), the user’s perspec-

tive of Ceph is transparent. The administrator’s perspective will

certainly differ, given the potential for many servers encom-

passing the storage system (see the Resources section for infor-

mation on creating a Ceph cluster). From the users’ point of view,

they have access to a large storage system and are not aware of

the underlying metadata servers, monitors, and individual object

storage devices that aggregate into a massive storage pool.

Users simply see a mount point, from which standard file I/O can

be performed.

The Ceph file system–or at least the client interface–is imple-

mented in the Linux kernel. Note that in the vast majority of

file systems, all of the control and intelligence is implemented

within the kernel’s file system source itself. But with Ceph, the

file system’s intelligence is distributed across the nodes, which

simplifies the client interface but also provides Ceph with the

ability to massively scale (even dynamically).

Rather than rely on allocation lists (metadata to map blocks on

a disk to a given file), Ceph uses an interesting alternative. A file

from the Linux perspective is assigned an inode number (INO)

from the metadata server, which is a unique identifier for the

file. The file is then carved into some number of objects (based

on the size of the file). Using the INO and the object number

(ONO), each object is assigned an object ID (OID). Using a simple

hash over the OID, each object is assigned to a placement group.

OpenStack for HPC

Cloud Environments

Object Storage with Ceph

Page 29: HPC Compass 2016_17


The placement group (identified as a PGID) is a conceptual

container for objects. Finally, the mapping of the placement

group to object storage devices is a pseudo-random mapping

using an algorithm called Controlled Replication Under Scalable

Hashing (CRUSH). In this way, mapping of placement groups (and

replicas) to storage devices does not rely on any metadata but

instead on a pseudo-random mapping function. This behavior is

ideal, because it minimizes the overhead of storage and simpli-

fies the distribution and lookup of data.

The final component for allocation is the cluster map. The cluster

map is an efficient representation of the devices representing

the storage cluster. With a PGID and the cluster map, you can

locate any object.

The Ceph metadata serverThe job of the metadata server (cmds) is to manage the file

system’s namespace. Although both metadata and data are

stored in the object storage cluster, they are managed separately

to support scalability. In fact, metadata is further split among

a cluster of metadata servers that can adaptively replicate and

distribute the namespace to avoid hot spots. As shown in Figure

4, the metadata servers manage portions of the namespace and

can overlap (for redundancy and also for performance). The

mapping of metadata servers to namespace is performed in

Ceph using dynamic subtree partitioning, which allows Ceph to

adapt to changing workloads (migrating namespaces between

metadata servers) while preserving locality for performance.

But because each metadata server simply manages the

namespace for the population of clients, its primary applica-

tion is an intelligent metadata cache (because actual metadata

is eventually stored within the object storage cluster). Metadata

to write is cached in a short-term journal, which eventually is

pushed to physical storage. This behavior allows the meta-

data server to serve recent metadata back to clients (which is

common in metadata operations). The journal is also useful for

failure recovery: if the metadata server fails, its journal can be

replayed to ensure that metadata is safely stored on disk.

Metadata servers manage the inode space, converting file names

to metadata. The metadata server transforms the file name into

an inode, file size, and striping data (layout) that the Ceph client

uses for file I/O.

Ceph monitorsCeph includes monitors that implement management of the

cluster map, but some elements of fault management are imple-

mented in the object store itself. When object storage devices fail

or new devices are added, monitors detect and maintain a valid

cluster map. This function is performed in a distributed fashion

where map updates are communicated with existing traffic.

Ceph uses Paxos, which is a family of algorithms for distributed


Ceph object storageSimilar to traditional object storage, Ceph storage nodes include

not only storage but also intelligence. Traditional drives are

simple targets that only respond to commands from initiators.

But object storage devices are intelligent devices that act as both

targets and initiators to support communication and collabora-

tion with other object storage devices.


Figure 4: Partitioning of the Ceph namespace for metadata servers

Page 30: HPC Compass 2016_17


From a storage perspective, Ceph object storage devices perform

the mapping of objects to blocks (a task traditionally done at

the file system layer in the client). This behavior allows the local

entity to best decide how to store an object. Early versions of

Ceph implemented a custom low-level file system on the local

storage called EBOFS. This system implemented a nonstandard

interface to the underlying storage tuned for object seman-

tics and other features (such as asynchronous notification of

commits to disk). Today, the B-tree file system (BTRFS) can be

used at the storage nodes, which already implements some of

the necessary features (such as embedded integrity).

Because the Ceph clients implement CRUSH and do not have

knowledge of the block mapping of files on the disks, the under-

lying storage devices can safely manage the mapping of objects

to blocks. This allows the storage nodes to replicate data (when

a device is found to have failed). Distributing the failure recovery

also allows the storage system to scale, because failure detec-

tion and recovery are distributed across the ecosystem. Ceph

calls this RADOS (see Figure 3).

Other features of interestAs if the dynamic and adaptive nature of the file system weren’t

enough, Ceph also implements some interesting features visible

to the user. Users can create snapshots, for example, in Ceph on

any subdirectory (including all of the contents). It’s also possible

to perform file and capacity accounting at the subdirectory level,

which reports the storage size and number of files for a given

subdirectory (and all of its nested contents).

Ceph status and futureAlthough Ceph is now integrated into the mainline Linux kernel,

it’s properly noted there as experimental. File systems in this state

are useful to evaluate but are not yet ready for production envi-

ronments. But given Ceph’s adoption into the Linux kernel and

the motivation by its originators to continue its development,

it should be available soon to solve your massive storage needs.

OpenStack for HPC

Cloud Environments

Application Containers with Docker

Page 31: HPC Compass 2016_17


Application Containers with DockerTransforming Business Through SoftwareGone are the days of private datacenters running off-the-shelf

software and giant monolithic code bases that you updated

once a year. Everything has changed. Whether it is moving to the

cloud, migrating between clouds, modernizing legacy or building

new apps and data structure, the desired results are always the

same – speed. The faster you can move defines your success as a


Software is the critical IP that defines your company even

if the actual product you are selling may be a t-shirt, a car,

or compounding interest. Software is how you engage your

customers, reach new users, understand their data, promote

your product or service and process their order.

To do this well, today’s software is going bespoke. Small pieces

of software that are designed for a very specific job are called

microservices. The design goal of microservices is to have each

service built with all of the necessary components to “run” a

specific job with just the right type of underlying infrastructure

resources. Then, these services are loosely coupled together so

they can be changed at anytime, without having to worry about

the service that comes before or after it.

This methodology, while great for continuous improvement,

poses many challenges in reaching the end state. First it creates

a new, ever-expanding matrix of services, dependencies and

infrastructure making it difficult to manage. Additionally it does

not account for the vast amounts of existing legacy applications

in the landscape, the heterogeneity up and down the application

stack and the processes required to ensure this works in practice.

The Docker Journey and the Power of ANDIn 2013, Docker entered the landscape with application

containers to build, ship and run applications anywhere. Docker

was able to take software and its dependencies package them up

into a lightweight container. Similar to how shipping containers

are today, software containers are simply a standard unit of soft-

ware that looks the same on the the outside regardless of what

code and dependencies are included on the inside. This enabled

developers and sysadmins to transport them across infrastruc-

tures and various environments without requiring any modi-

fications and regardless of the varying configurations in the

different environments. The Docker journey begins here.

Agility: The speed and simplicity of Docker was an

instant hit with developers and is what led to the

meteoric rise in the open source project. Developers

are now able to very simply package up any software and its

dependencies into a container. Developers are able to use any

language, version and tooling because they are all packaged in

a container, which “standardizes” all that heterogeneity without


Portability: Just by the nature of the Docker tech-

nology, these very same developers have realized

their application containers are now portable – in

ways not previously possible. They can ship their applications

from development, to test and production and the code will work

as designed every time. Any differences in the environment did

not affect what was inside the container. Nor did they need to

change their application to work in production. This was also a

boon for IT operations teams as they can now move applications

across datacenters for clouds to avoid vendor lock in.

Page 32: HPC Compass 2016_17


Control: As these applications move along the life-

cycle to production, new questions around security,

manageability and scale need to be answered. Docker

“standardizes” your environment while maintaining the hetero-

geneity your business requires. Docker provides the ability to set

the appropriate level of control and flexibility to maintain your

service levels, performance and regulatory compliance. IT oper-

ations teams can provision, secure, monitor and scale the infra-

structure and applications to maintain peak service levels. No

two applications or businesses are alike and the Docker allows

you to decide how to control your application environment.

At the core of the Docker journey is the power of AND. Docker

is the only solution to provide agility, portability and control for

developers and IT operations team across all stages of the appli-

cation lifecycle. From these core tenets, Containers as a Service

(CaaS) emerges as the construct by which these new applications

are built better and faster.

Docker Containers as a Service (Caas)What is Containers as a Service (CaaS)? It is an IT managed and

secured application environment of infrastructure and content

where developers can in a self service manner, build and deploy




Secure Content &


Deploy, Manage,



Developers IT Operations

Diagram: CaaS Workflow

OpenStack for HPC

Cloud Environments

Application Containers with Docker

Page 33: HPC Compass 2016_17


In the CaaS diagram above, development and IT operations

team collaborate through the registry. This is a service in which

a library of secure and signed images can be maintained. From

the registry, developers on the left are able to pull and build soft-

ware at their pace and then push the content back to the registry

once it passes integration testing to save the latest version.

Depending on the internal processes, the deployment step is

either automated with tools or can be manually deployed.

The IT operations team on the right in above diagram manages

the different vendor contracts for the production infrastruc-

ture such as compute, networking and storage. These teams

are responsible to provision the compute resources needed for

the application and use the Docker Universal Control Plane to

monitor the clusters and applications over time. Then, they can

move the apps from one cloud to another or scale up or down a

service to maintain peak performance.

Key Characteristics and ConsiderationsThe Docker CaaS provides a framework for organizations to unify

the variety of systems, languages and tools in their environment

and apply the level of control, security or freedom required for

their business. As a Docker native solution with full support of

the Docker API, Docker CaaS can seamlessly take the application

from local development to production without changing the

code and streamlining the deployment cycle.

The following characteristics form the minimum requirements

for any organization’s application environment. In this paradigm,

development and IT operations teams are empowered to use the

best tools for their respective jobs without worrying of breaking

systems, each other’s workflows or lock-in.



The needs of developers and operations.

Many tools specifically address the functional needs of only

one team; however, CaaS breaks the cycle for continuous

improvement. To truly gain acceleration in the development

to production timeline, you need to address both users along

a continuum. Docker provides unique capabilities for each

team as well as a consistent API across the entire platform

for a seamless transition from one team to the next.

All stages in the application lifecycle.

From continuous integration to delivery and devops, these

practices are about eliminating the waterfall development

methodology and the lagging innovation cycles with it. By

providing tools for both the developer and IT operations,

Docker is able to seamlessly support an application from

build, test, stage to production.

Any language.

Developer agility means the freedom to build with whatever

language, version and tooling required for the features they

are building at that time. Also, the ability to run multiple

versions of a language at the same time provides a greater

level of flexibility. Docker allows your team to focus on

building the app instead of thinking of how to build an app

that works in Docker.

Any operating system.

The vast majority of organizations have more than one oper-

ating system. Some tools just work better in Linux while

others in Windows. Application platforms need to account

and support this diversity, otherwise they are solving only

part of the problem. Originally created for the Linux commu-

nity, Docker and Microsoft are bringing forward Windows

Server support to address the millions of enterprise applica-

tions in existence today and future applications.



Page 34: HPC Compass 2016_17


Any infrastructure.

When it comes to infrastructure, organizations want

choice, backup and leverage. Whether that means you have

multiple private data centers, a hybrid cloud or multiple

cloud providers, the critical component is the ability to

move workloads from one environment to another, without

causing application issues. The Docker technology architec-

ture abstracts the infrastructure away from the application

allowing the application containers to be run anywhere and

portable across any other infrastructure.

Open APIs, pluggable architecture and ecosystem.

A platform isn’t really a platform if it is an island to itself.

Implementing new technologies is often not possible if you

need to re-tool your existing environment first. A funda-

mental guiding principle of Docker is a platform that is

open. Being open means APIs and plugins to make it easy for

you to leverage your existing investments and to fit Docker

into your environment and processes. This openness invites

a rich ecosystem to flourish and provide you with more

flexibility and choice in adding specialized capabilities to

your CaaS.

Although many, these characteristics are critical as the new

bespoke application paradigms only invite in greater hetero-

geneity into your technical architecture. The Docker CaaS plat-

form is fundamentally designed to support that diversity, while

providing the appropriate controls to manage at any scale.



OpenStack for HPC

Cloud Environments

Application Containers with Docker

Page 35: HPC Compass 2016_17


Docker CaaSThe Docker CaaS platform is made possible by a suite of inte-

grated software solutions with a flexible deployment model to

meet the needs of your business.

� On-Premises/VPC: For organizations who need to keep their

IP within their network, Docker Trusted Registry and Docker

Universal Control Plane can be deployed on-premises or

in a VPC and connected to your existing infrastructure and

systems like storage, Active Directory/LDAP, monitoring and

logging solutions. Trusted Registry provides the ability to

store and manage images on your storage infrastructure

while also managing role based access control to the images.

Universal Control Plane provides visibility across your Docker

environment including Swarm clusters, Trusted Registry

repositories, containers and multi container applications.

� In the Cloud: For organizations who readily use SaaS solu-

tions, Docker Hub and Tutum by Docker provide a registry

service and control plane that is hosted and managed by

Docker. Hub is a cloud registry service to store and manage

your images and users permissions. Tutum by Docker provi-

sions and manages the deployment clusters as well as moni-

tors and manages the deployed applications. Connect to the

cloud infrastructure of your choice or bring your own phys-

ical node to deploy your application.

Your Docker CaaS can be designed to provide a centralized point

of control and management or allow for decentralized manage-

ment to empower individual application teams. The flexibility

allows you to create a model that is right for your business,

just like how you choose your infrastructure and implement

processes. CaaS is an extension of that to build, ship and run


Many IT initiatives are enabled and in fact, accelerated by CaaS

due to its unifying nature across the environment. Each organiza-

tion has their take on the terms for the initiatives but they range

from things like containerization, which may involve the “lift

and shift” of existing apps, or the adoption of microservices to

continuous integration, delivery and devops and various flavors

of the cloud including adoption, migration, hybrid and multiple.

In each scenario, Docker CaaS brings the agility, portability and

control to enable the adoption of those use cases across the



Docker ToolboxDocker Hub

Docker Trusted Registry

Tutm by Docker

Docker Universal Control Plane

Diagram: CaaS Workflow

Page 36: HPC Compass 2016_17







tech BOX

Architecture of Docker

Docker is an open-source project that automates the deploy-

ment of applications inside software containers, by providing

an additional layer of abstraction and automation of operat-

ing-system-level virtualization on Linux. Docker uses the re-

source isolation features of the Linux kernel such as cgroups

and kernel namespaces, and a union-capable filesystem such as

AUFS and others to allow independent “containers” to run with-

in a single Linux instance, avoiding the overhead of starting and

maintaining virtual machines.

The Linux kernel’s support for namespaces mostly isolates an

application’s view of the operating environment, including pro-

cess trees, network, user IDs and mounted file systems, while

the kernel’s cgroups provide resource limiting, including the

CPU, memory, block I/O and network. Since version 0.9, Docker

includes the libcontainer library as its own way to directly use

virtualization facilities provided by the Linux kernel, in addi-

tion to using abstracted virtualization interfaces via libvirt, LXC

(Linux Containers) and systemd-nspawn.

As actions are done to a Docker base image, union filesystem lay-

ers are created and documented, such that each layer fully de-

scribes how to recreate an action. This strategy enables Docker’s

lightweight images, as only layer updates need to be propagated

(compared to full VMs, for example).

OpenStack for HPC

Cloud Environments

Application Containers with Docker

Page 37: HPC Compass 2016_17



Docker implements a high-level API to provide lightweight

containers that run processes in isolation.

Building on top of facilities provided by the Linux kernel (primar-

ily cgroups and namespaces), a Docker container, unlike a vir-

tual machine, does not require or include a separate operating

system. Instead, it relies on the kernel’s functionality and uses

resource isolation (CPU, memory, block I/O, network, etc.) and

separate namespaces to isolate the application’s view of the

operating system. Docker accesses the Linux kernel’s virtualiza-

tion features either directly using the libcontainer library, which

is available as of Docker 0.9, or indirectly via libvirt, LXC (Linux

Containers) or systemd-nspawn.

By using containers, resources can be isolated, services restrict-

ed, and processes provisioned to have an almost completely

private view of the operating system with their own process ID

space, file system structure, and network interfaces. Multiple

containers share the same kernel, but each container can be con-

strained to only use a defined amount of resources such as CPU,

memory and I/O.

Using Docker to create and manage containers may simplify

the creation of highly distributed systems by allowing multiple

applications, worker tasks and other processes to run autono-

mously on a single physical machine or across multiple virtual

machines. This allows the deployment of nodes to be performed

as the resources become available or when more nodes are need-

ed, allowing a platform as a service (PaaS)-style of deployment

and scaling for systems like Apache Cassandra, MongoDB or Riak.

Docker also simplifies the creation and operation of task or work-

load queues and other distributed systems.


Docker can be integrated into various infrastructure tools, in-

cluding Amazon Web Services, Ansible, CFEngine, Chef, Google

Cloud Platform, IBM Bluemix, Jelastic, Jenkins,Microsoft Azure,

OpenStack Nova, OpenSVC, HPE Helion Stackato, Puppet, Salt,

Vagrant, and VMware vSphere Integrated Containers.

The Cloud Foundry Diego project integrates Docker into the

Cloud Foundry PaaS.

The GearD project aims to integrate Docker into the Red Hat’s

OpenShift Origin PaaS.




cgroups namespaces







Linux kernel

Docker can use different interfaces to access virtualization features of the Linux kernel.

Page 38: HPC Compass 2016_17
Page 39: HPC Compass 2016_17


Advanced Cluster Management Made Easy

High Performance Computing (HPC) clusters serve a crucial role at research-intensive

organizations in industries such as aerospace, meteorology, pharmaceuticals, and oil

and gas exploration.

The thing they have in common is a requirement for large-scale computations. Bright’s

solution for HPC puts the power of supercomputing within the reach of just about any


Engineering | Life Sciences | Automotive | Price Modelling | Aerospace | CAE | Data Analytics

Page 40: HPC Compass 2016_17


The Bright AdvantageBright Cluster Manager for HPC makes it easy to build and oper-

ate HPC clusters using the server and networking equipment of

your choice, tying them together into a comprehensive, easy to

manage solution.

Bright Cluster Manager for HPC lets customers deploy complete

clusters over bare metal and manage them effectively. It pro-

vides single-pane-of-glass management for the hardware, the

operating system, HPC software, and users. With Bright Cluster

Manager for HPC, system administrators can get clusters up and

running quickly and keep them running reliably throughout their

life cycle – all with the ease and elegance of a fully featured, en-

terprise-grade cluster manager.

Bright Cluster Manager was designed by a group of highly expe-

rienced cluster specialists who perceived the need for a funda-

mental approach to the technical challenges posed by cluster

management. The result is a scalable and flexible architecture,

forming the basis for a cluster management solution that is ex-

tremely easy to install and use, yet is suitable for the largest and

most complex clusters.

Scale clusters to thousands of nodesBright Cluster Manager was designed to scale to thousands of

nodes. It is not dependent on third-party (open source) software

that was not designed for this level of scalability. Many advanced

features for handling scalability and complexity are included:

Management Daemon with Low CPU Overhead

The cluster management daemon (CMDaemon) runs on every

node in the cluster, yet has a very low CPU load – it will not cause

any noticeable slow-down of the applications running on your


Because the CMDaemon provides all cluster management func-

tionality, the only additional daemon required is the workload

manager (or queuing system) daemon. For example, no daemons

The cluster installer takes you through the installation process and offers advanced options such as “Express” and “Remote”.

Advanced Cluster

Management Made Easy

Easy-to-use, complete and scalable

Page 41: HPC Compass 2016_17


for monitoring tools such as Ganglia or Nagios are required. This

minimizes the overall load of cluster management software on

your cluster.

Multiple Load-Balancing Provisioning Nodes

Node provisioning – the distribution of software from the head

node to the regular nodes – can be a significant bottleneck in

large clusters. With Bright Cluster Manager, provisioning capabil-

ity can be off-loaded to regular nodes, ensuring scalability to a

virtually unlimited number of nodes.

When a node requests its software image from the head node,

the head node checks which of the nodes with provisioning ca-

pability has the lowest load and instructs the node to download

or update its image from that provisioning node.

Synchronized Cluster Management Daemons

The CMDaemons are synchronized in time to execute tasks in

exact unison. This minimizes the effect of operating system jit-

ter (OS jitter) on parallel applications. OS jitter is the “noise” or

“jitter” caused by daemon processes and asynchronous events

such as interrupts. OS jitter cannot be totally prevented, but for

parallel applications it is important that any OS jitter occurs at

the same moment in time on every compute node.

Built-In Redundancy

Bright Cluster Manager supports redundant head nodes and re-

dundant provisioning nodes that can take over from each other

in case of failure. Both automatic and manual failover configura-

tions are supported.

Other redundancy features include support for hardware and

software RAID on all types of nodes in the cluster.

Diskless Nodes

Bright Cluster Manager supports diskless nodes, which can be

useful to increase overall Mean Time Between Failure (MTBF),

particularly on large clusters. Nodes with only InfiniBand and no

Ethernet connection are possible too.

Bright Cluster Manager 7 for HPCFor organizations that need to easily deploy and manage HPC

clusters Bright Cluster Manager for HPC lets customers deploy

complete clusters over bare metal and manage them effectively.

It provides single-pane-of-glass management for the hardware,

the operating system, the HPC software, and users.

With Bright Cluster Manager for HPC, system administrators can

quickly get clusters up and running and keep them running relia-

bly throughout their life cycle – all with the ease and elegance of

a fully featured, enterprise-grade cluster manager.


� Simple Deployment Process – Just answer a few questions

about your cluster and Bright takes care of the rest. It installs

all of the software you need and creates the necessary config-

uration files for you, so you can relax and get your new cluster

up and running right – first time, every time.

Page 42: HPC Compass 2016_17






tech BOX


Head Nodes & Regular NodesA cluster can have different types of nodes, but it will always

have at least one head node and one or more “regular” nodes.

A regular node is a node which is controlled by the head node

and which receives its software image from the head node or

from a dedicated provisioning node.

Regular nodes come in different flavors.

Some examples include:

� Failover Node – A node that can take over all functionalities

of the head node when the head node becomes unavailable.

� Compute Node – A node which is primarily used for compu-


� Login Node – A node which provides login access to users of

the cluster. It is often also used for compilation and job sub-


� I/O Node – A node which provides access to disk storage.

� Provisioning Node – A node from which other regular nodes

can download their software image.

� Workload Management Node – A node that runs the central

workload manager, also known as queuing system.

� Subnet Management Node – A node that runs an InfiniBand

subnet manager.

� More types of nodes can easily be defined and configured

with Bright Cluster Manager. In simple clusters there is only

one type of regular node – the compute node – and the head

node fulfills all cluster management roles that are described

in the regular node types mentioned above.

Advanced Cluster

Management Made Easy

Easy-to-use, complete and scalable

Page 43: HPC Compass 2016_17


All types of nodes (head nodes and regular nodes) run the same

CMDaemon. However, depending on the role that has been as-

signed to the node, the CMDaemon fulfills different tasks. This

node has a ‘Login Role’, ‘Storage Role’ and ‘PBS Client Role’.

The architecture of Bright Cluster Manager is implemented by

the following key elements:

� The Cluster Management Daemon (CMDaemon)

� The Cluster Management Shell (CMSH)

� The Cluster Management GUI (CMGUI)

Connection Diagram

A Bright cluster is managed by the CMDaemon on the head node,

which communicates with the CMDaemons on the other nodes

over encrypted connections.

The CMDaemon on the head node exposes an API which can also

be accessed from outside the cluster. This is used by the cluster

management Shell and the cluster management GUI, but it can

also be used by other applications if they are programmed to ac-

cess the API. The API documentation is available on request.

The diagram below shows a cluster with one head node, three

regular nodes and the CMDaemon running on each node.

Elements of the Architecture

Page 44: HPC Compass 2016_17


� Can Install on Bare Metal – With Bright Cluster Manager there

is nothing to pre-install. You can start with bare metal serv-

ers, and we will install and configure everything you need. Go

from pallet to production in less time than ever.

� Comprehensive Metrics and Monitoring – Bright Cluster Man-

ager really shines when it comes to showing you what’s going

on in your cluster. It’s beautiful graphical user interface lets

you monitor resources and services across the entire cluster

in real time.

� Powerful User Interfaces – With Bright Cluster Manager you

don’t just get one great user interface, we give you two. That

way, you can choose to provision, monitor, and manage your

HPC clusters with a traditional command line interface or our

beautiful, intuitive graphical user interface.

� Optimize the Use of IT Resources by HPC Applications –Bright

ensures HPC applications get the resources they require

according to the policies of your organization. Your cluster

will prioritize workloads according to your business priorities.

� HPC Tools and Libraries Included – Every copy of Bright Cluster

Manager comes with a complete set of HPC tools and libraries

so you will be ready to develop, debug, and deploy your HPC

code right away.

“The building blocks for transtec HPC solutions must be chosen

according to our goals ease-of-management and ease-of-use.

With Bright Cluster Manager, we are happy to have the technology

leader at hand, meeting these requirements, and our customers

value that.”

Jörg Walter | HPC Solution Engineer

Monitor your entire cluster at an glance with Bright Cluster Manager

Advanced Cluster

Management Made Easy

Easy-to-use, complete and scalable

Page 45: HPC Compass 2016_17


Quickly build and deploy an HPC cluster When you need to get an HPC cluster up and running, don’t waste

time cobbling together a solution from various open source

tools. Bright’s solution comes with everything you need to set up

a complete cluster from bare metal and manage it with a beauti-

ful, powerful user interface.

Whether your cluster is on-premise or in the cloud, Bright is a

virtual one-stop-shop for your HPC project.

Build a cluster in the cloud

Bright’s dynamic cloud provisioning capability lets you build an

entire cluster in the cloud or expand your physical cluster into

the cloud for extra capacity. Bright allocates the compute re-

sources in Amazon Web Services automatically, and on demand.

Build a hybrid HPC/Hadoop cluster

Does your organization need HPC clusters for technical comput-

ing and Hadoop clusters for Big Data? The Bright Cluster Man-

ager for Apache Hadoop add-on enables you to easily build and

manage both types of clusters from a single pane of glass. Your

system administrators can monitor all cluster operations, man-

age users, and repurpose servers with ease.

Driven by researchBright has been building enterprise-grade cluster management

software for more than a decade. Our solutions are deployed in

thousands of locations around the globe. Why reinvent the

wheel when you can rely on our experts to help you implement

the ideal cluster management solution for your organization.

Clusters in the cloud

Bright Cluster Manager can provision and manage clusters that

are running on virtual servers inside the AWS cloud as if they

were local machines. Use this feature to build an entire cluster

in AWS from scratch or extend a physical cluster into the cloud

when you need extra capacity.

Use native services for storage in AWS – Bright Cluster Manager

lets you take advantage of inexpensive, secure, durable, flexible

and simple storage services for data use, archiving, and backup

in the AWS cloud.

Make smart use of AWS for HPC – Bright Cluster Manager can

save you money by instantiating compute resources in AWS only

when they are needed. It uses built-in intelligence to create in-

stances only after the data is ready for processing and the back-

log in on-site workloads requires it.

You decide which metrics to monitor. Just drag a component into the display area an Bright Cluster Manager instantly creates a graph of the data you need.

Page 46: HPC Compass 2016_17



High performance meets efficiencyInitially, massively parallel systems constitute a challenge to

both administrators and users. They are complex beasts. Anyone

building HPC clusters will need to tame the beast, master the

complexity and present users and administrators with an easy-

to-use, easy-to-manage system landscape.

Leading HPC solution providers such as transtec achieve this

goal. They hide the complexity of HPC under the hood and match

high performance with efficiency and ease-of-use for both users

and administrators. The “P” in “HPC” gains a double meaning:

“Performance” plus “Productivity”.

Cluster and workload management software like Moab HPC

Suite, Bright Cluster Manager or Intel Fabric Suite provide the

means to master and hide the inherent complexity of HPC sys-

tems. For administrators and users, HPC clusters are presented

as single, large machines, with many different tuning parame-

ters. The software also provides a unified view of existing clus-

ters whenever unified management is added as a requirement by

the customer at any point in time after the first installation. Thus,

daily routine tasks such as job management, user management,

queue partitioning and management, can be performed easily

with either graphical or web-based tools, without any advanced

scripting skills or technical expertise required from the adminis-

trator or user.

Advanced Cluster

Management Made Easy

Easy-to-use, complete and scalable

Page 47: HPC Compass 2016_17






tech BOX


Bright Cluster Manager unleashes and manages the unlimited power of the cloudCreate a complete cluster in Amazon EC2, or easily extend your

onsite cluster into the cloud. Bright Cluster Manager provides a

“single pane of glass” to both your on-premise and cloud resourc-

es, enabling you to dynamically add capacity and manage cloud

nodes as part of your onsite cluster.

Cloud utilization can be achieved in just a few mouse clicks, with-

out the need for expert knowledge of Linux or cloud computing.

With Bright Cluster Manager, every cluster is cloud-ready, at no

extra cost. The same powerful cluster provisioning, monitoring,

scheduling and management capabilities that Bright Cluster

Manager provides to onsite clusters extend into the cloud.

The Bright advantage for cloud utilization

� Ease of use: Intuitive GUI virtually eliminates user learning

curve; no need to understand Linux or EC2 to manage system.

Alternatively, cluster management shell provides powerful

scripting capabilities to automate tasks.

� Complete management solution: Installation/initialization,

provisioning, monitoring, scheduling and management in

one integrated environment.

� Integrated workload management: Wide selection of work-

load managers included and automatically configured with

local, cloud and mixed queues.

� Single pane of glass; complete visibility and control: Cloud

compute nodes managed as elements of the on-site cluster; vi-

sible from a single console with drill-downs and historic data.

� Efficient data management via data-aware scheduling: Auto-

matically ensures data is in position at start of computation;

delivers results back when complete.

� Secure, automatic gateway: Mirrored LDAP and DNS services

inside Amazon VPC, connecting local and cloud-based nodes

for secure communication over VPN.

� Cost savings: More efficient use of cloud resources; support

for spot instances, minimal user intervention.

Two cloud scenarios

Bright Cluster Manager supports two cloud utilization scenarios:

“Cluster-on-Demand” and “Cluster Extension”.

Scenario 1: Cluster-on-Demand

The Cluster-on-Demand scenario is ideal if you do not have a clus-

ter onsite, or need to set up a totally separate cluster. With just a

few mouse clicks you can instantly create a complete cluster in

the public cloud, for any duration of time.

Page 48: HPC Compass 2016_17


Scenario 2: Cluster Extension

The Cluster Extension scenario is ideal if you have a cluster onsite

but you need more compute power, including GPUs. With Bright

Cluster Manager, you can instantly add EC2-based resources to

your onsite cluster, for any duration of time. Extending into the

cloud is as easy as adding nodes to an onsite cluster – there are

only a few additional, one-time steps after providing the public

cloud account information to Bright Cluster Manager.

The Bright approach to managing and monitoring a cluster in

the cloud provides complete uniformity, as cloud nodes are

managed and monitored the same way as local nodes:

� Load-balanced provisioning

� Software image management

� Integrated workload management

� Interfaces – GUI, shell, user portal

� Monitoring and health checking

� Compilers, debuggers, MPI libraries, mathematical libraries

and environment modules

Advanced Cluster

Management Made Easy

Cloud Bursting With Bright

Page 49: HPC Compass 2016_17

Bright Computing


Bright Cluster Manager also provides additional features that are

unique to the cloud.

Amazon spot instance support – Bright Cluster Manager enables

users to take advantage of the cost savings offered by Amazon’s

spot instances. Users can specify the use of spot instances, and

Bright will automatically schedule as available, reducing the cost

to compute without the need to monitor spot prices and sched-

ule manually.

Hardware Virtual Machine (HVM) – Bright Cluster Manager auto-

matically initializes all Amazon instance types, including Cluster

Compute and Cluster GPU instances that rely on HVM virtualization.

Amazon VPC – Bright supports Amazon VPC setups which allows

compute nodes in EC2 to be placed in an isolated network, there-

by separating them from the outside world. It is even possible to

route part of a local corporate IP network to a VPC subnet in EC2,

so that local nodes and nodes in EC2 can communicate easily.

Data-aware scheduling – Data-aware scheduling ensures that

input data is automatically transferred to the cloud and made

accessible just prior to the job starting, and that the output data

is automatically transferred back. There is no need to wait for the

data to load prior to submitting jobs (delaying entry into the job

queue), nor any risk of starting the job before the data transfer is

complete (crashing the job). Users submit their jobs, and Bright’s

data-aware scheduling does the rest.

Bright Cluster Manager provides the choice:

“To Cloud, or Not to Cloud”

Not all workloads are suitable for the cloud. The ideal situation

for most organizations is to have the ability to choose between

onsite clusters for jobs that require low latency communication,

complex I/O or sensitive data; and cloud clusters for many other

types of jobs.

Bright Cluster Manager delivers the best of both worlds: a pow-

erful management solution for local and cloud clusters, with

the ability to easily extend local clusters into the cloud without

compromising provisioning, monitoring or managing the cloud


One or more cloud nodes can be configured under the Cloud Settings tab.

Page 50: HPC Compass 2016_17
Page 51: HPC Compass 2016_17


Intelligent HPCWorkload Management

While all HPC systems face challenges in workload demand, workflow constraints,

resource complexity, and system scale, enterprise HPC systems face more stringent

challenges and expectations.

Enterprise HPC systems must meet mission-critical and priority HPC workload demands

for commercial businesses, business-oriented research, and academic organizations.

High Throuhgput Computing | CAD | Big Data Analytics | Simulation | Aerospace | Automotive

Page 52: HPC Compass 2016_17


Solutions to Enterprise High-Performance Computing ChallengesThe Enterprise has requirements for dynamic scheduling, provi-

sioning and management of multi-step/multi-application servic-

es across HPC, cloud, and big data environments. Their workloads

directly impact revenue, product delivery, and organizational ob-


Enterprise HPC systems must process intensive simulation and

data analysis more rapidly, accurately and cost-effectively to

accelerate insights. By improving workflow and eliminating job

delays and failures, the business can more efficiently leverage

data to make data-driven decisions and achieve a competitive

advantage. The Enterprise is also seeking to improve resource

utilization and manage efficiency across multiple heterogene-

ous systems. User productivity must increase to speed the time

to discovery, making it easier to access and use HPC resources

and expand to other resources as workloads demand.

Moab HPC SuiteMoab HPC Suite accelerates insights by unifying data center re-

sources, optimizing the analysis process, and guaranteeing ser-

vices to the business. Moab 8.1 for HPC systems and HPC cloud

continually meets enterprise priorities through increased pro-

ductivity, automated workload uptime, and consistent SLAs. It

uses the battle-tested and patented Moab intelligence engine

to automatically balance the complex, mission-critical workload

priorities of enterprise HPC systems. Enterprise customers ben-

efit from a single integrated product that brings together key

Enterprise HPC capabilities, implementation services, and 24/7


It is imperative to automate workload workflows to speed time

to discovery. The following new use cases play a key role in im-

proving overall system performance:

Moab is the “brain” of an HPC system, intelligently optimizing workload throughput while balancing service levels and priorities.

Intelligent HPC

Workload Management

Moab HPC Suite

Page 53: HPC Compass 2016_17


� Elastic Computing – unifies data center resources by assisting

the business management resource expansion through burst-

ing to private/public clouds and other data center resources

utilizing OpenStack as a common platform

� Performance and Scale – optimizes the analysis process by

dramatically increasing TORQUE and Moab throughput and

scalability across the board

� Tighter cooperation between Moab and Torque – harmoniz-

ing these key structures to reduce overhead and improve

communication between the HPC scheduler and the re-

source manager(s)

� More Parallelization – increasing the decoupling of Moab

and TORQUE’s network communication so Moab is less

dependent on TORQUE’S responsiveness, resulting in a 2x

speed improvement and shortening the duration of the av-

erage scheduling iteration in half

� Accounting Improvements – allowing toggling between

multiple modes of accounting with varying levels of en-

forcement, providing greater flexibility beyond strict allo-

cation options. Additionally, new up-to-date accounting

balances provide real-time insights into usage tracking.

� Viewpoint Admin Portal – guarantees services to the busi-

ness by allowing admins to simplify administrative reporting,

workload status tracking, and job resource viewing

Accelerate InsightsMoab HPC Suite accelerates insights by increasing overall

system, user and administrator productivity, achieving more

accurate results that are delivered faster and at a lower cost

from HPC resources. Moab provides the scalability, 90-99 per-

cent utilization, and simple job submission required to maximize

productivity. This ultimately speeds the time to discovery, aiding

the enterprise in achieving a competitive advantage due to its

HPC system. Enterprise use cases and capabilities include the


� OpenStack integration to offer virtual and physical resource

provisioning for IaaS and PaaS

� Performance boost to achieve 3x improvement in overall opti-

mization performance

� Advanced workflow data staging to enable improved cluster

utilization, multiple transfer methods, and new transfer types

Page 54: HPC Compass 2016_17


“With Moab HPC Suite, we can meet very demanding customers’

requirements as regards unified management of heterogeneous

cluster environments, grid management, and provide them with

flexible and powerful configuration and reporting options. Our

customers value that highly.”

Sven Grützmacher | HPC Solution Engineer

� Advanced power management with clock frequency control

and additional power state options, reducing energy costs by

15-30 percent

�Workload-optimized allocation policies and provisioning to

get more results out of existing heterogeneous resources and

reduce costs, including topology-based allocation

� Unify workload management across heterogeneous clusters

by managing them as one cluster, maximizing resource availa-

bility and administration efficiency

� Optimized, intelligent scheduling packs workloads and back-

fills around priority jobs and reservations while balancing

SLAs to efficiently use all available resources

� Optimized scheduling and management of accelerators, both

Intel Xeon Phi and GPGPUs, for jobs to maximize their utiliza-

tion and effectiveness

� Simplified job submission and management with advanced

job arrays and templates

� Showback or chargeback for pay-for-use so actual resource

usage is tracked with flexible chargeback rates and reporting

by user, department, cost center, or cluster

� Multi-cluster grid capabilities manage and share workload

across multiple remote clusters to meet growing workload

demand or surges

Guarantee Services to the BusinessJob and resource failures in enterprise HPC systems lead to de-

layed results, missed organizational opportunities, and missed

objectives. Moab HPC Suite intelligently automates workload

and resource uptime in the HPC system to ensure that workloads

complete successfully and reliably, avoid failures, and guarantee

services are delivered to the business.

Intelligent HPC

Workload Management

Moab HPC Suite

Page 55: HPC Compass 2016_17


The Enterprise benefits from these features:

� Intelligent resource placement prevents job failures with

granular resource modeling, meeting workload requirements

and avoiding at-risk resources

� Auto-response to failures and events with configurable

actions to pre-failure conditions, amber alerts, or other

metrics and monitors

�Workload-aware future maintenance scheduling that helps

maintain a stable HPC system without disrupting workload


� Real-world expertise for fast time-to-value and system uptime

with included implementation, training and 24/7 support

remote services

Auto SLA EnforcementMoab HPC Suite uses the powerful Moab intelligence engine to

optimally schedule and dynamically adjust workload to consist-

ently meet service level agreements (SLAs), guarantees, or busi-

ness priorities. This automatically ensures that the right work-

loads are completed at the optimal times, taking into account

the complex number of using departments, priorities, and SLAs

to be balanced. Moab provides the following benefits:

� Usage accounting and budget enforcement that schedules

resources and reports on usage in line with resource sharing

agreements and precise budgets (includes usage limits, usage

reports, auto budget management, and dynamic fair share


� SLA and priority policies that make sure the highest priority

workloads are processed first (includes Quality of Service and

hierarchical priority weighting)

� Continuous plus future scheduling that ensures priorities and

guarantees are proactively met as conditions and workload

changes (i.e. future reservations, pre-emption)


transtec HPC solutions are designed for maximum flexibility

and ease of management. We not only offer our customers

the most powerful and flexible cluster management solu-

tion out there, but also provide them with customized setup

and sitespecific configuration. Whether a customer needs

a dynamical Linux-Windows dual-boot solution, unified

management of different clusters at different sites, or the

fine-tuning of the Moab scheduler for implementing fine-

grained policy confi guration – transtec not only gives you

the framework at hand, but also helps you adapt the system

according to your special needs. Needless to say, when cus-

tomers are in need of special trainings, transtec will be there

to provide customers, administrators, or users with specially

adjusted Educational Services.

Having a many years’ experience in High Performance

Computing enabled us to develop efficient concepts for in-

stalling and deploying HPC clusters. For this, we leverage

well-proven 3rd party tools, which we assemble to a total

solution and adapt and confi gure according to the custom-

er’s requirements.

We manage everything from the complete installation and

configuration of the operating system, necessary middle-

ware components like job and cluster management systems

up to the customer’s applications.

Page 56: HPC Compass 2016_17


IBM Platform LSFUser-friendly, topology aware workload managementPlatform HPC includes a robust workload scheduling capability,

which is based on Platform LSF - the industry’s most powerful,

comprehensive, policy driven workload management solution for

engineering and scientific distributed computing environments.

By scheduling workloads intelligently according to policy,

Platform HPC improves end user productivity with minimal sys-

tem administrative effort. In addition, it allows HPC user teams to

easily access and share all computing resources, while reducing

time between simulation iterations.

GPU scheduling – Platform HPC provides the capability to sched-

ule jobs to GPUs as well as CPUs. This is particularly advantageous

in heterogeneous hardware environments as it means that admin-

istrators can configure Platform HPC so that only those jobs that

can benefit from running on GPUs are allocated to those resourc-

es. This frees up CPU-based resources to run other jobs. Using the

unified management interface, administrators can monitor the

GPU performance as well as detect ECC errors.

Unified management interfaceCompeting cluster management tools either do not have a web-

based interface or require multiple interfaces for managing dif-

ferent functional areas. In comparison, Platform HPC includes a

single unified interface through which all administrative tasks

can be performed including node-management, job-manage-

ment, jobs and cluster monitoring and reporting. Using the uni-

fied management interface, even cluster administrators with

very little Linux experience can competently manage a state of

the art HPC cluster.

Intelligent HPC

Workload Management

IBM Platform LSF

Page 57: HPC Compass 2016_17


Job management – While command line savvy users can contin-

ue using the remote terminal capability, the unified web portal

makes it easy to submit, monitor, and manage jobs. As changes

are made to the cluster configuration, Platform HPC automatical-

ly re-configures key components, ensuring that jobs are allocat-

ed to the appropriate resources.

The web portal is customizable and provides job data manage-

ment, remote visualization and interactive job support.

Workload/system correlation – Administrators can correlate

workload information with system load, so that they can make

timely decisions and proactively manage compute resources

against business demand. When it’s time for capacity planning,

the management interface can be used to run detailed reports

and analyses which quantify user needs and remove the guess

work from capacity expansion.

Simplified cluster management – The unified management con-

sole is used to administer all aspects of the cluster environment.

It enables administrators to easily install, manage and monitor

their cluster. It also provides an interactive environment to easi-

ly package software as kits for application deployment as well

as pre-integrated commercial application support. One of the

key features of the interface is an operational dashboard that

provides comprehensive administrative reports. As the image il-

lustrates, Platform HPC enables administrators to monitor and

report on key performance metrics such as cluster capacity, avail-

able memory and CPU utilization. This enables administrators to

easily identify and troubleshoot issues. The easy to use interface

saves the cluster administrator time, and means that they do not

need to become an expert in the administration of open-source

software components. It also reduces the possibility of errors and

time lost due to incorrect configuration. Cluster administrators en-

joy the best of both worlds – easy access to a powerful, web-based

cluster manager without the need to learn and separately ad-

minister all the tools that comprise the HPC cluster environment.

Resource monitoring

Page 58: HPC Compass 2016_17


Intelligent HPC

Workload Management


SlurmThe Simple Linux Utility for Resource Management (Slurm) is an

open source, fault-tolerant, and highly scalable cluster manage-

ment and job scheduling system for large and small Linux clus-

ters. Slurm requires no kernel modifications for its operation and

is relatively self-contained. As a cluster workload manager, Slurm

has three key functions. First, it allocates exclusive and/or non-ex-

clusive access to resources (compute nodes) to users for some

duration of time so they can perform work. Second, it provides a

framework for starting, executing, and monitoring work (normally

a parallel job) on the set of allocated nodes. Finally, it arbitrates

contention for resources by managing a queue of pending work.

Optional plugins can be used for accounting, advanced reser-

vation, gang scheduling (time sharing for parallel jobs), backfill

scheduling, topology optimized resource selection, resource limits

by user or bank account, and sophisticated multifactor job prioriti-

zation algorithms.

ArchitectureSlurm has a centralized manager, slurmctld, to monitor resources

and work. There may also be a backup manager to assume those

responsibilities in the event of failure. Each compute server (node)

has a slurmd daemon, which can be compared to a remote shell:

it waits for work, executes that work, returns status, and waits

for more work. The slurmd daemons provide fault-tolerant hier-

archical communications. There is an optional slurmdbd (Slurm

DataBase Daemon) which can be used to record accounting in-

formation for multiple Slurm-managed clusters in a single data-

base. User tools include srun to initiate jobs, scancel to terminate

queued or running jobs, sinfo to report system status, squeue to

report the status of jobs, and sacct to get information about jobs

and job steps that are running or have completed. The smap and

sview commands graphically reports system and job status includ-

ing network topology.

Page 59: HPC Compass 2016_17


There is an administrative tool scontrol available to monitor and/

or modify configuration and state information on the cluster. The

administrative tool used to manage the database is sacctmgr.

It can be used to identify the clusters, valid users, valid bank ac-

counts, etc. APIs are available for all functions.

Slurm has a general-purpose plugin mechanism available to eas-

ily support various infrastructures. This permits a wide variety of

Slurm configurations using a building block approach. These pl-

ugins presently include:

� Accounting Storage: Primarily Used to store historical data

about jobs. When used with SlurmDBD (Slurm Database Dae-

mon), it can also supply a limits based system along with his-

torical system status.

� Account Gather Energy: Gather energy consumption data per

job or nodes in the system. This plugin is integrated with the

Accounting Storage and Job Account Gather plugins.





Compute node daemons






slurmd slurmdslurmd



User commands(partial list) Controller daemons

Slurm components

� Authentication of communications: Provides authentication

mechanism between various components of Slurm.

� Checkpoint: Interface to various checkpoint mechanisms.

� Cryptography (Digital Signature Generation): Mechanism used

to generate a digital signature, which is used to validate that

job step is authorized to execute on specific nodes. This is dis-

tinct from the plugin used for Authentication since the job

step request is sent from the user’s srun command rather than

directly from the slurmctld daemon, which generates the job

step credential and its digital signature.

� Generic Resources: Provide interface to control generic re-

sources like Processing Units (GPUs) and Intel® Many Integrat-

ed Core (MIC) processors.

� Job Submit: Custom plugin to allow site specific control over

job requirements at submission and update.

� Job Accounting Gather: Gather job step resource utilization


� Job Completion Logging: Log a job’s termination data. This is

typically a subset of data stored by an Accounting Storage Pl-


� Launchers: Controls the mechanism used by the ‘srun’ com-

mand to launch the tasks.

� MPI: Provides different hooks for the various MPI implemen-

tations. For example, this can set MPI specific environment


� Preempt: Determines which jobs can preempt other jobs and

the preemption mechanism to be used.

� Priority: Assigns priorities to jobs upon submission and on an

ongoing basis (e.g. as they age).

� Process tracking (for signaling): Provides a mechanism for

identifying the processes associated with each job. Used for

job accounting and signaling.

� Scheduler: Plugin determines how and when Slurm schedules


Page 60: HPC Compass 2016_17


� Node selection: Plugin used to determine the resources used

for a job allocation.

� Switch or interconnect: Plugin to interface with a switch or

interconnect. For most systems (ethernet or infiniband) this

is not needed.

� Task Affinity: Provides mechanism to bind a job and it’s indi-

vidual tasks to specific processors.

� Network Topology: Optimizes resource selection based upon

the network topology. Used for both job allocations and ad-

vanced reservation.

The entities managed by these Slurm daemons, shown in Figure 2,

include nodes, the compute resource in Slurm, partitions, which

group nodes into logical sets, jobs, or allocations of resources

assigned to a user for a specified amount of time, and job steps,

which are sets of (possibly parallel) tasks within a job. The parti-

tions can be considered job queues, each of which has an assort-

ment of constraints such as job size limit, job time limit, users

permitted to use it, etc. Priority-ordered jobs are allocated nodes

within a partition until the resources (nodes, processors, memory,

etc.) within that partition are exhausted. Once a job is assigned a

set of nodes, the user is able to initiate parallel work in the form of

job steps in any configuration within the allocation. For instance,

a single job step may be started that utilizes all nodes allocated

to the job, or several job steps may independently use a portion

of the allocation. Slurm provides resource management for the

processors allocated to a job, so that multiple job steps can be

simultaneously submitted and queued until there are available

resources within the job’s allocation.

Intelligent HPC

Workload Management

IBM Platform LSF 8

Page 61: HPC Compass 2016_17






tech BOX


IBM Platform LSF 8 Written with Platform LSF administrators in mind, this brief pro-

vides a short explanation of significant changes in Platform’s lat-

est release of Platform LSF, with a specific emphasis on schedul-

ing and workload management features.

About IBM Platform LSF 8Platform LSF is the most powerful workload manager for de-

manding, distributed high performance computing environ-

ments. It provides a complete set of workload management ca-

pabilities, all designed to work together to reduce cycle times

and maximize productivity in missioncritical environments.

This latest Platform LSF release delivers improvements in perfor-

mance and scalability while introducing new features that sim-

plify administration and boost user productivity. This includes:

� Guaranteed resources – Aligns business SLA’s with infra-

structure configuration for simplified administration and


� Live reconfiguration – Provides simplified administration

and enables agility

� Delegation of administrative rights – Empowers line of busi-

ness owners to take control of their own projects

� Fairshare & pre-emptive scheduling enhancements – Fine

tunes key production policies

Platform LSF 8 FeaturesGuaranteed Resources Ensure Deadlines are Met

In Platform LSF 8, resource-based scheduling has been extended

to guarantee resource availability to groups of jobs. Resources

can be slots, entire hosts or user-defined shared resources such

as software licenses.

As an example, a business unit might guarantee that it has access

to specific types of resources within ten minutes of a job being

submitted, even while sharing resources between departments.

This facility ensures that lower priority jobs using the needed re-

sources can be pre-empted in order to meet the SLAs of higher

priority jobs. Because jobs can be automatically attached to an

SLA class via access controls, administrators can enable these

guarantees without requiring that end-users change their job

submission procedures, making it easy to implement this capa-

bility in existing environments.

Live Cluster Reconfiguration

Platform LSF 8 incorporates a new live reconfiguration capabil-

ity, allowing changes to be made to clusters without the need

to re-start LSF daemons. This is useful to customers who need

to add hosts, adjust sharing policies or re-assign users between

groups “on the fly”, without impacting cluster availability or run-

ning jobs.

Page 62: HPC Compass 2016_17


Changes to the cluster configuration can be made via the bconf

command line utility, or via new API calls. This functionality can

also be integrated via a web-based interface using Platform Ap-

plication Center. All configuration modifications are logged for

a complete audit history, and changes are propagated almost

instantaneously. The majority of reconfiguration operations are

completed in under half a second.

With Live Reconfiguration, down-time is reduced, and adminis-

trators are free to make needed adjustments quickly rather than

wait for scheduled maintenance periods or non-peak hours. In

cases where users are members of multiple groups, controls can

be put in place so that a group administrator can only control

jobs associated with their designated group rather than impact-

ing jobs related to another group submitted by the same user.

Delegation of Administrative Rights

With Platform LSF 8, the concept of group administrators has

been extended to enable project managers and line of business

managers to dynamically modify group membership and fair-

share resource allocation policies within their group. The ability

to make these changes dynamically to a running cluster is made

possible by the Live Reconfiguration feature.

These capabilities can be delegated selectively depending on the

group and site policy. Different group administrators can man-

age jobs, control sharing policies or adjust group membership.

Intelligent HPC

Workload Management

IBM Platform LSF 8

Page 63: HPC Compass 2016_17


More Flexible Fairshare Scheduling Policies

To enable better resource sharing flexibility with Platform LSF 8,

the algorithms used to tune dynamically calculated user priori-

ties can be adjusted at the queue level. These algorithms can vary

based on department, application or project team preferences.

The Fairshare parameters ENABLE_HIST_RUN_TIME and HIST_

HOURS enable administrators to control the degree to which LSF

considers prior resource usage when determining user priority.

The flexibility of Platform LSF 8 has also been improved by al-

lowing a similar “decay rate” to apply to currently running jobs

(RUN_TIME_DECAY), either system-wide or at the queue level.

This is most useful for customers with long-running jobs, where

setting this parameter results in a more accurate view of real re-

source use for the fairshare scheduling to consider.

Performance & Scalability Enhancements

Platform LSF has been extended to support an unparalleled

scale of up to 100,000 cores and 1.5 million queued jobs for

very high throughput EDA workloads. Even higher scalability is

possible for more traditional HPC workloads. Specific areas of im-

provement include the time required to start the master-batch

daemon (MBD), bjobs query performance, job submission and

job dispatching as well as impressive performance gains result-

ing from the new Bulk Job Submission feature. In addition, on

very large clusters with large numbers of user groups employing

fairshare scheduling, the memory footprint of the master batch

scheduler in LSF has been reduced by approximately 70% and

scheduler cycle time has been reduced by 25%, resulting in better

performance and scalability.

More Sophisticated Host-based Resource Usage for Parallel Jobs

Platform LSF 8 provides several improvements to how resource

use is tracked and reported with parallel jobs. Accurate tracking

of how parallel jobs use resources such as CPUs, memory and

swap, is important for ease of management, optimal scheduling

and accurate reporting and workload analysis. With Platform LSF

8 administrators can track resource usage on a per-host basis and

an aggregated basis (across all hosts), ensuring that resource use

is reported accurately. Additional details such as running PIDs

and PGIDs for distributed parallel jobs, manual cleanup (if neces-

sary) and the development of scripts for managing parallel jobs

are simplified. These improvements in resource usage reporting

are reflected in LSF commands including bjobs, bhist and bacct.

Improved Ease of Administration for Mixed Windows and

Linux Clusters

The lspasswd command in Platform LSF enables Windows LSF

users to advise LSF of changes to their Windows level passwords.

With Platform LSF 8, password synchronization between envi-

ronments has become much easier to manage because the Win-

dows passwords can now be adjusted directly from Linux hosts

using the lspasswd command. This allows Linux users to conven-

iently synchronize passwords on Windows hosts without need-

ing to explicitly login into the host.

Bulk Job Submission

When submitting large numbers of jobs with different resource

requirements or job level settings, Bulk Job Submission allows

for jobs to be submitted in bulk by referencing a single file con-

taining job details.

Page 64: HPC Compass 2016_17
Page 65: HPC Compass 2016_17


Intel Cluster Ready – A Quality Standard for HPC ClustersIntel Cluster Ready is designed to create predictable expectations for users and pro-

viders of HPC clusters, primarily targeting customers in the commercial and industrial

sectors. These are not experimental “test-bed” clusters used for computer science and

computer engineering research, or high-end “capability” clusters closely targeting their

specific computing requirements that power the high-energy physics at the national

labs or other specialized research organizations.

Intel Cluster Ready seeks to advance HPC clusters used as computing resources in pro-

duction environments by providing cluster owners with a high degree of confidence

that the clusters they deploy will run the applications their scientific and engineering

staff rely upon to do their jobs. It achieves this by providing cluster hardware, software,

and system providers with a precisely defined basis for their products to meet their

customers’ production cluster requirements.

Life Sciences | Risk Analysis | Simulation | Big Data Analytics | High Performance Computing

Page 66: HPC Compass 2016_17


What are the Objectives of ICR?The primary objective of Intel Cluster Ready is to make clusters

easier to specify, easier to buy, easier to deploy, and make it eas-

ier to develop applications that run on them. A key feature of ICR

is the concept of “application mobility”, which is defined as the

ability of a registered Intel Cluster Ready application – more cor-

rectly, the same binary – to run correctly on any certified Intel

Cluster Ready cluster. Clearly, application mobility is important

for users, software providers, hardware providers, and system


� Users want to know the cluster they choose will reliably run

the applications they rely on today, and will rely on tomorrow

� Application providers want to satisfy the needs of their cus-

tomers by providing applications that reliably run on their

customers’ cluster hardware and cluster stacks

� Cluster stack providers want to satisfy the needs of their cus-

tomers by providing a cluster stack that supports their cus-

tomers’ applications and cluster hardware

� Hardware providers want to satisfy the needs of their cus-

tomers by providing hardware components that supports

their customers’ applications and cluster stacks

� System providers want to satisfy the needs of their custom-

ers by providing complete cluster implementations that reli-

ably run their customers’ applications

Without application mobility, each group above must either try

to support all combinations, which they have neither the time

nor resources to do, or pick the “winning combination(s)” that

best supports their needs, and risk making the wrong choice.

The Intel Cluster Ready definition of application portability

supports all of these needs by going beyond pure portability,

(re-compiling and linking a unique binary for each platform), to

application binary mobility, (running the same binary on multi-

ple platforms), by more precisely defining the target system.

Intel Cluster Ready

A Quality Standard for HPC Clusters

Page 67: HPC Compass 2016_17









A SingleSolutionPlatform



Certified Cluster Platforms

Crash Climate QCD

Intel MPI Library (run-time)Intel MKL Cluster Edition (run-time)

Linux Cluster Tools(Intel Selected)

InfiniBand (OFED)

Intel Xeon Processor Platform

Gigabit Ethernet 10Gbit Ethernet


Value Addby Individual



Tools(C++, Intel Trace

Analyzer and Collector, MKL, etc.)


Intel OEM1 OEM2 PI1 PI2 ...


A further aspect of application mobility is to ensure that regis-

tered Intel Cluster Ready applications do not need special pro-

gramming or alternate binaries for different message fabrics.

Intel Cluster Ready accomplishes this by providing an MPI imple-

mentation supporting multiple fabrics at runtime; through this,

registered Intel Cluster Ready applications obey the “message

layer independence property”. Stepping back, the unifying con-

cept of Intel Cluster Ready is “one-to-many,” that is

� One application will run on many clusters

� One cluster will run many applications

How is one-to-many accomplished? Looking at Figure 1, you see

the abstract Intel Cluster Ready “stack” components that al-

ways exist in every cluster, i.e., one or more applications, a clus-

ter software stack, one or more fabrics, and finally the underly-

ing cluster hardware. The remainder of that picture (to the right)

shows the components in greater detail.

Applications, on the top of the stack, rely upon the various APIs,

utilities, and file system structure presented by the underly-

ing software stack. Registered Intel Cluster Ready applications

are always able to rely upon the APIs, utilities, and file system

structure specified by the Intel Cluster Ready Specification; if an

application requires software outside this “required” set, then

Intel Cluster Ready requires the application to provide that soft-

ware as a part of its installation.

To ensure that this additional per-application software doesn’t

conflict with the cluster stack or other applications, Intel

Cluster Ready also requires the additional software to be in-

stalled in application-private trees, so the application knows

how to find that software while not interfering with other

applications. While this may well cause duplicate software to

be installed, the reliability provided by the duplication far out-

weighs the cost of the duplicated files.

ICR Stack

Page 68: HPC Compass 2016_17


A prime example supporting this comparison is the removal of a

common file (library, utility, or other) that is unknowingly need-

ed by some other application – such errors can be insidious to

repair even when they cause an outright application failure.

Cluster platforms, at the bottom of the stack, provide the APIs,

utilities, and file system structure relied upon by registered ap-

plications. Certified Intel Cluster Ready platforms ensure the

APIs, utilities, and file system structure are complete per the Intel

Cluster Ready Specification; certified clusters are able to provide

them by various means as they deem appropriate. Because of the

clearly defined responsibilities ensuring the presence of all soft-

ware required by registered applications, system providers have

a high confidence that the certified clusters they build are able

to run any certified applications their customers rely on. In ad-

dition to meeting the Intel Cluster Ready requirements, certified

clusters can also provide their added value, that is, other features

and capabilities that increase the value of their products.

How Does Intel Cluster Ready Accomplish its Objectives?At its heart, Intel Cluster Ready is a definition of the cluster as a

parallel application platform, as well as a tool to certify an ac-

tual cluster to the definition. Let’s look at each of these in more

detail, to understand their motivations and benefits.

A definition of the cluster as parallel application platform

The Intel Cluster Ready Specification is very much written as the

requirements for, not the implementation of, a platform upon

which parallel applications, more specifically MPI applications,

can be built and run. As such, the specification doesn’t care

whether the cluster is diskful or diskless, fully distributed or

single system image (SSI), built from “Enterprise” distributions

or community distributions, fully open source or not. Perhaps

more importantly, with one exception, the specification doesn’t

have any requirements on how the cluster is built; that one

Intel Cluster Ready

A Quality Standard for HPC Clusters

“The Intel Cluster Checker allows us to certify that our transtec

HPC clusters are compliant with an independent high quality

standard. Our customers can rest assured: their applications run as

they expect.”

Marcus Wiedemann | HPC Solution Architect

Page 69: HPC Compass 2016_17

Cluster Definition & Configuration XML File

STDOUT + LogfilePass/Fail Results & Diagnostics

Cluster Checker Engine


Test Module

Parallel Ops Check











Test Module

Parallel Ops Check





















exception is that compute nodes must be built with automated

tools, so that new, repaired, or replaced nodes can be rebuilt

identically to the existing nodes without any manual interac-

tion, other than possibly initiating the build process.

Some items the specification does care about include:

� The ability to run both 32- and 64-bit applications, including

MPI applications and X-clients, on any of the compute nodes

� Consistency among the compute nodes’ configuration, capa-

bility, and performance

� The identical accessibility of libraries and tools across

the cluster

� The identical access by each compute node to permanent

and temporary storage, as well as users’ data

� The identical access to each compute node from the head node

� The MPI implementation provides fabric independence

� All nodes support network booting and provide a remotely

accessible console

The specification also requires that the runtimes for specific In-

tel software products are installed on every certified cluster:

� Intel Math Kernel Library

� Intel MPI Library Runtime Environment

� Intel Threading Building Blocks

This requirement does two things. First and foremost, main-

line Linux distributions do not necessarily provide a sufficient

software stack to build an HPC cluster – such specialization is

beyond their mission. Secondly, the requirement ensures that

programs built with this software will always work on certified

clusters and enjoy simpler installations. As these runtimes are

directly available from the web, the requirement does not cause

additional costs to certified clusters. It is also very important

to note that this does not require certified applications to use

these libraries nor does it preclude alternate libraries, e.g., other

MPI implementations, from being present on certified clusters.

Quite clearly, an application that requires, e.g., an alternate MPI,

must also provide the runtimes for that MPI as a part of its in-


A tool to certify an actual cluster to the definition

The Intel Cluster Checker, included with every certified Intel

Cluster Ready implementation, is used in four modes in the life

of a cluster:

� To certify a system provider’s prototype cluster as a valid im-

plementation of the specification

� To verify to the owner that the just-delivered cluster is a “true

copy” of the certified prototype

� To ensure the cluster remains fully functional, reducing ser-

vice calls not related to the applications or the hardware

� To help software and system providers diagnose and correct

actual problems to their code or their hardware.

While these are critical capabilities, in all fairness, this greatly

understates the capabilities of Intel Cluster Checker. The tool

will not only verify the cluster is performing as expected. To

do this, per-node and cluster-wide static and dynamic tests are

made of the hardware and software.

Intel Cluster Checker

Page 70: HPC Compass 2016_17


The static checks ensure the systems are configured consist-

ently and appropriately. As one example, the tool will ensure

the systems are all running the same BIOS versions as well as

having identical configurations among key BIOS settings. This

type of problem – differing BIOS versions or settings – can be

the root cause of subtle problems such as differing memory con-

figurations that manifest themselves as differing memory band-

widths only to be seen at the application level as slower than ex-

pected overall performance. As is well-known, parallel program

performance can be very much governed by the performance

of the slowest components, not the fastest. In another static

check, the Intel Cluster Checker will ensure that the expected

tools, libraries, and files are present on each node, identically

located on all nodes, as well as identically implemented on all

nodes. This ensures that each node has the minimal software

stack specified by the specification, as well as identical soft-

ware stack among the compute nodes.

A typical dynamic check ensures consistent system perfor-

mance, e.g., via the STREAM benchmark. This particular test en-

sures processor and memory performance is consistent across

compute nodes, which, like the BIOS setting example above, can

be the root cause of overall slower application performance. An

additional check with STREAM can be made if the user config-

ures an expectation of benchmark performance; this check will

ensure that performance is not only consistent across the clus-

ter, but also meets expectations.

Going beyond processor performance, the Intel MPI Bench-

marks are used to ensure the network fabric(s) are performing

properly and, with a configuration that describes expected

performance levels, up to the cluster provider’s performance

expectations. Network inconsistencies due to poorly perform-

ing Ethernet NICs, InfiniBand HBAs, faulty switches, and loose or

faulty cables can be identified. Finally, the Intel Cluster Checker

is extensible, enabling additional tests to be added supporting

additional features and capabilities.

Intel Cluster Ready

Intel Cluster Ready Builds

HPC Momentum

Page 71: HPC Compass 2016_17






tech BOX

Cluster PlatformSoftware Tools

Reference Designs

Demand Creation

Support & Training











Server Platfo


Intel P


r ISV Enabling


Intel Cluster Ready Program


Intel Cluster Ready Builds HPC MomentumWith the Intel Cluster Ready (ICR) program, Intel Corporation

set out to create a win-win scenario for the major constituen-

cies in the high-performance computing (HPC) cluster mar-

ket. Hardware vendors and independent software vendors

(ISVs) stand to win by being able to ensure both buyers and

users that their products will work well together straight out

of the box.

System administrators stand to win by being able to meet

corporate demands to push HPC competitive advantages

deeper into their organizations while satisfying end users’

demands for reliable HPC cycles, all without increasing IT

staff. End users stand to win by being able to get their work

done faster, with less downtime, on certified cluster plat-

forms. Last but not least, with ICR, Intel has positioned itself

to win by expanding the total addressable market (TAM) and

reducing time to market for the company’s microprocessors,

chip sets, and platforms.

The Worst of Times

For a number of years, clusters were largely confined to

government and academic sites, where contingents of

graduate students and midlevel employees were availa-

ble to help program and maintain the unwieldy early sys-

tems. Commercial firms lacked this low-cost labor supply

and mistrusted the favored cluster operating system, open

source Linux, on the grounds that no single party could be

held accountable if something went wrong with it. Today,

cluster penetration in the HPC market is deep and wide,

extending from systems with a handful of processors to

This enables the Intel Cluster Checker to not only support the

minimal requirements of the Intel Cluster Ready Specification,

but the full cluster as delivered to the customer.

Conforming hardware and software

The preceding was primarily related to the builders of certified

clusters and the developers of registered applications. For end

users that want to purchase a certified cluster to run registered

applications, the ability to identify registered applications and

certified clusters is most important, as that will reduce their ef-

fort to evaluate, acquire, and deploy the clusters that run their

applications, and then keep that computing resource operating

properly, with full performance, directly increasing their pro-


Intel, XEON and certain other trademarks and logos appearing in this brochure,

are trademarks or registered trademarks of Intel Corporation.

Page 72: HPC Compass 2016_17


some of the world’s largest supercomputers, and from un-

der $25,000 to tens or hundreds of millions of dollars in price.

Clusters increasingly pervade every HPC vertical market:

biosciences, computer-aided engineering, chemical engineer-

ing, digital content creation, economic/financial services,

electronic design automation, geosciences/geo-engineering,

mechanical design, defense, government labs, academia, and


But IDC studies have consistently shown that clusters remain

difficult to specify, deploy, and manage, especially for new and

less experienced HPC users. This should come as no surprise,

given that a cluster is a set of independent computers linked to-

gether by software and networking technologies from multiple

vendors. Clusters originated as do-it-yourself HPC systems. In

the late 1990s users began employing inexpensive hardware to

cobble together scientific computing systems based on the “Be-

owulf cluster” concept first developed by Thomas Sterling and

Donald Becker at NASA. From their Beowulf origins, clusters have

evolved and matured substantially, but the system manage-

ment issues that plagued their early years remain in force today.

The Need for Standard Cluster SolutionsThe escalating complexity of HPC clusters poses a dilemma for

many large IT departments that cannot afford to scale up their

HPC-knowledgeable staff to meet the fast-growing end-user de-

mand for technical computing resources. Cluster management

is even more problematic for smaller organizations and busi-

ness units that often have no dedicated, HPC-knowledgeable

staff to begin with.

The ICR program aims to address burgeoning cluster complexity

by making available a standard solution (aka reference archi-

tecture) for Intel-based systems that hardware vendors can use

Intel Cluster Ready

Intel Cluster Ready Builds

HPC Momentum

Page 73: HPC Compass 2016_17


to certify their configurations and that ISVs and other software

vendors can use to test and register their applications, system

software, and HPC management software. The chief goal of this

voluntary compliance program is to ensure fundamental hard-

ware-software integration and interoperability so that system

administrators and end users can confidently purchase and de-

ploy HPC clusters, and get their work done, even in cases where

no HPC-knowledgeable staff are available to help.

The ICR program wants to prevent end users from having to

become, in effect, their own systems integrators. In smaller or-

ganizations, the ICR program is designed to allow overworked

IT departments with limited or no HPC expertise to support HPC

user requirements more readily. For larger organizations with

dedicated HPC staff, ICR creates confidence that required user

applications will work, eases the problem of system adminis-

tration, and allows HPC cluster systems to be scaled up in size

without scaling support staff. ICR can help drive HPC cluster re-

sources deeper into larger organizations and free up IT staff to

focus on mainstream enterprise applications (e.g., payroll, sales,

HR, and CRM).

The program is a three-way collaboration among hardware ven-

dors, software vendors, and Intel. In this triple alliance, Intel pro-

vides the specification for the cluster architecture implementa-

tion, and then vendors certify the hardware configurations and

register software applications as compliant with the specifica-

tion. The ICR program’s promise to system administrators and

end users is that registered applications will run out of the box

on certified hardware configurations.

ICR solutions are compliant with the standard platform archi-

tecture, which starts with 64-bit Intel Xeon processors in an In-

tel-certified cluster hardware platform. Layered on top of this

foundation are the interconnect fabric (Gigabit Ethernet, Infini-

Band) and the software stack: Intel-selected Linux cluster tools,

an Intel MPI runtime library, and the Intel Math Kernel Library.

Intel runtime components are available and verified as part of

the certification (e.g., Intel tool runtimes) but are not required

to be used by applications. The inclusion of these Intel runtime

components does not exclude any other components a systems

vendor or ISV might want to use. At the top of the stack are In-

tel-registered ISV applications.

At the heart of the program is the Intel Cluster Checker, a vali-

dation tool that verifies that a cluster is specification compliant

and operational before ISV applications are ever loaded. After

the cluster is up and running, the Cluster Checker can function

as a fault isolation tool in wellness mode. Certification needs

to happen only once for each distinct hardware platform, while

verification – which determines whether a valid copy of the

specification is operating – can be performed by the Cluster

Checker at any time.

Cluster Checker is an evolving tool that is designed to accept

new test modules. It is a productized tool that ICR members ship

with their systems. Cluster Checker originally was designed for

homogeneous clusters but can now also be applied to clusters

with specialized nodes, such as all-storage sub-clusters. Cluster

Checker can isolate a wide range of problems, including network

or communication problems.

Page 74: HPC Compass 2016_17



transtec offers their customers a new and fascinating way to

evaluate transtec’s HPC solutions in real world scenarios. With

the transtec Benchmarking Center solutions can be explored in

detail with the actual applications the customers will later run

on them. Intel Cluster Ready makes this feasible by simplifying

the maintenance of the systems and set-up of clean systems

very easily, and as often as needed.

As High-Performance Computing (HPC) systems are utilized for

numerical simulations, more and more advanced clustering

technologies are being deployed. Because of its performance,

price/performance and energy efficiency advantages, clusters

now dominate all segments of the HPC market and continue

to gain acceptance. HPC computer systems have become far

more widespread and pervasive in government, industry, and

academia. However, rarely does the client have the possibility

to test their actual application on the system they are planning

to acquire.

The transtec Benchmarking Center

transtec HPC solutions get used by a wide variety of clients.

Among those are most of the large users of compute power at

German and other European universities and research centers

as well as governmental users like the German army’s compute

center and clients from the high tech, the automotive and oth-

er sectors. transtec HPC solutions have demonstrated their

value in more than 500 installations. Most of transtec’s cluster

systems are based on SUSE Linux Enterprise Server, Red Hat

Intel Cluster Ready

The transtec Benchmarking Center

Page 75: HPC Compass 2016_17


Enterprise Linux, CentOS, or Scientific Linux. With xCAT for ef-

ficient cluster deployment, and Moab HPC Suite by Adaptive

Computing for high-level cluster management, transtec is able

to efficiently deploy and ship easy-to-use HPC cluster solutions

with enterprise-class management features. Moab has proven

to provide easy-to-use workload and job management for small

systems as well as the largest cluster installations worldwide.

However, when selling clusters to governmental customers as

well as other large enterprises, it is often required that the cli-

ent can choose from a range of competing offers. Many times

there is a fixed budget available and competing solutions are

compared based on their performance towards certain custom

benchmark codes.

So, in 2007 transtec decided to add another layer to their already

wide array of competence in HPC – ranging from cluster deploy-

ment and management, the latest CPU, board and network tech-

nology to HPC storage systems. In transtec’s HPC Lab the sys-

tems are being assembled. transtec is using Intel Cluster Ready

to facilitate testing, verification, documentation, and final test-

ing throughout the actual build process.

At the benchmarking center transtec can now offer a set of small

clusters with the “newest and hottest technology” through

Intel Cluster Ready. A standard installation infrastructure

gives transtec a quick and easy way to set systems up accord-

ing to their customers’ choice of operating system, compilers,

workload management suite, and so on. With Intel Cluster Ready

there are prepared standard set-ups available with verified per-

formance at standard benchmarks while the system stability is

guaranteed by our own test suite and the Intel Cluster Checker.

The Intel Cluster Ready program is designed to provide a com-

mon standard for HPC clusters, helping organizations design

and build seamless, compatible and consistent cluster config-

urations. Integrating the standards and tools provided by this

program can help significantly simplify the deployment and

management of HPC clusters.

Page 76: HPC Compass 2016_17
Page 77: HPC Compass 2016_17


Remote Visualization and Workflow Optimization

It’s human nature to want to ‘see’ the results from simulations, tests, and analyses. Up

until recently, this has meant ‘fat’ workstations on many user desktops. This approach

provides CPU power when the user wants it – but as dataset size increases, there can be

delays in downloading the results.

Also, sharing the results with colleagues means gathering around the workstation – not

always possible in this globalized, collaborative workplace.

Life Sciences | CAE | High Performance Computing | Big Data Analytics | Simulation | CAD

Page 78: HPC Compass 2016_17


Aerospace, Automotive and ManufacturingThe complex factors involved in CAE range from compute-in-

tensive data analysis to worldwide collaboration between de-

signers, engineers, OEM’s and suppliers. To accommodate these

factors, Cloud (both internal and external) and Grid Computing

solutions are increasingly seen as a logical step to optimize IT

infrastructure usage for CAE. It is no surprise then, that automo-

tive and aerospace manufacturers were the early adopters of

internal Cloud and Grid portals.

Manufacturers can now develop more “virtual products” and

simulate all types of designs, fluid flows, and crash simulations.

Such virtualized products and more streamlined collaboration

environments are revolutionizing the manufacturing process.

With NICE EnginFrame in their CAE environment engineers can

take the process even further by connecting design and simula-

tion groups in “collaborative environments” to get even greater

benefits from “virtual products”. Thanks to EnginFrame, CAE en-

gineers can have a simple intuitive collaborative environment

that can care of issues related to:

� Access & Security - Where an organization must give access

to external and internal entities such as designers, engi-

neers and suppliers.

� Distributed collaboration - Simple and secure connection of

design and simulation groups distributed worldwide.

� Time spent on IT tasks – By eliminating time and resources spent

using cryptic job submission commands or acquiring knowl-

edge of underlying compute infrastructures, engineers can

spend more time concentrating on their core design tasks.

EnginFrame’s web based interface can be used to access the

compute resources required for CAE processes. This means ac-

cess to job submission & monitoring tasks, input & output data

associated with industry standard CAE/CFD applications for

Fluid-Dynamics, Structural Analysis, Electro Design and Design

Remote Visualization

NICE EnginFrame:

A Technical Computing Portal

Page 79: HPC Compass 2016_17


Collaboration (like Abaqus, Ansys, Fluent, MSC Nastran, PAM-

Crash, LS-Dyna, Radioss) without cryptic job submission com-

mands or knowledge of underlying compute infrastructures.

EnginFrame has a long history of usage in some of the most

prestigious manufacturing organizations worldwide including

Aerospace companies like AIRBUS, Alenia Space, CIRA, Galileo

Avionica, Hamilton Sunstrand, Magellan Aerospace, MTU and

Automotive companies like Audi, ARRK, Brawn GP, Bridgestone,

Bosch, Delphi, Elasis, Ferrari, FIAT, GDX Automotive, Jaguar-Lan-

dRover, Lear, Magneti Marelli, McLaren, P+Z, RedBull, Swagelok,

Suzuki, Toyota, TRW.

Life Sciences and HealthcareNICE solutions are deployed in the Life Sciences sector at compa-

nies like BioLab, Partners HealthCare, Pharsight and the M.D.An-

derson Cancer Center; also in leading research projects like DEISA

or LitBio in order to allow easy and transparent use of computing

resources without any insight of the HPC infrastructure.

The Life Science and Healthcare sectors have some very strict

requirement when choosing its IT solution like EnginFrame, for


� Security - To meet the strict security & privacy requirements

of the biomedical and pharmaceutical industry any solution

needs to take account of multiple layers of security and au-


� Industry specific software - ranging from the simplest custom

tool to the more general purpose free and open middlewares.

EnginFrame’s modular architecture allows for different Grid

middlewares and software including leading Life Science appli-

cations like Schroedinger Glide, EPFL RAxML, BLAST family, Tav-

erna, and R) to be exploited, R Users can compose elementary

services into complex applications and “virtual experiments”,

run, monitor, and build workflows via a standard Web browser.

EnginFrame also has highly tunable resource sharing and fine-

grained access control where multiple authentication systems

(like Active Directory, Kerberos/LDAP) can be exploited simulta-


Growing complexity in globalized teamsHPC systems and Enterprise Grids deliver unprecedented time-

to-market and performance advantages to many research and

corporate customers, struggling every day with compute and

data intensive processes.

This often generates or transforms massive amounts of jobs and

data, that needs to be handled and archived efficiently to deliv-

er timely information to users distributed in multiple locations,

with different security concerns.

Poor usability of such complex systems often negatively im-

pacts users’ productivity, and ad-hoc data management often

increases information entropy and dissipates knowledge and

intellectual property.

The solutionSolving distributed computing issues for our customers, we

understand that a modern, user-friendly web front-end to HPC

and Grids can drastically improve engineering productivity, if

properly designed to address the specific challenges of the Tech-

nical Computing market

NICE EnginFrame overcomes many common issues in the areas

of usability, data management, security and integration, open-

ing the way to a broader, more effective use of the Technical

Computing resources.

The key components of the solution are:

� A flexible and modular Java-based kernel, with clear separa-

tion between customizations and core services

Page 80: HPC Compass 2016_17


� Powerful data management features, reflecting the typical

needs of engineering applications

� Comprehensive security options and a fine grained authori-

zation system

� Scheduler abstraction layer to adapt to different workload

and resource managers

� Responsive and competent Support services

End users can typically enjoy the following improvements:

� User-friendly, intuitive access to computing resources, using

a standard Web browser

� Application-centric job submission

� Organized access to job information and data

� Increased mobility and reduced client requirements

Remote Visualization

NICE EnginFrame:

A Technical Computing Portal

Page 81: HPC Compass 2016_17


On the other side, the Technical Computing Portal delivers signif-

icant added-value for system administrators and IT:

� Reduced training needs to enable users to access the


� Centralized configuration and immediate deployment of


� Comprehensive authorization to access services and


� Reduced support calls and submission errors

Coupled with our Remote Visualization solutions, our custom-

ers quickly deploy end-to-end engineering processes on their

Intranet, Extranet or Internet.

Page 82: HPC Compass 2016_17


Solving distributed computing issues for our customers, it is

easy to understand that a modern, user-friendly web front-end

to HPC and grids can drastically improve engineering productiv-

ity, if properly designed to address the specific challenges of the

Technical Computing market.

Remote VisualizationIncreasing dataset complexity (millions of polygons, interacting

components, MRI/PET overlays) means that as time comes to

upgrade and replace the workstations, the next generation of

hardware needs more memory, more graphics processing, more

disk, and more CPU cores. This makes the workstation expen-

sive, in need of cooling, and noisy.

Innovation in the field of remote 3D processing now allows com-

panies to address these issues moving applications away from

the Desktop into the data center. Instead of pushing data to the

application, the application can be moved near the data.

Instead of mass workstation upgrades, remote visualization

allows incremental provisioning, on-demand allocation, better

management and efficient distribution of interactive sessions

and licenses. Racked workstations or blades typically have low-

er maintenance, cooling, replacement costs, and they can ex-

tend workstation (or laptop) life as “thin clients”.

The solution

Leveraging their expertise in distributed computing and web-

based application portals, NICE offers an integrated solution to

access, load balance and manage applications and desktop ses-

sions running within a visualization farm. The farm can include

both Linux and Windows resources, running on heterogeneous


Remote Visualization

Remote Visualization

Page 83: HPC Compass 2016_17


The core of the solution is the EnginFrame visualization plug-in,

that delivers web-based services to access and manage applica-

tions and desktops published in the farm. This solution has been

integrated with:

� NICE Desktop Cloud Visualization (DCV)

� HP Remote Graphics Software (RGS)

� RealVNC

� TurboVNC and VirtualGL

� Nomachine NX

Coupled with these third party remote visualization engines

(which specialize in delivering high frame-rates for 3D graphics),

the NICE offering for Remote Visualization solves the issues of

user authentication, dynamic session allocation, session man-

agement and data transfers.

End users can enjoy the following improvements:

� Intuitive, application-centric web interface to start, control

and re-connect to a session

� Single sign-on for batch and interactive applications

� All data transfers from and to the remote visualization farm

are handled by EnginFrame

� Built-in collaboration, to share sessions with other users

� The load and usage of the visualization cluster is monitored

in the browser

The solution also delivers significant added-value for the sys-

tem administrators:

� No need of SSH / SCP / FTP on the client machine

� Easy integration into identity services, Single Sign-On (SSO),

Enterprise portals

� Automated data life cycle management

� Built-in user session sharing, to facilitate support

� Interactive sessions are load balanced by a scheduler (LSF,

Moab or Torque) to achieve optimal performance and re-

source usage

� Better control and use of application licenses

� Monitor, control and manage users’ idle sessions

Desktop Cloud VisualizationNICE Desktop Cloud Visualization (DCV) is an advanced technol-

ogy that enables Technical Computing users to remote access

2D/3D interactive applications over a standard network.

Engineers and scientists are immediately empowered by taking

full advantage of high-end graphics cards, fast I/O performance

and large memory nodes hosted in “Public or Private 3D Cloud”,

rather then waiting for the next upgrade of the workstations. The

DCV protocol adapts to heterogeneous networking infrastruc-

tures like LAN, WAN and VPN, to deal with bandwidth and latency

constraints. All applications run natively on the remote machines,

that could be virtualized and share the same physical GPU.

Page 84: HPC Compass 2016_17


In a typical visualization scenario, a software application sends

a stream of graphics commands to a graphics adapter through

an input/output (I/O) interface. The graphics adapter renders

the data into pixels and outputs them to the local display as

a video signal. When using NICE DCV, the scene geometry and

graphics state are rendered on a central server, and pixels are

sent to one or more remote displays.

This approach requires the server to be equipped with one or

more GPUs, which are used for the OpenGL rendering, while the

client software can run on “thin” devices.

NICE DCV architecture consist of:

� DCV Server, equipped with one or more GPUs, used for

OpenGL rendering

� one or more DCV end stations, running on “thin clients”, only

used for visualization

� etherogeneous networking infrastructures (like LAN, WAN

and VPN), optimized balancing quality vs frame rate

NICE DCV Highlights:

� enables high performance remote access to interactive

2D/3D software applications on low bandwidth/high latency

� supports multiple etherogeneous OS (Windows, Linux)

� enables GPU sharing

� supports 3D acceleration for OpenGL applications running

on virtual machines

� Supports multiple user collaboration via session sharing

� Enables attractive Return-on-Investment through resource

sharing and consolidation to data centers (GPU, memory, CPU)

� Keeps the data secure in the data center, reducing data load

and save time

� Enables right sizing of system allocation based on user’s

dynamic needs

� Facilitates application deployment: all applications, updates

and patches are instantly available to everyone, without any

changes to original code

“The amount of data resulting from e.g. simulations in CAE or other

engineering environments can be in the Gigabyte range. It is obvious

that remote post-processing is one of the most urgent topics to be

tackled. NICE EnginFrame provides exactly that, and our customers

are impressed that such great technology enhances their workflow so


Robin Kienecker | HPC Sales Specialist

Remote Visualization

Desktop Cloud Visualization

Page 85: HPC Compass 2016_17


Business BenefitsThe business benefits for adopting NICE DCV can be summarized

in to four categories:

Category Business Benefits

Productivity � Increase business efficiency

� Improve team performance by ensuring

real-time collaboration with colleagues

and partners in real time, anywhere.

� Reduce IT management costs by consoli-

dating workstation resources to a single


� Save money and time on application


� Let users work from anywhere if there is

an Internet connection



� Move graphics processing and data to the

datacenter - not on laptop/desktop

� Cloud-based platform support enables

you to scale the visualization solution

„on-demand“ to extend business, grow

new revenue, manage costs.



� Guarantee secure and auditable use of

remote resources (applications, data,

infrastructure, licenses)

� Allow real-time collaboration with

partners while protecting Intellectual

Property and resources

� Restrict access by class of user, service,

application, and resource



� Enable multiple users to follow

application procedures alongside an

instructor in real-time

� Enable collaboration and session sharing

among remote users

(employees, partners, and affiliates)

NICE DCV is perfectly integrated into EnginFrame Views, lever-

aging 2D/3D capabilities over the web, including the ability to

share an interactive session with other users for collaborative


Page 86: HPC Compass 2016_17



Cloud Computing With transtec and NICECloud Computing is a style of computing in which dynamically

scalable and often virtualized resources are provided as a service

over the Internet. Users need not have knowledge of, expertise in,

or control over the technology infrastructure in the “cloud” that

supports them. The concept incorporates technologies that have

the common theme of reliance on the Internet for satisfying the

computing needs of the users. Cloud Computing services usually

provide applications online that are accessed from a web brows-

er, while the software and data are stored on the servers.

Companies or individuals engaging in Cloud Computing do not

own the physical infrastructure hosting the software platform

in question. Instead, they avoid capital expenditure by renting

usage from a third-party provider (except for the case of ‘Private

Cloud’ - see below). They consume resources as a service, paying

instead for only the resources they use. Many Cloud Comput-

ing offerings have adopted the utility computing model, which

is analogous to how traditional utilities like electricity are con-

sumed, while others are billed on a subscription basis.

The main advantage offered by Cloud solutions is the reduction

of infrastructure costs, and of the infrastructure’s maintenance.

By not owning the hardware and software, Cloud users avoid

capital expenditure by renting usage from a third-party provider.

Customers pay for only the resources they use.

The advantage for the provider of the Cloud, is that sharing com-

puting power among multiple tenants improves utilization rates,

as servers are not left idle, which can reduce costs and increase


Remote Visualization

Cloud Computing With transtec and NICE

Page 87: HPC Compass 2016_17


Public cloud

Public cloud or external cloud describes cloud computing in the

traditional mainstream sense, whereby resources are dynam-

ically provisioned on a fine-grained, self-service basis over the

Internet, via web applications/web services, from an off-site

third-party provider who shares resources and bills on a fine-

grained utility computing basis.

Private cloud

Private cloud and internal cloud describe offerings deploying

cloud computing on private networks. These solutions aim to

“deliver some benefits of cloud computing without the pitfalls”,

capitalizing on data security, corporate governance, and relia-

bility concerns. On the other hand, users still have to buy, deploy,

manage, and maintain them, and as such do not benefit from

lower up-front capital costs and less hands-on management.


The majority of cloud computing infrastructure, today, consists

of reliable services delivered through data centers and built on

servers with different levels of virtualization technologies. The

services are accessible anywhere that has access to networking

infrastructure. The Cloud appears as a single point of access

for all the computing needs of consumers. The offerings need

to meet the quality of service requirements of customers and

typically offer service level agreements, and, at the same time

proceed over the typical limitations.

The architecture of the computing platform proposed by NICE

(fig. 1) differs from the others in some interesting ways:

� you can deploy it on an existing IT infrastructure, because it

is completely decoupled from the hardware infrastructure

� it has a high level of modularity and configurability, resulting

in being easily customizable for the user’s needs

� based on the NICE EnginFrame technology, it is easy to build

graphical web-based interfaces to provide several applica-

tions, as web services, without needing to code or compile

source programs

� it utilises the existing methodology in place for authentica-

tion and authorization

Page 88: HPC Compass 2016_17


Further, because the NICE Cloud solution is built on advanced

IT technologies, including virtualization and workload manage-

ment, the execution platform is dynamically able to allocate,

monitor, and configure a new environment as needed by the

application, inside the Cloud infrastructure. The NICE platform

offers these important properties:

� Incremental Scalability: the quantity of computing and

storage resources, provisioned to the applications, changes

dynamically depending on the workload

� Reliability and Fault-Tolerance: because of the virtualiza-

tion of the hardware resources and the multiple redundant

hosts, the platform adjusts the resources needed from the

applications, without disruption during disasters or crashes

� Service Level Agreement: the use of advanced systems for

the dynamic allocation of the resources, allows the guaran-

tee of service level, agreed across applications and services;

� Accountability: the continuous monitoring of the resourc-

es used by each application (and user), allows the setup of

services that users can access in a pay-per-use mode, or sub-

scribing to a specific contract. In the case of an Enterprise

Cloud, this feature allows costs to be shared among the cost

centers of the company.

transtec has a long-term experience in Engineering environ-

ments, especially from the CAD/CAE sector. This allow us to

provide customers from this area with solutions that greatly en-

hance their workflow and minimizes time-to-result.

This, together with transtec’s offerings of all kinds of services,

allows our customers to fully focus on their productive work,

and have us do the environmental optimizations.

Remote Visualization

NVIDIA Grid Boards

Page 89: HPC Compass 2016_17


Graphics accelerated virtual desktops and applicationsNVIDIA GRID for large corporations offers the ability to offload

graphics processing from the CPU to the GPU in virtualised envi-

ronments, allowing the data center manager to deliver true PC

graphics-rich experiences to more users for the first time.

Benefits of NVIDIA GRID for IT:

� Leverage industry-leading remote visualization solutions like


� Add your most graphics-intensive users to your virtual solu-


� Improve the productivity of all users

Benefits of NVIDIA GRID for users:

� Highly responsive windows and rich multimedia experiences

� Access to all critical applications, including the most 3D-intensive

� Access from anywhere, on any device

NVIDIA GRID BoardsNVIDIA’s Kepler architecture-based GRID boards are specifically

designed to enable rich graphics in virtualised environments.

GPU VirtualisationGRID boards feature NVIDIA Kepler-based GPUs that, for the first

time, allow hardware virtualisation of the GPU. This means mul-

tiple users can share a single GPU, improving user density while

providing true PC performance and compatibility.

Low-Latency Remote DisplayNVIDIA’s patented low-latency remote display technology great-

ly improves the user experience by reducing the lag that users

feel when interacting with their virtual machine.

With this technology, the virtual desktop screen is pushed

directly to the remoting protocol.

H.264 EncodingThe Kepler GPU includes a high-performance H.264 encoding

engine capable of encoding simultaneous streams with supe-

rior quality. This provides a giant leap forward in cloud server

efficiency by offloading the CPU from encoding functions and

allowing the encode function to scale with the number of GPUs

in a server.

Maximum User DensityGRID boards have an optimised multi-GPU design that helps to

maximise user density. The GRID K1 board features 4 GPUs and

16 GB of graphics memory, allowing it to support up to 100 users

on a single board.

Power EfficiencyGRID boards are designed to provide data center-class power

efficiency, including the revolutionary new streaming multi-

processor, called “SMX”. The result is an innovative, proven solu-

tion that delivers revolutionary performance-per-watt for the

enterprise data centre

24/7 ReliabilityGRID boards are designed, built, and tested by NVIDIA for 24/7

operation. Working closely with leading server vendors ensures

GRID cards perform optimally and reliably for the life of the sys-


Widest Range of Virtualisation SolutionsGRID cards enable GPU-capable virtualisation solutions like

XenServer or Linux KVM, delivering the flexibility to choose from

a wide range of proven solutions.

Page 90: HPC Compass 2016_17


GRID K1 Board GRID K2 Board

Number of GPUs 4 x entry Kepler GPUs2 x high-end Kepler


Total NVIDIA CUDA cores 768 3072

Total memory size 16 GB DDR3 8 GB GDDR5

Max power 130 W 225 W

Board length 10.5” 10.5”

Board height 4.4” 4.4”

Board width Dual slot Dual slot

Display IO None None

Aux power 6-pin connector 8-pin connector

PCIe x16 x16

PCIe generation Gen3 (Gen2 compatible) Gen3 (Gen2 compatible)

Cooling solution Passive Passive

Technical SpecificationsGRID K1 Board Specifications

GRID K2 Board Specifications

Virtual GPU TechnologyNVIDIA GRID vGPU brings the full benefit of NVIDIA hardware-ac-

celerated graphics to virtualized solutions. This technology pro-

vides exceptional graphics performance for virtual desktops

equivalent to local PCs when sharing a GPU among multiple users.

NVIDIA GRID vGPU is the industry’s most advanced technology

for sharing true GPU hardware acceleration between multiple vir-

tual desktops – without compromising the graphics experience.

Application features and compatibility are exactly the same as

they would be at the desk.

Remote Visualization

Virtual GPU Technology

Page 91: HPC Compass 2016_17


With GRID vGPU technology, the graphics commands of each vir-

tual machine are passed directly to the GPU, without translation

by the hypervisor. This allows the GPU hardware to be time-sliced

to deliver the ultimate in shared virtualised graphics performance.

vGPU Profiles Mean Customised, Dedicated Graphics MemoryTake advantage of vGPU Manager to assign just the right amount

of memory to meet the specific needs of each user. Every virtual

desktop has dedicated graphics memory, just like they would

at their desk, so they always have the resources they need to

launch and use their applications.

vGPU Manager enables up to eight users to share each physical

GPU, assigning the graphics resources of the available GPUs to

virtual machines in a balanced approach. Each NVIDIA GRID K1

graphics card has up to four GPUs, allowing 32 users to share a

single graphics card.

NVIDIA GRID Graphics Board

Virtual GPU Profile Application Certifications

Graphics Memory Max Displays Per User

Max Resolution Per Display

Max Users Per Graphics Board

Use Case

GRID K2K260Q 2,048 MB 4 2560×1600 4

Designer/Power User

K240Q 1,024 MB 2 2560×1600 8Designer/

Power User

K220Q 512 MB 2 2560×1600 16Designer/

Power User

K200 256 MB 2 1900×1200 16Knowledge


GRID K1K140Q 1,024 MB 2 2560×1600 16 Power User

K120Q 512 MB 2 2560×1600 32 Power User

K100 256 MB 2 1900×1200 32Knowledge


Up to 8 users supported per physical GPU depending in vGPU profiles

Page 92: HPC Compass 2016_17
Page 93: HPC Compass 2016_17

Life Sciences | CAE | High Performance Computing | Big Data Analytics | Simulation | CAD

Big Data Analytics withHadoopThe Hadoop Distributed File System (HDFS) is a distributed file system designed to run

on commodity hardware. It has many similarities with existing distributed file systems.

However, the differences from other distributed file systems are significant. HDFS is

highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS pro-

vides high throughput access to application data and is suitable for applications that

have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access

to file system data. HDFS was originally built as infrastructure for the Apache Nutch

web search engine project. HDFS is now an Apache Hadoop subproject.

Page 94: HPC Compass 2016_17


Big Data Analytics with Hadoop

Apache Hadoop

Apache HadoopOverviewApache Hadoop is an open-source software framework written

in Java for distributed storage and distributed processing of very

large data sets on computer clusters built from commodity hard-

ware. All the modules in Hadoop are designed with a fundamen-

tal assumption that hardware failures are common and should

be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as

Hadoop Distributed File System (HDFS), and a processing part

called MapReduce. Hadoop splits files into large blocks and dis-

tributes them across nodes in a cluster. To process data, Hadoop

transfers packaged code for nodes to process in parallel based

on the data that needs to be processed. This approach takes ad-

vantage of data locality – nodes manipulating the data they have

access to – to allow the dataset to be processed faster and more

efficiently than it would be in a more conventional supercomput-

er architecture that relies on a parallel file system where compu-

tation and data are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the follow-

ing modules:

� Hadoop Common – contains libraries and utilities needed by

other Hadoop modules;

� Hadoop Distributed File System (HDFS) – a distributed

file-system that stores data on commodity machines,

providing very high aggregate bandwidth across the cluster;

� Hadoop YARN – a resource-management platform responsi-

ble for managing computing resources in clusters and using

them for scheduling of users’ applications; and

� Hadoop MapReduce – an implementation of the MapReduce

programming model for large scale data processing.

Page 95: HPC Compass 2016_17


The term Hadoop has come to refer not just to the base mod-

ules above, but also to the ecosystem, or collection of additional

software packages that can be installed on top of or alongside

Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache

Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala,

Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm.

Apache Hadoop’s MapReduce and HDFS components were in-

spired by Google papers on their MapReduce and Google File


The Hadoop framework itself is mostly written in the Java pro-

gramming language, with some native code in C and command

line utilities written as shell scripts. Though MapReduce Java

code is common, any programming language can be used with

“Hadoop Streaming” to implement the “map” and “reduce” parts

of the user’s program. Other projects in the Hadoop ecosystem

expose richer user interfaces.

ArchitectureHadoop consists of the Hadoop Common package, which pro-

vides filesystem and OS level abstractions, a MapReduce engine

(either MapReduce/MR1 or YARN/MR2) and the Hadoop Distrib-

uted File System (HDFS). The Hadoop Common package contains

the necessary Java ARchive (JAR) files and scripts needed to start


For effective scheduling of work, every Hadoop-compatible file

system should provide location awareness: the name of the rack

(more precisely, of the network switch) where a worker node is.

Hadoop applications can use this information to execute code

on the node where the data is, and, failing that, on the same rack/

switch to reduce backbone traffic. HDFS uses this method when

replicating data for data redundancy across multiple racks. This

approach reduces the impact of a rack power outage or switch

failure; if one of these hardware failures occurs, the data will re-

main available.

A small Hadoop cluster includes a single master and multiple

worker nodes. The master node consists of a Job Tracker, Task

Tracker, NameNode, and DataNode. A slave or worker node acts

as both a DataNode and TaskTracker, though it is possible to have

data-only worker nodes and compute-only worker nodes. These

are normally used only in nonstandard applications.

Hadoop requires Java Runtime Environment (JRE) 1.6 or higher.

The standard startup and shutdown scripts require that Secure

Shell (ssh) be set up between nodes in the cluster.

In a larger cluster, HDFS nodes are managed through a dedicated

NameNode server to host the file system index, and a second-

ary NameNode that can generate snapshots of the namenode’s

memory structures, thereby preventing file-system corruption

and loss of data. Similarly, a standalone JobTracker server can

manage job scheduling across nodes. When Hadoop MapReduce

is used with an alternate file system, the NameNode, secondary

NameNode, and DataNode architecture of HDFS are replaced by

the file-system-specific equivalents.


master slave

multi-node cluster








A multi-node Hadoop cluster

Page 96: HPC Compass 2016_17


Hadoop distributed file system (HDFS)The Hadoop distributed file system (HDFS) is a distributed, scal-

able, and portable file-system written in Java for the Hadoop

framework. A Hadoop cluster has nominally a single namenode

plus a cluster of datanodes, although redundancy options are

available for the namenode due to its criticality. Each datanode

serves up blocks of data over the network using a block protocol

specific to HDFS. The file system uses TCP/IP sockets for commu-

nication. Clients use remote procedure call (RPC) to communi-

cate between each other.

HDFS stores large files (typically in the range of gigabytes to

terabytes) across multiple machines. It achieves reliability by

replicating the data across multiple hosts, and hence theoreti-

cally does not require RAID storage on hosts (but to increase I/O

performance some RAID configurations are still useful). With the

default replication value, 3, data is stored on three nodes: two on

the same rack, and one on a different rack. Data nodes can talk to

each other to rebalance data, to move copies around, and to keep

the replication of data high. HDFS is not fully POSIX-compliant,

because the requirements for a POSIX file-system differ from the

target goals for a Hadoop application. The trade-off of not having

a fully POSIX-compliant file-system is increased performance for

data throughput and support for non-POSIX operations such as


HDFS added the high-availability capabilities, as announced for

release 2.0 in May 2012, letting the main metadata server (the

NameNode) fail over manually to a backup. The project has also

started developing automatic fail-over.

The HDFS file system includes a so-called secondary namenode,

a misleading name that some might incorrectly interpret as a

backup namenode for when the primary namenode goes offline.

In fact, the secondary namenode regularly connects with the

Big Data Analytics with Hadoop

Apache Hadoop

Page 97: HPC Compass 2016_17


primary namenode and builds snapshots of the primary nameno-

de’s directory information, which the system then saves to local

or remote directories. These checkpointed images can be used

to restart a failed primary namenode without having to replay

the entire journal of file-system actions, then to edit the log to

create an up-to-date directory structure. Because the namenode

is the single point for storage and management of metadata, it

can become a bottleneck for supporting a huge number of files,

especially a large number of small files. HDFS Federation, a new

addition, aims to tackle this problem to a certain extent by allow-

ing multiple namespaces served by separate namenodes.

An advantage of using HDFS is data awareness between the job

tracker and task tracker. The job tracker schedules map or reduce

jobs to task trackers with an awareness of the data location. For

example: if node A contains data (x,y,z) and node B contains data

(a,b,c), the job tracker schedules node B to perform map or reduce

tasks on (a,b,c) and node A would be scheduled to perform map

or reduce tasks on (x,y,z). This reduces the amount of traffic that

goes over the network and prevents unnecessary data transfer.

When Hadoop is used with other file systems, this advantage is

not always available. This can have a significant impact on job-

completion times, which has been demonstrated when running

data-intensive jobs.

HDFS was designed for mostly immutable files and may not be

suitable for systems requiring concurrent write-operations.

HDFS can be mounted directly with a Filesystem in Userspace

(FUSE) virtual file system on Linux and some other Unix systems.

File access can be achieved through the native Java application

programming interface (API), the Thrift API to generate a client in

the language of the users’ choosing (C++, Java, Python, PHP, Ruby,

Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the com-

mand-line interface, browsed through the HDFS-UI Web applica-

tion (webapp) over HTTP, or via 3rd-party network client libraries.

JobTracker and TaskTracker: the MapReduce EngineAbove the file systems comes the MapReduce Engine, which con-

sists of one JobTracker, to which client applications submit Ma-

pReduce jobs. The JobTracker pushes work out to available Task-

Tracker nodes in the cluster, striving to keep the work as close to

the data as possible. With a rack-aware file system, the JobTracker

knows which node contains the data, and which other machines

are nearby. If the work cannot be hosted on the actual node

where the data resides, priority is given to nodes in the same

rack. This reduces network traffic on the main backbone net-

work. If a TaskTracker fails or times out, that part of the job is re-

scheduled. The TaskTracker on each node spawns a separate Java

Virtual Machine process to prevent the TaskTracker itself from

failing if the running job crashes its JVM. A heartbeat is sent from

the TaskTracker to the JobTracker every few minutes to check its

status. The Job Tracker and TaskTracker status and information is

exposed by Jetty and can be viewed from a web browser.

Known limitations of this approach are:

� The allocation of work to TaskTrackers is very simple. Every

TaskTracker has a number of available slots (such as “4 slots”).

Every active map or reduce task takes up one slot. The Job

Tracker allocates work to the tracker nearest to the data with

an available slot. There is no consideration of the current

system load of the allocated machine, and hence its actual


� If one TaskTracker is very slow, it can delay the entire Ma-

pReduce job – especially towards the end of a job, where

everything can end up waiting for the slowest task. With

speculative execution enabled, however, a single task can be

executed on multiple slave nodes.


By default Hadoop uses FIFO scheduling, and optionally 5 sched-

uling priorities to schedule jobs from a work queue. In version

0.19 the job scheduler was refactored out of the JobTracker, while

Page 98: HPC Compass 2016_17


adding the ability to use an alternate scheduler (such as the Fair

scheduler or the Capacity scheduler, described next).

Fair scheduler

The fair scheduler was developed by Facebook. The goal of the

fair scheduler is to provide fast response times for small jobs and

QoS for production jobs. The fair scheduler has three basic con-


1. Jobs are grouped into pools.

2. Each pool is assigned a guaranteed minimum share.

3. Excess capacity is split between jobs.

By default, jobs that are uncategorized go into a default pool.

Pools have to specify the minimum number of map slots, reduce

slots, and a limit on the number of running jobs.

Capacity scheduler

The capacity scheduler was developed by Yahoo. The capacity

scheduler supports several features that are similar to the fair


� Queues are allocated a fraction of the total resource capacity.

� Free resources are allocated to queues beyond their total


�Within a queue a job with a high level of priority has access

to the queue’s resources.

There is no preemption once a job is running.

Other applicationsThe HDFS file system is not restricted to MapReduce jobs. It can

be used for other applications, many of which are under devel-

opment at Apache. The list includes the HBase database, the

Apache Mahout machine learning system, and the Apache Hive

Data Warehouse system. Hadoop can in theory be used for any

sort of work that is batch-oriented rather than real-time, is very

data-intensive, and benefits from parallel processing of data.

Big Data Analytics with Hadoop

HDFS Architecture

Page 99: HPC Compass 2016_17


It can also be used to complement a real-time system, such as

lambda architecture.

As of October 2009, commercial applications of Hadoop included:

� Log and/or clickstream analysis of various kinds

� Marketing analytics

� Machine learning and/or sophisticated data mining

� Image processing

� Processing of XML messages

�Web crawling and/or text processing

� General archiving, including of relational/tabular data, e.g.

for compliance

HDFS ArchitectureAssumptions and GoalsHardware Failure

Hardware failure is the norm rather than the exception. An HDFS

instance may consist of hundreds or thousands of server ma-

chines, each storing part of the file system’s data. The fact that

there are a huge number of components and that each compo-

nent has a non-trivial probability of failure means that some

component of HDFS is always non-functional. Therefore, detec-

tion of faults and quick, automatic recovery from them is a core

architectural goal of HDFS.

Streaming Data Access

Applications that run on HDFS need streaming access to their

data sets. They are not general purpose applications that typi-

cally run on general purpose file systems. HDFS is designed more

for batch processing rather than interactive use by users. The

emphasis is on high throughput of data access rather than low

latency of data access. POSIX imposes many hard requirements

that are not needed for applications that are targeted for HDFS.

POSIX semantics in a few key areas has been traded to increase

data throughput rates.

Large Data Sets

Applications that run on HDFS have large data sets. A typical file

in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to

support large files. It should provide high aggregate data band-

width and scale to hundreds of nodes in a single cluster. It should

support tens of millions of files in a single instance.

Simple Coherency Model

HDFS applications need a write-once-read-many access model

for files. A file once created, written, and closed need not be

changed. This assumption simplifies data coherency issues and

enables high throughput data access. A MapReduce application

or a web crawler application fits perfectly with this model. There

is a plan to support appending-writes to files in the future.

“Moving Computation is Cheaper than Moving Data”

A computation requested by an application is much more effi-

cient if it is executed near the data it operates on. This is espe-

cially true when the size of the data set is huge. This minimizes

network congestion and increases the overall throughput of the

system. The assumption is that it is often better to migrate the

computation closer to where the data is located rather than mov-

ing the data to where the application is running. HDFS provides

interfaces for applications to move themselves closer to where

the data is located.

Portability Across Heterogeneous Hardware and

Software Platforms

HDFS has been designed to be easily portable from one platform

to another. This facilitates widespread adoption of HDFS as a

platform of choice for a large set of applications.

NameNode and DataNodesHDFS has a master/slave architecture. An HDFS cluster con-

sists of a single NameNode, a master server that manages the

file system namespace and regulates access to files by clients.

Page 100: HPC Compass 2016_17


In addition, there are a number of DataNodes, usually one per

node in the cluster, which manage storage attached to the nodes

that they run on. HDFS exposes a file system namespace and al-

lows user data to be stored in files. Internally, a file is split into one

or more blocks and these blocks are stored in a set of DataNodes.

The NameNode executes file system namespace operations like

opening, closing, and renaming files and directories. It also de-

termines the mapping of blocks to DataNodes. The DataNodes

are responsible for serving read and write requests from the file

system’s clients. The DataNodes also perform block creation, de-

letion, and replication upon instruction from the NameNode.

The NameNode and DataNode are pieces of software designed

to run on commodity machines. These machines typically run

a GNU/Linux operating system (OS). HDFS is built using the Java

language; any machine that supports Java can run the NameNo-

de or the DataNode software. Usage of the highly portable Java

language means that HDFS can be deployed on a wide range of

machines. A typical deployment has a dedicated machine that

runs only the NameNode software. Each of the other machines in

the cluster runs one instance of the DataNode software. The ar-

chitecture does not preclude running multiple DataNodes on the

same machine but in a real deployment that is rarely the case.

Namecode Metadata (Name, replicas, ...):/home/foo/data, 3, ...


Rack 1Write

Rack 2


Datanodes Datanodes






HDFS Architecture

Big Data Analytics with Hadoop

HDFS Architecture

Page 101: HPC Compass 2016_17


The existence of a single NameNode in a cluster greatly simplifies

the architecture of the system. The NameNode is the arbitrator

and repository for all HDFS metadata. The system is designed in

such a way that user data never flows through the NameNode.

The File System NamespaceHDFS supports a traditional hierarchical file organization. A user

or an application can create directories and store files inside

these directories. The file system namespace hierarchy is similar

to most other existing file systems; one can create and remove

files, move a file from one directory to another, or rename a file.

HDFS does not yet implement user quotas. HDFS does not sup-

port hard links or soft links. However, the HDFS architecture does

not preclude implementing these features.

The NameNode maintains the file system namespace. Any change

to the file system namespace or its properties is recorded by the

NameNode. An application can specify the number of replicas of

a file that should be maintained by HDFS. The number of copies

of a file is called the replication factor of that file. This informa-

tion is stored by the NameNode.

Data ReplicationHDFS is designed to reliably store very large files across machines

in a large cluster. It stores each file as a sequence of blocks; all

blocks in a file except the last block are the same size. The blocks

of a file are replicated for fault tolerance. The block size and rep-

lication factor are configurable per file. An application can spec-

ify the number of replicas of a file. The replication factor can be

specified at file creation time and can be changed later. Files in

HDFS are write-once and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of

blocks. It periodically receives a Heartbeat and a Blockreport

from each of the DataNodes in the cluster. Receipt of a Heartbeat

implies that the DataNode is functioning properly. A Blockreport

contains a list of all blocks on a DataNode.

Replica Placement: The First Baby Steps

The placement of replicas is critical to HDFS reliability and per-

formance. Optimizing replica placement distinguishes HDFS

from most other distributed file systems. This is a feature that

needs lots of tuning and experience. The purpose of a rack-aware

replica placement policy is to improve data reliability, availabil-

ity, and network bandwidth utilization. The current implemen-

tation for the replica placement policy is a first effort in this di-

rection. The short-term goals of implementing this policy are to

validate it on production systems, learn more about its behavior,

and build a foundation to test and research more sophisticated


Large HDFS instances run on a cluster of computers that com-

monly spread across many racks. Communication between two

nodes in different racks has to go through switches. In most

cases, network bandwidth between machines in the same rack is

greater than network bandwidth between machines in different


The NameNode determines the rack id each DataNode belongs

to via the process outlined in Hadoop Rack Awareness. A simple

but non-optimal policy is to place replicas on unique racks. This

prevents losing data when an entire rack fails and allows use of

bandwidth from multiple racks when reading data. This policy

evenly distributes replicas in the cluster which makes it easy to

Namecode (Filename, numReplicas, block-ids, ...)/users/sameerp/data/part-0, r:2, {1,3}, .../users/sameerp/data/part-1, r:3, {2,4,5}, ...


Block Replication

1 1 42

2 2 5


53 4 4

Page 102: HPC Compass 2016_17


balance load on component failure. However, this policy increas-

es the cost of writes because a write needs to transfer blocks to

multiple racks.

For the common case, when the replication factor is three,

HDFS’s placement policy is to put one replica on one node in the

local rack, another on a node in a different (remote) rack, and the

last on a different node in the same remote rack. This policy cuts

the inter-rack write traffic which generally improves write per-

formance. The chance of rack failure is far less than that of node

failure; this policy does not impact data reliability and availabil-

ity guarantees. However, it does reduce the aggregate network

bandwidth used when reading data since a block is placed in

only two unique racks rather than three. With this policy, the rep-

licas of a file do not evenly distribute across the racks. One third

of replicas are on one node, two thirds of replicas are on one rack,

and the other third are evenly distributed across the remaining

racks. This policy improves write performance without compro-

mising data reliability or read performance.

The current, default replica placement policy described here is a

work in progress.

Replica Selection

To minimize global bandwidth consumption and read latency,

HDFS tries to satisfy a read request from a replica that is clos-

est to the reader. If there exists a replica on the same rack as the

reader node, then that replica is preferred to satisfy the read re-

quest. If angg/ HDFS cluster spans multiple data centers, then a

replica that is resident in the local data center is preferred over

any remote replica.


On startup, the NameNode enters a special state called Safem-

ode. Replication of data blocks does not occur when the NameNo-

de is in the Safemode state. The NameNode receives Heartbeat

and Blockreport messages from the DataNodes. A Blockreport

Big Data Analytics with Hadoop

HDFS Architecture

Page 103: HPC Compass 2016_17


contains the list of data blocks that a DataNode is hosting. Each

block has a specified minimum number of replicas. A block is con-

sidered safely replicated when the minimum number of replicas

of that data block has checked in with the NameNode. After a

configurable percentage of safely replicated data blocks checks

in with the NameNode (plus an additional 30 seconds), the Na-

meNode exits the Safemode state. It then determines the list of

data blocks (if any) that still have fewer than the specified num-

ber of replicas. The NameNode then replicates these blocks to

other DataNodes.

The Persistence of File System MetadataThe HDFS namespace is stored by the NameNode. The NameNode

uses a transaction log called the EditLog to persistently record

every change that occurs to file system metadata. For example,

creating a new file in HDFS causes the NameNode to insert a re-

cord into the EditLog indicating this. Similarly, changing the rep-

lication factor of a file causes a new record to be inserted into the

EditLog. The NameNode uses a file in its local host OS file system

to store the EditLog. The entire file system namespace, includ-

ing the mapping of blocks to files and file system properties, is

stored in a file called the FsImage. The FsImage is stored as a file

in the NameNode’s local file system too.

The NameNode keeps an image of the entire file system

namespace and file Blockmap in memory. This key metadata item

is designed to be compact, such that a NameNode with 4 GB of

RAM is plenty to support a huge number of files and directories.

When the NameNode starts up, it reads the FsImage and EditLog

from disk, applies all the transactions from the EditLog to the in-

memory representation of the FsImage, and flushes out this new

version into a new FsImage on disk. It can then truncate the old

EditLog because its transactions have been applied to the persis-

tent FsImage. This process is called a checkpoint. In the current

implementation, a checkpoint only occurs when the NameNode

starts up. Work is in progress to support periodic checkpointing

in the near future.

The DataNode stores HDFS data in files in its local file system.

The DataNode has no knowledge about HDFS files. It stores each

block of HDFS data in a separate file in its local file system. The

DataNode does not create all files in the same directory. Instead,

it uses a heuristic to determine the optimal number of files per

directory and creates subdirectories appropriately. It is not op-

timal to create all local files in the same directory because the

local file system might not be able to efficiently support a huge

number of files in a single directory. When a DataNode starts up,

it scans through its local file system, generates a list of all HDFS

data blocks that correspond to each of these local files and sends

this report to the NameNode: this is the Blockreport.

The Communication ProtocolsAll HDFS communication protocols are layered on top of the TCP/

IP protocol. A client establishes a connection to a configurable

TCP port on the NameNode machine. It talks the ClientProtocol

with the NameNode. The DataNodes talk to the NameNode using

the DataNode Protocol. A Remote Procedure Call (RPC) abstrac-

tion wraps both the Client Protocol and the DataNode Protocol.

By design, the NameNode never initiates any RPCs. Instead, it

only responds to RPC requests issued by DataNodes or clients.

RobustnessThe primary objective of HDFS is to store data reliably even in

the presence of failures. The three common types of failures are

NameNode failures, DataNode failures and network partitions.

Data Disk Failure, Heartbeats and Re-Replication

Each DataNode sends a Heartbeat message to the NameNode pe-

riodically. A network partition can cause a subset of DataNodes

to lose connectivity with the NameNode. The NameNode detects

this condition by the absence of a Heartbeat message. The Na-

meNode marks DataNodes without recent Heartbeats as dead

and does not forward any new IO requests to them. Any data that

was registered to a dead DataNode is not available to HDFS any

more. DataNode death may cause the replication factor of some

Page 104: HPC Compass 2016_17


blocks to fall below their specified value. The NameNode con-

stantly tracks which blocks need to be replicated and initiates

replication whenever necessary. The necessity for re-replication

may arise due to many reasons: a DataNode may become unavail-

able, a replica may become corrupted, a hard disk on a DataNode

may fail, or the replication factor of a file may be increased.

Cluster Rebalancing

The HDFS architecture is compatible with data rebalancing

schemes. A scheme might automatically move data from one

DataNode to another if the free space on a DataNode falls below

a certain threshold. In the event of a sudden high demand for a

particular file, a scheme might dynamically create additional rep-

licas and rebalance other data in the cluster. These types of data

rebalancing schemes are not yet implemented.

Data Integrity

It is possible that a block of data fetched from a DataNode ar-

rives corrupted. This corruption can occur because of faults in a

storage device, network faults, or buggy software. The HDFS cli-

ent software implements checksum checking on the contents

of HDFS files. When a client creates an HDFS file, it computes a

checksum of each block of the file and stores these checksums in

a separate hidden file in the same HDFS namespace. When a cli-

ent retrieves file contents it verifies that the data it received from

each DataNode matches the checksum stored in the associated

checksum file. If not, then the client can opt to retrieve that block

from another DataNode that has a replica of that block.

Metadata Disk Failure

The FsImage and the EditLog are central data structures of HDFS.

A corruption of these files can cause the HDFS instance to be

non-functional. For this reason, the NameNode can be config-

ured to support maintaining multiple copies of the FsImage and

EditLog. Any update to either the FsImage or EditLog causes each

of the FsImages and EditLogs to get updated synchronously. This

synchronous updating of multiple copies of the FsImage and Ed-

Big Data Analytics with Hadoop

HDFS Architecture

Page 105: HPC Compass 2016_17


itLog may degrade the rate of namespace transactions per sec-

ond that a NameNode can support. However, this degradation

is acceptable because even though HDFS applications are very

data intensive in nature, they are not metadata intensive. When

a NameNode restarts, it selects the latest consistent FsImage and

EditLog to use.

The NameNode machine is a single point of failure for an HDFS

cluster. If the NameNode machine fails, manual intervention is

necessary. Currently, automatic restart and failover of the Na-

meNode software to another machine is not supported.


Snapshots support storing a copy of data at a particular instant

of time. One usage of the snapshot feature may be to roll back

a corrupted HDFS instance to a previously known good point in

time. HDFS does not currently support snapshots but will in a fu-

ture release.

Data OrganizationData Blocks

HDFS is designed to support very large files. Applications that are

compatible with HDFS are those that deal with large data sets.

These applications write their data only once but they read it one

or more times and require these reads to be satisfied at stream-

ing speeds. HDFS supports write-once-read-many semantics on

files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file

is chopped up into 64 MB chunks, and if possible, each chunk will

reside on a different DataNode.


A client request to create a file does not reach the NameNode im-

mediately. In fact, initially the HDFS client caches the file data

into a temporary local file. Application writes are transparently

redirected to this temporary local file. When the local file accu-

mulates data worth over one HDFS block size, the client contacts

the NameNode. The NameNode inserts the file name into the file

system hierarchy and allocates a data block for it. The NameNode

responds to the client request with the identity of the DataNode

and the destination data block. Then the client flushes the block

of data from the local temporary file to the specified DataNode.

When a file is closed, the remaining un-flushed data in the tempo-

rary local file is transferred to the DataNode. The client then tells

the NameNode that the file is closed. At this point, the NameNode

commits the file creation operation into a persistent store. If the

NameNode dies before the file is closed, the file is lost.

The above approach has been adopted after careful consider-

ation of target applications that run on HDFS. These applications

need streaming writes to files. If a client writes to a remote file

directly without any client side buffering, the network speed

and the congestion in the network impacts throughput consider-

ably. This approach is not without precedent. Earlier distributed

file systems, e.g. AFS, have used client side caching to improve

performance. A POSIX requirement has been relaxed to achieve

higher performance of data uploads.

Replication Pipelining

When a client is writing data to an HDFS file, its data is first writ-

ten to a local file as explained in the previous section. Suppose

the HDFS file has a replication factor of three. When the local

file accumulates a full block of user data, the client retrieves a

list of DataNodes from the NameNode. This list contains the

DataNodes that will host a replica of that block. The client then

flushes the data block to the first DataNode. The first DataNode

starts receiving the data in small portions (4 KB), writes each

portion to its local repository and transfers that portion to the

second DataNode in the list. The second DataNode, in turn starts

receiving each portion of the data block, writes that portion to its

repository and then flushes that portion to the third DataNode.

Finally, the third DataNode writes the data to its local reposi-

tory. Thus, a DataNode can be receiving data from the previous

one in the pipeline and at the same time forwarding data to the

next one in the pipeline. Thus, the data is pipelined from one

DataNode to the next.

Page 106: HPC Compass 2016_17


Space ReclamationFile Deletes and Undeletes

When a file is deleted by a user or an application, it is not im-

mediately removed from HDFS. Instead, HDFS first renames it

to a file in the /trash directory. The file can be restored quickly

as long as it remains in /trash. A file remains in /trash for a con-

figurable amount of time. After the expiry of its life in /trash, the

NameNode deletes the file from the HDFS namespace. The dele-

tion of a file causes the blocks associated with the file to be freed.

Note that there could be an appreciable time delay between the

time a file is deleted by a user and the time of the corresponding

increase in free space in HDFS.

A user can Undelete a file after deleting it as long as it remains

in the /trash directory. If a user wants to undelete a file that he/

she has deleted, he/she can navigate the /trash directory and re-

trieve the file. The /trash directory contains only the latest copy

of the file that was deleted. The /trash directory is just like any

other directory with one special feature: HDFS applies specified

policies to automatically delete files from this directory. The cur-

rent default policy is to delete files from /trash that are more

than 6 hours old. In the future, this policy will be configurable

through a well defined interface.

Decrease Replication Factor

When the replication factor of a file is reduced, the NameNode

selects excess replicas that can be deleted. The next Heartbeat

transfers this information to the DataNode. The DataNode then

removes the corresponding blocks and the corresponding free

space appears in the cluster. Once again, there might be a time

delay between the completion of the setReplication API call and

the appearance of free space in the cluster.

Big Data Analytics with Hadoop

HDFS Architecture

Page 107: HPC Compass 2016_17


AccessibilityHDFS can be accessed from applications in many different ways.

Natively, HDFS provides a Java API for applications to use. A C lan-

guage wrapper for this Java API is also available. In addition, an

HTTP browser can also be used to browse the files of an HDFS

instance. Work is in progress to expose HDFS through the Web-

DAV protocol.

Action Command

Create a directory named /foodir bin/hadoop dfs -mkdir /foodir

Remove a directory named /foodir bin/hadoop dfs -rmr /foodir

View the contents of a file named /foodir/myfile.txt bin/hadoop dfs -cat /foodir/myfile.txt

FS shell is targeted for applications that need a scripting language to interact with the stored data.

Action Command

Put the cluster in Safemode bin/hadoop dfsadmin -safemode enter

Generate a list of DataNodes bin/hadoop dfsadmin -report

Recommission or decommission DataNode(s) bin/hadoop dfsadmin -refreshNodes

FS Shell

HDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called FS shell that lets

a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already

familiar with. Here are some sample action/command pairs:


The DFSAdmin command set is used for administering an HDFS cluster.

These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs:

Browser Interface

A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port.

This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.

Page 108: HPC Compass 2016_17
Page 109: HPC Compass 2016_17

NVIDIA GPU Computing –The CUDA Architecture

The CUDA parallel computing platform is now widely deployed with 1000s of GPU-

accelerated applications and 1000s of published research papers, and a complete range

of CUDA tools and ecosystem solutions is available to developers.

High Throuhgput Computing | CAD | Big Data Analytics | Simulation | Aerospace | Automotive

Page 110: HPC Compass 2016_17


What is GPU Accelerated Computing?GPU-accelerated computing is the use of a graphics processing

unit (GPU) together with a CPU to accelerate scientific, analytics,

engineering, consumer, and enterprise applications. Pioneered

in 2007 by NVIDIA, GPU accelerators now power energy-efficient

datacenters in government labs, universities, enterprises, and

small-and-medium businesses around the world. GPUs are accel-

erating applications in platforms ranging from cars, to mobile

phones and tablets, to drones and robots.

How GPUS accelerated applications GPU-accelerated computing offers unprecedented application

performance by offloading compute-intensive portions of the

application to the GPU, while the remainder of the code still runs

on the CPU. From a user’s perspective, applications simply run

significantly faster.

CPU versus GPUA simple way to understand the difference between a CPU and

GPU is to compare how they process tasks. A CPU consists of a

few cores optimized for sequential serial processing while a GPU

has a massively parallel architecture consisting of thousands

of smaller, more efficient cores designed for handling multiple

tasks simultaneously.

“We are very proud to be one of the leading providers of Tesla sys-

tems who are able to combine the overwhelming power of NVIDIA

Tesla systems with the fully engineered and thoroughly tested

transtec hardware to a total Tesla-based solution.”

Bernd Zell | Senior HPC System Engineer

NVIDIA GPU Computing

What is GPU Computing?

Page 111: HPC Compass 2016_17

CPU multiple cores GPU thousands of cores


GPUs have thousands of cores to process parallel workloads


Kepler GK110/210 GPU Computing Architecture As the demand for high performance parallel computing increas-

es across many areas of science, medicine, engineering, and fi-

nance, NVIDIA continues to innovate and meet that demand with

extraordinarily powerful GPU computing architectures.

NVIDIA’s GPUs have already redefined and accelerated High Per-

formance Computing (HPC) capabilities in areas such as seismic

processing, biochemistry simulations, weather and climate mod-

eling, signal processing, computational finance, computer aided

engineering, computational fluid dynamics, and data analysis.

NVIDIA’s Kepler GK110/210 GPUs are designed to help solve the

world’s most difficult computing problems.

SMX Processing Core ArchitectureEach of the Kepler GK110/210 SMX units feature 192 single-preci-

sion CUDA cores, and each core has fully pipelined floating-point

and integer arithmetic logic units. Kepler retains the full IEEE

754-2008 compliant single and double-precision arithmetic in-

troduced in Fermi, including the fused multiply-add (FMA) oper-

ation. One of the design goals for the Kepler GK110/210 SMX was

to significantly increase the GPU’s delivered double precision

performance, since double precision arithmetic is at the heart of

many HPC applications. Kepler GK110/210’s SMX also retains the

special function units (SFUs) for fast approximate transcenden-

tal operations as in previous-generation GPUs, providing 8x the

number of SFUs of the Fermi GF110 SM.

Similar to GK104 SMX units, the cores within the new GK110/210

SMX units use the primary GPU clock rather than the 2x shader

clock. Recall the 2x shader clock was introduced in the G80 Tes-

la architecture GPU and used in all subsequent Tesla and Fermi

architecture GPUs.

Running execution units at a higher clock rate allows a chip

to achieve a given target throughput with fewer copies of the

execution units, which is essentially an area of optimization,

but the clocking logic for the faster cores is more power-hungry.

For Kepler, the priority was performance per watt. While NVIDIA

made many optimizations that benefitted both area and power,

they chose to optimize for power even at the expense of some

added area cost, with a larger number of processing cores run-

ning at the lower, less power-hungry GPU clock.

Quad Warp Scheduler The SMX schedules threads in groups of 32 parallel threads

called warps. Each SMX features four warp schedulers and eight

instruction dispatch units, allowing four warps to be issued and

executed concurrently.

Kepler’s quad warp scheduler selects four warps, and two inde-

pendent instructions per warp can be dispatched each cycle. Un-

like Fermi, which did not permit double precision instructions to

be paired with other instructions, Kepler GK110/210 allows dou-

ble precision instructions to be paired with other instructions.

Page 112: HPC Compass 2016_17


New ISA Encoding: 255 Registers per ThreadThe number of registers that can be accessed by a thread has

been quadrupled in GK110, allowing each thread access to up to

255 registers. Codes that exhibit high register pressure or spill-

ing behavior in Fermi may see substantial speedups as a result

of the increased available per - thread register count. A compel-

ling example can be seen in the QUDA library for performing lat-

tice QCD (quantum chromodynamics) calculations using CUDA.

QUDA fp64-based algorithms see performance increases up to

5.3x due to the ability to use many more registers per thread and

experiencing fewer spills to local memory.

GK210 further improves this, doubling the overall register file ca-

pacity per SMX as compared to GK110. In doing so, it allows appli-

cations to more readily make use of higher numbers of registers

per thread without sacrificing the number of threads that can

fit concurrently per SMX. For example, a CUDA kernel using 128

registers thread on GK110 is limited to 512 out of a possible 2048

concurrent threads per SMX, limiting the available parallelism.

Each Kepler SMX contains 4 Warp Schedulers, each with dual Instruction Dispatch Units. A single Warp Scheduler Unit is shown above.

NVIDIA GPU Computing

Kepler GK110/210

GPU Computing Architecture

Page 113: HPC Compass 2016_17


GK210 doubles the concurrency automatically in this case, which

can help to cover arithmetic and memory latencies, improving

overall efficiency.

Shuffle InstructionTo further improve performance, Kepler implements a new Shuf-

fle instruction, which allows threads within a warp to share data.

Previously, sharing data between threads within a warp required

separate store and load operations to pass the data through

shared memory. With the Shuffle instruction, threads within

a warp can read values from other threads in the warp in just

about any imaginable permutation.

Shuffle supports arbitrary indexed references – i.e. any thread

reads from any other thread. Useful shuffle subsets including

next-thread (offset up or down by a fixed amount) and XOR “but-

terfly” style permutations among the threads in a warp, are also

available as CUDA intrinsics. Shuffle offers a performance advan-

tage over shared memory, in that a store-and-load operation is

carried out in a single step. Shuffle also can reduce the amount of

shared memory needed per thread block, since data exchanged

at the warp level never needs to be placed in shared memory. In

the case of FFT, which requires data sharing within a warp, a 6%

performance gain can be seen just by using Shuffle.

Atomic OperationsAtomic memory operations are important in parallel pro-

gramming, allowing concurrent threads to correctly perform

read-modify-write operations on shared data structures. Atomic

operations such as add, min, max, and compare-and-swap are

atomic in the sense that the read, modify, and write operations

are performed without interruption by other threads.

Atomic memory operations are widely used for parallel sorting,

reduction operations, and building data structures in parallel

without locks that serialize thread execution. Throughput of

global memory atomic operations on Kepler GK110/210 are sub-

stantially improved compared to the Fermi generation. Atomic

operation throughput to a common global memory address is

improved by 9x to one operation per clock.

Atomic operation throughput to independent global address-

es is also significantly accelerated, and logic to handle address

conflicts has been made more efficient. Atomic operations can

often be processed at rates similar to global load operations.

This speed increase makes atomics fast enough to use frequent-

ly within kernel inner loops, eliminating the separate reduction

passes that were previously required by some algorithms to con-

solidate results.

This example shows some of the variations possible using the new Shuffle instruction in Kepler.

Page 114: HPC Compass 2016_17


Kepler GK110 also expands the native support for 64-bit atomic

operations in global memory. In addition to atomicAdd, atomic-

CAS, and atomicExch (which were also supported by Fermi and

Kepler GK104), GK110 supports the following:

� atomicMin

� atomicMax

� atomicAnd

� atomicOr

� atomicXor

Other atomic operations which are not supported natively (for

example 64-bit floating point atomics) may be emulated using

the compare-and-swap (CAS) instruction.

Texture ImprovementsThe GPU’s dedicated hardware Texture units are a valuable re-

source for compute programs with a need to sample or filter

image data. The texture throughput in Kepler is significantly in-

creased compared to Fermi – each SMX unit contains 16 texture

filtering units, a 4x increase vs the Fermi GF110 SM.

In addition, Kepler changes the way texture state is managed. In

the Fermi generation, for the GPU to reference a texture, it had

to be assigned a “slot” in a fixed-size binding table prior to grid


The number of slots in that table ultimately limits how many

unique textures a program can read from at run time. Ultimately,

a program was limited to accessing only 128 simultaneous tex-

tures in Fermi.

With bindless textures in Kepler, the additional step of using

slots isn’t necessary: texture state is now saved as an object in

memory and the hardware fetches these state objects on de-

mand, making binding tables obsolete.

NVIDIA GPU Computing

Kepler GK110/210

GPU Computing Architecture

Page 115: HPC Compass 2016_17


This effectively eliminates any limits on the number of unique

textures that can be referenced by a compute program. Instead,

programs can map textures at any time and pass texture handles

around as they would any other pointer.

Dynamic ParallelismIn a hybrid CPU-GPU system, enabling a larger amount of paral-

lel code in an application to run efficiently and entirely within

the GPU improves scalability and performance as GPUs increase

in perf/watt. To accelerate these additional parallel portions of

the application, GPUs must support more varied types of parallel

workloads. Dynamic Parallelism is introduced with Kepler GK110

and also included in GK210. It allows the GPU to generate new

work for itself, synchronize on results, and control the schedul-

ing of that work via dedicated, accelerated hardware paths, all

without involving the CPU.

Fermi was very good at processing large parallel data structures

when the scale and parameters of the problem were known at

kernel launch time. All work was launched from the host CPU,

would run to completion, and return a result back to the CPU.

The result would then be used as part of the final solution, or

would be analyzed by the CPU which would then send additional

requests back to the GPU for additional processing.

In Kepler GK110/210 any kernel can launch another kernel, and

can create the necessary streams, events and manage the de-

pendencies needed to process additional work without the need

for host CPU interaction. This architectural innovation makes it

easier for developers to create and optimize recursive and data -

dependent execution patterns, and allows more of a program to

be run directly on GPU. The system CPU can then be freed up for

additional tasks, or the system could be configured with a less

powerful CPU to carry out the same workload.

Dynamic Parallelism allows more varieties of parallel algorithms

to be implemented on the GPU, including nested loops with

differing amounts of parallelism, parallel teams of serial con-

trol-task threads, or simple serial control code offloaded to the

GPU in order to promote data-locality with the parallel portion

of the application. Because a kernel has the ability to launch ad-

ditional workloads based on intermediate, on-GPU results, pro-

grammers can now intelligently load-balance work to focus the

bulk of their resources on the areas of the problem that either

require the most processing power or are most relevant to the

solution. One example would be dynamically setting up a grid

for a numerical simulation – typically grid cells are focused in re-

gions of greatest change, requiring an expensive pre-processing

pass through the data.

Alternatively, a uniformly coarse grid could be used to prevent

wasted GPU resources, or a uniformly fine grid could be used to

ensure all the features are captured, but these options risk miss-

ing simulation features or “over-spending” compute resources

on regions of less interest. With Dynamic Parallelism, the grid

resolution can be determined dynamically at runtime in a da-

ta-dependent manner. Starting with a coarse grid, the simulation

can “zoom in” on areas of interest while avoiding unnecessary

calculation in areas with little change. Though this could be

accomplished using a sequence of CPU-launched kernels, it

would be far simpler to allow the GPU to refine the grid itself by

Dynamic Parallelism allows more parallel code in an application to be launched directly by the GPU onto itself (right side of image) rather than requiring CPU intervention (left side of image).

Page 116: HPC Compass 2016_17


analyzing the data and launching additional work as part of a

single simulation kernel, eliminating interruption of the CPU and

data transfers between the CPU and GPU.

Hyper-QOne of the challenges in the past has been keeping the GPU

supplied with an optimally scheduled load of work from multi-

ple streams. The Fermi architecture supported 16 – way concur-

rency of kernel launches from separate streams, but ultimately

the streams were all multiplexed into the same hardware work

queue. This allowed for false intra-stream dependencies, requir-

ing dependent kernels within one stream to complete before ad-

ditional kernels in a separate stream could be executed. While

this could be alleviated to some extent through the use of a

breadth-first launch order, as program complexity increases, this

can become more and more difficult to manage efficiently.

Kepler GK110/210 improve on this functionality with their

Hyper-Q feature. Hyper-Q increases the total number of connec-

tions (work queues) between the host and the CUDA Work Dis-

tributor (CWD) logic in the GPU by allowing 32 simultaneous,

hardware-managed connections (compared to the single con-

nection available with Fermi). Hyper-Q is a flexible solution that

allows connections from multiple CUDA streams, from multiple

Message Passing Interface (MPI) processes, or even from multiple

Image attribution Charles Reid

NVIDIA GPU Computing

Kepler GK110/210

GPU Computing Architecture

Page 117: HPC Compass 2016_17


threads within a process. Applications that previously encoun-

tered false serialization across tasks, thereby limiting GPU utiliza-

tion, can see up to a 32x performance increase without changing

any existing code. Each CUDA stream is managed within its own

hardware work queue, inter-stream dependencies are optimized,

and operations in one stream will no longer block other streams,

enabling streams to execute concurrently without needing to

specifically tailor the launch order to eliminate possible false de-


Hyper-Q offers significant benefits for use in MPI-based parallel

computer systems. Legacy MPI-based algorithms were often cre-

ated to run on multi-core CPU systems, with the amount of work

assigned to each MPI process scaled accordingly. This can lead to

a single MPI process having insufficient work to fully occupy the

GPU. While it has always been possible for multiple MPI process-

es to share a GPU, these processes could become bottlenecked

by false dependencies. Hyper-Q removes those false dependen-

cies, dramatically increasing the efficiency of GPU sharing across

MPI processes.

Grid Management Unit – Efficiently Keeping the GPU Utilized New features introduced with Kepler GK110, such as the ability

for CUDA kernels to launch work directly on the GPU with Dynam-

ic Parallelism, required that the CPU-to-GPU workflow in Kepler

offer increased functionality over the Fermi design. On Fermi, a

grid of thread blocks would be launched by the CPU and would

always run to completion, creating a simple unidirectional flow

of work from the host to the SMs via the CUDA Work Distributor

(CWD) unit. Kepler GK110/210 improve the CPU-to-GPU workflow

by allowing the GPU to efficiently manage both CPU and CUDA

created workloads.

NVIDIA discussed the ability of Kepler GK110 GPU to allow ker-

nels to launch work directly on the GPU, and it’s important to

understand the changes made in the Kepler GK110 architecture

to facilitate these new functions. In Kepler GK110/210, a grid can

be launched from the CPU just as was the case with Fermi, how-

ever new grids can also be created programmatically by CUDA

within the Kepler SMX unit. To manage both CUDA-created and

host-originated grids, a new Grid Management Unit (GMU) was

introduced in Kepler GK110. This control unit manages and pri-

oritizes grids that are passed into the CWD to be sent to the SMX

units for execution.

The CWD in Kepler holds grids that are ready to dispatch, and

it is able to dispatch 32 active grids, which is double the capac-

ity of the Fermi CWD. The Kepler CWD communicates with the

GMU via a bi-directional link that allows the GMU to pause the

dispatch of new grids and to hold pending and suspended grids

until needed. The GMU also has a direct connection to the Kepler

SMX units to permit grids that launch additional work on the

GPU via Dynamic Parallelism to send the new work back to GMU

to be prioritized and dispatched. If the kernel that dispatched the

additional workload pauses, the GMU will hold it inactive until

the dependent work has completed.

The redesigned Kepler HOST to GPU workflow shows the new Grid Management Unit, which allows it to manage the actively dispatching grids, pause dispatch, and hold pending and suspended grids.

Page 118: HPC Compass 2016_17



transtec has strived for developing well-engineered GPU Com-

puting solutions from the very beginning of the Tesla era. From

High-Performance GPU Workstations to rack-mounted Tesla

server solutions, transtec has a broad range of specially de-

signed systems available.

As an NVIDIA Tesla Preferred Provider (TPP), transtec is able to

provide customers with the latest NVIDIA GPU technology as

well as fully-engineered hybrid systems and Tesla Preconfig-

ured Clusters. Thus, customers can be assured that transtec’s

large experience in HPC cluster solutions is seamlessly brought

into the GPU computing world. Performance Engineering made

by transtec.

NVIDIA GPU Computing

Kepler GK110/210

GPU Computing Architecture

Page 119: HPC Compass 2016_17


NVIDIA GPUDirectWhen working with a large amount of data, increasing the data

throughput and reducing latency is vital to increasing compute

performance. Kepler GK110/210 support the RDMA feature in

NVIDIA GPUDirect, which is designed to improve performance

by allowing direct access to GPU memory by third-party devices

such as IB adapters, NICs, and SSDs.

When using CUDA 5.0 or later, GPUDirect provides the following

important features:

� Direct memory access (DMA) between NIC and GPU without

the need for CPU-side data buffering.

� Significantly improved MPISend/MPIRecv efficiency between

GPU and other nodes in a network.

� Eliminates CPU bandwidth and latency bottlenecks

�Works with variety of 3rd-party network, capture and storage


Applications like reverse time migration (used in seismic imaging

for oil & gas exploration) distribute the large imaging data across

several GPUs. Hundreds of GPUs must collaborate to crunch the

data, often communicating intermediate results. GPUDirect en-

ables much higher aggregate bandwidth for this GPU-to-GPU

GPUDirect RDMA allows direct access to GPU memory from 3rd-party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.

communication scenario within a server and across servers with

the P2P and RDMA features. Kepler GK110 also supports other

GPUDirect features such as Peer-to-Peer and GPUDirect for Video.

Kepler Memory Subsystem – L1, L2, ECC Kepler’s memory hierarchy is organized similarly to Fermi. The

Kepler architecture supports a unified memory request path for

loads and stores, with an L1 cache per SMX multiprocessor. Ke-

pler GK110 also enables compiler-directed use of an additional

new cache for read-only data, as described below.

Configurable shared memory and L1 CacheIn the Kepler GK110 architecture, as in the previous generation

Fermi architecture, each SMX has 64 KB of on-chip memory that

can be configured as 48 KB of Shared memory with 16 KB of L1

cache, or as 16 KB of shared memory with 48 KB of L1 cache.

Kepler now allows for additional flexibility in configuring the

allocation of shared memory and L1 cache by permitting a 32 KB/

32 KB split between shared memory and L1 cache.

Configurable Shared Memory and L1 Cache

Page 120: HPC Compass 2016_17


To support the increased throughput of each SMX unit, the

shared memory bandwidth for 64 bit and larger load operations

is also doubled compared to the Fermi SM, to 256 byte per core

clock. For the GK210 architecture, the total amount of configu-

rable memory is doubled to 128 KB, allowing a maximum of 112

KB shared memory and 16 KB of L1 cache. Other possible memory

configurations are 32 KB L1 cache with 96 KB shared memory, or

48 KB L1 cache with 80 KB of shared memory.

This increase allows a similar improvement in concurrency of

threads as is enabled by the register file capacity improvement

described above.

48 KB Read-Only Data CacheIn addition to the L1 cache, Kepler introduces a 48 KB cache for

data that is known to be read-only for the duration of the func-

tion. In the Fermi generation, this cache was accessible only by

the Texture unit. Expert programmers often found it advanta-

geous to load data through this path explicitly by mapping their

data as textures, but this approach had many limitations.

In Kepler, in addition to significantly increasing the capacity of

this cache along with the texture horsepower increase, we de-

cided to make the cache directly accessible to the SM for general

load operations. Use of the read-only path is beneficial because

it takes both load and working set footprint off of the Shared/L1

cache path. In addition, the Read-Only Data Cache’s higher tag

bandwidth supports full speed unaligned memory access pat-

terns among other scenarios.

Use of the read-only path can be managed automatically by the

compiler or explicitly by the programmer. Access to any variable

or data structure that is known to be constant through program-

mer use of the C99 standard “const __restrict” keyword may be

tagged by the compiler to be loaded through the Read- Only Data

Cache. The programmer can also explicitly use this path with the

__ldg() intrinsic.

NVIDIA, GeForce, Tesla, CUDA, PhysX, GigaThread, NVIDIA

Parallel Data Cache and certain other trademarks and logos

appearing in this brochure, are trademarks or registered

trademarks of NVIDIA orporation.

NVIDIA GPU Computing

Kepler GK110/210

GPU Computing Architecture

Page 121: HPC Compass 2016_17






tech BOX


Improved L2 CacheThe Kepler GK110/210 GPUs feature 1536 KB of dedicated L2

cache memory, double the amount of L2 available in the Fermi

architecture. The L2 cache is the primary point of data unifica-

tion between the SMX units, servicing all load, store, and texture

requests and providing efficient, high speed data sharing across

the GPU.

The L2 cache on Kepler offers up to 2x of the bandwidth per clock

available in Fermi. Algorithms for which data addresses are not

known beforehand, such as physics solvers, ray tracing, and

sparse matrix multiplication especially benefit from the cache

hierarchy. Filter and convolution kernels that require multiple

SMs to read the same data also benefit.

Memory Protection SupportLike Fermi, Kepler’s register files, shared memories, L1 cache, L2

cache and DRAM memory are protected by a Single-Error Correct

Double-Error Detect (SECDED) ECC code. In addition, the Read-

Only Data Cache supports single-error correction through a par-

ity check; in the event of a parity error, the cache unit automati-

cally invalidates the failed line, forcing a read of the correct data

from L2.

ECC checkbit fetches from DRAM necessarily consume some

amount of DRAM bandwidth, which results in a performance

difference between ECC-enabled and ECC-disabled operation,

especially on memory bandwidth-sensitive applications. Kepler

GK110 implements several optimizations to ECC checkbit fetch

handling based on Fermi experience. As a result, the ECC on-vs-

off performance delta has been reduced by an average of 66%,

as measured across our internal compute application test suite.

CUDACUDA is a combination hardware/software platform that

enables NVIDIA GPUs to execute programs written with C,

C++, Fortran, and other languages. A CUDA program invokes

parallel functions called kernels that execute across many

parallel threads. The programmer or compiler organizes

these threads into thread blocks and grids of thread blocks.

Each thread within a thread block executes an instance of

the kernel. Each thread also has thread and block IDs within

its thread block and grid, a program counter, registers, per-

thread private memory, inputs, and output results.

A thread block is a set of concurrently executing threads that

can cooperate among themselves through barrier synchro-

nization and shared memory. A thread block has a block ID

within its grid. A grid is an array of thread blocks that execute

the same kernel, read inputs from global memory, write re-

sults to global memory, and synchronize between dependent

kernel calls. In the CUDA parallel programming model, each

thread has a per-thread private memory space used for regis-

ter spills, function calls, and C automatic array variables. Each

thread block has a per-block shared memory space used for

inter-thread communication, data sharing, and result sharing

in parallel algorithms. Grids of thread blocks share results in

Global Memory space after kernel-wide global synchroniza-


CUDA Hardware ExecutionCUDA’s hierarchy of threads maps to a hierarchy of proces-

sors on the GPU; a GPU executes one or more kernel grids; a

streaming multiprocessor (SM on Fermi / SMX on Kepler) ex-

ecutes one or more thread blocks; and CUDA cores and other

Page 122: HPC Compass 2016_17


execution units in the SMX execute thread instructions. The SMX

executes threads in groups of 32 threads called warps.

While programmers can generally ignore warp execution for

functional correctness and focus on programming individual

scalar threads, they can greatly improve performance by having

threads in a warp execute the same code path and access memory

with nearby addresses.

CUDA Hierarchy of threads, blocks, and grids, with corresponding per-thread private, per-block shared,-and per-application global memory spaces.

CUDA Toolkit The NVIDIA CUDA Toolkit provides a comprehensive development

environment for C and C++ developers building GPU-accelerated

NVIDIA GPU Computing

A Quick Refresher on CUDA

Page 123: HPC Compass 2016_17


applications. The CUDA Toolkit includes a compiler for NVIDIA

GPUs, math libraries, and tools for debugging and optimizing the

performance of your applications. You’ll also find programming

guides, user manuals, API reference, and other documentation to

help you get started quickly accelerating your application with


Dramatically simplify parallel programming64 bit ARM Support

� Develop or recompile your applications to run on 64-bit ARM

systems with NVIDIA GPUs

Unified Memory

� Simplifies programming by enabling applications to access

CPU and GPU memory without the need to manually copy

data. Read more about unified memory.

Drop-in Libraries and cuFFT callbacks

� Automatically accelerate applications’ BLAS and FFTW calcu-

lations by up to 8X by simply replacing the existing CPU librar-

ies with the GPU-accelerated equivalents.

� Use cuFFT callbacks for higher performance custom process-

ing on input or output data.

Multi-GPU scaling

� cublasXT – a new BLAS GPU library that automatically scales

performance across up to eight GPUs in a single node, deliver-

ing over nine teraflops of double precision performance per

node, and supporting larger workloads than ever before (up

to 512GB). The re-designed FFT GPU library scales up to 2 GPUs

in a single node, allowing larger transform sizes and higher


Improved Tools Support

� Support for Microsoft Visual Studio 2013.

� Improved debugging for CUDA FORTRAN applications

� Replay feature in Visual Profiler and nvprof

� nvprune utiliy to optimize the size of object files

The new Tesla K80 GPU Accelerator based on the GK210 GPUA dual GPU board that combines 24 GB of memory with blazing fast

memory bandwidth and up to 2.91 Tflops double precision per-

formance with NVIDIA GPU Boost, the Tesla K80 GPU is designed

for the most demanding computational tasks. It’s ideal for single

and double precision workloads that not only require leading

compute performance but also demands high data throughput.

Page 124: HPC Compass 2016_17
Page 125: HPC Compass 2016_17

Intel Xeon Phi Coprocessor – The Architecture

Intel Many Integrated Core (Intel MIC) architecture combines many Intel CPU cores onto

a single chip. Intel MIC architecture is targeted for highly parallel, High Performance

Computing (HPC) workloads in a variety of fields such as computational physics, chem-

istry, biology, and financial services. Today such workloads are run as task parallel pro-

grams on large compute clusters.

Engineering | Life Sciences | Automotive | Price Modelling | Aerospace | CAE | Data Analytics

Page 126: HPC Compass 2016_17


What is the Intel Xeon Phi coprocessor?Intel Xeon Phi coprocessors are PCI Express form factor add-in

cards that work synergistically with Intel Xeon processors to

enable dramatic performance gains for highly parallel code –

up to 1.2 double-precision teraFLOPS (floating point operations

per second) per coprocessor. Manufactured using Intel’s indus-

try-leading 22 nm technology with 3-D Tri-Gate transistors, each

coprocessor features more cores, more threads, and wider vector

execution units than an Intel Xeon processor. The high degree of

parallelism compensates for the lower speed of each core to de-

liver higher aggregate performance for highly parallel workloads.

What applications can benefit from the Intel Xeon Phi


While a majority of applications (80 to 90 percent) will contin-

ue to achieve maximum performance on Intel Xeon processors,

certain highly parallel applications will benefit dramatically by

using Intel Xeon Phi coprocessors. To take full advantage of Intel

Xeon Phi coprocessors, an application must scale well to over 100

software threads and either make extensive use of vectors or ef-

ficiently use more local memory bandwidth than is available on

an Intel Xeon processor. Examples of segments with highly paral-

lel applications include: animation, energy, finance, life sciences,

manufacturing, medical, public sector, weather, and more.

When do I use Intel Xeon processors and Intel Xeon Phi


Intel Xeon Phi Coprocessor

The Architecture

Page 127: HPC Compass 2016_17






tech BOX


Think “resuse” rather than “recode”

Since languages, tools, and applications are compatible for

both Intel Xeon processor and coprocessors, now you can think

“reuse” rather than “recode”.

Single programming model for all your code

The Intel Xeon Phi coprocessor gives developers a hardware de-

sign point optimized for extreme parallelism, without requiring

them to re-architect or rewrite their code. No need to rethink the

entire problem or learn a new programming model; simply rec-

ompile and optimize existing code using familiar tools, libraries,

and runtimes.

Performance multiplier

By maintaining a single source code between Intel Xeon proces-

sors and Intel Xeon Phi coprocessors, developers optimize once

for parallelism but maximize performance on both processor

and coprocessor.

Execution flexibility

Intel Xeon Phi is designed from the ground up for high-perfor-

mance computing. Unlike a GPU, a coprocessor can host an oper-

ating system, be fully IP addressable, and support standards such

as Message Passing Interface (MPI). Additionally, it can operate in

multiple execution modes:

� “Symmetric” mode: Workload tasks are shared between the

host processor and coprocessor

� “Native” mode: Workload resides entirely on the coprocessor,

and essentially acts as a separate compute nodes

� “Offload” mode: Workload resides on the host processor, and

parts of the workload are sent out to the coprocessor as needed

Multiple Intel Xeon Phi coprocessors can be installed in a sin-

gle host system. Within a single system, the coprocessors can

communicate with each other through the PCIe peer-to-peer

interconnect without any intervention from the host. Similarly,

A Family of Coprocessors for Diverse Needs Intel Xeon Phi coprocessors provide up to 61 cores, 244

threads, and 1.2 teraflops of performance, and they come

in a variety of configurations to address diverse hardware,

software, workload, performance, and efficiency require-

ments. They also come in a variety of form factors, including

a standard PCIe x16 form factor (with active, passive, or no

thermal solution), and a dense form factor that offers addi-

tional design flexibility.

� The Intel Xeon Phi Coprocessor 3100 family provides out-

standing parallel performance. It is an excellent choice for

compute-bound workloads, such as MonteCarlo, black-

Scholes, HPL, LifeSc, and many others. Active and passive

cooling options provide flexible support for a variety of

server and workstation systems.

� The Intel Xeon Phi Coprocessor 5100 family is opti-

mized for high-density computing and is well-suited for

workloads that are memory-bandwidth bound, such as

STREAM, memory-capacity bound, such as ray-tracing, or

both, such as reverse time migration (RTM). These copro-

cessors are passively cooled and have the lowest thermal

design power (TDP) of the Intel Xeon Phi product family.

� The Intel Xeon Phi Coprocessor 7100 family provides the

most features and the highest performance and memory

capacity of the Intel Xeon Phi product family. This family

supports Intel Turbo boost Technology 1.0, which increas-

es core frequencies during peak workloads when thermal

conditions allow. Passive and no thermal solution options

enable powerful and innovative computing solutions.

Page 128: HPC Compass 2016_17


the coprocessors can also communicate through a network card

such as InfiniBand or Ethernet, without any intervention from

the host.

The Intel Xeon Phi coprocessor is primarily composed of process-

ing cores, caches, memory controllers, PCIe client logic, and a very

high bandwidth, bidirectional ring interconnect. Each core comes

complete with a private L2 cache that is kept fully coherent by a

global-distributed tag directory. The memory controllers and the

PCIe client logic provide a direct interface to the GDDR5 mem-

ory on the coprocessor and the PCIe bus, respectively. All these

components are connected together by the ring interconnect.

The first generation Intel Xeon Phi product codenamed “Knights Corner”


Intel Xeon Phi Coprocessor

The Architecture

Page 129: HPC Compass 2016_17


Each core in the Intel Xeon Phi coprocessor is designed to be pow-

er efficient while providing a high throughput for highly parallel

workloads. A closer look reveals that the core uses a short in-or-

der pipeline and is capable of supporting 4 threads in hardware.

It is estimated that the cost to support IA architecture legacy is a

mere 2% of the area costs of the core and is even less at a full chip

or product level. Thus the cost of bringing the Intel Architecture

legacy capability to the market is very marginal.

Vector Processing UnitAn important component of the Intel Xeon Phi coprocessor’s

core is its vector processing unit (VPU). The VPU features a novel

512-bit SIMD instruction set, officially known as Intel Initial Many

Core Instructions (Intel IMCI). Thus, the VPU can execute 16 sin-

gle-precision (SP) or 8 double-precision (DP) operations per cycle.

The VPU also supports Fused Multiply-Add (FMA) instructions and

hence can execute 32 SP or 16 DP floating point operations per

cycle. It also provides support for integers.

Vector units are very power efficient for HPC workloads. A single

operation can encode a great deal of work and does not incur en-

ergy costs associated with fetching, decoding, and retiring many

instructions. However, several improvements were required to

support such wide SIMD instructions. For example, a mask register

was added to the VPU to allow per lane predicated execution. This

helped in vectorizing short conditional branches, thereby im-

proving the overall software pipelining efficiency. The VPU also

supports gather and scatter instructions, which are simply non-

unit stride vector memory accesses, directly in hardware. Thus

for codes with sporadic or irregular access patterns, vector scat-

ter and gather instructions help in keeping the code vectorized.

The VPU also features an Extended Math Unit (EMU) that can ex-

ecute transcendental operations such as reciprocal, square root,

and log, thereby allowing these operations to be executed in a

vector fashion with high bandwidth. The EMU operates by calcu-

lating polynomial approximations of these functions.

The InterconnectThe interconnect is implemented as a bidirectional ring. Each

direction is comprised of three independent rings. The first,

largest, and most expensive of these is the data block ring. The

data block ring is 64 bytes wide to support the high bandwidth

requirement due to the large number of cores. The address ring

is much smaller and is used to send read/write commands and

memory addresses. Finally, the smallest ring and the least expen-

sive ring is the acknowledgement ring, which sends flow control

and coherence messages.

Intel Xeon Phi Coprocessor Core

Vector Processing Unit

Page 130: HPC Compass 2016_17


When a core accesses its L2 cache and misses, an address request

is sent on the address ring to the tag directories. The memory ad-

dresses are uniformly distributed amongst the tag directories

on the ring to provide a smooth traffic characteristic on the ring.

If the requested data block is found in another core’s L2 cache,

a forwarding request is sent to that core’s L2 over the address

ring and the request block is subsequently forwarded on the

data block ring. If the requested data is not found in any caches,

a memory address is sent from the tag directory to the memory


The figure below shows the distribution of the memory con-

trollers on the bidirectional ring. The memory controllers are

symmetrically interleaved around the ring. There is an all-to-all

mapping from the tag directories to the memory controllers. The

addresses are evenly distributed across the memory controllers,

The Interconnect

Distributed Tag Directories

Intel Xeon Phi Coprocessor

The Architecture

Page 131: HPC Compass 2016_17


thereby eliminating hotspots and providing a uniform access

pattern which is essential for a good bandwidth response.

During a memory access, whenever an L2 cache miss occurs on a

core, the core generates an address request on the address ring

and queries the tag directories. If the data is not found in the tag

directories, the core generates another address request and que-

ries the memory for the data. Once the memory controller fetch-

es the data block from memory, it is returned back to the core

over the data ring. Thus during this process, one data block, two

address requests (and by protocol, two acknowledgement mes-

sages) are transmitted over the rings. Since the data block rings

are the most expensive and are designed to support the required

data bandwidth, it is necessary to increase the number of less ex-

pensive address and acknowledgement rings by a factor of two

to match the increased bandwidth requirement caused by the

higher number of requests on these rings.

Other Design FeaturesOther micro-architectural optimizations incorporated into the

Intel Xeon Phi coprocessor include a 64-entry second-level Trans-

lation Lookaside Buffer (TLB), simultaneous data cache loads and

stores, and 512KB L2 caches. Lastly, the Intel Xeon Phi coproc-

essor implements a 16 stream hardware prefetcher to improve

the cache hits and provide higher bandwidth. The figure below

shows the net performance improvements for the SPECfp 2006

benchmark suite for a single core, single thread runs. The results

indicate an average improvement of over 80% per cycle not in-

cluding frequency.

Per-Core ST Performance Improvement (per cycle)

Interleaved Memory Access

Interconnect: 2x AD/AK

Page 132: HPC Compass 2016_17


CachesThe Intel MIC architecture invests more heavily in L1 and L2 cach-

es compared to GPU architectures. The Intel Xeon Phi coproces-

sor implements a leading-edge, very high bandwidth memory

subsystem. Each core is equipped with a 32KB L1 instruction

cache and 32KB L1 data cache and a 512KB unified L2 cache.

These caches are fully coherent and implement the x86 memory

order model. The L1 and L2 caches provide an aggregate band-

width that is approximately 15 and 7 times, respectively, faster

compared to the aggregate memory bandwidth. Hence, effective

utilization of the caches is key to achieving peak performance on

the Intel Xeon Phi coprocessor. In addition to improving band-

width, the caches are also more energy efficient for supplying

data to the cores than memory. The figure below shows the ener-

gy consumed per byte of data transfered from the memory, and

L1 and L2 caches. In the exascale compute era, caches will play a

crucial role in achieving real performance under restrictive pow-

er constraints.

StencilsStencils are common in physics simulations and are classic exam-

ples of workloads which show a large performance gain through

efficient use of caches.

Caches – For or Against?

Intel Xeon Phi Coprocessor

The Architecture

Page 133: HPC Compass 2016_17


Stencils are typically employed in simulation of physical sys-

tems to study the behavior of the system over time. When these

workloads are not programmed to be cache-blocked, they will be

bound by memory bandwidth. Cache blocking promises substan-

tial performance gains given the increased bandwidth and ener-

gy efficiency of the caches compared to memory. Cache blocking

improves performance by blocking the physical structure or the

physical system such that the blocked data fits well into a core’s

L1 and or L2 caches. For example, during each time-step, the

same core can process the data which is already resident in the

L2 cache from the last time step, and hence does not need to be

fetched from the memory, thereby improving performance. Addi-

tionally, the cache coherence further aids the stencil operation

by automatically fetching the updated data from the nearest

neighboring blocks which are resident in the L2 caches of other

cores. Thus, stencils clearly demonstrate the benefits of efficient

cache utilization and coherence in HPC workloads.

Power ManagementIntel Xeon Phi coprocessors are not suitable for all workloads.

In some cases, it is beneficial to run the workloads only on the

host. In such situations where the coprocessor is not being used,

it is necessary to put the coprocessor in a power-saving mode.

To conserve power, as soon as all the four threads on a core are

halted, the clock to the core is gated.

Once the clock has been gated for some programmable time, the

core power gates itself, thereby eliminating any leakage. At any

point, any number of the cores can be powered down or powered

up as shown in the figure below. Additionally, when all the cores

are power gated and the uncore detects no activity, the tag direc-

tories, the interconnect, L2 caches and the memory controllers

are clock gated.

Clock Gate Core

Power Gate Core

Stencils Example

Page 134: HPC Compass 2016_17


At this point, the host driver can put the coprocessor into a deep-

er sleep or an idle state, wherein all the uncore is power gated,

the GDDR is put into a self-refresh mode, the PCIe logic is put in

a wait state for a wakeup and the GDDR-IO is consuming very lit-

tle power. These power management techniques help conserve

power and make Intel Xeon Phi coprocessor an excellent candi-

date for data centers.

SummaryThe Intel Xeon Phi coprocessor provides high performance, and

performance per watt for highly parallel HPC workloads, while

not requiring a new programming model, API, language or restric-

tive memory modelIt is able to do this with an array of general

purpose cores withmultiple thread contexts, wide vector units,

caches, and high bandwidth on die and memory interconnect.

Knights Corner is the first Intel Xeon Phi product in the MIC ar-

chitecture family of processors from Intel, aimed at enabling the

exascale era of computing.

Package Auto C3

Intel Xeon Phi Coprocessor

An Outlook on Knights Landing

and Omni-Path

Page 135: HPC Compass 2016_17






tech BOX


Knights Landing Knights Landing is the codename for Intel’s 2nd generation In-

tel Xeon Phi Product Family, which will deliver massive thread

parallelism, data parallelism and memory bandwidth – with

improved single-thread performance and Intel Xeon processor

binary-compatibility in a standard CPU form factor.

Additionally, Knights Landing will offer integrated Intel Om-

ni-Path fabric technology, and also be available in the tradition-

al PCIe coprocessor form factor.

Performance3+ TeraFLOPS of double-precision peak theoretical performance

per single socket node.

Power Management: All On and Running

IntegrationIntel Omni Scale fabric integration

High-performance on-package memory (MCDRAM)

� Over 5x STREAM vs. DDR4 Over 400 GB/s

� Up to 16GB at launch

� NUMA support

� Over 5x Energy Efficiency vs. GDDR52

� Over 3x Density vs. GDDR52

� In partnership with Micron Technology

� Flexible memory modes including cache and flat

Server Processor � Standalone bootable processor (running host OS) and a PCIe

coprocessor (PCIe end-point device)

� Platform memory capacity comparable to Intel Xeon


� Reliability (“Intel server-class reliability”)

� Power Efficiency (Over 25% better than discrete coprocessor)

with over 10 GF/W

� Density (3+ KNL with fabric in 1U)

Microarchitecture � Based on Intel’s 14 nanometer manufacturing technology

� Binary compatible with Intel Xeon Processors

� Support for Intel Advanced Vector Extensions 512

(Intel AVX-512)

� 3x Single-Thread Performance compared to Knights Corner

� Cache-coherency

� 60+ cores in a 2D Mesh architecture

Page 136: HPC Compass 2016_17


4 Threads / Core

� Deep Out-of-Order Buffers

� Gather/scatter in hardware

� Advanced Branch Prediction

� High cache bandwidth

� Most of today’s parallel optimizations carry forward to KNL

� Multiple NUMA domain support per socket

Roadmap � Knights Hill is the codename for the 3rd generation of the

Intel Xeon Phi product family

� Based on Intel’s 10 nanometer manufacturing technology

� Integrated 2nd generation Intel Omni-Path Fabric

Omni Path – the Next-Generation FabricIntel Omni-Path Architecture is designed to deliver the perfor-

mance for tomorrow’s high performance computing (HPC) work-

loads and the ability to scale to tens – and eventually hundreds

– of thousands of nodes at a cost-competitive price to today’s


This next-generation fabric from Intel builds on the Intel True

Scale Fabric with an end-to-end solution, including PCIe adapters,

silicon, switches, cables, and management software. Customers

deploying Intel True Scale Fabric today will be able to migrate to

Intel Omni-Path Architecture through an Intel upgrade program.

In the near future, Intel will integrate the Intel Omni-Path Host

Fabric Interface onto future generations of Intel Xeon processors

and Intel Xeon Phi processors to address several key challenges

of HPC computing centers:

Intel Xeon Phi Coprocessor

An Outlook on Knights Landing

and Omni-Path

Page 137: HPC Compass 2016_17


� Performance: Processor capacity and memory bandwidth are

scaling faster than system I/O. Intel Omni-Path Architecture

will accelerate message passing interface (MPI) rates for to-

morrow’s HPC.

� Cost and density: More components on a server limit density

and increase fabric cost. An integrated fabric controller helps

eliminate the additional costs and required space of discrete

cards, enabling higher server density.

� Reliability and power: Discrete interface cards consume

many watts of power. Integrated onto the processor, the In-

tel Omni-Path Host Fabric Interface will draw less power with

fewer discrete components.

For software applications, Intel Omni-Path Architecture will

maintain consistency and compatibility with existing Intel True

Scale Fabric and InfiniBand APIs by working through the open

source OpenFabrics Alliance (OFA) software stack. The OFA soft-

ware stack works with leading Linux distribution releases.

Intel is providing a clear path to addressing the issues of tomor-

row’s HPC with Intel Omni-Path Architecture.

Power Management: All On and Running

Page 138: HPC Compass 2016_17
Page 139: HPC Compass 2016_17


Vector Supercomputing – The NEC SX Architecture

Initially, in the 80s and 90s, the notion of supercomputing was almost a synonym for

vector-computing. And this concept is back now in different flavors. Now almost every

architecture is using SIMD, the acronym for “single instruction multiple data”.

Life Sciences | Risk Analysis | Simulation | Big Data Analytics | High Performance Computing

Page 140: HPC Compass 2016_17


History of HPC, and what we learnDuring the last years systems based on x86 architecture domi-

nated the HPC market, with GPUs and many-core systems re-

cently making their inroads. But this period is coming to an end,

and it is again time for a differentiated complementary HPC-

targeted system. NEC will develop such a system.

The Early Days

When did HPC really start? Probably one would think about the

time when Seymour Cray developed high-end machines at CDC

and then started his own company Cray Research, with the first

product being the Cray-1 in 1976. At that time supercomputing was

very special, systems were expensive, and access was extremely

limited for researchers and students. But the business case

was obvious in many fields, scientifically as well as commercially.

These Cray machines were vector-computers, initially featuring

only one CPU, as parallelization was far from being mainstream,

even hardly a matter of research. And code development was

easier than today, a code of 20000 lines was already considered

“complex” at that time, and there were not so many standard

codes and algorithms available anyway. For this reason code-de-

velopment could adapt to any kind of architecture to achieve op-

timal performance.

In 1981 NEC started marketing its first vector supercomputer, the

SX-2, a single-CPU-vector-architecture like the Cray-1. The SX-2

was a milestone, it was the first CPU to exceed one GFlops of peak


The Attack of the “Killer-Micros”

In the 1990s the situation changed drastically, and what we see

today is a direct consequence. The “killer-micros” took over, new

parallel, later “massively parallel” systems (MPP) were built based

on micro-processor architectures which were predominantly

Vector Supercomputing

The NEC SX Architecture

Page 141: HPC Compass 2016_17


targeted to commercial sectors, or in the case of Intel, AMD or

Motorola to PCs, to products which were successfully invading

the daily life, partially more than even experts had anticipated.

Clearly one could save a lot of cost by using, in the worst case

slightly adapting these CPUs for high-end-systems, and by fo-

cusing investments on other ingredients, like interconnect tech-

nology. At the same time increasingly Linux was adopted and of-

fered a cheap generally available OS platform to allow everybody

to use HPC for a reasonable price.

These systems were cheaper because of two reasons: the develop-

ment of standard CPUs was sponsored by different businesses,

and another expensive part, the memory architecture with many

CPUs addressing the same flat memory space, was replaced by sys-

tems comprised of many distinct so-called nodes, complete sys-

tems by themselves, plus some kind of interconnect technology,

which could also be taken from standard equipment partially.

The impact on the user was the necessity to code for these “dis-

tributed memory systems”. Standards for coding evolved, PVM

and later MPI, plus some proprietary paradigms. MPI is dominat-

ing still today.

In the high-end the special systems were still dominating, be-

cause they had architectural advantages and the codes were run-

ning with very good efficiency on such systems. Most codes were

not parallel at all or at least not well parallelized, let alone on

systems with distributed memory. The skills and experiences to

generate parallel versions of codes were lacking in those compa-

nies selling high-end-machines and also on the customers’ side.

So it took some time for the MPPs to take over. But it was inevi-

table, because this situation, sometimes called “democratization

of HPC”, was positive for both the commercial and the scientific

developments. From that time on researchers and students in

academia and industry got sufficient access to resources they

could not have dreamt of before.

The x86 Dominance

One could dispute whether the x86 architecture is the most

advanced or elegant, undoubtedly it became the dominant

one. It was developed for mass-markets, this was the strate-

gy, and HPC was a side-effect, sponsored by the mass-market.

This worked successfully and the proportion of x86 Linux clus-

ters in the Top500 list increased year by year. Because of the

mass-markets it made sense economically to invest a lot into

the development of LSI technology (large-scale integration).

The related progress, which led to more and more transistors

on the chips and increasing frequencies, helped HPC applica-

tions, which got more, faster and cheaper systems with each

generation. Software did not have to adapt a lot between gen-

erations and the software developments rather focused initial-

ly on the parallelization for distributed memory, later on new

functionalities and features. This worked for quite some years.

Systems of sufficient size to produce relevant results for typical

applications are cheap, and most codes do not really scale to a

level of parallelism to use thousands of processes. Some light-

house projects do, after a lot of development which is not easily

affordable, but the average researcher or student is often com-

pletely satisfied with something between a few and perhaps a

few hundred nodes of a standard Linux cluster.

In the meantime already a typical Intel-based dual-socket-node

of a standard cluster has 24 cores, soon a bit more. So paralleliza-

tion is increasingly important. There is “Moore’s law”, invented by

Gordon E. Moore, a co-founder of the Intel Corporation, in 1965.

Originally it stated that the number of transistors in an integrated

circuit doubles about every 18 months, leading to an exponential

increase. This statement was often misinterpreted that the per-

formance of a CPU would double every 18 months, which is mis-

leading. We observe ever increasing core-counts, the transistors

are just there, so why not? That applications might not be able to

Page 142: HPC Compass 2016_17


use these many cores is another question, in principle the poten-

tial performance is there.

The “standard benchmark” in the HPC-scene is the so-called HPL,

High Performance LINPACK, and the famous Top500 list is built

based on this. This HPL benchmark is only one of three parts of

the original LINPACK benchmark, this is often forgotten! But the

other parts did not show such a good progress in performance,

their scalability is limited, so they could not serve to spawn com-

petition even between nations, which leads to a lot of prestige.

Moreover there are at most a few real codes in the world with

similar characteristics as the HPL. In other words: it does not tell

a lot. This is a highly political situation and probably quite some

money is wasted which would better be invested in brain-ware.

Anyway it is increasingly doubtful that LSI technology will ad-

vance at the same pace in the future. The grid-constant of Silicon

is 5.43 Å. LSI technology is now using 14 nm processes, the Intel

Broadwell generation, which is about 25 times the grid-constant,

and with 7nm LSIs at the long-term horizon. Some scientists as-

sume there is still some room to improve even more, but once

the features on the LSI have a thickness of only a few atoms one

cannot get another factor of ten. Sometimes the impact of some

future “disruptive technology” is claimed …

Another obstacle is the cost to build a chip fab for a new genera-

tion of LSI technology. Certainly there are ways to optimize both

production and cost, and there will be bright ideas, but with cur-

rent generations a fab costs at least many billion $. There have

been ideas of slowing down the innovation cycle on that side to

merely be able to recover the investment.

Memory Bandwidth

For many HPC applications it is not the compute power of the

Vector Supercomputing

The NEC SX Architecture

Page 143: HPC Compass 2016_17


CPU which counts, it is the speed with which the CPU can

exchange data with the memory, the so-called memory band-

width. If it is neglected the CPUs will wait for data, the perfor-

mance will decrease. To describe the balance between CPU

performance and memory bandwidth one often uses the ratio

between operations and the amount of bytes that can be trans-

ferred while they are running, the “byte-per-flop-ratio”. Tradi-

tionally vector machines were always great in this regard, and

this is to continue as far as current plans are known. And this is

the reason why a vector machine can realize a higher fraction

of its theoretical performance on a real application than a sca-

lar machine.

Power Consumption

One other but really overwhelming problem is the increasing

power consumption of systems. Looking at different genera-

tions of Linux clusters one can assume that a dual-socket-node,

depending on how it is equipped with additional features, will

consume between 350 Watt and 400 Watt.

This only slightly varies between different generations, perhaps it

will slightly increase. Even in Europe sites are talking about elec-

tricity budgets in excess of 1 MWatt, even beyond 10 MWatts. One

also has to cool the system, which adds something like 10-30%

to the electricity cost! The cost to run such a system over 5 years

will be in the same order of magnitude as the purchase price!

The power consumption will grow with the frequency of the CPU,

is the dominant reason why frequencies are stagnating, if not de-

creasing between product generations. There are more cores, but

individual cores of an architecture are getting slower, not faster!

This has to be compensated on the application side by increased

scalability, which is not always easy to achieve. In the past us-

ers easily got a higher performance for almost every application

with each new generation, this is no longer the case.

Architecture Development

So the basic facts clearly indicate limitations of the LSI technology

and they are known since quite some time. Consequently this

leads to attempts to get additional performance by other means.

And the developments in the market clearly indicate that high-

end users are willing to consider the adaptation of codes.

To utilize available transistors CPUs with more complex func-

tionality are designed, utilizing SIMD instructions or providing

more operations per cycle than just an add and a multiply. SIMD

is “Single Instruction Multiple Data”. Initially at Intel the SIMD-in-

structions provided two results in 64-bit-precision per cycle rath-

er than one, in the meantime with AVX it is even four, on Xeon Phi

already eight.

This makes a lot of sense. The electricity consumed by the “ad-

ministration” of the operations, code fetch and decode, configu-

ration of paths on the CPU etc. is a significant portion of the pow-

er consumption. If the energy is used not only for one operation

but for two or even four, the energy per operation is obviously


In essence and in order to overcome the performance-bottleneck

the architectures developed into the direction which to a larger

extent was already realized in the vector-supercomputers dec-

ades ago!

There is almost no architecture left without such features. But

SIMD-operations require application code to be vectorized. Be-

fore code developments and programming paradigms did not

care for fine-grain data-parallelism, which is the basis of appli-

cability of SIMD. Thus vector-machines could not show their

strength, on scalar code they cannot beat a scalar CPU. But strictly

speaking, there is no scalar CPU left, standard x86-systems do not

achieve their optimal performance on purely scalar code as well!

Page 144: HPC Compass 2016_17


Alternative Architectures?

During recent years GPUs and many-core architectures are in-

creasingly used in the HPC market. Although these systems are

somewhat diverse in detail, in any case the degree of parallelism

needed is much higher than in standard CPUs. This points into

the same direction, a low-level parallelism is used to address the

bottlenecks. In the case of Nvidia GPUs the individual so-called

CUDA cores, relatively simple entities, are steered by a more

complex central processor, realizing a high level of SIMD paral-

lelism. In the case of Xeon Phi the individual cores are also less

powerful, but they are used in parallel using coding paradigms

which operate on the level of loop nests, an approach which was

already used on vector supercomputers during the 1980s!

So these alternative architectures not only go into the same

direction, the utilization of fine-grain parallelism, they also in-

dicate that coding styles which suite real vector machines will

dominate the future software developments. Note that at last

some of these systems have an economic backing by mass mar-

kets, gaming!

Vector Supercomputing

The NEC SX Architecture

Page 145: HPC Compass 2016_17






tech BOX


History of the SX SeriesAs already stated NEC started marketing the first commercially

available generation of this series, the SX-2, in 1981. By now the

latest generation is the SX-ACE.

In 1998 Dr. Tadashi Watanabe, the developer of the SX series

and the major driver of NEC’s HPC-business, received the Eck-

hard-Mauchly-medal, a renowned US-award for these achieve-

ments, and in 2006 also the “Seymour Cray Award”. Prof. Watanabe

was a very pragmatic developer, a knowledgeable scientist and a

visionary manager, all at the same time, and a great person to

talk to. From the beginning the SX-architecture was different

from the competing Cray-vectors, it already integrated some of

the features that RISC CPUs at that time were using, but com-

bined them with vector registers and pipelines.

For example the SX could reorder instructions in hardware on

the fly, which Cray machines could not do. In effect this made

the system much easier to use. The single processor SX-2 was the

fastest CPU at that time, the first one to break the 1 GFlops bar-

rier. Naturally the compiler technology for automatic vectoriza-

tion and optimization was important.

With the SX-3 NEC developed a high-end shared-memory system.

It had up to four very strong CPUs. The resolution of memory ac-

cess conflicts by CPUs working in parallel, and the consistency of

data throughout the whole shared-memory system are non-triv-

ial, and the development of such a memory architecture was a

major breakthrough.

With the SX-4 NEC implemented a high-end system based on CMOS.

It was the direct competitor to the Cray T90, with a huge advantage

with regard to price and power consumption. The T90 was using

ECL technology and achieved 1.8 GFlops peak performance. The

SX-4 had a lower frequency because of CMOS, but because of the

higher density of the LSIs it featured a higher level of parallelism

on the CPU, leading to 2 GFlops per CPU with up to 32 CPUs.

The memory architecture to support those was a major leap

ahead. On this system one could typically reach an efficiency

of 20-40% of peak performance measured throughout complete

applications, far from what is achieved on standard components

today. With this system NEC also implemented the first genera-

tion of its own proprietary interconnect, the so-called IXS, which

Page 146: HPC Compass 2016_17


allowed connecting several of these strong nodes into one highly

powerful system. The IXS featured advanced functions like a

global barrier and some hardware fault-tolerance, things which

other companies tackled years later.

With the SX-5 this direction was continued, it was a very suc-

cessful system pushing the limits of CMOS-based CPUs and mul-

ti-node configurations. At that time slowly the efforts on the us-

ers’ side to parallelize their codes using “message passing”, PVM

and MPI at that time, started to surface, so the need for a huge

shared memory system slowly disappeared. Still there were quite

some important codes in the scene which were only parallel us-

ing a shared-memory paradigm, initially “microtasking”, which

was vendor-specific, “autotasking”, NEC and some other vendors

could automatically parallelize code, and later the standard

OpenMP. So there was still sufficient need for a huge and highly

capable shared-memory-system.

But later it became clear that more and more codes could utilize

distributed memory systems, and moreover the memory-archi-

tecture of huge shared memory systems became increasingly

complex and that way costly. A memory is made of “banks”, in-

dividual pieces of memory which are used in parallel. In order

to provide the data at the necessary speed for a CPU a certain

minimum amount of banks per CPU is required, because a bank

needs some cycles to recover after every access. Consequently

the number of banks had to scale with the number of CPUs, so

that the complexity of the network connecting CPUs and banks

was growing with the square of the number of CPUs. And if the

CPUs became stronger they needed more banks, which made it

even worse. So this concept became too expensive.

Vector Supercomputing

History of the SX Series

Page 147: HPC Compass 2016_17


At the same time NEC had a lot of experience in mixing MPI- and

OpenMP parallelization in applications, and it turned out that for

the very fast CPUs the shared-memory parallelization was typi-

cally not really efficient for more than four or eight threads. So

the concept was adapted accordingly, which lead to the SX-6 and

the famous Earth-Simulator, which kept the #1-spot of the Top500

for some years.

The Earth Simulator was a different machine, but based

on the same technology level and architecture as the SX-6.

These systems were the first implementation ever of a vector-CPU

on one single chip. The CPU by itself was not the challenge, but

the interface to the memory to support the bandwidth needed

for this chip, basically the “pins”. This kind of packaging tech-

no-logy was always a strong point in NEC’s technology portfolio.

The SX-7 was a somewhat special machine, it was based on the SX-6

technology level, but featured 32 CPUs on one shared memory.

The SX-8 then was a new implementation again with some ad-

vanced features which directly addressed the necessities of cer-

tain code-structures frequently encountered in applications. The

communication between the world-wide organization of appli-

cation-analysts and the hard- and software developers in Japan

had proven very fruitful, the SX-8 was very successful in the market.

With the SX-9 the number of CPUs per node was pushed up again

to 16 because of certain demands in the market. Just as an in-

dication for the complexity of the memory architecture: the

system had 32768 banks! The peak-performance of one node was

1,6 TFlops.

Keep in mind that the efficiency achieved on real applications

was still in the range of 10% in bad cases, up to 20% or even more,

depending on the code. One should not compare this peak perfor-

mance with some standard chip today, this would be misleading.

The Current Product: SX-ACEIn November 2014 NEC announced the next generation of the

SX-vector-product, the SX-ACE. Some systems are already in-

stalled and accepted, also in Europe.

The main design target was the reduction of power-consump-

tion per sustained performance, and measurements have prov-

en this has been achieved. In order to do so and in line with the

observation that the shared-memory parallelization with vector

CPUs will be useful on four cores, perhaps eight, but normally

not beyond that, the individual node consists of a single-chip-

four core-processor, which inherits memory-controllers and the

interface to the communication interconnect, a next generation

of the proprietary IXS.

That way a node is a small board, leading to a very compact de-

sign. The complicated network between the memory and the

CPUs, which took about 70% of the power-consumption of an SX-

9, was eliminated. And naturally a new generation of LSI-and PCB

technology also leads to reductions.

With the memory bandwidth of 256 GB/s this machine realizes

one “byte per flop”, a measure of the balance between the pure

compute power and the memory bandwidth of a chip. In compar-

ison nowadays typical standard processors have in the order of

one eights of this!

Page 148: HPC Compass 2016_17


Memory bandwidth is increasingly a problem, this is known in

the market as the “memory wall”, and is a hot topic since more

than a decade. Standard components are built for a different

market, where memory bandwidth is not as important as it is for

HPC applications. Two of these node boards are put into a mod-

ule, eight such modules are combined in one cage, and four of

the cages, i.e. 64 nodes, are contained in one rack. The standard

configuration of 512 nodes consists of eight racks.

Vector Supercomputing

History of the SX Series

Page 149: HPC Compass 2016_17


NEC already installed several SX-ACE systems in Europe, and

more installations are ongoing. The advantages of the system,

the superior memory bandwidth, the strong CPU with good effi-

ciency, will provide the scientists with performance levels on cer-

tain codes which otherwise could not be achieved. Moreover the

system has proven to be superior in terms of power consumption

per sustained application performance, and this is a real hot top-

ic nowadays.

Page 150: HPC Compass 2016_17






tech BOX

do i = 1, n

a(i) = b(i) + c(i)

end do


LSI technologies have improved drastically during the last

two decades, but the ever increasing power consumption of

systems now poses serious limitations. Tasks like fetching in-

structions, decoding them or configuring paths on the chip

consume energy, so it makes sense to execute these activities

not for one operation only, but for a set of such operations.

This is meant by SIMD.

Data ParallelismThe underlying structure to enable the usage of SIMD-instruc-

tions is called “data parallelism”. For example the following loop

is data-parallel, the individual loop iterations could be executed

in any arbitrary order, and the results would stay the same:

The complete independence of the loop iterations allows executing

them in parallel, and this is mapped to hardware with SIMD support.

Naturally one could also use other means to parallelize such a

loop, e.g. for a very long loop one could execute chunks of loop

iterations on different processors of a shared-memory machine

using OpenMP. But this will cause overhead, so it will not be effi-

cient for small trip counts, while a SIMD feature in hardware can

deal with such a fine granularity.

As opposed to the example above the following loop, a recursion,

is not data parallel:

Vector Supercomputing

History of the SX Series

Page 151: HPC Compass 2016_17

do i = 1, n-1

a(i+1) = a(i) + c(i)

end do


The time or number of cycles extends to the right hand side,

assuming five stages (vertical direction), the red squares indi-

cate which stage is busy at the given cycle. Obviously this is not

efficient, because white squares indicate transistors which con-

sume power without actively producing results. And the com-

pute-performance is not optimal, it takes five cycles to compute

one result.

Similarly to the assembly line in the automotive industry this is

what we want to achieve:

After the “latency”, five cycles, just the length of the pipeline, we

get a result every cycle. Obviously it is easy to feed the pipeline

with sufficient independent input data in the case of different

loop iterations of a data-parallel loop.

If this is not possible one could also use different operations

within the execution of a complicated loop body within one

Obviously a change of the order of the loop iterations, for ex-

ample the special case of starting with a loop-index of n-1 and

decreasing it afterwards, will lead to different results. Therefore

these iterations cannot be executed in parallel.

PipeliningThe following description is very much simplified, but explains

the basic principles: Operations like add and multiply consist of

so-called segments, which are executed one after the other in a

so-called pipeline. While each segment acts on the data for some

kind of intermediate step, only after the execution of the whole

pipeline the final result is available.

It is similar to an assembly-line in the automotive industry. Now,

the segments do not need to deal with the same operation at a

given point in time, they could be kept active if sufficient inputs

are fed into the pipeline. Again, in the automotive industry the

different stages of the assembly line are working on different

cars, and nobody would wait for a car to be completed before

starting the next one in the first stage of the production, that

way keeping almost all stages inactive. Such a situation would

be resembled by the following picture:

Page 152: HPC Compass 2016_17


given loop iteration. But typically operations in such a loop body

need results from previous operations of the same iteration, so

one would rather arrive at a utilization of the pipeline like this:

Some operations can be calculated almost in parallel, others

have to wait for previous calculations to finish because they

need input. In total the utilization of the pipeline is better but

not optimal.

This was the situation before the trend to SIMD computing was

revitalized. Before that the advances of LSI technology always

provided an increased performance with every new generation

of product. But then the race for increasing frequency stopped,

and now architectural enhancements are important again.

Single Instruction Multiple Data (SIMD)Hardware realizations of SIMD have been present since the first

Cray-1, but this was a vector architecture. Nowadays the instruc-

tion sets SSE and AVX in their different generations and flavors

use it, and similarly for example CUDA coding is also a way to uti-

lize data parallelism.

The “width” of the SIMD-instruction, the amount of loop itera-

tions to execute in parallel by one instruction, is kept relatively

small in the case of SSE (two for double precision) and AVX (four

for double precision). The implementations of SIMD in stand-

ard “scalar” processors typically involve add-and-multiply units

which can produce several results per clock-cycle. They also fea-

ture registers of the same “width” to keep several input parame-

ters or several results.

Vector Supercomputing

History of the SX Series

Page 153: HPC Compass 2016_17


iterations are independent, means if the loop is data parallel. It

would not work if one computation would need a result of anoth-

er computation as input while this is just being processed.

In fact the SX-vector-machines always were combining both

ideas, the vector-register and parallel functional units, so-called

“pipe-sets”, four in this picture:

There is another advantage of the vector architecture in the SX,

and this proves to be very important. The scalar portion of the

CPU is acting independently, and while the pipelines are kept

busy with input from the vector registers, the inevitable scalar

instructions which come with every loop can be executed. This

involves checking of loop counters, updating of address pointers

and the like. That way these instructions essentially do not con-

sume time. The amount of cycles available to “hide” those is the

length of the vector-registers divided by the number of pipe sets.

In fact the white space in the picture above can partially be hid-

den as well, because new vector instructions can be issued in ad-

vance to immediately take over once the vector register with the

input data for the previous instruction is completely processed.

The instruction-issue-unit has the complete status information

of all available resources, specifically pipes and vector registers.

In a way the pipelines are “doubled” for SSE, but both are activat-

ed by the same single instruction:

Vector ComputingThe difference between those SIMD architectures and SIMD with

a real vector architecture like the NEC SX-ACE, the only remaining

vector system, lies in the existence of vector registers.

In order to provide sufficient input to the functional units the

vector registers have proven to be very efficient. In a way they are

a very natural match to the idea of a pipeline – or assembly line.

Again thinking about the automotive industry, there one would

have a stock of components to constantly feed the assembly line.

The following viewgraph resembles this situation:

When an instruction is issued it takes some cycles for the first

result to appear, but after that one would get a new result every

cycle, until the vector register, which has a certain length, is

completely processed. Naturally this can only work if the loop

Page 154: HPC Compass 2016_17
Page 155: HPC Compass 2016_17

Parallel File Systems – The New Standard for HPC StorageHPC computation results in the terabyte range are not uncommon. The problem in this

context is not so much storing the data at rest, but the performance of the necessary

copying back and forth in the course of the computation job flow and the dependent

job turn-around time.

For interim results during a job runtime or for fast storage of input and results data,

parallel file systems have established themselves as the standard to meet the ever-

increasing performance requirements of HPC storage systems.

Engineering | Life Sciences | Automotive | Price Modelling | Aerospace | CAE | Data Analytics

Page 156: HPC Compass 2016_17


Parallel NFSYesterday’s Solution: NFS for HPC StorageThe original Network File System (NFS) developed by Sun Mi-

crosystems at the end of the eighties – now available in version 4.1

– has been established for a long time as a de-facto standard for

the provisioning of a global namespace in networked computing.

A very widespread HPC cluster solution includes a central master

node acting simultaneously as an NFS server, with its local file

system storing input, interim and results data and exporting

them to all other cluster nodes.

There is of course an immediate bottleneck in this method: When

the load of the network is high, or where there are large num-

bers of nodes, the NFS server can no longer keep up delivering or

receiving the data. In high-performance computing especially,

the nodes are interconnected at least once via Gigabit Ethernet,

so the sum total throughput is well above what an NFS server

with a Gigabit interface can achieve. Even a powerful network

connection of the NFS server to the cluster, for example with

10-Gigabit Ethernet, is only a temporary solution to this prob-

lem until the next cluster upgrade. The fundamental problem re-

mains – this solution is not scalable; in addition, NFS is a difficult

protocol to cluster in terms of load balancing: either you have

to ensure that multiple NFS servers accessing the same data are

constantly synchronised, the disadvantage being a noticeable

Parallel File Systems

Parallel NFS

Cluster Nodes= NFS Clients

NAS Head= NFS Server

A classical NFS server is a bottleneck

Page 157: HPC Compass 2016_17


drop in performance or you manually partition the global name-

space which is also time-consuming. NFS is not suitable for dy-

namic load balancing as on paper it appears to be stateless but

in reality is, in fact, stateful.

Today’s Solution: Parallel File SystemsFor some time, powerful commercial products have been

available to meet the high demands on an HPC storage system.

The open-source solutions BeeGFS from the Fraunhofer Com-

petence Center for High Performance Computing or Lustre are

widely used in the Linux HPC world, and also several other free

as well as commercial parallel file system solutions exist.

What is new is that the time-honoured NFS is to be upgraded,

including a parallel version, into an Internet Standard with

the aim of interoperability between all operating systems.

The original problem statement for parallel NFS access was

written by Garth Gibson, a professor at Carnegie Mellon Uni-

versity and founder and CTO of Panasas. Gibson was already a

renowned figure being one of the authors contributing to the

original paper on RAID architecture from 1988.

The original statement from Gibson and Panasas is clearly

noticeable in the design of pNFS. The powerful HPC file sys-

tem developed by Gibson and Panasas, ActiveScale PanFS,

with object-based storage devices functioning as central

components, is basically the commercial continuation of the

“Network-Attached Secure Disk (NASD)” project also developed

by Garth Gibson at the Carnegie Mellon University.

Parallel NFSParallel NFS (pNFS) is gradually emerging as the future standard

to meet requirements in the HPC environment. From the indus-

try’s as well as the user’s perspective, the benefits of utilising

standard solutions are indisputable: besides protecting end

user investment, standards also ensure a defined level of inter-

operability without restricting the choice of products availa-

ble. As a result, less user and administrator training is required

which leads to simpler deployment and at the same time, a

greater acceptance.

As part of the NFS 4.1 Internet Standard, pNFS will not only

adopt the semantics of NFS in terms of cache consistency or

security, it also represents an easy and flexible extension of

the NFS 4 protocol. pNFS is optional, in other words, NFS 4.1

implementations do not have to include pNFS as a feature. The

scheduled Internet Standard NFS 4.1 is today presented as IETF

RFC 5661.

The pNFS protocol supports a separation of metadata and data:

a pNFS cluster comprises so-called storage devices which store

the data from the shared file system and a metadata server

(MDS), called Director Blade with Panasas – the actual NFS 4.1

server. The metadata server keeps track of which data is stored

on which storage devices and how to access the files, the so-

called layout. Besides these “striping parameters”, the MDS

also manages other metadata including access rights or similar,

which is usually stored in a file’s inode.

The layout types define which Storage Access Protocol is used

by the clients to access the storage devices. Up until now, three

potential storage access protocols have been defined for pNFS:

file, block and object-based layouts, the former being described

in RFC 5661 direct, the latters in RFC 5663 and 5664, respective-

ly. Last but not least, a Control Protocol is also used by the MDS

and storage devices to synchronise status data. This protocol is

deliberately unspecified in the standard to give manufacturers

certain flexibility. The NFS 4.1 standard does however specify cer-

tain conditions which a control protocol has to fulfil, for exam-

ple, how to deal with the change/modify time attributes of files.

pNFS supports backwards compatibility with non-pNFS compat-

ible NFS 4 clients. In this case, the MDS itself gathers data from

the storage devices on behalf of the NFS client and presents the

data to the NFS client via NFS 4. The MDS acts as a kind of proxy

server – which is e.g. what the Director Blades from Panasas do.

Page 158: HPC Compass 2016_17






tech BOX


What’s New in NFS 4.1?NFS 4.1 is a minor update to NFS 4, and adds new features to it.

One of the optional features is parallel NFS (pNFS) but there is

other new functionality as well.

One of the technical enhancements is the use of sessions, a

persistent server object, dynamically created by the client. By

means of sessions, the state of an NFS connection can be stored,

no matter whether the connection is live or not. Sessions sur-

vive temporary downtimes both of the client and the server.

Each session has a so-called fore channel, which is the connec-

tion from the client to the server for all RPC operations, and op-

tionally a back channel for RPC callbacks from the server that

now can also be realized through firewall boundaries. Sessions

can be trunked to increase the bandwidth. Besides session

trunking there is also a client ID trunking for grouping together

several sessions to the same client ID.

By means of sessions, NFS can be seen as a really stateful proto-

col with a so-called “Exactly-Once Semantics (EOS)”. Until now, a

necessary but unspecified reply cache within the NFS server is

implemented to handle identical RPC operations that have been

sent several times. This statefulness in reality is not very robust,

however, and sometimes leads to the well-known stale NFS han-

dles. In NFS 4.1, the reply cache is now a mandatory part of the

NFS implementation, storing the server replies to RPC requests

persistently on disk.

Parallel File Systems

What’s New in NFS 4.1?

Page 159: HPC Compass 2016_17


Another new feature of NFS 4.1 is delegation for directories: NFS

clients can be given temporary exclusive access to directories.

Before, this has only been possible for simple files. With the

forthcoming version 4.2 of the NFS standard, federated filesys-

tems will be added as a feature, which represents the NFS coun-

terpart of Microsoft’s DFS (distributed filesystem).

Some Proposed NFSv4.2 featuresNFSv4.2 promises many features that end-users have been

requesting, and that makes NFS more relevant as not only an

“every day” protocol, but one that has application beyond the

data center.

Server Side Copy

Server-Side Copy (SSC) removes one leg of a copy operation. In-

stead of reading entire files or even directories of files from one

server through the client, and then writing them out to another,

SSC permits the destination server to communicate directly to

the source server without client involvement, and removes the

limitations on server to client bandwidth and the possible con-

gestion it may cause.

Application Data Blocks (ADB)

ADB allows definition of the format of a file; for example, a VM

image or a database. This feature will allow initialization of data

stores; a single operation from the client can create a 300GB da-

tabase or a VM image on the server.

Guaranteed Space Reservation & Hole Punching

As storage demands continue to increase, various efficiency

techniques can be employed to give the appearance of a large

virtual pool of storage on a much smaller storage system. Thin

provisioning, (where space appears available and reserved, but

is not committed) is commonplace, but often problematic to

manage in fast growing environments. The guaranteed space

reservation feature in NFSv4.2 will ensure that, regardless of the

thin provisioning policies, individual files will always have space

vailable for their maximum extent.

While such guarantees are a reassurance for the end-user, they

don’t help the storage administrator in his or her desire to fully

utilize all his available storage. In support of better storage ef-

ficiencies, NFSv4.2 will introduce support for sparse files. Com-

monly called “hole punching”, deleted and unused parts of files

are returned to the storage system’s free space pool.

Page 160: HPC Compass 2016_17

pNFS Clients Storage Device

Metadata Server

Storage Access Protocol

NFS 4.1Control Protocol


pNFS Layout TypesIf storage devices act simply as NFS 4 file servers, the file layout

is used. It is the only storage access protocol directly specified

in the NFS 4.1 standard. Besides the stripe sizes and stripe lo-

cations (storage devices), it also includes the NFS file handles

which the client needs to use to access the separate file areas.

The file layout is compact and static, the striping information

does not change even if changes are made to the file enabling

multiple pNFS clients to simultaneously cache the layout and

avoid synchronisation overhead between clients and the MDS

or the MDS and storage devices.

File system authorisation and client authentication can be well

implemented with the file layout. When using NFS 4 as the stor-

age access protocol, client authentication merely depends on

the security flavor used – when using the RPCSEC_GSS securi-

ty flavor, client access is kerberized, for example and the server

controls access authorization using specified ACLs and cryp-

tographic processes.

In contrast, the block/volume layout uses volume identifiers

and block offsets and extents to specify a file layout. SCSI block

commands are used to access storage devices. As the block

distribution can change with each write access, the layout

must be updated more frequently than with the file layout.

Parallel NFS

Parallel File Systems

Panasas HPC Storage

Page 161: HPC Compass 2016_17


Block-based access to storage devices does not offer any secure

authentication option for the accessing SCSI initiator. Secure

SAN authorisation is possible with host granularity only, based

on World Wide Names (WWNs) with Fibre Channel or Initiator

Node Names (IQNs) with iSCSI. The server cannot enforce access

control governed by the file system. On the contrary, a pNFS

client basically voluntarily abides with the access rights, the

storage device has to trust the pNFS client – a fundamental

access control problem that is a recurrent issue in the NFS pro-

tocol history.

The object layout is syntactically similar to the file layout, but it

uses the SCSI object command set for data access to so-called

Object-based Storage Devices (OSDs) and is heavily based on

the DirectFLOW protocol of the ActiveScale PanFS from Pana-

sas. From the very start, Object-Based Storage Devices were

designed for secure authentication and access. So-called capa-

bilities are used for object access which involves the MDS issu-

ing so-called capabilities to the pNFS clients. The ownership of

these capabilities represents the authoritative access right to

an object.

pNFS can be upgraded to integrate other storage access pro-

tocols and operating systems, and storage manufacturers also

have the option to ship additional layout drivers for their pNFS


The Panasas file system uses parallel and redundant access to

object storage devices (OSDs), per-file RAID, distributed metada-

ta management, consistent client caching, file locking services,

and internal cluster management to provide a scalable, fault tol-

erant, high performance distributed file system.

The clustered design of the storage system and the use of client-

driven RAID provide scalable performance to many concurrent

file system clients through parallel access to file data that is

striped across OSD storage nodes. RAID recovery is performed

in parallel by the cluster of metadata managers, and declus-

tered data placement yields scalable RAID rebuild rates as the

storage system grows larger.

Panasas HPC StoragePanasas ActiveStor parallel storage takes a very different ap-

proach by allowing compute clients to read and write directly

to the storage, entirely eliminating filer head bottlenecks and

allowing single file system capacity and performance to scale

linearly to extreme levels using a proprietary protocol called Di-

rectFlow. Panasas has actively shared its core knowledge with

a consortium of storage industry technology leaders to create

an industry standard protocol which will eventually replace the

need for DirectFlow. This protocol, called parallel NFS (pNFS) is

now an optional extension of the NFS v4.1 standard.

The Panasas system is a production system that provides file

service to some of the largest compute clusters in the world, in

scientific labs, in seismic data processing, in digital animation

studios, in computational fluid dynamics, in semiconductor

manufacturing, and in general purpose computing environ-

ments. In these environments, hundreds or thousands of file

system clients share data and generate very high aggregate I/O

load on the file system. The Panasas system is designed to sup-

port several thousand clients and storage capacities in excess

of a petabyte.

The unique aspects of the Panasas system are its use of per-file,

client-driven RAID, its parallel RAID rebuild, its treatment of dif-

ferent classes of metadata (block, file, system) and a commodity

parts based blade hardware with integrated UPS. Of course, the

system has many other features (such as object storage, fault

tolerance, caching and cache consistency, and a simplified

management model) that are not unique, but are necessary for a

scalable system implementation.

Page 162: HPC Compass 2016_17


Panasas File System BackgroundThe storage cluster is divided into storage nodes and manager

nodes at a ratio of about 10 storage nodes to 1 manager node,

although that ratio is variable. The storage nodes implement an

object store, and are accessed directly from Panasas file system

clients during I/O operations. The manager nodes manage the

overall storage cluster, implement the distributed file system se-

mantics, handle recovery of storage node failures, and provide an

exported view of the Panasas file system via NFS and CIFS. The

following Figure gives a basic view of the system components.

Object Storage

An object is a container for data and attributes; it is analogous

to the inode inside a traditional UNIX file system implementa-

tion. Specialized storage nodes called Object Storage Devices

(OSD) store objects in a local OSDFS file system. The object in-

terface addresses objects in a two-level (partition ID/object ID)

namespace. The OSD wire protocol provides byteoriented ac-

cess to the data, attribute manipulation, creation and deletion

of objects, and several other specialized operations. Panasas

uses an iSCSI transport to carry OSD commands that are very

similar to the OSDv2 standard currently in progress within SNIA

and ANSI-T10.

The Panasas file system is layered over the object storage. Each file

is striped over two or more objects to provide redundancy and high

bandwidth access. The file system semantics are implemented by

metadata managers that mediate access to objects from clients

of the file system. The clients access the object storage using the

iSCSI/OSD protocol for Read and Write operations. The I/O op-

erations proceed directly and in parallel to the storage nodes,

bypassing the metadata managers. The clients interact with the

out-of-band metadata managers via RPC to obtain access capa-

bilities and location information for the objects that store files.

Parallel File Systems

Panasas HPC Storage

Page 163: HPC Compass 2016_17


Object attributes are used to store file-level attributes, and di-

rectories are implemented with objects that store name to ob-

ject ID mappings. Thus the file system metadata is kept in the

object store itself, rather than being kept in a separate database

or some other form of storage on the metadata nodes.

System Software Components

The major software subsystems are the OSDFS object storage

system, the Panasas file system metadata manager, the Pana-

sas file system client, the NFS/CIFS gateway, and the overall

cluster management system.

� The Panasas client is an installable kernel module that runs in-

side the Linux kernel. The kernel module implements the stand-

ard VFS interface, so that the client hosts can mount the file sys-

tem and use a POSIX interface to the storage system.

� Each storage cluster node runs a common platform that is based











Panasas System Components

on FreeBSD, with additional services to provide hardware moni-

toring, configuration management, and overall control.

� The storage nodes use a specialized local file system (OSD-

FS) that implements the object storage primitives. They

implement an iSCSI target and the OSD command set. The

OSDFS object store and iSCSI target/OSD command proces-

sor are kernel modules. OSDFS is concerned with traditional

block-level file system issues such as efficient disk arm utili-

zation, media management (i.e. error handling), high through-

put, as well as the OSD interface.

� The cluster manager (SysMgr) maintains the global configura-

tion, and it controls the other services and nodes in the storage

cluster. There is an associated management application that pro-

vides both a command line interface (CLI) and an HTML interface

(GUI). These are all user level applications that run on a subset

of the manager nodes. The cluster manager is concerned with

membership in the storage cluster, fault detection, configuration

management, and overall control for operations like software

upgrade and system restart.

� The Panasas metadata manager (PanFS) implements the file

system semantics and manages data striping across the object

storage devices. This is a user level application that runs on every

manager node. The metadata manager is concerned with distrib-

uted file system issues such as secure multi-user access, main-

taining consistent file- and object-level metadata, client cache

coherency, and recovery from client, storage node, and metadata

server crashes. Fault tolerance is based on a local transaction log

that is replicated to a backup on a different manager node.

� The NFS and CIFS services provide access to the file system

for hosts that cannot use our Linux installable file system

client. The NFS service is a tuned version of the standard

FreeBSD NFS server that runs inside the kernel. The CIFS ser-

vice is based on Samba and runs at user level. In turn, these

services use a local instance of the file system client, which

runs inside the FreeBSD kernel. These gateway services run

on every manager node to provide a clustered NFS and CIFS


Page 164: HPC Compass 2016_17


Commodity Hardware Platform

The storage cluster nodes are implemented as blades that are

very compact computer systems made from commodity parts.

The blades are clustered together to provide a scalable plat-

form. The OSD StorageBlade module and metadata manager Di-

rectorBlade module use the same form factor blade and fit into

the same chassis slots.

Storage Management

Traditional storage management tasks involve partitioning

available storage space into LUNs (i.e., logical units that are one

or more disks, or a subset of a RAID array), assigning LUN own-

ership to different hosts, configuring RAID parameters, creating

file systems or databases on LUNs, and connecting clients to the

correct server for their storage. This can be a labor-intensive sce-

nario. Panasas provides a simplified model for storage manage-

ment that shields the storage administrator from these kinds of

details and allow a single, part-time admin to manage systems

that were hundreds of terabytes in size.

The Panasas storage system presents itself as a file system with

a POSIX interface, and hides most of the complexities of storage

management. Clients have a single mount point for the entire

system. The /etc/fstab file references the cluster manager, and

from that the client learns the location of the metadata service

instances. The administrator can add storage while the system

is online, and new resources are automatically discovered. To

manage available storage, Panasas introduces two basic stor-

age concepts: a physical storage pool called a BladeSet, and a

logical quota tree called a Volume. The BladeSet is a collection

of StorageBlade modules in one or more shelves that comprise a

RAID fault domain. Panasas mitigates the risk of large fault do-

mains with the scalable rebuild performance described below

in the text. The BladeSet is a hard physical boundary for the vol-

umes it contains. A BladeSet can be grown at any time, either by

adding more StorageBlade modules, or by merging two existing

BladeSets together.

Parallel File Systems

Panasas HPC Storage

Page 165: HPC Compass 2016_17


The Volume is a directory hierarchy that has a quota constraint

and is assigned to a particular BladeSet. The quota can be

changed at any time, and capacity is not allocated to the Vol-

ume until it is used, so multiple volumes compete for space

within their BladeSet and grow on demand. The files in those

volumes are distributed among all the StorageBlade modules

in the BladeSet. Volumes appear in the file system name space

as directories. Clients have a single mount point for the whole

storage system, and volumes are simply directories below the

mount point. There is no need to update client mounts when the

admin creates, deletes, or renames volumes.

Automatic Capacity Balancing

Capacity imbalance occurs when expanding a BladeSet (i.e., add-

ing new, empty storage nodes), merging two BladeSets, and re-

placing a storage node following a failure. In the latter scenario,

the imbalance is the result of the RAID rebuild, which uses spare

capacity on every storage node rather than dedicating a specif-

ic “hot spare” node. This provides better throughput during re-

build, but causes the system to have a new, empty storage node

after the failed storage node is replaced. The system automati-

cally balances used capacity across storage nodes in a BladeSet

using two mechanisms: passive balancing and active balancing.

Passive balancing changes the probability that a storage node

will be used for a new component of a file, based on its availa-

ble capacity. This takes effect when files are created, and when

their stripe size is increased to include more storage nodes. Ac-

tive balancing is done by moving an existing component object

from one storage node to another, and updating the storage

map for the affected file. During the transfer, the file is transpar-

ently marked read-only by the storage management layer, and

the capacity balancer skips files that are being actively written.

Capacity balancing is thus transparent to file system clients.

Object RAID and ReconstructionPanasas protects against loss of a data object or an entire stor-

age node by striping files across objects stored on different

storage nodes, using a fault-tolerant striping algorithm such as

RAID-1 or RAID-5. Small files are mirrored on two objects, and

larger files are striped more widely to provide higher bandwidth

and less capacity overhead from parity information. The per-file

RAID layout means that parity information for different files is

not mixed together, and easily allows different files to use differ-

ent RAID schemes alongside each other. This property and the

security mechanisms of the OSD protocol makes it possible to

enforce access control over files even as clients access storage

nodes directly. It also enables what is perhaps the most novel

aspect of our system, client-driven RAID. That is, the clients are

responsible for computing and writing parity. The OSD security

mechanism also allows multiple metadata managers to manage

objects on the same storage device without heavyweight coor-

dination or interference from each other.

Client-driven, per-file RAID has four advantages for large-scale

storage systems. First, by having clients compute parity for

their own data, the XOR power of the system scales up as the

number of clients increases. We measured XOR processing dur-

ing streaming write bandwidth loads at 7% of the client’s CPU,

with the rest going to the OSD/iSCSI/TCP/IP stack and other file

system overhead. Moving XOR computation out of the storage

system into the client requires some additional work to handle

failures. Clients are responsible for generating good data and

good parity for it. Because the RAID equation is per-file, an er-

rant client can only damage its own data. However, if a client

fails during a write, the metadata manager will scrub parity to

ensure the parity equation is correct.

The second advantage of client-driven RAID is that clients can

perform an end-to-end data integrity check. Data has to go

through the disk subsystem, through the network interface on

the storage nodes, through the network and routers, through

the NIC on the client, and all of these transits can introduce er-

rors with a very low probability. Clients can choose to read par-

ity as well as data, and verify parity as part of a read operation.

Page 166: HPC Compass 2016_17


If errors are detected, the operation is retried. If the error is per-

sistent, an alert is raised and the read operation fails. By check-

ing parity across storage nodes within the client, the system can

ensure end-to-end data integrity. This is another novel property

of per-file, client-driven RAID.

Third, per-file RAID protection lets the metadata managers re-

build files in parallel. Although parallel rebuild is theoretically

possible in block-based RAID, it is rarely implemented. This is due

to the fact that the disks are owned by a single RAID controller,

even in dual-ported configurations. Large storage systems have

multiple RAID controllers that are not interconnected. Since the

SCSI Block command set does not provide fine-grained synchro-

nization operations, it is difficult for multiple RAID controllers

to coordinate a complicated operation such as an online rebuild

without external communication. Even if they could, without

connectivity to the disks in the affected parity group, other RAID

controllers would be unable to assist. Even in a high-availability

configuration, each disk is typically only attached to two differ-

ent RAID controllers, which limits the potential speedup to 2x.

When a StorageBlade module fails, the metadata managers that

own Volumes within that BladeSet determine what files are af-

fected, and then they farm out file reconstruction work to every

other metadata manager in the system. Metadata managers re-

build their own files first, but if they finish early or do not own

any Volumes in the affected Bladeset, they are free to aid other

metadata managers. Declustered parity groups spread out the

I/O workload among all StorageBlade modules in the BladeSet.

The result is that larger storage clusters reconstruct lost data

more quickly.

The fourth advantage of per-file RAID is that unrecoverable

faults can be constrained to individual files. The most com-

monly encountered double-failure scenario with RAID-5 is an

unrecoverable read error (i.e., grown media defect) during the

reconstruction of a failed storage device. The second storage

device is still healthy, but it has been unable to read a sector,

©2013 Panasas Incorporated. All rights reserved. Panasas,

the Panasas logo, Accelerating Time to Results, ActiveScale,

DirectFLOW, DirectorBlade, StorageBlade, PanFS, PanActive

and MyPanasas are trademarks or registered trademarks of

Panasas, Inc. in the United States and other countries. All oth-

er trademarks are the property of their respective owners.

Information supplied by Panasas, Inc. is believed to be accu-

rate and reliable at the time of publication, but Panasas, Inc.

assumes no responsibility for any errors that may appear in

this document. Panasas, Inc. reserves the right, without no-

tice, to make changes in product design, specifications and

prices. Information is subject to change without notice.

Parallel File Systems

Panasas HPC Storage

Page 167: HPC Compass 2016_17


which prevents rebuild of the sector lost from the first drive and

potentially the entire stripe or LUN, depending on the design of

the RAID controller. With block-based RAID, it is difficult or im-

possible to directly map any lost sectors back to higher-level file

system data structures, so a full file system check and media

scan will be required to locate and repair the damage. A more

typical response is to fail the rebuild entirely. RAID control-

lers monitor drives in an effort to scrub out media defects and

avoid this bad scenario, and the Panasas system does media

scrubbing, too. However, with high capacity SATA drives, the

chance of encountering a media defect on drive B while rebuild-

ing drive A is still significant. With per-file RAID-5, this sort of

double failure means that only a single file is lost, and the spe-

cific file can be easily identified and reported to the administra-

tor. While block-based RAID systems have been compelled to

introduce RAID-6 (i.e., fault tolerant schemes that handle two

failures), the Panasas solution is able to deploy highly reliable

RAID-5 systems with large, high performance storage pools.

RAID Rebuild Performance

RAID rebuild performance determines how quickly the system

can recover data when a storage node is lost. Short rebuild

times reduce the window in which a second failure can cause

data loss. There are three techniques to reduce rebuild times: re-

ducing the size of the RAID parity group, declustering the place-

ment of parity group elements, and rebuilding files in parallel

using multiple RAID engines.

















Declustered parity groups

The rebuild bandwidth is the rate at which reconstructed data

is written to the system when a storage node is being recon-

structed. The system must read N times as much as it writes,

depending on the width of the RAID parity group, so the overall

throughput of the storage system is several times higher than

the rebuild rate. A narrower RAID parity group requires fewer

read and XOR operations to rebuild, so will result in a higher re-

build bandwidth.

However, it also results in higher capacity overhead for parity

data, and can limit bandwidth during normal I/O. Thus, selection

of the RAID parity group size is a trade-off between capacity

overhead, on-line performance, and rebuild performance. Un-

derstanding declustering is easier with a picture. In the figure

on the left, each parity group has 4 elements, which are indicat-

ed by letters placed in each storage device. They are distributed

among 8 storage devices. The ratio between the parity group

size and the available storage devices is the declustering ratio,

which in this example is ½. In the picture, capital letters repre-

sent those parity groups that all share the second storage node.

If the second storage device were to fail, the system would have

to read the surviving members of its parity groups to rebuild

the lost elements. You can see that the other elements of those

parity groups occupy about ½ of each other storage device. For

this simple example you can assume each parity element is the

same size so all the devices are filled equally. In a real system,

the component objects will have various sizes depending on the

overall file size, although each member of a parity group will be

very close in size. There will be thousands or millions of objects

on each device, and the Panasas system uses active balancing

to move component objects between storage nodes to level ca-

pacity. Declustering means that rebuild requires reading a sub-

set of each device, with the proportion being approximately the

same as the declustering ratio. The total amount of data read is

the same with and without declustering, but with declustering

it is spread out over more devices. When writing the reconstruct-

ed elements, two elements of the same parity group cannot be

Page 168: HPC Compass 2016_17


located on the same storage node. Declustering leaves many

storage devices available for the reconstructed parity element,

and randomizing the placement of each file’s parity group lets

the system spread out the write I/O over all the storage. Thus

declustering RAID parity groups has the important property of

taking a fixed amount of rebuild I/O and spreading it out over

more storage devices.

Having per-file RAID allows the Panasas system to divide the

work among the available DirectorBlade modules by assigning

different files to different DirectorBlade modules. This division

is dynamic with a simple master/worker model in which meta-

data services make themselves available as workers, and each

metadata service acts as the master for the volumes it imple-

ments. By doing rebuilds in parallel on all DirectorBlade mod-

ules, the system can apply more XOR throughput and utilize the

additional I/O bandwidth obtained with declustering.

Metadata ManagementThere are several kinds of metadata in the Panasas system.

These include the mapping from object IDs to sets of block ad-

dresses, mapping files to sets of objects, file system attributes

such as ACLs and owners, file system namespace information

(i.e., directories), and configuration/management information

about the storage cluster itself.

Block-level Metadata

Block-level metadata is managed internally by OSDFS, the file

system that is optimized to store objects. OSDFS uses a floating

block allocation scheme where data, block pointers, and object

descriptors are batched into large write operations. The write

buffer is protected by the integrated UPS, and it is flushed to

disk on power failure or system panics. Fragmentation was an

issue in early versions of OSDFS that used a first-fit block allo-

cator, but this has been significantly mitigated in later versions

that use a modified best-fit allocator.

Parallel File Systems

Panasas HPC Storage

Page 169: HPC Compass 2016_17


OSDFS stores higher level file system data structures, such as

the partition and object tables, in a modified BTree data struc-

ture. Block mapping for each object uses a traditional direct/

indirect/double-indirect scheme. Free blocks are tracked by

a proprietary bitmap-like data structure that is optimized for

copy-on-write reference counting, part of OSDFS’s integrated

support for object- and partition-level copy-on-write snapshots.

Block-level metadata management consumes most of the cycles

in file system implementations. By delegating storage manage-

ment to OSDFS, the Panasas metadata managers have an or-

der of magnitude less work to do than the equivalent SAN file

system metadata manager that must track all the blocks in the


File-level Metadata

Above the block layer is the metadata about files. This includes

user-visible information such as the owner, size, and modification

time, as well as internal information that identifies which ob-

jects store the file and how the data is striped across those

objects (i.e., the file’s storage map). Our system stores this file

metadata in object attributes on two of the N objects used to

store the file’s data. The rest of the objects have basic attrib-

utes like their individual length and modify times, but the high-

er-level file system attributes are only stored on the two attrib-

ute-storing components.

File names are implemented in directories similar to traditional

UNIX file systems. Directories are special files that store an array

of directory entries. A directory entry identifies a file with a tuple

of <serviceID, partitionID, objectID>, and also includes two <os-

dID> fields that are hints about the location of the attribute stor-

ing components. The partitionID/objectID is the two-level ob-

ject numbering scheme of the OSD interface, and Panasas uses

a partition for each volume. Directories are mirrored (RAID-1)

in two objects so that the small write operations associated

with directory updates are efficient.

Clients are allowed to read, cache and parse directories, or they

can use a Lookup RPC to the metadata manager to translate

a name to an <serviceID, partitionID, objectID> tuple and the

<osdID> location hints. The serviceID provides a hint about the

metadata manager for the file, although clients may be redirect-

ed to the metadata manager that currently controls the file.

The osdID hint can become out-of-date if reconstruction or

active balancing moves an object. If both osdID hints fail, the

metadata manager has to multicast a GetAttributes to the stor-

age nodes in the BladeSet to locate an object. The partitionID

and objectID are the same on every storage node that stores a

component of the file, so this technique will always work. Once

the file is located, the metadata manager automatically updates

the stored hints in the directory, allowing future accesses to

bypass this step.















Creating a file

Page 170: HPC Compass 2016_17


File operations may require several object operations. The figure

on the left shows the steps used in creating a file. The metadata

manager keeps a local journal to record in-progress actions so it

can recover from object failures and metadata manager crashes

that occur when updating multiple objects. For example, creat-

ing a file is fairly complex task that requires updating the par-

ent directory as well as creating the new file. There are 2 Create

OSD operations to create the first two components of the file,

and 2 Write OSD operations, one to each replica of the parent

directory. As a performance optimization, the metadata serv-

er also grants the client read and write access to the file and

returns the appropriate capabilities to the client as part of the

FileCreate results. The server makes record of these write

capabilities to support error recovery if the client crashes while

writing the file. Note that the directory update (step 7) occurs

after the reply, so that many directory updates can be batched

together. The deferred update is protected by the op-log

record that gets deleted in step 8 after the successful directory


The metadata manager maintains an op-log that records the

object create and the directory updates that are in progress.

This log entry is removed when the operation is complete. If the

metadata service crashes and restarts, or a failure event moves

the metadata service to a different manager node, then the op-

log is processed to determine what operations were active at

the time of the failure. The metadata manager rolls the opera-

tions forward or backward to ensure the object store is consist-

ent. If no reply to the operation has been generated, then the op-

eration is rolled back. If a reply has been generated but pending

operations are outstanding (e.g., directory updates), then the

operation is rolled forward.

The write capability is stored in a cap-log so that when a meta-

data server starts it knows which of its files are busy. In addition

“Outstanding in the HPC world, the ActiveStor solutions provided

by Panasas are undoubtedly the only HPC storage solutions that

combine highest scalability and performance with a convincing

ease of management.“

Christoph Budziszewski | HPC Solution Architect

Parallel File Systems

Panasas HPC Storage

Page 171: HPC Compass 2016_17


to the “piggybacked” write capability returned by FileCreate,

the client can also execute a StartWrite RPC to obtain a separate

write capability. The cap-log entry is removed when the client

releases the write cap via an EndWrite RPC. If the client reports

an error during its I/O, then a repair log entry is made and the file

is scheduled for repair. Read and write capabilities are cached

by the client over multiple system calls, further reducing meta-

data server traffic.

System-level Metadata

The final layer of metadata is information about the overall sys-

tem itself. One possibility would be to store this information in

objects and bootstrap the system through a discovery protocol.

The most difficult aspect of that approach is reasoning about

the fault model. The system must be able to come up and be

manageable while it is only partially functional. Panasas chose

instead a model with a small replicated set of system managers,

each that stores a replica of the system configuration metadata.

Each system manager maintains a local database, outside of

the object storage system. Berkeley DB is used to store tables

that represent our system model. The different system manager

instances are members of a replication set that use Lamport’s

part-time parliament (PTP) protocol to make decisions and up-

date the configuration information. Clusters are configured with

one, three, or five system managers so that the voting quorum

has an odd number and a network partition will cause a minori-

ty of system managers to disable themselves.

System configuration state includes both static state, such

as the identity of the blades in the system, as well as dynam-

ic state such as the online/offline state of various services and

error conditions associated with different system components.

Each state update decision, whether it is updating the admin

password or activating a service, involves a voting round and an

update round according to the PTP protocol. Database updates

are performed within the PTP transactions to keep the data-

bases synchronized. Finally, the system keeps backup copies of

the system configuration databases on several other blades to

guard against catastrophic loss of every system manager blade.

Blade configuration is pulled from the system managers as part

of each blade’s startup sequence. The initial DHCP handshake

conveys the addresses of the system managers, and thereafter

the local OS on each blade pulls configuration information from

the system managers via RPC.

The cluster manager implementation has two layers. The lower

level PTP layer manages the voting rounds and ensures that par-

titioned or newly added system managers will be brought up-to-

date with the quorum. The application layer above that uses the

voting and update interface to make decisions. Complex system

operations may involve several steps, and the system manager

has to keep track of its progress so it can tolerate a crash and

roll back or roll forward as appropriate.

For example, creating a volume (i.e., a quota-tree) involves file

system operations to create a top-level directory, object oper-

ations to create an object partition within OSDFS on each Stor-

ageBlade module, service operations to activate the appropri-

ate metadata manager, and configuration database operations

to reflect the addition of the volume.

Recovery is enabled by having two PTP transactions. The initial

PTP transaction determines if the volume should be created, and

it creates a record about the volume that is marked as incom-

plete. Then the system manager does all the necessary service

activations, file and storage operations. When these all com-

plete, a final PTP transaction is performed to commit the oper-

ation. If the system manager crashes before the final PTP trans-

action, it will detect the incomplete operation the next time it

restarts, and then roll the operation forward or backward.

Page 172: HPC Compass 2016_17


BeeGFSBeeGFS combines multiple storage servers to provide a highly

scalable shared network file system with striped file contents.

This way, it allows users to overcome the tight performance lim-

itations of single servers, single network interconnects, a limit-

ed number of hard drives etc. In such a system, high throughput

demands of large numbers of clients can easily be satisfied, but

even a single client can profit from the aggregated performance

of all the storage servers in the system.

Key Aspects:

� Maximum Flexibility

� BeeGFS supports a wide range of Linux distributions as well

as Linux kernels

� Storage servers run on top of an existing local file system

using the POSIX interface

� Clients and servers can be added to an existing system with-

out downtime

� BeeGFS supports multiple networks and dynamic failover

between different networks.

� BeeGFS provides support for metadata and file contents


This is made possible by a separation of metadata and file con-

tents. While storage servers are responsible for storing stripes of

the actual contents of user files, metadata servers do the coordi-

nation of file placement and striping among the storage servers

and inform the clients about certain file details when necessary.

When accessing file contents, BeeGFS clients directly contact the

storage servers to perform file I/O and communicate with multiple

servers simultaneously, giving your applications truly parallel

access to the file data. To keep the metadata access latency (e.g.

directory lookups) at a minimum, BeeGFS allows you to also dis-

tribute the metadata across multiple servers, so that each of the

metadata servers stores a part of the global file system name-


Parallel File Systems


Page 173: HPC Compass 2016_17


The following picture shows the system architecture and roles

within an BeeGFS instance. Note that although all roles in the pic-

ture are running on different hosts, it is also possible to run any

combination of client and servers on the same machine.

A very important differentiator for BeeGFS is the ease-of-use. The

philosophy behind BeeGFS is to make the hurdles as low as pos-

sible and allow the use of a parallel file system to as many people

as possible for their work. BeeGFS was designed to work with any

local file system (such as ext4 or xfs) for data storage – as long as

it is POSIX compliant. That way system administrators can chose

the local file system that they prefer and are experienced with

and therefore do not need to learn the use of new tools. This is

making it much easier to adopt the technology of parallel stor-

age – especially for sites with limited resources and manpower.

BeeGFS needs four main components to work

� the Management Server

� the Object Storage Server

� the Metadata Server

� the file system client

There are two supporting daemons

� The file system client needs a “helper-daemon” to run on the


� The Admon daemon might run in the storage cluster and give

system administrators a better view about what is going on.

It is not a necessary component and a BeeGFS system is fully

operable without it.

Management Server (MS)The MS is the component of the file system making sure that all

other processes can find each other. It is the first daemon that

has to be set up – and all configuration files of a BeeGFS instal-

lation have to point to the same MS. There is always exactly one

MS in a system. The MS maintains a list of all file system compo-

nents – this includes clients, Metadata Servers, Metadata Targets,

Storage Servers and Storage Targets.

Metadata Server (MDS)The MDS contains the information about MetaData in the

system. BeeGFS implements a fully scalable architecture for

BeeGFS Architecture

Page 174: HPC Compass 2016_17


MetaData and allows a practically unlimited number of MDS.

Each MDS has exactly one MetaDataTarget (MDT). Each directory

in the global file system is attached to exactly one MDS handling

its content. Since the assignment of directories to MDS is random

BeeGFS can make efficient use of a (very) large number of MDS.

As long as the number of directories is significantly larger than

the number of MDS (which is normally the case), there will be a

roughly equal share of load on each of the MDS.

The MDS is a fully multithreaded daemon – the number of

threads to start has to be set to appropriate values for the used

hardware. Factors to consider here are number of CPU cores in

the MDS, number of Client, workload, number and type of disks

forming the MDT. So in general serving MetaData needs some

CPU power – and especially, if one wants to make use of fast un-

derlying storage like SSDs, the CPUs should not be too light to

avoid becoming the bottleneck.

Object Storage Server (OSS)The OSS is the main service to store the file contents. Each OSS

might have one or many ObjectStorageTargets (OST) – where

an OST is a RAID-Set (or LUN) with a local file system (such as xfs

or ext4) on top. A typical OST is between 6 and 12 hard drives in

RAID6 – so an OSS with 36 drives might be organized with 3 OSTs

each being a RAID6 with 12 drives.

The OSS is a user space daemon that is started on the appro-

priate machine. It will work with any local file system that

is POSIX compliant and supported by the Linux distribution.

The underlying file system may be picked according to the work-

load – or personal preference and experience.

Like the MDS, the OSS is a fully multithreaded daemon as well –

and so as with the MDS, the choice of the number of threads has

to be made with the underlying hardware in mind. The most im-

portant factor for the number of OSS-threads is the performance

„transtec has successfully deployed the first petabyte filesystem

with BeeGFS. And it is very impressive to see how BeeGFS combines

performance with reliability.”

Hossein Rouhani | HPC Solution Engineer

Parallel File Systems


Page 175: HPC Compass 2016_17


and number of the OSTs that OSS is serving. In contradiction to

the MDS, the traffic on the OSTs should be (and normally is) to a

large extent sequential.

File System ClientBeeGFS comes with a native client that runs in Linux – it is a ker-

nel module that has to be compiled to match the used kernel. The

client is an open-source product made available under GPL. The

client has to be installed on all hosts that should access BeeGFS

with maximum performance.

The client needs two services to start:

fhgfs-helperd: This daemon provides certain auxiliary function-

alities for the fhgfs-client. The helperd does not require any

additional configuration – it is just accessed by the fhgfs-client

running on the same host. The helperd is mainly doing DNS for

the client kernel module and provides the capability to write the


fhgfs-client: This service loads the client kernel module – if nec-

essary it will (re-)compile the kernel module. The recompile is

done using an automatic build process that is started when (for

example) the kernel version changes. This helps to make the file

system easy to use since, after applying a (minor or major) kernel

update, the BeeGFS client will still start up after a reboot, re-com-

pile and the file system will be available.

Maximum Scalability � BeeGFS supports distributed file contents with flexible

striping across the storage servers as well as distributed


� Best in class client throughput of 3.5 GB/s write and 4.0 GB/s

read with a single I/O stream on FDR Infiniband

� Best in class metadata performance with linear scalability

through dynamic metadata namespace partitioning

� Best in class storage throughput with flexible choice of

underlying file system to perfectly fit the storage hardware.

Maximum UsabilityBeeGFS was designed with easy installation and admin-

istration in mind. BeeGFS requires no kernel patches (the

client is a patchless kernel module, the server components

are userspace daemons), comes with graphical cluster in-

stallation tools and allows you to add more clients and serv-

ers to the running system whenever you want it. The graph-

ical administration and monitoring system enables user to

handle typical management tasks in a intuitive way: cluster

installation, storage service management, live throughput and

health monitoring, file browsing, striping configuration and



� Data and metadata for each file are separated into stripes and

distributed across the existing servers to aggregate server


� Secure authentication process between client and server to

protect sensitive data

� Fair I/O option on user level to prevent a single user with mul-

tiple requests to stall other users request

� Automatic network failover, i.e. if Infiniband is down, BeeGFS

automatically uses Ethernet if available

� Optional server failover for external shared memory

resources - high availability

� Runs on non-standard architectures, e.g. PowerPC or Intel

Xeon Phi

� Provides a coherent mode, in which it is guaranteed that

changes to a file or directory by one client are always immedi-

ately visible to other clients.

Page 176: HPC Compass 2016_17



transtec has an outstanding history of parallel filesystem im-

plementations since more than 12 years. Starting with some of

the first Lustre implementations, some of them extending to

nearly 1 Petabyte gross capacity, which constituted an enor-

mous size back 10 years ago, over to the deployment of many

enterprise-ready commercial solutions, we have now success-

fully deployed dozens of BeeGFS environment of various sizes.

In especial, transtec has been the first provider of a 1-peta-

byte-filesystem based on BeeGFS, then known as FraunhoferFS,

or FhGFS.

No matter what the customer’s requirements are: transtec has

knowledge in all parallel filesystem technologies and is able to

provide the customer with an individually sized and tailor-made

solution that meets their demands concerning capacity and per-

formance, and above all – because this is what customers take

as a matter of course – is stable and reliable at the same time.

Parallel File Systems

General Parallel File System (GPFS)

Page 177: HPC Compass 2016_17


General Parallel FilesystemThe IBM General Parallel File System (GPFS) has always been con-

sidered a pioneer of big data storage and continues today to lead

in introducing industry leading storage technologies.

Since 1998 GPFS has lead the industry with many technologies

that make the storage of large quantities of file data possible.

The latest version continues in that tradition, GPFS 3.5 represents

a significant milestone in the evolution of big data management.

GPFS 3.5 introduces some revolutionary new features that clear-

ly demonstrate IBM’s commitment to providing industry leading

storage solutions.

What is GPFS?GPFS is more than clustered file system software; it is a full

featured set of file management tools. This includes advanced

storage virtualization, integrated high availability, automated

tiered storage management and the performance to effectively

manage very large quantities of file data.GPFS allows a group of

computers concurrent access to a common set of file data over

a common SAN infrastructure, a network or a mix of connection

types. The computers can run any mix of AIX, Linux or Windows

Server operating systems. GPFS provides storage management,

information life cycle management tools, centralized adminis-

tration and allows for shared access to file systems from remote

GPFS clusters providing a global namespace.

A GPFS cluster can be a single node, two nodes providing a high

availability platform supporting a database application, for ex-

ample, or thousands of nodes used for applications like the mod-

eling of weather patterns. The largest existing configurations ex-

ceed 5,000 nodes. GPFS has been available on since 1998 and has

been field proven for more than 14 years on some of the world’s

most powerful supercomputers2 to provide reliability and effi-

cient use of infrastructure bandwidth.

GPFS was designed from the beginning to support high perfor-

mance parallel workloads and has since been proven very effec-

tive for a variety of applications. Today it is installed in clusters

supporting big data analytics, gene sequencing, digital media

and scalable file serving. These applications are used across many

industries including financial, retail, digital media, biotechnolo-

gy, science and government. GPFS continues to push technology

limits by being deployed in very demanding large environments.

You may not need multiple petabytes of data today, but you will,

and when you get there you can rest assured GPFS has already

been tested in these enviroments. This leadership is what makes

GPFS a solid solution for any size application. Supported operat-

ing systems for GPFS Version 3.5 include AIX, Red Hat, SUSE and

Debian Linux distributions and Windows Server 2008.

The file systemA GPFS file system is built from a collection of arrays that contain

the file system data and metadata. A file system can be built from

a single disk or contain thousands of disks storing petabytes of

data. Each file system can be accessible from all nodes within the

cluster. There is no practical limit on the size of a file system. The

architectural limit is 299 bytes. As an example, current GPFS cus-

tomers are using single file systems up to 5.4PB in size and others

have file systems containing billions of files.

Application interfacesApplications access files through standard POSIX file system in-

terfaces. Since all nodes see all of the file data applications can

scale-out easily. Any node in the cluster can concurrently read or

update a common set of files. GPFS maintains the coherency and

consistency of the file system using sophisticated byte range

locking, token (distributed lock) management and journaling.

This means that applications using standard POSIX locking se-

mantics do not need to be modified to run successfully on a GPFS

file system.

Page 178: HPC Compass 2016_17


In addition to standard interfaces GPFS provides a unique set

of extended interfaces which can be used to provide advanced

application functionality. Using these extended interfaces an

application can determine the storage pool placement of a file,

create a file clone and manage quotas. These extended interfaces

provide features in addition to the standard POSIX interface.

Performance and scalabilityGPFS provides unparalleled performance for unstructured data.

GPFS achieves high performance I/O by:

Striping data across multiple disks attached to multiple nodes.

� High performance metadata (inode) scans.

� Supporting a wide range of file system block sizes to match

I/O requirements.

� Utilizing advanced algorithms to improve read-ahead and

write-behind IO operations.

� Using block level locking based on a very sophisticated scalable

token management system to provide data consistency while al-

lowing multiple application nodes concurrent access to the files.

When creating a GPFS file system you provide a list of raw devices

and they are assigned to GPFS as Network Shared Disks (NSD). Once

a NSD is defined all of the nodes in the GPFS cluster can access the

disk, using local disk connection, or using the GPFS NSD network

protocol for shipping data over a TCP/IP or InfiniBand connection.

GPFS token (distributed lock) management coordinates access to

NSD’s ensuring the consistency of file system data and metadata

when different nodes access the same file. Token management

responsibility is dynamically allocated among designated nodes

in the cluster. GPFS can assign one or more nodes to act as token

managers for a single file system. This allows greater scalabil-

ity when you have a large number of files with high transaction

workloads. In the event of a node failure the token management

responsibility is moved to another node.

Parallel File Systems

General Parallel File System (GPFS)

Page 179: HPC Compass 2016_17


All data stored in a GPFS file system is striped across all of the disks

within a storage pool, whether the pool contains 2 LUNS or 2,000

LUNS. This wide data striping allows you to get the best performance

for the available storage. When disks are added to or removed from

a storage pool existing file data can be redistributed across the new

storage to improve performance. Data redistribution can be done

automatically or can be scheduled. When redistributing data you

can assign a single node to perform the task to control the impact

on a production workload or have all of the nodes in the cluster par-

ticipate in data movement to complete the operation as quickly as

possible. Online storage configuration is a good example of an en-

terprise class storage management feature included in GPFS.

To achieve the highest possible data access performance GPFS

recognizes typical access patterns including sequential, reverse

sequential and random optimizing I/O access for these patterns.

Along with distributed token management, GPFS provides scalable

metadata management by allowing all nodes of the cluster access-

ing the file system to perform file metadata operations. This feature

distinguishes GPFS from other cluster file systems which typically

have a centralized metadata server handling fixed regions of the

file namespace. A centralized metadata server can often become a

performance bottleneck for metadata intensive operations, limiting

scalability and possibly introducing a single point of failure. GPFS

solves this problem by enabling all nodes to manage metadata.

AdministrationGPFS provides an administration model that is easy to use and

is consistent with standard file system administration practices

while providing extensions for the clustering aspects of GPFS. These

functions support cluster management and other standard file sys-

tem administration functions such as user quotas, snapshots and

extended access control lists. GPFS administration tools simpli-

fy cluster-wide tasks. A single GPFS command can perform a file

system function across the entire cluster and most can be issued

from any node in the cluster. Optionally you can designate a group

of administration nodes that can be used to perform all cluster

administration tasks, or only authorize a single login session to per-

form admin commands cluster-wide. This allows for higher security

by reducing the scope of node to node administrative access.

Rolling upgrades allow you to upgrade individual nodes in the clus-

ter while the file system remains online. Rolling upgrades are sup-

ported between two major version levels of GPFS (and service levels

within those releases). For example you can mix GPFS 3.4 nodes with

GPFS 3.5 nodes while migrating between releases.

Quotas enable the administrator to manage file system usage by us-

ers and groups across the cluster. GPFS provides commands to gen-

erate quota reports by user, group and on a sub-tree of a file system

called a fileset. Quotas can be set on the number of files (inodes) and

the total size of the files. New in GPFS 3.5 you can now define user

and group per fileset quotas which allows for more options in quo-

ta configuration. In addition to traditional quota management, the

GPFS policy engine can be used query the file system metadata and

generate customized space usage reports. An SNMP interface al-

lows monitoring by network management applications. The SNMP

agent provides information on the GPFS cluster and generates traps

when events occur in the cluster. For example, an event is generated

when a file system is mounted or if a node fails. The SNMP agent

runs on Linux and AIX. You can monitor a heterogeneous cluster as

long as the agent runs on a Linux or AIX node.

You can customize the response to cluster events using GPFS

callbacks. A callback is an administrator defined script that is

executed when an event occurs, for example, when a file system is

un-mounted for or a file system is low on free space. Callbacks can

be used to create custom responses to GPFS events and integrate

these notifications into various cluster monitoring tools. GPFS

provides support for the Data Management API (DMAPI) interface

which is IBM’s implementation of the X/Open data storage man-

agement API. This DMAPI interface allows vendors of storage man-

agement applications such as IBM Tivoli Storage Manager (TSM)

and High Performance Storage System (HPSS) to provide Hierarchi-

cal Storage Management (HSM) support for GPFS.

Page 180: HPC Compass 2016_17


GPFS supports POSIX and NFSv4 access control lists (ACLs). NFSv4

ACLs can be used to serve files using NFSv4, but can also be used in

other deployments, for example, to provide ACL support to nodes

running Windows. To provide concurrent access from multiple

operating system types GPFS allows you to run mixed POSIX and

NFS v4 permissions in a single file system and map user and group

IDs between Windows and Linux/UNIX environments. File systems

may be exported to clients outside the cluster through NFS. GPFS is

often used as the base for a scalable NFS file service infrastructure.

The GPFS clustered NFS (cNFS) feature provides data availability to

NFS clients by providing NFS service continuation if an NFS server

fails. This allows a GPFS cluster to provide scalable file service

by providing simultaneous access to a common set of data from

multiple nodes. The clustered NFS tools include monitoring of file

services and IP address fail over. GPFS cNFS supports NFSv3 only.

You can export a GPFS file system using NFSv4 but not with cNFS.

Data availabilityGPFS is fault tolerant and can be configured for continued access

to data even if cluster nodes or storage systems fail. This is ac-

complished though robust clustering features and support for

synchronous and asynchronous data replication.

GPFS software includes the infrastructure to handle data con-

sistency and availability. This means that GPFS does not rely on

external applications for cluster operations like node failover.

The clustering support goes beyond who owns the data or who

has access to the disks. In a GPFS cluster all nodes see all of the

data and all cluster operations can be done by any node in the

cluster with a server license. All nodes are capable of perform-

ing all tasks. What tasks a node can perform is determined by the

type of license and the cluster configuration.

As a part of the built-in availability tools GPFS continuously mon-

itors the health of the file system components. When failures

are detected appropriate recovery action is taken automatically.

Parallel File Systems

General Parallel File System (GPFS)

Page 181: HPC Compass 2016_17


Extensive journaling and recovery capabilities are provided

which maintain metadata consistency when a node holding

locks or performing administrative services fails.

Snapshots can be used to protect the file system’s contents

against a user error by preserving a point in time version of the

file system or a sub-tree of a file system called a fileset. GPFS im-

plements a space efficient snapshot mechanism that generates a

map of the file system or fileset at the time the snaphot is taken.

New data blocks are consumed only when the file system data

has been deleted or modified after the snapshot was created.

This is done using a redirect-on-write technique (sometimes

called copy-on-write). Snapshot data is placed in existing storage

pools simplifying administration and optimizing the use of ex-

isting storage. The snapshot function can be used with a backup

program, for example, to run while the file system is in use and

still obtain a consistent copy of the file system as it was when the

snapshot was created. In addition, snapshots provide an online

backup capability that allows files to be recovered easily from

common problems such as accidental file deletion.

Data ReplicationFor an additional level of data availability and protection synchro-

nous data replication is available for file system metadata and

data. GPFS provides a very flexible replication model that allows

you to replicate a file, set of files, or an entire file system. The rep-

lication status of a file can be changed using a command or by us-

ing the policy based management tools. Synchronous replication

allows for continuous operation even if a path to an array, an array

itself or an entire site fails.

Synchronous replication is location aware which allows you to

optimize data access when the replicas are separated across a

WAN. GPFS has knowledge of what copy of the data is “local” so

read-heavy applications can get local data read performance even

when data replicated over a WAN. Synchronous replication works

well for many workloads by replicating data across storage arrays

within a data center, within a campus or across geographical dis-

tances using high quality wide area network connections.

When wide area network connections are not high performance

or are not reliable, an asynchronous approach to data replication

is required. GPFS 3.5 introduces a feature called Active File Man-

agement (AFM). AFM is a distributed disk caching technology de-

veloped at IBM Research that allows the expansion of the GPFS

global namespace across geographical distances. It can be used to

provide high availability between sites or to provide local “copies”

of data distributed to one or more GPFS clusters. For more details

on AFM see the section entitled Sharing data between clusters.

For a higher level of cluster reliability GPFS includes advanced

clustering features to maintain network connections. If a network

connection to a node fails GPFS automatically tries to reestablish

the connection before marking the node unavailable. This can pro-

vide for better uptime in environments communicating across a

WAN or experiencing network issues.

Using these features along with a high availability infrastructure

ensures a reliable enterprise class storage solution.

GPFS Native Raid (GNR)Larger disk drives and larger file systems are creating challeng-

es for traditional storage controllers. Current RAID 5 and RAID 6

based arrays do not address the challenges of Exabyte scale stor-

age performance, reliability and management. To address these

challenges GPFS Native RAID (GNR) brings storage device man-

agement into GPFS. With GNR GPFS can directly manage thou-

sands of storage devices. These storage devices can be individual

disk drives or any other block device eliminating the need for a

storage controller.

GNR employs a de-clustered approach to RAID. The de-clustered

architecture reduces the impact of drive failures by spreading

data over all of the available storage devices improving appli-

cation IO and recovery performance. GNR provides very high

reliability through an 8+3 Reed Solomon based raid code that

divides each block of a file into 8 parts and associated parity.

Page 182: HPC Compass 2016_17


This algorithm scales easily starting with as few as 11 storage

devices and growing to over 500 per storage pod. Spreading the

data over many devices helps provide predicable storage perfor-

mance and fast recovery times measured in minutes rather than

hours in the case of a device failure.

In addition to performance improvements GNR provides

advanced checksum protection to ensure data integrity.

Checksum information is stored on disk and verified all the way

to the NSD client.

Information lifecycle management (ILM) toolsetGPFS can help you to achieve data lifecycle management efficien-

cies through policy-driven automation and tiered storage man-

agement. The use of storage pools, filesets and user-defined poli-

cies provide the ability to better match the cost of your storage to

the value of your data.

Storage pools are used to manage groups of disks within a file sys-

tem. Using storage pools you can create tiers of storage by group-

ing disks based on performance, locality or reliability characteris-

tics. For example, one pool could contain high performance solid

state disk (SSD) disks and another more economical 7,200 RPM disk

storage. These types of storage pools are called internal storage

pools. When data is placed in or moved between internal storage

pools all of the data management is done by GPFS. In addition to

internal storage pools GPFS supports external storage pools.

External storage pools are used to interact with an external stor-

age management application including IBM Tivoli Storage Man-

ager (TSM) and High Performance Storage System (HPSS). When

moving data to an external pool GPFS handles all of the metada-

ta processing then hands the data to the external application for

storage on alternate media, tape for example. When using TSM or

HPSS data can be retrieved from the external storage pool on de-

mand, as a result of an application opening a file or data can be

Parallel File Systems

General Parallel File System (GPFS)

Page 183: HPC Compass 2016_17


retrieved in a batch operation using a command or GPFS policy.

A fileset is a sub-tree of the file system namespace and provides

a way to partition the namespace into smaller, more manageable


Filesets provide an administrative boundary that can be used to

set quotas, take snapshots, define AFM relationships and be used

in user defined policies to control initial data placement or data

migration. Data within a single fileset can reside in one or more

storage pools. Where the file data resides and how it is managed

once it is created is based on a set of rules in a user defined policy.

There are two types of user defined policies in GPFS: file placement

and file management. File placement policies determine in which

storage pool file data is initially placed. File placement rules are

defined using attributes of a file known when a file is created such

as file name, fileset or the user who is creating the file. For example

a placement policy may be defined that states ‘place all files with

names that end in .mov onto the near-line SAS based storage pool

and place all files created by the CEO onto the SSD based storage

pool’ or ‘place all files in the fileset ‘development’ onto the SAS

based storage pool’.

Once files exist in a file system, file management policies can be

used for file migration, deletion, changing file replication status

or generating reports. You can use a migration policy to transpar-

ently move data from one storage pool to another without chang-

ing the file’s location in the directory structure. Similarly you can

use a policy to change the replication status of a file or set of files,

allowing fine grained control over the space used for data availa-

bility. You can use migration and replication policies together, for

example a policy that says: ‘migrate all of the files located in the

subdirectory /database/payroll which end in *.dat and are greater

than 1 MB in size to storage pool #2 and un-replicate these files’.

File deletion policies allow you to prune the file system, deleting

files as defined by policy rules. Reporting on the contents of a file

system can be done through list policies. List policies allow you

to quickly scan the file system metadata and produce information

listing selected attributes of candidate files. File management pol-

icies can be based on more attributes of a file than placement pol-

icies because once a file exists there is more known about the file.

For example file placement attributes can utilize attributes such

as last access time, size of the file or a mix of user and file size. This

may result in policies like: ‘Delete all files with a name ending in

.temp that have not been accessed in the last 30 days’, or ‘Migrate

all files owned by Sally that are larger than 4GB to the SATA stor-

age pool’.

Rule processing can be further automated by including attrib-

utes related to a storage pool instead of a file using the thresh-

old option. Using thresholds you can create a rule that moves

files out of the high performance pool if it is more than 80%

full, for example. The threshold option comes with the ability to

set high, low and pre-migrate thresholds. Pre-migrated files are

files that exist on disk and are migrated to tape. This method

is typically used to allow disk access to the data while allow-

ing disk space to be freed up quickly when a maximum space

threshold is reached. This means that GPFS begins migrating

data at the high threshold, until the low threshold is reached.

If a pre-migrate threshold is set GPFS begins copying data un-

til the pre-migrate threshold is reached. This allows the data to

continue to be accessed in the original pool until it is quickly

deleted to free up space the next time the high threshold is

reached. Thresholds allow you to fully utilize your highest per-

formance storage and automate the task of making room for

new high priority content.

Policy rule syntax is based on the SQL 92 syntax standard and

supports multiple complex statements in a single rule enabling

powerful policies. Multiple levels of rules can be applied to a

file system, and rules are evaluated in order for each file when

the policy engine executes allowing a high level of flexibility.

Page 184: HPC Compass 2016_17


GPFS provides unique functionality through standard interfaces,

an example of this is extended attributes. Extended attributes

are a standard POSIX facility.

GPFS has long supported the use of extended attributes, though

in the past they were not commonly used, in part because of per-

formance concerns. In GPFS 3.4, a comprehensive redesign of the

extended attributes support infrastructure was implemented,

resulting in significant performance improvements. In GPFS 3.5,

extended attributes are accessible by the GPFS policy engine al-

lowing you to write rules that utilize your custom file attributes.

Executing file management operations requires the ability to

efficiently process the file metadata. GPFS includes a high per-

formance metadata scan interface that allows you to efficiently

process the metadata for billions of files. This makes the GPFS ILM

toolset a very scalable tool for automating file management. This

high performance metadata scan engine employs a scale-out ap-

proach. The identification of candidate files and data movement

operations can be performed concurrently by one or more nodes

in the cluster. GPFS can spread rule evaluation and data move-

ment responsibilities over multiple nodes in the cluster pro-

viding a very scalable, high performance rule processing engine.

Cluster configurationsGPFS supports a variety of cluster configurations independent of

which file system features you use. Cluster configuration options

can be characterized into three basic categories:

� Shared disk

� Network block I/O

� Synchronously sharing data between clusters.

� Asynchronously sharing data between clusters.

Shared disk

A shared disk cluster is the most basic environment. In this con-

figuration, the storage is directly attached to all machines in the

cluster as shown in Figure 1. The direct connection means that

Parallel File Systems

General Parallel File System (GPFS)

Page 185: HPC Compass 2016_17


each shared block device is available concurrently to all of the

nodes in the GPFS cluster. Direct access means that the storage

is accessible using a SCSI or other block level protocol using a

SAN, Infiniband, iSCSI, Virtual IO interface or other block level IO

connection technology. Figure 1 illustrates a GPFS cluster where

all nodes are connected to a common fibre channel SAN. This

example shows a fibre channel SAN though the storage attach-

ment technology could be InfiniBand, SAS, FCoE or any other.

The nodes are connected to the storage using the SAN and to

each other using a LAN. Data used by applications running on

the GPFS nodes flows over the SAN and GPFS control informa-

tion flows among the GPFS instances in the cluster over the LAN.

This configuration is optimal when all nodes in the cluster need

the highest performance access to the data. For example, this is

a good configuration for providing etwork file service to client

systems using clustered NFS, high-speed data access for digital

media applications or a grid infrastructure for data analytics.

Network-based block IO

As data storage requirements increase and new storage and con-

nection technologies are released a single SAN may not be a suf-

ficient or appropriate choice of storage connection technology.

In environments where every node in the cluster is not attached

to a single SAN, GPFS makes use of an integrated network block

device capability. GPFS provides a block level interface over

TCP/IP networks called the Network Shared Disk (NSD) protocol.

Whether using the NSD protocol or a direct attachment to the

SAN the mounted file system looks the same to the application,

GPFS transparently handles I/O requests.

GPFS clusters can use the NSD protocol to provide high speed data

access to applications running on LAN-attached nodes. Data is

served to these client nodes from one or more NSD servers. In this

configuration, disks are attached only to the NSD servers. Each

NSD server is attached to all or a portion of the disk collection.

With GPFS you can define up to eight NSD servers per disk and

it is recommended that at least two NSD servers are defined for

each disk to avoid a single point of failure.

GPFS uses the NSD protocol over any TCP/IP capable network

fabric. On Linux GPFS can use the VERBS RDMA protocol on com-

patible fabrics (such as InfiniBand) to transfer data to NSD cli-

ents. The network fabric does not need to be dedicated to GPFS;

but should provide sufficient bandwidth to meet your GPFS

performance expectations and for applications which share the

bandwidth. GPFS has the ability to define a preferred network

subnet topology, for example designate separate IP subnets for

intra-cluster communication and the public network. This pro-

vides for a clearly defined separation of communication traffic

and allows you to increase the throughput and possibly the

number of nodes in a GPFS cluster. Allowing access to the same

disk from multiple subnets means that all of the NSD clients do

not have to be on a single physical network. For example you

can place groups of clients onto separate subnets that access

a common set of disks through different NSD servers so not all

NSD servers need to serve all clients. This can reduce the net-

working hardware costs and simplify the topology reducing

support costs, providing greater scalability and greater overall


Figure 1: SAN attached Storage

GPFS Nodes


Page 186: HPC Compass 2016_17


An example of the NSD server model is shown in Figure 2. In this

configuration, a subset of the total node population is defined

as NSDserver nodes. The NSD Server is responsible for the ab-

straction of disk datablocks across a TCP/IP or Infiniband VERBS

(Linux only) based network. The factthat the disks are remote

is transparent to the application. Figure 2 shows an example of

a configuration where a set of compute nodes are connected

to a set of NSD servers using a high-speed interconnect or an

IP-based network such as Ethernet. In this example, data to the

NSD servers flows over the SAN and both data and control infor-

mation to the clients flow across the LAN. Since the NSD servers

are serving data blocks from one or more devices data access is

similar to a SAN attached environment in that data flows from

all servers simultaneously to each client. This parallel data ac-

cess provides the best possible throughput to all clients. In ad-

dition it provides the ability to scale up the throughput even to

a common data set or even a single file.

The choice of how many nodes to configure as NSD servers is

based on performance requirements, the network architecture

and the capabilities of the storage subsystems. High bandwidth

LAN connections should be used for clusters requiring signifi-

cant data transfer rates. This can include 1Gbit or 10 Gbit Ether-

net. For additional performance or reliability you can use link ag-

gregation (EtherChannel or bonding), networking technologies

like source based routing or higher performance networks such

as InfiniBand.

The choice between SAN attachment and network block I/O is a

performance and economic one. In general, using a SAN provides

the highest performance; but the cost and management com-

plexity of SANs for large clusters is often prohibitive. In these

cases network block I/O provides an option. Network block I/O

is well suited to grid computing and clusters with sufficient net-

work bandwidth between the NSD servers and the clients. For

example, an NSD protocol based grid is effective for web applica-

tions, supply chain management or modeling weather patterns.

SAN Storage

Parallel File Systems

General Parallel File System (GPFS)

Page 187: HPC Compass 2016_17


Mixed ClustersThe last two sections discussed shared disk and network attached

GPFS cluster topologies. You can mix these storage attachment

methods within a GPFS cluster to better matching the IO require-

ments to the connection technology. A GPFS node always tries to

find the most efficient path to the storage. If a node detects a block

device path to the data it is used. If there is no block device path

then the network is used. This capability can be leveraged to pro-

vide additional availability. If a node is SAN attached to the storage

and there is an HBA failure, for example, GPFS can fail over to using

the network path to the disk. A mixed cluster topology can provide

direct storage access to non-NSD server nodes for high performance

operations including backups or data ingest.

Sharing data between clustersThere are two methods available to share data across GPFS clusters:

GPFS multi-cluster and a new feature called Active File Management


GPFS Multi-cluster allows you to utilize the native GPFS protocol to

share data across clusters. Using this feature you can allow other

clusters to access one or more of your file systems and you can

mount file systems that belong to other GPFS clusters for which

Figure 2: Network block IO

NSD Clients

I/O Servers



SAN Storage

you have been authorized. A multi-cluster environment allows the

administrator to permit access to specific file systems from another

GPFS cluster. This feature is intended to allow clusters toshare data

at higher performance levels than file sharing technologies like NFS

or CIFS. It is not intended to replace such file sharing technologies

which are optimized for desktop access or for access across unrelia-

ble network links. Multi-cluster capability is useful for sharing across

multiple clusters within a physical location or across locations. Clus-

ters are most often attached using a LAN, but in addition the cluster

connection could include a SAN. Figure 3 illustrates a multi-cluster

configuration with both LAN and mixed LAN and SAN connections.

In Figure 3, Cluster B and Cluster C need to access the data from

Cluster A. Cluster A owns the storage and manages the file system.

It may grant access to file systems which it manages to remote clus-

ters such as Cluster B and Cluster C. In this example, Cluster B and

Cluster C do not have any storage but that is not a requirement. They

could own file systems which may be accessible outside their clus-

ter. Commonly in the case where a cluster does not own storage, the

nodes are grouped into clusters for ease of management.

NSD Clients



NSD Servers

SAN Storage

Application Nodes

Figure 3: Mixed Cluster Atchitecture

Page 188: HPC Compass 2016_17


When the remote clusters need access to the data, they mount the

file system by contacting the owning cluster and passing required

security checks. In Figure 3, Cluster B acesses the data through the

NSD protocol. Cluster C accesses data through an extension of the

SAN. For cluster C access to the data is similar to the nodes in the

host cluster. Control and administrative traffic flows through the IP

network shown in Figure 3 and data access is direct over the SAN.

Both types of configurations are possible and as in Figure 3 can be

mixed as required. Multi-cluster environments are well suited to

sharing data across clusters belonging to different organizations

for collaborative computing , grouping sets of clients for adminis-

trative purposes or implementing a global namespace across sep-

arate locations. A multi-cluster configuration allows you to connect

GPFS clusters within a data center, across campus or across reliable

WAN links. For sharing data between GPFS clusters across less re-

liable WAN links or in cases where you want a copy of the data in

multiple locations you can use a new feature introduced in GPFS 3.5

called Active File Management.

Active File Management (AFM) allows you to create associations be-

tween GPFS clusters. Now the location and flow of file data between

GPFS clusters can be automated. Relationships between GPFS clus-

ters using AFM are defined at the fileset level. A fileset in a file sys-

tem can be created as a “cache” that provides a view to a file system

in another GPFS cluster called the “home.” File data is moved into a

cache fileset on demand. When a file is read the in the cache fileset


Cluster A

Cluster C

Cluster B


Figure 4: Multi-cluster

Parallel File Systems

General Parallel File System (GPFS)

Page 189: HPC Compass 2016_17


the file data is copied from the home into the cache fileset. Data

consistency and file movement into and out of the cache is man-

aged automatically by GPFS.

Cache filesets can be read-only or writeable. Cached data is locally

read or written. On read if the data is not in in the “cache” then

GPFS automatically creates a copy of the data. When data is written

into the cache the write operation completes locally then GPFS

asynchronously pushes the changes back to the home location.

You can define multiple cache filesets for each home data source.

The number of cache relationships for each home is limited only by

the bandwidth available at the home location. Placing a quota on a

cache fileset causes the data to be cleaned (evicted) out of the cache

automatically based on the space available. If you do not set a quota

a copy of the file data remains in the cache until manually evicted

or deleted. You can use AFM to create a global namespace within

a data center, across a campus or between data centers located

around the world. AFM is designed to enable efficient data transfers

over wide area network (WAN) connections. When file data is read

from home into cache that transfer can happen in parallel within a

node called a gateway or across multiple gateway nodes.

Using AFM you can now create a truly world-wide global namespace.

Figure 5 is an example of a global namespace built using AFM. In

this example each site owns 1/3 of the namespace, store3 “owns”

/data5 and /data6 for example. The other sites have cache filesets

that point to the home location for the data owned by the other

clusters. In this example store3 /data1-4 cache data owned by the

other two clusters. This provides the same namespace within each

GPFS cluster providing applications one path to a file across the en-

tire organization.

You can achieve this type of global namespace with multi-cluster or

AFM, which method you chose depends on the quality of the WAN

link and the application requirements.

Cache: /data1Local: /data3

Cache: /data5

Local: /data1Cache: /data3 Cache: /data5

Cache: /data1Cache: /data3Local: /data3

Store 1

Store 2

Store 3

Client access:/global/data1/global/data2/global/data3/global/data3/global/data4/global/data5/global/data6

Client access:/global/data1/global/data2/global/data3/global/data3/global/data4/global/data5/global/data6

Client access:/global/data1/global/data2/global/data3/global/data3/global/data4/global/data5/global/data6

Figure 5: Global Namespace using AFM

Page 190: HPC Compass 2016_17






tech BOX


What’s new in GPFS Version 3.5For those who are familiar with GPFS 3.4 this section provides a

list of what is new in GPFS Version 3.5.

Active File ManagementWhen GPFS was introduced in 1998 it represented a revolution

in file storage. For the first time a group of servers could share

high performance access to a common set of data over a SAN or

network. The ability to share high performance access to file data

across nodes was the introduction of the global namespace.

Later GPFS introduced the ability to share data across multiple

GPFS clusters. This multi-cluster capability enabled data sharing

between clusters allowing for better access to file data.

This further expanded the reach of the global namespace from

within a cluster to across clusters spanning a data center or a

country. There were still challenges to building a multi-cluster

global namespace. The big challenge is working with unreliable

and high latency network connections between the servers.

Active File Management (AFM) in GPFS addresses the WAN band-

width issues and enables GPFS to create a world-wide global name-

space. AFM ties the global namespace together asynchronous-

ly providing local read and write performance with automated

namespace management. It allows you to create associations be-

tween GPFS clusters and define the location and flow of file data.

High Performance Extended AttributesGPFS has long supported the use of extended attributes, though

in the past they were not commonly used, in part because of

Parallel File Systems

What’s new in GPFS Version 3.5

Page 191: HPC Compass 2016_17


performance concerns. In GPFS 3.4, a comprehensive redesign of

the extended attributes support infrastructure was implement-

ed, resulting in significant performance improvements. In GPFS

3.5, extended attributes are accessible by the GPFS policy engine

allowing you to write rules that utilize your custom file attrib-


Now an application can use standard POSIX interfaces t to man-

age extended attributes and the GPFS policy engine can utilize

these attributes.

Independent FilesetsTo effectively manage a file system with billions of files requires

advanced file management technologies. GPFS 3.5 introduces

a new concept called the independent fileset. An independent

fileset has its own inode space. This means that an independent

fileset can be managed similar to a separate file system but still

allow you to realize the benefits of storage consolidation.

An example of an efficiency introduced with independent file-

sets is improved policy execution performance. If you use an in-

dependent fileset as a predicate in your policy, GPFS only needs

to scan the inode space represented by that fileset, so if you have

1 billion files in your file system and a fileset has an inode space

of 1 million files, the scan only has to look at 1 million inodes. This

instantly makes the policy scan much more efficient. Independ-

ent filesets enable other new fileset features in GPFS 3.5.

Fileset Level Snapshots

Snapshot granularity is now at the fileset level in addition to file

system level snapshots.

Fileset Level Quotas

User and group quotas can be set per fileset

File CloningFile clones are space efficient copies of a file where two instances

of a file share data they have in common and only changed

blocks require additional storage. File cloning is an efficient way

to create a copy of a file, without the overhead of copying all of

the data blocks.

IPv6 SupportIPv6 support in GPFS means that nodes can be defined using mul-

tiple addresses, both IPv4 and IPv6.

GPFS Native RAIDGPFS Native RAID (GNR) brings storage RAID management into

the GPFS NSD server. With GNR GPFS directly manages JBOD

based storage. This feature provides greater availability, flexibili-

ty and performance for a variety of application workloads.

GNR implements a Reed-Solomon based de-clustered RAID tech-

nology that can provide high availability and keep drive failures

from impacting performance by spreading the recovery tasks

over all of the disks. Unlike standard Network Shared Disk (NSD)

data access GNR is tightly integrated with the storage hardware.

For GPFS 3.5 GNR is available on the IBM Power 775 Supercom-

puter platform.

Page 192: HPC Compass 2016_17


Parallel File Systems


LustreMaking Parallel Storage Simpler and More ProductiveNot long ago storage for high performance computing (HPC)

meant complexity and challenging management. HPC was the

domain of government-sponsored research or data-intensive ap-

plications, like weather prediction or large-scale manufacturing

in the aerospace and automotive sectors.

Today, HPC has expanded beyond just national laboratories and

research institutes to become a key technology for enterprises

of all sizes as they seek to develop improved products or entirely

new industries. Getting the maximum performance from HPC and

data-intensive applications requires fast and scalable storage

software. Simply put, today’s HPC workloads require storage infra-

structure that scales endlessly and delivers unmatched I/O levels.

Intel Enterprise Edition for Lustre software

(Intel EE for Lustre software) unleashes the performance and

scalability of the Lustre parallel file system for HPC workloads,

including technical ‘big data’ applications common within to-

day’s enterprises. It allows end-users that need the benefits of

large–scale, high bandwidth storage to tap the power and scal-

ability of Lustre, with the simplified installation, configuration,

and management features provided by Intel Manager for Lustre

software, a management solution purpose-built by the Lustre

experts at Intel for the Lustre file system. Further, Intel EE for Lus-

tre software is backed by Intel, the recognized technical support

providers for Lustre, including 24/7 service level agreement (SLA)


> Product Components

Intel Manager for Lustre

Intel Manager for Lustre software includes simple, but pow-

erful, management tools that provide a unified, consistent

view of Lustre storage systems and simplify the installation,

configuration, monitoring, and overall management of Lustre.


Page 193: HPC Compass 2016_17


The manager consolidates all Lustre information in a central,

browser-accessible location for ease of use.

Integrated Apache Hadoop Adapter

When organizations operate both Lustre and Apache Hadoop

within a shared HPC infrastructure, there is a compelling use

case for using Lustre as the file system for Hadoop analytics, as

well as HPC storage.

Intel Enterprise Edition for Lustre includes an Intel-developed

adapter which allows users to run MapReduce applications di-

rectly on Lustre. This optimizes the performance of MapReduce

operations while delivering faster, more scalable, and easier to

manage storage.

> Benefits


Intel EE for Lustre software has been designed to enable ful-

ly parallel I/O throughput across thousands of clients, servers,

and storage devices. Metadata and data are stored on separate

servers to allow optimization of each system for the different

workloads they present. Improved metadata scalability using

Distributed Namespace (DNE) feature is now integrated in Intel

Manager for Lustre. Intel EE for Lustre can also scale down effi-

ciently to provide fast parallel storage for smaller organizations.


The object-based storage architecture of Intel EE for Lustre soft-

ware can scale to tens of thousands of clients and petabytes of



Intel EE for Lustre software is based on the community release of

Lustre software, and is hardware, server, and network fabric neu-

tral. Enterprises can scale their storage deployments horizontal-

ly, yet continue to have simple-to-manage storage.


Lustre has been in use in the world’s largest datacenters for over

a decade and hardened in the harshest big data environments.

Intel EE for Lustre software is rigorously tested, reliable, and

backed by Intel, the leading provider of technical support for

Lustre software. Intel EE for Lustre software delivers commer-

cial- ready Lustre in a package that can scale efficiently both up

and down to suit your business workloads, with built-in manage-


Key Features

Lustre powers 60% of the world’s top 100 fastest computers

(www.top500.org). Unleash the Lustre parallel file system as an

enterprise platform–offering higher throughput and helping to

prevent bottlenecks.

� Built on the community release of Lustre software

� Intel Manager for Lustre simplifies install and configuration

� Enormous storage capacity and I/O

� Open, documented interfaces for deep integration

� Throughput in excess of 1 terabyte per second

� Resilient, highly available storage

� Centralized, GUI-based administration for management


� Integrated support for Hadoop MapReduce applications with

Lustre storage

� Rigorously tested, stable software proven across diverse


� Flexible storage solution based on Intel-enhanced

community release software

� Global 24/7 technical support

> What’s New

Differentiated Storage Services (DSS)

Available onlywith Intel EE for Lustre software, this advanced

feature allows data “hints” to be passed through Lustre, enabling

cache mechanisms to prioritize data for optimal performance.

Page 194: HPC Compass 2016_17


Intel Manager for Lustre Monitoring for OpenZFS

Beginning with this release, storage solutions based on Intel EE

for Lustre software can exploit the data resiliency and volume

management capabilities available with OpenZFS.

Integrated HPC Scheduler “Connector” for MapReduce

Building upon the unique Intel software “connector” that allows

Lustre to replace HDFS, the newest connector couples SLURM

with YARN, and augments the Hadoop scheduler allowing

MapReduce applications to be scheduled like any HPC job.

Support for New System Software

Intel EE for Lustre delivers new support for Red Hat Enterprise

Linux and CentOS 6.7 servers and clients, or for Red Hat Enter-

prise Linux and CentOS 7.1 clients only.

Increasing Metadata Performance

More easily scale out the metadata performance of Lustre by tak-

ing advantage of the new DNE feature using Intel Manager for


Parallel File Systems






tiered storage

Administrator’s UI

and DashboardCLI


Management and Monitoring Services


Lustre File System Storage Plug-ln

Global Support and Services from Intel

Page 195: HPC Compass 2016_17


Compare EditionsIntel Enterprise Edition

for Lustre Software

Intel Foundation Edition

for Lustre Software

Intel Cloud Edition

for Lustre Software

Delivered with powerfull storage solutions

from Intel Lustre solution resellerx x x

Easy-to-use web-based GUI to simplify

deployment, management and monitoring

of the Lustre file system


Uses Intel Manager

for Lustre Software


Uses AWS embedded

monitoring tools

Intel software connectors vital for MapReduce

and analytics with Lustre and HPCx

Stable Enterprise platform, built for reliability

and performance – with long term supportx x

Full distribution of the community release of Lustre x x

Intel training for reseller partners x x

Intel backed support for debugging and patches x x x

Developed and supported by Intel,

the Lustre expertsx x x

Intel Enterprise Edition for Lustre Software couples the perfor-

mance and scalability of the most widely used file system for

HPC, with advanced storage management tools for enterprises

needing scale-out storage to fuel their technical applications or

Hadoop MapReduce workloads.

Intel Foundation Edition for Lustre Software delivers a scalable

Lustre storage solution with trusted Intel support. It is ideal for

organizations who prefer to design and deploy their own com-

munity configurations, and want to augment in-house manage-

ment and support capabilities with Intel expertise.

Intel Cloud Edition for Lustre Software brings Lustre to a cloud

environment via Amazon Web Services allowing data scientists

and developers instant access to parallel clustered storage.

Page 196: HPC Compass 2016_17
Page 197: HPC Compass 2016_17

Intel Omni-Path Interconnect

Intel® Omni-Path Architecture (Intel® OPA), an element of Intel® Scalable System Frame-

work, delivers the performance for tomorrow’s high performance computing (HPC)

workloads and the ability to scale to tens of thousands of nodes – and eventually more

– at a price competitive with today’s fabrics. The Intel OPA 100 Series product line is an

end-to-end solution of PCIe* adapters, silicon, switches, cables, and management soft-

ware. As the successor to Intel® True Scale Fabric, this optimized HPC fabric is built upon

a combination of enhanced IP and Intel® technology.

For software applications, Intel OPA will maintain consistency and compatibility with

existing Intel True Scale Fabric and InfiniBand* APIs by working through the open source

OpenFabrics Alliance (OFA) software stack on leading Linux* distribution releases. Intel

True Scale Fabric customers will be able to migrate to Intel OPA through an upgrade


High Throuhgput Computing | CAD | Big Data Analytics | Simulation | Aerospace | Automotive

Page 198: HPC Compass 2016_17


Intel Omni-Path Interconnect


AchitectureThe Future of High Performance FabricsCurrent standards-based high performance fabrics, such as In-

finiBand, were not originally designed for HPC, resulting in per-

formance and scaling weaknesses that are currently impeding

the path to Exascale computing. Intel Omni-Path Architecture

is being designed specifically to address these issues and scale

cost-effectively from entry level HPC clusters to larger clusters

with 10,000 nodes or more. To improve on the InfiniBand specifi-

cation and design, Intel is using the industry’s best technologies

including those acquired from QLogic and Cray alongside Intel®


While both Intel OPA and InfiniBand Enhanced Data Rate (EDR)

will run at 100Gbps, there are many differences. The enhance-

ments of Intel OPA will help enable the progression towards Ex-

ascale while cost-effectively supporting clusters of all sizes with

optimization for HPC applications at both the host and fabric

levels for benefits that are not possible with the standard Infini-

Band-based designs.

Intel OPA is designed to provide the:

� Features and functionality at both the host and fabric levels

to greatly raise levels of scaling

� CPU and fabric integration necessary for the increased

computing density, improved reliability, reduced power, and

lower costs required by significantly larger HPC deployments

� Fabric tools to readily install, verify, and manage fabrics at

this level of complexity

Intel Omni-Path Key Fabric Features and InnovationsAdaptive Routing

Adaptive Routing monitors the performance of the possible

paths between fabric end-points and selects the least congest-

ed path to rebalance the packet load. While other technologies

also support routing, the implementation is vital. Intel’s imple-

mentation is based on cooperation between the Fabric Manager

Page 199: HPC Compass 2016_17


and the switch ASICs. The Fabric Manager – with a global view of

the topology – initializes the switch ASICs with several egress op-

tions per destination, updating these options as the fundamen-

tal fabric changes when links are added or removed. Once the

switch egress options are set, the Fabric Manager monitors the

fabric state, and the switch ASICs dynamically monitor and react

to the congestion sensed on individual links. This approach ena-

bles Adaptive Routing to scale as fabrics grow larger and more


Dispersive Routing

One of the critical roles of fabric management is the initialization

and configuration of routes through the fabric between pairs of

nodes. Intel Omni-Path Fabric supports a variety of routing meth-

ods, including defining alternate routes that disperse traffic

flows for redundancy, performance, and load balancing. Instead

of sending all packets from a source to a destination via a single

path, Dispersive Routing distributes traffic across multiple paths.

Once received, packets are reassembled in their proper order for

rapid, efficient processing. By leveraging more of the fabric to de-

liver maximum communications performance for all jobs, Disper-

sive Routing promotes optimal fabric efficiency.

Traffic Flow Optimization

Traffic Flow Optimization optimizes the quality of service be-

yond selecting the priority – based on virtual lane or service level

– of messages to be sent on an egress port. At the Intel Omni-Path

Architecture link level, variable length packets are broken up into

fixed-sized containers that are in turn packaged into fixed-sized

Link Transfer Packets (LTPs) for transmitting over the link. Since

packets are broken up into smaller containers, a higher priority

container can request a pause and be inserted into the ISL data

stream before completing the previous data.

The key benefit is that Traffic Flow Optimization reduces the vari-

ation in latency seen through the network by high priority traffic

in the presence of lower priority traffic. It addresses a traditional

weakness of both Ethernet and InfiniBand in which a packet

must be transmitted to completion once the link starts even if

higher priority packets become available.

Packet Integrity Protection

Packet Integrity Protection allows for rapid and transparent re-

covery of transmission errors between a sender and a receiver

on an Intel Omni-Path Architecture link. Given the very high In-

tel OPA signaling rate (25.78125G per lane) and the goal of sup-

porting large scale systems of a hundred thousand or more links,

transient bit errors must be tolerated while ensuring that the

performance impact is insignificant. Packet Integrity Protec-

tion enables recovery of transient errors whether it is between

a host and switch or between switches. This eliminates the need

for transport level timeouts and end-to-end retries. This is done

without the heavy latency penalty associated with alternate er-

ror recovery approaches.

Dynamic Lane Scaling

Dynamic Lane Scaling allows an operation to continue even if

one or more lanes of a 4x link fail, saving the need to restart or go

to a previous checkpoint to keep the application running. The job

can then run to completion before taking action to resolve the is-

sue. Currently, InfiniBand typically drops the whole 4x link if any

of its lanes drops, costing time and productivity.

Ask How Intel Omni-Path Architecture Can Meet Your

HPC Needs

Intel is clearing the path to Exoscale computing and addressing

tomorrow’s HPC issues. Contact your Intel representative or any

authorized Intel True Scale Fabric provider to discuss how Intel

Omni-Path Architecture can improve the performance of your

future HPC workloads.

Page 200: HPC Compass 2016_17


Intel Omni-Path Interconnect

Host Fabric Interface (HFI)

Host Fabric Interface (HFI)Designed specifically for HPC, the Intel Omni-Path Host Fabric

Interface (Intel OP HFI) uses an advanced connectionless design

that delivers performance that scales with high node and core

counts, making it the ideal choice for the most demanding ap-

plication environments. Intel OP HFI supports 100 Gbps per port,

which means each Intel OP HFI port can deliver up to 25 GBps

per port of bidirectional bandwidth. The same ASIC utilized in the

Intel OP HFI will also be integrated into future Intel Xeon proces-

sors and used in third-party products.

This device has not been authorized as required by the rules of

the Federal Communications Commission. This device is not, and

may not be, offered for sale or lease, or sold or leased, until autho-

rization is obtained.

HighlightsEach HFI supports:

� Multi-core scaling – support for up to 160 contexts

� 16 Send DMA engines (M2IO usage)

� Efficiency – large MTU support (4 KB, 8 KB, and 10KB) for

reduced per-packet processing overheads. Improved packet-

level interfaces to improve utilization of on-chip resources.

� Receive DMA engine arrival notification

� Each HFI can map ~128 GB window at 64 byte granularity

� Up to 8 virtual lanes for differentiated QoS

� ASIC designed to scale up to 160M messages/second and

300M bidirectional messages/second

Intel Omni-Path Host Fabric Interface (HFI) Optimizations

Much of the improved HPC application performance and low

end-to-end latency at scale comes from the following enhance-


Page 201: HPC Compass 2016_17


Enhanced Performance Scaled Messaging (PSM).

The application view of the fabric is derived heavily from – and

application-level software compatible with – the demonstrated

scalability of Intel True Scale Fabric architecture by leveraging

an enhanced next generation version of the Performance Scaled

Messaging (PSM) library. Major deployments by the US Depart-

ment of Energy and other have proven this scalability advantage.

PSM is specifically designed for the Message Passing Interface

(MPI) and is very lightweight – one-tenth of the user space code

– compared to using verbs. This leads to extremely high MPI and

Partitioned Global Address Space (PGAS) message rates (short

message efficiency) compared to using InfiniBand verbs.

Intel Omni-Path Host Fabric Interface Intel Foundation Edition for Lustre* Software

Adapter TypeeredLow Profile PCIe Card

PCIe x16

Low Profile PCIe Card

(PCIe x8)

Ports Single Single

Connector QSFP28 QSFP28

Link Speed 100Gb/s ~58Gb/s on 100Gb/s Link

Power (Typ./Max)

- Cooper

- Optical up to 3 Watts (Class 4)

10.6/14.9W (Optical)

7.4/11.7W (Copper)

6.3/8.3W (Copper)

9.5/11.5W (Optical)

Thermal/Temp. Passive (55° C @ 200 LFM) Passive (55° C @ 200 LFM)

“Connectionless” message routing.

Intel Omni-Path Architecture – based on a connectionless design

– does not establish connection address information between

nodes, cores, or processes while a traditional implementation

maintains this information in the cache of the adapter. As a re-

sult, the connectionless design delivers consistent latency inde-

pendent of the scale or messaging partners. This implementation

offers greater potential to scale performance across a large node

or core count cluster while maintaining low end-to-end latency

as the application is scaled across the cluster.

Intel Omni-Path Fabric Director Host Fabric Interface Adapters 100 Series

Page 202: HPC Compass 2016_17


Intel Omni-Path Interconnect

Architecture Switch Fabric Optimizations

Architecture Switch Fabric OptimizationsWhile similar to existing technology, Intel Omni-Path Architec-

ture has been enhanced to overcome the scaling challenges of

large-sized clusters. These enhancements include:

High Message Rate Throughput. Intel Omni-Path Architecture

is designed to support high message rate traffic from each node

through the fabric. With ever-increasing processing power and

core counts in Intel Xeon and Intel Xeon Phi processors, that

means the fabric has to support high bandwidth as well as high

message rate throughput.

48-port Switch ASIC. Intel OPA switch 48-port design provides for

improved fabric scalability, reduced latency, increased density,

and reduced cost and power. In fact, the 48-port ASIC can enable 5

hop configurations of up to 27,648 nodes, or over 2.3x what’s pos-

sible with current InfiniBand solutions. Depending on fabric size,

this can reduce fabric infrastructure requirements in a typical

fat tree configuration by over 50 percent, since fewer switches,

cables, racks, and power are needed as compared to today’s 36-

port switch ASICs. Table 1 summarizes the Intel OPA advantages.

Deterministic Latency. Features in Intel OPA will help minimize

the negative performance impacts of large Maximum Transfer

Units (MTUs) on small messages and help maintain consistent

latency for interprocess communication (IPC) messages, such as

Message Passing Interface (MPI) messages, when large messag-

es – typically storage – are being simultaneously transmitted in

the fabric. This will allow Intel OPA to bypass lower priority large

packets to allow higher priority small packets, creating a low and

more predictable latency through the fabric.

Page 203: HPC Compass 2016_17


Enhanced End-to-End Reliability. Intel Omni-Path Architecture

will also deliver efficient detection and error correction, which

is expected to be much more efficient then forward error cor-

rection (FEC) defined in the InfiniBand standard. Enhancements

include zero load for detection, and if a correction is required,

packets only need to be retransmitted from the last link – not all

the way from the sending node – which enables near zero addi-

tional latency for a correction.

Intel Omni-Path Edge Switch 100 Series

Intel Omni-Path Edge Switches 100 Series 48 Port 100 Series 24 Port

Porst 48 24

Rack Space Required 1U (1.75”) 1U (1.75”)

Capacity 9.6Tb/s 4.8Tb/s

Port Speed 100Gb/s 100Gb/s

Switch Latency 100-110ns 100-110ns

Interface QSFP28 QSFP28

Chassis Mgmt. Yes Yes

Embedded Management Optional Optional

Power Supply min/max 1/2 1/2

Power (Typ./Max)

Input 100-240 VAC 50-60 Hz

Optical Power w/Class 4 Optical 3 Watts max

189/238 W (Cooper)

356/408 W (All Optical)

146/179 W (Cooper)

231/264 W (All Optical)

Fans N+1 (Speed Control) N+1 (Speed Control)

Airflow (Customer Changeable) Forward/Reverse Forward/Reverse

Page 204: HPC Compass 2016_17


Intel Omni-Path Interconnect

Fabric Software Components

Fabric Software ComponentsSoftware ComponentsThe Intel Omni-Path Architecture (Intel OPA) host software stack

and the Intel Fabric Suite are the components of Intel OPA soft-

ware. Intel will Open Source all Host Fabric Components such as

the Driver, PSM, Libfabric providers, Fabric Manager, FastFabric

tools, HFI diags, FM GUI, OFA enhancements: patches, plugins, etc.

In addition, host components will be upstreamed to be delivered

as a part of the Operating system.

Intel OPA Host SoftwareIntel OPA runs today’s open source application software written

for existing OpenFabrics Alliance interfaces with no code chang-

es required. Intel OPA host software is also being open sourced

to enable an ecosystem of applications that “just work” together.

Utilizing Performance Scaled Messaging (PSM), Intel OPA host

software creates a fast data path with a lightweight, high per-

formance computing (HPC) optimized software driver layer. In

addition, standard I/O-focused protocols are supported by the

standard verbs layer.

Intel Omni-Path Fabric SuiteIntel Omni-Path Fabric (Intel OP Fabric) provides comprehen-

sive control of administrative functions using a mature subnet

manager. With advanced routing algorithms, powerful diagnos-

tic tools, and full subnet manager failover, the Intel OP Fabric

manager simplifies subnet, fabric, and individual component

management, easing the deployment and optimization of large


Intel Omni-Path Fabric Manager GUIThe manager GUI provides an intuitive, scalable dashboard with

analysis tools for monitoring Intel OP Fabric status and configu-

ration. The GUI runs on a Linux or Windows OS with TCP/IP con-

nectivity to the Intel OP Fabric manager.

Page 205: HPC Compass 2016_17


Intel Omni-Path Fabric Suite FastFabric ToolsetGuided by an intuitive interface, the FastFabric Toolset in the In-

tel OP Fabric suite provides for rapid, error-free installation and

configuration of Intel OPA host and management software tools,

as well as simplified installation, configuration, validation, and

optimization of HPC fab

� Automated host, switch, and chassis software or firmware

installation – fast and in parallel

� Powerful fabric deployment and verification tools

� Advanced fabric routing methods and analysis tools

� Quality of service (QoS) support with virtual fabrics.

Page 206: HPC Compass 2016_17
Page 207: HPC Compass 2016_17


Numascale’s NumaConnect technology enables computer system vendors to build

scalable servers with the functionality of enterprise mainframes at the cost level of

clusters. The technology unites all the processors, memory and IO resources in the

system in a fully virtualized environment controlled by standard operating systems.

High Throuhgput Computing | CAD | Big Data Analytics | Simulation | Aerospace | Automotive

Page 208: HPC Compass 2016_17


NumaConnect Background


NumaConnect BackgroundSystems based on NumaConnect will efficiently support all

classes of applications using shared memory or message pass-

ing through all popular high level programming models. System

size can be scaled to 4k nodes where each node can contain mul-

tiple processors. Memory size is limited by the 48-bit physical

address range provided by the Opteron processors resulting in a

total system main memory of 256 TBytes.

At the heart of NumaConnect is NumaChip; a single chip that

combines the cache coherent shared memory control logic with

an on-chip 7 way switch. This eliminates the need for a separate,

central switch and enables linear capacity and cost scaling.

The continuing trend with multi-core processor chips is enabling

more applications to take advantage of parallel processing. Nu-

maChip leverages the multi-core trend by enabling applications

to scale seamlessly without the extra programming effort re-

quired for cluster computing. All tasks can access all memory

and IO resources. This is of great value to users and the ultimate

way to virtualization of all system resources.

All high speed interconnects now use the same kind of physical

interfaces resulting in almost the same peak bandwidth. The

differentiation is in latency for the critical short transfers, func-

tionality and software compatibility. NumaConnect differenti-

ates from all other interconnects through the ability to provide

unified access to all resources in a system and utilize caching

techniques to obtain very low latency.

Key Facts:

� Scalable, directory based Cache Coherent Shared Memory

interconnect for Opteron

� Attaches to coherent HyperTransport (cHT) through HTX con-

nector, pick-up module or mounted directly on main board

� Configurable Remote Cache for each node

Page 209: HPC Compass 2016_17


� Full 48 BIT physical address space (256 Tbytes)

� Up to 4k (4096) nodes

� ≈1 microsecond MPI latency (ping-pong/2)

� On-chip, distributed switch fabric for 2 or 3 dimensional torus


Expanding the capabilities of multi-core processors

Semiconductor technology has reached a level where proces-

sor frequency can no longer be increased much due to power

consumption with corresponding heat dissipation and thermal

handling problems. Historically, processor frequency scaled at

approximately the same rate as transistor density and resulted

in performance improvements for most all applications with

no extra programming efforts. Processor chips are now instead

being equipped with multiple processors on a single die. Utiliz-

ing the added capacity requires softwarethat is prepared for

parallel processing. This is quite obviously simple for individual

and separated tasks that can be run independently, but is much

more complex for speeding up single tasks.

The complexity for speeding up a single task grows with the log-

ic distance between the resources needed to do the task, i.e. the

fewer resources that can be shared, the harder it is. Multi-core

processors share the main memory and some of the cache

levels, i.e. they are classified as Symmetrical Multi Processors

(SMP). Modern processors chips are also equipped with signals

and logic that allow connecting to other processor chips still

maintaining the same logic sharing of memory. The practical

limit is at two to four processor sockets before the overheads

reduce performance scaling instead of increasing it. This is nor-

mally restricted to a single motherboard.

Currently, scaling beyond the single/dual SMP motherboards is

done through some form of network connection using Ethernet

or a higher speed interconnect like InfiniBand. This requires pro-

cesses runningon the different compute nodes to communicate

through explicit messages. With this model, programs that need

to be scaled beyond a small number of processors have to be

written in a more complex way where the data can no longer

be shared among all processes, but need to be explicitly decom-

posed and transferred between the different processors’ mem-

ories when required.

NumaConnect uses a scalable approachto sharing all memory

based on distributed directories to store information about

shared memory locations. This means that programs can be

scaled beyond the limit of a single motherboard without any

changes to the programming principle. Any process running on

any processor in the system can use any part of the memory re-

gardless if the physical location of the memory is on a different


NumaConnect Value PropositionNumaConnect enables significant cost savings in three dimen-

sions; resource utilization, system management and program-

mer productivity.

According to long time users of both large shared memory sys-

tems (SMPs) and clusters in environments with a variety of ap-

plications, the former provide a much higher degree of resource

utilization due to the flexibility of all system resources. They

indicate that large mainframe SMPs can easily be kept at more

than 90% utilization and that clusters seldom can reach more

than 60-70% in environments running a variety of jobs. Better

compute resource utilization also contributes to more efficient

use of the necessary infrastructure with power consumption

and cooling as the most prominent ones (account for approxi-

mately one third of the overall coast) with floor-space as a sec-

ondary aspect.

Regarding system management, NumaChip can reduce the

number of individual operating system images significantly.

Page 210: HPC Compass 2016_17


In a system with 100Tflops computing power, the number of

system images can be reduced from approximately 1 400 to

40, a reduction factor of 35. Even if each of those 40 OS imag-

es require somewhat more resources for management than the

1 400 smaller ones, the overall savings are significant.

Parallel processing in a cluster requires explicit message pass-

ing programming whereas shared memory systems can utilize

compilers and other tools that are developed for multi-core pro-

cessors. Parallel programming is a complex task and programs

written for message passing normally contain 50%-100% more

code than programs written for shared memory processing.

Since all programs contain errors, the probability of errors in

message passing programs is 50%-100% higher than for shared

memory programs. A significant amount of software develop-

ment time is consumed by debugging errors further increasing

the time to complete development of an application.

In principle, servers are multi-tasking, multi-user machines that

are fully capable of running multiple applications at any given

time. Small servers are very cost-efficient measured by a peak

price/performance ratio because they are manufactured in very

high volumes and use many of the same components as desk-

side and desktop computers. However, these small to medium

sized servers are not very scalable. The most widely used config-

uration has 2 CPU sockets that hold from 4 to 16 CPU cores each.

They cannot be upgraded with out changing to a different main

board that also normally requires a larger power supply and a

different chassis. In turn, this means that careful capacity plan-

ning is required to optimize cost and if compute requirements

increase, it may be necessary to replace the entire server with a

bigger and much more expensive one since the price increase is

far from linear.

NumaChip contains all the logic needed to build Scale-Up sys-

tems based on volume manufactured server components. This



Page 211: HPC Compass 2016_17


drives the cost per CPU core down to the same level while offer-

ing the same capabilities as the mainframe type servers. Where

IT budgets are in focus the price difference is obvious and Nu-

maChip represents a compelling proposition to get mainframe

capabilities at the cost level of high-end cluster technology. The

expensive mainframes still include some features for dynamic

system reconfiguration that NumaChip systems will not offer in-

itially. Such features depend on operating system software and

can be also be implemented in NumaChip based systems.

TechnologyMulti-core processors and shared memory

Shared memory programming for multi-processing boosts pro-

grammer productivity since it is easier to handle than the alter-

native message passing paradigms. Shared memory programs

are supported by compiler tools and require less code than the

alternatives resulting in shorter development time and fewer

program bugs. The availability of multi-core processors on all

major platforms including desktops and laptops is driving more

programs to take advantage of the increased performance

potential. NumaChip offers seamless scaling within the same

programming paradigm regardless of system size from a single

processor chip to systems with more than 1,000 processor chips.

Other interconnect technologies that do not offer cc-NUMA

capabilities require that applications are written for message

passing, resulting in larger programs with more bugs and cor-

respondingly longer development time while systems built with

NumaChip can run any program efficiently.


The strong trend of virtualization is driven by the desire of ob-

taining higher utilization of resources in the datacenter. In short,

it means that any application should be able to run on any serv-

er in the datacenter so that each server can be better utilized

by combining more applications on each server dynamically

according to user loads. Commodity server technology repre-

sents severe limitations in reaching this goal.

One major limitation is that the memory requirements of any

given application need to be satisfied by the physical server

that hosts the application at any given time. In turn, this means

that if any application in the datacenter shall be dynamically

executable on all of the servers at different times, all of the serv-

ers must be configured with the amount of memory required by

the most demanding application, but only the one running the

app will actually use that memory. This is where the mainframes

excel since these have a flexible shared memory architecture

where any processor can use any portion of the memory at any

given time, so they only need to be configured to be able to han-

dle the most demanding application in one instance.

NumaChip offers the exact same feature, by providing any appli-

cation with access to the aggregate amount of memory in the

system. In addition, it also offers all applications access to all I/O

devices in the system through the standard virtual view provid-

ed by the operating system.

The two distinctly different architectures of clusters and main-

frames are shown in Figure 1. In Clusters processes are loosely

coupled through a network like Ethernet or InfiniBand. An appli-

cation that needs to utilize more processors or I/O than those

present in each server must be programmed to do so from the

beginning. In the main-frame, any application can use any re-

source in the system as a virtualized resource and the compiler

can generate threads to be executed on any processor.

In a system interconnected with NumaChip, all processors can

access all the memory and all the I/O resources in the system

in the same way as on a mainframe. NumaChip provides a fully

virtualized hardware environment with shared memory and I/O

and with the same ability as mainframes to utilize compiler gen-

erated parallel processes and threads.

Page 212: HPC Compass 2016_17

Clustered Architecture

Mainframe Architecture


Operating Systems

Systems based on NumaChip can run standard operating sys-

tems that handle shared memory multiprocessing. Numascale

provides a bootstrap loader that is invoked after power-up and

performs initialization of the system by setting up node address

routing tables. When the standard bootstrap loader is launched,

the system will appear as a large unified shared memory system.

Cache Coherent Shared Memory

The big differentiator for NumaConnect compared to other

high-speed interconnect technologies is the shared memory

Clustered vs Mainframe Architecture



Page 213: HPC Compass 2016_17


� Resources can be mapped and used by any processor in the

system – optimal use of resources in a virtualized environ-


� Process scheduling is synchronized through a single, real-time

clock - avoids serialization of scheduling associated with

asynchronous operating systems in a cluster and the corre-

sponding loss of efficiency

Scalability and Robustness

The initial design aimed at scaling to very large numbers of pro-

cessors with 64-bit physical address space with 16 bits for node

identifier and 48 bits of address within each node. The current

implementation for Opteron is limited by the global physical ad-

dress space of 48 bits, with 12 bits used to address 4 096 physical

nodes for a total physical address range of 256 Terabytes.

A directory based cache coherence protocol was developed to

handle scaling with significant number of nodes sharing data to

avoid overloading the interconnect between nodes with coher-

ency traffic which would seriously reduce real data throughput.

and cache coherency mechanisms. These features allow pro-

grams to access any memory location and any memory mapped

I/O device in a multiprocessor system with high degree of effi-

ciency. It provides scalable systems with a unified programming

model that stays the same from the small multi-core machines

used in laptops and desktops to the largest imaginable single

system image machines that may contain thousands of proces-


There are a number of pros for shared memory machines that

lead experts to hold the architecture as the holy grail of comput-

ing compared to clusters:

� Any processor can access any data location through direct

load and store operations - easier programming, less code to

write and debug

� Compilers can automatically exploit loop level parallelism –

higher efficiency with less human effort

� System administration relates to a unified system as opposed

to a large number of separate images in a cluster – less effort

to maintain

NumaChip System Architecture

Page 214: HPC Compass 2016_17


The basic ring topology with distributed switching allows a

number of different interconnect configurations that are more

scalable than most other interconnect switch fabrics. This also

eliminates the need for a centralized switch and includes inher-

ent redundancy for multidimensional topologies.

Functionality is included to manage robustness issues associ-

ated with high node counts and extremely high requirements

for data integrity with the ability to provide high availability for

systems managing critical data in transaction processing and

realtime control. All data that may exist in only one copy are ECC

protected with automatic scrubbing after detected single bit

errors and automatic background scrubbing to avoid accumula-

tion of single bit errors.

Integrated, distributed switching

NumaChip contains an on-chip switch to connect to other nodes

in a NumaChip based system, eliminating the need to use a

centralized switch. The on-chip switch can connect systems in

one, two or three dimensions. Small systems can use one, medi-

um-sized system two and large systems will use all three dimen-

sions to provide efficient and scalable connectivity between


The two- and three-dimensional topologies (called Torus) also

have the advantage of built-in redundancy as opposed to sys-

tems based on centralized switches, where the switch repre-

sents a single point of failure.

The distributed switching reduces the cost of the system since

there is no extra switch hardware to pay for. It also reduces the

amount of rack space required to hold the system as well as the

power consumption and heat dissipation from the switch hard-

ware and the associated power supply energy loss and cooling




Page 215: HPC Compass 2016_17


Block Diagram of NumaChip

Multiple links allow flexible system configurations in multi-dimensional topologies

Page 216: HPC Compass 2016_17






tech BOX


Redefining Scalable OpenMP and MPI Shared Memory Advantages

Multi-processor shared memory processing has long been the

preferred method for creating and running technical computing

codes. Indeed, this computing model now extends from a user’s

dual core laptop to 16+ core servers. Programmers often add

parallel OpenMP directives to their programs in order to take

advantage of the extra cores on modern servers. This approach

is flexible and often preserves the “sequential” nature of the

program (pthreads can of course also be used, but OpenMP is

much easier to use). To extend programs beyond a single server,

however, users must use the Message Passing Interface (MPI) to

allow the program to operate across a high-speed interconnect.

Interestingly, the advent of multi-core servers has created a par-

allel asymmetric computing model, where programs must map

themselves to networks of shared memory SMP servers. This

asymmetric model introduces two levels of communication,

local within a node, and distant to other nodes. Programmers

often create pure MPI programs that run across multiple cores

on multiple nodes. While a pure MPI program does represent

the greatest common denominator, better performance may be

sacrificed by not utilizing the local nature of multi-core nodes.

Hybrid (combined MPI/OpenMP) models have been able to pull

more performance from cluster hardware, but often introduce

programming complexity and may limit portability.Clearly, us-

ers prefer writing software for large shared memory systems to

MPI programming. This preference becomes more pronounced

when large data sets are used. In a large SMP system the data

are simply used in place, whereas in a distributed memory clus-

ter the dataset must be partitioned across compute nodes.


Redefining Scalable OpenMP and MPI

Page 217: HPC Compass 2016_17


In summary, shared memory systems have a number of highly

desirable features that offer ease of use and cost reduction over

traditional distributed memory systems:

� Any processor can access any data location through direct load

and store operations, allowing easier programming (less time

and training) for end users, with less code to write and debug.

� Compilers, such as those supporting OpenMP, can automatically

exploit loop level parallelism and create more efficient codes,

increasing system throughput and better resource utilization.

� System administration of a unified system (as opposed to

a large number of separate images in a cluster) results in

reduced effort and cost for system maintenance.

� Resources can be mapped and used by any processor in

the system, with optimal use of resources in a single image

operating system environment.

Shared Memory as a Universal Platform

Although the advantages of shared memory systems are clear,

the actual implementation of such systems “at scale” has been

difficult prior to the emergence of NumaConnect technolo-

gy. There have traditionally been limits to the size and cost of

shared memory SMP systems, and as a result the HPC commu-

nity has moved to distributed memory clusters that now scale

into the thousands of cores. Distributed memory programming

occurs within the MPI library, where explicit communication

pathways are established between processors (i.e., data is es-

sentially copied from machine to machine). A large number of

existing applications use MPI as a programming model. Fortu-

nately, MPI codes can run effectively on shared memory sys-

tems. Optimizations have been built into many MPI versions

that recognize the availability of shared memory and avoid full

message protocols when communicating between processes.

Shared memory programming using OpenMP has been useful on

small-scale SMP systems such as commodity workstations and

servers. Providing large-scale shared memory environments for

these codes, however, opens up a whole new world of perfor-

mance capabilities without the need for re-programming. Using

NumaConnect technology, scalable shared memory clusters are

capable of efficiently running both large-scale OpenMP and MPI

codes without modification.

Record-Setting OpenMP Performance

In the HPC community NAS Parallel Benchmarks (NPB) have

been used to test the performance of parallel computers (http://

www.nas.nasa.gov/publications/npb.html). The benchmarks are

a small set of programs derived from Computational Fluid Dy-

namics (CFD) applications that were designed to help evaluate

the performance of parallel supercomputers. Problem sizes in

NPB) are predefined and indicated as different classes (currently

A through F, with F being the largest).

Reference implementations of NPB are available in common-

ly-used programming models such as MPI and OpenMP, which

make them ideal for measuring the performance of both distrib-

uted memory and SMP systems. These benchmarks were com-

piled with Intel ifort version 14.0.0. (Note: the current-generated

code is slightly faster, but Numascale is working on NumaCon-

nect optimizations for the GNU compilers and thus suggests

using gcc and gfortran for OpenMP applications.) For the follow-

ing tests, the NumaConnect Shared Memory benchmark system

has 1TB of memory and 256 cores. It utilizes eight servers, each

equipped with two x AMD Opteron 2.5 GHz 6380 CPUs, each with

16 cores and 128GB of memory. Figure One shows the results for

running NPB-SP (Scalar Penta-diagonal solver) over a range of 16

to 121 cores using OpenMP for the Class D problem size.

Figure Two shows results for the NPB-LU benchmark (Lower-Up-

per Gauss-Seidel solver) over a range of 16 to 121 cores, using

OpenMP for the Class D problem size.

Page 218: HPC Compass 2016_17


Figure Three shows he NAS-SP benchmark E-class scaling per-

fectly from 64 processes (using affinity 0-255:4) to 121 process-

es (using affinity 0-241:2). Results indicate that larger problems

scale better on NumaConnect systems, and it was noted that

NASA has never seen OpenMP E Class results with such a high

number of cores.

OpenMP NAS Parallel results for NPB-SP (Class D)

OpenMP NAS Parallel results for NPB-LU (Class E)


Redefining Scalable OpenMP and MPI

Page 219: HPC Compass 2016_17


OpenMP applications cannot run on InfinBand clusters without

additional software layers and kernel modifications. The Numa-

Connect cluster runs a standard Linux kernel image.

Surprisingly Good MPI Performance

Despite the excellent OpenMP shared memory performance

that NumaConnect can deliver, applications have historically

been written using MPI. The performance of these applications

is presented below. As mentioned, the NumaConnect system

can easily run MPI applications.

Figure Four is a comparison of NumaConnect and FDR InfiniB-

and NPB-SP (Class D). The results indicate that NumaConnect

performance is superior to that of a traditional distributed In-

finiBand memory cluster. MPI tests were run with OpenMPI and

gfortran 4.8.1 using the same hardware mentioned above.

Both industry-standard OpenMPI and MPICH2 work in shared

memory mode. Numascale has implemented their own version

of the OpenMPI BTL (Byte Transfer Layer) to optimize the com-

munication by utilizing non-polluting store instructions. MPI

messages require data to be moved, and in a shared memory en-

vironment there is no reason to use standard instructions that

implicitly result in cache pollution and reduced performance.

This results in very efficient message passing and excellent MPI


Similar results are shown in Figure Five for the NAS-LU (Class D).

NumaConnect’s performance over InfiniBand may be one of the

more startling results for the NAS benchmarks. Recall again that

OpenMP applications cannot run on InfiniBand clusters without

additional software layers and kernel modifications.

OpenMP NAS results for NPB-SP (Class E)

NPB-SP comparison of NumaConnect to FDI InfiniBand

NPB-LU comparison of NumaConnect to FDI InfiniBand

Page 220: HPC Compass 2016_17






tech BOX


Big Data Big Data applications – once limited to a few exotic disciplines

– are steadily becoming the dominant feature of modern com-

puting. In industry after industry, massive datasets are being

generated by advanced instruments and sensor technology.

Consider just one example, next generation DNA sequencing

(NGS). Annual NGS capacity now exceeds 13 quadrillion base

pairs (the As, Ts, Gs, and Cs that make up a DNA sequence). Each

base pair represents roughly 100 bytes of data (raw, analyzed,

and interpreted). Turning the swelling sea of genomic data into

useful biomedical information is a classic Big Data challenge,

one of many, that didn’t exist a decade ago.

This mainstreaming of Big Data is an important transformational

moment in computation. Datasets in the 10-to-20 Terabytes (TB)

range are increasingly common. New and advanced algorithms

for memory-intensive applications in Oil & Gas (e.g. seismic data

processing), finance (real-time trading) social media (database),

and science (simulation and data analysis), to name but a few,

are hard or impossible to run efficiently on commodity clusters.

The challenge is that traditional cluster computing based on

distributed memory – which was so successful in bringing down

the cost of high performance computing (HPC) – struggles when

forced to run applications where memory requirements exceed

the capacity of a single node. Increased interconnect latencies,

longer and more complicated software development, inefficient

system utilization, and additional administrative overhead are

all adverse factors. Conversely, traditional mainframes running

shared memory architecture and a single instance of the OS have

always coped well with Big Data Crunching jobs.


Big Data

Page 221: HPC Compass 2016_17


Numascale’s SolutionUntil now, the biggest obstacle to wider use of shared memory

computing has been the high cost of mainframes and high-end

‘super-servers’. Given the ongoing proliferation of Big Data appli-

cations, a more efficient and cost-effective approach to shared

memory computing is needed. Now, Numascale, a four-year-old

spin-off from Dolphin Interconnect Solutions, has developed a

technology, NumaConnect, which turns a collection of standard

servers with separate memories and IO into a unified system

that delivers the functionality of high-end enterprise servers and

mainframes at a fraction of the cost.

Numascale Technology snapshot (board and chip):

NumaConnect links commodity servers together to form a single

unified system where all processors can coherently access and

share all memory and I/O. The combined system runs a single

instance of a standard operating system like Linux. At the heart

of NumaConnect is NumaChip – a single chip that combines the

cache coherent shared memory control logic with an on-chip

7-way switch. This eliminates the need for a separate, central

switch and enables linear capacity and cost scaling.

Systems based on NumaConnect support all classes of applica-

tions using shared memory or message passing through all pop-

ular high level programming models. System size can be scaled

to 4k nodes where each node can contain multiple processors.

Memory size is limited only by the 48-bit physical address range

provided by the Opteron processors resulting in a record-break-

ing total system main memory of 256 TBytes. (For details of Nu-

mascale technology see: The NumaConnect White Paper)

The result is an affordable, shared memory computing option to

tackle data-intensive applications. NumaConnect-based systems

running with entire data sets in memory are “orders of magni-

tude faster than clusters or systems based on any form of existing

mass storage devices and will enable data analysis and decision

support applications to be applied in new and innovative ways,”

says Kåre Løchsen, Numascale founder and CEO.

The big differentiator for NumaConnect compared to other high-

speed interconnect technologies is the shared memory and

cache coherency mechanisms. These features allow programs to

access any memory location and any memory mapped I/O device

in a multiprocessor system with high degree of efficiency. It pro-

vides scalable systems with a unified programming model that

stays the same from the small multi-core machines used in lap-

tops and desktops to the largest imaginable single system image

machines that may contain thousands of processors and tens to

hundreds of terabytes of main memory.

Early adopters are already demonstrating performance gains

and costs savings. A good example is Statoil, the global energy

company based in Norway. Processing seismic data requires

massive amounts of floating point operations and is normally

performed on clusters. Broadly speaking, this kind of processing

is done by programs developed for a message-passing paradigm

(MPI). Not all algorithms are suited for the message passing para-

digm and the amount of code required is huge and the develop-

ment process and debugging task are complex.

Numascale’s implementation of shared memory has many inno-

vations. Its on-chip switch, for example, can connect systems in

one, two, or three dimensions (e.g. 2D and 3D Torus). Small sys-

tems can use one, medium sized system two, and large systems

will use all three dimensions to provide efficient and scalable

connectivity between processors. The distributed switching re-

duces the cost of the system since there is no extra switch hard-

ware to pay for. It also reduces the amount of rack space required

to hold the system as well as the power consumption and heat

dissipation from the switch hardware and the associated power

supply energy loss and cooling requirements.

Page 222: HPC Compass 2016_17






tech BOX


Cache Coherence Multiprocessor systems with caches and shared memory space

need to resolve the problem of keeping shared data coherent.

This means that the most recently written data to a memory lo-

cation from one processor needs to be visible to other proces-

sors in the system immediately after a synchronization point.

The synchronization point is normally implemented as a barrier

through a lock (semaphore) where the process that stores data

releases a lock after finishing the store sequence.

When the lock is released, the most recent version of the data

that was stored by the releasing process can be read by another

process. If there were no individual caches or buffers in the data

paths between processors and the shared memory system, this

would be a relatively trivial task. The presence of caches chang-

es this picture dramatically. Even if the caches were so-called

write-through, in which case all store operations proceed all

the way to main memory, it would still mean that the proces-

sor wanting to read shared data could have copies of previous


Cache Coherence

Page 223: HPC Compass 2016_17


versions of the data in the cache and such a read operation

would return a stale copy seen from the producer process un-

less certain steps are taken to avoid the problem.

Early cache systems would simply require a cache clear opera-

tion to be issued before reading the shared memory buffer in

which case the reading process would encounter cache misses

and automatically fetch the most recent copy of the data from

main memory. This could be done without too much loss of per-

formance when the speed difference between memory, cache

and processors was relatively small.

As these differences grew and write-back caches were intro-

duced, such coarse-grained coherency mechanisms were no

longer adequate. A write-back cache will only store data back

to main memory upon cache line replacements, i.e. when the

cache line will be replaced due to a cache miss of some sort. To

accommodate this functionality, a new cache line state, “dirty”,

was introduced in addition to the traditional “hit/miss” states.

Page 224: HPC Compass 2016_17






tech BOX


NUMA and ccNUMA The term Non-Uniform Memory Access means that memory

accesses from a given processor in a multiprocessor system will

have different access time depending on the physical location

of the memory. The “cc” means Cache Coherence, (i.e. all memory

references from all processors will return the latest updated

data from any cache in the system automatically.)

Modern multiprocessor systems where processors are con-

nected through a hierarchy of interconnects (or buses) will per-

form differently depending on the relative physical location of

processors and the memory being accessed. Most small multi-

processor systems used to be symmetric (SMP – Symmetric Mul-

ti-Processor) having a couple of processors hooked up to a main

memory system through a central bus where all processors had

equal priority and the same distance from the memory.

This was changed through AMD’s introduction of Hypertrans-

port with on-chip DRAM controllers. When more than one such

processor chip (normally also with multiple CPU cores per chip)

are connected, the criteria for being categorized as NUMA are

fulfilled. This is illustrated by the fact that reading or writing

data from or to the local memory is indeed quite a bit faster

than performing the same operations on the memory that is

controlled by the other processor chip.

Seen from a programming point of view, the uniform, global

shared address space is a great advantage since any operand or



Page 225: HPC Compass 2016_17


variable can be directly referenced from he program through a

single load or store operation. This is both simple and incurs no

overhead. In addition, the large physical memory capacity elimi-

nates the need for data decomposition and the need for explicit

message passing between processes. It is also of great impor-

tance that the programming model can be maintained through-

out the whole range of system sizes from small desktop systems

with a couple of processor cores to huge supercomputers with

thousands of processors.

The Non-Uniformity seen from the program is only coupled with

the access time difference between different parts of the physi-

cal memory. Allocation of physical memory is handled by the op-

erating system (OS) and modern OS kernels include features for

optimizing allocation of memory to match with the scheduling

of processors to execute the program. If a programmer wishes

to influence memory allocation and process scheduling in any

way, OS calls to pin-down memory and reserve processors can

be inserted in the code.

This is of course less desirable than to let the OS handle it auto-

matically because it increases the efforts required to port the

application between different operating systems. In normal

cases with efficient caching of remote accesses, the automatic

handling by the OS will be sufficient to ensure high utilization of

processors even if the memory is physically scattered between

the processing nodes.

Page 226: HPC Compass 2016_17
Page 227: HPC Compass 2016_17

Life Sciences | CAE | High Performance Computing | Big Data Analytics | Simulation | CAD

Allinea Forge for parallel Software DevelopmentAllinea Forge is the complete toolsuite for software development – with everything

needed to debug, profile, optimize, edit and build C, C++ and Fortran applications on

Linux for high performance – from single threads through to complex parallel HPC codes

with MPI, OpenMP, threads or CUDA.

Leading enterprises and labs that need high performance computing rely on Allinea

Forge to help them develop fast, robust software and keep their development teams on

track. You’d be in good company with us – users on 70% of the world’s largest supercom-

puters and clusters rely on Allinea Forge.

Page 228: HPC Compass 2016_17


Application Performance and Development ToolsAt its heart, High Performance Computing is used to answer

scientific questions, to engineer better products, to simulate

and predict the world around us, to analyze data and to make


While the hardware is often the focus and grabs the head-

lines, the reality is that HPC software applications determine

everything: the only purpose of a High Performance Computing

system is to run software for “the mission”.

The whole of HPC depends on the software. For that reason it is

important to ensure that the software that is run is:

� Able to exploit the hardware

� Fast and efficient in terms of the resources and time used

� Robust and reliable

� Energy efficient

To take the first criteria as an example: procuring hardware that

has one component that is over-specified for the application pro-

file means that this part of the investment is wasted and could

have achieved more output elsewhere. If storage is the bottle-

neck of the workflow, then a faster and more expensive proces-

sor may not deliver any extra scientific or industrial throughput.

Increasingly the demand to fit energy or power usage within

physical or cost constraints is adding energy and power as well

time into this “optimization challenge”.

Tools from Allinea Software are widely used in the HPC industry

to analyze, tune and develop applications in order to achieve the

scientific or business demands of HPC.

Allinea Forge for parallel

Software Development

Application Performance and

Development Tools

Page 229: HPC Compass 2016_17


Analyzing the ApplicationsTo increase the discovery provided by a current or future HPC sys-

tem, it is essential to analyze the applications that are run. This

assessment should be performed on a regular basis over the life-

time of a system – and also prior to system acquisition to drive

relevant system specification. Frequent analysis enables tem-

porary system issues and specific run-configuration issues to be

addressed quickly. The need for analysis is applicable where soft-

ware is provided by third parties such as commercial engineer-

ing software – as well as for in-house or community software.

Allinea Performance Reports provides performance characteri-

zation that helps to understand the limiting factors for an appli-

cation running on a system. Its one-page reports show the pro-

cessor, networking and I/O time balance, along with advanced

performance and resource information such as processor vector-

ization and memory access, memory usage, core usage and syn-

chronization losses. This helps users to tune runtime parameters

better and also helps developers to target optimization efforts.

Developing High Performance SoftwareFor in-house and community software, the ability to both devel-

op and improve software increases the results and potential of

HPC systems.

However, today’s architectures, large code bases, higher node

and thread counts and the complexity of platform performance

place high demands on those that write software.

In addition to the best compilers (generating the fastest binaries

for a given source code) and fast libraries for core operations

such as dense matrix operations, it is necessary to have tools

that enable developers and scientists to apply their skills effec-

tively and to create more robust and efficient software.

Allinea Forge is a development tool suite that includes the

world’s leading debugging and performance profiling tools –

Allinea DDT and Allinea MAP. Using Allinea Forge enables

developers to create faster and more robust code, in a fraction

of the time.

The highly capable tools have ease-of-use at the fore: the suite

provides access to local, remote and cloud-based HPC systems

for developers – enabling their editing, building, debugging and

performance profiling of code on the systems. Designed for

developers working with HPC’s multi-process and multi-threaded

codes, it is deployed on systems ranging from technical work-

stations through to the world’s largest supercomputers.

Its debugger provides powerful insight into software defects

– controlling concurrent processes and threads, applying

advanced memory debugging to complex bugs, and visualizing

distributed application data and state.

Its performance profilers pinpoints application performance

bottlenecks and enables developers to analyze key performance

issues such as OpenMP or MPI synchronization, use of cache-un-

friendly memory access patterns, excessive I/O, vectorization – in

addition to power and energy use.

Page 230: HPC Compass 2016_17


What’s in Allinea Forge? Allinea Forge includes

� The world’s leading C, C++,

Fortran and F90 debugger – Allinea DDT

� The fast profiler for high performance multi-threaded

or multi-process code – Allinea MAP

� One single crisp clear interface – and that supports

debugging, profiling, editing and building code on local

and remote systems

Edit, build and commitWhether debugging or profiling, Allinea Forge lets you make

those quick changes easily – with its built-in editor. There’s no

need for firing up a separate editor and breaking your workflow.

When the changes are done, simply rebuild and debug again, or

profile the code to see the impact of those changes.

Forge also supports the major source control systems and can

annotate your code to let you know when code was changed,

and who by. Ideal for those large multi-developer projects: find

out who broke the build easily!

� Supports Git, Mercurial, Subversion, CVS– update the code or

commit to preserve your work – or annotate the code with

version numbers and change messages

� Full syntax highlighting and code-folding – hide long and

irrelevant code blocks so that you can see the structure of

your code more clearly.

� Configurable build – supports any build command.

� Powerful search and navigation.

� In-built static analysis for C++ projects.

Access remote or cloud systems as seamlessly as your laptopThe workflow of a developer in high performance computing or

working in the cloud has specific needs for tools: you need to

work with your code and machine often from a distance – as of-

ten as you might work on code running locally.

Allinea Forge for parallel

Software Development

What’s in Allinea Forge?

Page 231: HPC Compass 2016_17


The whole Allinea Forge toolkit has been built with you in mind.

It has unique remote connection support that brings editing,

debugging and profiling into your local desktop with minimal

network lag - enabling access to large scale machines from home

quickly, easily and securely.

� Connects via secure shell (SSH) – running the user interface

locally and controlling the remote session using our unique

scalable low traffic control architecture.

� Known to work with most common one-time-password (OTP)

authentication tokens.

Platform SupportAllinea Forge supports the platforms of technical and high per-

formance computing.

� Connects Hardware and O/S: Linux on Intel Xeon, Intel Xeon

Phi, ARM 64-bit, OpenPOWER.

� Parallel processing frameworks: Almost every known imple-

mentation of the MPI standard, OpenSHMEM and OpenMP.

� Coprocessors: NVIDIA GPUs – and models/languages such as

OpenACC and CUDA – and Intel Xeon Phi.

� Systems: Multicore laptops, through to supercomputers and

clusters – including those provided by key vendors such as

Bull, Cray, Dell, HP, IBM, Lenovo and SGI.

A remote client is available to allow access to supported remote

systems from OS/X and Windows laptops, in addition to Linux


Cross-platform ToolsIt’s important to have great tools in order to develop great soft-

ware, that’s why we’re ensuring that everything from the latest

compilers, C++11 standards, OpenMP and MPI versions to NVIDIA

CUDA and Intel® Xeon Phi™, Intel Xeon, 64-bit ARM and Open-

POWER hardware are fully-supported by our tools.

Debugging with Allinea DDTAllinea DDT is the debugger for software engineers and scientists

developing C++, C or Fortran parallel and threaded applications

on CPUs, GPUs and Intel Xeon Phi coprocessors.

Its powerful intuitive graphical interface with automatic detec-

tion of memory bugs, divergent behavior and lightning-fast per-

formance at all scales combine to make Allinea DDT the number

one debugger in research, industry and academia.

Amongst DDT’s features:

� Powerful memory debugging – to catch problems like

memory leaks, dangling pointers or reading beyond the

bounds of arrays.

� Intelligent aggregation of multiple processes and threads

state and date – to simplify finding unusual behavior.

� Control multiple threads and processes simultaneously –

set breakpoints, run to a line, or step forwards and run.

� Automatic change detection of variables and smart

highlighting and graphs of values across threads and


� Unique large and distributed arrays viewing and filtering for

scientific codes.

� Extensible C++ STL and complex type viewing – to make

viewing opaque C++/C++11 classes easy.

Allinea MAP – intuitive lightweight profilingAllinea MAP is a profiler that shows you which lines of code are

slow and gets everything else out of your way.

Whether at one process or ten thousand, Allinea MAP is designed

to work out-of-the-box with no need for instrumentation and no

Page 232: HPC Compass 2016_17


danger of creating large, unmanageable data files. Software en-

gineers developing parallel and threaded applications on CPUs,

GPUs and Intel Xeon Phi coprocessors rely on Allinea MAP’s unri-

valled capability.

Is the compiler vectorizing this loop? Is this code CPU-bound, or

memory-bound? Why is my multi-process code slow? Which MPI

calls is it waiting at? Is multithreaded synchronization destroy-

ing performance? Allinea MAP answers all these questions and

more in a beautiful, intuitive graphical interface.

It is used from multicore Linux workstations through to the

largest supercomputers on the planet.

� Fast – typically under 5% runtime overhead means that you

can profile realistic test cases that you care most about.

� Easy – the interactive user interface is clear and intuitive,

designed for developers and computational scientists.

� Scalable – architected for the biggest clusters and super-

computers – and the biggest and most complex codes!

� No fuss– profiles C++, C, Fortran with no relinking, instrumen-

tation or code changes required.

Allinea Forge for parallel

Software Development

What’s in Allinea Forge?

Page 233: HPC Compass 2016_17







tech BOX

Allinea DDT – a debugger for parallel and threaded Linux codeAllinea DDT makes debugging faster and gives developers

the break they need to deliver software on time. The compre-

hensive debugger for C, C++, Fortran and F90 multi-process

and multi-threaded applications:

� Debugs native Linux applications, including those

running on more than one server – such as on

HPC clusters using MPI.

� Controls processes and threads – step through program

execution, stop on variable changes, errors and at user

inserted breakpoints.

� Has market leading memory debugging to detect hard-

to-find or intermittent problems caused by memory

leaks, dangling pointers, beyond-bounds array access.

� Has powerful and extensible C++ STL debugging to view

STL containers intuitively.

� Shows variables and arrays across multiple processes

and detects changes automatically.

� Provides In-built editing, building and version control

integration – with outstanding support for working on

remote systems.

� Has an Offline mode for debugging non-interactively

and recording application behavior and state.

Powerful, scalable, multi-process, multi-threaded debuggingIf you work with multiple processes and threads, through

libraries like MPI or OpenMP, Allinea DDT makes concurrency


MAP exposes a wide set of performance problems and

bottlenecks by measuring:

� Computation – with self and child and call tree

representations over time.

� Thread activity – to identify over-subscribed cores and

sleeping threads that waste available CPU time for OpenMP

and pthreads.

� Instruction types (for x86_64) – to show use of eg. vector-

units or other performance extensions.

� Synchronization, communication and workload imbalance

for MPI or multi-process usage.

� I/O performance and time spent in I/O – to identify

bottlenecks in shared or local file systems.

Page 234: HPC Compass 2016_17


Control threads and processes individually and collectively

� Set breakpoints on groups of MPI processes, and all or

individual threads (OpenMP and pthreads).

� Step or play processes and threads individually or en-masse.

� Create groups of processes based on variable/expression

values, current code location or process state.

See state and differences between processes and threads easily

DDT has unique data aggregation and difference-highlighting

views that automatically compare and contrast program state so

it’s quicker to handle multiple contexts.

The parallel stack view gives a scalable view of thread and pro-

cess stacks – from a handful of processes and threads to hun-

dreds of thousands – it groups stacks to highlight the differences

that can indicate divergence – ideal for MPI, OpenMP, pthreads,

CUDA and OpenACC debugging.

Unusual data values and changes are seen instantly with smart

highlighting and sparklines - thumbnail-plots that show values

of variables across every process. Read more about sparklines in

our blog. Additional cross-process and thread comparison tools

help to group or search data.

DDT also makes it easier to understand the types of bugs that

multi-process codes can introduce.

� Diagnose deadlocks, livelocks and message synchroniza-

tion errors with both graphical and table-based message

queue displays for MPI.

� Integration support for open-source MPI correctness tools.

Above all, it’s tried and tested at scale – with lightning-fast

performance even at the extreme: step and display 700,000

processes in a fraction of a second.

Allinea Forge for parallel

Software Development

What’s in Allinea Forge?

Page 235: HPC Compass 2016_17


Data Browsing for C, C++ and F90 structures and classesWe know it’s essential to see variables quickly and understand

the most complex data structures – and with Allinea DDT you can.

� View variables and expressions – and watch as they change

as you debug through your code with automatic change


� Browse through variables, objects, modules and complex

C++ classes, structs or F90 derived type data structures and

multi-dimensional arrays.

� Dereference pointers or find addresses of objects – and use

memory debugging to find out where a pointer was allocat-

ed and how large the allocation is.

� Extensible intelligent display of complex data structures

such as C++ and C++11 STL, Boost and Qt data containers -

including list, set, map, multimap, QString and many more.

� Add support for your own custom data types to show only

the data you care about.

� Set watchpoints to stop when data is read or written by any


Memory DebuggingDDT’s built-in fast memory debugger reveals problems in heap

(allocated) memory usage automatically.

� Instantly stops on exceptions and crashes – identify the

source and data behind any crash with an easy to use com-

prehensive interface to application state.

� Catch out-of-bounds data access as soon as it occurs.

� Consign memory leaks to history: leak reports detail the

source locations that are responsible for the memory


� Supports custom allocators – so that allocation locations

are reported at the level that matters to your application.

� Check for dangling pointers.

� Examine a pointer – check if it is valid and and view the

full stack of where it was allocated – and how large its

allocation is.

� C, C++ memory allocated with malloc, new and similar

� Fortran or F90 allocatable memory and pointers.

Powerful navigation, edit, build and version control – in a comprehensive toolsuiteIt’s natural to want to improve the speed of code after removing

the bugs. That’s why Allinea DDT is part of Allinea Forge. DDT is

available separately or in an Allinea Forge license that includes

the leading Linux parallel and threaded code profiler, Allinea

MAP. MAP shares the same user interface and installation as DDT

– which means it is easy to switch to profile a code once debug-

ging is complete.

Page 236: HPC Compass 2016_17


The powerful user interface allows developers to edit, build and

commit whilst debugging – and much more! The development

capabilities in Allinea DDT and Allinea Forge are explored in the

Allinea Forge features page.

Working on remote systems? Connect from your laptop with na-

tive Mac, Windows and Linux clients with the free native remote

client – no more laggy X connection forwarding!

Thinking smarter about solving bugs

We’ve put time into debugging so that you don’t have to. Allinea

DDT integrates the best practice to help you spot the causes of

problems more quickly.

� Catch common errors at source with built-in static analysis

for C, C++ and Fortran – which warns about bugs before

you’ve even hit them.

� Why does this revision of my code behave differently to the

previous one? Version control integration highlights recent

changes – and allows you to see why and where changes

have been introduced.

� A single command in Allinea DDT will automatically log the

values of variables across all processes at each changed

section of code, allowing you to track down exactly how and

why a particular change introduced problems to your code.

To help you manage your progress during debugging our

logbook provides a handy record of the steps you’ve taken and

the things that have been observed.

� Automatic, always-on logging of your debugging activity so

that it’s easy to go back and review the evidence for things

you might have missed at the time.

� Share logbooks with others on your team,

making collaborative debugging a reality.

� Compare two traces side-by-side, to see changes between

systems, versions or process counts jump right out.

Allinea Forge for parallel

Software Development

What’s in Allinea Forge?

Page 237: HPC Compass 2016_17


Scalable printf and powerful command-line modesTracepoints provide scalable “printf” debugging. DDT can insert

tracepoints into your code, and log values scalably each time a

line of code is passed. The results are available interactively – or

in the offline debugging mode.

Offline mode debugs whilst you sleep. Submit an offline debug

run to your batch scheduler from the command line and receive

a text or HTML error and trace report from DDT when the job is

done. Read more about offline debugging in our blog.

Handles and visualizes huge data setsDDT is designed to handle the large data sets of scientific and

technical computing. It can display or filter large arrays – even

when distributed on multiple processes.

� Export to industry standard data formats HDF5 and CSV.

� Browse arrays and gather statistics – over all your process-

es in parallel – with powerful filtering to search petabytes

in parallel.

� Visualize and debug data simultaneously using Allinea DDT’s

connector to the VisIt scientific visualization tool.

Cross-platform for the HPC architectures of today and tomorrowAllinea DDT is cross-platform – supporting all of today’s leading

technical-computing platforms. It’s a CUDA debugger, a parallel

debugger, a multi-threaded debugger, an OpenMP debugger and

a memory debugger all at the same time – and supports mixed

hybrid programming models.


Intel Xeon, Intel Xeon Phi , NVIDIA CUDA , IBM BlueGene/Q,

OpenPOWER, ARM 8 (64 bit)


MPI, OpenMP, CUDA , OpenACC, UPC, CoArray Fortran, PGAS

Languages, pthread-based multithreading, SHMEM, OpenSHMEM


Fortran, C++, C++11, C, PGAS Languages, CoArray Fortran

Page 238: HPC Compass 2016_17







tech BOX

MPI, OpenMP, C, C++ and F90 profiling – Allinea MAPProfiler Features and Benefits

We believe scientists and developers should be set free to spend

their time and energy doing great science and code – not bat-

tling with arcane or unnecessarily complex tools. Allinea MAP

is our way of giving time back to you – it is a profiler that shows

developers exactly where and why code is losing performance.

It provides:

� Effortless code profiling – without needing to change your

code or – on most systems – the way you build it.

� Profiling for applications running on more than one server

and multiple processes – such as on HPC clusters using MPI.

� Clear views of bottlenecks in I/O, in compute, in thread or in

multi-process activity.

� Deep insight into actual processor instruction types that

affect your performance such as vectorization and memory


� Memory usage over time to discover high-watermarks and

changes in complete memory footprint.

� A powerful, navigable source browser in which you can edit,

build and commit your changes – with outstanding support

for working on remote systems.

More results, faster

Profiling your C++, C, Fortran or F90 code on Linux is as simple

as running “map -profile my_program.exe”. There are no ex-

tra steps and no complicated instrumentation. MPI, OpenMP,

pthreads or unthreaded codes can all be profiled with Allinea

MAP. The graphical results are precise, straightforward to inter-

pret and bottlenecks are shown directly in the source code.

Allinea Forge for parallel

Software Development

What’s in Allinea Forge?

Page 239: HPC Compass 2016_17


Integration and a common interface with the debugger Allinea

DDT in the Allinea Forge tool suite makes moving between tools

a breeze, saving you time and energy throughout each stage of

the development cycle.

Low-overhead profiling:

for production and test workloads at any scale

Existing performance tools can give a powerful view, when you

run them and spend the time analyzing their output, but when

was the last time you ran a profiler on your production code?

We built Allinea MAP with less than 5% runtime overhead – and

the data files created by MAP remain small, for any size and du-

ration of run – so you can run it every day, on every change, giv-

ing you fascinating and powerful insights into the performance

of real codes under real conditions.

Fundamental to MAP is that it shows time alongside your source

code – so that bottlenecks are clearer to see, and the top-down

stack view of time across all processes means its easy to navi-

gate through the code to the parts that matter.

I/O profiling

As systems get larger, more and more codes are being affected

by poor I/O performance. Often, this goes unnoticed or misla-

belled as poor application scaling. Allinea MAP shows you exact-

ly where your file I/O bandwidth is being used, helping to diag-

nose overloaded shared filesystems, poor read/write patterns

and even system misconfiguration issues.

Profiling threads and OpenMP code

Getting performance from multithreaded code can be a chal-

lenge – but Allinea MAP makes it easy to see where thread syn-

chronization is costing cycles and where threads are spending

their time.

With views of CPU core activity, and code profiling by actual per-

core walltime, Allinea MAP is the thread profiler that threaded

code has been waiting for – it’s unique in profiling threads accu-

rately and quickly on real workloads.

MPI Profiling

At the core of Allinea MAP is a scalable architecture that lets it

profile hundreds, or thousands of parallel processes – such as

those in HPC’s main communication library, MPI. It gathers per-

formance data from each process, and merges the information

to present why and where your MPI or multiprocess code is slow.

MAP works with almost every MPI implementation so that your

users have the fastest and most pain free experience of profiling

their MPI code.

Memory Profiling

As your application progresses, Allinea MAP can show you

the real memory usage across all processes in the application

and all compute nodes/servers. The memory usage helps

you identify imbalance, or changes caused by phases in your

application – and MAP shows this alongside your source code.

Page 240: HPC Compass 2016_17


The visible high-water mark of usage helps to track down ap-

plications that rely on 3rd party libraries which temporarily

consume memory and push memory usage over the edge. Ap-

plications that use increasing memory over time – memory

leaks – can then be address with Allinea DDT’s in-built memory


Additionally, the time spent in memory accesses is one of the

key metrics profiled so that poor memory access patterns and

cache use are found easily.

Accelerator Metrics

Allinea MAP supports the latest NVIDIA CUDA GPUs and helps

you to profile CUDA GPUs and the CPUs together. Profiling ena-

bles you to see how your CPU waits for GPU completion - and

view CUDA GPU time in global memory access, GPU utilization

and even GPU temperature.

For further information:

� Read more about MAP’s features for profiling codes that use


� Watch a video on Allinea Forge helping developers to use

CUDA and OpenACC effectively at NVIDIA’s GTC conference.

Energy Profiling

Energy consumption and peak power usage is increasingly

important for high-performance applications and their users.

With Allinea MAP’s Energy Pack,developers can optimize for

time and energy.

Allinea Forge for parallel

Software Development

What’s in Allinea Forge?

Page 241: HPC Compass 2016_17


The latest Sandy Bridge and above Intel processors are support-

ed (including Haswell and Broadwell chips) – via their in-built In-

tel RAPL power measurement capability – for CPU power meas-

urement. GPU power measurement is available on any NVIDIA

GPU with power monitoring support. Node-level measurement

is also available for systems supporting the Intel Energy Checker

API or the Cray HSS energy counters (XK6 and XC30 and above).

Compare the performance of different clusters

and architectures

Allinea MAP is cross-platform profiler supporting the major

Linux platforms. It provides its data in an open XML format,

making it ideal for post-processing to characterize and compare

the performance of key codes on different hardware platforms.

Even without access to the original source code, Allinea MAP

tracks and reports on CPU, memory and MPI performance met-

rics over time, giving you everything you need to evaluate and

compare potential new platforms.

Allinea MAP supports all of today’s leading technical-computing

platforms – which means you can be productive on any system.

It’s an MPI profiler, a multi-threaded profiler and an OpenMP

profiler – and supports mixed hybrid programming models.


Intel Xeon Phi , NVIDIA CUDA , OpenPOWER, ARM 8 (64 bit),

Intel Xeon


MPI, OpenMP, CUDA , OpenACC, UPC, PGAS Languages,

pthread-based multithreading, SHMEM, OpenSHMEM


Fortran, C, C++11, C, PGAS Languages

Free up support staff time to solve key challenges

HPC consultants and support staff have a deep understanding

of performance and optimization tools. Yet again and again

they tell us much of their time is spent diagnosing the same ba-

sic mistakes new programmers make over and over.

We designed Allinea MAP so that new developers of MPI, OpenMP

and regular code can see the cause of common performance

problems at once, freeing up experts to dive deeper into complex

and leadership-class optimization problems.

Page 242: HPC Compass 2016_17



In computer science, ACID (Atomicity, Consistency, Isolation, Dura-

bility) is a set of properties that guarantee that database transac-

tions are processed reliably. In the context of databases, a single

logical operation on the data is called a transaction. For example,

a transfer of funds from one bank account to another, even involv-

ing multiple changes such as debiting one account and crediting

another, is a single transaction.

ACML (“AMD Core Math Library“)

A software development library released by AMD. This library

provides useful mathematical routines optimized for AMD pro-

cessors. Originally developed in 2002 for use in high-performance

computing (HPC) and scientific computing, ACML allows nearly

optimal use of AMD Opteron processors in compute-intensive

applications. ACML consists of the following main components:

� A full implementation of Level 1, 2 and 3 Basic Linear Algebra

Subprograms (→ BLAS), with optimizations for AMD Opteron


� A full suite of Linear Algebra (→ LAPACK) routines.

� A comprehensive suite of Fast Fourier transform (FFTs) in sin-

gle-, double-, single-complex and double-complex data types.

� Fast scalar, vector, and array math transcendental library


� Random Number Generators in both single- and double-pre-


AMD offers pre-compiled binaries for Linux, Solaris, and

Windows available for download. Supported compilers include

gfortran, Intel Fortran Compiler, Microsoft Visual Studio, NAG,

PathScale, PGI compiler, and Sun Studio.

BLAS (“Basic Linear Algebra Subprograms“)

Routines that provide standard building blocks for performing

basic vector and matrix operations. The Level 1 BLAS perform

scalar, vector and vector-vector operations, the Level 2 BLAS

perform matrix-vector operations, and the Level 3 BLAS perform

matrix-matrix operations. Because the BLAS are efficient,

portable, and widely available, they are commonly used in

the development of high quality linear algebra software, e.g.

→ LAPACK. Although a model Fortran implementation of the BLAS

in available from netlib in the BLAS library, it is not expected to

perform as well as a specially tuned implementation on most

high-performance computers – on some machines it may give

much worse performance – but it allows users to run → LAPACK

software on machines that do not offer any other implemen-

tation of the BLAS.


In computing, Ceph is an → object storage based free software

storage platform that stores data on a single distributed comput-

er cluster, and provides interfaces for object-, block- and file-level

storage. Ceph aims primarily to be completely distributed without

a single point of failure, scalable to the exabyte level, and freely


Ceph replicates data and makes it fault-tolerant, using commod-

ity hardware and requiring no specific hardware support. As a

result of its design, the system is both self-healing and self-man-

aging, aiming to minimize administration time and other costs.

See the main article “OpenStack for HPC Cloud Environments”

for further details.


Page 243: HPC Compass 2016_17


Cg (“C for Graphics”)

A high-level shading language developed by Nvidia in close col-

laboration with Microsoft for programming vertex and pixel

shaders. It is very similar to Microsoft’s → HLSL. Cg is based on

the C programming language and although they share the same

syntax, some features of C were modified and new data types

were added to make Cg more suitable for programming graph-

ics processing units. This language is only suitable for GPU pro-

gramming and is not a general programming language. The Cg

compiler outputs DirectX or OpenGL shader programs.

CISC (“complex instruction-set computer”)

A computer instruction set architecture (ISA) in which each in-

struction can execute several low-level operations, such as

a load from memory, an arithmetic operation, and a memory

store, all in a single instruction. The term was retroactively

coined in contrast to reduced instruction set computer (RISC).

The terms RISC and CISC have become less meaningful with the

continued evolution of both CISC and RISC designs and imple-

mentations, with modern processors also decoding and split-

ting more complex instructions into a series of smaller internal

micro-operations that can thereby be executed in a pipelined

fashion, thus achieving high performance on a much larger sub-

set of instructions.


Aggregation of several, mostly identical or similar systems to

a group, working in parallel on a problem. Previously known

as Beowulf Clusters, HPC clusters are composed of commodity

hardware, and are scalable in design. The more machines are

added to the cluster, the more performance can in principle be


control protocol

Part of the → parallel NFS standard

CUDA driver API

Part of → CUDA


Part of → CUDA

CUDA toolkit

Part of → CUDA

CUDA (“Compute Uniform Device Architecture”)

A parallel computing architecture developed by NVIDIA. CUDA

is the computing engine in NVIDIA graphics processing units

or GPUs that is accessible to software developers through in-

dustry standard programming languages. Programmers use

“C for CUDA” (C with NVIDIA extensions), compiled through a

PathScale Open64 C compiler, to code algorithms for execution

on the GPU. CUDA architecture supports a range of computation-

al interfaces including → OpenCL and → DirectCompute. Third

party wrappers are also available for Python, Fortran, Java and

Matlab. CUDA works with all NVIDIA GPUs from the G8X series

onwards, including GeForce, Quadro and the Tesla line. CUDA

provides both a low level API and a higher level API. The initial

CUDA SDK was made public on 15 February 2007, for Microsoft

Windows and Linux. Mac OS X support was later added in ver-

sion 2.0, which supersedes the beta released February 14, 2008.

CUDA is the hardware and software architecture that enables

NVIDIA GPUs to execute programs written with C, C++, Fortran,

→ OpenCL, → DirectCompute, and other languages. A CUDA pro-

gram calls parallel kernels. A kernel executes in parallel across

Page 244: HPC Compass 2016_17


a set of parallel threads. The programmer or compiler organizes

these threads in thread blocks and grids of thread blocks. The

GPU instantiates a kernel program on a grid of parallel thread

blocks. Each thread within a thread block executes an instance

of the kernel, and has a thread ID within its thread block,

program counter, registers, per-thread private memory, inputs,

and output results.

A thread block is a set of concurrently executing threads that

can cooperate among themselves through barrier synchroniza-

tion and shared memory. A thread block has a block ID within

its grid. A grid is an array of thread blocks that execute the same

kernel, read inputs from global memory, write results to global

memory, and synchronize between dependent kernel calls. In

the CUDA parallel programming model, each thread has a per-

thread private memory space used for register spills, function

calls, and C automatic array variables. Each thread block has a

per-block shared memory space used for inter-thread communi-

cation, data sharing, and result sharing in parallel algorithms.

Grids of thread blocks share results in global memory space af-

ter kernel-wide global synchronization.

CUDA’s hierarchy of threads maps to a hierarchy of processors

on the GPU; a GPU executes one or more kernel grids; a stream-

ing multiprocessor (SM) executes one or more thread blocks;

and CUDA cores and other execution units in the SM execute

threads. The SM executes threads in groups of 32 threads called

a warp. While programmers can generally ignore warp ex-

ecution for functional correctness and think of programming

one thread, they can greatly improve performance by having

threads in a warp execute the same code path and access mem-

ory in nearby addresses. See the main article “GPU Computing”

for further details.


An application programming interface (API) that supports gen-

eral-purpose computing on graphics processing units (GPUs)

on Microsoft Windows Vista or Windows 7. DirectCompute is

part of the Microsoft DirectX collection of APIs and was initially

released with the DirectX 11 API but runs on both DirectX 10

and DirectX 11 GPUs. The DirectCompute architecture shares a

range of computational interfaces with → OpenCL and → CUDA.


An open-source project that automates the deployment of

applications inside software containers, by providing an

additional layer of abstraction and automation of operating-



Thread Block

Grid 0

Grid 1

per-Thread PrivateLocal Memory

per-BlockShared Memory






Page 245: HPC Compass 2016_17


system-level virtualization on Linux. Docker uses the resource

isolation features of the Linux kernel such as cgroups and ker-

nel namespaces, and a union-capable filesystem such as aufs

and others to allow independent “containers” to run within a

single Linux instance, avoiding the overhead of starting and

maintaining virtual machines.

The Linux kernel’s support for namespaces mostly isolates an

application’s view of the operating environment, including pro-

cess trees, network, user IDs and mounted file systems, while

the kernel’s cgroups provide resource limiting, including the

CPU, memory, block I/O and network. Since version 0.9, Docker

includes the libcontainer library as its own way to directly use

virtualization facilities provided by the Linux kernel, in addi-

tion to using abstracted virtualization interfaces via libvirt, LXC

(Linux Containers) and systemd-nspawn.

As actions are done to a Docker base image, union filesystem

layers are created and documented, such that each layer fully

describes how to recreate an action. This strategy enables

Docker’s lightweight images, as only layer updates need to be

propagated (compared to full VMs, for example). See the main

article “OpenStack for HPC Cloud Environments” for further de-


ETL (“Extract, Transform, Load”)

A process in database usage and especially in data warehousing

that involves:

� Extracting data from outside sources

� Transforming it to fit operational needs (which can include

quality levels)

� Loading it into the end target (database or data warehouse)

The first part of an ETL process involves extracting the data from

the source systems. In many cases this is the most challenging

aspect of ETL, as extracting data correctly will set the stage for

how subsequent processes will go. Most data warehousing proj-

ects consolidate data from different source systems. Each sepa-

rate system may also use a different data organization/format.

Common data source formats are relational databases and flat

files, but may include non-relational database structures such

as Information Management System (IMS) or other data struc-

tures such as Virtual Storage Access Method (VSAM) or Indexed

Sequential Access Method (ISAM), or even fetching from outside

sources such as through web spidering or screen-scraping. The

streaming of the extracted data source and load on-the-fly to

the destination database is another way of performing ETL

when no intermediate data storage is required. In general, the

goal of the extraction phase is to convert the data into a single

format which is appropriate for transformation processing.

An intrinsic part of the extraction involves the parsing of ex-

tracted data, resulting in a check if the data meets an expected

pattern or structure. If not, the data may be rejected entirely or

in part.

The transform stage applies a series of rules or functions to the

extracted data from the source to derive the data for loading

into the end target. Some data sources will require very little or

even no manipulation of data. In other cases, one or more of the

following transformation types may be required to meet the

business and technical needs of the target database.

The load phase loads the data into the end target, usually the

data warehouse (DW). Depending on the requirements of the

organization, this process varies widely. Some data warehouses

Page 246: HPC Compass 2016_17


may overwrite existing information with cumulative informa-

tion, frequently updating extract data is done on daily, weekly

or monthly basis. Other DW (or even other parts of the same DW)

may add new data in a historicized form, for example, hourly.

To understand this, consider a DW that is required to maintain

sales records of the last year. Then, the DW will overwrite any

data that is older than a year with newer data. However, the

entry of data for any one year window will be made in a histo-

ricized manner. The timing and scope to replace or append are

strategic design choices dependent on the time available and

the business needs. More complex systems can maintain a his-

tory and audit trail of all changes to the data loaded in the DW.

As the load phase interacts with a database, the constraints de-

fined in the database schema – as well as in triggers activated

upon data load – apply (for example, uniqueness, referential in-

tegrity, mandatory fields), which also contribute to the overall

data quality performance of the ETL process.

FFTW (“Fastest Fourier Transform in the West”)

A software library for computing discrete Fourier transforms

(DFTs), developed by Matteo Frigo and Steven G. Johnson at the

Massachusetts Institute of Technology. FFTW is known as the

fastest free software implementation of the Fast Fourier trans-

form (FFT) algorithm (upheld by regular benchmarks). It can

compute transforms of real and complex-valued arrays of arbi-

trary size and dimension in O(n log n) time.

floating point standard (IEEE 754)

The most widely-used standard for floating-point computa-

tion, and is followed by many hardware (CPU and FPU) and soft-

ware implementations. Many computer languages allow or re-

quire that some or all arithmetic be carried out using IEEE 754

formats and operations. The current version is IEEE 754-2008,

which was published in August 2008; the original IEEE 754-1985

was published in 1985. The standard defines arithmetic formats,

interchange formats, rounding algorithms, operations, and ex-

ception handling. The standard also includes extensive recom-

mendations for advanced exception handling, additional opera-

tions (such as trigonometric functions), expression evaluation,

and for achieving reproducible results. The standard defines

single-precision, double-precision, as well as 128-byte quadru-

ple-precision floating point numbers. In the proposed 754r ver-

sion, the standard also defines the 2-byte half-precision number


FraunhoferFS (FhGFS)

A high-performance parallel file system from the Fraunhofer

Competence Center for High Performance Computing. Built on

scalable multithreaded core components with native → Infini-

Band support, file system nodes can serve → InfiniBand and

Ethernet (or any other TCP-enabled network) connections at the

same time and automatically switch to a redundant connection

path in case any of them fails. One of the most fundamental

concepts of FhGFS is the strict avoidance of architectural bottle

necks. Striping file contents across multiple storage servers is

only one part of this concept. Another important aspect is the

distribution of file system metadata (e.g. directory information)

across multiple metadata servers. Large systems and metadata

intensive applications in general can greatly profit from the lat-

ter feature.

FhGFS requires no dedicated file system partition on the servers

– it uses existing partitions, formatted with any of the standard

Linux file systems, e.g. XFS or ext4. For larger networks, it is also


Page 247: HPC Compass 2016_17


possible to create several distinct FhGFS file system partitions

with different configurations. FhGFS provides a coherent mode,

in which it is guaranteed that changes to a file or directory by

one client are always immediately visible to other clients.

Global Arrays (GA)

A library developed by scientists at Pacific Northwest National

Laboratory for parallel computing. GA provides a friendly API

for shared-memory programming on distributed-memory com-

puters for multidimensional arrays. The GA library is a predeces-

sor to the GAS (global address space) languages currently being

developed for high-performance computing. The GA toolkit has

additional libraries including a Memory Allocator (MA), Aggre-

gate Remote Memory Copy Interface (ARMCI), and functionality

for out-of-core storage of arrays (ChemIO). Although GA was ini-

tially developed to run with TCGMSG, a message passing library

that came before the → MPI standard (Message Passing Inter-

face), it is now fully compatible with → MPI. GA includes simple

matrix computations (matrix-matrix multiplication, LU solve)

and works with → ScaLAPACK. Sparse matrices are available but

the implementation is not optimal yet. GA was developed by Jar-

ek Nieplocha, Robert Harrison and R. J. Littlefield. The ChemIO li-

brary for out-of-core storage was developed by Jarek Nieplocha,

Robert Harrison and Ian Foster.

The GA library is incorporated into many quantum chemistry

packages, including NWChem, MOLPRO, UTChem, MOLCAS, and

TURBOMOLE. The GA toolkit is free software, licensed under a

self-made license.

Globus Toolkit

An open source toolkit for building computing grids developed

and provided by the Globus Alliance, currently at version 5.

GMP (“GNU Multiple Precision Arithmetic Library”)

A free library for arbitrary-precision arithmetic, operating on

signed integers, rational numbers, and floating point numbers.

There are no practical limits to the precision except the ones

implied by the available memory in the machine GMP runs on

(operand dimension limit is 231 bits on 32-bit machines and 237

bits on 64-bit machines). GMP has a rich set of functions, and

the functions have a regular interface. The basic interface is

for C but wrappers exist for other languages including C++, C#,

OCaml, Perl, PHP, and Python. In the past, the Kaffe Java virtual

machine used GMP to support Java built-in arbitrary precision

arithmetic. This feature has been removed from recent releases,

causing protests from people who claim that they used Kaffe

solely for the speed benefits afforded by GMP. As a result, GMP

support has been added to GNU Classpath. The main target ap-

plications of GMP are cryptography applications and research,

Internet security applications, and computer algebra systems.


Kazushige Goto’s implementation of → BLAS.

grid (in CUDA architecture)

Part of the → CUDA programming model


An extension of the standard File Transfer Protocol (FTP) for use

with Grid computing. It is defined as part of the → Globus toolkit,

under the organisation of the Global Grid Forum (specifically,

by the GridFTP working group). The aim of GridFTP is to

provide a more reliable and high performance file transfer for

Grid computing applications. This is necessary because of the

increased demands of transmitting data in Grid computing – it is

Page 248: HPC Compass 2016_17


frequently necessary to transmit very large files, and this needs

to be done fast and reliably. GridFTP is the answer to the prob-

lem of incompatibility between storage and access systems.

Previously, each data provider would make their data available

in their own specific way, providing a library of access func-

tions. This made it difficult to obtain data from multiple sources,

requiring a different access method for each, and thus dividing

the total available data into partitions. GridFTP provides a uni-

form way of accessing the data, encompassing functions from

all the different modes of access, building on and extending the

universally accepted FTP standard. FTP was chosen as a basis

for it because of its widespread use, and because it has a well

defined architecture for extensions to the protocol (which may

be dynamically discovered).


An open-source software framework written in Java for distrib-

uted storage and distributed processing of very large data sets

on computer clusters built from commodity hardware. All the

modules in Hadoop are designed with a fundamental assump-

tion that hardware failures are common and should be auto-

matically handled by the framework. See the main article “Big

Data Analytics with Apache Hadoop” for further details.

Hierarchical Data Format (HDF)

A set of file formats and libraries designed to store and orga-

nize large amounts of numerical data. Originally developed at

the National Center for Supercomputing Applications, it is cur-

rently supported by the non-profit HDF Group, whose mission

is to ensure continued development of HDF5 technologies, and

the continued accessibility of data currently stored in HDF. In

keeping with this goal, the HDF format, libraries and associated

tools are available under a liberal, BSD-like license for general

use. HDF is supported by many commercial and non-commer-

cial software platforms, including Java, MATLAB, IDL, and Py-

thon. The freely available HDF distribution consists of the li-

brary, command-line utilities, test suite source, Java interface,

and the Java-based HDF Viewer (HDFView). There currently exist

two major versions of HDF, HDF4 and HDF5, which differ signifi-

cantly in design and API.

HLSL (“High Level Shader Language“)

The High Level Shader Language or High Level Shading Lan-

guage (HLSL) is a proprietary shading language developed by

Microsoft for use with the Microsoft Direct3D API. It is analo-

gous to the GLSL shading language used with the OpenGL stan-

dard. It is very similar to the NVIDIA Cg shading language, as it

was developed alongside it.

HLSL programs come in three forms, vertex shaders, geometry

shaders, and pixel (or fragment) shaders. A vertex shader is ex-

ecuted for each vertex that is submitted by the application, and

is primarily responsible for transforming the vertex from object

space to view space, generating texture coordinates, and calcu-

lating lighting coefficients such as the vertex’s tangent, binor-

mal and normal vectors. When a group of vertices (normally 3,

to form a triangle) come through the vertex shader, their out-

put position is interpolated to form pixels within its area; this

process is known as rasterisation. Each of these pixels comes

through the pixel shader, whereby the resultant screen colour

is calculated. Optionally, an application using a Direct3D10 in-

terface and Direct3D10 hardware may also specify a geometry

shader. This shader takes as its input the three vertices of a tri-

angle and uses this data to generate (or tessellate) additional

triangles, which are each then sent to the rasterizer.


Page 249: HPC Compass 2016_17



Switched fabric communications link primarily used in HPC and

enterprise data centers. Its features include high throughput,

low latency, quality of service and failover, and it is designed to

be scalable. The InfiniBand architecture specification defines a

connection between processor nodes and high performance

I/O nodes such as storage devices. Like → PCI Express, and many

other modern interconnects, InfiniBand offers point-to-point

bidirectional serial links intended for the connection of pro-

cessors with high-speed peripherals such as disks. On top of

the point to point capabilities, InfiniBand also offers multicast

operations as well. It supports several signalling rates and, as

with PCI Express, links can be bonded together for additional


The SDR serial connection’s signalling rate is 2.5 gigabit per sec-

ond (Gbit/s) in each direction per connection. DDR is 5 Gbit/s

and QDR is 10 Gbit/s. FDR is 14.0625 Gbit/s and EDR is 25.78125

Gbit/s per lane. For SDR, DDR and QDR, links use 8B/10B encod-

ing – every 10 bits sent carry 8 bits of data – making the use-

ful data transmission rate four-fifths the raw rate. Thus single,

double, and quad data rates carry 2, 4, or 8 Gbit/s useful data,

respectively. For FDR and EDR, links use 64B/66B encoding – ev-

ery 66 bits sent carry 64 bits of data.

Implementers can aggregate links in units of 4 or 12, called 4X or

12X. A 12X QDR link therefore carries 120 Gbit/s raw, or 96 Gbit/s

of useful data. As of 2009 most systems use a 4X aggregate, im-

plying a 10 Gbit/s (SDR), 20 Gbit/s (DDR) or 40 Gbit/s (QDR) con-

nection. Larger systems with 12X links are typically used for

cluster and supercomputer interconnects and for inter-switch


The single data rate switch chips have a latency of 200 nano-

seconds, DDR switch chips have a latency of 140 nanoseconds

and QDR switch chips have a latency of 100 nanoseconds. The

end-to-end latency range ranges from 1.07 microseconds MPI

latency to 1.29 microseconds MPI latency to 2.6 microseconds.

As of 2009 various InfiniBand host channel adapters (HCA) exist

in the market, each with different latency and bandwidth char-

acteristics. InfiniBand also provides RDMA capabilities for low

CPU overhead. The latency for RDMA operations is less than 1


See the main article “InfiniBand” for further description of In-

finiBand features

Intel Integrated Performance Primitives (Intel IPP)

A multi-threaded software library of functions for multimedia

and data processing applications, produced by Intel. The library

supports Intel and compatible processors and is available for

Windows, Linux, and Mac OS X operating systems. It is available

separately or as a part of Intel Parallel Studio. The library takes

advantage of processor features including MMX, SSE, SSE2,

SSE3, SSSE3, SSE4, AES-NI and multicore processors. Intel IPP is

divided into four major processing groups: Signal (with linear

array or vector data), Image (with 2D arrays for typical color

spaces), Matrix (with n x m arrays for matrix operations), and

Cryptography. Half the entry points are of the matrix type, a

third are of the signal type and the remainder are of the im-

age and cryptography types. Intel IPP functions are divided

into 4 data types: Data types include 8u (8-bit unsigned), 8s

(8-bit signed), 16s, 32f (32-bit floating-point), 64f, etc. Typically, an

application developer works with only one dominant data type

for most processing functions, converting between input to

Page 250: HPC Compass 2016_17


processing to output formats at the end points. Version 5.2 was

introduced June 5, 2007, adding code samples for data compres-

sion, new video codec support, support for 64-bit applications

on Mac OS X, support for Windows Vista, and new functions for

ray-tracing and rendering. Version 6.1 was released with the

Intel C++ Compiler on June 28, 2009 and Update 1 for version 6.1

was released on July 28, 2009.

Intel Threading Building Blocks (TBB)

A C++ template library developed by Intel Corporation for writ-

ing software programs that take advantage of multi-core pro-

cessors. The library consists of data structures and algorithms

that allow a programmer to avoid some complications arising

from the use of native threading packages such as POSIX →

threads, Windows → threads, or the portable Boost Threads in

which individual → threads of execution are created, synchro-

nized, and terminated manually. Instead the library abstracts

access to the multiple processors by allowing the operations

to be treated as “tasks”, which are allocated to individual cores

dynamically by the library’s run-time engine, and by automat-

ing efficient use of the CPU cache. A TBB program creates, syn-

chronizes and destroys graphs of dependent tasks according

to algorithms, i.e. high-level parallel programming paradigms

(a.k.a. Algorithmic Skeletons). Tasks are then executed respect-

ing graph dependencies. This approach groups TBB in a family

of solutions for parallel programming aiming to decouple the

programming from the particulars of the underlying machine.

Intel TBB is available commercially as a binary distribution with

support and in open source in both source and binary forms.

Version 4.0 was introduced on September 8, 2011.

iSER (“iSCSI Extensions for RDMA“)

A protocol that maps the iSCSI protocol over a network that

provides RDMA services (like → iWARP or → InfiniBand). This

permits data to be transferred directly into SCSI I/O buffers

without intermediate data copies. The Datamover Architecture

(DA) defines an abstract model in which the movement of data

between iSCSI end nodes is logically separated from the rest of

the iSCSI protocol. iSER is one Datamover protocol. The inter-

face between the iSCSI and a Datamover protocol, iSER in this

case, is called Datamover Interface (DI).

iWARP (“Internet Wide Area RDMA Protocol”)

An Internet Engineering Task Force (IETF) update of the RDMA

Consortium’s → RDMA over TCP standard. This later standard is

zero-copy transmission over legacy TCP. Because a kernel imple-

mentation of the TCP stack is a tremendous bottleneck, a few

vendors now implement TCP in hardware. This additional hard-

ware is known as the TCP offload engine (TOE). TOE itself does

not prevent copying on the receive side, and must be combined

with RDMA hardware for zero-copy results. The main compo-

nent is the Data Direct Protocol (DDP), which permits the ac-

tual zero-copy transmission. The transmission itself is not per-

formed by DDP, but by TCP.


SDRSingle Data Rate

DDRDouble Data Rate

QDRQuadruple Data Rate

FDRFourteen Data Rate

EDREnhanced Data Rate

1X 2 Gbit/s 4 Gbit/s 8 Gbit/s 14 Gbit/s 25 Gbit/s

4X 8 Gbit/s 16 Gbit/s 32 Gbit/s 56 Gbit/s 100 Gbit/s

12X 24 Gbit/s 48 Gbit/s 96 Gbit/s 168 Gbit/s 300 Gbit/s

Page 251: HPC Compass 2016_17



An SQL join clause combines records from two or more tables

in a relational database. It creates a set that can be saved as a

table or used as it is. A JOIN is a means for combining fields from

two tables (or more) by using values common to each. ANSI-

standard SQL specifies five types of JOIN: INNER, LEFT OUTER,

RIGHT OUTER, FULL OUTER and CROSS. As a special case, a table

(base table, view, or joined table) can JOIN to itself in a self-join.

Kernel (in CUDA architecture)

Part of the → CUDA programming model


A high-quality open-source implementation of the → MPI speci-

fication, including all of MPI-1.2 and much of MPI-2. Superseded

by the → OpenMPI implementation-

LAPACK (“linear algebra package”)

Routines for solving systems of simultaneous linear equations,

least-squares solutions of linear systems of equations, eigen-

value problems, and singular value problems. The original goal

of the LAPACK project was to make the widely used EISPACK and

→ LINPACK libraries run efficiently on shared-memory vector

and parallel processors. LAPACK routines are written so that as

much as possible of the computation is performed by calls to

the → BLAS library. While → LINPACK and EISPACK are based on

the vector operation kernels of the Level 1 BLAS, LAPACK was

designed at the outset to exploit the Level 3 BLAS. Highly effi-

cient machine-specific implementations of the BLAS are avail-

able for many modern high-performance computers. The BLAS

enable LAPACK routines to achieve high performance with

portable software.


Part of the → parallel NFS standard. Currently three types of

layout exist: file-based, block/volume-based, and object-based,

the latter making use of → object-based storage devices


A collection of Fortran subroutines that analyze and solve linear

equations and linear least-squares problems. LINPACK was de-

signed for supercomputers in use in the 1970s and early 1980s.

LINPACK has been largely superseded by → LAPACK, which has

been designed to run efficiently on shared-memory, vector su-

percomputers. LINPACK makes use of the → BLAS libraries for

performing basic vector and matrix operations.

The LINPACK benchmarks are a measure of a system‘s float-

ing point computing power and measure how fast a computer

solves a dense N by N system of linear equations Ax=b, which is a

common task in engineering. The solution is obtained by Gauss-

ian elimination with partial pivoting, with 2/3•N³ + 2•N² floating

point operations. The result is reported in millions of floating

point operations per second (MFLOP/s, sometimes simply called



Communication protocol in → Lustre

Logical object volume (LOV)

A logical entity in → Lustre


An object-based → parallel file system

Page 252: HPC Compass 2016_17


Management server (MGS)

A functional component in → Lustre


A framework for processing highly distributable problems

across huge datasets using a large number of computers

(nodes), collectively referred to as a cluster (if all nodes use the

same hardware) or a grid (if the nodes use different hardware).

Computational processing can occur on data stored either in a

filesystem (unstructured) or in a database (structured).

“Map” step: The master node takes the input, divides it into

smaller sub-problems, and distributes them to worker nodes. A

worker node may do this again in turn, leading to a multi-level

tree structure. The worker node processes the smaller problem,

and passes the answer back to its master node.

“Reduce” step: The master node then collects the answers to all

the sub-problems and combines them in some way to form the

output – the answer to the problem it was originally trying to


MapReduce allows for distributed processing of the map and

reduction operations. Provided each mapping operation is in-

dependent of the others, all maps can be performed in parallel

– though in practice it is limited by the number of independent

data sources and/or the number of CPUs near each source. Simi-

larly, a set of ‘reducers’ can perform the reduction phase - pro-

vided all outputs of the map operation that share the same key

are presented to the same reducer at the same time. While this

process can often appear inefficient compared to algorithms

that are more sequential, MapReduce can be applied to signifi-

cantly larger datasets than “commodity” servers can handle – a

large server farm can use MapReduce to sort a petabyte of data

in only a few hours. The parallelism also offers some possibility

of recovering from partial failure of servers or storage during

the operation: if one mapper or reducer fails, the work can be

rescheduled – assuming the input data is still available.

metadata server (MDS)

A functional component in → Lustre

metadata target (MDT)

A logical entity in → Lustre

MKL (“Math Kernel Library”)

A library of optimized, math routines for science, engineering,

and financial applications developed by Intel. Core math func-

tions include → BLAS, → LAPACK, → ScaLAPACK, Sparse Solvers,

Fast Fourier Transforms, and Vector Math. The library supports

Intel and compatible processors and is available for Windows,

Linux and Mac OS X operating systems.

MPI, MPI-2 (“message-passing interface”)

A language-independent communications protocol used to

program parallel computers. Both point-to-point and collective

communication are supported. MPI remains the dominant mod-

el used in high-performance computing today. There are two

versions of the standard that are currently popular: version 1.2

(shortly called MPI-1), which emphasizes message passing and

has a static runtime environment, and MPI-2.1 (MPI-2), which in-

cludes new features such as parallel I/O, dynamic process man-

agement and remote memory operations. MPI-2 specifies over

500 functions and provides language bindings for ANSI C, ANSI


Page 253: HPC Compass 2016_17


Fortran (Fortran90), and ANSI C++. Interoperability of objects de-

fined in MPI was also added to allow for easier mixed-language

message passing programming. A side effect of MPI-2 stan-

dardization (completed in 1996) was clarification of the MPI-1

standard, creating the MPI-1.2 level. MPI-2 is mostly a superset

of MPI-1, although some functions have been deprecated. Thus

MPI-1.2 programs still work under MPI implementations com-

pliant with the MPI-2 standard. The MPI Forum reconvened in

2007, to clarify some MPI-2 issues and explore developments for

a possible MPI-3.


A freely available, portable → MPI 2.0 implementation,

maintained by Argonne National Laboratory

MPP (“massively parallel processing”)

So-called MPP jobs are computer programs with several parts

running on several machines in parallel, often calculating simu-

lation problems. The communication between these parts can

e.g. be realized by the → MPI software interface.


Microsoft → MPI 2.0 implementation shipped with Microsoft

HPC Pack 2008 SDK, based on and designed for maximum com-

patibility with the → MPICH2 reference implementation.


An → MPI 2.0 implementation based on → MPICH2 and devel-

oped by the Department of Computer Science and Engineer-

ing at Ohio State University. It is available under BSD licens-

ing and supports MPI over InfiniBand, 10GigE/iWARP and


NetCDF (“Network Common Data Form”)

A set of software libraries and self-describing, machine-inde-

pendent data formats that support the creation, access, and

sharing of array-oriented scientific data. The project homep-

age is hosted by the Unidata program at the University Cor-

poration for Atmospheric Research (UCAR). They are also the

chief source of NetCDF software, standards development,

updates, etc. The format is an open standard. NetCDF Clas-

sic and 64-bit Offset Format are an international standard

of the Open Geospatial Consortium. The project is actively

supported by UCAR. The recently released (2008) version 4.0

greatly enhances the data model by allowing the use of the

→ HDF5 data file format. Version 4.1 (2010) adds support for

C and Fortran client access to specified subsets of remote

data via OPeNDAP. The format was originally based on the

conceptual model of the NASA CDF but has since diverged

and is not compatible with it. It is commonly used in clima-

tology, meteorology and oceanography applications (e.g.,

weather forecasting, climate change) and GIS applications. It

is an input/output format for many GIS applications, and for

general scientific data exchange. The NetCDF C library, and

the libraries based on it (Fortran 77 and Fortran 90, C++, and

all third-party libraries) can, starting with version 4.1.1, read

some data in other data formats. Data in the → HDF5 format

can be read, with some restrictions. Data in the → HDF4 for-

mat can be read by the NetCDF C library if created using the

→ HDF4 Scientific Data (SD) API.


A remote direct memory access (RDMA)-based network interface

implemented in Windows Server 2008 and later. NetworkDirect

uses a more direct path from → MPI applications to networking

Page 254: HPC Compass 2016_17


hardware, resulting in very fast and efficient networking. See the

main article “Windows HPC Server 2008 R2” for further details.

NFS (Network File System)

A network file system protocol originally developed by Sun Mi-

crosystems in 1984, allowing a user on a client computer to ac-

cess files over a network in a manner similar to how local stor-

age is accessed. NFS, like many other protocols, builds on the

Open Network Computing Remote Procedure Call (ONC RPC)

system. The Network File System is an open standard defined in

RFCs, allowing anyone to implement the protocol.

Sun used version 1 only for in-house experimental purposes.

When the development team added substantial changes to NFS

version 1 and released it outside of Sun, they decided to release

the new version as V2, so that version interoperation and RPC

version fallback could be tested. Version 2 of the protocol (de-

fined in RFC 1094, March 1989) originally operated entirely over

UDP. Its designers meant to keep the protocol stateless, with

locking (for example) implemented outside of the core protocol

Version 3 (RFC 1813, June 1995) added:

� support for 64-bit file sizes and offsets, to handle files larger

than 2 gigabytes (GB)

� support for asynchronous writes on the server, to improve

write performance

� additional file attributes in many replies, to avoid the need

to re-fetch them

� a READDIRPLUS operation, to get file handles and attributes

along with file names when scanning a directory

� assorted other improvements

At the time of introduction of Version 3, vendor support for TCP

as a transport-layer protocol began increasing. While several


vendors had already added support for NFS Version 2 with TCP

as a transport, Sun Microsystems added support for TCP as a

transport for NFS at the same time it added support for Version 3.

Using TCP as a transport made using NFS over a WAN more fea-


Version 4 (RFC 3010, December 2000; revised in RFC 3530, April

2003), influenced by AFS and CIFS, includes performance im-

provements, mandates strong security, and introduces a state-

ful protocol. Version 4 became the first version developed with

the Internet Engineering Task Force (IETF) after Sun Microsys-

tems handed over the development of the NFS protocols.

NFS version 4 minor version 1 (NFSv 4.1) has been approved by

the IESG and received an RFC number since Jan 2010. The NFSv

4.1 specification aims: to provide protocol support to take ad-

vantage of clustered server deployments including the ability

to provide scalable parallel access to files distributed among

multiple servers. NFSv 4.1 adds the parallel NFS (pNFS) capabil-

ity, which enables data access parallelism. The NFSv 4.1 protocol

defines a method of separating the filesystem meta-data from

the location of the file data; it goes beyond the simple name/

data separation by striping the data amongst a set of data serv-

ers. This is different from the traditional NFS server which holds

the names of files and their data under the single umbrella of

the server.

In addition to pNFS, NFSv 4.1 provides sessions, directory del-

egation and notifications, multi-server namespace, access con-

trol lists (ACL/SACL/DACL), retention attributions, and SECIN-

FO_NO_NAME. See the main article “Parallel Filesystems” for

further details.

Page 255: HPC Compass 2016_17


Current work is being done in preparing a draft for a future ver-

sion 4.2 of the NFS standard, including so-called federated file-

systems, which constitute the NFS counterpart of Microsoft’s

distributed filesystem (DFS).


A NoSQL (originally referring to “non SQL” or “non relational” ) da-

tabase provides a mechanism for storage and retrieval of data

which is modeled in means other than the tabular relations used

in relational databases. Such databases have existed since the

late 1960s, but did not obtain the “NoSQL” moniker until a surge

of popularity in the early twenty-first century.

Motivations for this approach include: simplicity of design, sim-

pler “horizontal” scaling to clusters of machines (which is a prob-

lem for relational databases), and finer control over availability.

The data structures used by NoSQL databases (e.g. key-value,

wide column, graph, or document) are different from those used

by default in relational databases, making some operations fast-

er in NoSQL. The particular suitability of a given NoSQL database

depends on the problem it must solve. Sometimes the data struc-

tures used by NoSQL databases are also viewed as “more flex-

ible” than relational database tables.

NoSQL databases are increasingly used in big data and real-time

web applications. NoSQL systems are also sometimes called “Not

only SQL” to emphasize that they may support SQL-like query lan-


Many NoSQL stores compromise consistency (in the sense of the

CAP theorem) in favor of availability, partition tolerance, and

speed. Barriers to the greater adoption of NoSQL stores include

the use of low-level query languages (instead of SQL, for instance

the lack of ability to perform ad-hoc → JOINs across tables), lack

of standardized interfaces, and huge previous investments in ex-

isting relational databases.

Instead, most NoSQL databases offer a concept of “eventual

consistency” in which database changes are propagated to all

nodes “eventually” (typically within milliseconds) so queries for

data might not return updated data immediately or might result

in reading data that is not accurate, a problem known as stale

reads. Additionally, some NoSQL systems may exhibit lost writes

and other forms of data loss. Fortunately, some NoSQL systems

provide concepts such as write-ahead logging to avoid data loss.

For distributed transaction processing across multiple databas-

es, data consistency is an even bigger challenge that is difficult

for both NoSQL and relational databases. Even current relational

databases do not allow referential integrity constraints to span

databases. There are few systems that maintain both → ACID

transactions and X/Open XA standards for distributed transac-

tion processing.

NUMA (“non-uniform memory access”)

A computer memory design used in multiprocessors, where the

memory access time depends on the memory location relative

to a processor. Under NUMA, a processor can access its own local

memory faster than non-local memory, that is, memory local to

another processor or memory shared between processors.

object storage server (OSS)

A functional component in → Lustre

Page 256: HPC Compass 2016_17


object storage target (OST)

A logical entity in → Lustre

object-based storage device (OSD)

An intelligent evolution of disk drives that can store and serve

objects rather than simply place data on tracks and sectors.

This task is accomplished by moving low-level storage func-

tions into the storage device and accessing the device through

an object interface. Unlike a traditional block-oriented device

providing access to data organized as an array of unrelated

blocks, an object store allows access to data by means of stor-

age objects. A storage object is a virtual entity that groups data

together that has been determined by the user to be logically

related. Space for a storage object is allocated internally by the

OSD itself instead of by a host-based file system. OSDs man-

age all necessary low-level storage, space management, and

security functions. Because there is no host-based metadata

for an object (such as inode information), the only way for an

application to retrieve an object is by using its object identifier

(OID). The SCSI interface was modified and extended by the OSD

Technical Work Group of the Storage Networking Industry Asso-

ciation (SNIA) with varied industry and academic contributors,

resulting in a draft standard to T10 in 2004. This standard was

ratified in September 2004 and became the ANSI T10 SCSI OSD

V1 command set, released as INCITS 400-2004. The SNIA group

continues to work on further extensions to the interface, such

as the ANSI T10 SCSI OSD V2 command set.

OLAP cube (“Online Analytical Processing”)

A set of data, organized in a way that facilitates non-predeter-

mined queries for aggregated information, or in other words,


online analytical processing. OLAP is one of the computer-

based techniques for analyzing business data that are collec-

tively called business intelligence. OLAP cubes can be thought

of as extensions to the two-dimensional array of a spreadsheet.

For example a company might wish to analyze some financial

data by product, by time-period, by city, by type of revenue and

cost, and by comparing actual data with a budget. These addi-

tional methods of analyzing the data are known as dimensions.

Because there can be more than three dimensions in an OLAP

system the term hypercube is sometimes used.

OpenCL (“Open Computing Language”)

A framework for writing programs that execute across hetero-

geneous platforms consisting of CPUs, GPUs, and other proces-

sors. OpenCL includes a language (based on C99) for writing ker-

nels (functions that execute on OpenCL devices), plus APIs that

are used to define and then control the platforms. OpenCL pro-

vides parallel computing using task-based and data-based par-

allelism. OpenCL is analogous to the open industry standards

OpenGL and OpenAL, for 3D graphics and computer audio,

respectively. Originally developed by Apple Inc., which holds

trademark rights, OpenCL is now managed by the non-profit

technology consortium Khronos Group.

OpenMP (“Open Multi-Processing”)

An application programming interface (API) that supports

multi-platform shared memory multiprocessing programming

in C, C++ and Fortran on many architectures, including Unix and

Microsoft Windows platforms. It consists of a set of compiler di-

rectives, library routines, and environment variables that influ-

ence run-time behavior.

Page 257: HPC Compass 2016_17


Jointly defined by a group of major computer hardware and

software vendors, OpenMP is a portable, scalable model that

gives programmers a simple and flexible interface for develop-

ing parallel applications for platforms ranging from the desk-

top to the supercomputer.

An application built with the hybrid model of parallel program-

ming can run on a computer cluster using both OpenMP and

Message Passing Interface (MPI), or more transparently through

the use of OpenMP extensions for non-shared memory systems.


An open source → MPI-2 implementation that is developed and

maintained by a consortium of academic, research, and indus-

try partners.


A free and open-source software platform for cloud computing,

mostly deployed as an infrastructure-as-a-service (IaaS). The

software platform consists of interrelated components that

control hardware pools of processing, storage, and network-

ing resources throughout a data center. Users either manage it

through a web-based dashboard, through command-line tools,

or through a RESTful API. OpenStack.org released it under the

terms of the Apache License.

OpenStack began in 2010 as a joint project of Rackspace Host-

ing and NASA. As of 2016 it is managed by the OpenStack Foun-

dation, a non-profit corporate entity established in September

2012 to promote OpenStack software and its community.

The OpenStack community collaborates around a six-month,

time-based release cycle with frequent development mile-

stones. During the planning phase of each release, the commu-

nity gathers for an OpenStack Design Summit to facilitate de-

veloper working sessions and to assemble plans. See the main

article “OpenStack for HPC Cloud Environments” for further


parallel NFS (pNFS)

A → parallel file system standard, optional part of the current →

NFS standard 4.1. See the main article “Parallel Filesystems” for

further details.

PCI Express (PCIe)

A computer expansion card standard designed to replace the

older PCI, PCI-X, and AGP standards. Introduced by Intel in 2004,

PCIe (or PCI-E, as it is commonly called) is the latest standard

for expansion cards that is available on mainstream comput-

ers. PCIe, unlike previous PC expansion standards, is structured

around point-to-point serial links, a pair of which (one in each

direction) make up lanes; rather than a shared parallel bus.

PCIe 1.x PCIe 2.x PCIe 3.0 PCIe 4.0

x1 256 MB/s 512 MB/s 1 GB/s 2 GB/s

x2 512 MB/s 1 GB/s 2 GB/s 4 GB/s

x4 1 GB/s 2 GB/s 4 GB/s 8 GB/s

x8 2 GB/s 4 GB/s 8 GB/s 16 GB/s

x16 4 GB/s 8 GB/s 16 GB/s 32 GB/s

x32 8 GB/s 16 GB/s 32 GB/s 64 GB/s

Page 258: HPC Compass 2016_17



These lanes are routed by a hub on the main-board acting as

a crossbar switch. This dynamic point-to-point behavior allows

more than one pair of devices to communicate with each other

at the same time. In contrast, older PC interfaces had all devices

permanently wired to the same bus; therefore, only one device

could send information at a time. This format also allows “chan-

nel grouping”, where multiple lanes are bonded to a single de-

vice pair in order to provide higher bandwidth. The number of

lanes is “negotiated” during power-up or explicitly during op-

eration. By making the lane count flexible a single standard can

provide for the needs of high-bandwidth cards (e.g. graphics

cards, 10 Gigabit Ethernet cards and multiport Gigabit Ethernet

cards) while also being economical for less demanding cards.

Unlike preceding PC expansion interface standards, PCIe is a

network of point-to-point connections. This removes the need

for “arbitrating” the bus or waiting for the bus to be free and

allows for full duplex communications. This means that while

standard PCI-X (133 MHz 64 bit) and PCIe x4 have roughly the

same data transfer rate, PCIe x4 will give better performance if

multiple device pairs are communicating simultaneously or if

communication within a single device pair is bidirectional.

Specifications of the format are maintained and developed by a

group of more than 900 industry-leading companies called the

PCI-SIG (PCI Special Interest Group). In PCIe 1.x, each lane carries

approximately 250 MB/s. PCIe 2.0, released in late 2007, adds a

Gen2-signalling mode, doubling the rate to about 500 MB/s. On

November 18, 2010, the PCI Special Interest Group officially pub-

lishes the finalized PCI Express 3.0 specification to its members

to build devices based on this new version of PCI Express, which

allows for a Gen3-signalling mode at 1 GB/s.

On November 29, 2011, PCI-SIG has annonced to proceed to

PCI Express 4.0 featuring 16 GT/s, still on copper technology.

Additionally, active and idle power optimizations are to be in-

vestigated. Final specifications are expected to be released in


PETSc (“Portable, Extensible Toolkit for Scientific Computation”)

A suite of data structures and routines for the scalable (parallel)

solution of scientific applications modeled by partial differential

equations. It employs the → Message Passing Interface (MPI) stan-

dard for all message-passing communication. The current ver-

sion of PETSc is 3.2; released September 8, 2011. PETSc is intended

for use in large-scale application projects, many ongoing compu-

tational science projects are built around the PETSc libraries.

Its careful design allows advanced users to have detailed control

over the solution process. PETSc includes a large suite of parallel

linear and nonlinear equation solvers that are easily used in ap-

plication codes written in C, C++, Fortran and now Python. PETSc

provides many of the mechanisms needed within parallel appli-

cation code, such as simple parallel matrix and vector assembly

routines that allow the overlap of communication and computa-

tion. In addition, PETSc includes support for parallel distributed

arrays useful for finite difference methods.


→ thread

PTX (“parallel thread execution”)

Parallel Thread Execution (PTX) is a pseudo-assembly language

used in NVIDIA’s CUDA programming environment. The ‘nvcc’

compiler translates code written in CUDA, a C-like language, into

Page 259: HPC Compass 2016_17


PTX, and the graphics driver contains a compiler which translates

the PTX into something which can be run on the processing cores.

RDMA (“remote direct memory access”)

Allows data to move directly from the memory of one computer

into that of another without involving either one‘s operating

system. This permits high-throughput, low-latency networking,

which is especially useful in massively parallel computer clus-

ters. RDMA relies on a special philosophy in using DMA. RDMA

supports zero-copy networking by enabling the network adapt-

er to transfer data directly to or from application memory, elimi-

nating the need to copy data between application memory and

the data buffers in the operating system. Such transfers require

no work to be done by CPUs, caches, or context switches, and

transfers continue in parallel with other system operations.

When an application performs an RDMA Read or Write request,

the application data is delivered directly to the network, reduc-

ing latency and enabling fast message transfer. Common RDMA

implementations include → InfiniBand, → iSER, and → iWARP.

RISC (“reduced instruction-set computer”)

A CPU design strategy emphasizing the insight that simplified

instructions that “do less“ may still provide for higher perfor-

mance if this simplicity can be utilized to make instructions ex-

ecute very quickly → CISC.

ScaLAPACK (“scalable LAPACK”)

Library including a subset of → LAPACK routines redesigned

for distributed memory MIMD (multiple instruction, multiple

data) parallel computers. It is currently written in a Single-

Program-Multiple-Data style using explicit message passing

for interprocessor communication. ScaLAPACK is designed for

heterogeneous computing and is portable on any computer

that supports → MPI. The fundamental building blocks of the

ScaLAPACK library are distributed memory versions (PBLAS) of

the Level 1, 2 and 3 → BLAS, and a set of Basic Linear Algebra

Communication Subprograms (BLACS) for communication tasks

that arise frequently in parallel linear algebra computations. In

the ScaLAPACK routines, all interprocessor communication oc-

curs within the PBLAS and the BLACS. One of the design goals of

ScaLAPACK was to have the ScaLAPACK routines resemble their

→ LAPACK equivalents as much as possible.

service-oriented architecture (SOA)

An approach to building distributed, loosely coupled applica-

tions in which functions are separated into distinct services

that can be distributed over a network, combined, and reused.

See the main article “Windows HPC Server 2008 R2” for further


single precision/double precision

→ floating point standard

SMP (“shared memory processing”)

So-called SMP jobs are computer programs with several

parts running on the same system and accessing a shared

memory region. A usual implementation of SMP jobs is

→ multi-threaded programs. The communication between the

single threads can e.g. be realized by the → OpenMP software

interface standard, but also in a non-standard way by means of

native UNIX interprocess communication mechanisms.

Page 260: HPC Compass 2016_17



SMP (“symmetric multiprocessing”)

A multiprocessor or multicore computer architecture where two

or more identical processors or cores can connect to a single

shared main memory in a completely symmetric way, i.e. each

part of the main memory has the same distance to each of the

cores. Opposite: → NUMA

storage access protocol

Part of the → parallel NFS standard


A simple synthetic benchmark program that measures sustain-

able memory bandwidth (in MB/s) and the corresponding com-

putation rate for simple vector kernels.

streaming multiprocessor (SM)

Hardware component within the → Tesla GPU series

subnet manager

Application responsible for configuring the local → InfiniBand

subnet and ensuring its continued operation.

superscalar processors

A superscalar CPU architecture implements a form of parallel-

ism called instruction-level parallelism within a single proces-

sor. It thereby allows faster CPU throughput than would other-

wise be possible at the same clock rate. A superscalar processor

executes more than one instruction during a clock cycle by si-

multaneously dispatching multiple instructions to redundant

functional units on the processor. Each functional unit is not a

separate CPU core but an execution resource within a single CPU

such as an arithmetic logic unit, a bit shifter, or a multiplier.


NVIDIA‘s third brand of GPUs, based on high-end GPUs from the

G80 and on. Tesla is NVIDIA‘s first dedicated General Purpose

GPU. Because of the very high computational power (measured

in floating point operations per second or FLOPS) compared to

recent microprocessors, the Tesla products are intended for the

HPC market. The primary function of Tesla products are to aid

in simulations, large scale calculations (especially floating-point

calculations), and image generation for professional and scien-

tific fields, with the use of → CUDA. See the main article “NVIDIA

GPU Computing” for further details.


A thread of execution is a fork of a computer program into two

or more concurrently running tasks. The implementation of

threads and processes differs from one operating system to an-

other, but in most cases, a thread is contained inside a process.

On a single processor, multithreading generally occurs by multi-

tasking: the processor switches between different threads. On

a multiprocessor or multi-core system, the threads or tasks will

generally run at the same time, with each processor or core run-

ning a particular thread or task. Threads are distinguished from

processes in that processes are typically independent, while

threads exist as subsets of a process. Whereas processes have

separate address spaces, threads share their address space,

which makes inter-thread communication much easier than

classical inter-process communication (IPC).

Page 261: HPC Compass 2016_17


thread (in CUDA architecture)

Part of the → CUDA programming model

thread block (in CUDA architecture)

Part of the → CUDA programming model

thread processor array (TPA)

Hardware component within the → Tesla GPU series

10 Gigabit Ethernet

The fastest of the Ethernet standards, first published in 2002

as IEEE Std 802.3ae-2002. It defines a version of Ethernet with a

nominal data rate of 10 Gbit/s, ten times as fast as Gigabit Eth-

ernet. Over the years several 802.3 standards relating to 10GbE

have been published, which later were consolidated into the

IEEE 802.3-2005 standard. IEEE 802.3-2005 and the other amend-

ments have been consolidated into IEEE Std 802.3-2008. 10

Gigabit Ethernet supports only full duplex links which can be

connected by switches. Half Duplex operation and CSMA/CD

(carrier sense multiple access with collision detect) are not sup-

ported in 10GbE. The 10 Gigabit Ethernet standard encompasses

a number of different physical layer (PHY) standards. As of 2008

10 Gigabit Ethernet is still an emerging technology with only 1

million ports shipped in 2007, and it remains to be seen which of

the PHYs will gain widespread commercial acceptance.

warp (in CUDA architecture)

Part of the → CUDA programming modelOLAP cube

Page 262: HPC Compass 2016_17



Page 263: HPC Compass 2016_17


Page 264: HPC Compass 2016_17



Page 265: HPC Compass 2016_17


Page 266: HPC Compass 2016_17


Page 267: HPC Compass 2016_17

Always keep in touch with the latest newsVisit us on the Web!

Here you will find comprehensive information about HPC, IT solutions for the datacenter, services and

high-performance, efficient IT systems.

Subscribe to our technology journals, E-News or the transtec newsletter and always stay up to date.


Page 268: HPC Compass 2016_17


transtec Germany

Tel +49 (0) 7121/2678 - 400

[email protected]


transtec Switzerland

Tel +41 (0) 44 / 818 47 00

[email protected]


transtec United Kingdom

Tel +44 (0) 1295 / 756 500

[email protected]


ttec Netherlands

Tel +31 (0) 24 34 34 210

[email protected]


transtec France

Tel +33 (0)

[email protected]


Texts and concept:

Layout and design:

Dr. Oliver Tennert | [email protected]

Silke Lerche | [email protected]

Fanny Schwarz | [email protected]

Sebastian Friedmann | [email protected]

© transtec AG, March 2016

The graphics, diagrams and tables found herein are the intellectual property of transtec AG and may be reproduced or published only with its express permission.

No responsibility will be assumed for inaccuracies or omissions. Other names or logos may be trademarks of their respective owners.