hpc user forum isv panel april 2010 dearborn, mi

HPC USER FORUM ISV PANEL

April 2010Dearborn, MI

HPC USER FORUM ISV PANEL

April 2010Dearborn, MI

Panel Members Panel Members

•Moderators• Alex Akkerman, Ford Motor Company• Sharan Kalwani, KAUST

•Participants• Steve Feldman, CD-adapco• Matt Dunbar, Simulia• Uwe Schramm, Altair Engineering• Li Zhang, Livermore Software Technology

Corporation• Barbara Hutchings, ANSYS, Inc.• Martin McNamee, MSC Software

Panel Format Panel Format

•4 QuestionsProvided ahead of time

•2 minutes per question for each participant•Follow-up and Audience after each participant had a chance to comment

Q1. Applications Scalability…..Q1. Applications Scalability…..

•Please share with the audience, briefly – the issues surrounding Applications Scalability and how is this being addressed?

Our solvers scale reasonably well to 512 cores or more with very large problems.Very few actually use this many on one analysis today

Primary solver bottlenecks-Memory bandwidthUnbalanced work loads

Untapped speed potentialsParallel MeshingParallel post-processingParallel I/ODifferent algorithms, re-evaluation of methods


Fundamental limitations on scalability remain scalar sections of code (Amdahl’s law) and load balance

Solution is still developer time and effort which is being invested

Looking at ways to improve developer efficiency through use of better programming models (primarily from Intel and Microsoft)Past programming model changes were either too limited

(OpenMP), too immature, or simply ineffectiveNewer models driven by need to bring multi-core execution to

commodity applications have more promise


We’re talking finite element solver applications Two classes of solvers

Interative schemesMatrix inversion schemes

Issues: Scalability, Quality, Repeatability, Data transfer, Hardware configurations, Hardware access

Addressed: Optimal domain decomposition, Computational methods that scale well, Solver architecture, Focus on certain hardware, Partnerships


ConsistencyData summation order for different MPI causes errors

– LS-DYNA uses fixed order.Modified/refined model decomposes in different way

changing results – ‘Cut lines’ are preserved from 1st run.

ScalabilityScaling for 128 processors not always good.Hybrid LS-DYNA runs SMP within processor and MPP

between processors.Results consistent with increased # SMP threads.Simple command line to execute.



Solver scaling continues to expand CFD to 1000, Structures to 100 Especially key to accelerating transients

Need to address the scalar bottlenecksAcross full simulation process (meshing, I/O, certain

solver physics, visualization)Hybrid parallel algorithms for multi-core/mixed-core Distributed/shared memory, OpenMP vs. MPI

Support for latest communication technologies QDR IB, iWARP for 10gigE, etc.

Q2. Licensing Model.....Q2. Licensing Model.....

•As hardware technology shift to multi-core processors continues and even accelerates, the licensing models of many ISV codes become a serious problem for your customers. Per core licensing becomes exceedingly unaffordable and limiting in ability to improve and even maintain the levels of performance of recent past.Panel participants – How can you help your customers become more competitive given current technology trends?


Modified ANSYS HPC licensing in 2009 Tied to the value of HPC New scalable licensing enables ‘extreme/unlimited

parallel’ for high fidelity; minimizes the licensing “penalty” on higher core count processors

Enterprise access is key Hardware located anywhere, users located anywhere

Owned, rented, IaaS Interchangeable across physics

Buy once, deploy once


One Code Strategy – LS-OPT, LS-PrePost, Dummies, barriers & head forms FEA models all available as part of LS-DYNA distribution with no additional license keys required.

Ultimate Value: Multi-physics & multi-stage capabilities in one scalable code.

Flexibility: 4 core license allows 4 one core jobs or one 4 core job.

Steeply decreasing licensing fees per core as the # processors increase.

Unlimited core site license.

Not have licensing based on number of cores

Per use token-licensing Addressing thru special license decay

Multi-run environmentsMassive computation


Have asked this question internally and am presenting collective responses

Two factors in license priceParallel development and testing are more expensive than scalar With SIMULIA typical sale is annual license, so, on the one hand,

sales force is motivated to maintain a good relationship with customer, but on the other hand the sales force is fearful of “revenue erosion” from “free parallel”

SIMULIA sales team view existing licensing model which rewards parallel execution with lower “per core hour” execution costsRequires greater base token pool

Change?Requires “revenue neutral” shift in licensing modelGreat volume of sales (more customers)


Our “Power” session licenses are independent of the number of cores used for a single analysis

Our “Cloud license” model is also independent of the number of cores and of the number of simultaneous analyses. You pay only for what you use and we do not care where you run.

We make our clients more competitive by adding value with each release:1. Cut the total engineering time required for analysis.

Engineering time is far more expensive than computer time.

2. Enlarge the universe of problems that our tools can be employed to analyze while working to make all our analysis more accurate.


Q3. New Technology Adoption….…Q3. New Technology Adoption….…

•We notice a considerable lag in adoption of new technologies (e.g. FPGA, GPGPU) in the Manufacturing CAE space. Please elaborate on what are the issues and your response.


New technologies come with lots of hype and little infrastructure. It takes time for languages, compilers, debuggers,… to mature and standardize. We are not in a position to rewrite 1M lines in assembly language every time some new device appears.

New technologies are not always applicable to our particular needs.

When we see a technology with reasonable potential for return on investment, we partner with the technology providers, watch the literature, assign researchers… and it does not always pay off.

Adoption of GPGPU, and to greater extent, FPGA has high programming cost for large, general purpose codes

Result is that GPGPU focus tends to be on acceleration of obvious bottlenecks, preferably, with low code line countsDrawback for parallel codes is that often greater parallel gains

are in same areas, so gains from GPGPU are considerably less for parallel codes than for scalar codes

Even where adoption is underway, keeping x86 and GPGPU code (CUDA/OpenCL) results in two code bases

SIMULIA is accelerating obvious code with GPGPU, and working internally and with partners to find better programming model


Technology need to be fit a for certain computational methods – memory, data transfer

We’re trying, but the gains do not justify the effort

Technology is not where it needs to be Lack of standards


LSTC currently is evaluating the impact of GPUs on the performance of implicit LS-DYNA. It is applied to the innermost computational kernel of the sparse matrix factorization.

GPUs offer high performance for certain computational kernels.

Performance is subjected to overhead cost of transferring the data to the GPU and results back from the GPU.

Performance will no longer degrade for REAL*8 arithmetic when the Nvidia Fermi GPUs become available.

LSTC hopes to have the GPU implementation in Implicit around mid-year.


• Establishing ROI is critical - and unclear Moving technology target (CPUs vs GP-GPUs)

• Substantial investment required Only a subset of operations map to GPU without

significant algorithm changes– Bottleneck associated with memory access

to/from off-CPU boards; Not enough memory to offload “entire algorithm”

Lack of ‘off the shelf” vendor libraries; multiple development environments (OpenCL / CUDA)

• Some “low hanging fruit” (e.g., matrix factorization) Available now (beta) on GP-GPUs


Q4. Breakthrough Performance…..Q4. Breakthrough Performance…..

•Could you please comment on how your products could potentially evolve near- or mid-term leading to substantially higher levels of performance for your customers?


New, more scalable solvers With promise to extend scaling to 1000’s+ core Robustness is key (takes time)

Vector processing paradigms (multicore, GPU)

Parallel execution of multiple design points Full automation of parametric updates Human productivity and compute throughput

New features continuously implemented (Electromagnetics, Acoustics, Frequency response, Compressible/incompressible fluids, Isogeometric elements).

Multiscale capabilities under development to have initial release this year.

Hybrid MPI/OPENMP promises major scalability boost at high # processors for both explicit and implicit solutions – scaling to 1000’s of nodes for both explicit & implicit solvers.

Replace prototype testing by simulation:Strict modeling guidelines for analystsA single FE model for crash, NVH, durability, etc.Advance in Constitutive models, Contact, FSI with SPH, ALE,

Particle methods, Sensors and Control Systems, and complete compatibility with NASTRAN

Manufacturing simulations (in LS-DYNA, Moldflow, etc.) to provide initial conditions for crash simulations.


Not only the solver runtime is important, but how the solver use impacts design decision

New paradigms of designing products Integration of design methods with solvers Advancing use of multi-CPU Advancing numerical techniques


Abaqus/Explicit is unlikely to see major breakthrough in near to mid term, but will show steady incremental improvement

Customer base execution of Abaqus/Standard for large jobs exceeded customer adoption about 3 years ago (large model scaling to 128 to 256 cores)

Takes time for customers to get credible performance data and to change hardware available in order to adapt to a shift in scalability

For implicit FEA hard to get away from “Nastran node” for several yearsSIMULIA investigating next possible jumps in performance

For Abaqus/Standard working on “strong scaling” gains (i.e. deliver scalability throughout problem size range)

Beyond the “more cores” approach potential of GPGPU is great, but need a programming breakthrough


Our goal is to cut down the TOTAL simulation time.Meshing/CAD interfacing. We have gone from weeks of

preparation time to hours. We have made enormous breakthroughs in our ability to process “dirty” CAD.

Post-processing – recently cut the time to output a specific set of plots from 40 hours to 1 hour.

Strategies to deal with larger models and transients, including use of parallel I/O.

Customization and integration with the client’s own workflows and processes.

Solver efficiency alone is not the only important measure of performance.


hpc user forum isv panel april 2010 dearborn, mi

Documents

applications scalability

scalability scaling

mi slide

commodity applications

msc software slide

solver scaling

core licensing

promise q1