model order reduction via matlab parallel computing toolbox · model order reduction via matlab...

Model Order Reduction via Matlab Parallel ComputingToolbox

E. Fatih Yetkin & Hasan Dag

Istanbul Technical UniversityComputational Science & Engineering Department

September 21, 2009

E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 1 / 40

1 Parallel ComputationWhy We Need Parallelism in MOR?What is Parallelism?Parallel Architectures

2 Tools of ParallelizationProgramming ModelsParallel Matlab

3 Parallel Version of Rational Krylov MethodsRational Krylov MethodsH2 optimality and Rational Krylov methodsAn Example SystemParallelization of the AlgorithmResults

4 Conclusions


Why We Need Parallelism in MOR?Computational Complexity

Model reduction methods aim to build a model, which is easy tohandle. However, for some type of methods such as balancedtruncation or rational Krylov reduction process takes lots of time fordense problems.

Computational Complexity of Rational Krylov Methods

Complexity of the process decomposition of (A− σiE ) for k points isO(N3)

Therefore, especially in dense problems parallelism is an obligation.


What is Parallelism?Sequential Programming

A single CPU (core) is available

Problem is composed of series of commands

Each command is executed one after another


What is Parallelism?Parallel Programming

In the simplest sense, parallel computing is the simultaneous use ofmultiple computing resources to solve a computational problem:

I with multiple CPUs or coresI Problem is broken into discrete parts that can be solved concurrently.

Each part is executed on different CPUs simultaneously.


Parallel ArchitecturesShared Memory

Generally shared memory machines have in common the ability for allprocessors to access all memory as global address space.

Multiple processors can operate independently but share the samememory resources.

Shared memory machines can be divided into two main classes basedupon memory access times: UMA and NUMA


Parallel ArchitecturesUMA vs. NUMA

In Uniform Memory Access (UMA) architecture, identical processorshas equal access times to memory. Also called SymmetricMultiprocessor (SMP).

Non-uniform Memory Access (NUMA) machines, often made byphysically linking two or more SMPs and not all processors have equalaccess time to all memories.


Parallel ArchitecturesDistributed Memory

Processors have their own local memory. Memory addresses in oneprocessor do not map to another processor, so there is no concept ofglobal address space across all processors.

When a processor needs access to data in another processor, it isusually the task of the programmer to explicitly define how and whendata is communicated. Synchronization between tasks is likewise theprogrammer’s responsibility.


Parallel ArchitecturesHybrid Memory

The largest and the fastest computers in the world today employ bothshared and distributed memory architectures.

The shared memory component is usually a cache coherent SMPmachine. Processors on a given SMP can address that machine’smemory as global.

Network communications are required to move data from one SMP toanother.


Parallel Programming Models: ThreadsPOSIX Threads & OpenMP

In the threads model of parallel programming, a single process canhave multiple, concurrent execution paths.

Threads can come and go, but a.out remains present to provide thenecessary shared resources until the application is completed.

Unrelated standardization efforts have resulted in two very differentimplementations of threads: POSIX Threads and OpenMP.


Parallel Programming Models: Message PassingInterfaceMPI

A set of tasks that use their own local memory during computation.Multiple tasks can reside on the same physical machine as well acrossan arbitrary number of machines.

Tasks exchange data through communications by sending andreceiving messages.


Matlab Distributed Computing ToolboxDistributed or Parallel

From the view of Matlab terminology parallel jobs run on the internalworkers such as cores and distributed jobs run on the cluster nodes.


Basics of Parallel Computing Toolboxparfor

In Matlab you can use parfor to make a parallel loop.

Message passing or some low level communication issues handled byMatlab itself.


Basics of Parallel Computing Toolboxwhen we can use parfor?


Basics of Parallel Computing Toolboxwhen we can not use parfor?


Basics of Parallel Computing Toolboxsingle process multiple data (spmd)

In Matlab you can use spmd blocks to run a process on different datasets.


Basics of Parallel Computing Toolboxsingle process multiple data (spmd)

Master processor has a right to access for all workers’ data


Basics of Parallel Computing Toolboxdistributed arrays

It is possible to distribute any array to workers.


Basics of Parallel Computing Toolboxdistributed arrays


Matrix transposingMPI-Fortran vs. Matlab -DCT


Rational Krylov Methods

If D selected as zero system triple can be selected as Σ = (A,B,C ) for

x = Ax + Bu

y = CT x + Du

Two matrices V ∈ Rnxk and W ∈ Rnxk can be defined whereW ∗V = Ik and k � n

With these two matrices reduced order system can be found as

A = W ∗AV B = W ∗B C = CV (1)


Rational Krylov Method

There are lots of ways to build the projection matrices.

One way is using rational Krylov subspace bases.

Assume that k distinct points in complex plane are selected forinterpolation.

Then interpolation matrices, V and W can be built as shown below.

V = [(s1I − A)−1B . . . (sk I − A)−1B]

W = [(s1I − AT )−1C . . . (sk I − AT 1)−1C ] (2)


Rational Krylov Projectors

Assuming that det(W ∗V ) 6= 0, the projected reduced system can bebuilt as,

A = W TAV , B = W TB, C = CV (3)

where W = W (V ∗W )−1 to ensure W ∗V = Ik .

The basic problem is to find a strategy to select the interpolationpoints.

As the worst case, the interpolation points can be selected asrandomly from the operating frequency of the system.


Rational Krylov Projectors


H2 norm of a system

This approach is not optimal. To improve this approach severalmethods can be used. In this work we use the iterative rational Krylovapproach to achieve H2 norm optimal reduced model.

H2 norm of a system is defined as below,

||G ||2 :=

[ ∫ +∞

−∞|G (jω)|2dω

]1/2

(4)


H2 optimality

Reduced order system Gr (s) is H2 optimal if it minimizes the

Gr (s) = argmindeg(G)=r ||G (s)− G (s)||H2 (5)

And there are two important theorems to obtain an H2 optimalreduced model given by Meier (1967) and Grimme (1997).

Antoulas et.al. combine these two important results to achieve anIterative Rational Krylov Algorithm (IRKA) to obtain H2 optimalreduced order model


Iterative Rational Krylov Algorithm (IRKA)


Example RLC network

We use a ladder RLC network as benchmark example for thenumerical implementation of the Alg.1 and Alg.2.

Minimal realization of the circuit is given in Fig.1. For this circuitorder of the system n = 5. On the other hand, system matrices ofthis circuit can easily be extended


Frequency plots of the reduced and original systemsN=201 and the order of reduced system k=20


Computational Cost of Methods

Computational cost of the rational Krylov methods can be given asO(N3) for dense problems

In IRKA rational Krylov methods are used iteratively and thecomputational complexity has to be multiplied by the iterationnumber r .


Parallel Parts of Algorithms

Although both algorithms have k times factorization to compute(si I − A)−1B, these factorizations can be computed on differentprocessors independently.

The matrix-matrix and matrix-vector multiplications in the algorithmsare amenable to parallel processing.


Parallel Version of Alg. 1


CPU times for Rational Krylov

Table: CPU times of parallel version of Alg.1 for different system orders wherethe reduced system order k=200.

Proc no. time (n=2000) time (n=5000)

1 59.8 1485.3

2 31.4 780.7

4 21.2 451.4

8 23.8 374.2


CPU times for IRKA

Table: CPU times of parallel version of Alg.2 for different system orders wherethe reduced system order k=200.

Proc no. time (n=2000) time (n=5000)

1 512.6 2486.2

2 410.7 1605.9

4 203.9 810.8

8 176.1 648.4


Speedup graph for RK

Speedup of a parallel algorithm is defined as

Sp =T1

Tp(6)

where T1 is the CPU time for one processor and Tp is the CPU time for Pprocessor.


Speedup graph for RK


Speedup graph for IRKA


continued

It can easily be seen from the figures, when we increase the numberof processors processing time decreases appreciably upto some point,after which it starts to increase.

This is due to communication times becoming dominant overcomputation time. But in both algorithm, when the size of thesystem matrices are getting larger better speedups are obtained.


Conclusions

In this work, iterative rational Krylov method based optimal H2 normmodel reduction methods are parallelized.

These methods require huge computation but the algorithmthemselves are suitable for parallel processing.

Therefore, computational time decreases when the number ofprocessors is increased.

Due to communication needs of the processors, communication timedominates the overall process time when the system order is small.

But in larger orders, parallel algorithm has better speedup values.


model order reduction via matlab parallel computing toolbox · model order reduction via matlab...

Documents