domain decomposition method for a finite element algorithm for image segmentation

Upload: endless-love

Post on 09-Jan-2016

14 views

Category:

Documents


0 download

DESCRIPTION

Domain Decomposition Method for a Finite Element Algorithm for Image Segmentation

TRANSCRIPT

  • Domain Decomposition Method for a Finite

    Element Algorithm for Image Segmentation

    Andre Gaul

    [email protected]

    Diploma Thesis

    Chair of Applied Mathematics IIIDepartment of Mathematics

    Section Modeling, Simulation, OptimizationUniversity of Erlangen-Nurnberg

    Supervisors: Dr. J. M. Fried, Prof. Dr. E. Bansch

    Erlangen

    June 2009

  • Abstract

    Computer-based image segmentation is a common task whenanalyzing and classifying images in a broad range of applica-tions. Problems arise when it comes to the computation ofsegmentations for huge datasets like high-resolution micro-scope or satellite scans or three-dimensional magnetic reso-nance images appearing in medical image processing. Thecomputation may exceed constraints like available memoryand time.We will present a finite element algorithm for image seg-mentation based on a level set formulation combined withthe domain decomposition method which enables us to com-pute segmentations of large datasets on multi-core CPUs andhigh-performance distributed parallel computers rapidly.

  • Contents

    1 Introduction 1

    2 Mathematical Model of Image Segmentation 3

    2.1 The Mumford-Shah Energy Functional . . . . . . . . . . . . . . . . . . . 32.2 The Chan-Vese Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 The Level Set Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Heaviside Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Multiple Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 The Euler-Lagrange equation . . . . . . . . . . . . . . . . . . . . . . . . 92.7 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.8 Weak formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.9 Finite Element Space Discretization . . . . . . . . . . . . . . . . . . . . 122.10 Time Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.11 Matrix Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3 Mathematical Model of Domain Decomposition Method 17

    3.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.1 Naive Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.2 Load-Balancing Partitioning . . . . . . . . . . . . . . . . . . . . . 20

    3.2 The Schur Complement Method . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Block Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . 223.2.2 Decoupling of Subdomain Problems . . . . . . . . . . . . . . . . 243.2.3 Iterative Solver for the Schur Complement System . . . . . . . . 243.2.4 Subdomain Matrices and Subdomain Schur Complements . . . . 253.2.5 Subdomain Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.6 Condition Number . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4 Implementation in Image 29

    4.1 Brief Introduction to the Image Framework . . . . . . . . . . . . . . . . 294.2 Parallel Computing Programming Model . . . . . . . . . . . . . . . . . . 324.3 Design Principles with MPI in Image . . . . . . . . . . . . . . . . . . . 334.4 Partitioning of Triangulations using ParMETIS . . . . . . . . . . . . . . 344.5 Distribution of Subdomains . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Association of Global and Local Degrees of Freedom . . . . . . . . . . . 374.7 Handling of Interface Data . . . . . . . . . . . . . . . . . . . . . . . . . . 374.8 Non-Blocking MPI Communication . . . . . . . . . . . . . . . . . . . . . 384.9 Distributed Iterative Solver . . . . . . . . . . . . . . . . . . . . . . . . . 41

    4.9.1 Assembly of Matrices and Adaption of Right Hand Sides . . . . 414.9.2 Schur Complement System Solver . . . . . . . . . . . . . . . . . 414.9.3 Backward Substitution . . . . . . . . . . . . . . . . . . . . . . . . 43

  • 5 Numerical Results 45

    5.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.1.1 Experimental Order of Convergence . . . . . . . . . . . . . . . . 455.1.2 Artificial Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    Checkerboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Grayscale Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.1.3 Real World Images . . . . . . . . . . . . . . . . . . . . . . . . . . 53Multiple Channels . . . . . . . . . . . . . . . . . . . . . . . . . . 53Large-Scale Image . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5.2 Parallel Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.1 Computation Environments . . . . . . . . . . . . . . . . . . . . . 565.2.2 Scalability Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 57

    Small-Sized problem . . . . . . . . . . . . . . . . . . . . . . . . . 58Large-Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    6 Conclusion and Perspective 67

  • Acknowledgements

    I wish to thank all of my friends for their assistance in my studies and especially inthis work. Special thanks go to Jenny for all the love, care, fun and for constantlytriggering thoughts on and actions in a philosophical and political world that mattersbeyond mathematics.Concerning this work, I am very grateful to Michael Fried for the excellent supervision

    and the topic, perfectly matching my personal interests. I had great fun while attainingknowledge together with Kai Hertel, occasionally spending days and nights on programcode. Furthermore, I want to thank the entire staff (and Saeco) at AM3 for making thisa humane place to productively work at. Special thanks go to Eberhard Bansch, SteffenBasting, Rodolphe Prignitz, Stephan Weller and Rolf Krahl for taking the time when-ever mathematical problems arose and the latter two, being LATEXperts, also for theirsupport concerning typesetting. Thanks are also directed towards the high performancecomputing team at the universitys computing center for operating the woody clusterand sharing their profound knowledge.My parents deserve very special thanks for unconditionally supporting me in every

    way, enabling me to concentrate on my studies and this work. I deeply wish everyone tobe able to study under similar circumstances and look forward to a time when incomewill no longer determine educational chances. Beyond that, I thank my brother Mirkofor the humorous phone calls, often exhilarating me in times of heavy work load.Thank you!

  • Notation

    Basic Notation

    R Set of the real numbersR+ Set of the positive real numbersN Set of the natural numbersN0 Set of the natural numbers including zero Domain Rd with d {2, 3}

    Vectors and Matrices

    For vectors x, y Rn and matrices A Rnn we write:

    (x1, . . . , xn) Cartesian components of the vector x Rn

    ei Canonical unit vector in direction of spatial axis ix y Euclidean scalar product: x y =

    ni=1 xiyi

    x Euclidean norm: x = (x x)12

    Aij Entry of the matrix A in the i-th row and the j-th columnA Transpose of matrix A with Aij = Aji

    A Matrix norm A := supzRnAzz

    (A) Condition number of the matrix A: (A) := AA1

    Operators

    For a function f : R+ R of space and time and a vector-valued function g :Rd R+ R

    d we write:

    tf Time derivative of f : tf :=ft

    if Derivative of f with respect to i-th spatial axis: if :=fxi

    f Gradient with respect to spatial variables:f = (1f, . . . , df)

    g Divergence with respect to spatial variables: g =

    di=1 igi

  • Specific symbols

    I Vector-valued image I : Rm

    BV Space of functions of bounded variationdiam (S) Diameter of a simplex S

    Th Triangulation Th = {Si}NTi=1 of with NT simplices and

    h = maxi=1,...,NT diam (Si)XT Vector space of finite element functions with respect to Tvh Discrete function vh XThi i-th segment i of a segmentation Segmentation interface separating the segments iPi i-th subdomain of a partitioning PI Interface separating the subdomains PiRi Restriction operator Ri mapping unknowns in the global

    domain to unknowns of the i-th subdomain Pi

  • 1 Introduction

    Image segmentation partitions a given image into multiple segments in such a way thatsimilar regions are grouped together in one segment. More generally, the segmentedimage shares visual characteristics in each segment. The process aims at detectingobjects and its boundaries or simplifying the image in order to analyze it more easily.Computer-based image segmentation has become a vital method in several applicationslike locating objects in medical imaging or satellite images and enables many people tofocus on the kind of work computers are not able to perform yet.

    When it comes to the computational processing of huge datasets like high-resolutionimages in two or three dimensions, arising for example in magnetic resonance imag-ing, problems like high memory consumption and long computation times have to beaddressed.

    This work presents an image segmentation algorithm combined with the domain de-composition method which allows for fast computation of large-scaled image segmen-tations on parallel computers. Since the segmentation algorithm has been investigatedthoroughly in the past, we will place emphasis on the domain decomposition techniquein this study.

    In chapter 2, we will present a mathematical model of image segmentation based onthe Mumford-Shah functional. Input data may consist of multiple image channels, e.g.RGB color images. The presented algorithm generates a piecewise constant approxi-mation of a given image with an arbitrary number of segments. The resulting partialdifferential equation is discretized in space by the finite element method.

    The theoretical background of the employed domain decomposition method will beintroduced in chapter 3. We will shed light on different partitioning approaches andpresent a Schur complement method, which is a straightforward approach to decouplegroups of unknowns resulting from the finite element discretized equation. The decou-pling is the key to our parallel implementation.

    We will describe the most important details concerning the implementation in chap-ter 4. The algorithms have been embedded in the abstract image processing frameworkImage which is briefly introduced together with the used finite element toolbox AL-BERTA. Because domain decomposition methods aim at speeding up computations,they are tightly coupled to computer science and we will describe the algorithms bothfrom a mathematical and from a computational point of view where appropriate. Con-cepts used like the distributed memory approach and MPI are briefly discussed. Wewill demonstrate parallel partitioning with ParMetis before discussing the distributedparallel Schur complement solver, which is the core and workhorse of our implementa-tion. Crucial points in the implementation are highlighted along with possible solutionsof which very few appear in the form of actual source code. For the sake of clarity thechapter closes with an overview of the work flow of the presented algorithms.

    Chapter 5 presents experiments addressing the segmentation of example images as wellas the analysis of the parallel performance of the domain decomposition implementation.

    1

  • 1 Introduction

    Computations have been performed with up to 384 processors on the high-performancecluster woody installed at the computing center of the University of Erlangen-Nurnberg.We finish this document with concluding remarks and perspectives for further research

    in chapter 6.

    2

  • 2 Mathematical Model of Image

    Segmentation

    The process of segmentation aims at detecting objects and their boundaries in a giventwo- or three-dimensional image I : RNC consisting of NC N channels. Here, Rn (n {2, 3}) denotes the open and bounded domain where the image resides.Each channel of the image is given by intensity and as such takes values in R.We will start off with the Mumford-Shah energy functional and the Chan-Vese seg-

    mentation model for one image channel. In section 2.5 the algorithm will be extendedto multiple channels before developing the functional towards the associated Euler-Lagrange equation. The equations weak formulation will then be brought to a matrix-formulation by using the finite element method for space and a semi-implicit scheme fortime discretization.

    2.1 The Mumford-Shah Energy Functional

    A common approach for segmenting images was proposed by Mumford and Shah in [16].The basic idea is to find a piecewise smooth approximation u : R to the givenimage I and an interface splitting the domain into pairwise disjoint segments

    i with i = 0, , NS1 and =(NS1

    i=0 i

    ) such that the Mumford-Shah energy

    functional

    FMS (u,) :=

    |u I |2 +

    \

    |u|2 + Hn1 () (2.1)

    is minimized. The first condition |u I |

    2 = u I 2L2 forces the approximation u to

    be close to the given target image I , while the second condition\ |u|

    2 affects the

    smoothness of the approximation u in the interior of segments. The length (for n = 2),or generally the measure, of the interface is controlled by the n 1 dimensionalHausdorff measure Hn1 () in the third term. These three conditions are weightedagainst each other by the three parameters , and .In contrast to other methods the minimization of the Mumford-Shah energy functional

    does not involve an edge detector function.

    2.2 The Chan-Vese Model

    We will now describe an active contour method proposed by Chan and Vese [7] whichis based upon the Mumford-Shah energy functional.The algorithm presented by Chan and Vese allows for the detection of NS N seg-

    ments in a given image. Instead of a piecewise smooth approximation u we will usea piecewise constant function u (x) = ci for x i. We obviously obtain u = 0 for

    3

  • 2 Mathematical Model of Image Segmentation

    x i and the second term vanishes. Mumford and Shah furthermore showed in [16]that the constants ci are in fact the averages of the original image I in the respectivesegment i:

    ci =1

    i

    i

    I . (2.2)

    For piecewise constant functions u the Mumford-Shah functional boils down to:

    FCV () =

    NS1i=0

    i

    |ci I |2 + Hn1 ()

    ci =1

    i

    i

    I , i = 0, . . . , NS 1

    (2.3)

    and its minimization leads to the minimal partition problem

    min

    FCV ( ). (2.4)

    A remaining issue is to find an adequate representation of the interface .

    2.3 The Level Set Formulation

    We now introduce a level set approach to handle the interface as well as the segmentsi. For two segments (NS = 2), the idea is to define a smooth function : R andto use the zero isoline level as interface

    := {x | (x) = 0} (2.5)

    and use the sign of to define the segments

    0 := {x | (x) < 0}

    1 := {x | (x) > 0} .(2.6)

    The level set method has several advantages over other methods, e.g. it allows fortopology changes of the interface . Following Fried [11] we can furthermore extendthe level set approach to NS = 2

    NL segments by using NL level set functions =(0 , . . . ,NL1

    ). Using the Heaviside function H : R R with

    H (z) :=

    {0 for z 01 for z > 0

    we define the Heaviside vector H () := (H (NL1 ) , . . . ,H (0 )). In order to definethe segments for NL > 1 we use the segments index i {0, . . . , NS 1} unique binaryrepresentation

    b (i) := (bNL1 (i) , . . . , b0 (i)) with bj (i) {0, 1} j {0, . . . , NL 1} (2.7)

    with

    i =

    NL1j=0

    bj (i) 2j .

    4

  • 2.3 The Level Set Formulation

    0 0.51

    00.5

    11

    0.5

    0

    0.5

    1

    01

    1

    0

    (a) Graph of two level set functions 0,1 along withthe corresponding zero isoline levels on the bottom.

    0

    3

    1 2

    10

    (b) Resulting segments 0,1,2,3 and inter-face for the level set functions in (a).

    Figure 2.1: Two level set functions and the resulting segmentation.

    Using the above, we can now define the interface and the segments i as

    j := {x | j (x) = 0}

    :=

    NL1j=0

    j =

    x

    NL1j=0

    j (x) = 0

    i := {x | H () = b (i)} .

    (2.8)

    Figure 2.1 shows a simple example when two level set functions are used.For convenience we split the index set J := {0, . . . , NL 1} into two subsets for every

    segment index i:

    I (i) := {j J | bj (i) = 1}

    I (i) := J \ I (i)(2.9)

    The indicator function i () of the segment i then reads

    i () :=jI(i)

    H (j )jI(i)

    (1H (j )). (2.10)

    In order to reformulate the length of in terms of level set functions we need somedefinitions from the theory of functions of bounded variation. We only present the basicsand refer to the work of Ambrosio, Fusco and Pallara [1] for an in-depth analysis of theMumford-Shah energy functional with respect to functions of bounded variations.

    Definition 2.3.1 (Variation). Let f L1 (). The variation V (f ,) of f in isdefined by

    V (f ,) := sup

    f dx

    C10 (,Rn) , 1 .

    5

  • 2 Mathematical Model of Image Segmentation

    Note 2.3.1. For continuously differentiable f C1 (,R) integration by parts revealsthat

    V (f ,) =

    |f | dx.

    This result will be of importance in section 2.4 where the discontinuous Heaviside func-tion is going to be replaced by a regularized Heaviside function.

    Definition 2.3.2 (Function of bounded variation). A function f L1 () is a functionof bounded variation in if V (f ,)

  • 2.4 Heaviside Regularization

    2 1 0 1 20

    0.2

    0.4

    0.6

    0.8

    1

    (a)

    2 1 0 1 20

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    (b)

    Figure 2.2: (a) shows the regularized Heaviside function H and (b) the regularized deltafunction for = 0.1

    2.4 Heaviside Regularization

    Following Chan and Vese in [7] we replace the Heaviside function H appearing in (2.14)due to technical reasons with the C (R)-regularization

    H (z) :=1

    2+

    1

    arctan

    (z

    )(2.15)

    with > 0 as a regularization parameter. The derivative of H then is

    (z) :=d

    dzH (z) =

    1

    2 + z2. (2.16)

    Note that lim0

    H = H and lim0

    = , where denotes the (distributional) derivative

    of the Heaviside function H in the bounded variation sense.

    Chan and Vese also presented another regularization for H and :

    H (z) :=

    0 for z < 12

    (1 + z

    + 1

    sin(z

    ))for |z|

    1 for z >

    (z) :=d

    dzH (z) =

    {0 for |z| >

    12

    (1 + cos

    (z

    ))for |z|

    We will comment on resulting practical differences between the two regularizationapproaches later on and continue using the functions defined in (2.15) and (2.16) toobtain the regularized version of the interface length L () defined in (2.13):

    L () =

    NL1j=0

    V (H (j) ,). (2.17)

    7

  • 2 Mathematical Model of Image Segmentation

    Because we are now dealing with continuously differentiable functions, we are able touse the result of note 2.3.1 and obtain by applying the chain rule:

    L () =

    NL1j=0

    | (H (j))|

    =

    NL1j=0

    (j) |j |.

    (2.18)

    We also define the regularized indicator function i, () by using the regularizedHeaviside function:

    i, () :=jI(i)

    H (j )jI(i)

    (1H (j )). (2.19)

    The regularized energy functional now reads:

    F () =

    NS1i=0

    |ci I |2i, () + L ()

    ci =1

    i

    i

    I , i = 0, . . . , NS 1

    (2.20)

    2.5 Multiple Channels

    We do not only want to segment scalar valued images but multi-channel images likeRGB-color images or satellite images with an arbitrary number of channels. To segmenta vector valued image we need to incorporate the information of all channels at onceinstead of segmenting the channels in sequence. Let NC be the number of channels andI :=

    (I 1 , . . . , INC

    ): RNC the vector-valued original image. We again follow an

    approach proposed by Chan, Sandberg and Vese [6] and adopted by Fried [11]. Theidea is to use the arithmetic mean of the squared L2 norms

    gk :=

    NS1i=0

    i

    cki I k 2 , k = 1, . . . , NCin (2.20) to obtain the generalized multi-channel functional

    F () = 1

    NC

    NCk=1

    gki () +

    NL1j=0

    (j) |j |

    cki =1

    i

    i

    I k , i = 0, . . . , NS 1

    (2.21)

    8

  • 2.6 The Euler-Lagrange equation

    2.6 The Euler-Lagrange equation

    In this section we are going to derive the Euler-Lagrange equation associated with theenergy functional F defined in (2.21) in order to find a solution C2

    (,RNL

    )to the

    minimization problem

    F () = minC2(,RNL)

    F () . (2.22)

    The details of the method are described by Evans in [9]. The basic idea is that for afunction C2

    (,RNL

    )satisfying (2.22) the following holds:

    d

    d[F ( + )]|=0 = 0 = (0, ..., NL1) C

    (,RNL

    ).

    With el as the l-th unit vector, the above condition is equivalent to:

    d

    d[F ( + el)]|=0 = 0 l J = {0, . . . , NL 1} (2.23)

    for all test functions C (,R).At first, we will compute the derivative of the first term appearing in F . We do not

    take into account the dependence of the constants ci, appearing in the function gk, onthe level set functions . Thus, the computation of the derivative boils down to thecomputation of the derivative of i,. For l I (i) we obtain

    d

    di, ( + el) =

    d

    dH (l + )

    jI(i)\{l}

    H (j )jI(i)

    (1H (j ))

    = (l + )

    jI(i)\{l}

    H (j )jI(i)

    (1H (j ))

    and analogously for l I (i)

    d

    di, ( + el) =

    d

    d(1H (l + ))

    jI(i)

    H (j )

    jI(i)\{l}

    (1H (j ))

    = (l + )jI(i)

    H (j )

    jI(i)\{l}

    (1H (j )).

    For general l J we define

    li, () :=

    jI(i)\{l}

    H (j )

    jI(i)\{l}

    (1H (j ))

    and with the binary representation of the segments index b (i) defined in (2.7) we arriveat

    d

    d[i, ( + el)]|=0 =

    [(1)(1bl(i)) (l + )

    li, ()

    ]=0

    = (1)(1bl(i)) (l )li, (). (2.24)

    9

  • 2 Mathematical Model of Image Segmentation

    We will now take care of the derivative of the length term L ():

    d

    dL ( + el) =

    d

    d

    (l + ) | (l + )|+d

    d

    NL1j=0j 6=l

    (j) |j |

    =

    d

    d( (l + ) |l + |)

    =

    (l + ) |l + |+

    (l + )l +

    |l + |

    .

    We now evaluate at = 0 in order to use integration by parts on the second termand apply homogeneous Neumann boundary condition later on:

    d

    d[L ( + el)]|=0 =

    (l) |l|+

    (l)l|l|

    =

    (l) |l|+

    (l)

    (l|l|

    )

    = 0(homogeneous Neumann boundary)

    ( (l)

    l|l|

    )

    =

    (l) |l|

    (l)l l|l|

    =|l|

    (l)

    (l|l|

    )

    =

    (l)

    (l|l|

    ) (2.25)

    Now it is time to combine (2.24) and (2.25) such that the derivate from (2.23) becomes:

    d

    d[F ( + el)]|=0

    = 1

    NC

    NCk=1

    NS1i=0

    cki I k 2 dd [i ( + el)]|=0 + dd [L ( + el)]|=0=

    NS1i=0

    1

    NC

    NCk=1

    ((1)(1bl(i))

    cki I k 2) (l )

    li, ()

    (l)

    (l|l|

    )

    10

  • 2.7 Gradient Descent

    With g li (x) :=1NC

    NCk=1 (1)

    (1bl(i))cki I k (x)2 we now arrive at the weak varia-

    tional formulation of the minimum condition (2.23):

    NS1i=0

    g li (l )li, ()

    (l)

    (l|l|

    ) = 0 l J. (2.26)

    Because we chose C (,R) to be an arbitrary test function we now restrict to be in the space C0 (,R) of differentiable functions with support on a compact setcontained in and apply the fundamental lemma of calculus of variations to obtain theclassical formulation of the Euler-Lagrange equation:

    NS1i=0

    g li (l )li, () (l)

    (l|l|

    )= 0 in ,

    (l)l|l|

    = 0 on

    l J. (2.27)

    Note that the length L of the interface in the energy functional (2.20) now appearsin the second term with l|l| being the curvature of the level set function l andthe zero isoline level respectively.Let us recall the two regularization approaches defined in section 2.4. Chan and

    Vese observed in [7], that only local minima of the non-convex functional may be foundwith the second regularizations H and , respectively. The small compact support

    supp(

    )= [, ] would be responsible for making the algorithm depend on the

    initial level set function and only local minima may be obtained. The first introducedregularization is not equal to zero everywhere and tends to compute global minima.

    2.7 Gradient Descent

    Following Chan and Vese in [7] we interpret (2.27) as the resulting state of an evolu-tionary process. We therefore introduce an artificial time t [0, T ] and choose our levelset functions as l C

    2 ( [0, T ] ,R). Minimizing the functional is accomplished byletting the level set functions evolve over the time t in the negative direction of thegradient:

    lt

    = tl =

    NS1i=0

    g li (l )li, () + (l)

    (l|l|

    ).

    The complete system of evolution equations then is l J :

    tl (l)

    (l|l|

    )=

    NS1i=0

    g lili, () in (0, T ] ,

    (l)l|l|

    = 0 on (0, T ] ,

    l (, 0) = 0l () in .

    (2.28)

    11

  • 2 Mathematical Model of Image Segmentation

    The equation is a degenerated parabolic partial differential equation similar to thelevel set formulation of the mean curvature flow. Replacing 1

    (l)with 1|l| in the first

    equation of (2.28) would result in the level set formulation of the mean curvature flowwith a special right hand side function.

    2.8 Weak formulation

    If we take a look at (2.28), we see that problems may arise with vanishing gradientsl = 0. According to the usual practice in the case of mean curvature flow (cf. Friedin [10]), we introduce another regularization Q : R R

    Q (z) :=2 + z2 (2.29)

    and reformulate the evolution equations to:

    tl (l)

    (l

    Q (|l|)

    )=

    NS1i=0

    g lili, () in (0, T ] ,

    (l)l

    Q (|l|) = 0 on (0, T ] ,

    l (, 0) = 0l () in .

    (2.30)

    The corresponding weak formulation of the first equation in (2.30) can now be writtenas: C ( [0, T ] ,R) , l J

    tl (l)

    l

    Q (|l|) =

    NS1i=0

    g lili, ()

    Integration by parts

    tl (l)

    l

    Q (|l|) +

    lQ (|l|)

    =

    NS1i=0

    g lili, ()

    and dropping of the second term because of Neumann boundary conditions results in:

    tl (l)

    +

    lQ (|l|)

    =

    NS1i=0

    g lili, () (2.31)

    2.9 Finite Element Space Discretization

    We will now develop the equations towards a computer-compatible formulation, whichbasically means to transfer the continuous problem into a suitable, discrete counterpart.In contrast to Chan and Vese [7], we will not use the finite difference method but thefinite element method following Fried [11]. As we are about to use the ALBERTA finiteelement library in chapter 4, the definitions will follow the work of Schmidt and Siebert[18].

    12

  • 2.10 Time Discretization

    Definition 2.9.1 (Simplex). Let d N with 0 d n and a0, . . . , ad Rn vertices

    such that a1 a0, . . . , ad a0 are linear independent vectors in Rn. The set

    S =

    {x =

    di=0

    iai Rn

    0 i 1 anddi=0

    i = 1

    }

    is called a d-simplex. For k N, k < d and a0, . . . , ak {a0, . . . , ad} the simplex

    S =

    {x =

    ki=0

    iai R

    n

    0 i 1 andki=0

    i = 1

    }

    is called a k-sub-simplex of S.

    Definition 2.9.2 (Conforming Triangulation). A conforming triangulation (or mesh)of is a set of simplices T = {Si}i=1,...,NT such that

    (1) =NTi=1 Si and

    (2) the intersection Si Sj of Si, Sj T with i 6= j is either empty or a completek-sub-simplex of both Si and Sj with 0 k < d.

    Let from now on T be a conforming triangulation of . We can now define thefunction space XT by

    XT :={ C0

    () S T : Pp (S)}

    where Pp (S) is the space of polynomials of order p on the simplex S.Let BT := {1, . . . , NB} be a corresponding Lagrange basis of XT described in

    detail in [18]. A finite element function vh XT then is uniquely determined by avector (v1, . . . , vNB ) R

    NB with:

    vh =

    NBi=1

    vii (x). (2.32)

    We can now formulate the spatial discretized version of the weak evolution equation(2.31) as follows: j {1, . . . , NB}

    th,l (h,l)

    j +

    j h,lQ (|h,l|)

    =

    NS1i=0

    g lili, (h)j (2.33)

    2.10 Time Discretization

    Following Fried [11] we employ a semi-implicit Euler scheme for time discretization. Let := T

    Nbe the time step size for some N N. For a function : [0, T ] R we

    definem (x) := (x,m) , m = 0, . . . , N .

    We will now turn to a linearization by treating all non-linear terms explicitly. Thus,a semi-implicit time discretization of (2.33) takes the following form: j {1, . . . , NB}

    1

    mh,l m1h,l

    (m1h,l

    ) j +

    j mh,l

    Q

    (m1h,l ) = NS1i=0

    g lili,

    (m1h

    )j . (2.34)

    13

  • 2 Mathematical Model of Image Segmentation

    2.11 Matrix Formulation

    Let us introduce the matrix representation of (2.34) because chapter 3 will make exten-sive use of it. Sorting the terms results in:

    mh,l

    (m1h,l

    )j +

    j mh,l

    Q

    (m1h,l )=

    m1h,l

    (m1h,l

    )j NS1i=0

    g lili,

    (m1h

    )j .

    (2.35)

    Using the functions representation mh,l =NB

    k=1mk,lk in the basis BT from (2.32)

    we reformulate (2.35) to: l J, j {1, . . . , NB}

    NBk=1

    mk,l

    jk

    (m1h,l

    ) + NBk=1

    mk,l

    j k

    Q

    (m1h,l )=

    NBk=1

    m1k,l

    jk

    (m1h,l

    ) NS1i=0

    g lili,

    (m1h

    )j .

    (2.36)

    Defining our system matrices Al RNBNB and the corresponding right hand sidesf l RNB by

    Aljk :=

    jk

    (m1h,l

    ) +

    j k

    Q

    (m1h,l )f lj :=

    NBk=1

    m1k,l

    jk

    (m1h,l

    ) NS1i=0

    g lili,

    (m1h

    )j

    we end up with the matrix formulation of the problem: find l = (1,l, . . . ,NB ,l)

    All = fl. (2.37)

    for all l J .

    The matrix Al is symmetric and we note that for all v RNB with v 6= 0 and

    14

  • 2.11 Matrix Formulation

    vh :=NB

    j=1 vjj XT the following holds:

    vtAlv =

    NBj,k=1

    vjAljkvk

    =

    NBj,k=1

    vjjvkk

    (m1h,l

    ) + NBj,k=1

    vjj vkk

    Q

    (m1h,l )

    =

    (NBj=1 vjj

    )(NBk=1 vkk

    )

    (m1h,l

    ) +

    (NBj=1 vjj

    )(NB

    k=1 vkk)

    Q

    (m1h,l )=

    v2h

    (m1h,l

    ) +

    vh2

    Q

    (m1h,l )> 0.

    Thus the matrix Al is symmetric and positive definite. This will be important for theselection of an appropriate solver in chapter 3.

    15

  • 3 Mathematical Model of Domain

    Decomposition Method

    Domain decomposition methods are devoted to mathematical and computational strate-gies in which the computational domain is split into several subdomains in order to solvea boundary value problem faster. The splitting allows to compute parts of the solutionon each subdomain independently and thus in parallel on multiple processors. Of course,independence is only possible over a specific period of time because some communicationis needed in between for the transportation of information across subdomain boundaries.

    Basically the techniques can be classified into two categories:

    Overlapping methods. In overlapping domain decomposition methods, the subdomainsshare a thin layer with their neighbors. The Schwarz alternating method andthe additive Schwarz method, for example, are two popular overlapping domaindecomposition approaches.

    Non-overlapping methods. In these methods, adjacent subdomains only share a n 1dimensional part of the computational domain, the so-called interface. Severalapproaches like finite element tearing and interconnect (feti) or the Balancingdomain decomposition (bddc) exist and are used widely.

    In this work we will present a so-called Schur complement method, which belongs tothe non-overlapping methods. We chose this method because it can be described in avery intuitive way by pure algebraic means and it furthermore can be implemented withhigh parallel efficiency. The first non-trivial task in domain decomposition techniques isthe partitioning of the domain into subdomains. The unknowns arising from a finiteelement discretization are split into groups in a very natural way by the partitioning of. We will distinguish between unknowns belonging to the interior of subdomains andthe ones belonging to the interface, which separates the subdomains from each other.The Schur complement method decouples the unknowns belonging to the interior of thesubdomains from each other and introduces a problem that has to be solved on theinterface unknowns only. The algorithms we shall present in this chapter can be appliedto any linear system Ax = f arising from a finite element discretization as describedin sections 2.9 and 2.11. We will, however, concentrate on symmetric, positive definitematrices and the image segmentation problem in particular.

    Because the objective of the domain decomposition method is to speed up the solutionprocess on a computer, we have split the domain decomposition algorithms into twoparts. The necessary mathematical prerequisites will be described in this chapter whilethe details concerning computational issues and the actual implementation are presentedin chapter 4.

    17

  • 3 Mathematical Model of Domain Decomposition Method

    P1P2

    I

    Figure 3.1: Partition of into two non-overlapping subdomains P1 and P2. The inter-face I = P1 P2 separates the subdomains from each other.

    3.1 Partitioning

    First of all we shall introduce the terms partition and interface:

    Definition 3.1.1 (Partition, Interface). A set of subsets Pi , i = 1, . . . , NP is calleda partition P = {Pi}i=1,...,NP of if

    (1) Pi 6= i {1, . . . , NP }

    (2) =NPi=1

    Pi

    (3) Pi Pj = i, j {1, . . . , NP } , i 6= j.

    Then Pi is called a subdomain of for every i {1, . . . , NP }. The induced interface Iis defined by

    I :=

    i,j{1,...,NP }i6=j

    (Pi Pj) .

    Figure 3.1 illustrates the definitions of partition and interface in a basic example.

    Note 3.1.1. The partition interface I is not to be confused with the segmentation in-terface defined in (2.8). They may, but usually will not, coincide. The same appliesto the subdomains Pi, which are often denoted by i in domain decomposition literature.However, and i will always refer to the segmentation algorithm in this work, whereasthe subdomains Pi and the partition interface I are related to the domain decompositiontechnique.

    As we are about to develop a domain decomposition method for an algorithm using thefinite element discretization for the computational domain we only allow for partitionswith subdomains consisting of complete simplices. We demand that for a partitionP = {Pi}i=1,...,NP of with respect to a conforming triangulation T = {Sj}j=1,...,NTthe following holds for some index subset Ji {1, . . . , NT }:

    i {1, . . . , NP } : Pi =jJi

    Sj .

    18

  • 3.1 Partitioning

    P1

    I

    P3

    P2

    P4

    (a)

    P1

    I

    P3

    P2

    P4

    (b)

    Figure 3.2: Partitioning of a rectangular domain R2 into 4 rectangular subdomainswith the naive approach (m1 = m2 = 2): (a) shows the partition based ona globally refined triangulation; (b) shows failing load-balancing when theunderlying triangulation is locally refined.

    The index sets Ji {1, . . . , NT } now uniquely define our partition.Partitioning a given triangulation plays a vital role in the development of an efficient

    domain decomposition algorithm. Image processing usually takes place on rectangulardomains, e.g. = [0, 1]n, so trivial partitioning strategies come to mind.

    3.1.1 Naive Partitioning

    A straightforward, geometrical approach is to cut the domain into ml N slices alongevery axis l {1, . . . , n}. The subdomains then are defined by the cartesian products

    Pi :=

    [j1 1

    m1,j1m1

    ]

    [jn 1

    mn,jdmn

    ]

    with jl {1, . . . ,ml} and the partition index i {1, . . . ,n

    k=1mk} satisfying

    i = 1 +

    nl=1

    (jl 1)l1k=1

    mk.

    This results in a chessboard-like partition of an associated triangulation T as illus-trated in figure 3.2(a) for a globally refined triangulation where all simplices are of thesame volume and arranged in a structured way.

    Problems arise when it comes to locally refined triangulations like the one shown infigure 3.2(b). Here, the number of unknowns in the subdomain P3 is much greater thanthe number in the subdomains P1, P2 and P4. We will later on assign every subdomainPi to one processor. An imbalance, as illustrated in figure 3.2(b), results in a disastrousparallel efficiency, because the processor dealing with P3 would be still computing whilethe others would already have finished their task. Three processors would waste CPU-cycles in idle mode. This behavior even becomes worse for larger numbers of CPUs.

    19

  • 3 Mathematical Model of Domain Decomposition Method

    Another drawback of this basic, geometrical partitioning strategy is the restrictionto NP =

    nk=1mk subdomains. To obtain an arbitrary number of subdomains one

    could set m1 to the desired number of subdomains and mk = 1 for k > 1. But thiswork-around introduces another problem concerning the size of the interface I. We willsee later on that only the unknowns belonging to the interface need to be exchangedbetween subdomains. Simple partitioning strategies result in large interfaces producingtime-consuming communication overhead or an imbalance between the subdomains.Both leads to poor scalability.

    3.1.2 Load-Balancing Partitioning

    As seen in the previous section, a good partitioning algorithm for a scalable parallelapplication should ideally combine two features:

    (1) The variance of the number of simplices in each subdomain is minimal.

    (2) The size of the interface separating the subdomains from each other is minimal.

    Karypis and Kumar presented a graph partitioning algorithm in [13], which resultedin their widely-used and well-tested open source software package Metis describedin [14]. The task of partitioning a triangulation T can be transformed to a graph theoryformulation easily by using the dual graph defined by the undirected graph G = (T , E)with one vertex for each simplex Si T = {Si}i=1,...,NT and one edge for every twoadjacent simplices:

    E :=

    {{Si, Sj}

    i, j {1, . . . , NT } , i 6= j,Si Sj is a n 1 dimensional subsimplex of Si and Sj}.

    Figure 3.3(b) gives an example of what the dual graph looks like for a small two-dimensional mesh.

    Metis and its parallelized offspring ParMetis address exactly the mentioned de-mands and thus are perfect candidates for partitioning a given triangulation T intoequal-sized subdomains with minimal interface size. In addition to the high quality ofthe obtained partitions, Metis and ParMetis are very fast. For further details on thealgorithms used we refer to the work of Karypis and Kumar, particularly [13] and [14].Figure 3.3 illustrates the typical workflow for partitioning a given triangulation with

    Metis. We will discuss remaining implementational issues in 4.4.

    Note 3.1.2. As of this writing, the partitioning routines implemented in Metis do notguarantee that the resulting subdomains Pi are contiguous. In practice we have onlybeen able to observe non-contiguous subdomains with Metis in particular non-realisticcases where the number of partitions was almost the number of simplices. However, ouralgorithm is prepared for non-contiguous subdomains.

    Note 3.1.3. Metis tries to achieve a minimal edgecut in the dual graph which meansa minimal size of the interface I in terms of adjacent simplices (and not in terms ofthe geometrical length) while trying to keep the number of graph vertices (simplices) ineach partition equal. However, the sizes of the interface parts Ii := I Pi touchingone particular subdomain may vary. This has to be considered when designing andimplementing the algorithms.

    20

  • 3.1 Partitioning

    T

    (a)

    G = (T , E)

    (b)

    P1

    P2P3

    P4

    (c)

    I

    P1

    P2 P3

    P4

    (d)

    Figure 3.3: Evolution from a triangulation to a balanced partition with assistance ofMetis: (a) shows a locally refined triangulation T of a rectangular domain R2 and (b) illustrates the corresponding dual graph G with one vertexper simplex; Metis then assigns a partition number to each vertex of G asshown in (c) which finally results in the subdomains and the interface in (d).

    21

  • 3 Mathematical Model of Domain Decomposition Method

    3.2 The Schur Complement Method

    We will now investigate how to decouple the unknowns belonging to the several subdo-mains. Let us, for this purpose, recall the matrix formulation of the image segmentationproblem from section 2.11. Our presentation is based on the work of Toselli and Wid-lung [19], Barth, Chan and Tang [2] and Saad [17]. The following considerations are notrestricted to the image segmentation problem, but apply just as well to any symmetricpositive definite matrix formulation arising from a finite element discretization. We willdrop unnecessary indices from equation (2.37) in this section and work on the problem

    Ax = f (3.1)

    with a symmetric positive definite matrix A RNN , the unknowns x = (x1, . . . , xN ) RN and a right hand side f RN . N denotes the number of global Lagrange basis

    functions of the underlying finite element space XT defined in section 2.9.Let P = {Pi}i=1,...,M from now on be a non-overlapping partition of into M sub-

    domains based on a triangulation T . The induced interface is again denoted by I.

    3.2.1 Block Gaussian Elimination

    We will start off with a reordering of the variables in the vectors x and f such thatunknowns belonging to the interior of P1, . . . ,PNP are arranged in order first and theones belonging to the interface I are moved to the end. With N iP N and NI denotingthe number of unknowns belonging to the interior of the subdomain Pi (i = 1, . . . ,M)and to the interface I respectively we obtain

    x =

    xP1...

    xPMxI

    and f =

    fP1...

    fPMfI

    with xPi , fPi RN iP and xI , fI R

    NI .Of course, the reordering affects the matrix since rows and columns have to be per-

    muted accordingly. The small support of the Lagrange basis functions then is responsiblefor the following block structure of the reordered matrix:

    A =

    AP1P1 AP1I

    AP2P2 AP2I. . .

    ...

    APMPM APMI

    AIP1 AIP2 AIPM AII

    . (3.2)

    Figure 3.4 gives insight into the structure of the reordered matrix for a simple casewith 4 partitions. We will group the matrices for the sake of clarity:

    A =

    APP APIAIP AII

    (3.3)

    22

  • 3.2 The Schur Complement Method

    P1

    P2

    P3 P4

    I1

    I2

    I3

    I4

    I5

    IX

    IX

    (a)

    P1

    P2

    P3

    P4

    I1I2I3

    I4I5

    IX

    P1 P2 P3 P4 I1I2I3

    I4I5

    IX

    (b)

    Figure 3.4: (a) shows a partitioning of a triangulation with 584 simplices into 4 subdo-mains and (b) shows the block structure of the corresponding matrix A afterreordering the unknowns. Here the interface I has been divided into partsI1, . . . , I5, IX in order to illustrate the adjacency structure in the matrixmore clearly.

    Note that APP is a block diagonal matrix. Furthermore, the reordering conservesthe sparseness, symmetry and positive definiteness of A because rows and columns arechanged simultaneously.

    Let us now perform a block Gaussian elimination to eliminate the block AIP in (3.3).We therefore multiply equation (3.1) with

    L :=

    I 0AIPA

    1PP I

    SL (R)

    and obtain APP API

    0 AII AIPA1PPAPI

    xPxI

    =

    fPfI AIPA

    1PPfP

    . (3.4)

    The matrix

    S := AII AIPA1PPAPI (3.5)

    is called the Schur complement matrix of A associated with the interface variables xI .Together with

    fI := fI AIPA1PPfP (3.6)

    we obtain the Schur complement system

    SxI = fI . (3.7)

    Solving (3.1) can now be performed in three steps:

    23

  • 3 Mathematical Model of Domain Decomposition Method

    1. Compute the adapted right hand side fI = fI AIPA1PPfP for the Schur com-

    plement system (3.6).

    2. Solve the reduced Schur complement system SxI = fI to obtain the interfacesolution xI (3.7).

    3. Backward substitution by solving APPxP = fPAPIxI for the interior unknownsxP (3.4).

    3.2.2 Decoupling of Subdomain Problems

    We can now take advantage of the fact that the matrix APP is block diagonal with eachblock associated with a subdomain. Block diagonal matrices can be inverted blockwise

    A1PP =

    AP1P1

    AP2P2. . .

    APMPM

    1

    =

    A1P1P1

    A1P2P2. . .

    A1PMPM

    (3.8)

    and the solution of a system

    APPzP = yP

    in fact naturally decouples into M systems

    APiPizPi = yPi , i {1, . . . ,M} .

    These systems can therefore be solved independently in parallel.

    3.2.3 Iterative Solver for the Schur Complement System

    The reduced system (3.7) can be solved by an iterative solver. One major advantageof iterative methods in this scenario is the option of abandoning the expensive explicitformation of the Schur complement matrix S, because only matrix-by-vector multipli-cations yI = SxI are required. To be able to select an appropriate iterative solver weneed the following theorem.

    Theorem 3.2.1 (Symmetric positive definiteness of the Schur complement matrix).Let for n,m N

    M =

    (A BB C

    ) R(n+m)(n+m)

    be a symmetric positive definite matrix composed of blocks A Rnn, B Rnm andC Rmm. Then the Schur complement matrix

    S := C BA1B Rmm

    is also symmetric positive definite.

    24

  • 3.2 The Schur Complement Method

    Proof. Symmetry directly follows from the symmetry ofM and hence A = A, C = C

    and A1 = (A1):

    S = C (BA1B

    )= C B

    (A1

    )B = C BA1B = S.

    We will now show that S is positive definite. Let 0 6= z Rm be an arbitrary non-zero

    vector and y := A1Bz Rn. Since M is positive definite and

    (yz

    )6= 0 we obtain

    0 mesh , -1, CALL_LEAF_EL); el_info;

    el_info=traverse_next(stack , el_info))

    7 {

    8 img_el_parinfo *el_parinfo =

    get_el_parinfo(el_info ->el);

    35

  • 4 Implementation in Image

    9 el_parinfo ->id = count ++;

    10 }

    11 free_traverse_stack(stack);

    12

    13 return count;

    14 }

    Listing 4.7: Definition of img_alberta_mesh_tag

    After the triangulation has been tagged, the adjacency information is gathered bytraversing the mesh another time in a similar way. In each element we iterate throughall neighbors and fill xadj, adjncy and vtxdist accordingly. The MPI_Bcast func-tion is used to transfer parameters for ParMetis to the worker processes. The ar-rays xadj and adjncy are distributed with help of the function MPI_Scatter whichsends equal sized parts to all processes. ParMetis then is executed via a call toParMETIS_V3_PartKway() with the parameter controlling the number of desired par-titions set to the number of worker processes. The result of the partitioning processis stored in an array int *part which holds a partition number for each vertex of thedual graph and thus for each simplex of the triangulation. We iterate another timethrough the triangulation and store the partition number in each leaf elements memberel_parinfo->part. Every simplex of the triangulation now is tagged with a partitionindex and we can begin to distribute the subdomains among the worker processes.

    4.5 Distribution of Subdomains

    The subdomains now have to make their way to the worker processes. The hierarchicalmesh in ALBERTA assumes that every element of the binary tree has either two orno children, so we are not allowed to just copy leaf elements along with its parents tothe worker processes because we could end up with a corrupt tree where some elementsonly have one child.

    Our approach is to create a macro triangulation for each subdomain. ALBERTA shipswith methods creating a valid triangulation from a structure called MACRO_DATA so weare just filling its members:

    1 struct macro_data {

    2 int dim; // dimension of the mesh

    3 int n_total_vertices; // number of vertices

    4 int n_macro_elements; // number of macro elements

    5 REAL_D *coords; // vertex coordinates

    6 int *mel_vertices; // macro element vertices

    7 int *neigh; // macro element neighbors

    8 S_CHAR *boundary; // boundary type if no neighbor

    9 U_CHAR *el_type; //not needed by our implementation

    10 };

    11 typedef struct macro_data MACRO_DATA;

    Listing 4.8: Declaration of MACRO_DATA

    36

  • 4.6 Association of Global and Local Degrees of Freedom

    By using this approach, we lose the ability to coarsen the triangulation in the workerprocesses since it only consists of macro elements as leaf elements. Of course, we areable to adapt the triangulation in the master process, but it then has to be repartitionedand redistributed to the worker processes.

    After building the macro triangulations each worker process receives its subdomainin form of a MACRO_DATA from the master via the MPI calls MPI_Send and MPI_Recv.

    4.6 Association of Global and Local Degrees of Freedom

    Each worker process can now allocate its own local finite element space (FE_SPACE) forthe corresponding subdomain triangulation. For the efficient exchange of finite elementdata between the master and the worker processes we are in need of special structuresassociating the global degrees of freedom in the master process with the subdomainslocal degrees of freedom. Especially the processing speed in the master process is criticalfor overall performance.

    In order to distribute initial data, the master process first of all needs to know whichDOFs belong to the subdomain of each worker process. For this reason the masterprocess temporarily allocates a finite element space (FE_SPACE) for each subdomain inexactly the same manner the worker processes do. This way we are able to retrievethe association between worker and master DOFs. For each worker process, the masterholds a DOF_INT_VEC of the corresponding subdomains FE_SPACE storing the subdomainDOFs index in the global FE_SPACE. These association vectors are combined in thearray DOF_INT_VEC **assoc_wa2ma. To be precise: The DOF indexed with i in theFE_SPACE of the worker process Pj is identified in the global FE_SPACE by the indexassoc_wa2ma[j-1]->vec[i].

    Note 4.6.1. wa2ma is an abbreviation for worker-all-to-master-all. We will introduceanother association vector specialized for the interface in the next section.

    4.7 Handling of Interface Data

    The operations on interface data in the master process have to run as fast as possible,because even smallest amounts of wasted time turn out to be fatal for the scalability.In order to achieve the highest possible performance in the master, we turned awayfrom ALBERTAs DOF_REAL_VECs for interface data and use plain C arrays combinedwith special interface mappings. These are again plain C arrays mapping only thesubdomains interface degrees of freedom to the index in the master interface arraywhich correspond to the projection matrices RiI introduced in section 3.2.4.

    The values of a finite element function belonging to the interface are stored in REAL*iface_vals and the right hand side in REAL *iface_rhs. Before actually runningthe distributed solving process we copy the interface values from a DOF_REAL_VEC tothe corresponding location in iface_vals and vice versa upon completion. The righthand side vector iface_rhs is directly filled with the data obtained from the workerprocesses and does not need to be changed until completion of the solving process.

    The interface mappings are organized in arrays int **assoc_wi2mi similar to assoc_wa2maused for the mapping of all DOFs. An interface degree of freedom with the index

    37

  • 4 Implementation in Image

    i in the worker process Pj is identified in the masters interface array by the indexassoc_wi2mi[j-1][i].

    Note 4.7.1. wi2mi is an abbreviation for worker-interface-to-master-interface.

    4.8 Non-Blocking MPI Communication

    Communication between the master and worker processes initially has been implementedusing the basic and easy-to-use MPI directives MPI_Send and MPI_Recv for point-to-point data exchange. These routines are blocking, which means that the functions waitand do not return before all data has been sent or received, respectively. Most notablythe functions wait if their counterparts have not even been called.It turns out that blocking function calls cause serious performance loss when oper-

    ations are performed with multiple processes in sequence. A piece of code exhibitingsuch runtime behavior is shown in listing 4.9.If, for example, the worker process P1 has not yet finished its computation and thus

    is not able to provide results with MPI_Send, the master process is stuck in line 8 untilP1 has initiated and completed the communication via a call to MPI_Send. Perhapsother worker processes have already initiated a MPI_Send but have to wait because thecorresponding MPI_Recv in the master process has not yet been reached due to the lateprocess P1. All processes except P1 may come to a halt even though data is ready forfurther processing, which would result in severe performance deterioration when usinga large number of processes. Figure 4.10 illustrates the MPI communication in a worstcase scenario.The solution to this problem is to switch to non-blocking MPI communication with

    the commands MPI_Isend and MPI_Irecv. This requires some additional code presentedin listing 4.11.The master then calls MPI_Irecv for each worker process which returns immediately

    and just fills the corresponding entry in the array request. MPI_Waitany waits for anyof these processes and when entering the while-loop, the buffer buf[index] has alreadybeen filled with the data received from the worker process Pindex+1. The result can beprocessed in the loop just like in the blocking version above. Figure 4.12 shows theruntime behavior using non-blocking MPI communication in the master process.

    38

  • 4.8 Non-Blocking MPI Communication

    1 REAL *iface_vals; // result initialized with zeros

    2 int ** assoc_wi2mi; // association vector (cf. section 4.7 )

    3 int *len; // # of iface values for each subdomain

    4 REAL *buf; // buffer of size max_j {len[j]}

    5 // [...]

    6 for (int source = 1; source < size; source ++)

    7 {

    8 MPI_Recv(buf , len[source -1], REAL_MPI , source , 0,

    MPI_COMM_WORLD , MPI_STATUS_IGNORE);

    9 for(unsigned int i=0; i

  • 4 Implementation in Image

    1 // variables except buf as in listing 4.9

    2 REAL **buf; // buf[j] is of size len[j]

    3 //[...]

    4 MPI_Request *request = malloc( (size -1) *

    sizeof(MPI_Request));

    5

    6 for (int source = 1; source < size; source ++)

    7 MPI_Irecv(buf[source -1], len[source -1], REAL_MPI , source ,

    0, MPI_COMM_WORLD , request +(source -1));

    8

    9 int index =0;

    10 while (

    11 (MPI_Waitany(size , request , &index , MPI_STATUS_IGNORE) ==

    MPI_SUCCESS)

    12 && (index != MPI_UNDEFINED) )

    13 {

    14 for(unsigned int i=0; i

  • 4.9 Distributed Iterative Solver

    4.9 Distributed Iterative Solver

    This section will use the results of 3.2 in order to briefly describe the core of the Schurcomplement domain decomposition implementation in Image. The solution processinvolves the following three steps for each timestep:

    1. Assembly and right hand side adaption

    Assemble the subdomain matrices Ai and right hand sides f i.

    Compute adapted right hand sides f iI = fiI A

    iIP

    (AiPP

    )1f iP for the Schur

    complement system.

    2. Solve the Schur complement system SxI = fI to obtain the interface solution xIwith the help of an iterative solver involving only matrix-by-vector multiplicationsrI = SxI .

    3. Solve AiPPxiP = f

    iP A

    iPIx

    iI for the interior unknowns x

    iP

    We will discuss now the most important parts of the implementation concerning thesesteps.

    4.9.1 Assembly of Matrices and Adaption of Right Hand Sides

    Each worker process Pi has its own finite element space (FE_SPACE) and we are thusable to assemble the local subdomain matrices Ai as well as the right hand sides f i

    in the worker processes in parallel. This is realized with ALBERTAs standard meshtraversal routine which visits each element and may run arbitrary code. We computeeach elements local mass and stiffness matrix and add it to the subdomains systemmatrix Ai. The subdomain matrices are stored in ALBERTAs standard DOF_MATRIXstructure.Immediately afterwards the worker processes start to adapt the right hand side for

    the Schur complement system by first solving

    AiPPziP = f

    iP (4.1)

    and then computing

    f iI = fiI A

    iIPz

    iP . (4.2)

    Until here all tasks have been carried out in parallel without any communication. Noweach worker process sends f iI to the master process where the right hand side for theSchur complement system is obtained by summing up the subdomain contributionsfI =

    Mi=1R

    iI f

    iI . Instead of the prolongation matrix R

    iI the association vectors

    assoc_wi2mi described in section 4.7 are used.

    4.9.2 Schur Complement System Solver

    Our implementation is an extension to the Conjugate Gradient solver already imple-mented in ALBERTA. This extension replaces ALBERTAs standard matrix-vectorproduct routine with one aware of the distributed domain decomposition structures. As

    41

  • 4 Implementation in Image

    described in section 3.2.3 we will not form the matrices S or Si explicitly because ofhigh computational costs for the inverses

    (AiPP

    )1.

    We recall the local subdomain Schur complements (3.13) and the relation to the globalSchur complement from (3.12)

    Si =AiII AiIP

    (AiPP

    )1AiPI (4.3)

    S =

    Mi=1

    RiI SiRiI (4.4)

    which gives us a recipe for implementing the matrix-by-vector multiplication with theSchur complement matrix in a distributed manner. For each iteration of the outeriterative solver we have to compute a matrix-by-vector multiplication rI = SxI byperforming the following operations in our implementation:

    1. First of all, the master process gathers the interface DOFs of xI for each workerprocess by using assoc_wi2mi (cf. 4.7) and sends them accordingly via MPI. Thiscorresponds to the application of the restriction operator RiI in (4.4). Each workerprocess Pi now holds the portion x

    iI affecting its interface part.

    2. Each worker process computes yiP = AiPIx

    iI

    3. Each worker process solves AiPPziP = y

    iP using the standard Conjugate Gradient

    solver implemented in ALBERTA. We will need a high accuracy for this solutionas stated in note 3.2.1.

    4. Each worker process computes the subdomain result riI = AiIIxI A

    iIPz

    iP .

    5. The master process receives and sums up the subdomain results riI to obtain

    rI =M

    i=1RiI r

    iI . We again employ the efficient association vectors assoc_wi2mi

    instead of a multiplication with the matrix RiI . Additionally, the non-blockingMPI communication described in section 4.8 is used to receive the interface datariI from the worker processes. This allows us to process data as soon as it isavailable and thus prevents unnecessary delay in the master process in the casewhere worker processes do not terminate computations in order.

    The remaining steps beside this matrix-by-vector multiplication like computation ofdescent direction, update of the residual and the solution are all performed by AL-BERTAs Conjugate Gradient solver. Because ALBERTA allows to exchange thematrix-by-vector multiplication easily for every implemented iterative solver (e.g. GM-Res and BiCGstab) we would be able to use these as well in the case of non-symmetric,positive definite matrices.Note that the steps 2-4 can be carried out in parallel without communication. Only

    the steps 1 and 5 involve communication via MPI which has been optimized in ourimplementation in order to obtain better scalability.Beside the matrix-by-vector multiplication, the Conjugate Gradient method only re-

    quires the computation of scalar products and the sum of two vectors for one iteration.These are computed in serial on the master, but as outlined in section 4.7, the vectorsare plain C arrays of the size of the interface and we are able to use optimized BLAS

    42

  • 4.9 Distributed Iterative Solver

    routines, e.g. the AMD Core Math Library or the Intel Math Kernel Library. Thecomputational cost of one Conjugate Gradient iteration is dominated by the distributedmatrix-by-vector multiplication in real world applications if the problem size is not toosmall. We will present details concerning the runtime behavior in chapter 5. Neverthe-less the serial parts in the master process, the condition of the Schur complement ora slight load-imbalance would be responsible for decreasing scalability when radicallyincreasing the number of processors.

    4.9.3 Backward Substitution

    The last step performed in each timestep is the solution for the subdomains interiorvariables xiP . Therefore the solution of A

    iPPx

    iP = f

    iP A

    iPIx

    iI is computed in parallel

    in the worker processes.We will only transfer the interior solutions from the worker processes to the master

    process if additional computation or output is required in the master process. For theassembly of the local subdomain matrices Ai of the next time step only the valuesalready present in the respective worker processes are needed. For the segmentationalgorithm we have to sum up the mean values

    cki =1

    i

    i

    I k =1M

    j=1i Pj

    Mj=1

    iPj

    I k

    for each channel k and each segment i at the end of a time step (cf. section 2.5). Thisis accomplished by computing the volumes and integrals locally in the worker processesand employing the function MPI_Allreduce() with the MPI reduce operation set toMPI_SUM in order to sum up the local contributions and distribute the result back to allprocesses.Figure 4.13 gives an impression of the parallel work flow for the initialization phase

    and one timestep.

    Note 4.9.1. A major feature of this implementation is that no matrices have to bestored or assembled in the master process. All operations that have to be carried outin the master process are usually considered to be performance-critical. With the un-derlying domain decomposition approach based upon the finite element method we areable to assemble the subdomain matrices in parallel in a very natural way without anycommunication.

    43

  • 4 Implementation in Image

    Time t P0 P1 P2 P3

    Partitioning withParMetis (cf. 4.4)

    ParMetis

    distributesubdomains (cf. 4.5)

    {

    receive subdomain,get FE_SPACE anddata

    associate (cf. 4.6),send initial data

    {begin timestep

    assemble Ai, f i

    compute and sendf iI (cf. 4.9.1)

    {fI =

    f iI

    solveAiPPz

    iP = A

    iPIy

    iI

    and sendriI = A

    iIIy

    iIA

    iIPz

    iP

    (cf. 4.9.2)

    {begin solvingSxI = fI ;

    compute and sendinitial yI (cf. 4.9.2)

    SyI =

    riIsend new yI

    (cf 4.9.2)

    {

    {

    repeat iterationsuntil tolerance is

    reached

    {distribute interfacesolution xI

    solve AiPPxiP =

    f iP AiPIx

    iI to

    obtain interiorsolutions xiP(cf. 4.9.3)

    {

    receive full solutionx of timestep (if

    needed in master)

    next timestep

    Figure 4.13: Timeline of initialization and one timestep in our implementation with4 processes: Initialization (orange), assembly (green), Schur complementsolver (yellow) and solving for interior variables (blue). The arrows indicateMPI communication between the processes.

    44

  • 5 Numerical Results

    In this chapter, we will turn to numerical results of the presented algorithms. In thefirst part we will present numerical results obtained by the segmentation algorithm. Thesecond part will analyze the runtime behavior of the parallelization with benchmarks.All presented results have been computed using the domain decomposition methodimplementation of the segmentation algorithm in the Image project.

    Linear Lagrange elements have been used for all calculations. The ALBERTA libraryallows us to use elements of higher order, but difficulties arise when it comes to thecomputation of the integrals in the segmentation equations right hand side (cf. 2.34)and mean values. A non-linear zero isoline of the level set functions would not split theelements into simplices anymore and the geometry would become hard to tackle.

    5.1 Segmentation

    First of all, we will verify the correctness of the parallel segmentation algorithm bypresenting the experimental order of convergence in a case with a known solution. Wewill also show examples where no exact solution is known but many features of theMumford-Shah segmentation method can be recognized. The parallel performance liketiming and efficiency of the used domain decomposition technique is omitted here andwill be subject of the second part (section 5.2).

    5.1.1 Experimental Order of Convergence

    In this section we will numerically check the algorithm for convergence and computethe experimental order of convergence in case of a known exact solution. The solutionpresented here follows Fried in [11] with minor corrections.

    We restrict ourselves to the case of one level set function : R in this sectionand want to partition a two-dimensional image I : R with := [1, 1]2 into twosegments

    0 = {x | (x) < 0} and

    1 = {x | (x) > 0}

    with a piecewise constant approximation u : R

    u (x) =

    {c0 , x 0

    c1 , x 1.

    45

  • 5 Numerical Results

    We therefore have to solve the following evolution equations (cf. equation (2.28)):

    t ()

    (

    ||

    )=

    NS1i=0

    f li li, () in (0, T ] , (5.1)

    ()

    || = 0 on (0, T ] ,

    (, 0) = 0 () in .

    In the case of only one level set function the right hand side simplifies to:

    1i=0

    fii, () = ((c0 I )

    2 (c1 I )2).

    We now turn to a very special case and assume that the initial data 0 as well asthe given image I only depend on x1. Then 0 has straight isolines with curvature

    (0|0|

    )= 0. We furthermore restrict ourself to solutions (x1, t) depending only

    on x1 in space and exhibiting a non-vanishing gradient (x1, t) for all t [0, T ]. Thecurvature of the isolines of such solutions analogously vanishes and (5.1) reads:

    t ()

    = ((c0 I )

    2 (c1 I )2)

    in (0, T ] (5.2)

    If we fix the parameter = 1, we obtain with the definition of the regularized deltafunction from (2.16):

    t(1 + 2

    )=

    ((c0 I )

    2 (c1 I )2)

    in (0, T ] . (5.3)

    With f :=

    ((c0 I )

    2 (c1 I )2)equation (5.3) reads:

    t(1 + 2

    )= f in (0, T ] . (5.4)

    We now require that the zero isoline level of and thus the segments 0 and 1 donot change over time. Then f neither depends on the level set function nor on thetime t and the mean values c0 and c1 are constants.

    Under the above assumptions equation (5.4) is an ordinary differential equation foreach fixed x1 [1, 1] with the following real-valued solution

    (t) =(A (t))

    13

    2

    2

    (A (t))13

    (5.5)

    with

    A (t) = 4

    (3tf +30 + 30 +

    4 + 9t2f2 + 6tf

    (30 + 30

    )+(30 + 30

    )2). (5.6)

    46

  • 5.1 Segmentation

    (a) The given image I (b) The initial level set function 0 and its zeroisoline

    (c) The discrete solution of the level set functionh7 and the zero isoline at t = 1 computed withtriangulation Th7

    (d) The segmented image u remains unchangedover time because the sign of the level set func-tion h does not depend on time

    Figure 5.1: Computation of the experimental order of convergence in the case of a knownsolution

    47

  • 5 Numerical Results

    Let us now consider suitable initial conditions and an image I we are able to use withImage. We define the original image by a grayscale image consisting of four stripes

    I (x1, x2) :=

    0 , 1 x1 < 0.5

    0.25 , 0.5 x1 < 0

    0.75 , 0 x1 < 0.5

    1 , 0.5 x1 1

    and the initial level set function by

    0 (x1, x2) := 0.3 sin(2x1

    )which both fulfill the above assumptions. Figure 5.1 shows the image I and the initiallevel set function 0.In order to compute the experimental order of convergence, we numerically compute

    the discrete solution h with dirichlet boundary conditions h| = on a series{Thj}j

    of globally refined triangulations with the mesh size hj =hj12 . We compute the errors

    L2 norm in space and L norm in time

    errj = supt[0,T ]

    ( hj

    )212

    = hjL,L2

    for each triangulation Thj . The experimental order of convergence then is

    EOCj =ln(

    errjerrj+1

    )ln (2)

    The used parameters for the computations are listed in table 5.2.

    Parameter Value

    Time step size h2jEnd time T 0.5Heaviside regularization 1.0Curvature weight 1.0Curvature regularization 1.0 108

    Right hand side weight 1.0Subdomain solver tolerance tolsub 1.0 10

    12

    Schur complement solver tolerance tolschur 1.0 108

    Table 5.2: Parameters for computation of experimental order of convergence

    The computations have been run twice, one time with 16 and another time with 256processors on the Woodcrest Cluster, to be able to observe potential disparities. Wewill describe the used computer cluster in detail in the second part concentrating onparallelization. The results for the experimental order of convergence are presented intable 5.3.

    48

  • 5.1 Segmentation

    16 CPUs 256 CPUs

    j hj hjL,L2 EOCj hjL,L2 EOCj

    3 2.5 101 4.330 1002 4 1.25 101 3.025 1002 5.173 1001 3.025 1002 5 1.625 102 2.038 1002 5.698 1001 2.038 1002 5.698 1001

    6 3.125 102 1.390 1002 5.512 1001 1.390 1002 5.512 1001

    7 1.562 102 9.668 1003 5.245 1001 9.668 1003 5.245 1001

    8 7.812 103 6.805 1003 5.066 1001 6.805 1003 5.066 1001

    Table 5.3: Experimental order of convergence. Note that we have not been able to com-pute the error for refinement level 3 on 256 CPUs because the triangulationexactly consists of 256 simplices in this case and ParMetis did not sup-ply every worker process with a simplex, which is what our implementationrequires.

    The experimental order of convergence stabilizes around 12 , which is the same resultFried obtained in [11]. Our segmentation algorithm with domain decomposition par-allelization thus is able to reproduce the solutions of the original serial version of thecode. Furthermore, there are no differences between the computations performed using16 and 256 processors.

    Note 5.1.1. In addition, we verified the correctness of the domain decomposition codewith computations of the experimental order of convergence for the heat equation andmean curvature flow.

    5.1.2 Artificial Images

    In order to gain more insight into the segmentation algorithm we present some syntheticimages and their segmentations. No exact solution is known for these model problemsbut the images exhibit outstanding details one wishes to find in the segmentation. Theseexamples demonstrate distinctive features of the Chan-Vese segmentation model and theunderlying Mumford-Shah energy functional.

    In contrast to the computations performed for the previous section we will use locallyrefined triangulations from now on. The L2 interpolation error between the originalimage and its finite element representation is computed for each simplex in the trian-gulation. ALBERTAs built in refinement routines then refine the grid based on thecomputed errors. The process is iterated until the error falls below a prescribed boundor a maximal refinement depth is reached. This method allows more precise calcula-tions in areas where the image exhibits many details while keeping the computationcosts down by not introducing new degrees of freedom in regions with homogeneousimage data. Figure 5.4(b) shows a locally refined mesh for a checkerboard image. Themesh adaption for multiple channels is accomplished by computing the L2 error for eachchannel and using the arithmetic mean.

    The level set functions are initialized such that the zero isoline levels form circles.The first picture in figure 5.6 gives an impression of the initial level set function. Notethat the zero isolines cannot form good approximations to circles in the corners due to

    49

  • 5 Numerical Results

    the very coarse mesh in this regions of the example.

    Checkerboard

    Figure 5.4(a) shows the original checkerboard image and a corresponding adapted mesh.

    (a) Original checkerboard image (b) Locally refined mesh adapted to the originalimage

    Figure 5.4: Checkerboard image and mesh

    We want to detect the four squares which can be accomplished by employing just onelevel set function, because the image only consists of two colors. The chosen parametersare listed in table 5.5.

    Parameter Value

    Time step size 1.0 102

    End time T 0.28Heaviside regularization 1.0Curvature weight 1.0 102

    Curvature regularization 1.0 108

    Right hand side weight 255.0Subdomain solver tolerance tolsub 1.0 10

    12

    Schur complement solver tolerance tolschur 1.0 108

    Table 5.5: Parameters for computation of experimental order of convergence

    Figure 5.6 shows the evolution of the level set functions zero isoline level and theresulting segmented image at three time steps. The stationary state with respect tothe interface and the induced segmentation was reached after 28 time steps and theinterface precisely matches the dividing lines between the black and white areas.

    50

  • 5.1 Segmentation

    Figure 5.6: Three steps (t0 = 0, t1 = 0.14 and t2 = 0.28) of the segmentation evolutionfor the checkerboard image. The upper row shows the original image with theinterface and the lower row reveals the corresponding segmented images.

    Grayscale Gradient

    We now turn to a more interesting scenario with a grayscale gradient in figure 5.7(a).The image thus consists of more than two color levels and it is not clear where the inter-face should exactly be placed on the fading right part even for humans. Nevertheless,we expect a sane segmentation algorithm to recognize the circles left boundary reliably.We used exactly the same parameters as in the previous experiment and obtained the

    results depicted in figure 5.7(c) and 5.7(d). The interface front immediately movedto the hard line on the left side and stabilized in the fading part on the right side. Theexperiment was repeated with different parameters and . The resulting segmentedimages only differed marginally from the presented one. For higher curvature parameters we obtained a slightly rounded interface where the interface leaves the full circlesboundary.

    51

  • 5 Numerical Results

    (a) Original image (b) Mesh after adaption

    (c) Interface at t = 0.28 (d) Segmented image at t = 0.28

    Figure 5.7: Segmentation of a fading circle

    52

  • 5.1 Segmentation

    5.1.3 Real World Images

    Multiple Channels

    In this experiment, we demonstrate the detection of objects in images consisting ofmultiple channels. Figure 5.8 shows a photograph of a road sign which is given by threereal-valued channels: red, green and blue.

    Figure 5.8: Original image: Australian wombat road sign

    We wish to detect the yellow sign and the black wombat symbol on it. The backgroundconsists of a clear blue sky and very fine structures of trees. We started off with onelevel set function and added another one after ten time steps in order to be able todetect four segments (sign, symbol, trees and sky). The time step size and end timehave been raised to = 0.1 and T = 2.0. The weight of the right hand side has beenset to = 2550.0 in order to force the approximation closer to the original image. Wecomputed segmentations for two different choices of the curvature weight parameter. Inthe first run the parameter was set to 1 = 0.01 as in the examples before. The secondrun was done using 2 = 0.1, thus penalizing irregular and long segment interfaces.Figure 5.9 shows the results of the computations. The sky was separated from the sign

    and the forest with the first level set function in both cases. The second level set functionthen evolved to detect the symbol and parts of the trees in the background. Note that thehigher weight of the curvature term effectively resulted in smoother segment boundaries.Especially the fine structures of the trees in the background were combined to formbigger areas with less details. The small interruption of the black line surrounding thesign (below the wombats head) was ignored by the segmentation algorithm for bothchoices of and the line appears continuous in the segmentations.

    53

  • 5 Numerical Results

    (a) 1 = 0.01

    (b) 2 = 0.1

    Figure 5.9: Segmented road sign image using different curvature weights . In every rowthe left image shows the result at t = 1.0 before adding the second level setfunction and the right one is the segmentation at t = 2.0 with 2 level setfunctions and thus four colors.

    54

  • 5.1 Segmentation

    Large-Scale Image

    The next example is a high resolution photograph consisting of 2000x2000 pixels.

    Figure 5.10: Original image: Coast with rocks

    We now also wish to find finer structures appearing in the water and on the rocks.Therefore, the curvature parameter has been lowered to = 0.001 to allow moreirregular and longer segment boundaries. We started off with one level set function andsuccessively added new ones after 15 time steps. Before adding a fourth level set functionafter 45 time steps we stopped the computation, thus obtaining 23 = 8 segments anddifferent colors. We reduced the L2 interpolation error between the original image anda finite element representation to 0.045 by refining the mesh heavily. We ended up witha very fine mesh consisting of 1423026 simplices inducing 712382 degrees of freedom forthe used linear Lagrange elements. The remaining parameters have been left unchanged.Figure 5.11 shows the final segmentation. As was intended with the lowering of the

    curvature parameter, the segmentation depicts finer details in the lower part of theimage while keeping the segments representing the sky quite smooth.A major problem arising with the computation of segmentations for large-scaled image

    data comprising many details is the enormous time and memory consumption, bringingsingle computer systems to or even beyond their limits. The key to the computationof such segmentations on very fine meshes for high resolution images is parallelization,which is the subject of the next section.

    55

  • 5 Numerical Results

    Figure 5.11: Segmentation into 8 segments

    5.2 Parallel Performance

    The segmentation of high resolution multi-channel datasets exhibiting many details isof interest in many fields like medical image processing or the analysis of microscopeand satellite scans. This section is devoted to runtime analysis of the Schur complementdomain decomposition method in combination with the segmentation algorithm.

    5.2.1 Computation Environments

    Development and testing was done on several computer systems ranging from conven-tional consumer computers to high-performance compute clusters. Here, we will presentresults computed with the Woodcrest Cluster woody, which is installed at the computingcenter of the University of Erlangen-Nurnberg (RRZE).The Woodcrest Cluster is a distributed-memory platform consisting of:

    217 compute nodes, each with two dual-core Intel Xeon 5160 Woodcrest CPUs(3.0 GHz, 4 MB shared level 2 cache) 868 CPU cores in total

    8 GB of RAM per node

    InfiniBand switched fabric network for MPI communication between nodes andfor Input/Output operations

    We have chosen the distributed-memory approach along with MPI communication forour implementation (cf. section 4.2), thus woody exactly meets the requirements of ourcode. In order to tease out maximal performance for the woody machine, we employedthe following compilers and libraries for Image and ALBERTA:

    56

  • 5.2 Parallel Performance

    Intel C compiler 10.1

    Intel Math Kernel Library 9.0 (MKL) providing BLAS routines for ALBERTA

    Intel MPI Library 3.1 for MPI-2 communication

    Intel Trace Analyzer and Collector 7.1 (ITAC) for detailed parallel profiling

    Since the code conforms to the standards C99 and MPI-2, we are not restricted toany of the above software and the code runs fine with other compilers. For example,the open source GNU compiler collection in combination with MPICH or OpenMPI hasbeen used extensively for testing. The application furthermore behaves comparable onother hardware, e.g. with AMD CPUs, and also runs on standard off-the-shelf multi-core machines. But since our algorithms aim at the solution of very large systems withnumerous processors, we will stick to the high-performance compute cluster woody inthis work.

    5.2.2 Scalability Benchmarks

    Gaining insight into the runtime behavior of a parallel application like Image is chal-lenging, since concurrency and communication between processes add a new level ofcomplexity. Timing in a parallel application depends even more on activities of the op-erating system than in a serial setting. A very short delay in one process may cause allother processes to wait and thus is affecting parallel efficiency heavily. We can preventmany potential sources of delays inside the application but we usually cannot influenceinterruptions coming from the operating system. We shall now briefly describe whichmeasurements have been used for our benchmarks.

    Different measures for timing an application exist:

    user time is the time the CPU spent with the execution of actual application code

    system time is the time the CPU spent with the execution of operating systemcode like I/O and networking

    real time is the elapsed wall clock time

    For applications performing only computations and very little or no Input/Output usu-ally only the user time is used. In our parallel application, we are interested in theoverall runtime which explicitly includes waiting times. The structure of our code iscentralized because the master process controls the iterative solving process, whichhas been outlined in detail in chapter 4. We therefore always measured the real timeelapsed in the master process from the first to the last time step. Initialization andpost-processing like writing resulting images to files is not of interest for the efficiencyof the used domain decomposition method. However, the assembly of the matrices andright hand sides was taken into account as this is part of the parallel algorithm.

    Because our implementation is designed with a governing master process, we are notable to compute with only one process. The implementation furthermore requires anon-empty interface and thus a minimum of 3 processes. Because each node of thewoody cluster combines 4 CPUs, we performed all of our experiments on full nodes. So

    57

  • 5 Numerical Results

    let n N be the number of used compute nodes. Then, the p = 4n CPUs are assignedto one master and p 1 worker processes.Let Rn be the real execution time of the solving process with n nodes. We define the

    relative speedup by

    Sn :=R1Rn

    .

    The efficiency then is defined by

    En :=Snn.

    An efficiency close to one indicates an ideal utilization of the processors (linear speedup).Values above one may also occur, for example in the following situations:

    when vectors entirely fit in the processors caches

    if the interface I, separating the subdomains, suddenly induces a Schur comple-ment system which the Conjugate Gradient method is able to solve faster

    if the partitioning results in a better load-balancing

    On the other hand, we expect the absolute speedup, referring to the execution time ofa corresponding serial implementation, to be below one for very low numbers of CPUs,because of the communication and management overhead of the domain decompositionimplementation.

    Let us keep these considerations in mind and turn to benchmarks in the following twosections.

    Small-Sized problem

    The described domain decomposition method and its implementation certainly aim atthe solution of large scaled problems with respect to the spatial discretization. Nev-ertheless, we will show the characteristics of our implementation when applied to asmall-sized problem.

    We start off with the segmentation of the coastline (figure 5.10) with all parametersexcept for the mesh refinement set as above. The refinement process was stopped earlierto obtain a coarser mesh. In order to observe the correlation between high local detaildensity and a locally refined mesh, figure 5.12(a) shows the mesh as an overlay on theoriginal image. Figure 5.12(b) presents a partitioning produced with help of ParMetis.

    The adapted mesh clearly shows coarse areas in the upper right part and fine struc-tures in the center and bottom. Note that the partitioning is not based on the geo-metrical size but on the number of simplices. For example, the upper right partition(red) covers a larger area than the one in the lower left corner (blue). Building the dualgraph and partitioning the mesh with ParMetis took about 80 milliseconds in thisexperiment.

    Beside the timing information we also captured valuable data like the condition num-bers of the Schur complement matrices and the number of needed iterations for the Schurcomplement CG solver. For the sake of clarity we will only provide these additional datafor the first time step of the computation. The timing, however, was measured for 10

    58

  • 5.2 Parallel Performance

    (a) Original image and locally refined mesh (b) Partitioning into 7 subdomains for the usewith 8 processors

    Figure 5.12: Mesh refinement and partitioning

    time steps in order to equilibrate timing inaccuracies (e.g. caused by operating systemjitter) and to obtain representable data for the complete evolution of a level set functionuntil the stationary state of the zero isoline level. Table 5.13 shows the benchmark dataof the computation for a small mesh consisting of 121,335 simplices yielding 60,929global degrees of freedom.

    Nodes(CPUs)

    NP NI (S) CGiterations

    timeRp [s]

    speedupSp

    efficiencyEp

    serial 40.72

    1 (4) 20,204 318 130.1 41 47.68 1.00 1.002 (8) 8,605 692 338.7 59 14.42 3.30 1.654 (16) 3,984 1,174 421.8 75 8.34 5.71 1.428 (32