fault location using wide-area measurements and sparse ... · dr. ali abur, adviser this thesis...

Fault Location Using Wide-Area Measurements and Sparse

Estimation

A Thesis Presented

by

Guangyu Feng

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Master of Science

in

Electrical and Computer Engineering

Northeastern University

Boston, Massachusetts

November 2015

NORTHEASTERN UNIVERSITY

Graduate School of Engineering

Thesis Signature Page

Thesis Title: Fault Location Using Wide-Area Measurements and Sparse Estimation

Author: Guangyu Feng NUID: 001916249

Department: Electrical and Computer Engineering

Approved for Thesis Requirements of the Master of Science Degree

Thesis Advisor

Dr. Ali AburSignature Date

Thesis Committee Member or Reader

Dr. Hanoch Lev-AriSignature Date

Thesis Committee Member or Reader

Dr. Bahram ShafaiSignature Date

Department Chair

Dr. Sheila S. HemamiSignature Date

Associate Dean of Graduate School:

Dr. Sara Wadia-FascettiSignature Date

To my parents.

ii

Contents

List of Figures v

List of Tables vi

List of Acronyms vii

Acknowledgments viii

Abstract of the Thesis ix

1 Introduction 11.1 Motivation of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 42.1 State of Art of Fault Location Methods . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Traveling Wave-based Methods . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Impedance-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 AI-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 Wide Area Measurement System . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Sparse Estimation Techniques . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Basic Methodology 73.1 Transformation of Fault Location Problem . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Derivation of Equivalent Injection . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Determination of Fault Type and Fault Line Type . . . . . . . . . . . . . . 113.1.3 Simulation Verification of Equivalent Injection Theory . . . . . . . . . . . 123.1.4 Building the Linear Estimation Equations . . . . . . . . . . . . . . . . . . 13

3.2 Sparse Estimation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Measurement Placement for Unique Solution . . . . . . . . . . . . . . . . . . . . 16

iii

4 Alternations Improving the Method 194.1 Overlapping Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Second Stage OLS Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Criteria for Valid Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.2 OLS Backup Estimation Routine . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Extended Robust Formulation and Correction Routine . . . . . . . . . . . . . . . . 214.3.1 Extended Robust Formulation . . . . . . . . . . . . . . . . . . . . . . . . 214.3.2 Recursive Solution Routine with Correction of Erroneous Measurement(s) 21

4.4 Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4.1 Problem Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4.2 Parallelism Attempted . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Simulation Results 275.1 Faults in the Unbalanced Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2 Wide-area Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2.1 Teed Circuit Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2.2 Comparison of Basic LARS-Lasso and Overlapping Group Lasso . . . . . 305.2.3 Robustness of Extended LARS-Lasso Formulation Against Gross Error . . 33

5.3 Parallel Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3.1 Sequential Run Time Benchmark . . . . . . . . . . . . . . . . . . . . . . 345.3.2 Speed-up Using openMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3.3 Speed-up Using GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Conclusion 37

Bibliography 39

A Appendix 45A.1

Matlab Code Solving Lasso Problem . . . . . . . . . . . . . . . . . . . . . . . . . 46A.2

Parallel C Code using openMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51A.3

CUDA C Code for GPU implementation . . . . . . . . . . . . . . . . . . . . . . . 81

iv

List of Figures

3.1 Illustration of a fault on a two terminal line. . . . . . . . . . . . . . . . . . . . . . 93.2 Illustration of a fault on a two terminal line with parallel branch. . . . . . . . . . . 93.3 Illustration of a fault on teed line. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 Identification of fault type and faulted line type through inspecting equivalent injections. 113.5 11-bus distribution system diagram in ATP-EMTP. . . . . . . . . . . . . . . . . . 123.6 Real part of equivalent injections for the 11-bus unbalanced case. . . . . . . . . . . 133.7 Imaginary part of equivalent injections for the 11-bus unbalanced case. . . . . . . . 13

4.1 Flowchart for the recursive estimation . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Three Scenarios of Data Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Example of Column Block Distribution . . . . . . . . . . . . . . . . . . . . . . . 244.4 Development process of different implementations . . . . . . . . . . . . . . . . . 25

5.1 Computation Times Used by Basic LARS-Lasso and Overlapping Group Lasso. . . 315.2 Number of Iterations by Basic LARS-Lasso and Overlapping Group Lasso. . . . . 315.3 Error of Fault Distance by Basic LARS-Lasso and Overlapping Group Lasso. . . . 325.4 Number of OLS Backup Estimations by Basic LARS-Lasso and Overlapping Group

Lasso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.5 Comparison of Fault Distance Error under One or Two Unit Failures . . . . . . . . 335.6 Comparison of Speed-ups for Different Cases using openMP . . . . . . . . . . . . 35

v

List of Tables

5.1 Relative Error of Fault Distance for the Unbalanced 11-Bus Case . . . . . . . . . . 285.2 Platform Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3 Sequential Run Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.4 Summary of GPU run times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.5 Summary of Profiling Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

vi

List of Acronyms

AI Artificial Intelligence

WAMS Wide Area Measurement System.

PMU Phasor Measurement Unit.

Lasso Least Absolute Shrinkage and Selection Operator.

The Lasso is a shrinkage and selection method for linear regression. It minimizes the usualsum of squared errors, with a bound on the sum of the absolute values of the coefficients. Ithas connections to soft-thresholding of wavelet coefficients, forward stagewise regression, andboosting methods.

LARS Least Angle Regression. The “S” suggests “Lasso” and “Stagewise”.

OLS Ordinary Least Squares.

openMP Open Multi-Processing

MPI Message Passing Interface

GPU Graphics Processing Unit

CUDA Compute Unified Device Architecture

vii

Acknowledgments

Here I wish to thank those who have supported me during and before the process of the thesis work.Among them, I’m especially grateful to Dr. Ali Abur, who has shared with me valuable researchguidance, wisdom as well as unique insights about problem-solving and debugging. I always considerit a privilege to have taken the power system analysis and unbalanced power grid course offered byDr. Abur since these courses not only made my knowledge on power system more comprehensive butalso left me with notes and codes that are still of great help to my research. In addition, I appreciateit very much the chances that Dr. Abur gave me for publishing and presenting my work even when Iwas feeling not very confident about my results.

I would like in the mean time thank Dr. Lev-Ari and Dr. Shafai in my committee as well asDr. Tomsovic from Univ. of Tennessee and Dr. Leeser from whom I have taken important courseson signal processing, linear system and control, constrained optimization and high performancecomputing. Beside enhancing my basic engineering skills, all these courses have contributed directlyor indirectly to my research.

Furthermore, I really appreciate the encouragement and help from my lab mates (or former labmates) especially Mert, Murat, Alireza, Chenxi, Pengxiang and Yuzhang. I wish all the best to theirlife and careers.

Last but not least, I can not thank my parents more for supporting me remotely both financiallyand spiritually over the past years. They are always unconditionally standing behind me no matterwhat choice I make.

viii

Abstract of the Thesis

Fault Location Using Wide-Area Measurements and Sparse Estimation

by

Guangyu Feng

Master of Science in Power System

Northeastern University, November 2015

Dr. Ali Abur, Adviser

This thesis describes a fault location method which relies on scattered wide-area synchronized phasormeasurements and usage of sparse estimation techniques. The main contribution is the way faultlocation is reformulated as a sparse estimation problem. Faulted system is modeled by equivalent ter-minal bus injections which would cause the same changes in bus voltages with the fault current drawnat any point along the faulted line. Once these injections are estimated, the fact that the ratio betweenequivalent injections at the two terminal buses depends only on the ratio of serial impedances on eachside of the fault point can be used to locate the fault. It is shown that this formulation applies to bothtwo terminal lines as well as teed lines regardless of fault type or resistance. Assuming availabilityof an accurate three-phase network model and a sufficient number of phasor measurements over theentire network, an underdetermined set of linear equations can be formed and then solved for thesparse equivalent bus injections. Then the problem fits naturally into a Least Absolute Shrinkage andSelection Operator (Lasso) formulation and can be solved via the Least Angle Regression (LARS)algorithm. Based on the condition for unique solution for Lasso problem, a scheme for optimalphasor measurement placement is also derived. Furthermore, alternations have been made to thebasic implementation of LARS so that the method’s reliability, robustness and efficiency is improved.

Using extensive simulations on both unbalanced three phase test case and wide-area network testcase, the accuracy and efficiency of the proposed method with its alternations have been verified.

ix

Chapter 1

Introduction

Power transmission lines are exposed to faults caused by natural or artificial reasons such as light-

ening, short circuits, faulty equipments, mis-operation, overload and aging. Faults fan be roughly

categorized as open circuit faults and short circuit faults. While open circuit faults are generally rare,

short circuit faults are more common in power systems and attract wide-spread attention. Faults are

also divided by being temporary or permanent. Temporary faults are know to be able to self-clear

and are more common on power lines. For these faults, accurate location of the fault can find the

weak point on the line and prevent future occurrence of fault by reparation. Permanent faults, on the

other hands, cause major damages and require immediate action from the crew to make the faulted

line return to service. Untimely or inaccurate fault location can result in prolonged line outage and

also severe economic losses and reduced reliability of the service. Thus accurate and timely fault

location has always been a critical issue in power system operation [1].

When a fault occurs on a transmission line, distance relays can provide some indication of the

general area where the fault is, but they are designed more in a way that line protection is made

on-line as fast as possible other than pinpoint the fault location. On the other hand, distance to fault

is estimated off-line from the recorded data. Depending on the types of data and models available to

researchers, there have been various methods proposed so far with most of them relying on local data

from one or both ends of the faulted transmission line.

1

CHAPTER 1. INTRODUCTION

1.1 Motivation of the Study

While there have been well-established fault locators using local measurements, chances are that

local measurement units may fail at the very time when the fault occurs. In this regard, this work

has turned to wide-area measurements for more reliable fault location method. Recent years have

seen increasing deployment of synchronized phasor measurements in power system and applications

based on the Wide Area Measurement System (WAMS) such as wide-area control and linear state

estimation. But the benefit of WAMS on fault location is not yet fully discovered since power grid

protection system have long been designed with local measurements as the decision variables.

Furthermore, previous formulations of fault location problems often need prior information about

fault type or faulted line to determine how the fault can be located. This adds to the purpose of

this work to derive a fault location method of universal applicability and straightforward solution

regardless of fault types and faulted line types.

In formulating the proposed fault location method, the advancement of sparse estimation technique

is also taken into account so that fault location problem can be tackled using new mathematical tools,

contributing to the diversity of the pool of fault locators.

1.2 Contribution of the Thesis

This work contributes the current fault location practice in the following aspects:

• Non-local, in other words, wide-area phasor measurements is used to locate the fault, thus

making it still possible to locate the fault when local measurement fails or is broken down.

• Transformation of fault location problem into linear sparse estimation problem of general

form regardless of fault types and faulted line types, enabling the usage of sparse estimation

algorithms.

• Only partial observability of the system is required to ensure unique fault location solution.

2

CHAPTER 1. INTRODUCTION

• Alternations to basic sparse estimation implementation are made to improve the reliability and

robustness of the method.

• In addition to sequential implementation, parallel implementations of the proposed method are

also investigated and realized, improving the method’s efficiency.

1.3 Thesis Outline

This thesis is composed of six chapters, the second chapter reviews past literatures on fault location

and provides technical background for the proposed method. The third chapters introduces the

formulation and basic proposed methodology. Then chapter 4 presents several alternations made on

the basic method so as to obtain better performance in terms of accuracy, robustness and efficiency.

Chapter 5 describes the test result addressing the universal applicability and performance of the

proposed method. Finally chapter 6 concludes the thesis and introduces future work.

3

Chapter 2

Background

2.1 State of Art of Fault Location Methods

Fault location methods currently used for transmission system faults can be broadly classified as

(1) traveling wave-based methods,(2) impedance-based methods and (3) Artificial Intelligence (AI)-

based methods.

2.1.1 Traveling Wave-based Methods

Traveling-wave based methods [2–6] make use of the high frequency (several kHz to MHz) elec-

tromagnetic transients induced by the fault to locate the time of arrival of the traveling wave from

the fault point. Thus, modal domain wave velocities need to be calculated in order to obtain the

distance from the fault point to the measuring device where the traveling wave is recorded. While

thanks to high frequency sampling, traveling wave based methods are independent with fault types,

the wave velocities can not always be accurately defined given the variation of the velocities with

frequency and tower configurations. In addition, there usually is a non-negligible cost associated

with the required high-frequency sampling of the waveforms.

4

CHAPTER 2. BACKGROUND

2.1.2 Impedance-based Methods

Impedance based methods are methods that use phasors to locate the fault. There is a large volume

of papers describing such methods using single-ended or double-ended measurements (e.g. [7–10]).

These methods require at least one of the terminals of the faulted line to be equipped with a monitoring

device like a digital fault recorder or other types of IEDs (intelligent electronic devices) [11]. There

are also multi-end algorithms [9,12–17] that use measurements from the multiple ends of transmission

line. For algorithms requiring more than one end of measurements, they can be further divided into

ones using synchronized or unsynchronized measurements. As stated in Sec. 1.1, many impedance

methods require prior assumption or knowledge about the fault section as well as fault type to

determine the detail of their algorithm, making them uneasy to implement.

2.1.3 AI-based Methods

AI-based methods [18–22] are those using artificial intelligence techniques such as pattern recognition

and machine learning algorithms. Unlike impedance or traveling wave based methods, they do not

rely on physics relationships between the measurement and the fault location but rather look for the

underlying connection between certain feature of the data and the fault. As the configuration of the

system evolves, these methods often need to be “retrained” to better adjust to the system.

2.2 Technical Background

2.2.1 Wide Area Measurement System

Recent years have seen major changes in power system operation due to the employment of WAMS.

Many applications, mostly regarding monitoring and control actions have been improved. Yet it has

not been clear how power system protection or fault location can benefit from this technical advantage.

Recently several studies have proposed the use of wide-area phasor measurements and locating

the fault not just based on single or double-ended measurements, but based on several measurements

taken at various locations over the whole system. The work in [23] derived a fault location factor as

5

CHAPTER 2. BACKGROUND

a function of fault distance for every bus. The homogeneity of this factor over a network is used to

identify firstly the fault region and then the exact location. In [24], the expected change of phasors

due to faults at different points of a network are used to be matched with measured phasors using an

optimization scheme. Similar to [23], its diagnosis process is also hierarchal.

2.2.2 Sparse Estimation Techniques

Another important technical basis for this work is the use of sparse estimation techniques in power

systems. In fact this kind of techniques have been used by various researchers for purposes of

detection and identification of faults. One recent work reported in [25] uses several compressive

sensing algorithms to pinpoint the location of a faulted bus. In [26] the authors employ greedy

orthogonal matching pursuit (OMP) method and the least-absolute shrinkage and selection operator

(Lasso) to identify line outages at affordable complexity. In another study [27] the authors derive an

algorithm to detect and localize malicious data attacks using L1 regularization for the generalized

likelihood ratio test.

6

Chapter 3

Basic Methodology

3.1 Transformation of Fault Location Problem

3.1.1 Derivation of Equivalent Injection

The proposed method aims to use the pre-fault steady state model of a power system and represent a

fault or multiple faults of any type on any combination of phases by current injection into the fault

point so as to keep the consistency of the system model. To discover how a fault can be identified

using the pre-fault system model, a series of classical fault analysis is conducted. Consider a single

phase to ground fault in an N bus system, first an extended bus impedance matrix Z̃bus is built with

the fault point as the (N + 1)th node. The change of bus voltages due to the fault is computed by

multiplying the (N + 1)th column of Z̃bus with a fault current If . Then using (3.1) where Ybus and

V represent the bus admittance matrix and bus voltages of the original N bus system, a sparse ∆I

is derived and the sparse bus current injections can be regarded as equivalent with the fault current

injection in terms of their effect on bus voltages.

Ybus∆V = ∆I (3.1)

The result unveils an interesting feature that the nonzero entries in ∆I only exist in the terminal

7

CHAPTER 3. BASIC METHODOLOGY

buses of the faulted line and their ratio is determined solely by the serial impedances of branches

incident to the fault point. If the faulted line is homogeneous, the distance from the fault point to

each terminal can be readily calculated from the ratio of the equivalent injections which is a real

number. Otherwise it would take some effort to look up the line parameters for translating the

ratio of injections to ratio of distances, which is not within the scope of this paper. Note that the

aforementioned feature does not require the system to be balanced, nor does it require the shunt

capacitances to be neglected, the only pre-requisition is accurate Pi-models of lines and load data at

the time of the fault. Since the bus impedance matrix for any system during the fault instant can be

seen as a constant matrix, the rule of superposition can be applied for multi-phase faults and multiple

faults in the system, their equivalent bus injections will appear in multiple phases or multiple terminal

buses instead of a single pair in the case of single phase to ground fault. It has to be clarified that

although the equivalent injection representations of fault have been built from scratch by the authors,

it also appears in literatures such as [28] and has been utilized in a different formulation.

To support the general applicability of the proposed method, detailed demonstration of the

equivalent injections will be introduced separately for two and three terminal lines.

3.1.1.1 Two terminal lines

For conventional two terminal lines, consider the scenario shown in Fig. 3.1 where the terminal

buses are indexed as A and B and the serial line impedances from the fault point to the terminals are

named z1 and z2 respectively. Here the single phase representation is used for the sake of simplicity.

Following the derivation in Sec. 3.1.1, the injections equivalent to the fault current If are:

8


I f

z1

z2

IA

IB

zAB2

zA#h z

Bth

! B

Figure 3.1: Illustration of a fault on a two terminal line.

IA = Ifz2

z1 + z2(3.2)

IB = Ifz1

z1 + z2(3.3)

IAIB

=z2z1

(3.4)

I f

z1

z2

IA

IB

zAB2

zA#h z

Bth

! B

Figure 3.2: Illustration of a fault on a two terminal line with parallel branch.

As shown in (3.4) the ratio of the derived equivalent injections is inversely proportional to the

ratio of the serial impedance on each side of the fault point and has nothing to do with the rest of the

network.

9


3.1.1.2 Teed lines

Representation of faults by equivalent bus injections applies not only to conventional two terminal

lines, but also to three terminal lines where the T-node is typically not monitored. In Fig. 3.3 the

terminal buses are named as A, B and C and the T-node as T . z1 and z2 are the impedances between

the fault point and bus A and the T-node respectively. The branch impedances of line B − T and

C − T are denoted by zBT and zCT .

I f

A

B

C

T

z1

z2

zBT

zCT

IA

IB

IC

zAth

zBth

zCth

Figure 3.3: Illustration of a fault on teed line.

Given a fault between A and T with a fault current If , the equivalent terminal injections will be

given by:

IA = If (1− z1(zBT + zCT )

zAT zBT + zAT zCT + zBT zCT) (3.5)

IB = Ifz1zCT

zAT zBT + zAT zCT + zBT zCT(3.6)

IC = Ifz1zBT

zAT zBT + zAT zCT + zBT zCT(3.7)

IBIC

=zCTzBT

(3.8)

Although the relationships between the injections seem more complicated than in (3.4) they are

10


still determined only by the ratios of the serial branch impedances on the faulted line. By comparing

IB and IC , the line segment A− T can be readily identified as the faulted segment and the distance

to fault point can then be determined by either comparing IA and IB or IA and IC .

3.1.2 Determination of Fault Type and Fault Line Type

From the derivations in 3.1.1 it can be concluded that the number and location of nodes with

equivalent injection inherently indicates the type of the faulted line. When multi-phase fault occurs,

the pair of equivalent injection on nodes of the same phases would still present the proportional

relationships in (3.4) and (3.8). The only difference is that the sparsity of the computed current

injection vector in two and three phase fault cases will be respectively two and three times of the

sparsity in single phase fault case. Based on the Kirchhoff’s Law, if all elements of the equivalent

injections sum up to zero, it indicates that the fault is ungrounded, otherwise it is a grounded fault.

Therefore there will be no need to know the type of the faulted line and the fault or adjust the

approach before computing these equivalent injections. Instead, they can be determined by simple

inspection following the decision making diagram below.

Eq. Injections

are Computed

Two Terminal

Line Fault

Teed Line

Fault

Grounded

Fault

Ungrounded

Fault

Single Phase Fault Two Phase Fault Three Phase Fault

# Buses with Eq. Injections = 2 # Buses with Eq. Injections = 3

Sparsity

= 2

Sparsity

= 4

Sparsity

= 6Sparsity

= 6

Sparsity

= 9

Sparsity

= 3

Sum of Eq. Injections != 0 Sum of Eq. Injections = 0

Figure 3.4: Identification of fault type and faulted line type through inspecting equivalent injections.

11


Since the fault branch of any type is replaced by current source branch, the feasibility of the

equivalent injection representation and decision making diagram is not affected by the value of fault

resistance as long as it is not so large that the fault current can be neglected. The proposed method is

therefore independent of fault resistance, which is not commonly seen in the field of fault location.

3.1.3 Simulation Verification of Equivalent Injection Theory

To add the credibility of the equivalent injection theory, simulation have been run in ATP-EMTP

regarding a fault in an unbalanced system (Fig. 3.5) modified from the 13-Bus feeder case in [29].

All 11 buses and 10 line sections are marked on the picture. A phase A and B to ground fault on the

1/4 part of section 5 between Bus 2 and 6 has been simulated. Using the simulated change of bus

voltages, the equivalent current injections have been computed. Fig. 3.6 and Fig. 3.7 are plots of the

real and imaginary part of the current injection vector.

Figure 3.5: 11-bus distribution system diagram in ATP-EMTP.

12


ID of Nodes (different phases are listed seperately)0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

Real P

art

of

Eq

uiv

ale

nt

Inje

cti

on

Vecto

r

-1000

-800

-600

-400

-200

0

200Verification Through the 11-Bus Unbalanced System

Figure 3.6: Real part of equivalent injections for the 11-bus unbalanced case.

ID of Nodes (different phases are listed seperately)0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930Im

ag

. P

art

of

Eq

uiv

ale

nt

Inje

cti

on

Vecto

r

-1500

-1000

-500

0

500

1000

1500Verification Through the 11-Bus Unbalanced System

Figure 3.7: Imaginary part of equivalent injections for the 11-bus unbalanced case.

It can be readily seen from the plots that the non-zeros do appear in the terminal nodes (node 4, 5,

14, 15 correspond to the A&B phases of Bus 2&6) and proportional relationship between the pairs of

equivalent injections are valid. In addition, the decision making diagram can be easily utilized to

identify that the fault is a double phase to ground fault.

3.1.4 Building the Linear Estimation Equations

As shown in (3.1) and the supporting simulation verification, it is possible to estimate the equivalent

injections by a simple matrix-vector product of Ybus and ∆V , which however requires full observ-

13


ability of the system, i.e. the pre and post fault voltages at all buses are assumed to be available

and synchronized. Unfortunately, this is not a very realistic assumption for most power systems

today. Thus, this assumption is relaxed by considering that the system is partially observable with

voltage phasors known at only m out of N buses. Voltage phasors at these m buses can be measured

by Phasor Measurement Unit (PMU)s placed at these buses or derived from voltages measured by

PMUs at neighboring buses and current flow measurements on lines connecting these neighbors.

This will lead to the following underdetermined version of nodal equations for the system:

ZmI = ∆Vm (3.9)

where Zm and ∆Vm are extracted from Zbus and ∆V such that their rows correspond to the m

observable buses. In order for this complex-valued equation to be solved by existing sparse recovery

algorithms, it is converted into a larger set of real equations as described in [30] and shown below.

X =

real(Zm) −imag(Zm)

imag(Zm) real(Zm)

(3.10)

y =

real(∆Vm)

imag(∆Vm)

(3.11)

As a result, the following real-valued estimation problem is obtained:

y = Xβ (3.12)

The solution vector β will contain two sub-vectors, namely the real and imaginary parts of the

equivalent injections.

14


3.2 Sparse Estimation Algorithm

The linear equation given by (3.12) is underdetermined and would normally have infinite solutions.

However, it could be proved that when the number of non-zeros in β is smaller than half of the

number of rows in X , its sparsest solution will be unique [31]. The sparsest solution for fault

in-between line terminals could only be the one with equivalent bus injections because otherwise any

sparser injections would indicate a fault on a bus. Therefore the most straightforward formulation for

solving (3.12) with a sparsity constraint is:

min ‖β‖0 s.t. y = Xβ (3.13)

where the L0 norm of the solution β is defined as:

‖β‖0 = #{i = 1, 2, ...N |βi 6= 0} (3.14)

Since this L0 minimization formulation is non-convex and NP-hard as it involves combinatorial

optimization, the authors turn to the formulation of L1 minimization because L1 norm is proved to

be a convex relaxation of L0 norm. A popular L1 norm minimization formulation is Lasso (least

absolute shrinkage and selection operator) [32]:

β̂ ∈ argminβ∈Rp

1

2‖y −Xβ‖22 + λ‖β‖1 (3.15)

where the tuning of parameter λ>0 leads to solutions with different sparsity. One efficient algorithm

for solving this formulation is LARS (Least Angle Regression and Shrinkage) which iteratively

selects a predictor with an equal correlation criterion [33]. The merit of this algorithm is that it is

parameter-free and requires little computation time. In the following sections the formulation (3.15)

and its solution through LARS algorithm will be addressed as basic LARS-Lasso method so as to be

distinguished with its alternations.

15


The basic procedure for LARS is as follows, it can be concluded that the complexity of the LARS

algorithm is O(N), that is, it features linear complexity.

1. Start with an all-zero initial coefficient vector, and find the predictor (column in X) most

correlated with the response;

2. Take the largest step possible in the direction of this predictor until some other predictor, has

as much correlation with the current residual;

3. Proceed in a direction equiangular between the two predictors until a third variable earns its

way into the “most correlated” set;

4. Then proceed equiangularly between the former three predictors, that is, along the “least angle

direction”, until a fourth variable enters, and so on.

3.3 Measurement Placement for Unique Solution

The dictionary matrix in (3.15) should satisfy certain conditions in order for (3.15) to be solvable,

and this condition can be used in placing the voltage measurements. Based on the KKT optimality

conditions for (3.15), a sufficient condition for its unique solution is introduced in [34]. Firstly the

concept of equicorrelation set ε is developed as follows:

ε = {i ∈ {1,...p} : |XTi (y −Xβ̂)| = λ} (3.16)

In other words, β̂ε is the set containing only the non-zero entries of the solution β̂. Then the condition

for unique solution will be given by the following theorem:

Lemma 1 For any y, X and λ > 0, if null(Xε) = 0, or equivalently rank(Xε) = |ε|, then the

Lasso solution is unique.

In order to apply this condition for the placement of PMUs, the columns of the dictionary matrix

X and the variables in β are divided into groups g1, g2......, gL where L is the total number of lines

and group gi contains the indices of columns inX (also the indices of entries in β) corresponding to

16


the terminal buses of the ith line. Assuming that the fault occurs at one line at a time, the non-zero

entries in the solution vector will appear in only one of the groups. To guarantee that the Lasso

problem for each line has a unique solution, ∀i ∈ 1, 2, ..., L the submatrix Xgi should have full

column rank. Note that if the positive sequence network of a system is to be studied, the submatrix

extracted from Zm according to the grouping of g1, g2......, gL will only have two column vectors(or

three for teed lines). If the full three phase model is used, the submatrix will contain six column

vectors corresponding to the terminal buses in phases A, B, and C. However, even for this case, only

two single phase columns need to be compared due to the inherent independence between columns

of different phases.

Intuitively speaking, for two normalized column vectors if the entries at the same row are dif-

ferent from each other then it will be sufficient to consider them to be independent of each other.

However, the bus impedance matrix is usually ill-conditioned with highly correlated columns, thus

the numerical threshold for two entries to be different has to be large enough for reliable performance

of Lasso solvers. This is accomplished by assigning th = 50% of the maximum difference between

two columns as the threshold. As an example, for the jth line from bus k to bus m the following

operation is performed:

d(:, j) = |Zbus(:, k)−Zbus(:,m)| (3.17)

s(:, j) = {i ∈ {1, ...N} : d(i, j) > th ·max(d(:, j))}. (3.18)

This relationship can be summarized as a binary matrix A1 in which each row is related with

each branch and each column with each bus in the system.

A1(i, j) ={ 1 if i ∈ s(:, j)

0 otherwise.(3.19)

17


Since a given bus voltage can be measured either directly by a PMU installed at the bus or indi-

rectly by a PMU at one of its neighboring buses, similar to the procedure for optimally placing PMUs

for state estimation [35], theN×N (N is the number of buses) bus incidence matrixA2 is defined as:

A2(i, j) ={ 1 if bus i is connected with bus j

0 otherwise.(3.20)

In order to obtain a unique solution using the minimum number of PMUs the following binary

integer optimization problem will be formulated, where xi = 1(i = 1, 2, ....N) implies an installed

PMU at bus i, and the non-zero entries in the productA2x represent the buses whose voltages can

be obtained based on the chosen PMU placement.

minN∑i=1

xi (3.21)

s.t. A1A2x > 1̂,x is binary. (3.22)

The solution of this optimization will yield only partially observable system, unlike the case of

PMU-based state estimation which requires full network observability.

18

Chapter 4

Alternations Improving the Method

Upon the description of the proposed method in 3, several alternations have been made aiming to

improve the performance of the method in terms of reliability, robustness and efficiency.

4.1 Overlapping Group Lasso

In addition to the basic formulation of Lasso optimization, an alternative formulation is also consid-

ered, where the structural feature of the solution in addition to sparsity is taken into account, that

is, the non-zero injections in the solution only appear at the terminal buses of a line. Therefore

the solution is sparse not only element-wise but also group-wise. Recalling 3.3 where the buses

in a system with L lines can be grouped by g1, g2......, gL, it is recognized that these groups are

overlapping since a bus in transmission system is usually connected to more than one line. Therefore,

incorporating the penalty on group sparsity leads to the following formulation [36]:

β̂ ∈ argminβ∈Rp

1

2‖y −Xβ‖22 + λ1‖β‖1 + λ2

L∑i=1

‖βgi‖2 (4.1)

This reformulated problem can be solved by using the “CVX”, a software package for specifying and

solving convex programs [37, 38].

19

CHAPTER 4. ALTERNATIONS IMPROVING THE METHOD

4.2 Second Stage OLS Estimation

4.2.1 Criteria for Valid Solution

After Lasso or overlapping group Lasso solvers return the solution vector for (3.15) or (4.1), the

following criteria are used to check the validity of the solution.

1. After thresholding the absolute value of the entires in the solution with a small number ε, the

sparsity of the solution vector belongs to either 2,4,6 or 3,6,9 where the latter case indicates

fault on a teed circuit line.

2. The indices of “non-zero” entries in the solution belongs to one of the groups of indices g1,

g2......, gL.

3. The ratio between the non-zeros on each bus should satisfy either (3.4) or (3.8).

4.2.2 OLS Backup Estimation Routine

If the solution does not satisfy all the above conditions, then it will be necessary to fix the result based

on whatever Lasso or overlapping group Lasso yields. Fortunately when these two solvers fail to

give the correct coefficients in the solution, the support of the invalid solutions do contain the indices

corresponding to at least one of the faulted line terminals. As a result, several groups of indices from

the set g1, g2......, gL overlapping with the support of the invalid solution are to be considered as

suspect indices. Then the valid solution can be computed by a second stage unregularized ordinary

least square estimation which has also been suggested in [33] under the name “LARS-OLS hybrid”.

In the case of the proposed method, this is done by extracting the columns of the dictionary matrix

X using the suspected group indices (denoted by s) and solving an overdetermined linear equation

via (4.2).

β̂s = (XTs Xs)

−1XTs y (4.2)

20


The resulting β̂s with the least error will be the final solution. This second stage backup evaluation

can be regarded as a reliable protection for the proposed method which might not work well under

certain noises.

4.3 Extended Robust Formulation and Correction Routine

4.3.1 Extended Robust Formulation

Since chances are individual measurement unit in the system may fail or suffer from corruption, it is

considered necessary to investigate the robustness of the proposed method against such gross error.

This leads us to an extended Lasso formulation in [39], which incorporates sparse missing or grossly

corrupted measurements into (3.15). Below is the the extended version of the underdetermined linear

equation (3.12), where e stands for sparse gross errors added to the correct voltage measurements.

y = Xβ + e =[X I

] β

e

(4.3)

Using (4.3), the only difference in the solution routine with the basic LARS-Lasso is the re-

placement of the original dictionary matrixX with the extended matrix[X I

]. In this way, if

individual measurement units fail or happen to be inaccurate, the e vector will have corresponding

non-zero entries, the locations and values of gross errors along with correct equivalent injections can

be identified simultaneously.

4.3.2 Recursive Solution Routine with Correction of Erroneous Measurement(s)

To achieve better performance using the extended robust formulation, the estimated gross error

vector can be recycled to correct the measurements backwards, a corresponding recursive routine

is designed as in Fig. 4.1. The initially corrupted measurement vector will be updated with the

estimated e and then used for the next round of Lars routine until no gross error is detected. This

approach helps improving the final accuracy of the fault location even though it comes with the

21


increase of computation time which is not so critical since fault location is an offline application and

single round of Lars takes little time to run.

Start

Implement Lars using

extended formulation

Error vector has

non-zero entry?

Correct the

measurement vector

with estimated

gross error

End of estimation

Figure 4.1: Flowchart for the recursive estimation

4.4 Parallel Implementation

4.4.1 Problem Decomposition

In terms of efficiency of the proposed method, although is has been found that LARS has linear

complexity, as the scale of power system increases, it’s worried that the computation speed of fault

location program will drop below the satisfactory limit for the fault to be repaired in time. Thus

investigations are made on how the sparse estimation problem can be decomposed allowing for

solution using multiple processors running in parallel. Based on [40], there can be three types of data

distribution for the dictionary matrix namely row block distribution, column block distribution and

general block distribution are illustrated, as shown in Fig. 4.2 below.

Among the three distribution scenarios, it’s considered that column block distribution is the most

appropriate one for parallelizing the original Lasso problem because in the application of power

system fault location each column in the dictionary matrix represents a bus in the network and

22


Figure 4.2: Three Scenarios of Data Distribution

aggregations of buses form sub-networks which is a common way of decomposing large power

networks.

To decompose the original lasso problem into smaller problems, it is necessary to take into account

again the notion of groups of indices. For each line in the network, there are four corresponding

column indices in the dictionary matrix and this four indices form a group. For a network with L

lines, there will be L groups of indices. Thus as shown in the column block distribution scenario,

each parallel processor can take charge of a subset of the dictionary matrix’s columns which is the

union of several groups of indices corresponding to several lines. As with the overlapping Lasso

formulation, the subsets of columns are overlapping since the terminals of a line cannot be assigned

to different groups.

When the problem is decomposed using the column block distribution, it can be easily seen

that the sub-problems are independent with each other. In other words, the sub-problems are

embarrassingly parallel [41]. Take Fig. 4.3 as an example, there are three pairs of estimation problems

to solved in parallel: (A1, b), (A2, b) and (A3, b). Each one of {A1, A2, A3} covers roughly

L/nWorker, (nWorker = 3) groups of lines, the tie-line between two areas will correspond to the

overlapping columns between two sub-matrices.

Assume the sparse injections exist in the x1 part, then the first sub-problem can be solved

smoothly by the first processor, resulting in the same sparse solution with the original problem,

but in the meantime the other sub-problems will run-into exhaustive search or exit halfway due to

internal detection of discrepancies because there is no sparse solution for them. To achieve the goal

23


Figure 4.3: Example of Column Block Distribution

of reducing computation time, there should be a communication mechanism between processors so

that once a single processor is done with its estimation other processors will be told to stop.

4.4.2 Parallelism Attempted

Based on the sequential Matlab code of the Lars algorithm [42] for solving Lasso, the author has

developed 4 parallel implementations of the algorithm using Matlab Parallel Computation Toolbox

(PCT) [43], Message Passing Interface (MPI) [44], Open Multi-Processing (openMP) [45] and

GPU [46] respectively. The MPI and openMP implementations are modified from the sequential C

code of LARS which uses the GNU Scientific Library (GSL) [47] for defining and calculating vectors

and matrices. The main reason for using the GSL library is that it’s built within the gnu-4.8.1 compiler

and it provides an interface to the Basic Linear Algebra Subprograms (BLAS) [48] operations which

apply to vector and matrix objects. For set operations such as union and set difference the standard

library (STL) [49] has been applied.

The following Fig.4.4 identifies the development path and supporting libraries for each version.

Among the four parallel versions, Matlab PCT, MPI and openMP versions are based on the column

block decomposition of the original problem, therefore they are all embarrassingly parallel, each

thread/worker uses part of the dictionary matrix to apply the Lars algorithm and one of them will

eventually get the correct solution. The GPU version is written in Compute Unified Device Architec-

ture (CUDA) C [50], and GPU-accelerated libraries namely cuBLAS [51] and Thrust [52] which are

analogous to the CPU based GSLBLAS and STL libraries are utilized.

24


Matlab sequential

C++ sequenital

Matlab PCT C++ with MPI

GSL,

GSLBLAS

STL

CUDA on GPU

cuBLAS

Thrust

C++ with openMP

: parallel program

: sequential program

: supporting library

MPI: Message Passing Interface

openMP: Open Multi-Processing

GPU: Graphics Processing Unit

Figure 4.4: Development process of different implementations

Among the 4 implementations, the Matlab PCT is only used for prototyping the problem decom-

position and the SPMD (single program multiple data) [53] scheme which is also used in the MPI

implementation. Ans for the MPI version, it’s later found out that the communication desired is

incompatible with MPI’s broadcast function. The MPI Bcast() can only come into effect when

it’s called before the use of the broadcasted variable and the root of the broadcast has to be assigned

without knowing which thread can generate the correct sparse solution. Therefore in the MPI version

the overall timing is also not determined by the “successful” thread but the thread who takes the

longest time, nothing can be done to tell other threads to stop when one thread has succeeded.

Therefore, neither Matlab PCT or MPI implementation will used for the efficiency test.

On the other hand, due to the inherent difference between the architecture of multi-core CPU

and GPU, it’s not promising to transplant the solution routine on each core of the MPI or openMP

implementation to each thread in GPU. In this light, the reasonable way is to take advantage of

massive GPU threads to accelerate the matrix/vector operations while keeping the framework of

25


the sequential C version unchanged. The author has resorted to the cuBLAS library which is a

GPU-accelerated version of the complete standard BLAS library. Similarly the use of STL library

has been replaced by using the its GPU analogous library named Thrust. Both cuBLAS and Thrust

functions implicitly create objects on GPU and do the computation without user-defined kernel

functions. Unfortunately although cuBLAS functions can be called by use-defined kernel functions,

Thrust does not allow so, hence even though it’s fairly convenient to use these two libraries, there is

no freedom in testing different grid and block sizes.

As for the openMP version, the most important merit of it is its use of shared memory instead of

protocol based communication, even though this merit comes with the limit of computation resource:

sheer openMP program can only use the cores equipped for one compute node whereas there is no

such limit for MPI. The broadcast from the “successful” thread to others is easily implemented by

changing the value of a shared variable so that the loop in other threads will break. OpenMP can

control the overall run time by the “successful” thread which is the fastest thread among all.

26

Chapter 5

Simulation Results

To generate an overall picture of the method’s applicability and performance (accuracy, efficiency,

robustness), various tests have been performed from different perspectives. The accuracy of the

estimation is evaluated by the error in fault distance. Assign D as the total length of the faulted line,

d as the distance from the fault point to the terminal bus where the equivalent injection is larger, and

resti and rtrue as the estimated and “true” ratio between the equivalent injections, the relative fault

distance error can be derived by (5.3):

desti =D

resti + 1(5.1)

dtrue =D

rtrue + 1(5.2)

err =‖desti − dtrue‖

dtrue=‖rtrue − resti‖resti + 1

(5.3)

In the following sections, firstly the universal applicability of the method to different fault types

and line types are verified through an unbalanced test case as well as an “teed” line in the IEEE

118 bus test case from [54]. Then the performance of the proposed method and its variations are

evaluated using extensive simulation in the 118 bus test case. A 300 bus test case as well as a 3395

bus test case will also be used for efficiency test of the parallel implementation.

27

CHAPTER 5. SIMULATION RESULTS

5.1 Faults in the Unbalanced Test Case

Using the same unbalanced ATP test case as in Section 3.1.3, the applicability of the proposed method

to identifying faults of various types on both balanced and unbalanced lines has been validated.

For the 10 line sections in the test system (Fig. 3.5), all five types of fault (single to three phase,

grounded and ungrounded) are simulated and the equivalent injections are estimated using full

measurements of all buses because the system is too small to apply the optimized PMU placement.

The measurement noise and gross error is not taken into consideration in this case, since in later

subsections these will be discussed for larger test cases. The relative error of the estimated distance

for all simulated faults are summarized in Table 5.1, where “SLG, DL, DLG, 3L, 3LG” stand for

single phase to ground fault, double phase short circuit fault, double phase to ground fault, three phase

short circuit fault and three phase to ground fault respectively. The fault resistance is set to be 0.1 ohm.

Table 5.1: Relative Error of Fault Distance for the Unbalanced 11-Bus Case

Section ID SLG DL DLG 3L 3LG

Sec 1 3.86e-5 6.75e-5 1.12e-4 7.57e-5 2.41e-5Sec 2 2.10e-5 1.06e-5 1.18e-5 N/A N/ASec 3 3.63e-5 2.30e-5 7.10e-6 N/A N/ASec 4 5.40e-5 5.12e-5 5.06e-5 8.95e-5 1.38e-4Sec 5 2.13e-5 3.30e-5 1.68e-5 7.35e-5 5.41e-5Sec 6 1.37e-4 1.18e-4 7.44e-5 9.62e-5 9.24e-5Sec 7 5.20e-3 2.00e-3 2.40e-3 N/A N/ASec 8 6.16e-5 N/A N/A N/A N/ASec 9 3.80e-3 N/A N/A N/A N/ASec 10 2.41e-5 5.84e-6 1.38e-5 3.49e-5 2.24e-5

From the statistics above it can be seen that the accuracy of the equivalent injection estimation

is comparatively better in grounded faults than ungrounded fault. This is only a rough observation

because the electromagnetic measurements provided by ATP have to be fitted into phasors, the

performance of curve fitting also has effect on the final error. In any case, the proposed method can

be regarded as accurate from practical point of view.

28


5.2 Wide-area Cases

In the following sections the IEEE-118 bus test system for which only positive sequence model is

available will be used firstly to verify the applicability of the proposed method to teed circuit fault,

then examine the effectiveness of the proposed method in terms of accuracy, number of iterations and

number of OLS backup evaluations under different levels of measurement noise. After implementing

the binary optimization shown in (3.22), it came out that 24 PMUs need to be placed making 96 out

of the 118 buses observable.

For faults on each of the 170 lines, changes in bus voltages are simulated by injecting a pair of

proportional currents at the terminal buses commensurate with the description of equivalent injections

in (3.4). The location results for all 170 lines in noise free condition as well as under different levels

of Gaussian measurement noises are collected to generate statistical performance metrics. Both

the LARS algorithm for Lasso and the CVX package are employed in Matlab environment using a

2.6GHz Intel i5 processor.

5.2.1 Teed Circuit Fault

To verify the equivalent three terminal injections described in Section 3.1.1.2, Bus 71 in the 118-Bus

system is reduced as a teed point since it is a zero-injection bus with only bus 70, 72, and 73 as its

neighbors. Faults are simulated on line section 70-71, 71-72 and 71-73 respectively and each fault

can be located correctly. For instance, for a fault on line 70-71 at a point (2/7)th of the line distance

away from bus 70, the equivalent injections estimated by basic LARS-Lasso are:

I70 = 383.0554− j759.1415(p.u.) (5.4)

I72 = 13.1256− j24.9163(p.u.) (5.5)

I73 = 47.1500− j102.6877(p.u.) (5.6)

29


Let z70−71, z71−72 and z71−73 represent the series line impedances and z70−f represent the

impedance from bus 70 to the fault point. Then, comparing the three injections leads to the following

relationships:

I72I73

=z71−72z71−73

=0.0446 + j0.1800

0.0087 + j0.0454(5.7)

zs = z70−71z71−72 + z70−71z71−73 + z71−72z71−73 (5.8)

If = I70 + I72 + I73 = 443.33− j886.75(p.u.) (5.9)

z70−f =I72zs

Ifz71−73= 0.0025 + j0.0102(p.u.) (5.10)

z70−fz70−71

= 0.2863 ≈ 2

7(5.11)

Therefore, the location of the fault which is (2/7)th of the total distance from bus 70 to the teed

point (bus 71) is accurately estimated.

5.2.2 Comparison of Basic LARS-Lasso and Overlapping Group Lasso

The performance of the method is examined by comparing the results from basic LARS-Lasso for-

mulation with overlapping group Lasso formulation in terms of 3 metrics: average computation time,

average number of iterations, average error, and number of cases requiring second estimation among

all fault cases. Since the software package CVX uses a more general purpose interior-point algorithm

other than the LARS algorithm used for Lasso, it is much slower than LARS and the comparison of

computation time is only for the sake of completeness, as it will be shown in later section how the

efficiency will be using different number of computation cores. Graphic representation of comparison

results are shown in Fig. 5.1 to Fig. 5.4.

It can be seen from Fig. 5.1 to Fig. 5.4 that the overlapping group Lasso works more stably

looking from number of iteration, number of OLS backup evaluation under different levels of noise

although it is not that highly accurate as basic Lars implementation, this may be due to the limited

30


0 1e−5 1e−4 1e−3 1e−2 3e−20

0.5

1

1.5

2

2.5

3

3.5

4

Standard Deviation of Measurement Noise

Av

e.

Co

mp

uta

tio

n T

ime

(s

)

Comparison of Computation Time (s)

Lasso

OvlpGrpLasso

Figure 5.1: Computation Times Used by Basic LARS-Lasso and Overlapping Group Lasso.

0 1e−5 1e−4 1e−3 1e−2 3e−20

50

100

150

200

250


Av

e.

Nu

mb

er

of

Ite

rati

on

Comparison of Number of Iteration

Lasso

OvlpGrpLasso

Figure 5.2: Number of Iterations by Basic LARS-Lasso and Overlapping Group Lasso.

adjustment that can be made to the tolerance of CVX. However, if timing is to be placed importance

on, overlapping group Lasso will be less favorable than basic LARS-Lasso.

31


0 1e−5 1e−4 1e−3 1e−2 3e−20

0.002

0.004

0.006

0.008

0.01

0.012

0.014


Av

e.

Fa

ult

Dis

tan

ce

Err

or

Comparison of Fault Distance Error

Lasso

OvlpGrpLasso

Figure 5.3: Error of Fault Distance by Basic LARS-Lasso and Overlapping Group Lasso.

0 1e−5 1e−4 1e−3 1e−2 3e−20

10

20

30

40

50

60


Av

e.

#O

LS

ba

ck

up

Es

tim

ati

on

Comparison of #OLS backup Estimation

Lasso

OvlpGrpLasso

Figure 5.4: Number of OLS Backup Estimations by Basic LARS-Lasso and Overlapping GroupLasso.

32


5.2.3 Robustness of Extended LARS-Lasso Formulation Against Gross Error

To evaluate the robustness of the extended robust formulation, the same set of fault cases under

Gaussian noises are solved using the extended formulation with one and two failed measurements

(output equals zero) respectively and each case is run twice: one with and the other without the

recursive routine. Hence there are four records of performance to be compared with that of the

gross error free cases solved by basic LARS-Lasso. In each gross error-corrupted case the error

vector can be accurately estimated and used for correcting the measurement in the recursive routine.

Fig.5.5 demonstrates the comparison of estimation error under different levels of noise. It can be

concluded that the effect sparse gross errors have on the accuracy of the location method is within

acceptable limits. Applying the recursive routine will help increase the accuracy towards that of the

gross error-free cases.

10−5

10−4

10−3

10−2

10−1

0

0.002

0.004

0.006

0.008

0.01

0.012


Av

e.

Fa

ult

Dis

tan

ce

Err

or

Comparison of Average Fault Distance Error

Error Free Cases

One Unit Failure Without Correction

One Unit Failure With Correction

Two Unit Failure Without Correction

Two Unit Failure With Correction

Figure 5.5: Comparison of Fault Distance Error under One or Two Unit Failures

33


5.3 Parallel Implementation Results

To evaluate the efficiency of the proposed method especially its parallel implementation, a 300 bus

as well as 3395 bus test case are also used for simulations. The sequential and parallel programs are

run on the discovery cluster offered by the Information Technology Services, Research Computing at

Northeastern University. The queue used belongs to the node group “nodes10g2” whose technical

specifications are as follows:

Table 5.2: Platform Information

CPU Intel Xeon CPU E5-2680 2.8GHz

# logical cores on each node 40

RAM 64GB

5.3.1 Sequential Run Time Benchmark

Before examining the speed-up gained from parallel implementation, the run time of Matlab and

C++ sequential programs are record as below to serve as benchmark for efficiency test.

Table 5.3: Sequential Run Times

nBus Matlab Run Time C++ Run Time

118 0.0245s 0.0116s300 0.0981s 0.1153s3395 89.4747s 213.5774s

5.3.2 Speed-up Using openMP

The openMP implementation is tested using a series number of cores ranging from 5 to 40 which

is the maximum number of usable cores on a compute node. The speed-ups are shown in Fig.5.6

below.

34


0 5 10 15 20 25 30 35 4010

−1

100

101

102

103

Number of Cores used in openMP

Av

e.

Sp

ee

d−

up

Comparison of Speed−ups for Different Cases

118−Bus Case

300−Bus Case

3395−Bus Case

Figure 5.6: Comparison of Speed-ups for Different Cases using openMP

It can be seen that for large system as the 3395 bus test system, the accelerating effect openMP

has on the efficiency is much more obvious than that for smaller cases. And as the number of cores

goes up, the computation time for the 3395 bus case becomes more and more acceptable.

5.3.3 Speed-up Using GPU

Since the library functions automatically launch the kernel functions, it’s not possible to design

the dimension of the grids, therefore there’s only one way of execution for the GPU version. The run

times and corresponding speed-ups are shown in 5.3.3.

Table 5.4: Summary of GPU run times

nBus GPU Run Time Speed-up

118 0.4733s 0.02139300 1.1700s 0.098523395 10.1567s 21.0283

35


It seems odd that for smaller cases like the 118 and 300 bus case, the GPU version is much slower

than the sequential C program. A close look at profiling results indicates that this is mostly caused

by vector initialization and set operation functions using the Thrust library. The following 5.3.3 is

a extracted of the profiling result from which it could be seen that even the most time consuming

cuBLAS function (”void gemv2T kernel val”) is much faster than a Thrust function. Since there’s

a lot set operation in the while loop and this kind of operation involves data dependency between

elements of vectors, it’s not practical to write kernel functions to parallelize them. Therefore so far

the most promising approach to parallelize the proposed method is through openMP.

Table 5.5: Summary of Profiling Results

Total Time% Total Time Total Calls Avg. Time Function Name

53.61% 384.56ms 8496 45.263us void::thrust::system......9.92% 71.172ms 33960 2.0950us CUDA memcpy DtoH (device to host)0.08% 450.72us 30 15.023us void gemv2T kernel val

Despite of the non-ideal result, the profiling result does verify the efficiency of cuBLAS functions

and the dynamic kernel launches by cuBLAS and Thrust function calls (the gridSize and blockSize

varies each time). It’s fair to say that GPU can provide significant speed-up for floating point matrix

operations while it’s not guaranteed to help with integer set operations.

36

Chapter 6

Conclusion

The major contribution of this work is the transformation of fault location problem of both two

terminal lines and teed circuit lines into a sparse estimation problem or to be more specific, Lasso

problem which can be solved effectively by the LARS algorithm. Using the condition for unique

solution to the Lasso formulation, an optimal PMU placement scheme is designed which results

in only partial observability of the system. Beside the basic LARS-Lasso implementation, the

overlapping group Lasso formulation which customizes the penalty for sparsity using structural

information of the solution is also built and implemented. In addition, an extended robust Lasso

formulation designed to tackle sparse gross errors is applied and equipped with an recursive routine

for correcting the discrepant measurements. To improve the method’s efficiency for large systems,

parallel computation techniques are explored and implemented.

Using specifically cases, the proposed method’s universal applicability to different types of faults

and lines are verified. The performance of basic LARS-Lasso implementation, overlapping group

Lasso as well as Lasso with extended robust formulation are examined using wide-area test case.

Furthermore, the result for parallel implementation base on openMP has revealed that for large

systems, the proposed method can be accelerated so that the required time for locating the fault can

be satisfactory.

37

CHAPTER 6. CONCLUSION

Future research topics might include the investigation of the method’s robustness against high

noise levels using different amount of redundant measurements. The issue of parameter error in the

dictionary matrix is also worthy of exploration.

38

Bibliography

[1] M. M. Saha, J. Izykowski, E. Rosołowski, J. Izykowski, and E. Rosołowski, Fault location on

power networks. Springer, 2010.

[2] P. Gale, P. Crossley, X. Bingyin, Y. Ge, B. Cory, and J. Barker, “Fault location based on travelling

waves,” in Fifth International Conference on Developments in Power System Protection. IET,

1993, pp. 54–59.

[3] F. H. Magnago and A. Abur, “Fault location using wavelets,” IEEE Transactions on Power

Delivery, vol. 13, no. 4, pp. 1475–1480, 1998.

[4] C. Y. Evrenosoglu and A. Abur, “Travelling wave based fault location for teed circuits,” IEEE

Transactions on Power Delivery, vol. 20, no. 2, pp. 1115–1121, 2005.

[5] M. Korkali, H. Lev-Ari, and A. Abur, “Traveling-wave-based fault-location technique for

transmission grids via wide-area synchronized voltage measurements,” IEEE Transactions on

Power Systems, vol. 27, no. 2, pp. 1003–1011, 2012.

[6] R. Razzaghi, G. Lugrin, H. Manesh, C. Romero, M. Paolone, and F. Rachidi, “An efficient

method based on the electromagnetic time reversal to locate faults in power networks,” IEEE


[7] T. Takagi, Y. Yamakoshi, M. Yamaura, R. Kondow, and T. Matsushima, “Development of a

new type fault locator using the one-terminal voltage and current data,” IEEE Transactions on

Power Apparatus and Systems, no. 8, pp. 2892–2898, 1982.

39

BIBLIOGRAPHY

[8] T. Kawady and J. Stenzel, “A practical fault location approach for double circuit transmission

lines using single end data,” IEEE Transactions on Power Delivery, vol. 18, no. 4, pp. 1166–

1173, 2003.

[9] S. M. Brahma, “Fault location scheme for a multi-terminal transmission line using synchronized

voltage measurements,” IEEE Transactions on Power Delivery, vol. 20, no. 2, pp. 1325–1331,

2005.

[10] D. Novosel, D. G. Hart, E. Udren, and J. Garitty, “Unsynchronized two-terminal fault location

estimation,” IEEE Transactions on Power Delivery, vol. 11, no. 1, pp. 130–138, 1996.

[11] M. Kezunovic, “Smart fault location for smart grids,” IEEE Transactions on Smart Grid, vol. 2,

no. 1, pp. 11–22, 2011.

[12] C.-W. Liu, K.-P. Lien, C.-S. Chen, and J.-A. Jiang, “A universal fault location technique for

N-terminal transmission lines,” IEEE Transactions on Power Delivery, vol. 23, no. 3, pp.

1366–1373, 2008.

[13] T. Funabashi, H. Otoguro, Y. Mizuma, L. Dube, and A. Ametani, “Digital fault location

for parallel double-circuit multi-terminal transmission lines,” IEEE Transactions on Power

Delivery, vol. 15, no. 2, pp. 531–537, 2000.

[14] G. Manassero Jr, E. C. Senger, R. M. Nakagomi, E. L. Pellini, and E. C. N. Rodrigues, “Fault-

location system for multiterminal transmission lines,” IEEE Transactions on Power Delivery,

vol. 25, no. 3, pp. 1418–1426, 2010.

[15] D. A. Tziouvaras, J. B. Roberts, and G. Benmouyal, “New multi-ended fault location design for

two-or three-terminal lines,” 2001.

[16] M. Abe, N. Otsuzuki, T. Emura, and M. Takeuchi, “Development of a new fault location system

for multi-terminal single transmission lines,” IEEE Transactions on Power Delivery, vol. 10,

no. 1, pp. 159–168, 1995.

40

BIBLIOGRAPHY

[17] T. Nagasawa, M. Abe, N. Otsuzuki, T. Emura, Y. Jikihara, and M. Takeuchi, “Development

of a new fault location algorithm for multi-terminal two parallel transmission lines,” IEEE


[18] H. Livani and C. Y. Evrenosoglu, “A fault classification and localization method for three-

terminal circuits using machine learning,” IEEE Transactions on Power Delivery, vol. 28, no. 4,

pp. 2282 – 2290, 2013.

[19] R. Mahanty and P. D. Gupta, “Application of RBF neural network to fault classification and

location in transmission lines,” IEE Proceedings-Generation, Transmission and Distribution,

vol. 151, no. 2, pp. 201–212, 2004.

[20] J. Gracia, A. Mazon, and I. Zamora, “Best ANN structures for fault location in single-and

double-circuit transmission lines,” IEEE Transactions on Power Delivery, vol. 20, no. 4, pp.

2389–2395, 2005.

[21] Z. Chen and J.-C. Maun, “Artificial neural network approach to single-ended fault locator for

transmission lines,” IEEE Transactions on Power Systems, vol. 15, no. 1, pp. 370–375, 2000.

[22] A. Mazon, I. Zamora, J. Minambres, M. Zorrozua, J. Barandiaran, and K. Sagastabeitia, “A new

approach to fault location in two-terminal transmission lines using artificial neural networks,”

Electric Power Systems Research, vol. 56, no. 3, pp. 261–266, 2000.

[23] Q. Jiang, X. Li, B. Wang, and H. Wang, “PMU-based fault location using voltage measurements

in large transmission networks,” IEEE Transactions on Power Delivery, vol. 27, no. 3, pp.

1644–1652, 2012.

[24] A. Salehi-Dobakhshari and A. M. Ranjbar, “Application of synchronised phasor measurements

to wide-area fault diagnosis and location,” IET Generation, Transmission & Distribution, vol. 8,

no. 4, pp. 716–729, 2014.

[25] M. Majidi, A. Arabali, and M. Etezadi-Amoli, “Fault location in distribution networks by

compressive sensing,” IEEE Transactions on Power Delivery (Early Access), 2014.

41

BIBLIOGRAPHY

[26] H. Zhu and G. B. Giannakis, “Sparse overcomplete representations for efficient identification

of power line outages,” IEEE Transactions on Power Systems, vol. 27, no. 4, pp. 2215–2224,

2012.

[27] O. Kosut, L. Jia, R. J. Thomas, and L. Tong, “Malicious data attacks on smart grid state estima-

tion: Attack strategies and countermeasures,” in 2010 First IEEE International Conference on

Smart Grid Communications (SmartGridComm). IEEE, 2010, pp. 220–225.

[28] C. Wang, Q.-Q. Jia, X.-B. Li, and C.-X. Dou, “Fault location using synchronized sequence

measurements,” International Journal of Electrical Power & Energy Systems, vol. 30, no. 2, pp.

134–139, 2008.

[29] IEEE Distribution System Analysis Subcommittee. (2013) Distribution test feeders. [Online].

Available: http://ewh.ieee.org/soc/pes/dsacom/testfeeders/index.html

[30] A. Sharif-Nassab, M. Kharratzadeh, M. Babaie-Zadeh, and C. Jutten, “How to use real-valued

sparse recovery algorithms for complex-valued sparse recovery?” in Proceedings of the 20th

European Signal Processing Conference (EUSIPCO). IEEE, 2012, pp. 849–853.

[31] D. L. Donoho, “For most large underdetermined systems of linear equations the minimal L1-

norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics,

vol. 59, no. 6, pp. 797–829, 2006.

[32] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical

Society. Series B (Methodological), pp. 267–288, 1996.

[33] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani et al., “Least angle regression,” The Annals of

statistics, vol. 32, no. 2, pp. 407–499, 2004.

[34] R. J. Tibshirani et al., “The lasso problem and uniqueness,” Electronic Journal of Statistics,

vol. 7, pp. 1456–1490, 2013.

[35] B. Xu and A. Abur, “Observability analysis and measurement placement for systems with

pmus,” in IEEE PES Power Systems Conference and Exposition. IEEE, 2004, pp. 943–946.

42

http://ewh.ieee.org/soc/pes/dsacom/testfeeders/index.html

BIBLIOGRAPHY

[36] L. Yuan, J. Liu, and J. Ye, “Efficient methods for overlapping group lasso,” in Advances in

Neural Information Processing Systems, 2011, pp. 352–360.

[37] M. Grant and S. Boyd, “Graph implementations for nonsmooth convex programs,” in Recent

Advances in Learning and Control, ser. Lecture Notes in Control and Information Sciences,

V. Blondel, S. Boyd, and H. Kimura, Eds. Springer-Verlag Limited, 2008, pp. 95–110,

http://stanford.edu/∼boyd/graph dcp.html.

[38] ——, “CVX: Matlab software for disciplined convex programming, version 2.1,” http://cvxr.

com/cvx, Mar. 2014.

[39] N. M. Nasrabadi, T. D. Tran, and N. Nguyen, “Robust lasso with missing and grossly corrupted

observations,” in Advances in Neural Information Processing Systems, 2011, pp. 1881–1889.

[40] Z. Peng, M. Yan, and W. Yin, “Parallel and distributed sparse optimization,” in Signals, Systems

and Computers, 2013 Asilomar Conference on. IEEE, 2013, pp. 659–646.

[41] Argonne National Laboratory, “Parallel Algorithm Examples,” http://www.mcs.anl.gov/∼itf/

dbpp/text/node10.html.

[42] S. S. Kim, “Lars Algorithm,” http://www.mathworks.com/matlabcentral/fileexchange/

23186-lars-algorithm, Mar. 2009.

[43] T. Mathworks, “Matlab Parallel Computing Toolbox,” http://www.mathworks.com/help/

distcomp/index.html, 2015.

[44] Argonne National Laboratory, “The Message Passing Interface (MPI) Standard,” http://www.

mcs.anl.gov/research/projects/mpi/.

[45] Blaise Barney, Lawrence Livermore National Laboratory, “openMP,” https://computing.llnl.

gov/tutorials/openMP/.

[46] Nvidia, “What is GPU Accelerated Computing?” http://www.nvidia.com/object/

what-is-gpu-computing.html.

43

http://stanford.edu/~boyd/graph_dcp.html

http://cvxr.com/cvx

http://cvxr.com/cvx

http://www.mcs.anl.gov/~itf/dbpp/text/node10.html

http://www.mcs.anl.gov/~itf/dbpp/text/node10.html

http://www.mathworks.com/matlabcentral/fileexchange/23186-lars-algorithm

http://www.mathworks.com/matlabcentral/fileexchange/23186-lars-algorithm

http://www.mathworks.com/help/distcomp/index.html

http://www.mathworks.com/help/distcomp/index.html

http://www.mcs.anl.gov/research/projects/mpi/

http://www.mcs.anl.gov/research/projects/mpi/

https://computing.llnl.gov/tutorials/openMP/

https://computing.llnl.gov/tutorials/openMP/

http://www.nvidia.com/object/what-is-gpu-computing.html

http://www.nvidia.com/object/what-is-gpu-computing.html

BIBLIOGRAPHY

[47] “GSL - GNU Scientific Library,” http://www.gnu.org/software/gsl/, Aug. 2014.

[48] “BLAS (Basic Linear Algebra Subprograms),” http://www.netlib.org/blas/, Sep. 2014.

[49] “Standard C++ Library Reference,” http://www.cplusplus.com/reference/, Feb. 2015.

[50] Nvidia, “CUDA C Programming Guide,” https://docs.nvidia.com/cuda/

cuda-c-programming-guide/.

[51] “cuBLAS,” http://docs.nvidia.com/cuda/cublas/#axzz3XkIZYzaJ, Feb. 2015.

[52] “Thrust,” http://thrust.github.io/, Feb. 2015.

[53] “SPMD,” http://www.mathworks.com/help/distcomp/spmd.html, Feb. 2015.

[54] University of Washinton, Seattle. (1993) Power systems testcase archive. [Online]. Available:

http://www.ee.washington.edu/research/pstca/

44

http://www.gnu.org/software/gsl/

http://www.netlib.org/blas/

http://www.cplusplus.com/reference/

https://docs.nvidia.com/cuda/cuda-c-programming-guide/

https://docs.nvidia.com/cuda/cuda-c-programming-guide/

http://docs.nvidia.com/cuda/cublas/#axzz3XkIZYzaJ

http://thrust.github.io/

http://www.mathworks.com/help/distcomp/spmd.html

http://www.ee.washington.edu/research/pstca/

Appendix A

Appendix

45

APPENDIX A. APPENDIX

A.1

Matlab Code Solving Lasso Problem

This part of code implements the LARS with extended formulation. The codes for basic LARS and

overlapping group Lasso are similar to this thus omitted.

1 function stat = solver_robust_lars_fun(sig, errV_unit, err_level, lam_e, correct

)

2 global nBus nBranch BranchTypeInfo MeasuredBus Aeq Aeq_noQR Q Neighbour Zbus_red

Zbus_f nG G isOpt LengthRatio I0 Result Flag Iter_Err

3

4 nCase = 0;

5 noise_ratio = -1*ones(nBranch,1);

6 % Iter_Err = (-1)*ones(nBranch,2);

7 Com_time = (-1)*ones(nBranch,1);

8 Require_2nd_eval = false(nBranch,1);

9

10 true_inj = [real(I0), imag(I0),LengthRatio*real(I0),LengthRatio*imag(I0)];

11 true_inj = sort(true_inj); % increasing order

12

13 [Q,R] = qr(Aeq_noQR);

14 ext_Aeq = [R, lam_e*inv(Q)];

15 %ext_Aeq = [Aeq, lam_e*eye(2*length(MeasuredBus))];

16

17 stopCriterion = {};

18 stopCriterion{1,1} = 'MSE';%'maxKernels';

19 MSE = 1e-6;

20 stopCriterion{1,2} = MSE;%1000

21 for fault_sec_id = 1:1:nBranch;

22 if BranchTypeInfo(fault_sec_id,3) > 0

23 Flag(fault_sec_id) = -1;

24 continue;

25 end

46


26 nCase = nCase + 1;

27 % Flag(fault_sec_id) = 1;

28 Fault_sec = BranchTypeInfo(fault_sec_id,1:2);

29 I_s = sparse(Fault_sec.', [1;1], [LengthRatio*I0; I0], nBus, 1);

30 I = full(I_s);

31 deltaV = Zbus_f*I;

32 V_noise = sig.*randn(nBus,1); % normal distribution with a mean of 0 and a

standard deviation of sig.

33 deltaV = deltaV+V_noise;

34 noise_ratio(nCase,1) = norm(V_noise)/norm(deltaV);

35

36 deltaV2 = deltaV(MeasuredBus);

37

38 beq_org = [real(deltaV2); imag(deltaV2)];

39

40 deltaV2(errV_unit,1) = deltaV2(errV_unit,1)*err_level;

41

42 beq = [real(deltaV2); imag(deltaV2)];

43

44 err_set = beq - beq_org;

45

46 beq = Q\beq;

47

48 Result{nCase,1} = fault_sec_id;

49 Result{nCase,2} = Fault_sec;

50

51 isValid = false;

52 unsolvableCase = 0;

53

54 existErr = true;

55 while existErr == true

56 t0 = clock;

57 sol = lars(beq, ext_Aeq, [], 'lasso',stopCriterion,[],0,1); % solve by

lars

47


58 time1 = etime(clock, t0);

59

60 [tttt,nIter] = size(sol);

61 Result{nCase,5} = nIter;

62 active_set = sol(1,nIter).active_set;

63 beta = sol(1,nIter).beta;

64

65 th = 2e-2*max(abs(beta));

66 inj_beta_idx = find(abs(beta)>th);

67

68 inj_nodes_idx = active_set(inj_beta_idx);%%%%%%%%%%%%%%%%

69 inj = beta(inj_beta_idx);

70

71 temp1 = find(inj_nodes_idx>2*nBus);

72 if ˜isempty(temp1)

73 err_val = inj(temp1)*lam_e;

74 err_node_idx = inj_nodes_idx(temp1)-2*nBus;

75 temp2 = find(inj_nodes_idx<=2*nBus);

76 inj_nodes_idx = inj_nodes_idx(temp2);

77 inj = inj(temp2);

78 Result{nCase,10} = err_val;

79 Result{nCase,11} = err_node_idx;

80 error_vector = zeros(2*length(MeasuredBus),1);

81 error_vector(err_node_idx) = err_val.';

82 err_of_err = norm(error_vector-err_set)/norm(err_set);

83 Result{nCase,12} = err_of_err;

84

85 existErr = false;

86

87 % beq = beq - Q\error_vector;

88

89 if correct == true

90 beq = beq - Q\error_vector;

91 existErr = true; % go back to do lars again

48


92 end

93 else

94 existErr = false;

95 end

96

97 end

98 Result{nCase,3} = inj_nodes_idx;

99 Result{nCase,4} = inj;

100 %% see if the inj is valid

101 if mod(length(inj_nodes_idx),2)==0

102 isValid = checkInj(inj, inj_nodes_idx, nBus);

103 else

104 isValid = false;

105 end

106 %% 2nd stage OLS

107 if ˜isValid


109 Require_2nd_eval(fault_sec_id) = true;

110 [sel_idx, nOLS_test, inj_best, minResidual] = OLS_backup(inj_nodes_idx,

nBus, beq, Aeq, Neighbour); % global Aeq Neighbour

111

112 if sel_idx(1)˜=Fault_sec(1) || sel_idx(2)˜=Fault_sec(2)

113 % if minResidual > norm(error_vector)

114 disp(['case', num2str(nCase),' for fault on ', num2str(Fault_sec(1)),

'-', num2str(Fault_sec(2)),' could not be solved through OLS!'])

115 nIter = -1;

116 unsolvableCase = -1;


118 end

119 Result{nCase,7} = sel_idx;

120 Result{nCase,9} = nOLS_test;

121 inj = inj_best.';

122 Result{nCase,8} = inj;

123 else

49


124 Flag(fault_sec_id) = 1;

125 end

126 %% Record stats

127 Com_time(nCase) = time1;

128 Iter_Err(nCase,1) = nIter;

129 if unsolvableCase > -1

130 unsolvableCase = genError3(true_inj, inj, LengthRatio);

131 Result{nCase,6} = unsolvableCase;

132 end

133 Iter_Err(nCase,2) = unsolvableCase;

134

135 end

136 %% Process stats

137 total_2nd_eval = sum(Require_2nd_eval)

138 % Iter_Err = Iter_Err(1:nCase,:);

139 Iter_Err_idx_1 = find(Iter_Err(:,1) > 0);

140 Iter_Err_idx_2 = find(Iter_Err(:,2) > 0);

141 valid_iter = Iter_Err(Iter_Err_idx_1,1);

142 valid_error = Iter_Err(Iter_Err_idx_2,2);

143 noise_ratio = noise_ratio(1:nCase,:);

144

145 ave_noise = mean(noise_ratio)

146 ave_iter = mean(valid_iter)

147 ave_err = mean(valid_error)

148 ave_time = mean(Com_time(1:nCase))

149 stat = [ave_noise, ave_iter, ave_err, ave_time, total_2nd_eval];

150 end

50


A.2

Parallel C Code using openMP

This piece of attached code is the parallel C code using openMP for implementing LARS. The codes

for sequential C and MPI implementation are similar to this thus omitted.

1 // FL_Lars.cpp : Defines the entry point for the console application.

2 //

3

4 //#include "stdafx.h"

5

6 //#include "Lars.h"

7 //#include "util.h"

8 #include <time.h>

9 #include <omp.h> // openmp header

10 #include <stdlib.h>

11

12 #include <iostream>

13 #include <stdio.h>

14 #include <stddef.h>

15 #include <math.h>

16 #include <algorithm>

17 #include <string>

18 #include <vector>

19

20 #include <gsl/gsl_math.h>

21 #include <gsl/gsl_vector.h>

22 #include <gsl/gsl_matrix.h>

23 #include <gsl/gsl_blas.h>

24 #include <gsl/gsl_linalg.h>

25 #include <gsl/gsl_errno.h>

26

27 //using namespace std;

51


28 #define MM_MAX_LINE_LENGTH 1025

29 #define MM_PREMATURE_EOF 12

30 #define PI 3.141592653589793238462643383279

31 #define MSE_th 0.001

32 #define RESOLUTION_OF_LARS 1e-12

33 #define REGULARIZATION_FACTOR 1e-11

34 #define MAX_ITER 50

35 #define BETA_TH 1e-3

36

37 double sum_gsl_vec(gsl_vector *vv)

38 {

39 double result;

40 result = 0;

41 for (size_t k = 0; k < vv->size; k++)

42 {

43 result = result + gsl_vector_get (vv, k);

44 }

45 return result;

46 }

47

48 int new_mm_read_mtx_array_size(FILE *f, int *M, int *N)

49 {

50 char line[MM_MAX_LINE_LENGTH];

51 int num_items_read;

52 /* set return null parameter values, in case we exit with errors */

53 *M = *N = 0;

54

55 /* now continue scanning until you reach the end-of-comments */

56 do

57 {

58 if (fgets(line,MM_MAX_LINE_LENGTH,f) == NULL)

59 return MM_PREMATURE_EOF;

60 }while (line[0] == '%');

61

52


62 /* line[] is either blank or has M,N, nz */

63 if (sscanf(line, "%d %d", M, N) == 2)

64 return 0;

65

66 else /* we have a blank line */

67 do

68 {

69 num_items_read = fscanf(f, "%d %d", M, N);

70 if (num_items_read == EOF) return MM_PREMATURE_EOF;

71 }

72 while (num_items_read != 2);

73

74 return 0;

75 }

76

77 int print_matrix(FILE *f, const gsl_matrix *m)

78 {

79 int status, n = 0;

80

81 for (size_t i = 0; i < m->size1; i++) {

82 for (size_t j = 0; j < m->size2; j++) {

83 if ((status = fprintf(f, "%.12f ", gsl_matrix_get(m, i,

j))) < 0)

84 return -1;

85 n += status;

86 }

87

88 if ((status = fprintf(f, "\n")) < 0)

89 return -1;

90 n += status;

91 }

92

93 return n;

94 }

53


95

96 int print_vector(FILE *f, const gsl_vector *m)

97 {

98 int status, n = 0;

99

100 for (size_t j = 0; j < m->size; j++) {

101 if ((status = fprintf(f, "%.12f\n", gsl_vector_get(m, j))) < 0)

102 return -1;

103 n += status;

104 }

105 return n;

106 }

107

108 int main(int argc, char **argv)

109 {

110 int max_n_threads = omp_get_max_threads();

111 printf("max #threads = %d\n",max_n_threads);

112

113 int n_avail_procs = omp_get_num_procs();

114 printf("# of available processors = %d\n",n_avail_procs);

115

116 if (argc <2){printf("need exactly 1 argument for #threads\n");return 0;}

117

118 int caseID = atoi(argv[1]);

119

120 int startFaultCaseID = atoi(argv[2]);

121

122 int endFaultCaseID = atoi(argv[3]);

123 int nTotalFaultCases = endFaultCaseID-startFaultCaseID+1;

124

125 int n_assigned_threads = atoi(argv[4]);

126

127 omp_set_num_threads(n_assigned_threads);

128 printf("#threads assigned = %d\n",n_assigned_threads);

54


129

130 double start_time, end_time, exe_time;

131

132 //bool toldToStop = 0;

133 int othersStop = 0;

134

135 int tid;

136 gsl_set_error_handler_off();

137 /*=========================Build the sub-cases======================== */

138 //BUILD CASE===Lars_prob Lars_sub(caseID);

139

140 size_t nRow;

141 size_t nBr;

142

143 gsl_vector *b;

144 gsl_matrix *groupIdx;

145

146 FILE *f;

147 int m, n;

148 int row, col;

149 double entry;

150 /* Read Original Am */

151 gsl_matrix *Am;

152 char s[20];

153 sprintf(s, "Data/ieee%d/Aeq_%d.dat", caseID, caseID);

154 printf("reading %s\n", s);

155

156 f = fopen(s, "r");

157 if(f == NULL) {

158 printf("ERROR: %s does not exist, exiting.\n", s);

159 exit(EXIT_FAILURE);

160 }

161 //mm_read_mtx_array_size(f, &m, &n);

162 new_mm_read_mtx_array_size(f, &m, &n);

55


163 Am = gsl_matrix_calloc(m, n);

164 for(int i = 0; i < m*n; i++) {

165 row = i % m;

166 col = floor((double)i/m);

167 fscanf(f, "%lf", &entry);

168 gsl_matrix_set(Am, row, col, entry);

169 }

170 fclose(f);

171

172 /* Read groupIdx */

173 sprintf(s, "Data/ieee%d/groupIdx_%d.dat", caseID, caseID);


175

176 f = fopen(s, "r");

177 if(f == NULL) {



180 }



183 groupIdx = gsl_matrix_calloc(m, n);

184 for(int i = 0; i < m*n; i++) {

185 row = i % m;



188 gsl_matrix_set(groupIdx, row, col, entry);

189 }

190 fclose(f);

191

192 nRow = Am->size1;

193 //nTotalCol = Am->size2;

194 nBr = groupIdx->size1;

195 size_t nGroupPerWorker = floor((double)nBr/n_assigned_threads);

196 //beta = gsl_vector_calloc(nCol);

56


197 /*======== Decompose A =========*/

198

199 //#pragma omp barrier

200 //start_time = omp_get_wtime();

201 gsl_vector *time_array;

202 time_array = gsl_vector_calloc(nTotalFaultCases);

203

204 int iFaultCase ;

205 for(iFaultCase = startFaultCaseID; iFaultCase <= endFaultCaseID; iFaultCase

++ )

206 {

207 /* Read b */

208 sprintf(s, "Data/ieee%d/beq_%d.dat", caseID, iFaultCase);


210

211 f = fopen(s, "r");

212 if(f == NULL) {



215 }



218 b = gsl_vector_calloc(m);

219 for (int i = 0; i < m; i++) {


221 gsl_vector_set(b, i, entry);

222 }

223 fclose(f);

224 #pragma omp parallel private(tid)

225 {

226 size_t nCol;

227 gsl_matrix *A;

228 bool no_xtx = 0;

229 gsl_vector *beta;

57


230 std::vector<size_t> active;

231 int nIter = 0;

232 int canPrintSol = 0;

233 tid = omp_get_thread_num();

234

235 nCol = 0;

236

237 std::vector<size_t> temp_idx;

238 size_t groupIdx_temp;

239 for(size_t t = (tid)*nGroupPerWorker; t < std::min(nBr,(tid+1)*

nGroupPerWorker); t++)

240 {

241 for(size_t p = 0; p < 4; p++)

242 {

243 groupIdx_temp = (size_t)gsl_matrix_get(groupIdx, t, p)-1;

244 std::vector<size_t>::iterator it;

245 it = std::find(temp_idx.begin(), temp_idx.end(), groupIdx_temp);

246 if (it == temp_idx.end()) // not exist

247 {

248 temp_idx.push_back(groupIdx_temp);

249 nCol += 1;

250 }

251 }

252 }

253 std::sort(temp_idx.begin(),temp_idx.end());

254 A = gsl_matrix_calloc(nRow,nCol);

255 beta = gsl_vector_calloc(nCol);

256

257 gsl_vector *A_temp_col;

258 A_temp_col = gsl_vector_calloc(nRow);

259

260 for(size_t p = 0; p < nCol; p++)

261 {

262 gsl_matrix_get_col(A_temp_col, Am, temp_idx[p]);

58


263 gsl_matrix_set_col (A, p, A_temp_col);//Aeq_temp = Aeq(:,Idx_worker{

k,1});

264 }

265 gsl_vector_free(A_temp_col);

266 temp_idx.clear();

267 //gsl_matrix_free(Am);

268 /*========================END OF CASE BUILDING

===========================*/

269

270 #pragma omp barrier

271 start_time = omp_get_wtime();

272

273 /*========================SOLVE THE SUB-PROB

==============================*/

274 std::vector <size_t> all_candidate;

275 for(size_t i=0; i<nCol; i++) all_candidate.push_back(i);

276 gsl_matrix *x;

277 x = gsl_matrix_calloc(nRow, nCol);

278 gsl_matrix_memcpy(x, A);

279 gsl_vector *temp_col;

280 temp_col = gsl_vector_calloc(nRow);

281 double temp_sum_mean;

282 double temp_access;

283 for(size_t i = 0; i < nCol; i++)// i th column

284 {

285 gsl_matrix_get_col(temp_col, x, i);//x_ith_col->temp_col

286 temp_sum_mean = sum_gsl_vec(temp_col);

287 temp_sum_mean = temp_sum_mean/nRow;

288

289 gsl_vector_add_constant(temp_col, -temp_sum_mean);//substract mean

290 gsl_matrix_set_col(x, i, temp_col);// copy temp_col back to x

291 }

292 // x = x./ sx_rep;

293 gsl_vector *sx;

59


294 sx = gsl_vector_calloc(nCol);

295 gsl_matrix *x2;

296 x2 = gsl_matrix_calloc(nRow, nCol);

297 gsl_vector *temp_col2;

298 temp_col2 = gsl_vector_calloc(nRow);

299 gsl_matrix_memcpy(x2, x);

300 gsl_matrix_mul_elements(x2, x);//x2<-x2.*x;

301 for(size_t i = 0; i < nCol; i++)// i th column

302 {

303 gsl_matrix_get_col(temp_col, x, i);

304 gsl_matrix_get_col(temp_col2, x2, i);

305 temp_sum_mean = sum_gsl_vec(temp_col2);

306 temp_sum_mean = sqrt(temp_sum_mean);

307 //if (nCol*nCol > 1000000)

308 if (nCol*nCol > 100000000)

309 {

310 if(temp_sum_mean < RESOLUTION_OF_LARS)

311 {

312 temp_sum_mean = GSL_POSINF;

313 //printf("sx(%lu)<RESOLUTION_OF_LARS, need to be erased from

all_candidate\n",i);

314 all_candidate.erase(all_candidate.begin()+i);//all_candidate

= find(sx < GSL_POSINF);

315 }

316 }

317

318 gsl_vector_set(sx, i, temp_sum_mean);

319 gsl_vector_scale(temp_col, 1/temp_sum_mean);//x = x./ sx_rep;

320 gsl_matrix_set_col(x, i, temp_col);// copy temp_col back to x

321 }

322 gsl_matrix_free(x2);

323 gsl_vector_free(temp_col);

324 gsl_vector_free(temp_col2);

325

60


326 //

327 gsl_matrix *xtx;

328 //xtx = gsl_matrix_calloc(nCol, nCol);

329 //if (nCol*nCol > 10000000)

330 if (nCol*nCol > 1000000000)

331 {

332 //printf("[%d]Too large matrix, lars will not pre-calculate xtx\n",

tid);

333 no_xtx = 1;

334 }

335 else

336 {

337 xtx = gsl_matrix_calloc(nCol, nCol);

338

339 //xtx = xˆT*x;

340

341 gsl_matrix *temp_c;

342 temp_c = gsl_matrix_alloc(nCol,nCol);

343 gsl_matrix_set_zero(temp_c);

344 gsl_blas_dgemm(CblasTrans, CblasNoTrans, 1.0, x, x, 1.0, temp_c);

345 gsl_matrix_memcpy(xtx, temp_c);

346

347 }

348 gsl_vector *y;

349 y = gsl_vector_calloc(nRow);

350

351 gsl_vector_memcpy(y, b);

352

353 double my;

354 my = sum_gsl_vec(b)/nRow;

355 gsl_vector_add_constant(y, -my);//y = b - my;

356 /*==========INITIALIZATION==========*/

357 active.clear();

358 //inactive = all_candidate;

61


359 std::vector<size_t> inactive(all_candidate);

360 gsl_vector *mu_a;

361 mu_a = gsl_vector_calloc(nRow);

362 gsl_vector_set_zero(mu_a);

363

364 gsl_vector *mu_a_plus;

365 mu_a_plus = gsl_vector_calloc(nRow);

366 gsl_vector *mu_a_OLS;

367 mu_a_OLS = gsl_vector_calloc(nRow);

368

369 //gsl_vector *beta;

370 //beta = gsl_vector_calloc(nCol);

371 gsl_vector_set_zero(beta);

372 gsl_vector *beta_new;

373 beta_new = gsl_vector_calloc(nCol);

374 gsl_vector *beta_OLS;

375 beta_OLS = gsl_vector_calloc(nCol);

376

377 gsl_vector *y2;

378 y2 = gsl_vector_calloc(nRow);

379

380 gsl_vector_memcpy(y2, y);

381 gsl_vector_mul(y2, y);

382 double MSE = sum_gsl_vec(y2);

383 MSE = MSE/nRow;

384 gsl_vector_free(y2);

385

386 gsl_vector *c; //correlation vector

387 c = gsl_vector_calloc(nCol);

388 double C_max;

389 //C_max = max(abs(c));

390 size_t C_max_ind;

391 std::vector<size_t> C_max_ind_pl;

392 std::vector<size_t> drop;

62


393 /*==========MAIN LOOP===============*/

394 //othersStop = 0;

395 //MPI_Bcast(&othersStop, 1, MPI_INT, tid, MPI_COMM_WORLD);

396 while(nIter < MAX_ITER )//&& othersStop==0)

397 {

398 if(othersStop == 1)

399 {

400 canPrintSol = 0;

401 break;

402 }

403

404 if(MSE <= MSE_th)

405 {

406 printf("[%d]MSE <= MSE_th, break.\n", tid);


408 break;

409 }

410

411 nIter += 1;

412 //c = xˆT*(y-mu_a);

413 gsl_vector *y_mu_a;

414 y_mu_a = gsl_vector_calloc(nRow);

415

416 gsl_vector_memcpy(y_mu_a, y);

417 gsl_vector_sub(y_mu_a, mu_a);

418

419 gsl_vector_set_zero(c);

420 gsl_blas_dgemv(CblasTrans, 1.0, x, y_mu_a, 1.0, c);

421

422 gsl_vector_free(y_mu_a);

423

424 double c_temp;

425 gsl_vector *abs_c_inactive;

426 abs_c_inactive = gsl_vector_calloc((size_t)inactive.size());

63


427 for(size_t k = 0; k < (size_t)inactive.size(); k++)

428 {

429 c_temp = gsl_vector_get(c,inactive[k]);

430 if(c_temp <0) c_temp = -c_temp;

431

432 gsl_vector_set(abs_c_inactive, k, c_temp);

433 }

434 C_max = gsl_vector_max(abs_c_inactive);

435 C_max_ind = gsl_vector_max_index(abs_c_inactive);

436 C_max_ind = inactive[C_max_ind];//C_max_ind = inactive(C_max_ind);

437 //active = sort(union(active,C_max_ind));

438 std::vector<size_t>::iterator it;

439 it = std::find(active.begin(), active.end(), C_max_ind);

440 if(it == active.end())

441 {

442 active.push_back(C_max_ind);

443 //if(tid==10) fprintf(f_act_record, "[%d] %lu added to active

set\n", nIter, C_max_ind);

444 }

445 std::sort(active.begin(), active.end());

446 //C_max_ind_pl = abs(c(inactive))>C_max-RESOLUTION_OF_LARS;

447 for(size_t k = 0; k < abs_c_inactive->size; k++)

448 {

449 if(gsl_vector_get(abs_c_inactive,k) > (C_max-RESOLUTION_OF_LARS)

)

450 {

451 C_max_ind_pl.push_back(inactive[k]);//C_max_ind_pl =

inactive(C_max_ind_pl);

452 it = std::find(active.begin(), active.end(), inactive[k]);

453 if (it == active.end())

454 {

455 active.push_back(inactive[k]);//active = sort(union(

active, C_max_ind_pl));

456 //if(tid==10)fprintf(f_act_record,"[%d]%lu added to

64


active set\n", nIter, inactive[k]);

457 }

458 }

459 }

460 std::sort(active.begin(), active.end());

461 gsl_vector_free(abs_c_inactive);

462

463 std::vector<size_t> inactive_temp;

464 //inactive = setdiff(all_candidate, active);

465 std::set_difference(all_candidate.begin(),all_candidate.end(),active

.begin(),active.end(), std::inserter(inactive_temp,

inactive_temp.begin()));

466 inactive = inactive_temp;

467 std::vector<size_t>::iterator it2;

468 it2 = std::find(drop.begin(), drop.end(), C_max_ind);

469

470 if(!drop.empty() && it2 == drop.end())

471 {

472 //active(find(active==C_max_ind))=[];

473 //printf("[%d][%d]Dropped item and index of maximum correlation

is not the same. But it is being ignored here...\n", tid,

nIter);


475 break;

476 }

477 if(!drop.empty())

478 {

479 C_max_ind = 0;

480 C_max_ind_pl.clear();

481 //printf("size of drop is %lu, drop has %lu.\n",drop.size(),

drop[0]);

482 }

483

484 //active = setdiff(active,drop);

65


485 std::vector<size_t> active_temp;

486 std::set_difference(active.begin(),active.end(),drop.begin(),drop.

end(), std::inserter(active_temp, active_temp.begin()));

487 active = active_temp;

488

489 //inactive = sort(union(inactive,drop));

490 std::sort(drop.begin(), drop.end());

491 std::set_union(inactive.begin(),inactive.end(),drop.begin(),drop.end

(),inactive_temp.begin());

492 //

493 gsl_vector *x_active_col;

494 x_active_col = gsl_vector_calloc(nRow);

495 gsl_matrix *xa;

496 xa = gsl_matrix_calloc(nRow, (size_t)active.size());

497 gsl_matrix *ga;

498 ga = gsl_matrix_calloc((size_t)active.size(), (size_t)active.size())

;

499

500 double s_temp_0;

501 gsl_vector *s;

502 if(active.empty())

503 {

504 printf("[%d][%d] active is empty, maybe hard to allocate s\n",

tid, nIter);

505 }

506 s = gsl_vector_calloc(active.size());

507

508 for(size_t k = 0; k < (size_t)active.size(); k++)

509 {

510 double c_active_temp = gsl_vector_get(c, active[k]);

511 if(c_active_temp >= RESOLUTION_OF_LARS)//s = sign(c(active));

512 {gsl_vector_set(s,k,1); s_temp_0 = 1;}//s = 1;

513

514 if(c_active_temp <= -RESOLUTION_OF_LARS)

66


515 {gsl_vector_set(s,k,-1); s_temp_0 = -1;}//s = -1;

516

517 if(c_active_temp < RESOLUTION_OF_LARS && c_active_temp > -

RESOLUTION_OF_LARS)

518 {gsl_vector_set(s,k,0); s_temp_0 = 0;}//s = 0;

519 //printf("s = sign(c(active)), s[k] = %f\n",s_temp_0);

520 gsl_matrix_get_col(x_active_col, x, active[k]);//xa = x(:,active

).*repmat(sˆT,n,1);

521 gsl_vector_scale(x_active_col, s_temp_0);

522 gsl_matrix_set_col(xa, k, x_active_col);

523 }

524 gsl_vector_free(x_active_col);

525

526 if(!no_xtx)

527 {

528 double ssT_temp;

529 double xtx_temp;

530 for(size_t i = 0; i < (size_t)active.size(); i++)//ga = xtx(

active,active).*(s*sˆT);

531 {

532 for(size_t j = 0; j < (size_t)active.size(); j++)

533 {

534 xtx_temp = gsl_matrix_get(xtx, active[i], active[j]);

535 //printf("xtx(active[%d],active[%d]) = %lf\n",active[i],

active[j],xtx_temp);

536 ssT_temp = gsl_vector_get(s,i)*gsl_vector_get(s,j);

537 xtx_temp = xtx_temp*ssT_temp;

538 //printf("after xtx_temp*s*s \n");

539 gsl_matrix_set(ga, i, j, xtx_temp);

540 //printf("ga(%d,%d) = %lf\n",i,j,xtx_temp);

541 }

542 }

543 }

544 gsl_vector_free(s);

67


545

546 if(no_xtx)//ga = xaˆT*xa;

547 {

548 gsl_matrix_set_zero(ga);

549 gsl_blas_dgemm(CblasTrans, CblasNoTrans, 1.0, xa, xa, 1.0, ga);

550

551 }

552 if(ga->size2<1)

553 {

554 printf("[%d][%d] the nCol of ga is zero, hard to allocate Sga.\n

", tid, nIter);

555 }

556 //invga = ga\eye(size(ga,1));

557 int ss;

558 gsl_matrix *invga = gsl_matrix_calloc(ga->size2, ga->size2);

559 gsl_permutation *perm = gsl_permutation_alloc(ga->size2);

560 //if(tid == 10 ) printf("[%d][%d] before inverting ga.\n", tid,

nIter);

561

562 if(othersStop == 1)

563 {


565 break;

566 }

567

568 gsl_matrix *ga_copy = gsl_matrix_calloc(ga->size1, ga->size2);

569 gsl_matrix_memcpy(ga_copy, ga);

570 gsl_matrix *Vga = gsl_matrix_calloc(ga->size2, ga->size2);

571 gsl_vector *Sga = gsl_vector_calloc(ga->size2);

572

573 int status = gsl_linalg_SV_decomp_jacobi(ga_copy, Vga, Sga);

574

575 if(status)

576 { /* an error occurred */

68


577 printf("[%d][%d] Jacobi error occured\n", tid, nIter);


579 break;

580 }

581

582 //gsl_linalg_SV_decomp_jacobi(ga_copy, Vga, Sga);

583 if(gsl_vector_min(Sga)<1e-9)

584 {

585 //printf("[%d][%d] ga is not invertible\n", tid, nIter);


587 break;

588 }

589 else

590 {

591 double cond_ga;

592 cond_ga = gsl_vector_max(Sga)/gsl_vector_min(Sga);

593 if(cond_ga>1e9)

594 {

595 //printf("[%d][%d] cond(ga) is too large\n", tid, nIter);


597 break;

598 }

599 }

600 gsl_matrix_free(ga_copy);

601 gsl_matrix_free(Vga);

602 gsl_vector_free(Sga);

603

604 gsl_linalg_LU_decomp(ga, perm, &ss);

605 gsl_linalg_LU_invert(ga, perm, invga);// THIS MIGHT CHANGE GA?

606 //if(tid == 10 ) printf("[%d][%d] after inverting ga.\n", tid, nIter

);

607 if(ga->size1<1)

608 {

609 printf("[%d][%d] the nRow of ga is zero, hard to allocate

69


invga_col.\n", tid, nIter);

610 }

611 if(ga->size2<1)

612 {

613 printf("[%d][%d] the nCol of ga is zero, hard to allocate

sum_invga_col&invga_row.\n", tid, nIter);

614 }

615 gsl_vector *invga_col;

616 invga_col = gsl_vector_calloc(ga->size1);

617 gsl_vector *sum_invga_col;

618 sum_invga_col = gsl_vector_calloc(ga->size2);

619 for(size_t k = 0; k < ga->size2; k++)

620 {

621 gsl_matrix_get_col(invga_col, invga, k);

622 gsl_vector_set(sum_invga_col, k, sum_gsl_vec(invga_col));

623 }

624 double aa;//aa = sum(sum(invga))ˆ(-1/2);

625 aa = sum_gsl_vec(sum_invga_col);

626 aa = sqrt(aa);

627 aa = 1/aa;

628 //printf("aa = %lf\n",aa);

629 gsl_vector_free(invga_col);

630 gsl_vector_free(sum_invga_col);

631 //wa = aa*sum(invga,2);

632 gsl_vector *invga_row;

633 invga_row = gsl_vector_calloc(ga->size2);

634

635 if(invga->size1<1)

636 {

637 printf("[%d][%d] the nRow of invga is zero, hard to allocate wa

.\n", tid, nIter);

638 }

639 gsl_vector *wa;

640 wa = gsl_vector_calloc(invga->size1);

70


641 double wa_temp;

642 for(size_t k = 0; k < ga->size1; k++)

643 {

644 gsl_matrix_get_row(invga_row, invga, k);

645 wa_temp = aa*sum_gsl_vec(invga_row);

646

647 gsl_vector_set(wa, k, wa_temp);

648 }

649 gsl_vector_free(invga_row);

650 gsl_matrix_free(invga);

651 gsl_permutation_free(perm);

652 //ua = xa*wa;

653 gsl_vector *ua;

654 ua = gsl_vector_calloc(nRow);

655 gsl_vector_set_zero(ua);

656 gsl_blas_dgemv(CblasNoTrans, 1.0, xa, wa, 1.0, ua);

657 // test using Eq 2.7

658 if(xa->size2<1)

659 {

660 printf("[%d][%d] the nCol of xa is zero, hard to allocate test_1

.\n", tid, nIter);

661 }

662 gsl_vector *test_1;//test_1 = xaˆT*ua;

663 test_1 = gsl_vector_calloc(xa->size2);

664 gsl_vector_set_zero(test_1);

665 gsl_blas_dgemv(CblasTrans, 1.0, xa, ua, 1.0, test_1);

666 if(test_1->size<1)

667 {

668 printf("[%d][%d] the size of test_1 is zero, hard to allocate

test_2.\n", tid, nIter);

669 }

670 gsl_vector *test_2; //test_2 = aa*ones(size(test_1));

671 test_2 = gsl_vector_calloc(test_1->size);

672

71


673 double test_1_2_temp;

674 for(size_t k = 0; k < xa->size2; k++)

675 {

676 test_1_2_temp = gsl_vector_get(test_1, k);

677 test_1_2_temp = test_1_2_temp - aa;

678 if(test_1_2_temp<0) test_1_2_temp = -test_1_2_temp;

679 gsl_vector_set(test_2, k, test_1_2_temp);

680 }

681

682 double test_1_2 = sum_gsl_vec(test_2);

683

684 double test_3;

685

686 gsl_blas_ddot(ua, ua, &test_3);

687

688 test_3 = sqrt(test_3);

689 test_3 = test_3 - 1;

690 if(test_3 <0) test_3 = -test_3;

691

692 gsl_vector_free(test_1);

693 gsl_vector_free(test_2);

694 if(test_1_2 > RESOLUTION_OF_LARS*100 || test_3>RESOLUTION_OF_LARS

*100)

695 {

696 //printf("[%d][%d] Eq 2.7 test failure.\n", tid, nIter);


698 break;

699 }

700 //a = xˆT*ua;

701 gsl_vector *a;

702 a = gsl_vector_calloc(nCol);

703 gsl_vector_set_zero(a);

704 gsl_blas_dgemv(CblasTrans, 1.0, x, ua, 1.0, a);

705

72


706 double c_inactive_temp;

707 double a_inactive_temp;

708 double tmp_1;

709 double tmp_2;

710 std::vector<double> tmp;

711 for(size_t k = 0; k < (size_t)inactive.size(); k++)

712 {

713 c_inactive_temp = gsl_vector_get(c, inactive[k]);//tmp_1 = (

C_max-c(inactive))./(aa-a(inactive));

714 a_inactive_temp = gsl_vector_get(a, inactive[k]);//tmp_2 = (

C_max+c(inactive))./(aa+a(inactive));

715 tmp_1 = (C_max-c_inactive_temp)/(aa-a_inactive_temp);

716 tmp_2 = (C_max+c_inactive_temp)/(aa+a_inactive_temp);

717 //gsl_matrix_set(tmp_3, k, 1, tmp_1);//tmp_3 = [tmp_1,tmp_2];

718 //gsl_matrix_set(tmp_3, k, 2, tmp_2);

719 if(tmp_1>0) tmp.push_back(tmp_1);//tmp = tmp_3(find(tmp_3>0));

720 if(tmp_2>0) tmp.push_back(tmp_2);

721 }

722 double gamm;

723 if(tmp.empty())

724 gamm = C_max/aa;

725 else

726 gamm = *std::min_element(tmp.begin(),tmp.end());//gamma = min(

tmp);

727

728 gsl_vector *d;//d = zeros(1,nCol);

729 d = gsl_vector_calloc(nCol);

730 gsl_vector_set_zero(d);

731 gsl_vector *tmpp;//tmp = zeros(1,nCol);

732 tmpp = gsl_vector_calloc(nCol);

733 gsl_vector_set_zero(tmpp);

734 double d_temp;

735 double beta_temp;

736 std::vector<double> tmpp2;

73


737 double s_temp;


739 {

740 double c_active_temp2 = gsl_vector_get(c, active[k]);

741 if(c_active_temp2 >= RESOLUTION_OF_LARS)//s = sign(c(active));

742 s_temp = 1;

743 if(c_active_temp2 <= -RESOLUTION_OF_LARS)

744 s_temp = -1;

745 if(c_active_temp2 < RESOLUTION_OF_LARS && c_active_temp2 > -

RESOLUTION_OF_LARS)

746 s_temp = 0;

747 //printf("active[k]=%d, s_temp = %f\n",active[k],s_temp);

748 d_temp = s_temp*gsl_vector_get(wa, k);

749 //printf("d[active[k] = s_temp.*wa = %lf\n",d_temp);

750 if(d_temp==0)//if (length(find(d(active)==0)))

751 {

752 //printf("[%d][%d] Something wrong with vector d: Eq 3.4\n",

tid, nIter);


754 break;

755 }

756 gsl_vector_set(d, active[k], d_temp);//d(active) = s.*wa;

757 }


759 {

760 beta_temp = gsl_vector_get(beta, active[k]);

761 d_temp = gsl_vector_get(d, active[k]);

762 beta_temp = -beta_temp/d_temp;

763

764 gsl_vector_set(tmpp, active[k], beta_temp);//tmp(active) = -1*

beta(active)./d(active);

765 if(beta_temp>0)

766 {

767 tmpp2.push_back(beta_temp);//tmp2 = tmp(find(tmp>0));

74


768 //printf("%lf goes to tmpp2 in tmp2 = tmp(find(tmp>0));\n",

beta_temp);

769 }

770 }

771

772 drop.clear();//drop = [];

773 double gamm_tilde = GSL_POSINF;

774

775 if(!tmpp2.empty())

776 {

777 std::vector<double>::iterator min_tmpp2_p;

778 min_tmpp2_p = std::min_element(tmpp2.begin(),tmpp2.end());

779

780 double & min_tmpp2 = *min_tmpp2_p;//????

781

782 if(!tmpp2.empty() && gamm >= min_tmpp2)

783 {

784 gamm_tilde = min_tmpp2; //gamma_tilde = min(tmp2);

785 for(size_t k = 0; k < nCol; k++)//drop = find(tmp==

gamma_tilde);

786 {

787 if(gsl_vector_get(tmpp, k)==gamm_tilde)

788 {

789 drop.push_back(k);

790 //if(tid==10) fprintf(f_act_record, "[%d] %lu added

to drop at line: 808\n", nIter, k);

791 }

792 }

793 }

794 }

795 tmpp2.clear();

796 gsl_vector_free(tmpp);

797

798 double min_gamm = gamm;

75


799 if(gamm > gamm_tilde) min_gamm = gamm_tilde;

800 gsl_vector_memcpy(mu_a_plus, mu_a);

801 gsl_blas_daxpy(min_gamm, ua, mu_a_plus);//mu_a_plus = mu_a + min(

gamm,gamm_tilde)*ua;

802

803 gsl_vector_memcpy(beta_new, beta);

804 gsl_blas_daxpy(min_gamm, d, beta_new);//beta_new = beta + min(gamm,

gamm_tilde)*d;

805


807 std::vector<size_t> active_temp2;

808 std::set_difference(active.begin(),active.end(),drop.begin(),drop.

end(),std::inserter(active_temp2, active_temp2.begin()));

809 active = active_temp2;

810 active_temp2.clear();

811

812 //inactive = setdiff(all_candidate,active);

813 std::vector<size_t> inactive_temp2;

814 std::set_difference(all_candidate.begin(),all_candidate.end(),active

.begin(),active.end(),std::inserter(inactive_temp2,

inactive_temp2.begin()));

815 inactive = inactive_temp2;

816 inactive_temp2.clear();

817 //printf("line 721:size of inactive is %lu.\n",inactive.size());

818 for(size_t k = 0; k < (size_t)drop.size(); k++)//beta_new(drop)=0;

819 {

820 gsl_vector_set(beta_new, drop[k], 0);

821 }

822 double C_max_aa;

823 C_max_aa = C_max/aa;

824 gsl_vector_memcpy(mu_a_OLS, mu_a);

825 gsl_blas_daxpy(C_max_aa, ua, mu_a_OLS);//mu_a_OLS = mu_a + C_max/aa*

ua;

826

76


827 gsl_vector_memcpy(beta_OLS, beta);

828 gsl_blas_daxpy(C_max_aa, d, beta_OLS);//beta_LS = beta + C_max/aa*d;

829

830 //MSE = sum((y-mu_a_OLS).ˆ2)/length(y);

831 gsl_vector *y_minus_mu_a;

832 y_minus_mu_a = gsl_vector_calloc(nRow);

833 gsl_vector_memcpy(y_minus_mu_a, y);

834 gsl_vector_sub(y_minus_mu_a, mu_a_OLS);

835 gsl_vector_mul(y_minus_mu_a, y_minus_mu_a);

836 double sum_y_mu_a = sum_gsl_vec(y_minus_mu_a);

837 MSE = sum_y_mu_a/nRow;

838

839 gsl_vector_free(y_minus_mu_a);

840 gsl_vector_free(d);

841 gsl_vector_free(ua);

842 gsl_vector_free(wa);

843 gsl_vector_free(a);

844 gsl_matrix_free(xa);

845 gsl_matrix_free(ga);

846

847 //update

848 gsl_vector_memcpy(mu_a, mu_a_plus);//mu_a = mu_a_plus;

849 gsl_vector_memcpy(beta, beta_new);//beta = beta_new;

850 double d_beta_temp;

851 gsl_vector *beta_backup;

852 beta_backup = gsl_vector_calloc(nCol);

853 gsl_vector_memcpy(beta_backup, beta);


855 {

856 //beta = beta(active)./sx(active);

857 gsl_vector_set_zero(beta);


859 {

860 d_beta_temp = gsl_vector_get(beta_backup, active[k]);

77


861 d_beta_temp = d_beta_temp/gsl_vector_get(sx, active[k]);

862 gsl_vector_set(beta, active[k], d_beta_temp);

863 }

864 double temp_beta_element;

865 std::vector<size_t> active_backup(active);

866 for(size_t k = 0; k < (size_t)active_backup.size(); k++)

867 {

868 temp_beta_element = gsl_vector_get(beta, active_backup[k]);

869 if (temp_beta_element >= -BETA_TH && temp_beta_element <=

BETA_TH)

870 {

871 std::vector<size_t>::iterator it_act;

872 it_act = std::find (active.begin(), active.end(),

active_backup[k]);

873

874 active.erase(it_act);

875 gsl_vector_set(beta, active_backup[k], 0);

876 }

877 }

878 active_backup.clear();

879 if(active.size()>4)

880 {

881 //printf("[%d] Invalid result despite of convergence.\n",

tid);


883 break;

884 }


886 //othersStop = 1;//need to be broadcasted!

887 break;

888 }

889 gsl_vector_free(beta_backup);

890 }

891 //if(nIter>=MAX_ITER){printf("[%d] Reached nIter limit!\n", tid);}

78


892

893 gsl_vector_free(mu_a_plus);

894 gsl_vector_free(mu_a_OLS);

895 gsl_vector_free(beta_new);

896 gsl_vector_free(beta_OLS);

897 gsl_vector_free(c);

898 gsl_vector_free(mu_a);

899 gsl_vector_free(y);

900 gsl_matrix_free(x);

901 gsl_vector_free(sx);

902

903 if(!no_xtx) {gsl_matrix_free(xtx);}

904

905 if(!canPrintSol)

906 {

907 gsl_matrix_free(A);

908 //gsl_vector_free(b);

909 gsl_vector_free(beta);

910 //gsl_matrix_free(groupIdx);

911 active.clear();

912 }

913

914 if(canPrintSol)// && !othersStop)

915 {

916 othersStop = 1;

917

918 end_time = omp_get_wtime();

919 exe_time = end_time - start_time;

920 printf("========Solved successfully by tid %d========\n",tid);

921

922 printf("[%d]Thread Runtime = %f\n", tid, exe_time);

923 printf("============Solution by tid %d============\n", tid);

924 //fprintf(f_record, "[%d]Runtime = %f\n", tid, exe_time);

925 //printf("[%d]Runtime = %f\n", tid, exe_time);

79


926 printf("[%d]====Number of Iteration: %d====\n",tid, nIter);

927 for(size_t iii = 0; iii<active.size(); iii++)

928 {

929 printf("[%d]------active_idx:%lu------beta[active_idx]:%6.8f

------\n",tid, active[iii], gsl_vector_get(beta,active[iii])

);

930 }

931

932 gsl_matrix_free(A);

933 //gsl_vector_free(b);

934 gsl_vector_free(beta);

935 //gsl_matrix_free(groupIdx);

936 active.clear();

937 printf("

====************************************************************====\

n");

938

939 }

940 #pragma omp barrier

941 end_time = omp_get_wtime();

942 if(tid == 0)

943 {

944 exe_time = end_time - start_time;

945 printf("[%d] Runtime = %f\n", iFaultCase, exe_time);

946 gsl_vector_set(time_array, iFaultCase-startFaultCaseID, exe_time);

947 }

948 }

949 gsl_vector_free(b);

950 }

951 //end_time = omp_get_wtime();

952 //exe_time = end_time - start_time;

953

954 printf("=============Average Runtime = %f\n", sum_gsl_vec(time_array)/

nTotalFaultCases);

80


955

956 gsl_matrix_free(Am);

957 gsl_matrix_free(groupIdx);

958 gsl_vector_free(time_array);

959

960 /*========================END OF SOLVING==============================*/

961

962 return 0;

963 }

A.3

CUDA C Code for GPU implementation

This piece of attached code is the CUDA C code used for GPU implementation.

1 #include <stdio.h>

2 #include <stdlib.h>

3 #include <string.h>

4 #include <time.h>

5 #include <cuda_runtime.h>

6 // for std

7 #include <math.h>

8 #include <algorithm>

9 #include <string>

10 #include <vector>

11 // includes, cuda

12 //#include <cublas.h>

13 #include <cublas_v2.h>

14 #include <thrust/host_vector.h>

15 #include <thrust/device_vector.h>

16 #include <thrust/device_ptr.h>

17

81


18 #include <thrust/sort.h>

19 #include <thrust/merge.h>

20 #include <thrust/transform.h>

21 #include <thrust/reduce.h>

22 #include <thrust/scan.h>

23 #include <thrust/set_operations.h>

24 #include <thrust/iterator/discard_iterator.h>

25 #include <thrust/device_ptr.h>

26 #include <thrust/iterator/counting_iterator.h>

27 #include <thrust/iterator/transform_iterator.h>

28 #include <thrust/iterator/permutation_iterator.h>

29 #include <thrust/extrema.h>

30

31 #include <thrust/functional.h>

32 #include <thrust/sequence.h>

33 #include <thrust/fill.h>

34

35

36 #define MM_MAX_LINE_LENGTH 1025

37 #define MM_PREMATURE_EOF 12

38 #define PI 3.141592653589793238462643383279

39 #define MSE_th 0.001

40 #define RESOLUTION_OF_LARS 1e-12

41 #define REGULARIZATION_FACTOR 1e-11

42 #define MAX_ITER 50

43 #define INF 1e12

44 #define BETA_TH 1e-3

45

46 // matrix indexing convention

47 #define id(m, n, ld) (((n) * (ld) + (m)))

48

49 int new_mm_read_mtx_array_size(FILE *f, int *M, int *N)

50 {

51 char line[MM_MAX_LINE_LENGTH];

82


52 int num_items_read;

53 /* set return null parameter values, in case we exit with errors */

54 *M = *N = 0;

55

56 /* now continue scanning until you reach the end-of-comments */

57 do

58 {

59 if (fgets(line,MM_MAX_LINE_LENGTH,f) == NULL)

60 return MM_PREMATURE_EOF;

61 }while (line[0] == '%');

62

63 /* line[] is either blank or has M,N, nz */

64 if (sscanf(line, "%d %d", M, N) == 2)

65 return 0;

66

67 else /* we have a blank line */

68 do

69 {

70 num_items_read = fscanf(f, "%d %d", M, N);

71 if (num_items_read == EOF) return MM_PREMATURE_EOF;

72 }

73 while (num_items_read != 2);

74

75 return 0;

76 }

77

78 struct MinusC: public thrust::unary_function<double, double>

79 {

80 const double offset;

81 __host__ __device__ MinusC(double _offset) :

82 offset(_offset)

83 {

84 }

85 __host__ __device__ double operator()(double x)

83


86 {

87 return (double) x-offset;

88 }

89 };

90 struct DivideByC: public thrust::unary_function<double, double>

91 {


93 __host__ __device__ DivideByC(double _offset) :

94 offset(_offset)

95 {

96 }


98 {

99 return (double) x/offset;

100 }

101 };

102 struct MultipleByC: public thrust::unary_function<double, double>

103 {


105 __host__ __device__ MultipleByC(double _offset) :

106 offset(_offset)

107 {

108 }


110 {

111 return (double) x*offset;

112 }

113 };

114 struct Inv: public thrust::unary_function<double, double>

115 {


117 {

118 return (double) 1.0 / x;

119 }

84


120 };

121 struct Abs: public thrust::unary_function<double, double>

122 {


124 {

125 if(x<0) x = -x;

126 return (double) x;

127 }

128 };

129 struct SquareRoot: public thrust::unary_function<double, double>

130 {


132 {

133 return (double) sqrt(x);

134 }

135 };

136 struct Square: public thrust::unary_function<double, double>

137 {


139 {

140 return (double) x*x;

141 }

142 };

143 struct FilterIgnores: public thrust::unary_function<double, double>

144 {


146 {

147 if(x< RESOLUTION_OF_LARS)

148 return INF;

149 else

150 return x;

151 }

152 };

153 // main

85


154 int main(int argc, char** argv)

155 {

156

157 if (argc <2){printf("need exactly 1 argument for #threads\n");return 0;}

158

159 //double start_time, end_time, exe_time;

160

161 int caseID = atoi(argv[1]);

162 printf("Will solve nBus = %d case.\n", caseID);

163

164 int startFaultCaseID = atoi(argv[2]);

165

166 int endFaultCaseID = atoi(argv[3]);

167 int nTotalFaultCases = endFaultCaseID-startFaultCaseID+1;

168 //int nProcs = atoi(argv[2]);

169 /*=========================BUILD THE PROBLEM======================== */

170

171 bool no_xtx = 0;

172

173 FILE *f;

174 int m, n;

175

176 double entry;

177 /* Read A */

178 char s[20];

179 sprintf(s, "Data/ieee%d/Aeq_%d.dat", caseID, caseID);


181

182 f = fopen(s, "r");

183 if(f == NULL) {



186 }


86



189

190 int nRow = m;

191 int nCol = n;

192

193 printf("nRow = %d, nCol = %d.\n", nRow, nCol);

194 //A = (double*)malloc(nRow*nCol * sizeof(double));

195 thrust::host_vector<double> A(nRow*nCol);

196

197 int row, col;

198

199 for(int i = 0; i < m*n; i++) {

200 row = i % m;



203 //gsl_matrix_set(Am, row, col, entry);

204

205 A[id(row, col, nRow)] = entry;

206 }

207

208 fclose(f);

209

210 thrust::host_vector<double> timearray(nTotalFaultCases);

211 int iFaultCase ;

212 for(iFaultCase = startFaultCaseID; iFaultCase <= endFaultCaseID; iFaultCase

++ )

213 {

214

215 /* Read b */

216 sprintf(s, "Data/ieee%d/beq_%d.dat", caseID, iFaultCase);


218

219 f = fopen(s, "r");

220 if(f == NULL) {

87




223 }



226 if(m!=nRow) {

227 printf("ERROR: length of b doesn't match with nRow of A.\n", s);


229 }

230 //b = gsl_vector_calloc(m);

231 //b = (double*)malloc(nRow * sizeof(double));

232 thrust::host_vector<double> b(nRow);

233

234 for (int i = 0; i < nRow; i++) {


236 b[id(i,0,nRow)] = entry;

237 }

238 fclose(f);

239

240 /*====================INITIALIZATION OF TIMING AND CUBLAS

=====================*/

241

242 //clock_t start=clock();

243 //double dtime;

244

245 // initialize cublas

246 cublasStatus_t status;

247 cublasHandle_t hd;

248

249 status = cublasCreate(&hd);

250

251

252 if (status != CUBLAS_STATUS_SUCCESS)

253 {

88


254 fprintf(stderr, "!!!! CUBLAS initialization error\n");

255 return EXIT_FAILURE;

256 }

257 else printf("CUBLAS initializations success\n");

258

259 /*===========================SOLVE THE PROBLEM

===============================*/

260

261 // transfer host to device

262 thrust::device_vector<double> dx(A.begin(), A.end());

263 thrust::device_vector<double> dy(b.begin(), b.end());

264

265 clock_t start=clock();

266 double dtime;

267

268 thrust::device_vector<int> all_candidate(nCol);

269 // initialize all_candidate to 0,1,2,3, .... for(int i=0; i<nCol; i++)

all_candidate.push_back(i);

270 thrust::sequence(all_candidate.begin(), all_candidate.end(),0);

271

272 // mx = mean(xin); x = xin - repmat(mx,size(xin,1),1);

273 thrust::device_vector<double> ones(nRow, 1);

274 thrust::device_vector<double> sumA(nCol, 0);

275

276

277 double* p_dx = thrust::raw_pointer_cast(&dx[0]);

278 double* p_dy = thrust::raw_pointer_cast(&dy[0]);

279 double* p_ones = thrust::raw_pointer_cast(&ones[0]);

280 double* p_sumA = thrust::raw_pointer_cast(&sumA[0]);

281

282

283 const double c1 = 1.0;

284 const double c0 = 0.0;

285 const double c_1 = -1.0;

89


286

287 cublasDgemv(hd, CUBLAS_OP_T, nRow, nCol, &c1, p_dx, nRow, p_ones, 1, &c0

, p_sumA, 1);

288

289 thrust::transform(sumA.begin(), sumA.end(), sumA.begin(), DivideByC((

double)nRow));

290

291 //B nROw*nCol, make each row of B sumA[k]

292 thrust::device_vector<double> BsumA(nRow*nCol);

293 for(int k = 0; k < nCol; k++)// TOO SLOW!!!

294 {

295 thrust::fill(BsumA.begin()+k*nRow, BsumA.begin()+(k+1)*nRow, sumA[k

]);

296 }

297 double* p_BsumA = thrust::raw_pointer_cast(&BsumA[0]);

298 thrust::device_vector<double> dx_copy = dx;

299 double* p_dxcopy = thrust::raw_pointer_cast(&dx_copy[0]);

300 cublasDgeam(hd, CUBLAS_OP_N, CUBLAS_OP_N, nRow, nCol, &c1, p_dxcopy,

nRow, &c_1, p_BsumA, nRow, p_dx, nRow);

301 dx_copy.clear();

302

303 // sx = sum(x.ˆ2).ˆ(1/2);ignores = find(sx < RESOLUTION_OF_LARS); sx(

ignores)=inf;

304 // x = x ./ repmat(sx,size(xin,1),1);

305 thrust::device_vector<double> sx = dx;

306 thrust::transform(sx.begin(), sx.end(), sx.begin(), Square());

307

308 double* p_sx = thrust::raw_pointer_cast(&sx[0]);

309 cublasDgemv(hd, CUBLAS_OP_T, nRow, nCol, &c1, p_sx, nRow, p_ones, 1, &c0

, p_sumA, 1);

310

311 thrust::transform(sumA.begin(), sumA.end(), sumA.begin(), SquareRoot());

312 thrust::device_vector<double> ssx(sumA.begin(), sumA.end());

313

90


314 //thrust::device_vector<int> all_candidate_backup = all_candidate;

315 for(int k = 0; k <nCol; k++)

316 {

317 if(sumA[k]<RESOLUTION_OF_LARS)

318 {

319 sumA[k] = INF;

320 all_candidate.erase(all_candidate.begin()+k);

321 }

322 //thrust::transform(dx.begin()+k*nRow, dx.begin()+(k+1)*nRow, dx.

begin()+k*nRow, DivideByC(sumA[k]));

323 //cublasDscal(hd, nCol, 1/sumA[k], p_dx, 1);

324 }

325 thrust::device_vector<double> forDivide(nCol*nCol,0);

326 for(int k = 0; k < nCol; k++)// TOO SLOW!!!

327 {

328 forDivide[id(k,k,nCol)]=1/sumA[k];

329 }

330 double* p_forDivide = thrust::raw_pointer_cast(&forDivide[0]);

331 thrust::device_vector<double> dx_copy2 = dx;

332 double* p_dxcopy2 = thrust::raw_pointer_cast(&dx_copy2[0]);

333 cublasDgemm(hd, CUBLAS_OP_N, CUBLAS_OP_N, nRow, nCol, nCol, &c1,

p_dxcopy2, nRow, p_forDivide, nCol, &c0, p_dx, nRow);

334 dx_copy2.clear();

335 sumA.clear();

336 ones.clear();

337 sx.clear();

338

339 thrust::device_vector<double> xtx(1, 0);

340 //double* p_dxtx = thrust::raw_pointer_cast(&xtx[0]);

341 if (nCol*nCol > 100000000)

342 {

343 printf("Too large matrix, lars will not pre-calculate xtx\n");

344 no_xtx = 1;

345 }

91


346 else

347 {

348 xtx.resize(nCol*nCol);

349 //thrust::device_vector<double> xtx(nCol*nCol, 0);

350 double* p_dxtx = thrust::raw_pointer_cast(&xtx[0]);

351 //xtx = xˆT*x;

352 cublasDgemm(hd, CUBLAS_OP_T, CUBLAS_OP_N, nCol, nCol, nRow, &c1,

p_dx, nRow, p_dx, nRow, &c0, p_dxtx, nCol);

353 }

354

355

356 //y = yin-mean(yin);

357 double mean_y;

358 cublasDdot (hd, nRow, p_dy, 1, p_ones, 1, &mean_y);

359 mean_y = mean_y/nRow;

360 thrust::transform(dy.begin(), dy.end(), dy.begin(), MinusC(mean_y));

361 ones.clear();

362

363 /*==========INITIALIZATION==========*/

364 thrust::device_vector<int> d_active;

365 ////active.clear();

366 //inactive = all_candidate;

367 thrust::device_vector<int> inactive=all_candidate;

368 thrust::device_vector<double> mu_a(nRow, 0);

369 thrust::device_vector<double> mu_a_plus(nRow, 0);

370 thrust::device_vector<double> mu_a_OLS(nRow, 0);

371 thrust::device_vector<double> beta(nCol, 0);

372 thrust::device_vector<double> beta_new(nCol, 0);

373 thrust::device_vector<double> beta_OLS(nCol, 0);

374 //MSE = sum(y.ˆ2)/length(y);

375 thrust::device_vector<double> y2=dy;

376 double* p_y2 = thrust::raw_pointer_cast(&y2[0]);

377 thrust::transform(y2.begin(), y2.end(), y2.begin(), Square());

378 double MSE;

92


379 cublasDasum(hd, nRow, p_y2, 1, &MSE);

380 MSE = MSE/nRow;

381 y2.clear();

382

383 thrust::device_vector<double> c(nCol);

384

385 //double C_max;

386 //C_max = max(abs(c));

387 //int C_max_ind;

388 thrust::device_vector<int> C_max_ind_pl;

389 thrust::device_vector<int> drop;

390 int nIter = 0;

391 bool canPrintSol = 0;

392 /*==============MAIN LOOP===============*/

393

394 while(nIter < MAX_ITER)

395 {

396


398 {

399 printf("MSE <= MSE_th=%6.8f, break.\n", MSE_th);


401 break;

402 }

403 const double cc1 = 1.0;

404 const double cc0 = 0.0;

405 nIter += 1;

406

407 //c = xˆT*(y-mu_a);

408 thrust::device_vector<double> y_mu_a(nRow);

409 thrust::transform(dy.begin(), dy.end(), mu_a.begin(), y_mu_a.begin()

, thrust::minus<double>());

410

411

93


412 double* p_y_mua = thrust::raw_pointer_cast(&y_mu_a[0]);

413 thrust::device_vector<double> c(nCol, 0);

414 double* p_c = thrust::raw_pointer_cast(&c[0]);

415

416 cublasDgemv(hd, CUBLAS_OP_T, nRow, nCol, &cc1, p_dx, nRow, p_y_mua,

1, &cc0, p_c, 1);

417 y_mu_a.clear();

418

419 // [C_max,C_max_ind] = max(abs(c(inactive))); C_max_ind = inactive

(C_max_ind);

420 thrust::device_vector<int> C_max_ind(1);

421 int* p_cmaxInd = thrust::raw_pointer_cast(&C_max_ind[0]);

422 thrust::device_vector<double> c_inact(inactive.size());

423

424 for(int k = 0; k < inactive.size(); k++)

425 {

426 c_inact[k] = c[inactive[k]];

427 if(c_inact[k] < 0) c_inact[k] = -c_inact[k];

428 }

429 const double* p_cinact = thrust::raw_pointer_cast(&c_inact[0]);

430

431 thrust::device_vector<double>::iterator iter_max = thrust::

max_element(c_inact.begin(), c_inact.end());

432

433 unsigned int position = iter_max - c_inact.begin();

434 double C_max = *iter_max;

435

436 C_max_ind[0] = inactive[position];

437


439 {

440 if(c_inact[k] >= C_max-RESOLUTION_OF_LARS && inactive[k]!=

C_max_ind[0]) C_max_ind.push_back(inactive[k]);

441 }

94


442

443

444 //active = sort(union(active,C_max_ind));

445 // set_union returns an iterator C_end denoting the end of input

446 thrust::device_vector<int>::iterator C_end;

447 thrust::device_vector<int> new_active0(d_active.size()+C_max_ind.

size());

448 C_end = thrust::set_union(d_active.begin(), d_active.end(),

C_max_ind.begin(), C_max_ind.end(), new_active0.begin());

449 // shrink C to exactly fit output

450 new_active0.erase(C_end, new_active0.end());

451 d_active = new_active0;

452 new_active0.clear();

453 thrust::sort(d_active.begin(), d_active.end(), thrust::less<int>());

454

455

456 //inactive = setdiff(all_candidate, active);

457 thrust::device_vector<int>::iterator inact_end;

458 inact_end = thrust::set_difference(all_candidate.begin(),

all_candidate.end(), d_active.begin(), d_active.end(), inactive.

begin());

459 inactive.erase(inact_end, inactive.end());

460 //if ˜isempty(drop) & length(find(drop==C_max_ind))==0

461 thrust::device_vector<int>::iterator iter;

462 iter = thrust::find(drop.begin(), drop.end(), C_max_ind[0]);

463 if(!drop.empty() && iter == drop.end())

464 {

465 printf("[%d]Dropped item and index of maximum correlation is not

the same. But it is being ignored here...\n", nIter);


467 break;

468 }

469 if(!drop.empty())

470 {

95


471 C_max_ind.clear();

472 }


474 thrust::device_vector<int> new_active(d_active.size());

475 thrust::device_vector<int>::iterator act_end;

476 act_end = thrust::set_difference(d_active.begin(), d_active.end(),

drop.begin(), drop.end(), new_active.begin());

477 new_active.erase(act_end, new_active.end());

478 d_active = new_active;

479 new_active.clear();

480

481 //inactive = sort(union(inactive,drop));

482 thrust::device_vector<int> new_inactive(inactive.size()+drop.size())

;

483 inact_end = thrust::set_union(inactive.begin(), inactive.end(), drop

.begin(), drop.end(), new_inactive.begin());

484 new_inactive.erase(inact_end, new_inactive.end());

485 inactive = new_inactive;

486 new_inactive.clear();

487 thrust::sort(inactive.begin(), inactive.end(), thrust::less<int>());

488 //

489 if(d_active.empty())

490 {

491 printf("[%d] active is empty, maybe hard to allocate s\n", nIter

);

492 }

493 //s = sign(c(active)); % eq 2.10 xa = x(:,active).*repmat(s',n,1);

% eq 2.4

494 thrust::device_vector<double> xa(nRow*d_active.size());

495 thrust::device_vector<double> ga(d_active.size()*d_active.size());

496 thrust::device_vector<double> s(d_active.size());

497 double s_temp_0;

498 //xa = x(:,active).*repmat(sˆT,n,1);

499 for(int k = 0; k < d_active.size(); k++)

96


500 {

501 double c_active_temp = c[d_active[k]];

502 if(c_active_temp >= RESOLUTION_OF_LARS)//s = sign(c(active));

503 {s[k] = 1; s_temp_0 = 1;}//s = 1;

504

505 if(c_active_temp <= -RESOLUTION_OF_LARS)

506 {s[k] = -1; s_temp_0 = -1;}//s = -1;

507

508 if(c_active_temp < RESOLUTION_OF_LARS && c_active_temp > -

RESOLUTION_OF_LARS)

509 {s[k] = 0; s_temp_0 = 0;}//s = 0;

510 thrust::copy(dx.begin()+d_active[k]*nRow, dx.begin()+(d_active[k

]+1)*nRow, xa.begin()+k*nRow);

511 thrust::transform(xa.begin()+k*nRow, xa.begin()+(k+1)*nRow, xa.

begin()+k*nRow, MultipleByC(s_temp_0));

512 }

513

514 if(!no_xtx)//ga = xtx(active,active).*(s*sˆT);

515 {

516 double ssT_temp;

517 double xtx_temp;

518 for(int i = 0; i < d_active.size(); i++)

519 {

520 for(int j = 0; j < d_active.size(); j++)

521 {

522 xtx_temp = xtx[id(d_active[i], d_active[j], nCol)];

523 ssT_temp = s[i]*s[j];

524 xtx_temp = xtx_temp*ssT_temp;

525 ga[id(i,j,d_active.size())] = xtx_temp;

526 }

527 }

528 }

529 double* p_xa = thrust::raw_pointer_cast(&xa[0]);

530 double* p_ga = thrust::raw_pointer_cast(&ga[0]);

97


531

532 if(no_xtx)//ga = xaˆT*xa;

533 {

534 cublasDgemm(hd, CUBLAS_OP_T, CUBLAS_OP_N, d_active.size(),

d_active.size(), nRow, &cc1, p_xa, nRow, p_xa, nRow, &cc0,

p_ga, d_active.size());

535 }

536 s.clear();

537

538 //invga = ga\eye(size(ga,1));

539 thrust::device_vector<double> invga(d_active.size()*d_active.size())

;//invga

540 if(ga.size()==1)

541 {

542 //printf("ga is scalar, invresion is simple\n");

543 if(ga[0]==0)

544 {

545 printf("ga is zero!!!\n");


547 break;

548 }

549 else

550 invga[0] = 1/ga[0];

551 }

552

553 int batchSize = 1;

554 int n = d_active.size();

555 int lda = n;

556 if(n<1)

557 {

558 printf("[%d] the size of active is 0.\n", nIter);


560 }

561

98


562 int *P, *INFO;

563 cudaMalloc<int>(&P,n * batchSize * sizeof(int));

564 cudaMalloc<int>(&INFO,batchSize * sizeof(int));

565

566 double* p_invga = thrust::raw_pointer_cast(&invga[0]);//p_invga

567 const double* p_gac = thrust::raw_pointer_cast(&ga[0]);

568

569 thrust::device_vector<double*> array_pA(1);

570 array_pA[0] = p_ga;

571 double** p_pA = thrust::raw_pointer_cast(&array_pA[0]);

572

573 cublasDgetrfBatched(hd,n,p_pA,lda,P,INFO,batchSize);

574 int INFOh = 0;

575 cudaMemcpy(&INFOh,INFO,sizeof(int),cudaMemcpyDeviceToHost);

576 if(INFOh == n)

577 {

578 fprintf(stderr, "Factorization Failed: Matrix is singular\n");

579 cudaDeviceReset();


581 }

582

583 thrust::device_vector<const double*> array_pAc(1);

584 array_pAc[0] = p_gac;

585 const double** p_pAc = thrust::raw_pointer_cast(&array_pAc[0]);

586

587 thrust::device_vector<double*> array_pinvA(1);

588 array_pinvA[0] = p_invga;

589 double** p_pinvA = thrust::raw_pointer_cast(&array_pinvA[0]);

590

591 cublasDgetriBatched(hd,n,p_pAc,lda,P,p_pinvA,lda,INFO,batchSize);

592 cudaMemcpy(&INFOh,INFO,sizeof(int),cudaMemcpyDeviceToHost);

593 if(INFOh != 0)

594 {

595 fprintf(stderr, "Inversion Failed: Matrix is singular\n");

99


596 cudaDeviceReset();


598 }

599

600 cudaFree(P), cudaFree(INFO);

601

602 // aa= sum(sum(invga))ˆ(-1/2);% eq 2.5

603 double aa;

604 double sum_invga = 0;

605

606 thrust::device_vector<double> ones2(n, 1);

607 double* p_ones2 = thrust::raw_pointer_cast(&ones2[0]);

608 thrust::device_vector<double> sum_invga_col(n);

609 double *p_sum_invga_col = thrust::raw_pointer_cast(&sum_invga_col

[0]);

610 cublasDgemv(hd, CUBLAS_OP_T, n, n, &cc1, p_invga, n, p_ones2, 1, &

cc0, p_sum_invga_col, 1);

611 sum_invga = thrust::reduce( sum_invga_col.begin() , sum_invga_col.

end());

612 aa = 1/sqrt(sum_invga);

613

614 // wa= aa*sum(invga,2);% eq 2.6

615 thrust::device_vector<double> sum_invga_row(n);

616 double *p_sum_invga_row = thrust::raw_pointer_cast(&sum_invga_row

[0]);

617 cublasDgemv(hd, CUBLAS_OP_N, n, n, &cc1, p_invga, n, p_ones2, 1, &

cc0, p_sum_invga_row, 1);

618 thrust::device_vector<double> wa(n);

619 double *p_wa = thrust::raw_pointer_cast(&wa[0]);

620 thrust::transform(sum_invga_row.begin(), sum_invga_row.end(), wa.

begin(), MultipleByC(aa));

621

622 // ua= xa*wa;% eq 2.6

623 thrust::device_vector<double> ua(nRow);

100


624 double *p_ua = thrust::raw_pointer_cast(&ua[0]);

625 cublasDgemv(hd, CUBLAS_OP_N, nRow, n, &cc1, p_xa, nRow, p_wa, 1, &

cc0, p_ua, 1);

626

627

628 // test using Eq 2.7

629 //test_1 = xaˆT*ua;

630 thrust::device_vector<double> test_1(n,0);

631 double *p_test_1 = thrust::raw_pointer_cast(&test_1[0]);

632 cublasDgemv(hd, CUBLAS_OP_T, nRow, n, &cc1, p_xa, nRow, p_ua, 1, &

cc0, p_test_1, 1);

633

634 //test_2 = aa*ones(size(test_1));

635 thrust::device_vector<double> test_2(n,0);

636 double *p_test_2 = thrust::raw_pointer_cast(&test_2[0]);

637 //test_1_2 = sum(sum(abs(test_1-test_2)));

638 thrust::transform(test_1.begin(), test_1.end(), test_2.begin(),

MinusC(aa));

639 thrust::transform(test_2.begin(), test_2.end(), test_2.begin(), Abs

());

640 double test_1_2 = thrust::reduce( test_2.begin() , test_2.end());

641

642

643 //test_3 = norm(ua) - 1;

644 double test_3;

645 const double *p_uac = thrust::raw_pointer_cast(&ua[0]);

646 cublasDnrm2(hd, nRow, p_uac, 1, &test_3);

647 test_3 = test_3-1;

648 if(test_3<0) test_3 = -test_3;

649

650

651 if(test_1_2 > RESOLUTION_OF_LARS*100 || test_3>RESOLUTION_OF_LARS

*100)

652 {

101


653 printf("[%d]Eq 2.7 test failure.\n", nIter);


655 break;

656 }

657 //a = xˆT*ua;

658 thrust::device_vector<double> a(nCol);

659 double *p_a = thrust::raw_pointer_cast(&a[0]);

660 cublasDgemv(hd, CUBLAS_OP_T, nRow, nCol, &cc1, p_dx, nRow, p_ua, 1,

&cc0, p_a, 1);

661

662 double c_inactive_temp;

663 double a_inactive_temp;

664 double tmp_1;

665 double tmp_2;

666 thrust::device_vector<double> tmp;


668 {

669 c_inactive_temp = c[inactive[k]];//tmp_1 = (C_max-c(inactive))

./(aa-a(inactive));

670 a_inactive_temp = a[inactive[k]];//tmp_2 = (C_max+c(inactive))

./(aa+a(inactive));

671 tmp_1 = (C_max-c_inactive_temp)/(aa-a_inactive_temp);

672 tmp_2 = (C_max+c_inactive_temp)/(aa+a_inactive_temp);

673 if(tmp_1>0) tmp.push_back(tmp_1);//tmp = tmp_3(find(tmp_3>0));

674 if(tmp_2>0) tmp.push_back(tmp_2);

675 }

676

677 double gamm;

678 if(tmp.empty())

679 //if(tmp->empty())

680 gamm = C_max/aa;

681 else

682 {//gamma = min(tmp);

683 thrust::device_vector<double>::iterator iter4 = thrust::

102


min_element(tmp.begin(), tmp.end());

684 gamm = *iter4;

685 }

686

687

688 thrust::device_vector<double> d(nCol,0);//d = zeros(1,nCol);

689 thrust::device_vector<double> tmpp(nCol,0);//tmp = zeros(1,nCol);

690 double d_temp;

691 double beta_temp;

692 thrust::device_vector<double> tmpp2;

693 double s_temp;


695 {

696 double c_active_temp2 = c[d_active[k]];

697 if(c_active_temp2 >= RESOLUTION_OF_LARS)//s = sign(c(active));

698 s_temp = 1;

699 if(c_active_temp2 <= -RESOLUTION_OF_LARS)

700 s_temp = -1;

701 if(c_active_temp2 < RESOLUTION_OF_LARS && c_active_temp2 > -

RESOLUTION_OF_LARS)

702 s_temp = 0;

703 d_temp = s_temp*wa[k];// d(active) = s.*wa;

704 if(d_temp==0)//if (length(find(d(active)==0)))

705 {

706 printf("[%d]Something wrong with vector d: Eq 3.4\n", nIter)

;


708 break;

709 }

710 d[d_active[k]] = d_temp;

711 }

712


714 {

103


715 beta_temp = beta[d_active[k]];

716 d_temp = d[d_active[k]];

717 beta_temp = -beta_temp/d_temp;

718 tmpp[d_active[k]] = beta_temp;//tmp(active) = -1*beta(active)./d

(active);

719

720 if(beta_temp>0)

721 {

722 tmpp2.push_back(beta_temp);//tmp2 = tmp(find(tmp>0));

723 //tmpp2->push_back(beta_temp);//tmp2 = tmp(find(tmp>0));

724 //printf("%lf goes to tmpp2 in tmp2 = tmp(find(tmp>0));\n",

beta_temp);

725 }

726 }

727

728 drop.clear();//drop = [];

729 double gamm_tilde = INF;

730

731 if(!tmpp2.empty())

732 {

733 thrust::device_vector<double>::iterator iter3 = thrust::

min_element(tmpp2.begin(), tmpp2.end());

734

735 unsigned int position = iter3 - tmpp2.begin();

736 double min_tmpp2 = *iter3;

737

738 if(!tmpp2.empty() && gamm >= min_tmpp2)

739 {

740 gamm_tilde = min_tmpp2; //gamma_tilde = min(tmp2);

741 for(int k = 0; k < nCol; k++)//drop = find(tmp==gamma_tilde)

;

742 {

743 if(tmpp[k]==gamm_tilde)

744 {

104


745 drop.push_back(k);

746 //if(rank==10) fprintf(f_act_record, "[%d] %d added

to drop at line: 808\n", nIter, k);

747 }

748 }

749 }

750 }

751 tmpp2.clear();

752 tmpp.clear();

753

754 double min_gamm = gamm;

755 if(gamm > gamm_tilde) min_gamm = gamm_tilde;

756

757 //const double *p_min_gamm = &min_gamm;

758 mu_a_plus = mu_a;

759 //double *p_mu_a_plus = thrust::raw_pointer_cast(&mu_a_plus[0]);

760 //cublasDaxpy(hd, nRow, p_min_gamm, p_uac, 1, p_mu_a_plus, 1);//

mu_a_plus = mu_a + min(gamm,gamm_tilde)*ua;

761 thrust::device_vector<double> new_ua1=ua;

762 thrust::transform(ua.begin(), ua.end(), new_ua1.begin(), MultipleByC

(min_gamm));

763 thrust::transform(mu_a.begin(), mu_a.end(), new_ua1.begin(),

mu_a_plus.begin(), thrust::plus<double>());

764 new_ua1.clear();

765

766

767 beta_new = beta;

768 //const double *p_dc = thrust::raw_pointer_cast(&d[0]);

769 //double *p_beta_new = thrust::raw_pointer_cast(&beta_new[0]);

770 //cublasDaxpy(hd, nCol, p_min_gamm, p_dc, 1, p_beta_new, 1);//

beta_new = beta + min(gamm,gamm_tilde)*d;

771 thrust::device_vector<double> new_d1=d;

772 thrust::transform(d.begin(), d.end(), new_d1.begin(), MultipleByC(

min_gamm));

105


773 thrust::transform(beta.begin(), beta.end(), new_d1.begin(), beta_new

.begin(), thrust::plus<double>());

774 new_d1.clear();

775


777 thrust::device_vector<int> new_active2(d_active.size());

778 thrust::device_vector<int>::iterator act_end2;

779 act_end2 = thrust::set_difference(d_active.begin(), d_active.end(),

drop.begin(), drop.end(), new_active2.begin());

780 new_active2.erase(act_end2, new_active2.end());

781 d_active = new_active2;

782 new_active2.clear();

783

784 //inactive = setdiff(all_candidate,active);

785 thrust::device_vector<int>::iterator inact_end2;

786 inact_end2 = thrust::set_difference(all_candidate.begin(),

all_candidate.end(), d_active.begin(), d_active.end(), inactive.

begin());

787 inactive.erase(inact_end2, inactive.end());

788

789 for(int k = 0; k < drop.size(); k++)//beta_new(drop)=0;

790 {

791 beta_new[drop[k]]= 0;

792 }

793

794 double C_max_aa;

795 C_max_aa = C_max/aa;

796 //const double * p_C_max_aa = &C_max_aa;

797 mu_a_OLS = mu_a;

798 //double *p_mu_a_OLS = thrust::raw_pointer_cast(&mu_a_OLS[0]);

799 //cublasDaxpy(hd, nRow, p_C_max_aa, p_uac, 1, p_mu_a_OLS, 1);//

mu_a_OLS = mu_a + C_max/aa*ua;

800 thrust::device_vector<double> new_ua2=ua;

801 thrust::transform(ua.begin(), ua.end(), new_ua2.begin(), MultipleByC

106


(C_max_aa));

802 thrust::transform(mu_a.begin(), mu_a.end(), new_ua2.begin(),

mu_a_OLS.begin(), thrust::plus<double>());

803 new_ua2.clear();

804

805

806 beta_OLS = beta;

807 ////const double *p_dc = thrust::raw_pointer_cast(&d[0]);

808 //double *p_beta_OLS = thrust::raw_pointer_cast(&beta_OLS[0]);

809 //cublasDaxpy(hd, nCol, p_C_max_aa, p_dc, 1, p_beta_new, 1);//

beta_LS = beta + C_max/aa*d;

810 thrust::device_vector<double> new_d2=d;

811 thrust::transform(d.begin(), d.end(), new_d2.begin(), MultipleByC(

C_max_aa));

812 thrust::transform(beta.begin(), beta.end(), new_d2.begin(), beta_OLS

.begin(), thrust::plus<double>());

813 new_d2.clear();

814

815

816 //MSE = sum((y-mu_a_OLS).ˆ2)/length(y);

817 thrust::device_vector<double> y_minus_mu_a(nRow);

818 y_minus_mu_a = dy;

819 //double *p_y_minus_mu_a = thrust::raw_pointer_cast(&y_minus_mu_a

[0]);

820 //double* p_mu_a_OLSc = thrust::raw_pointer_cast(&mu_a_OLS[0]);

821 //const double c_m1 = -1;

822 //cublasDaxpy(hd, nRow, &c_m1, p_mu_a_OLSc, 1, p_y_minus_mu_a, 1);

823 thrust::transform(dy.begin(), dy.end(), mu_a_OLS.begin(),

y_minus_mu_a.begin(), thrust::minus<double>());

824 thrust::transform(y_minus_mu_a.begin(), y_minus_mu_a.end(),

y_minus_mu_a.begin(), Square());

825 MSE = thrust::reduce( y_minus_mu_a.begin() , y_minus_mu_a.end());

826 MSE = MSE/nRow;

827

107


828 y_minus_mu_a.clear();

829 d.clear();

830 ua.clear();

831 wa.clear();

832 a.clear();

833 xa.clear();

834 ga.clear();

835 invga.clear();

836

837 //update

838 mu_a = mu_a_plus;

839 beta = beta_new;

840 thrust::device_vector<double> beta_backup=beta;

841 double d_beta_temp;


843 {

844

845 //beta = beta(active)./sx(active);

846 thrust::fill(beta.begin(), beta.end(), 0);


848 {

849 d_beta_temp = beta_backup[d_active[k]];

850 d_beta_temp = d_beta_temp/ssx[d_active[k]];

851 beta[d_active[k]] = d_beta_temp;

852 }

853

854


856 //othersStop = 1;//need to be broadcasted!

857 break;

858 }

859 beta_backup.clear();

860 }

861 if(nIter>=MAX_ITER){printf(" Reached nIter limit!\n");}

108


862 mu_a_plus.clear();

863 mu_a_OLS.clear();

864 beta_new.clear();

865 beta_OLS.clear();

866 c.clear();

867 mu_a.clear();

868 dy.clear();

869 dx.clear();

870 //sx.clear();

871 ssx.clear();

872

873 if(!no_xtx) {xtx.clear();}

874 thrust::host_vector<double> h_beta(beta.begin(), beta.end());

875 thrust::host_vector<int> active(d_active.begin(), d_active.end());

876 /*========================END OF SOLVING==============================*/

877 dtime = ((double)clock() - start)/CLOCKS_PER_SEC;

878 if(canPrintSol ==1)

879 {

880 printf("========Solved successfully ========\n");

881 //fprintf(f_record, "[%d]Runtime = %f\n", rank, exe_time);

882 printf("Runtime = %f\n", dtime);

883

884 printf("====Number of Iteration: %d====\n", nIter);

885 for(int iii = 0; iii<active.size(); iii++)

886 {

887 printf("------active_idx:%d------beta[active_idx]:%6.8f------\n"

, active[iii], h_beta[active[iii]] );

888 }

889 }

890 else

891 {

892 printf("========Not Solved TAT ========\n");

893 }

894 timearray[iFaultCase-startFaultCaseID] = dtime;

109


895 // memory clean up

896

897 b.clear();

898 h_beta.clear();

899 active.clear();

900

901 // shutdown

902 status = cublasDestroy(hd);

903

904 if (status != CUBLAS_STATUS_SUCCESS)

905 {

906 fprintf(stderr, "!!!! shutdown error (A)\n");

907 return EXIT_FAILURE;

908 }

909 }

910 A.clear();

911 //

912 double t_ave = thrust::reduce(timearray.begin(),timearray.end());

913 t_ave = t_ave/nTotalFaultCases;

914 printf("//***********========[%d] Ave. Time = %lf ========*********\n", caseID,

t_ave);

915 timearray.clear();

916

917 /*

918 if(argc <= 1 || strcmp(argv[1], "-noprompt"))

919 {

920 printf("\nPress ENTER to exit...\n");

921 getchar();

922 }*/

923 return EXIT_SUCCESS;

924 }

110

Index

L0 minimization problem, 15

L1 minimization problem, 15

Equivalent Injection, 7

LARS, 15

Lasso, 15

OLS Backup Estimation, 20

Overlapping Group Lasso, 19

Teed lines, 10

111

fault location using wide-area measurements and sparse ... · dr. ali abur, adviser this thesis...

Documents