Download - Cascading Failures in Infrastructure Networks

Cascading Failures in Infrastructure Networks

David Alderson

Ph.D. Candidate

Dept. of Management Science and Engineering

Stanford University

April 15, 2002

Advisors: William J. Perry, Nicholas Bambos

IPAM 4/15/2002 David Alderson

Outline

• Background and Motivation

• Union Pacific Case Study

• Conceptual Framework

• Modeling Cascading Failures

• Ongoing Work


Background

• Most of the systems we rely on in our daily lives are designed and built as networks– Voice and data communications– Transportation– Energy distribution

• Large-scale disruption of such systems can be catastrophic because of our dependence on them

• Large-scale failures in these systems– Have already happened– Will continue to happen


Recent Examples• Telecommunications

– ATM network outage: AT&T (February 2001)– Frame Relay outage: AT&T (April 1998), MCI (August 1999)

• Transportation– Union Pacific Service Crisis (May 1997- December 1998)

• Electric Power– Northeast Blackout (November 1965)– Western Power Outage (August 1996)

• All of the above– Baltimore Tunnel Accident (July 2001)


Public Policy

• U.S. Government interest from 1996 (and earlier)

• Most national infrastructure systems are privately owned and operated– Misalignment between business imperatives (efficiency) and

public interest (robustness)

• Previously independent networks now tied together through common information infrastructure

• Current policy efforts directed toward building new public-private relationships– Policy & Partnership (CIAO)– Law Enforcement & Coordination (NIPC)– Defining new roles (Homeland Security)


Research QuestionsBroadly:

• Is there something about the network structure of these systems that contributes to their vulnerability?

More specifically:

• What is a cascading failure in the context of an infrastructure network?

• What are the mechanisms that cause it?

• What can be done to control it?

• Can we design networks that are robust to cascading failures?

• What are the implications for network-based businesses?


Outline





• Ongoing Work


Union Pacific Railroad

• Largest RR in North America– Headquartered in Omaha, Nebraska– 34,000 track miles (west of Mississippi River)

• Transporting– Coal, grain, cars, other manifest cargos– 3rd party traffic (e.g. Amtrak passenger trains)

• 24x7 Operations:– 1,500+ trains in motion– 300,000+ cars in system

• More than $10B in revenue annually



• Four major resources constraining operations:– Line capacity

(# parallel tracks, speed restrictions, etc.)– Terminal capacity (in/out tracks, yard capacity)– Power (locomotives)– Crew (train personnel, yard personnel)

• Ongoing control of operations is mainly by:– Dispatchers– Yardmasters– Some centralized coordination, primarily through a

predetermined transportation schedule



• Sources of network disruptions: – Weather

(storms, floods, rock slides, tornados, hurricanes, etc.)– Component failures

(signal outages, broken wheels/rails, engine failures, etc.)– Derailments (~1 per day on average)– Minor incidents (e.g. crossing accidents)

• Evidence for system-wide failures– 1997-1998 Service Crisis

• Fundamental operating challenge


UPRR Fundamental Challenge

Two conflicting drivers:

• Business imperatives necessitate a lean operation that maximizes efficiency and drives the system toward high utilization of available network resources.

• An efficient operation that maximizes utilization is very sensitive to disruptions, particularly because of the effects of network congestion.


Railroad CongestionThere are several places where congestion may be

seen within the railroad:

• Line segments

• Terminals

• Operating Regions

• The Entire Railroad Network

• (Probably not locomotives or crews)

Congestion is related to capacity.


UPRR Capacity Model Concepts

Factors Affecting Observed Performance:•Dispatcher / Corridor Manager Expertise•On Line Incidents / Equipment Failure•Weather•Temporary Speed Restrictions

3628

25

32

Lin

e S

egm

ent V

eloc

ity

Volume (trains per day)

Emprically-DerivedRelationship

18

35

The Effect of ForcingVolume in Excess of Capacity


Implications of Congestion

Concepts of traffic congestion are important for two key aspects of network operations:– Capacity Planning and Management– Service Restoration

In the presence of service interruptions, the objective of Service Restoration is to: – Minimize the propagation across the network of any

disturbance caused by a service interruption– Minimize the time to recovery to fluid operations


Modeling Congestion

We can model congestion using standard models from transportation engineering.

Define the relationships between:

• Number of items in the system (Density)

• Average processing rate (Velocity)

• Input Rate

• Output Rate (Throughput)


Modeling Congestion

N

nKnv 1)(

N

K

Velocity (v)

Density (n)

Velocity vs. Density: Assume that velocity decreases (linearly) with

the traffic density.


Modeling Congestion

N

nnKnnvn

2

)()(

Density (n)N

*

N/2

Throughput ()

Throughput vs. Density Throughput = Velocity · Density

Throughput is maximized at

n = N/2 with value

* = N/4 (K=1).


*

Velocity

Throughput

K

N

Velocity

Density

*

N Density

ThroughputModeling Congestion

Velocity

Throughput


Modeling Congestion

p

N

nKnv 1)(

p

N

nnKnvnn 1)()(

Let p represent the intensity of congestion onset.

v(n)=1-(n/10)^p

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10n

0.1

0.25

0.5

1

2

4

10

mu(n) = n*v(n) = n * { 1 - (n/10)^p }

0

1

2

34

5

6

7

8

0 1 2 3 4 5 6 7 8 9 10n

0.1

0.25

0.5

1

2

4

10


Modeling Congestion

p

N

nnK

p1

lim

It is clear that

nK

N

becomes

otherwise

Nnn

nnK

N

N

0

1)( where

))(1(

mu(n) = n*v(n) = n * { 1 - (n/10)^p }

0

1

2

34

5

6

7

8

0 1 2 3 4 5 6 7 8 9 10n

0.1

0.25

0.5

1

2

4

10


*

Velocity

Throughput

K

N

Velocity

Density

*

N Density

ThroughputModeling Congestion

Velocity

Throughput


UP Service Crisis• Initiating Event

– 5/97 derailment at a critical train yard outside of Houston

• Additionally – Loss of BNSF route that was decommissioned for repairs– Embargo at Laredo interchange point to Mexico

• Complicating Factors– UP/SP merger and transition to consolidated operations– Hurricane Danny, fall 1997– Record rains and floods (esp. Kansas) in 1998

• Operational Issues– Tightly optimized transportation schedule– Traditional service priorities


Union Pacific RailroadTotal System Inventory, December 1996 - November 1998

280,000

290,000

300,000

310,000

320,000

330,000

340,000

350,000

360,000

370,000

Dec

-96

Mar

-97

Jun-

97

26-S

ep-9

7

17-O

ct-9

7

7-N

ov-9

7

28-N

ov-9

7

19-D

ec-9

7

9-Ja

n-98

30-J

an-9

8

20-F

eb-9

8

13-M

ar-9

8

3-A

pr-9

8

24-A

pr-9

8

15-M

ay-9

8

5-Ju

n-98

26-J

un-9

8

17-J

ul-9

8

7-A

ug-9

8

28-A

ug-9

8

18-S

ep-9

8

9-O

ct-9

8

30-O

ct-9

8

20-N

ov-9

8

Inventory (cars)

UP Service Crisis

Source: UP Filings with Surface Transportation Board, September 1997 – December 1998

Houston-Gulf Coast

Central Corridor(Kansas-Nebraska-Wyoming)

Southern California


Case Study: Union PacificCompleted Phase 1 of case study:

• Understanding of the factors affecting system capacity, system dynamics

• Investigation of the 1997-98 Service Crisis

• Project definition: detailed study of Sunset Route

• Data collection, preliminary analysis for the Sunset Route

Ongoing work:

• A detailed study of their specific network topology

• Development of real-time warning and analysis tools


Outline





• Ongoing Work


Basic Network Concepts• Networks allow the sharing of distributed resources

• Resource use resource load– Total network usage = total network load

• Total network load is distributed among the components of the network– Many networking problems are concerned with finding a

“good” distribution of load

• Resource allocation load distribution


Infrastructure Networks

• Self-protection as an explicit design criterion

• Network components themselves are valuable– Expensive – Hard to replace– Long lead times to obtain

• Willingness to sacrifice current system performance in exchange for future availability

• With protection as an objective, connectivity between neighboring nodes is– Helpful– Harmful


Cascading Failures

Cascading failures occur in networks where– Individual network components can fail– When a component fails, the natural dynamics of the

system may induce the failure of other components

Network components can fail because– Accident– Internal failure– Attack

Initiating events

A cascading failure is not– A single point of failure

– The occurrence of multiple concurrent failures

– The spread of a virus


Related Work

Cascading Failures:– Electric Power: Parrilo et. al. (1998), Thorp et. al. (2001)– Social Networks: Watts (1999)– Public Policy: Little (2001)

Other network research initiatives– “Survivable Networks”– “Fault-Tolerant Networks”

Large-Scale Vulnerability– Self-Organized Criticality: Bak (1987), many others– Highly Optimized Tolerance: Carlson and Doyle (1999)– Normal Accidents: Perrow (1999) – Influence Models: Verghese et. al. (2001)


Our Approach

• Cascading failures in the context of flow networks– conservation of flow within the network

• Overloading a resource leads to degraded performance and eventual failure

• Network failures are not independent– Flow allocation pattern resource interdependence

• Focus on the dynamics of network operation and control

• Design for robustness (not protection)


Taxonomy of Network Flow Models

FluidApproximations

Time-DependentAverages

Static FlowModels

Long-TermAverages

DiffusionApproximations

Averages &Variances

QueueingModels

Probability Distributions

SimulationModels

Event Sequences

Quantity ofInterest

ModelingApproach

OngoingOperation

(Processing& Routing)

RelevantDecisions

CoarseGrainedModels

FineGrainedModels

Failure &Recovery

CapacityPlanning

Reference:Janusz Filipiak


Time Scales in Network Operations

minutesto hours

daysto weeks

milliseconds to seconds

daysto weeks

monthsto years

minutesto hours

LongTime

Scales

ShortTime

Scales

ComputerRouting

RailroadTransportation

RelevantDecisions

Ongoing Operation(Processing & Routing)

Failure &Recovery

CapacityPlanning


What Are Network Dynamics?

Type of Network Dynamics

UnderlyingAssumption

DynamicsON

Networks

DynamicsOF

Networks

Network topologyis CHANGING

Network topologyis STATIC

Failure &Recovery


Network Flow Optimization

• Original work by Ford and Fulkerson (1956)

• One of the most studied areas for optimization

• Three main problem types– Shortest path problems– Maximum flow problems– Minimum cost flow problems

• Special interpretation for some of the most celebrated results in optimization theory

• Broad applicability to a variety of problems


Single Commodity Flow ProblemNotation:

N set of nodes, indexed i = 1, 2, … N

A set of arcs , indexed j = 1, 2, … M

di demand (supply) at node i

fj flow along arc j

uj capacity along arc j

A node-arc incidence matrix,

A set of flows f is feasible if it satisfies the constraints:

Ai f = di i N (flows balanced at node i, and

supply/demand is satisfied)

0 fj uj j A (flow on arc j less than capacity)

otherwise

node exits arc if

node enters arc if

0

1

1

ij

ij

aij


Single Commodity Flow Problem

Feasible region, denoted F():

(flows balanced at node i)

0 fj uj j A (flow on arc j feasible)

otherwise

if

if

0

ti

si

dfA ii

ts


Minimum Cost Problem

subject to:



otherwise

if

if

0

ti

si

dfA ii

ts

Let cj = cost on arc j

Minimizef (j A) cj fj


Shortest Path Problem

subject to:


0 fj uj=1 j A (flow on arc j feasible)

otherwise

if

if

0

ti

si

dfA ii

=11=

Let costs cj correspond to “distances”, set = 1

Minimizef (j A) cj fj

ts


Maximum Flow Problem

subject to:



otherwise

if

if

0

ti

si

dfA ii

Maximizef

ts


Network Optimization

Traditional Assumptions:– Complete information– Static network (capacities, demands, topology)– Centralized decision maker

Solution obtained from global optimization algorithms

Relevant issues:– Computational (time) complexity

• Function of problem size (number of inputs)• Based on worst-case data

– Parallelization (decomposition)– Synchronization (global clock)


New ChallengesMost traditional assumptions no longer hold…

• Modern networks are inherently dynamic– Connectivity fluctuates, components fail, growth is ad hoc– Traffic demands/patterns constantly change

• Explosive growth massive size scale

• Faster technology shrinking time scale

• Operating decisions are made with incomplete, incorrect information

• Claim: A global approach based on static assumptions is no longer viable


Cascading Failures & Flow Networks

• In general, we assume that network failures result from violations of network constraints• Node feasibility (flow conservation)• Arc feasibility (arc capacity)

• That is, failure infeasibility

• The network topology provides the means by which failures (infeasibilities) propagate

• In the optimization context, a cascading failure is a collapse of the feasible region of the optimization problem that results from the interaction of the constraints when a parameter is changed


Addressing New Challenges

• Extend traditional notions of network optimization to model cascading failures in flow networks– Allow for node failures– Include flow dynamics

• Consider solution approaches based on– Decentralized control– Local information

• Leverage ideas from dual problem formulation

• Identify dimensions along which there are explicit tensions and tradeoffs between vulnerability and performance


Dual Problem FormulationPrimal Problem

Min cT f

s.t. A f = d

f 0

f u

Dual Problem

Max Td - uT

s.t. TA c

unrestricted

0

• Dual variables , have interpretation as prices at nodes, arcs

• Natural decomposition as distributed problem

• e.g. Nodes set prices based on local information

• Examples:

• Kelly, Low and many others for TCP/IP congestion control

• Boyd and Xiao for dual decomposition of SRRA problem


Outline





• Ongoing Work


• Let n(k) = flow being processed in interval k

• Node dynamics

n(k+1) = n(k) + a(k) – d(k)

n(k)(load)

d(k)(performance)

Node Dynamics• Consider each node as a simple input-output system running in discrete time…

n(k)a(k) d(k)

n(k)

knknkd

0

)(0)()(

• Processing capacity • State-dependent outputSystem is feasible

for a(k) <

constant a(k)

n*

• a(k) – d(k) indicates how n(k) is changing

• n* is equilibrium point

• Node “fails” if n(k) >


a2(k)

=a2(k) d2(k)u1

Network Dynamics• The presence of an arc between adjacent nodes couples their behavior

n1(k)a1(k) d1(k) n2(k)

n2(k)

d2(n2)

2

2

n1(k)

d1(n1)

1

1

a1(k)

• Arc capacities limit both outgoing and incoming flow

u1(k)

u1(k)

a2(k)a2(k)


=u1u1=0

Network Dynamics• The failure of one node can lead to the failure of another

n1(k)a1(k) d1(k) n2(k)a2(k) d2(k)

• When a node fails, the capacity of its incoming arcs drop effectively to zero.

• Upstream node loses capacity of arc

• In the absence of control, the upstream node fails too.

Result: Node failures propagate “upstream”…

Question:

• How will the network respond to perturbations?


Network Robustness

Consider the behavior of the network in response to a perturbation to arc capacity:1. Does the disturbance lead to a local failure?2. Does the failure propagate?3. How far does it propagate?

Measure the consequences in terms of:– Size of the resulting failure island– Loss of network throughput

Key factors:– Flow processing sensitivity to congestion– Network topology– Local routing and flow control policies– Time scales


Congestion SensitivityIn many real network systems, components are sensitive to congestion

SystemLoad

SystemPerformance

• Using the aforementioned family of functions we can tune the

sensitivity of congestion

• Direct consequences on local dynamics, stability, and control

• Tradeoff between system efficiency vs. fragility

• Implications for local behavior

evidence ofcongestion


Qualitative Behavior

Input Rate

x1* x2

* TotalSystemLoad

OutputRate

StableEquilibrium

UnstableEquilibrium

CongestionCollapse


SevereCongestion

MildCongestion

FluidProcessing


Input Rate

x1* x2

* TotalSystemLoad

OutputRate

System response to changes in input rate is opposite in fluid vs. congested regions.



Input Rate

x1* x2

* TotalSystemLoad

OutputRate

Safety Margin

New Input Rate

y1* y2

*

“Efficiency” results in “Fragility”


Ongoing Work• Modeling behavior of flow networks

– Vulnerability to cascading failures– Sensitivity to congestion

• Bringing together notions from network optimization, dynamical systems, and distributed control

• Exploring operating tradeoffs between – efficiency and robustness– global objectives vs. local behavior– system performance vs. system vulnerability

• Collectively, these features provide a framework for study of real systems– UPRR case study– Computer networks


Future Directions

• Development of decision support tools to support real-time operations– Warning systems– Incident recovery

• Investigation of issues related to topology

• Notions from economics– Network complements and substitutes– Node cooperation and competition


Key Takeaways

• Large-scale failures happen– Elements of vulnerability associated with connectivity– But we are moving to connect everything together…

• Critical tradeoff for network-based businesses– Business profitability from resource efficiency– System robustness

• Two fundamental aspects to understanding large-scale failure behavior– Networks– Dynamics

• Relevance to a wide variety of applications

Thank You

[email protected]

Download - Cascading Failures in Infrastructure Networks

Top Related