Cascading Failures in Infrastructure Networks
David Alderson
Ph.D. Candidate
Dept. of Management Science and Engineering
Stanford University
April 15, 2002
Advisors: William J. Perry, Nicholas Bambos
IPAM 4/15/2002 David Alderson
Outline
• Background and Motivation
• Union Pacific Case Study
• Conceptual Framework
• Modeling Cascading Failures
• Ongoing Work
IPAM 4/15/2002 David Alderson
Background
• Most of the systems we rely on in our daily lives are designed and built as networks– Voice and data communications– Transportation– Energy distribution
• Large-scale disruption of such systems can be catastrophic because of our dependence on them
• Large-scale failures in these systems– Have already happened– Will continue to happen
IPAM 4/15/2002 David Alderson
Recent Examples• Telecommunications
– ATM network outage: AT&T (February 2001)– Frame Relay outage: AT&T (April 1998), MCI (August 1999)
• Transportation– Union Pacific Service Crisis (May 1997- December 1998)
• Electric Power– Northeast Blackout (November 1965)– Western Power Outage (August 1996)
• All of the above– Baltimore Tunnel Accident (July 2001)
IPAM 4/15/2002 David Alderson
Public Policy
• U.S. Government interest from 1996 (and earlier)
• Most national infrastructure systems are privately owned and operated– Misalignment between business imperatives (efficiency) and
public interest (robustness)
• Previously independent networks now tied together through common information infrastructure
• Current policy efforts directed toward building new public-private relationships– Policy & Partnership (CIAO)– Law Enforcement & Coordination (NIPC)– Defining new roles (Homeland Security)
IPAM 4/15/2002 David Alderson
Research QuestionsBroadly:
• Is there something about the network structure of these systems that contributes to their vulnerability?
More specifically:
• What is a cascading failure in the context of an infrastructure network?
• What are the mechanisms that cause it?
• What can be done to control it?
• Can we design networks that are robust to cascading failures?
• What are the implications for network-based businesses?
IPAM 4/15/2002 David Alderson
Outline
• Background and Motivation
• Union Pacific Case Study
• Conceptual Framework
• Modeling Cascading Failures
• Ongoing Work
IPAM 4/15/2002 David Alderson
Union Pacific Railroad
• Largest RR in North America– Headquartered in Omaha, Nebraska– 34,000 track miles (west of Mississippi River)
• Transporting– Coal, grain, cars, other manifest cargos– 3rd party traffic (e.g. Amtrak passenger trains)
• 24x7 Operations:– 1,500+ trains in motion– 300,000+ cars in system
• More than $10B in revenue annually
IPAM 4/15/2002 David Alderson
Union Pacific Railroad
• Four major resources constraining operations:– Line capacity
(# parallel tracks, speed restrictions, etc.)– Terminal capacity (in/out tracks, yard capacity)– Power (locomotives)– Crew (train personnel, yard personnel)
• Ongoing control of operations is mainly by:– Dispatchers– Yardmasters– Some centralized coordination, primarily through a
predetermined transportation schedule
IPAM 4/15/2002 David Alderson
Union Pacific Railroad
• Sources of network disruptions: – Weather
(storms, floods, rock slides, tornados, hurricanes, etc.)– Component failures
(signal outages, broken wheels/rails, engine failures, etc.)– Derailments (~1 per day on average)– Minor incidents (e.g. crossing accidents)
• Evidence for system-wide failures– 1997-1998 Service Crisis
• Fundamental operating challenge
IPAM 4/15/2002 David Alderson
UPRR Fundamental Challenge
Two conflicting drivers:
• Business imperatives necessitate a lean operation that maximizes efficiency and drives the system toward high utilization of available network resources.
• An efficient operation that maximizes utilization is very sensitive to disruptions, particularly because of the effects of network congestion.
IPAM 4/15/2002 David Alderson
Railroad CongestionThere are several places where congestion may be
seen within the railroad:
• Line segments
• Terminals
• Operating Regions
• The Entire Railroad Network
• (Probably not locomotives or crews)
Congestion is related to capacity.
IPAM 4/15/2002 David Alderson
UPRR Capacity Model Concepts
Factors Affecting Observed Performance:•Dispatcher / Corridor Manager Expertise•On Line Incidents / Equipment Failure•Weather•Temporary Speed Restrictions
3628
25
32
Lin
e S
egm
ent V
eloc
ity
Volume (trains per day)
Emprically-DerivedRelationship
18
35
The Effect of ForcingVolume in Excess of Capacity
IPAM 4/15/2002 David Alderson
Implications of Congestion
Concepts of traffic congestion are important for two key aspects of network operations:– Capacity Planning and Management– Service Restoration
In the presence of service interruptions, the objective of Service Restoration is to: – Minimize the propagation across the network of any
disturbance caused by a service interruption– Minimize the time to recovery to fluid operations
IPAM 4/15/2002 David Alderson
Modeling Congestion
We can model congestion using standard models from transportation engineering.
Define the relationships between:
• Number of items in the system (Density)
• Average processing rate (Velocity)
• Input Rate
• Output Rate (Throughput)
IPAM 4/15/2002 David Alderson
Modeling Congestion
N
nKnv 1)(
N
K
Velocity (v)
Density (n)
Velocity vs. Density: Assume that velocity decreases (linearly) with
the traffic density.
IPAM 4/15/2002 David Alderson
Modeling Congestion
N
nnKnnvn
2
)()(
Density (n)N
*
N/2
Throughput ()
Throughput vs. Density Throughput = Velocity · Density
Throughput is maximized at
n = N/2 with value
* = N/4 (K=1).
IPAM 4/15/2002 David Alderson
*
Velocity
Throughput
K
N
Velocity
Density
*
N Density
ThroughputModeling Congestion
Velocity
Throughput
IPAM 4/15/2002 David Alderson
Modeling Congestion
p
N
nKnv 1)(
p
N
nnKnvnn 1)()(
Let p represent the intensity of congestion onset.
v(n)=1-(n/10)^p
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9 10n
0.1
0.25
0.5
1
2
4
10
mu(n) = n*v(n) = n * { 1 - (n/10)^p }
0
1
2
34
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10n
0.1
0.25
0.5
1
2
4
10
IPAM 4/15/2002 David Alderson
Modeling Congestion
p
N
nnK
p1
lim
It is clear that
nK
N
becomes
otherwise
Nnn
nnK
N
N
0
1)( where
))(1(
mu(n) = n*v(n) = n * { 1 - (n/10)^p }
0
1
2
34
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10n
0.1
0.25
0.5
1
2
4
10
IPAM 4/15/2002 David Alderson
*
Velocity
Throughput
K
N
Velocity
Density
*
N Density
ThroughputModeling Congestion
Velocity
Throughput
IPAM 4/15/2002 David Alderson
UP Service Crisis• Initiating Event
– 5/97 derailment at a critical train yard outside of Houston
• Additionally – Loss of BNSF route that was decommissioned for repairs– Embargo at Laredo interchange point to Mexico
• Complicating Factors– UP/SP merger and transition to consolidated operations– Hurricane Danny, fall 1997– Record rains and floods (esp. Kansas) in 1998
• Operational Issues– Tightly optimized transportation schedule– Traditional service priorities
IPAM 4/15/2002 David Alderson
Union Pacific RailroadTotal System Inventory, December 1996 - November 1998
280,000
290,000
300,000
310,000
320,000
330,000
340,000
350,000
360,000
370,000
Dec
-96
Mar
-97
Jun-
97
26-S
ep-9
7
17-O
ct-9
7
7-N
ov-9
7
28-N
ov-9
7
19-D
ec-9
7
9-Ja
n-98
30-J
an-9
8
20-F
eb-9
8
13-M
ar-9
8
3-A
pr-9
8
24-A
pr-9
8
15-M
ay-9
8
5-Ju
n-98
26-J
un-9
8
17-J
ul-9
8
7-A
ug-9
8
28-A
ug-9
8
18-S
ep-9
8
9-O
ct-9
8
30-O
ct-9
8
20-N
ov-9
8
Inventory (cars)
UP Service Crisis
Source: UP Filings with Surface Transportation Board, September 1997 – December 1998
Houston-Gulf Coast
Central Corridor(Kansas-Nebraska-Wyoming)
Southern California
IPAM 4/15/2002 David Alderson
Case Study: Union PacificCompleted Phase 1 of case study:
• Understanding of the factors affecting system capacity, system dynamics
• Investigation of the 1997-98 Service Crisis
• Project definition: detailed study of Sunset Route
• Data collection, preliminary analysis for the Sunset Route
Ongoing work:
• A detailed study of their specific network topology
• Development of real-time warning and analysis tools
IPAM 4/15/2002 David Alderson
Outline
• Background and Motivation
• Union Pacific Case Study
• Conceptual Framework
• Modeling Cascading Failures
• Ongoing Work
IPAM 4/15/2002 David Alderson
Basic Network Concepts• Networks allow the sharing of distributed resources
• Resource use resource load– Total network usage = total network load
• Total network load is distributed among the components of the network– Many networking problems are concerned with finding a
“good” distribution of load
• Resource allocation load distribution
IPAM 4/15/2002 David Alderson
Infrastructure Networks
• Self-protection as an explicit design criterion
• Network components themselves are valuable– Expensive – Hard to replace– Long lead times to obtain
• Willingness to sacrifice current system performance in exchange for future availability
• With protection as an objective, connectivity between neighboring nodes is– Helpful– Harmful
IPAM 4/15/2002 David Alderson
Cascading Failures
Cascading failures occur in networks where– Individual network components can fail– When a component fails, the natural dynamics of the
system may induce the failure of other components
Network components can fail because– Accident– Internal failure– Attack
Initiating events
A cascading failure is not– A single point of failure
– The occurrence of multiple concurrent failures
– The spread of a virus
IPAM 4/15/2002 David Alderson
Related Work
Cascading Failures:– Electric Power: Parrilo et. al. (1998), Thorp et. al. (2001)– Social Networks: Watts (1999)– Public Policy: Little (2001)
Other network research initiatives– “Survivable Networks”– “Fault-Tolerant Networks”
Large-Scale Vulnerability– Self-Organized Criticality: Bak (1987), many others– Highly Optimized Tolerance: Carlson and Doyle (1999)– Normal Accidents: Perrow (1999) – Influence Models: Verghese et. al. (2001)
IPAM 4/15/2002 David Alderson
Our Approach
• Cascading failures in the context of flow networks– conservation of flow within the network
• Overloading a resource leads to degraded performance and eventual failure
• Network failures are not independent– Flow allocation pattern resource interdependence
• Focus on the dynamics of network operation and control
• Design for robustness (not protection)
IPAM 4/15/2002 David Alderson
Taxonomy of Network Flow Models
FluidApproximations
Time-DependentAverages
Static FlowModels
Long-TermAverages
DiffusionApproximations
Averages &Variances
QueueingModels
Probability Distributions
SimulationModels
Event Sequences
Quantity ofInterest
ModelingApproach
OngoingOperation
(Processing& Routing)
RelevantDecisions
CoarseGrainedModels
FineGrainedModels
Failure &Recovery
CapacityPlanning
Reference:Janusz Filipiak
IPAM 4/15/2002 David Alderson
Time Scales in Network Operations
minutesto hours
daysto weeks
milliseconds to seconds
daysto weeks
monthsto years
minutesto hours
LongTime
Scales
ShortTime
Scales
ComputerRouting
RailroadTransportation
RelevantDecisions
Ongoing Operation(Processing & Routing)
Failure &Recovery
CapacityPlanning
IPAM 4/15/2002 David Alderson
What Are Network Dynamics?
Type of Network Dynamics
UnderlyingAssumption
DynamicsON
Networks
DynamicsOF
Networks
Network topologyis CHANGING
Network topologyis STATIC
Failure &Recovery
IPAM 4/15/2002 David Alderson
Network Flow Optimization
• Original work by Ford and Fulkerson (1956)
• One of the most studied areas for optimization
• Three main problem types– Shortest path problems– Maximum flow problems– Minimum cost flow problems
• Special interpretation for some of the most celebrated results in optimization theory
• Broad applicability to a variety of problems
IPAM 4/15/2002 David Alderson
Single Commodity Flow ProblemNotation:
N set of nodes, indexed i = 1, 2, … N
A set of arcs , indexed j = 1, 2, … M
di demand (supply) at node i
fj flow along arc j
uj capacity along arc j
A node-arc incidence matrix,
A set of flows f is feasible if it satisfies the constraints:
Ai f = di i N (flows balanced at node i, and
supply/demand is satisfied)
0 fj uj j A (flow on arc j less than capacity)
otherwise
node exits arc if
node enters arc if
0
1
1
ij
ij
aij
IPAM 4/15/2002 David Alderson
Single Commodity Flow Problem
Feasible region, denoted F():
(flows balanced at node i)
0 fj uj j A (flow on arc j feasible)
otherwise
if
if
0
ti
si
dfA ii
ts
IPAM 4/15/2002 David Alderson
Minimum Cost Problem
subject to:
(flows balanced at node i)
0 fj uj j A (flow on arc j feasible)
otherwise
if
if
0
ti
si
dfA ii
ts
Let cj = cost on arc j
Minimizef (j A) cj fj
IPAM 4/15/2002 David Alderson
Shortest Path Problem
subject to:
(flows balanced at node i)
0 fj uj=1 j A (flow on arc j feasible)
otherwise
if
if
0
ti
si
dfA ii
=11=
Let costs cj correspond to “distances”, set = 1
Minimizef (j A) cj fj
ts
IPAM 4/15/2002 David Alderson
Maximum Flow Problem
subject to:
(flows balanced at node i)
0 fj uj j A (flow on arc j feasible)
otherwise
if
if
0
ti
si
dfA ii
Maximizef
ts
IPAM 4/15/2002 David Alderson
Network Optimization
Traditional Assumptions:– Complete information– Static network (capacities, demands, topology)– Centralized decision maker
Solution obtained from global optimization algorithms
Relevant issues:– Computational (time) complexity
• Function of problem size (number of inputs)• Based on worst-case data
– Parallelization (decomposition)– Synchronization (global clock)
IPAM 4/15/2002 David Alderson
New ChallengesMost traditional assumptions no longer hold…
• Modern networks are inherently dynamic– Connectivity fluctuates, components fail, growth is ad hoc– Traffic demands/patterns constantly change
• Explosive growth massive size scale
• Faster technology shrinking time scale
• Operating decisions are made with incomplete, incorrect information
• Claim: A global approach based on static assumptions is no longer viable
IPAM 4/15/2002 David Alderson
Cascading Failures & Flow Networks
• In general, we assume that network failures result from violations of network constraints• Node feasibility (flow conservation)• Arc feasibility (arc capacity)
• That is, failure infeasibility
• The network topology provides the means by which failures (infeasibilities) propagate
• In the optimization context, a cascading failure is a collapse of the feasible region of the optimization problem that results from the interaction of the constraints when a parameter is changed
IPAM 4/15/2002 David Alderson
Addressing New Challenges
• Extend traditional notions of network optimization to model cascading failures in flow networks– Allow for node failures– Include flow dynamics
• Consider solution approaches based on– Decentralized control– Local information
• Leverage ideas from dual problem formulation
• Identify dimensions along which there are explicit tensions and tradeoffs between vulnerability and performance
IPAM 4/15/2002 David Alderson
Dual Problem FormulationPrimal Problem
Min cT f
s.t. A f = d
f 0
f u
Dual Problem
Max Td - uT
s.t. TA c
unrestricted
0
• Dual variables , have interpretation as prices at nodes, arcs
• Natural decomposition as distributed problem
• e.g. Nodes set prices based on local information
• Examples:
• Kelly, Low and many others for TCP/IP congestion control
• Boyd and Xiao for dual decomposition of SRRA problem
IPAM 4/15/2002 David Alderson
Outline
• Background and Motivation
• Union Pacific Case Study
• Conceptual Framework
• Modeling Cascading Failures
• Ongoing Work
IPAM 4/15/2002 David Alderson
• Let n(k) = flow being processed in interval k
• Node dynamics
n(k+1) = n(k) + a(k) – d(k)
n(k)(load)
d(k)(performance)
Node Dynamics• Consider each node as a simple input-output system running in discrete time…
n(k)a(k) d(k)
n(k)
knknkd
0
)(0)()(
• Processing capacity • State-dependent outputSystem is feasible
for a(k) <
constant a(k)
n*
• a(k) – d(k) indicates how n(k) is changing
• n* is equilibrium point
• Node “fails” if n(k) >
IPAM 4/15/2002 David Alderson
a2(k)
=a2(k) d2(k)u1
Network Dynamics• The presence of an arc between adjacent nodes couples their behavior
n1(k)a1(k) d1(k) n2(k)
n2(k)
d2(n2)
2
2
n1(k)
d1(n1)
1
1
a1(k)
• Arc capacities limit both outgoing and incoming flow
u1(k)
u1(k)
a2(k)a2(k)
IPAM 4/15/2002 David Alderson
=u1u1=0
Network Dynamics• The failure of one node can lead to the failure of another
n1(k)a1(k) d1(k) n2(k)a2(k) d2(k)
• When a node fails, the capacity of its incoming arcs drop effectively to zero.
• Upstream node loses capacity of arc
• In the absence of control, the upstream node fails too.
Result: Node failures propagate “upstream”…
Question:
• How will the network respond to perturbations?
IPAM 4/15/2002 David Alderson
Network Robustness
Consider the behavior of the network in response to a perturbation to arc capacity:1. Does the disturbance lead to a local failure?2. Does the failure propagate?3. How far does it propagate?
Measure the consequences in terms of:– Size of the resulting failure island– Loss of network throughput
Key factors:– Flow processing sensitivity to congestion– Network topology– Local routing and flow control policies– Time scales
IPAM 4/15/2002 David Alderson
Congestion SensitivityIn many real network systems, components are sensitive to congestion
SystemLoad
SystemPerformance
• Using the aforementioned family of functions we can tune the
sensitivity of congestion
• Direct consequences on local dynamics, stability, and control
• Tradeoff between system efficiency vs. fragility
• Implications for local behavior
evidence ofcongestion
IPAM 4/15/2002 David Alderson
Qualitative Behavior
Input Rate
x1* x2
* TotalSystemLoad
OutputRate
StableEquilibrium
UnstableEquilibrium
CongestionCollapse
IPAM 4/15/2002 David Alderson
SevereCongestion
MildCongestion
FluidProcessing
Qualitative Behavior
Input Rate
x1* x2
* TotalSystemLoad
OutputRate
System response to changes in input rate is opposite in fluid vs. congested regions.
IPAM 4/15/2002 David Alderson
Qualitative Behavior
Input Rate
x1* x2
* TotalSystemLoad
OutputRate
Safety Margin
New Input Rate
y1* y2
*
“Efficiency” results in “Fragility”
IPAM 4/15/2002 David Alderson
Ongoing Work• Modeling behavior of flow networks
– Vulnerability to cascading failures– Sensitivity to congestion
• Bringing together notions from network optimization, dynamical systems, and distributed control
• Exploring operating tradeoffs between – efficiency and robustness– global objectives vs. local behavior– system performance vs. system vulnerability
• Collectively, these features provide a framework for study of real systems– UPRR case study– Computer networks
IPAM 4/15/2002 David Alderson
Future Directions
• Development of decision support tools to support real-time operations– Warning systems– Incident recovery
• Investigation of issues related to topology
• Notions from economics– Network complements and substitutes– Node cooperation and competition
IPAM 4/15/2002 David Alderson
Key Takeaways
• Large-scale failures happen– Elements of vulnerability associated with connectivity– But we are moving to connect everything together…
• Critical tradeoff for network-based businesses– Business profitability from resource efficiency– System robustness
• Two fundamental aspects to understanding large-scale failure behavior– Networks– Dynamics
• Relevance to a wide variety of applications
Thank You