when technology falters: the caregroup network outage john d. halamka md cio, caregroup cio, harvard...
TRANSCRIPT
When Technology Falters:The CareGroup Network Outage
John D. Halamka MD
CIO, CareGroup
CIO, Harvard Medical School
Agenda
In depth overview of the Network Outage Key Lessons The Sequel – SQL Slammer Questions and Answers
CareGroup Network as Built
RenaissanceParkswitch-rca switch-rcb
switch-rcc
5500 5500
5500
EastCampus
switch-ccell118
switch-rob05
switch-ly030
SiSi
switch-br203
5500
5509
5500
5500ATM7/1
FEC 9/1-2
FEC10/5-6
FEC 9/1-2
WestCampus
switch-spg06b
switch-spg06a
switch-ccw00m4
5500
5500
5500
SiSi
ATM 5/1
FEC 9/5-6
SiSi
FEC 8/1-3
FEC 10/1-4FEC 8/3-4
FEC 9/1-4
FEC 8/1-2
FEC 8/1-2
FEC 8/1-2
SiSi
FEC 10/1-2
SiSi
FEC 11/5-6
FEC 8/1-2
FEC 9/1-2
FEC 10/5-6FEC10/1-4
FEC 11/1-2
FEC 9/1-2
FEC 11/3-4
ATM10/1ATM10/1
FEC 6/23-24 FEC 6/23-24
ATM
7/1
ATM
10/
1
FEC 3/21-24FEC 4/21-24
FEC 3/21-24FEC 4/21-24
ATM 10/1
ATM 5/1
14
8 12
12
12
12
1212
8 12
14
14 14
12
8 8
8
(8) 5505 Switches
(1) 5505 Switch
(15) 5505 Switches(1) 5500 Switch
(37) 5505 Switches
(1) 5500 Switch(21) 5505 Switches(2) 5509 Switches
(1) 3500 XL Switch (PACS)
(18) 5505 Switches
(21) 5505 Switches(1) 3500XL
(4) Ren Ctr (rc5, rc6, rc7, rcc)(1) Mount Auburn (Remote)
(1) 5500 Switch(1) 5505 Switch
(3) 5509 Switches
SiSi
(12) CC East Campus(4) HIM
(3) 109 Brookline Ave(2) 2127 Burlington(3) Research North
SiSi
(24) CC West Campus - - 1 is Dual Homed w/spg06b(1) PACS
(1) Research East
(1) 5500 Switch(1) 5505 Switch(1) 5509 Switch
(4) Dana(3) East(7) Feldberg(2) Finard(3) Kirstein(4) Reisman(5) Rose(1) Service(5) Stoneham(2) Yamens
(1) Baker(4) Deaconess(7) Lowry Medical(1) Maintenance(3) Palmer(1) CC West (Dual-homed w/ccw00m4)
(13) Farr(6) Kennedy(2) Lowry Medical(1) Masco
(3) Ren Ctr (rc7, rc8, rcc)
SiSi
ATM OC-3 (155Mbps) over SonetATM OC-3 (155Mbps) dark fiber
Fast Etherchannel (400 Mbps)Fast Etherchannel (800 Mbps)Not Active
SiSi
Timeline
November 13, 2002 1:45pm– Napster-like internal attack– Change begins, redundant links cut– Callisma and Cisco on site
November 14, 2002– Spanning tree issues– WAN issues– CAP declared at 4:00pm
Core Switch Utilization
Timeline
November 15, 2002– PACS Rebuild– Research/Cardiology rebuild– Reboot of core and distribution layer
November 16, 2002– VLAN mismatch– Redundant Core built as contingency
Core Switch Utilization
Root Cause Analysis
CareGroup Network grew organically by Merger and Acquisition into a massive bridged switched network which was not within Spanning Tree spec
Equipment was not life cycle managed Router/switch configuration was not in
accordance with best practices i.e. multicast dense mode
Spanning Tree Problems
When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer 2 hops from root.
The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other.
Key Lessons
Partner with your network vendor– Encourage external audits of your network– Engage advanced engineering services– Avoid senior management blind spots
Key Lessons
Avoid flat topology bridged switched networks.
Best Practice CareGroup Network
One VLAN per Subnet per VLANs span many physical switches physical switches
Limited or no bridging Extensive use of bridging
Layer 2 switching limited to Layer 2 switching access layer extended across core
Key Lessons
Re-evaluate the enterprise architecture of your network– Routed core– Switched distribution and access layers– Robust Firewall
Key Lessons
Life Cycle Manage your network– Eliminate Legacy Protocols– Recognize the value of new feature sets– Hardware must keep up with the demands of a
changing organization – video over IP, IP telephony, bioinformatics, image management
Key Lessons
Implement appropriate monitoring and diagnostic tools to maintain the health and hygiene of your network– Concord– NATKit– CiscoWorks– OpenView
Key Lessons
Have a robust downtime plan– Out of band diagnostics– Dial up modems and computers in key clinical
areas– Overview of CareGroup Disaster Recovery
plan
Service Objectives
Protection Features
Protection features
Protection Techniques Cost versus Benefit
Protection Techniques by Vulnerability
Key Lessons
Implement Strict Change Control– Standards, configurations, devices, protocols,
links, processes, procedures, or services– Prior review and approval of all network
infrastructure changes– Multi-discipline membership– Changes classed as substantial, moderate, or
minimal impact
Key Lessons
Implement Strict Change Control (cont)– Substantial changes require Cisco AES review– Changes scheduled 2am – 5am weekends– Changes require baseline, testing, and recovery
plans– As-Built documentation to include overall,
physical and logical diagrams– NCCB recommends expense allocation
The Sequel – SQL Slammer
Released at 12:30am on January 25 Infected East Coast at 12:40am Microsoft SQLServer 2000 was patched,
however Microsoft did not issue any patches or security warnings on Microsoft Data Engine 2000 (MSDE), which is included with numerous desktop products
Spread of the Worm
Exact effect on CareGroup
MSDE and non-IS maintained databases infected
Network saturated by worm activity Shut off links to Research areas Blocked all traffic from the public internet Network traffic levels returned to normal
Cleanup
Restart of servers and desktops that were disrupted by the outage
Once all areas research areas had cleaned desktops, we restored port 1433 connectivity
Further Lessons learned
VPN as a security risk Implement a scanning program to analyze
research desktop and server vulnerabilities Ensure you have modern network
equipment that afford you the tools to control intra-VLAN traffic
Conclusions
Lifecycle manage your network just as you would your desktop
Ensure senior management understands the value of the network as a strategic asset
Build great downtime procedures including out of band connectivity just in case the technology falters