![Page 1: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/1.jpg)
Leveraging SDN Layering to Systematically Troubleshoot Networks
Brandon Heller★Colin ScottNick McKeown⌘Scott Shenker Andreas Wundsam §Hongyi Zeng⌘Sam WhitlockVimalkumar Jeyakumar⌘Nikhil Handigol★James McCauleyKyriakos Zarifis∞Peyman Kazemian★
HotSDN 2013Hong Kong
⌘StanfordBerkeley
∞USCICSI
★SDN Academy§Big Switch Networks
![Page 2: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/2.jpg)
AdminNetwork
skills + tools + knowledge
Protocols
Configuration
Topology
Policy
• connect hosts A + B• quarantine virus-
infected hosts• route guest traffic to
an HTTP proxy• prioritize SSH
+
1: Configure
Ethane, overlays, consistency primitives, network programming languages, …
3: Fix Stuff!
2: Troubleshoot
This Talk
![Page 3: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/3.jpg)
AdminNetwork
skills + tools + knowledge
Protocols
Configuration
Topology
Policy
• connect hosts A + B• quarantine virus-
infected hosts• route guest traffic to
an HTTP proxy• prioritize SSH
+
1: Configure
Ethane, overlays, consistency primitives, network programming languages, …
3: Fix Stuff!
2: Troubleshoot
#1 request from network admins:Automatic Troubleshooting
Source: “Automatic Test Packet Generation”, CoNEXT ‘12, Zeng et al.
This Talk
![Page 4: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/4.jpg)
How to automate troubleshooting?
NetworkPolicy
• isolate groups A + B• route guest traffic to
an HTTP proxy• block a list of virus-
infected hosts
Challenging in traditional networks.
~?
(2) Check behavior against policy:• confusing: don’t know lowest-level forwarding behavior• distributed: hard to get a meaningful snapshot
Two requirements.(1) Know the intended policy:
• confusing: different config format for each protocol• distributed: configuration spread among all nodes• hard: must understand all protocols & their interactions
difficult to check
impractical to infer
![Page 5: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/5.jpg)
Control-Plane Layering in SDN
Firmware FirmwareFirmware
Network Hypervisor
App App App
State Layers
Logical View
Physical View
Device State
Hardware
Policy
Code Layers
Network OS
HW HW
HW
![Page 6: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/6.jpg)
Firmware FirmwareFirmware
HW HW
HW
Systematically Troubleshooting an SDN
Network OS
Network Hypervisor
App App App
State Layers
Logical View
Physical View
Device State
Hardware
Policy
Code Layers Observation: Each state layer fully specifies network behavior.
Insight:Bugs manifest as mistranslations between layers.
Systematic Approach:(1) Binary search to isolate
to a code layer.(2) Leverage state to isolate
within the code layer.
![Page 7: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/7.jpg)
Phase 1: Localizing to a code layer[Operator Intent]
Logical View
Physical View
Device State
Hardware
Policy
?~
Apps
NetHyperV
NetOS
Firmware
[Actual Behavior]
Cause: Firmware Bug
Yes
No
?~YesNo
?~ YesNo
SOFT[CONEXT ‘12]
Anteater[SIGCOMM ‘11]
Symptom: Hosts unable to communicate
![Page 8: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/8.jpg)
Phase 1: Localizing to a code layer[Operator Intent]
Logical View
Physical View
Device State
Hardware
Policy
?~
Apps
NetHyperV
NetOS
Firmware
[Actual Behavior]
Yes
No ?~Yes
No
Symptom: Tenant Isolation Breach
HSA[NSDI ’12]
OFRewind[ATC ‘11]
YesNo?~ ?~
Yes
No
Correspondence Checking
Cause: NetHypervisor Bug
![Page 9: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/9.jpg)
How to automate troubleshooting?
NetworkPolicy
• isolate groups A + B• route guest traffic to
an HTTP proxy• block a list of virus-
infected hosts
Possible in Software-Defined Networks
~?
(2) Check behavior against policy:• confusing: don’t know lowest-level forwarding behavior• distributed: hard to get a meaningful snapshot
Two requirements.(1) Know the intended policy:
• confusing: different config format for each protocol• distributed: configuration spread among all nodes• hard: must understand all protocols & their interactions
directly accessible
directly providedapp
fewer nodes
![Page 10: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/10.jpg)
Takeways
• Control plane layering enables systematic troubleshooting
• Thinking about troubleshooting in terms of layers shows us where tools fit in– Reveals missing tools– Highlights choices between tools, with tradeoffs
• Plenty of opportunities left.
Operationalize!
![Page 11: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/11.jpg)
Leverage the layers in SDN.
Brandon Heller★Colin ScottNick McKeown⌘Scott Shenker Andreas Wundsam §Hongyi Zeng⌘Sam WhitlockVimalkumar Jeyakumar⌘Nikhil Handigol★James McCauleyKyriakos Zarifis∞Peyman Kazemian★
HotSDN 2013Hong Kong
⌘StanfordBerkeley
∞USCICSI
★SDN Academy§Big Switch Networks
Questions?
![Page 12: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/12.jpg)
How is this different than general distributed systems debugging?
• Simple answer: it’s not! SDN is an excellent opportunity to draw upon ideas from other distributed systems
• Subtlety: networks are solving a much more constrained problem than general distributed systems
![Page 13: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/13.jpg)
Limitations
• Correctness only, not performance• Side effects not reflected in state• No guarantee of finding single code layer• No guarantee of individual layer correctness• No guarantee of future correctness• Layer visibility may be imperfect
![Page 14: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/14.jpg)
Plenty of Opportunities Remain• Automatic Troubleshooting
Actionable Bug Reports– Filtering the signal from the noise– Creating consistent views of state
• Improving Invariant Checkers– Scale– Flexible Policy Input
• Hybrid Traditional + SDN Debugging
![Page 15: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/15.jpg)
Plenty of Opportunities Remain• Automatic Troubleshooting
Actionable Bug Reports– Filtering the signal from the noise– Creating consistent views of state
Packet History:Path + Headers
+ Forwarding State
Forwarding State Forwarding
State
Forwarding State
Forwarding State
[HotSDN 2012:Where is the Debugger for My Software-Defined Network?]
![Page 16: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/16.jpg)
Plenty of Opportunities Remain• Automatic Troubleshooting
Actionable Bug Reports– Filtering the signal from the noise
Controller A
Controller B
Controller C
Switch 1
Switch 2
Switch3
Switch 4
Switch 5
Switch 6
Switch 7
Switch 8
Switch 9
[Berkeley Tech Report:How Did We Get Into This Mess? Isolating Fault-Inducing Inputs to SDN Control Software]
Minimal Causal
Sequence
![Page 17: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/17.jpg)
Isn’t this unnecessary with consistency primitives/languages/etc?
• No• Catch/rule out bugs outside the framework• Catch instances where the framework pushes
config that breaks the policy
![Page 18: Leveraging SDN Layering to Systematically Troubleshoot Networks](https://reader035.vdocument.in/reader035/viewer/2022062815/56816931550346895de07e61/html5/thumbnails/18.jpg)
What’s novel about this work?
• Simple answer: nothing!