openstack discovery and networking assurance - koren lev - meetup
TRANSCRIPT
self marketing slide coming next …
OpenStack Discovery and AssuranceKoren LevDC Operator, IT Developer, Entrepreneur, Dev/Ops manager etc…
• I’ve been using OpenStack since Diablo (~ 6 years)• I’ve been operating and supporting SP and ENT
deployments in Europe and the Middle East
General observations and thoughts…
• I believe OpenStack infrastructure is not very easy to operate(post installation that is …)
• I believe it is a bit hard to maintain and troubleshoot
• Community’s focus on fulfilment (“make it work”), provisioning (“configure it”) and abstraction (“end users don’t care about the details”)- Therein lies the problem (IMHO)
• We neglected the Cloud operator operations needs (IMHO)
• According to Mirantis (example): running 5000 OpenStack nodes was failing mostly because of issues around Neutron
• I’ll use networking charter to illustrate it, the points made fits all charters
Thought backed up by some investigation
Controllers and Agents vs Workers/Plugins
• Most OS modules operates using controllers and agents.• Here is an example :
Controllers
Agents
Workers
APIs: for fulfilment and provisioning - abstracted
https://docs.openstack.org/developer/neutron/#neutron-stadium
Neutron controller data (current API):
“instance”
Very simple, abstracted, awesome for the cloud user …
“port” “network”
“router”
…and be assured:the network is active !
The views of cloud operations team…
• Let’s say a ‘vm200’ instance on ‘network100’ can’t communicate (it happens…)• Troubleshooting with premium knowledge (good support personnel)• Assuming: Mirantis 8.0 (Liberty), Mechanism : OVS and LXB, Type: VXLAN• Assuming: only RegionOne• Assuming: you found the nova instance-to-host mapping (nova API)• Assuming: you found the nova instance-name-to-uuid mapping (nova API)
Since liberty *
• Running on host ‘node-6’ , OVS agent there, host and agent reachable.• We need more details before going down to the hosts level …
• DHCP server and a gateway/router running on this network, find out where:
The views of cloud operations team…
• More details are missing, available through MariaDB, not exposed in API (partial list):
• Is this really important data for troubleshooting ?
• Well…depends what’s wrong in the network ( if not being ‘active’ or ‘;-)’ )
Workers/plugins vendors are placing their details in MariaDB (no ops API)
The views of cloud operations team…
• So based on the findings so far, moving to hosts level (yes, MariaDB data is not enough !):
The views of cloud operations team…
• Ever wondered what’s going on in hypervisor interface list ? (partial list here):
The views of cloud operations team…
• Let’s skip vNIC model type details for now, move down to the linux bridge:
The instance representation of a network ‘port’ inside that specific hypervisor (assuming linux bridge plugin)
The bridge-side network ‘port’ inside that specific hypervisor (assuming linuxbridge plugin)
Thought : is it ‘active’ ?
The views of cloud operations team…
• Let’s skip monitoring details for now, move down to the OpenvSwitch:
The views of cloud operations team…
The ovs-side network ‘port’ inside that specific hypervisor (assuming ovs plugin)
The tunneling bridge inside OVS in-charge of isolation and segmentation
Tunneling used for this specific case (vxlan)
The integration bridge inside OVS in-charge of isolation and encapsulationThe ovs-side representation
of the instance ‘port’
• Now which communication is broken ? to which destinations ? depending on the answers, we can go across to the specific tunnel destinations.
• Let’s assume vm200 has no ip address assigned , so investigating the tunnel to node-6 (neutron-agent dhcp is over there, see slide 7):
The views of cloud operations team…
Node-1 192.168.2.1 as source and node-6 192.168.2.2 as destination(assuming in this example there is no routing needed from the source and destination of the tunnel)
• Finding the physical NICs used for the segmentaion/tunneling from node-1 to node-6:
The views of cloud operations team…
“br-mesh” bridge in this hypervisor is holding the ip for the vxlan-sys tunneling inside the ovs
“br-mesh” bridge in this hypervisor is connected through pNIC ens160, sub-interface 103 (vlan for the tunnel endpoint)
vi /etc/network/interfaces.d/ifcfg-ens160.103:
• Moving to node-1 for the L3, DHCP and Meta investigations :
The views of cloud operations team…
Find uuid of dhcp service running by that specific dhcp agent on that specific node
The dhcp server has this vNIC port connected down at node-1
• vServices vNIC interfaces connections on node-1 (dhcp - a quick summary):
The views of cloud operations team…
• vServices vNIC interfaces connections on node-1 (l3- a quick summary):
The views of cloud operations team…
• What if we change distribution/mechanism/types ? (guess what - different discovery/collection logic and different details per object), dpdk/fd.io example:
The views of cloud operations team…
• What if more then 1 VM ? What if HA ? What if DVR ?
The views of cloud operations team…
• Discovery x VMs x 2 , Discovery x 2 , Discovery x Hosts
• Post discovery you can start finding a fix …
Yes, we are a small team that spent the last year developing a possible offering to start solving the networking charter, focused on ‘Networking Operations API’ (see next).
..not a cure for cancer …but it’s pretty good, tested with real IT operations teams
We call it ‘Calipso’
Point made (!?) stop bitching…any solution ?
Possible Openstack attachments: ‘Monasca’, ‘Vitrage’ , ‘Ceilometer’, ‘Neutron’, ‘Tacker’Others: ‘Barometer’
• OpenStack “Operations APIs” – let’s get started…
• Exposing up the needed details for the Cloud operations team
• To be developed for any module suffering from lack of workers/plugins visibility
Our ‘Networking Operations API’:
• Modeled for Multi distribution, any mechanism driver / type drivers variances
• Includes smart discovery logic, a visualization solution , monitoring, analysis
Proposition: a possible starting point
Visibility = Predictability = Stability
CNA TNAMaintenance Troubleshooting
Inventory Discovery
Graph
MonitorFailure
DetectionFailure
AnalysisReport
Show connections, dependencies,
state and impactShow failure, root cause
Interfaces: API, DB, CLI forHypervisors/Containers
Discovery
OSDNA: Modules
Cloud Network Administrator
Tenant Network Administrator
Project ‘Calipso’
Calipso objects - examplesOSDNA Object Object Details Example 1 Example 2 Example 3
vService Services Overlay (virtual) DHCP (ip netns) L3 GW (ip netns) FWaaS
vNIC VMs NIC, Container CNI Instance/vService
vNIC
Tap to linux-bridge VPP Virtual-Ethernet
vConnector L2 inside a host(isolation) Linux Bridge VPP bridge-domain VMware Port-Group
vEdge Virtual to Physical Edge OVS VPP Midonet
pNIC / Bond Physical Underlay Fabric Edge Ports EPGs in ACI Servers Eth / Ether-
channels
Network Segment Virtual Segments (for any
tunneling overlay)
VLAN VXLAN Segment-ID GRE segments
OTEP Overlay Tunnel VXLAN Geneve GRE
OSDNA Views Details Example 1 Example 2 Example 3
Virtual Topology Modular links graph in
Calipso discovery
vService to Network Instance to Network All virtual2physical
per network
Policy Topology Data from the APP
Driving OpenStack
App VM to DB VM VNF to end-user VNF chaining
Calipso object model: adaptive, simple
Calipso
Environment A
Calipso DiscoveryLogic
APIDBCLI
Environment_Config AInitial scan logic
Environment_Config BInitial scan logic
APIDBCLI
Environment B
Environment_Config CInitial scan logic
APIDBCLI
Environment C
"name" : “MyENV3",
"host" : "10.56.20.239",
"port" : "5673",
"user" : "nova",
"password" : "YVWMiKMshZhlxxxxqFu5PdT9d"
},
{
“Mon" : "Monitoring3",
"type" : "Sensu",
"host" : "korlev-nsxe1.cisco.com",
"port" : "4567"
[removed]
],
"distribution" : "Mirantis-8.0",
"last_scanned:" : "5/8/16",
"name" : "Mirantis-Liberty",
"mechanism_drivers" : "OVS"],
"type_drivers" : "vxlan",
"operational" : "yes",
"type" : "environment"
Calipso hierarchical, modeled
Inventory:regionsProjects
HostsAggregates / zones
NetworksPorts
InstancesvNICs
vConnectorsvEdges
vServicespNICsOTEPsetc ..
Links and Relationships
Analysis:
Instance-vNICvNIC-vConnector
vConnector-vEdgevEdge-pNICpNIC-OTEP
OTEP-vConnectorvService-vNICNetwork-Port
etc …
Calipso Cliques and Topologies:(Cliques):
Focal_point_type (ex): instanceClique_type: [array of links]
RabbitMQCRUD events
Real time Updates
Environment_Listener AEvent-based scan logicEnvironment_Listener BEvent-based scan logicEnvironment_Listener CEvent-based scan logic
ObjectScan
SSH parsing caching
Environment ARegion X, Zone Y
Host 234
Calipso Monitoring
SensuServer Manager(conf by Calipso)
Calipso SensuChecks
Sensu Redis DBCalipso hierarchical,
modeledInventory:
regionsProjects
HostsAggregates / zones
NetworksPorts
InstancesvNICs
vConnectorsvEdges
vServicespNICsOTEPsetc ..
Real time Status and Statistics
OTEP
vNIC
pNIC
vEdge
Sensu Client Transport
(configured and deployed by
Calipso)
VPP stats/resultsvNIC stats/resultsLXB stats/resultsOTEPs stats/resultspNICs stats/resultsetc.. Checks are customized and modeled
Sensu API
Sensu UI
Calipso Sensu Handler Environment ACalipso Sensu Handler
Environment ACalipso Sensu Handler Environment C
Monitoring Configurator(Environment-aware)
Calipso BUS
Calipso porting to TSDB
Calipso DiscoveryLogic
Possibly contributing to OpenStack Health checks
Historical reporting
Calipso visualization: modeled for complex virtual topologies
Op
en S
tack
Cal
ipso
Dis
cove
ry
Connecting physical and virtual elements of cloud
networking
Cal
ipso
UI
Calipso Graph
Cloud Networking Assurance
Historical Trends , Root Cause , Impact Analysis
Cloud Network Administrator
Tenant Network Administrator
Virtual Network Elements, Dependencies, Status,Stats API Extensions for
discovery/assurance
Do
cker
ANY (*Open)Stack,ANY Plugin
Model-DrivenDiscovery
Engine
Inventory
Containers
Users:
OpenStack Discover*Mongo
DB*Monitor* BUS*
External App
UI*API*
OS CRUD events
Scan 4 all Data (API, DB, CLI)
Scan (temp) Data Scan (temp) Data
Full Inventory Data
Environment Config(Init/Setup) Environment Config
(Init/Setup)
State/Statistics
Checks Results
Live Updates
Inventory, Topology Data
Full Topology Data
Run a Scan
Scan 4 some Data (API, DB, CLI)(scheduled)
Run a Scan
Some Inventory Data Some Topology Data
Inventory, Topology
Inventory, Topology
Analysis APP
Inventory, Topology
Monitoring Config(Init/Setup)Monitor Clients +
Checks Installation
Run a Scan
Messages/Updates
Setup MonitorSetup Monitor
Monitoring Config(Init/Setup)
State/Statistics
State. StatisticsState/Statistics
Messages / Notifications
APIDBCLI
RabbitMQ
SensuClients
SensuChecks
Messages / Notifications
UI ConfigUI Config
Environment Config(Init/Setup)
State. Statistics
Agent for ‘OperationsAPI’
* All Container-based today
Discovery logic successfully running on:
OVS, VLANs, GREs, VXLANs:• "Mirantis-6.0", "Mirantis-7.0", "Mirantis-8.0", "Mirantis-9.0",• "RDO-Mitaka", "RDO-Liberty", "RDO-Juno" • “Devstack-liberty", "Canonical-icehouse","Canonical-juno", • "Canonical-liberty", "Canonical-mitaka", • "Apex-Mitaka“ (3-o), "Devstack-Mitaka", • "packstack-7.0.0-0.10.dev1682“• "Stratoscale-v2.1.6", • "Mirantis-9.1",
VPP, VLANs:"RDO-Mitaka“, "Apex-Mitaka",
Pre QA: Midonet, vSphere (vSwitch)
If your variance is not on this list it means we didn’t test/validate
We’d appreciate your help in adapting to more variances !
Adapting to multi-environment cases !!
OpenStackCalipso objects for
ContainersCalipso objects for Bare
Metal
Through API
Objects in Calipso Discovery CalipsoMonitoring
Region - ex: NYC, SJC
Host – ex: compute node
Project – ex: Coke
Port
Zone / Aggregate – ex: B16, Floor 2 etc …
Calipso objects for VMware vSphere
API – OpenStack API – Contiv , Docker API – Cisco UCS API – vSphereCalipso
Adapters
Through API
Through API
Through API
Custom SensuChecks
N/A
Server
Tenant
Container veth
Cluster
N/A
N/A
N/A
NIC
N/A
DataCenter Cluster
Server
Tenant
Port-group
DataCenter
NetworkCustom Sensu
ChecksNetwork Network Network
Calipso objects for OpenStack
Calipso objects for Containers
Calipso objects for Bare Metal
Through API
Objects in Calipso Discovery CalipsoMonitoring
Instance / vService – ex: a VM, a DHCP srv
pNIC – ex : TengigEth
vConnector – ex: Bridge
vEdge – ex: OVS, fd.io etc
OTEP – ex: VXLAN, GRE
vNIC / Port
Network / Network Segment
Container
pNIC
Bridge, BDomain
OVS, fd.io
VXLAN
Container veth, CNI
Network / Network Segment
A Server
pNIC
N/A
N/A
N/A
N/A
Network / Network Segment
Calipso objects for VMware vSphere
API – OpenStackDB – MySQL
CLI – Linux Bash / SSH
API – Contiv , DockerDB – ETCD
CLI – Linux Bash / SSH / Docker
API – Cisco UCSDB –
CLI – OS specific / SSH
API – vSphere,DB – N/ACLI – ESXi
VM
pNIC
Port-group
vSwitch / NSX switch
VXLAN
vNIC
Network
CalipsoAdapters
Custom SensuChecks
Custom SensuChecks
Custom SensuChecks
Custom SensuChecks
Custom SensuChecks
Custom SensuChecks