can you trust neutron?

25
Can you trust Neutron? A tour of scalability and reliability improvements from Havana to Juno Salvatore Orlando (@taturiello) Aaron Rosen (@aaronorosen)

Upload: salvorlando

Post on 14-Jun-2015

411 views

Category:

Software


1 download

DESCRIPTION

A tour of scalability improvements between Havana and Juno. The presentation discusses results from an experimental campaign and the various features that enable the scalability improvements Presentation from Aaron Rose and Salvatore Orlando.

TRANSCRIPT

Page 1: Can you trust Neutron?

Can you trust Neutron?A tour of scalability and reliability improvements from Havana to Juno

Salvatore Orlando (@taturiello)Aaron Rosen (@aaronorosen)

Page 2: Can you trust Neutron?

From Havana to Juno

● 12 months● 1672 commits● +147765 -70127 lines of code

(excluding changes in neutron/locale/*)

But... did it really get any better?

Page 3: Can you trust Neutron?

Measuring scalability - Process

● Goal: Validate agent scalability under varying loado In this talk we’ll discuss the L2 agent only, sorry!

● Testbed: single server OpenStack installation

● Methodology: run several experiments increasing

the number of servers concurrently createdo Number of servers ranging from 1 to 20o Every experiment is repeated 20 timeso For each metric, study mean, median, and variance

Page 4: Can you trust Neutron?

Measuring scalability - Metrics

Instance metrics (t_start = instance created):● t_active - time until the instance reaches active state● t_ping - time until the instance can be pinged● t_allocate_net - time spent configuring networking for instance

Port metrics (t_start = VIF plugged):● t_proc: time until the agent start processing the port● t_up: time until the port is wired● t_dhcp: time for adding DHCP info for the new port

Page 5: Can you trust Neutron?

Measuring scalability - Results

t_up in Havana and Juno - a rather remarkable difference!

Page 6: Can you trust Neutron?

Measuring scalability - Resultst_allocate_net almost constant in Juno

Growth trend is only 15% of the one seen in Havana

Page 7: Can you trust Neutron?

Measuring scalability - results● VM failure rate

analysiso Failure == error while

creating VM or unable to ping within 3 min timeout

● Juno is infallible decently reliable (Havana not as much…)

Page 8: Can you trust Neutron?

Analysing progress

FolsomGrizzly Havana

IcehouseJuno

>>>>

>>

<<

Page 9: Can you trust Neutron?

How the software improved

● Boot VMs only once network is wired

● Remove choke points from L2 agents

● Streamline security group RPC

● Better router processing in L3 agents

● Reporting floating IP processing status

● many others… which unfortunately won’t fit into the time

allocated to this talk

Page 10: Can you trust Neutron?

More results

Page 11: Can you trust Neutron?

● Virtually no improvements in time to ping an instance

- As the tests are executed on a single host IO contention between instances is the main bottleneck.

- “Time to ping” is slowed down by longer instance boot times

● Instances are slower to go to “ACTIVE” then they were in Havana

- This is actually a desired feature

- Indeed it’s the reason for which failure rate in Juno is 0 even with 20 concurrent instances

Page 12: Can you trust Neutron?

Nova/Neutron Event reporting

Problem: Nova displays cached IPAM info about instance from neutron. Cache is updated slowly…

nova-api

neutron-api1. Associate floating IP to port

2. Show me instance!

Wat? No floating ip?

Page 13: Can you trust Neutron?

Nova/Neutron Event reportingSolution: Neutron sends events to nova on IPAM changes causing nova to update its cache.

neutron-api1. Associate floating IP to port

nova-api

2. network-changed for instance X

nova-compute3. dispatch event to compute host

4. update_network cache for instance X

5. Show me instance!

I haz floating ip

Page 14: Can you trust Neutron?

Nova/Neutron Event reportingProblem: Instances would go active before network was wired. Some dhcp clients (as the one in cirros images) doesn’t continue retrying...

nova-api1. Boot instance

W00T Active!

Timeout.. Hrm?!?

2. Ready?!?

3. ssh instance…..

Page 15: Can you trust Neutron?

Nova/Neutron Event reportingSolution:Neutron sends events to nova on when network is ready.

nova-api

1. Boot instance

nova-scheduler nova-compute

VM

3. Started in paused state

neutron-api

2B. event: network-vif-plugged: port X

VM

Neutron Backend

2. Allocate network for instance

3B. unpaused

1B. Port X active

Page 16: Can you trust Neutron?

Enabling/disabling event reporting

Settings in nova.conf

vif_plugging_timeout = 300vif_plugging_is_fatal = True

Page 17: Can you trust Neutron?

Speeding up L2 interface processing

Problem - device processing delayed by:- inefficient server/agent interface- preemptive behaviour of security group callbacks- pedantic polling of interfaces on integration bridge- superficial analysis of devices to process

Solution:- ovsdb-monitor triggers interface processing only when changes are detected- Neutron server perform at most 2 RPC call over AMQP for each API operation

- only 1 call in most cases- The L2 agent queries the server only once for retrieving interface detail- Security group updates are processed in the same loop as interface, thus avoiding starvation.- The agent only processes interfaces which are ready to be used - and most importantly

processes them only once!

Page 18: Can you trust Neutron?

Streamlining security group RPCs

Problem - exponential complexityThe payload of the RPC call to retrieve security group rules grows exponentially when the number of devices increases

Solution:Restructure the format of the payload exchanged between agent and server, removing data redundancy.With the new payload format, security group rules are not repeated anymore.

Page 19: Can you trust Neutron?

Streamlining security group RPCs

Credits: Miguel Angel Ajo Pelayohttp://www.ajo.es/post/95269040924/neutron-security-group-rules-for-devices-rpc-rewrite

RPC message payload size vs # of ports RPC execution time vs # of ports

Page 20: Can you trust Neutron?

Reducing router processing times

Problems:● Router synchronization starves RPC handling● Not enough parallelism in router and floating IP processing

Solution:● Router synchronization tasks and RPC messages are added to a priority

queue. Items pulled from the queue are processed in separate threads.● Apply iptables command in a non blocking fashion

Page 21: Can you trust Neutron?

Know your floating IP status

Problem:There was no way to know whether your floating IP is ready or not(beyond pinging it, obviously)

Solution:- Introducing the concept of operational status for floating IPs.- The L3 agent calls back the server to confirm successful floating IP creation (ACTIVE), or an

error (DOWN)- The state defaults to DOWN. Goes ACTIVE upon floating IP association, and DOWN when the

floating IP is disassociated.

Page 22: Can you trust Neutron?

Other enhancements (in brief)

● Multiple REST API workers

● Multiple RPC over AMQP workers

● Better IP address recycling

● Removal of several locking queries

o ie: LOCK FOR UPDATE statements

● Removal of conditions triggering LOCK WAIT timeout errors

o bug triggered by eventlet yielding within a transaction

Page 23: Can you trust Neutron?

Where we are...● The L2 agent scalability considerably improved over the past 12 months

o Results measured with OVS only but the same considerations apply to Linux Bridge as well

● Security groups can now be used even in very large deployments

● Nova/Neutron interface much more reliableo Boot a server only when the network for it is wired

o Faster, less chatty communication

● Some progress on resource status trackingo Far from being optimal, but at least now you can now when your floating IP is

ready to use...

Page 24: Can you trust Neutron?

… and where we want to be● There is still a lot of room for improvement in the agents

o E.g.: OVS agent still scan all ports on integration bridge at each iteration

● The Nova/Neutron interface is better, but is however far from idealo Enhanced caching on the nova side can avoid a lot of round trips to neutron

● Little to nothing has been done for tracking async operation and resource status. For example:o there is no way to know whether DHCP info are ready for a port

o security group updates are processed asynchronously, but it is impossible to know when processing completes

Page 25: Can you trust Neutron?

Final thoughts

● “Much better” is different from “ideal”o ≅ 3 seconds for wiring an interface could not be ideal for many

applicationso scalability limits should be addressed even if they involve architectural

changes

● What about data plane scalability?

● What about API usability?