i/o scalability in xen

25
Sponsored by: & & I/O Scalability in Xen Kevin Tian [email protected] Eddie Dong [email protected] Yang Zhang [email protected]

Upload: the-linux-foundation

Post on 01-Jun-2015

3.710 views

Category:

Technology


0 download

DESCRIPTION

The talk will cover SR-IOV scalability and challenges in Xen, as well as NUMA I/O support in Xen.

TRANSCRIPT

Page 1: I/O Scalability in Xen

Sponsored by:

& &

I/O Scalability in Xen

Kevin Tian [email protected]

Eddie Dong [email protected]

Yang Zhang [email protected]

Page 2: I/O Scalability in Xen

2

Agenda

Overview of I/O Scalability Issues

• Excessive Interrupts Hurt

• I/O NUMA Challenge

Proposals

• Soft interrupt throttling in Xen

• Interrupt-Less NAPI (ILNAPI)

• Host I/O NUMA

• Guest I/O NUMA

Page 3: I/O Scalability in Xen

3

Retrospect…

2009 Xen Summit (Eddie Dong, …)

Extending I/O Scalability in Xen

Covered topics

• VNIF: multiple TX/RX tasklets, notification frequency

• VT-d: vEOI optimization, vIntr delivery

• SR-IOV: adaptive interrupt coalescing (AIC)

Interrupt is the hotspot!

Page 4: I/O Scalability in Xen

4

New Challenges Always Exist

Interrupt overhead is increasingly high

• One 10G Niantic NIC may incur 512k intr/s

• 64 (VFs + PF) x 8000 intr/s

• Similar for dom0 when multiple queues are used

• 40G NIC is coming

Prevalent NUMA architecture (even on 2-node low end server)

• The DMA distance to memory node matters (I/O NUMA)

• w/o I/O NUMA awareness, DMA accesses may be suboptimal

Need breakthrough in software architecture

Page 5: I/O Scalability in Xen

5

Excessive Interrupts Hurt! (SR-IOV Rx Netperf)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0.00%

50.00%

100.00%

150.00%

200.00%

250.00%

300.00%

350.00%

1vm 2vm 3vm 4vm 5vm 6vm 7vm Mb/s CPU%

CPU% (ITR=8000)CPU% (ITR=4000)CPU% (ITR=2000)CPU% (ITR=1000)BW (ITR=8000)BW (ITR=4000)BW (ITR=2000)BW (ITR=1000)

Page 6: I/O Scalability in Xen

6

Excessive Interrupts Hurt!

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0.00%

50.00%

100.00%

150.00%

200.00%

250.00%

300.00%

350.00%

1vm 2vm 3vm 4vm 5vm 6vm 7vm Mb/s CPU%

CPU% (ITR=8000)CPU% (ITR=4000)CPU% (ITR=2000)CPU% (ITR=1000)BW (ITR=8000)BW (ITR=4000)BW (ITR=2000)BW (ITR=1000)Linear (CPU% (ITR=8000))

CPU% increases fast with high interrupt rate!

Bandwidth is not saturated with low

interrupt rate!

Page 7: I/O Scalability in Xen

7

Excessive Interrupts Hurt! (Cont.)

Excessive VM-exits (7vm as example)

Excessive context switches

• “Tackling the Management Challenges of Server

Consolidation on Multi-core System”,

Hui Lv, Xen Summit 2011 SC

Excessive ISR/softirq overhead both

in Xen and guest

Similar impact for dom0 using multi-queue NIC

External Interrupts 35k/s

APIC Access 49k/s

Interrupt Window 7k/s

Page 8: I/O Scalability in Xen

8

NUMA Status in Xen

I/O Devices

Processor Nodes

IOH/PCH

Memory Buffer

Memory

Integrated PCI-e devices

Host CPU/Memory NUMA

• Administrable based on capacity plan

Guest CPU/Memory NUMA

• Not supported

• But extensively discussed

Lack of manageability for

• Host I/O NUMA

• Guest I/O NUMA

Page 9: I/O Scalability in Xen

9

NUMA Related Structures

An integral combo for CPU, memory and I/O devices

• System Resource Affinity Table (SRAT)

• Associates CPUs and memory ranges, with proximity domain

• System Locality Distance Table (SLIT)

• Distance among proximity domains

• _PXM (Proximity) object

• Standard way to describe proximity info for I/O devices

Solely acquiring _PXM info of I/O devices is not enough to construct I/O NUMA knowledge!

Page 10: I/O Scalability in Xen

10

Host I/O NUMA Issues

No host I/O NUMA awareness in Dom0

• Dom0 owns the majority of I/O devices

• Dom0 memory is first allocated by skipping DMA zone

• DMA memory is reallocated for continuity later

• Above allocations are made within node_affinity mask round-robin

• No consideration on actual I/O NUMA topology

Complex and confusing if dom0 handles host I/O NUMA itself

• Implicates physical CPU/Memory awareness in dom0 too

• Virtual NUMA vs. Host NUMA?

Xen however has no knowledge of _PXM()

Page 11: I/O Scalability in Xen

11

Guest I/O NUMA Issues

Guest needs I/O NUMA awareness to handle assigned devices

• Guest NUMA is the premise

Guest NUMA is not upstream yet!

• Extensive talks in previous Xen summits

• “VM Memory Allocation Schemes and PV NUMA Guests”, Dulloor Rao

• “Xen Guest NUMA: General Enabling Part”, Jun Nakajima

• Already extensive discussions and works…

• Now time to push into upstream!

No I/O NUMA information exposed to guest

Lack of I/O NUMA awareness in device assignment process

Page 12: I/O Scalability in Xen

12

Proposals

Per-interrupt overhead has been studied extensively!

Now we want to reduce the interrupt number!

Page 13: I/O Scalability in Xen

13

The Effect of Dynamic Interrupt Rate

A manual tweak on ITR based on VM number (8000 / vm_num)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0.00%

50.00%

100.00%

150.00%

200.00%

250.00%

300.00%

350.00%

1vm 2vm 3vm 4vm 5vm 6vm 7vm Mb/s CPU%

CPU% (ITR=8000)

CPU% (ITR=1000)

CPU% (dynamic ITR)

BW (ITR=8000)

BW (ITR=1000)

BW (dynamic ITR)

Linear (CPU% (ITR=8000))

Linear (CPU% (ITR=1000))

Linear (CPU% (dynamic ITR))

Page 14: I/O Scalability in Xen

14

Software Interrupt Throttling in Xen

Throttle virtual interrupts based on administrative policies

• Based on shared resources (e.g. bandwidth/VM_number)

• Based on priority and SLAs

• Apply to both PV and HVM guests

Fewer virtual interrupts reduces guest ISR/softirq overhead

It may further throttle physical interrupts too!

• If the device doesn’t trigger a new interrupt when an earlier request is still pending

Page 15: I/O Scalability in Xen

15

Interrupt-Less NAPI (ILNAPI)

NAPI itself doesn’t eliminate interrupts

• NAPI logic is scheduled by rx interrupt handler

• Mask interrupt when NAPI is scheduled

• Unmask interrupt when NAPI completes current poll

What about scheduling NAPI w/o interrupts?

• If we can piggyback NAPI schedule on other events…

• System calls, other interrupts, scheduling, …

• Internal NAPI schedule overhead is much less than a heavy device->Xen->VM interrupt path

Yes, that’s … “Interrupt-Less NAPI (ILNAPI)”

Page 16: I/O Scalability in Xen

16

Interrupt-Less NAPI (Cont.)

ILNAPI_HIGH watermark:

• When there’re too many notifications within the guest

• Serve as the high watermark for NAPI schedule frequency

ILNAPI_LOW watermark:

• Activated when there’re insufficient notifications

• Serve as the low water mark to ensure a reasonable traffic

• May move back to interrupt-driven manner

Event Pool NAPI

Net Core

IXGBEVF driver

Poll

IXGBE NIC

ISR

IRQ

Syscall

ISR

Schedule

Page 17: I/O Scalability in Xen

17

Interrupt-Less NAPI (Cont.)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0.00%

50.00%

100.00%

150.00%

200.00%

250.00%

300.00%

350.00%

1vm 2vm 3vm 4vm 5vm 6vm 7vm Mb/s CPU%

CPU% (ITR=8000)

CPU% (ITR=1000)

CPU% (ILNAPI)

BW (ITR=8000)

BW (ITR=1000)

BW (ILNAPI)

Linear (CPU% (ITR=8000))

Linear (CPU% (ITR=1000))

Linear (CPU% (ILNAPI))

Page 18: I/O Scalability in Xen

18

Interrupt-Less NAPI (Cont.)

Watermarks can be adaptively chosen by the driver

• Based on bandwidth/buffer estimation

Or an enlightened scheme:

• Xen may provide guidance through shared buffer

• Resource utilization (e.g. VM number)

• Administrative policies

• SLA requirements

• ILNAPI can be turned on/off dynamically under Xen’s control

• E.g. in case where latency is much concerned

Page 19: I/O Scalability in Xen

19

Proposals

We need close the Xen architecture gaps for

both host I/O NUMA and guest I/O NUMA!

Page 20: I/O Scalability in Xen

20

Host I/O NUMA

Give Xen full NUMA information:

• Xen already sees SRAT/SLIT

• New hypercall to convey I/O proximity info (_PXM) from Dom0

• Xen need extend _PXM to all child devices

• Extend DMA reallocation hypercall to carry device ID

• May need Xen version for set_dev_node

• Xen reallocates DMA memory based on proximity info

CPU access in dom0 remains NUMA-unaware…

• E.g. the communication between backend/frontend driver

Page 21: I/O Scalability in Xen

21

Guest I/O NUMA

Okay, let’s help guest NUMA support in Xen!

IOMMU may also spans nodes

• ACPI defines Remapping Hardware Status Affinity (RHSA)

• The association between IOMMU and proximity domain

• Allocate remapping table based on RHSA and proximity domain info

Page 22: I/O Scalability in Xen

22

Guest I/O NUMA (Cont.)

Make up guest I/O NUMA awareness

• Construct _PXM method for assigned devices in DM

• Based on guest NUMA info (SRAT/SLIT)

• Extend control panel to favor I/O NUMA

• Assign devices which are in same proximity domain as specified nodes of the guest

• Or, affine guest to the node where assigned device is affined

• The policy for SR-IOV may be more constrained • E.g. all guests sharing same SR-IOV device run on same node

• Warn user when optimal placement can’t be assured

Page 23: I/O Scalability in Xen

23

Summary

I/O scalability is always challenging every time when we re-examine it!

Excessive interrupts hurt I/O scalability, but there’re some means both in Xen and in guest to mitigate it!

CPU/Memory NUMA has been well managed in Xen, but I/O NUMA awareness is still not in place!

Page 24: I/O Scalability in Xen

24

Legal Information

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT.

Intel may make changes to specifications, product descriptions, and plans at any time, without notice.

All dates provided are subject to change without notice.

Intel is a trademark of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2007, Intel Corporation. All rights are protected.

Page 25: I/O Scalability in Xen

25