improving scalability of xen: the 3,000 domains experiment

34
Improving Scalability of Xen: the 3,000 domains experiment Wei Liu <[email protected]>

Upload: xen-project

Post on 05-Dec-2014

1.007 views

Category:

Technology


0 download

DESCRIPTION

Machines are getting powerful these days and more and more VMs will run on a single machine. This work started off with a simple goal - to run 3,000 domains on a single host and address any scalability issues come across. I will start with Xen internal then state the problems and solutions. Several improvements are made, from hypervisor, Linux kernel to user space components like console backend and xenstore backend. The main improvement for hypervisor and Dom0 Linux kernel is the new event channel infrastructure, which enable Dom0 to handle much more events simultaneously - the original implementation only allows 1024 and 4096 respectively. The targeting audiences are cloud developers, kernel developers and those who are interested in Xen scalability and internals. They need to have general knowledge of Linux, knowledge of Xen is not required but nice to have.

TRANSCRIPT

Page 1: Improving Scalability of Xen: The 3,000 Domains Experiment

Improving Scalability of Xen: the 3,000 domains

experiment

Wei Liu <[email protected]>

Page 2: Improving Scalability of Xen: The 3,000 Domains Experiment

Xen: the gears of the cloud● large user base

estimated more than 10 million individuals users

● power the largest clouds inproduction

● not just for servers

Page 3: Improving Scalability of Xen: The 3,000 Domains Experiment

Xen: Open SourceGPLv2 with DCO (like Linux)Diverse contributor community

source:Mike Dayhttp://code.ncultra.org

Page 4: Improving Scalability of Xen: The 3,000 Domains Experiment

Xen architecture:PV guests

Hardware

Xen

Dom0 DomU

HW drivers

PV backends PV Frontends

DomU

PV Frontends

DomU

PV Frontends

Page 5: Improving Scalability of Xen: The 3,000 Domains Experiment

Xen architecture:PV protocol

Request Producer

Request Consumer

Response Producer

Response Consumer

Backend Frontend

Event channel for notification

Page 6: Improving Scalability of Xen: The 3,000 Domains Experiment

Xen architecture:driver domains

Hardware

Xen

Dom0 DomU

NetFront

Disk Driver Domain

Toolstack Disk Driver

BlockBack

Network Driver Domain

Network Driver

NetBack BlockFront

Page 7: Improving Scalability of Xen: The 3,000 Domains Experiment

Xen architecture:HVM guests

Hardware

Xen

Dom0 stubdom

HW drivers

PV backends

HVM DomU

PV Frontends

HVM DomU

QEMU IO emulation IO emulationQEMU

Page 8: Improving Scalability of Xen: The 3,000 Domains Experiment

Xen architecture:PVHVM guests

Hardware

Xen

Dom0

HW drivers

PV backends

PVHVM DomU

PVHVM DomU

PV Frontends

PVHVM DomU

Page 9: Improving Scalability of Xen: The 3,000 Domains Experiment

Xen scalability:current statusXen 4.2● Up to 5TB host memory (64bit)● Up to 4095 host CPUs (64bit)● Up to 512 VCPUs for PV VM● Up to 256 VCPUs for HVM VM● Event channels

○ 1024 for 32-bit domains○ 4096 for 64-bit domains

Page 10: Improving Scalability of Xen: The 3,000 Domains Experiment

Xen scalability:current statusTypical PV / PVHVM DomU● 256MB to 240GB of RAM● 1 to 16 virtual CPUs● at least 4 inter-domain event channels:

○ xenstore○ console○ virtual network interface (vif)○ virtual block device (vbd)

Page 11: Improving Scalability of Xen: The 3,000 Domains Experiment

Xen scalability:current status● From a backend domain's (Dom0 / driver

domain) PoV:○ IPI, PIRQ, VIRQ: related to number of cpus and

devices, typical Dom0 has 20 to ~200○ yielding less than 1024 guests supported for 64-bit

backend domains and even less for 32-bit backend domains

● 1K still sounds a lot, right?○ enough for normal use case○ not ideal for OpenMirage (OCaml on Xen) and other

similar projects

Page 12: Improving Scalability of Xen: The 3,000 Domains Experiment

Start of the story● Effort to run 1,000 DomUs (modified Mini-

OS) on a single host *● Want more? How about 3,000 DomUs?

○ definitely hit event channel limit○ toolstack limit○ backend limit○ open-ended question: is it practical to do so?

* http://lists.xen.org/archives/html/xen-users/2012-12/msg00069.html

Page 13: Improving Scalability of Xen: The 3,000 Domains Experiment

Toolstack limitxenconsoled and cxenstored both use select(2)● xenconsoled: not very critical and can be

restarted● cxenstored: critical to Xen and cannot be

shutdown otherwise lost information● oxenstored: use libev so there is no problem

switch from select(2) to poll(2)implement poll(2) for Mini-OS

Page 14: Improving Scalability of Xen: The 3,000 Domains Experiment

Event channel limitIdentified as key feature for 4.3 release. Two designs came up by far:● 3-level event channel ABI● FIFO event channel ABI

Page 15: Improving Scalability of Xen: The 3,000 Domains Experiment

3-level ABIMotivation: aimed for 4.3 timeframe● an extension to default 2-level ABI, hence

the name● started in Dec 2012● V5 draft posted Mar 2013● almost ready

Page 16: Improving Scalability of Xen: The 3,000 Domains Experiment

Default (2-level) ABI

1

1 0... 0

Selector (1 word, per cpu)

Bitmap (shared)

0

1 Upcall pending flag

Page 17: Improving Scalability of Xen: The 3,000 Domains Experiment

3-level ABI

1

1 0

1 0 All 0

...

... ...

0

First level selector (per cpu)

Second level selector (per cpu)

Bitmap (shared)

0

1 Upcall pending flag

Page 18: Improving Scalability of Xen: The 3,000 Domains Experiment

3-level ABINumber of event channels:● 32K for 32 bit guests● 256K for 64 bit guestsMemory footprint:● 2 bits per event (pending and mask)● 2 / 16 pages for 32 / 64 bit guests● NR_VCPUS pages for controlling structure

Limited to Dom0 and driver domains

Page 19: Improving Scalability of Xen: The 3,000 Domains Experiment

3-level ABI● Pros

○ general concepts and race conditions are fairly well understood and tested

○ envisioned for Dom0 and driver domains only, small memory footprint

● Cons○ lack of priority (inherited from 2-level design)

Page 20: Improving Scalability of Xen: The 3,000 Domains Experiment

FIFO ABIMotivation: designed ground-up with gravy features● design posted in Feb 2013● first prototype posted in Mar 2013● under development, close at hand

Page 21: Improving Scalability of Xen: The 3,000 Domains Experiment

FIFO ABI

Event 1Event 2Event 3...

Per CPU control structure

Event word (32 bit)Shared event array

Empty queue and non-empty queue (only showing the LINK field)

Selector for picking up event queue

Page 22: Improving Scalability of Xen: The 3,000 Domains Experiment

FIFO ABINumber of event channels:● 128K (2^17) by designMemory footprint:● one 32-bit word per event● up to 128 pages per guest● NR_VCPUS pages for controlling structure

Use toolstack to limit maximum number of event channels a DomU can have

Page 23: Improving Scalability of Xen: The 3,000 Domains Experiment

FIFO ABI● Pros

○ event priority○ FIFO ordering

● Cons○ relatively large memory footprint

Page 24: Improving Scalability of Xen: The 3,000 Domains Experiment

Community decision● scalability issue not as urgent as we thought

○ only OpenMirage expressed interest on extra event channels

● delayed until 4.4 release○ better to maintain one more ABI than two○ measure both and take one

● leave time to test both designs○ event handling is complex by nature

Page 25: Improving Scalability of Xen: The 3,000 Domains Experiment

Back to the story

3,000 DomUs experiment

Page 26: Improving Scalability of Xen: The 3,000 Domains Experiment

Hardware spec:● 2 sockets, 4 cores, 16 threads● 24GB RAMSoftware config:● Dom0 16 VCPUs ● Dom0 4G RAM● Mini-OS 1 VCPU● Mini-OS 4MB RAM● Mini-OS 2 event channels

3,000 Mini-OS

DEMO

Page 27: Improving Scalability of Xen: The 3,000 Domains Experiment

3,000 LinuxHydramonster hardware spec:● 8 sockets, 80 cores, 160 threads● 512GB RAMSoftware config:● Dom0 4 VCPUs (pinned)● Dom0 32GB RAM● DomU 1 VCPU● DomU 64MB RAM● DomU 3 event channels (2 + 1 VIF)

Page 28: Improving Scalability of Xen: The 3,000 Domains Experiment

ObservationDomain creation time:● < 500 acceptable● > 800 slow● took hours to create 3,000 DomUs

Page 29: Improving Scalability of Xen: The 3,000 Domains Experiment

ObservationBackend bottleneck:● network bridge limit in Linux● PV backend drivers buffer starvation● I/O speed not acceptable● Linux with 4G RAM can allocate ~45k event

channels due to memory limitation

Page 30: Improving Scalability of Xen: The 3,000 Domains Experiment

ObservationCPU starvation:● density too high:1 PCPU vs ~20 VCPUs● backend domain starvation● should dedicate PCPUs to critical service

domain

Page 31: Improving Scalability of Xen: The 3,000 Domains Experiment

SummaryThousands of domains, doable but not very practical at the moment● hypervisor and toolstack

○ speed up creation● hardware bottleneck

○ VCPU density○ network / disk I/O

● Linux PV backend drivers○ buffer size○ processing model

Page 32: Improving Scalability of Xen: The 3,000 Domains Experiment

Beyond?Possible practical way to run thousands of domains:

offload services to dedicated domains and trust Xen scheduler.

Disaggregation

Page 33: Improving Scalability of Xen: The 3,000 Domains Experiment

Happy hacking and have fun!

Q&A

Page 34: Improving Scalability of Xen: The 3,000 Domains Experiment

AcknowledgementPictures used in slides:thumbsup:http://primary3.tv/blog/uncategorized/cal-state-university-northridge-thumbs-up/hydra: http://www.pantheon.org/areas/gallery/mythology/europe/greek_people/hydra.html