solaris operating system hardware virtualization product 268
TRANSCRIPT
SOLARIS™ OPERATING SYSTEMHARDWARE VIRTUALIZATION PRODUCT ARCHITECTURE Chien-Hua Yen, ISV Engineering [email protected]
Sun BluePrints™ On-Line — November 2007
Part No 820-3703-10Revision 1.0, 11/27/07Edition: November 2007
Sun Microsystems, Inc.
Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
Hardware Level Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Section 1: Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Virtual Machine Monitor Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
VMM Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
VMM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
The x86 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
SPARC Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Section 2: Hardware Virtualization Implementations . . . . . . . . . . . . . . . . . . . . . 37
Sun xVM Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Sun xVM Server Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Sun xVM Server CPU Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Sun xVM Server Memory Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Sun xVM Server I/O Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Sun xVM Server with Hardware VM (HVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
HVM Operations and Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Sun xVM Server with HVM Architecture Overview. . . . . . . . . . . . . . . . . . . . . . . 68
Logical Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Logical Domains (LDoms) Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . 80
CPU Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Memory Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
I/O Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
VMware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
VMware Infrastructure Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
VMware CPU Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
VMware Memory Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
VMware I/O Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Section 3: Additional Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
VMM Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Terms and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Author Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
1 Introduction Sun Microsystems, Inc.
Chapter 1
Introduction
In the IT industry, virtualization is a mechanism of presenting a set of logical computing
resources over a fixed hardware configuration so that these logical resources can be
accessed in the same manner as the original hardware configuration. The concept of
virtualization is not new. First introduced in the late 1960s on mainframe computers,
virtualization has recently become popular as a means to consolidate servers and
reduce the costs of hardware acquisition, energy consumption, and space utilization.
The hardware resources that can be virtualized include computer systems, storage, and
the network.
Server virtualization can be implemented at different levels on the computing stack,
including the application level, operating system level, and hardware level:
• An example of application level virtualization is the Virtual Machine for the Java™
platform (Java Virtual Machine or JVM™ machine)1. The JVM implementation
provides an application execution environment as a layer between the application
and the OS, removing application dependency on OS-specific APIs and hardware-
specific characteristics.
• OS level virtualization abstracts OS services such as file systems, devices,
networking, and security, and provides a virtualized operating environment to
applications. Typically, OS level virtualization is implemented by the OS kernel.
Only one instance of the kernel runs on the system, and it provides multiple
virtualized operating environments to applications. Examples of OS level
virtualization include Solaris™ Containers technology, Linux VServers, and FreeBSD
Jails. OS level virtualization has less performance overhead and better system
resource utilization than hardware level virtualization. Since one OS kernel is
shared among all virtual operating environments, isolation among all virtualized
operating environments is as good as the OS provides.
• Hardware level virtualization, discussed in detail in this paper, has become popular
recently because of increasing CPU power and low utilization of CPU resources in the
IT data center. Hardware level virtualization allows a system to run multiple OS
instances. With less sharing of system resources than OS level virtualization,
hardware virtualization provides stronger isolation of operating environments.
The Solaris OS includes bundled support for application and OS level virtualization with
its JVM software and Solaris Containers offerings. Sun first added support for hardware
virtualization in the Solaris 10 11/06 release with Sun Logical Domains (LDoms)
technology, supported on Sun servers which utilize UltraSPARC T1 or UltraSPARC T2
1. The terms "Java Virtual Machine" and "JVM" mean a Virtual Machine for the Java(TM) platform.
2 Introduction Sun Microsystems, Inc.
processors. VMware also supports the Solaris OS as a guest OS in its VMware Server and
Virtual Infrastructure products starting with the Solaris 10 1/06 release. In October
2007, Sun announced the Sun xVM family of products that includes the Sun xVM Server
and the Sun xVM Ops Center management system:
• Sun xVM Server — includes support for the Xen open source community work [6] on
the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform
• Sun xVM Ops Center — a management suite for the Sun xVM Server
Note – In this paper, in order to distinguish the discussion of x86 and UltraSPARC T1/T2 processors, Sun xVM Server is specifically used to refer to the Sun hardware virtualization product for the x86 platform, and LDoms is used to refer to the Sun hardware virtualization product for the UltraSPARC T1 and T2 platforms.
The hardware virtualization technology and new products built around this technology
have expanded options and opportunities for deploying servers with better utilization,
more flexibility, and enhanced functionality. In reaping the benefits of the hardware
virtualization, IT professionals also face the challenges of operating within the
limitation of a virtualized environment while delivering the same level of service
agreement as the physical operating environment. Meeting this requirement requires a
good understanding of virtualization technologies, CPU architecture, and software
implementations, and awareness of their strengths and limitations.
Hardware Level VirtualizationHardware level virtualization is a mechanism of virtualizing the system hardware
resources such as CPU, memory, and I/O, and creating multiple execution
environments on a single system. Each of these execution environments runs an
instance of the operating system.
A hardware level virtualization implementation typically consists of several virtual
machines (VMs), as shown in Figure 1. A layer of software, the virtual machine monitor
(VMM), manages system hardware resources and presents an abstraction of these
resources to each VM. The VMM runs in privileged mode and has full control of system
hardware. A guest operating system (GOS) runs in each VM. The GOS to VM is
analogous to program to process in which OS plays the function of the VMM.
3 Introduction Sun Microsystems, Inc.
Figure 1. In hardware level virtualization, the VMM software manages hardware resources and presents an abstraction of these resources to one or more virtual machines.
Hardware resource virtualization can take the form of sharing, partitioning, or
delegating:
• Sharing — Resources are shared among VMs. The VMM coordinates the use of
resources by VMs. For example, the VMM may include a CPU scheduler to run threads
of VMs based on a pre-determined scheduling policy and VM priority.
• Partitioning — Resources are partitioned so that each VM gets the portion of
resources allocated to it. Partitioning can be dynamically adjusted by the VMM based
on the utilization of each VM. Examples of resource partitioning include the
ballooning memory technique employed in Sun xVM Server and VMware, and the
allocation of CPU resources in Logical Domains technology.
• Delegating — With delegating, resources are not directly accessible by a VM.
Instead, all resource accesses are made through a control VM that has direct access
to the resource. I/O device virtualization is normally accessed via delegation.
The distinction and boundaries between the virtualization methods are often not clear.
For example, sharing may be used for one component and partitioning used in others,
and together they make up an integral functional module.
Benefits of Hardware Level VirtualizationHardware level virtualization allows multiple operating systems to run on a single
server system. This ability offers many benefits that are not available in a single OS
server. These benefits can be summarized in three functional categories:
• Workload ConsolidationAccording to Gartner [17] “Intel servers running at 10 percent to 15 percent
utilization are common.” Many IT organizations run out and buy a new server every
time they deploy a new application. With virtualization, computers no longer have to
be dedicated to a particular task. Applications and users can share computing
resources, remaining blissfully unaware that they are doing so. Companies can shift
computing resources around to meet demand at a given time, and get by with less
infrastructure overall. When used for consolidation, virtualization can also save
VM
GOS
VM
GOS
VM
GOS
Virtual Machine Monitor (VMM)
Platform Hardware
4 Introduction Sun Microsystems, Inc.
hardware and maintenance expenses, floor space, cooling costs, and power
consumption.
• Workload MigrationHardware level virtualization decouples the OS from the underlying physical platform
resources. A guest OS state, along with the user applications running on top of it, can
be encapsulated into an entity and moved to another system. This capability is useful
for migrating a legacy OS system from an old under-powered server to a more
powerful server while preserving the investment in software. When a server needs to
be maintained, a VM can be dynamically migrated to a new sever with no down time,
further enhancing availability. Changes in workload intensity levels can be addressed
by dynamically shifting underlying resources to the starving VMs. Legacy applications
that ran natively on a server continue to run on the same OS running inside a VM,
leveraging the existing investment in applications and tools.
• Workload IsolationWorkload isolation includes fault and security isolations. Multiple guest OSes run
independently, and thus a software failure in one VM does not affect other VMs.
However, the VMM layer introduces a single point of failure that can bring down all
VMs on the system. A VMM failure, although potentially catastrophic, is less probable
than a failure in the OS because the complexity of VMM is much less than that of an
OS.
Multiple VMs also provide strong security isolation among themselves with each VM
running an independent OS. Security intrusions are confined to the VM in which they
occur. The boundary around each VM is enforced by the VMM and the inter-domain
communication, if provided by the VMM, is restricted to specific kernel modules only.
One distinct feature of hardware level virtualization is the ability to run multiple
instances of heterogeneous operating systems on a single hardware platform. This
feature is important for the following reasons:
• Better security and fault containment among application services can be achieved
through OS isolation.
• Applications written for one OS can run on a system that supports a different OS.
• Better management of system resource utilization is possible among the virtualized
environments.
ScopeThis paper explores the underlying hardware architecture and software implementation
for enabling hardware virtualization. Great emphasis has been placed on the CPU
hardware architecture limitations for virtualizing CPU services and their software
workarounds. In addition, this paper discusses in detail the software architecture for
implementing the following types of virtualization:
5 Introduction Sun Microsystems, Inc.
• CPU virtualization — uses processor privileged mode to control resource usage by
the VM, and relays hardware traps and interrupts to VMs
• Memory virtualization — partitions physical memory among multiple VMs and
handles page translations for each VM
• I/O virtualization — uses a dedicated VM with direct access to I/O devices to provide
device services
The paper is organized into three sections. Section I, Background Information, contains
information on VMMs and provides details on the x86 and SPARC processors:
• “Virtual Machine Monitor Basics” on page 9 discusses the core of hardware
virtualization, the VMM, as well as requirements for the VMM and several types of
VMM implementations.
• “The x86 Processor Architecture” on page 21 describes features of the x86 processor
architecture that are pertinent to virtualization.
• “SPARC Processor Architecture” on page 29 describes features of the SPARC processor
that affect virtualization implementations.
Section II, Hardware Virtualization Implementations, provides details on the Sun xVM
Server, Logical Domains, and VMware implementations:
• “Sun xVM Server” on page 39 discusses a paravirtualized Solaris OS that is based on
an open source VMM implementation for x86[6] processors and is planned for
inclusion in a future Solaris release.
• “Sun xVM Server with Hardware VM (HVM)” on page 63 continues the discussion of
Sun xVM Server for the x86 processors that support hardware virtual machines: Intel-
VT and AMD-V.
• “Logical Domains” on page 79 discusses Logical Domains (LDoms), supported on Sun
servers that utilize UltraSPARC T1 or T2 processors, and describes Solaris OS support
for this feature.
• “VMware” on page 97 discusses the VMware implementation for the VMM.
Section III, Additional Information, contains a concluding comparison, references, and
appendices:
• “VMM Comparison” on page 109 presents a summary of the VMM implementations
discussed in this paper.
• “References” on page 111 provides a comprehensive listing of related references.
• “Terms and Definitions” on page 113 contains a glossary of terms.
• “Author Biography” on page 117 provides information on the author.
6 Introduction Sun Microsystems, Inc.
Introduction Sun Microsystems, Inc.
Section I
Background Information
• Chapter 2: Virtual Machine Monitor Basics (page 9)
• Chapter 3: The x86 Processor Architecture (page 21)
• Chapter 4: SPARC Processor Architecture (page 29)
8 Introduction Sun Microsystems, Inc.
9 Virtual Machine Monitor Basics Sun Microsystems, Inc.
Chapter 2
Virtual Machine Monitor Basics
At the heart of hardware level virtualization is the VMM. The VMM is a software layer
that abstracts computer hardware resources so that multiple OS instances can run on a
physical system. Hardware resources are normally controlled and managed by the OS.
In a virtualized environment the VMM takes this role, managing and coordinating
hardware resources. There is no clear boundary between an OS and the VMM from the
definition point of view. The division of functions between OS and the VMM can be
influenced by factors such as processor architecture, performance, OS, and non-
technical requirements such as ease of installation and migration.
Certain VMM requirements exist for running multiple OS instances on a system. These
requirements, discussed in detail in the next section, stem primarily from processor
architecture design that is inherently an impediment to hardware virtualization. Based
on these requirements, two types of VMMs have emerged, each with distinct
characteristics in defining the relationship between the VMM and an OS. This
relationship determines the privilege level of the VMM and an OS, and the control and
sharing of hardware resources.
VMM RequirementsA software program communicates with the computer hardware through instructions.
Instructions, in turn, operate on registers and memory. If any of the instructions,
registers, or memory involved in an action is privileged, that instruction results in a
privileged action. Sometimes an action, which is not necessarily privileged, attempts to
change the configuration of resources in the system. Subsequently, this action would
impact other actions whose behavior or result depends on the configuration of
resources. The instructions that result in such operations are called sensitive
instructions.
In the context of the virtualization discussion, a processor's instructions can be
classified into three groups:
• Privileged instructions are those that trap if the processor is in non-privileged mode
and do not trap if it is in privileged mode.
• Sensitive instructions are those that change or reference the configuration of
resources (memory), affect the processor mode without going through the memory
trap sequence (page fault), or reference the sensitive registers whose contents
change when the processor switches to run another VM.
• Non-privileged and non-sensitive instructions are those that do not fall into either
the privileged or sensitive categories described above.
10 Virtual Machine Monitor Basics Sun Microsystems, Inc.
Sensitive instructions have “a major bearing on the virtualizability of a machine” [1]
because of their system-wide impact. In a virtualized environment, a GOS should only
contain non-privileged and non-sensitive instructions.
If sensitive instructions are a subset of privileged instructions, it is relatively easy to
build a VM because all sensitive instructions will result in a trap. In this case a VMM can
be constructed to catch all traps that result from execution of sensitive instructions by a
GOS. All privileged and sensitive actions from VMs would be caught by the VMM, and
resources could be allocated and managed accordingly (a technique called trap-and-
emulate). A GOS's trap handler could then be called by the VMM trap handler to
perform the GOS-specific actions for the trap.
If a sensitive instruction is a non-privileged instruction, the instruction executed by one
VM will be unnoticed. Robin and Irvine [3] identified several x86 instructions in this
category. These instructions cannot be safely executed by a GOS as they can impact the
operations of other VMs or adversely affect the operation of its own GOS. Instead, these
instructions must be substituted by the VMM service. The substitution can be in the
form of an API for the GOS to call, or a dynamic conversion of these instructions to
explicit processor traps.
Types of VMMIn a virtualized environment, the VMM controls the hardware resources. VMMs can be
categorized into two types, based on this control of resources:
• Type I — maintains exclusive control of hardware resources
• Type II —leverages the host OS by running inside the OS kernel
The Type I VMM [3] has several distinct characteristics: it is the first software to run
(besides BIOS and the boot loader), it has full and exclusive control of system hardware,
and it runs in privileged mode directly on the physical processor. The GOS on a Type I
VMM implementation runs in a less privileged mode than the VMM to avoid conflicts
managing the hardware resources.
An example of a Type I VMM is Sun xVM Server. Sun xVM Server includes a bundled
VMM, the Sun vVM Hypervisor for x86. The Sun xVM Hypervisor for x86 is the first
software, beside BIOS and boot loader, to run during boot as shown in the GRUB
menu.lst file:
title Sun xVM Server kernel$ /boot/$ISADIR/xen.gzmodule$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unixmodule$ /platform/i86pc/$ISADIR/boot_archive
11 Virtual Machine Monitor Basics Sun Microsystems, Inc.
The GRUB bootloader first loads the Sun xVM Hypervisor for x86, xen.gz. After the
VMM gains control of the hardware, it loads the Solaris kernel,
/platform/i86xpv/kernel/$ISADIR/unix, to run as a GOS.
Sun's Logical Domains and VMware's Virtual Infrastructure 3 [4] (formerly knows as
VMware ESX Server), described in detail in Chapter 7 “Logical Domains” on page 79 and
Chapter 8 “VMware” on page 97, are also Type I VMMs.
A Type II VMM typically runs inside a host OS kernel as an add-on module, and the host
OS maintains control of the hardware resources. The GOS in a Type II VMM is a process
of the host OS. A Type II VMM leverages the kernel services of the host OS to access
hardware, and intercepts a GOS's privileged operations and performs these operations
in the context of the host OS. Type II VMMs have the advantage of preserving the
existing installation by allowing a new GOS to be added to an running OS.
An example of type II VMM is VMware's VMware Server (formerly known as VMware
GSX Server).
Figure 2 illustrates the relationships among hardware, VMM, GOS, host OS, and user
application in virtualized environments.
Figure 2. Virtual machine monitors vary in how they support guest OS, host OS, and user applications in virtualized environments.
VMM ArchitectureAs discussed in “VMM Requirements” on page 9, the VMM performs some of the
functions that an OS normally does: namely, it controls and arbitrates CPU and memory
resources, and provides services to upper layer software for sensitive and privileged
operations. These functions require the VMM to run in privileged mode and the OS to
relinquish the privileged and sensitive operations to the VMM. In addition to processor
and memory operation, I/O device support also has a large impact on VMM
architecture.
Apps
Type I VMMServer
Type II VMMServer
PhysicalServer
GOS
Apps
GOS
Apps
GOS
VMM
Platform Hardware
Apps Apps
GOS
Apps
GOS
VMM VMM
Host OS
Platform Hardware
OS
Unprivileged Mode
Privileged ModePlatform Hardware
User SpaceApplications
12 Virtual Machine Monitor Basics Sun Microsystems, Inc.
VMM in Privileged ModeA processor typically has two or more privileged modes. The operating system kernel
runs in the privileged mode. The user applications run in a non-privileged mode and
trap to the kernel when they need to access system resources or services from the
kernel.
The GOS normally assumes it runs in the most privileged mode of the processor.
Running a VMM in a privileged mode can be accomplished with one of the following
three methods:
• Deprivileging the GOS — This method usually requires a modification to the OS to
run at a lower privilege level. For x86 systems, the OS normally runs at protected ring
0, the most privileged level. In Sun xVM Server, ring 0 is reserved to run the VMM.
This requires the GOS to be modified, or paravirtualized, to run outside of ring 0 at a
lower privilege level.
• Hyperprivileging the VMM — Instead of changing the GOS to run at lower privilege,
another approach taken by the chip vendors is to create a hyperprivileged processor
mode for the VMM. The Sun UltraSPARC T1 and T2 processor’s hyperprivileged mode
[2], Intel-VT's VMX-root operation (see [7] Volume 3B, Chapter 19), and AMD-V’s
VMRUN-Exit state (see [9] Chapter 15) are examples of a hyperprivileged processor for
VMM operations.
• Both VMM and GOS run in same privileged mode — It is possible to have both the
VMM and GOS run in the same privileged mode. In this case, the VMM intercepts all
privileged and sensitive operations of a GOS before passing them to the processor. For
example, VMware allows both the GOS and the VMM to run in privileged mode.
VMware dynamically examines each instruction to decide whether the processor
state and the segment reversibility (see “Segmented Architecture” on page 23) allow
the instruction to be executed directly without the involvement of the VMM. If the
GOS is in privileged mode or the code segment is non-reversible, the VMM performs
necessary conversions of the core execution path.
Removing Sensitive Instructions in the GOSPrivileged and sensitive operations are normally executed by the OS kernel. In a
virtualized environment, the GOS has to relinquish the privileged and sensitive
operations to the VMM. This is accomplished by one of the following approaches:
• Modifying the GOS source code to use the VMM services for handling sensitive
operations (paravirtualization)This method is used by Sun xVM Server and Sun's Logical Domains (LDoms). Sun xVM
Server and LDoms provide a set of hypercalls for an OS to request VMM services. The
VMM-aware Solaris OS uses these hypercalls to replace its sensitive instructions.
13 Virtual Machine Monitor Basics Sun Microsystems, Inc.
• Dynamically translating the GOS sensitive instructions by software
As described in a previous section, VMware uses binary translation to replace the GOS
sensitive instructions with VMM instructions.
• Dynamically translating the GOS sensitive instructions by hardware
This method requires the processor to provides a special mode of operation that is
entered when an sensitive instruction is executed in reduced privileged mode.
The first approach, which involves modifying the GOS source code, is called
paravirtualization, because the VMM provides only partial virtualization of the
processor. The GOS must replace its sensitive and privileged operations with the VMM
service. The remaining two approaches provide full virtualization to the VM, enabling
the GOS to run without modification
In addition to OS modification, performance requirements, processor architecture
design, tolerance of a single point of failure, and support for legacy OS installations
have an impact on the design of VMM architecture.
Physical Memory VirtualizationMemory management by the VMM involves two tasks: partitioning physical memory
for VMs, and supporting page translations in a VM.
Each OS assumes physical memory starts from page frame number (PFN) 0 and is
contiguous to the size configured for that VM. An OS uses physical addresses in
operations like page table updates and Direct Memory Access (DMA). In reality, the
starting PFN of the memory exported to a VM may not start from PFN 0 and may not be
contiguous.
The virtualization of physical address is provided in the VMM by creating another layer
of addressing scheme, namely machine address (MA). Within a GOS, a virtual address
(VA) is used by applications, and a physical address (PA) is used by the OS in DMA and
page tables. The VMM maps a PA from a VM to a MA, which is used on hardware. The
VMM maintains translation tables, one for each VM, for mapping PAs to MAs.
Figure 3 depicts the scheme to partition machine memory to physical memory for each
VM.
14 Virtual Machine Monitor Basics Sun Microsystems, Inc.
Figure 3. Example physical-to-machine memory mapping.
A ballooning technique [5] has been used in some virtualization products to achieve
better utilization of physical memory among VMs. The idea behind the ballooning
technique is simple. The VMM controls a balloon module in a GOS. When the VMM
wants to reclaim memory, it inflates the balloon to increase pressure on memory,
forcing the GOS to page out memory to disk. If the demand for physical memory
decreases, the VMM deflates the balloon in a VM, enabling the GOS to claim more
memory.
Page Translations VirtualizationAccess to processor's page translation hardware is a privileged operation, and this
operation is performed by the privileged VMM. Exactly what the VMM needs to perform
depends on the processor architecture. For example, x86 hardware automatically loads
translations from the page table to the Translation Lookaside Buffer (TLB). The software
has no control of loading page translations to the TLB. Therefore, the VMM is
responsible for updating the page table that is seen by the hardware. The SPARC
processor uses software through traps to load page translations to the TLB. A GOS
maintains its page tables in its own memory, and the VMM gets page translations from
the VM and loads them to the TLB.
VMMs typically support the following two methods to support page translations:
• Hypervisor calls — The GOS makes a call to the VMM for page translation
operations. This method is commonly used by paravirtualized OSes, as it provides
better performance.
• Shadow page table — The VMM maintains an independent copy of page tables,
called shadow page tables, from the guest page tables. When a page fault occurs,
the VMM propagates changes made by the GOS's page table to the shadow page
table. This method is commonly used by VMMs that support full virtualization, as the
GOS continues to update its own page table and the synchronization of the guest
VM0
Physical Memory
PFN 0
VM1
PFN 0 MPFN 0
Machine Memory
VM/GOS VMM
15 Virtual Machine Monitor Basics Sun Microsystems, Inc.
page table and the shadow page table is handled by the VMM when page faults
occur.
Figure 4 shows three different page translation implementations in the Solaris OS on
x86 and SPARC platforms.
1. The paravirtualized Sun xVM Server uses the following approach on x86 platforms:
[1] The GOS uses the hypervisor call method to update the page tables
maintained by the VMM.
2. The Sun xVM Server with HVM and VMware use the following approach:
[2a] The GOS maintains its own guest page table. The synchronization between
the guest page table and the hardware page table (shadow page table) is
handled by the VMM when page faults occur.
[2b] The x86 CPU loads the page translation from the hardware page table to
the TLB.
3. On SPARC systems, the Solaris OS uses the following approach for Logical Domains:
[3a] The GOS maintains its own page table. The GOS takes an entry from the
page table as an argument to the hypervisor call that loads the translations
to the TLB.
[3b] The VMM gets the page translation from the GOS and loads the translation
to the TLB.
Figure 4. Page translation schemes used on x86 and SPARC architectures.
The memory management implementation for Sun xVM Server, Sun xVM Server with
HVM, VMware, and Logical Domains using these mechanisms is discussed in detail in
later sections of this paper.
GOS
VMM
HardwareTLB
SPARC Page Translations
HV Calls
GOS
VMM
Hardware
1
2a
3a
TLB
3b
X86 Page Translations
Guest Page Table
HV Calls
HW Page Table
2b
Guest Page Table
TLB Operations
16 Virtual Machine Monitor Basics Sun Microsystems, Inc.
I/O VirtualizationI/O devices are typically managed by a special software module called the device driver
running in the kernel context. Due to vastly different types and varieties of device types
and device drivers, the VMM either includes few device drivers or leaves device
management entirely to the GOS. In the latter case, because of existing device
architecture limitations (discussed later in the section), devices can only be exclusively
managed by one VM.
This constraint creates some challenges for I/O access by a VM, and limits the
following:
• What device are exported to a VM
• How devices are exported to a VM
• How each I/O transaction is handled by a VM and the VMM
Consequently, I/O has the most challenges in the areas of compatibility and
performance for virtual machines. In order to explain what devices are exported and
how they are exported, it is first necessary to understand the options available to
handle I/O transactions in a VM.
There are, in general, three approaches for I/O virtualization, as illustrated in Figure 5:
• Direct I/O (VM1 and VM3)
• Virtual I/O using I/O transaction emulation (VM2)
• Virtual I/O using device emulation (VM4)
Figure 5. Different I/O virtualization techniques used by virtual machine monitors.
For direct I/O, the VMM exports all or a portion of the physical devices attached to the
system to a VM, and relies on VMs to manage devices. The VM that has direct I/O
access uses the existing driver in the GOS to communicate directly with the device.
VM 1 and VM3 in Figure 5 have direct I/O access to devices. VM1 is also a special I/O VM
that provides virtual I/O for other VMs, such as VM2, to access devices.
VM1
I/O Transaction Emulation andNative Driver
Direct I/O
I/O VM
VM2
Virtual Driver
Virtual I/Othru
I/O VM
Virtual I/OthruVMM
VM3
Native Driver
Network Chip SCSI Controller
Device Emulation and Device Driver
Direct I/O
VM4
Native Driveror
Virtual Driver
VMM
Sun X64 Server
17 Virtual Machine Monitor Basics Sun Microsystems, Inc.
Virtual I/O is made possible by controlling the device types exported to a VM. There are
two different methods of implementing virtual I/O: I/O transaction emulation (shown
in VM2 in Figure 5) and device emulation (shown in VM4).
• I/O transaction emulation requires virtual drivers on both ends for each type of I/O
transaction (data and control functions). As shown in Figure 5, the virtual driver on
the client side (VM2) receives I/O requests from applications and forwards requests
through the VMM to the virtual driver on the server side (VM1); the virtual driver on
the server side then sends out the request to the device.
I/O transaction emulation is typically used in paravirtualization because the OS on
the client side needs to include the special drivers to communicate with its
corresponding driver in the OS on the server side, and needs to add kernel interfaces
for inter-domain communication using the VMM services. However, it is possible to
have PV drivers in an un-paravirtualized OS (full virtualization) for better I/O
performance. For example, Solaris 10, which is not paravirtualized, can include PV
drivers on a HVM-capable system to get better performance than that achieved using
device emulation drivers such as QEMU. (See “Sun xVM Server with HVM I/O
Virtualization (QEMU)” on page 71.)
I /O transaction emulation may cause application compatibility issues if the virtual
driver does not provide all data and control functions (for example, ioctl(2)) that
the existing driver does.
• Device emulation provides an emulation of a device type, enabling the existing
driver for the emulated device in a GOS to be used. The VMM exports emulated
device nodes to a VM so that the existing drivers for the emulated devices in a GOS
are used. By doing this, the VMM controls the driver used by a GOS for a particular
device type; for example, using the e1000g driver for all network devices. Thus, the
VMM can focus on the emulation of underlying hardware using one driver interface.
Driver accesses to the I/O register and port in a GOS, which will result in a trap due to
invalid address, are caught and converted to access the real device hardware. VM4 in
Figure 5 uses native OS drivers to access emulated devices exported by the VMM.
Device emulation is in general less efficient and more limited on platforms supported
than I/O transaction emulation. Device emulation does not require changes in the
GOS and, therefore, is typically used to provide full virtualization to a VM.
Virtual I/O, unlike direct I/O, requires additional drivers in either the I/O VM or the
VMM to provide I/O virtualization. This constraint:
• Limits the type of devices that are made available to a VM
• Limits device functionality
• Causes significant I/O performance overhead
While virtualization provides full application binary compatibility, I/O becomes a
trouble area in terms of application compatibility and performance in a VM. One
18 Virtual Machine Monitor Basics Sun Microsystems, Inc.
solution to the I/O virtualization issues is to allow VMs to directly access I/O, as shown
by VM3 in Figure 5.
Direct I/O access by VMs requires additional hardware support to ensure device
accesses by a VM are isolated and restricted to resources owned by the assigned VM. In
order to understand the industry effort to allow an I/O device to be shared among VMs,
it is necessary to examine device operations from an OS point of view.
The interactions between an OS and a device consist, in general, of three operations:
1. Programmed I/O (PIO) — host-initiated data transfer. In PIO, a host OS maps a
virtual address to a piece of device memory and accesses the device memory using
CPU load/store instructions.
2. Direct Memory Access (DMA) —device-initiated data transfer without the CPU
involvement. In DMA, a host OS writes an address of its memory and the transfer
size to a device's DMA descriptor. After receiving an enable DMA instruction from
the host driver, the device performs data transfer at a time it chooses and uses
interrupts to notify the host OS of DMA completion.
3. Interrupt —a device-generated asynchronous event notification.
Interrupts are already virtualized by all VMM implementations as is shown in the later
discussions for Sun xVM Server, Logical Domains, and VMware. The challenge of I/O
sharing among VMs therefore lies in the device handling for PIO and DMA. To meet the
challenges, PCI SIG has released a suite of IOV specifications for PCI Express (PCIe)
devices, in particular the “Single Root I/O Virtualization and Sharing Specification”
(SRIOV) specification [35] for device sharing and PIO operation, and the “Address
Translation Services (ATS)” specification [30] for DMA operation.
Device Configuration and PIO
A PCI device exports its memory to the host through Base Address Registers (BARs) in its
configuration space. A device's configuration space is identified in the PCI configuration
address space as shown in Figure 6.
Figure 6. PCI configuration address space.
A PCI device can have up to 8 physical functions (PF). Each PF has its own 256 byte
configuration header. The BARs of a PCI function, which are 32-bit wide, are located at
offset 0x10-0x24 in the configuration header. The host gets the size of the memory
region mapped by a BAR by writing a value of all 1's to the BAR and then reading the
value back. The address written to a BAR is the assigned starting address of the memory
region mapped to the BAR.
011516232431
Reserved Bus NumberRegisterNumber
FunctionNumber
DeviceNumber 00
2781011
19 Virtual Machine Monitor Basics Sun Microsystems, Inc.
To allow multiple VMs to share a PF, the SRIOV specification introduces the notion of a
Virtual Function (VF). Each VF shares some common configuration header fields with
the PF and other VFs. The VF BARs are defined in the PCIe's SRIOV extended capabilities
structure. A VF contains a set of non-shared physical resources, such as work queue and
data buffer, which are required to deliver function specific services. These resources are
exported through the VF BARs and are directly accessible by a VM.
The starting address of a VF's memory space is derived from the first VF's memory
space address and the size of VF's BAR. For any given VFx, the starting address of its
memory space mapped to BARa is calculated according to the following formula:
where addr (VF1, BARa) is the starting address of BARa for the first VF and (VF BARa
aperture size) is the size of the VF BARa as determined by writing a value of 1's to BARa
and reading the value back. Using this mechanism, a GOS in a VM is able to share the
device with other VMs while performing device operations that pertain only to the VM.
DMA
In many current implementations (especially in most x86 platforms), physical addresses
are used in DMA. Since a VM shares the same physical address space on the system
with other VMs, a VM might read/write to another VM's memory through DMA. For
example, a device driver in a VM might write the memory contents that belong to other
VMs to a disk and read the data back into the VM's memory. This causes a potential
breach in security and fault isolation among VMs.
To provide isolation during DMA operation, the ATS specification defines a scheme for a
VM to use the address mapped to its own physical memory for DMA operation. (This
approach is used in similar designs such as IOMMU Specification [31] and DMA
Remapping [28].) This DMA ATS enables DMA memory to be partitioned into multiple
domains, and keeps DMA transactions on one domain isolated from other domains.
Figure 7 shows device DMA with and without ATS. With DMA ATS, the DMA address is
like a virtual address that is associated with a context (VM). DMA transactions initiated
by a VM can only be associated with the memory owned by the VM. DMA ATS is a
chipset function that resides outside of the processor.
addr VFx BARa( , ) addr VF1 BARa( , ) x 1–( ) VF BARa aperature size( )×+=
20 Virtual Machine Monitor Basics Sun Microsystems, Inc.
Figure 7. DMA with and without address translation service (ATS).
As shown in Figure 7, the physical address (PA) is used on the hardware platform
without hardware support for ATS. For platforms with hardware support for ATS, a GOS
in a VM writes either a device virtual address (DVA) or a guest physical address (GPA) to
the device’s DMA engine. The device driver in the GOS loads the mappings of either the
DVA or GPA to the host physical address (HPA) in the hardware IOMMU. The HPA is the
address understood by the memory controller.
Note – The distinction between the HPA and GPA is described in detail in later sections for Sun xVM Server (see “Physical Memory Management” on page 52), for UltraSPARC LDoms (see “Physical Memory Allocation” on page 88), and for VMware (see “Physical Memory Management” on page 103).
When the device performs a DMA operation, a DVA/GPA address appears on the PCI bus
and is intercepted by the hardware IOMMU. The hardware IOMMU looks up the
mapping for the DVA/GPA, finds the corresponding HPA, and moves the PCI data to
system memory pointed to by the HPA. Since either DVA or GPA of a VM has its own
address space, ATS allows system memory for DMA to be partitioned and, thus,
prevents a VM from accessing another VM’s DMA buffer.
PA - Physical AddressHPA - Host Physical AddressDVA - Device Virtual AddressGPA - Guest Physical Address
System Memory
DMA Buffer
DMA Buffer
North Bridge
CPU
PCI Device
PCI DeviceSouth Bridge
DMA without ATS
PAPA
PAPA
PA
System Memory
DMA Buffer DMA
Buffer
DMA Buffer
VM2VM1
DMA Buffer
North Bridge
CPU
PCI Device
PCI Device
South Bridgew/ IOMMU
DMA with ATS
HPAHPA
DVA/GPA
DVA/GPA
HPA
21 The x86 Processor Architecture Sun Microsystems, Inc.
Chapter 3
The x86 Processor Architecture
This chapter provides background information on the x86 processor architecture that is
relevant to later discussions on Sun xVM Server (Chapter 5 on page 39), Sun xVM Server
with HVM (Chapter 6 on page 63), and VMware (Chapter 8 on page 97).
The x86 processor was not designed to run in a virtualized environment, and the x86
architecture presents some challenges for CPU and memory virtualization. This chapter
discusses the following x86 architecture features that are pertinent to virtualization:
• Protected Mode
The protected mode in the x86 processor utilizes two mechanisms, segmentation and
paging, to prevent a program from accessing a segment or a page with a higher
privilege level. Privilege level controls how the VMM and a GOS work together to
provide CPU virtualization.
• Segmented Architecture
The x86 segmented architecture converts a program's virtual addresses into linear
addresses that are used by the paging mechanism to map into physical memory.
During the conversion, the processor's privilege level is checked against the privilege
level of the segment for the address. Because of the segment cache technique
employed by the x86 processor, the VMM must ensure segment cache consistency
with the VM descriptor table updates. This x86 feature results in a significant amount
of work for the VMM of full virtualization products such as VMware.
• Paging Architecture
The x86 paging architecture provides page translations to the TLB and page tables.
Because the loading of page translations from page table to TLB is done
automatically by hardware on the x86 platform, page table updates have to be
performed by the privileged VMM. Several mechanisms are available for updating
this “hardware” page table by a VM.
• I/O and Interrupts
A device interacts with a host processor through PIO, DMA, and interrupts. PIO in the
x86 processor can be performed through either I/O ports using special I/O
instructions or through memory-mapped addresses with general purpose MOVE and
String instructions. DMA in most x86 platforms is performed with physical
addresses. This can cause a security and isolation breach in a virtualized environment
because a VM may read/write other VMs memory contents. Interrupts and
exceptions are handled through the Interrupt Descriptor Table (IDT). There is only one
IDT on the system and access to the IDT is privileged. Therefore, interrupts have to be
handled by the VM and virtualized to be delivered to a VM.
22 The x86 Processor Architecture Sun Microsystems, Inc.
• Timer Devices
The x86 platform includes several timer devices for time keeping purposes.
Knowledge of the characteristics of these devices is important to fully understand
time keeping in a VM: Some timer devices are interrupt driven (which is virtualized
and delayed) and some require privileged access to update the device counter.
Protected ModeThe x86 architecture protected mode provides a protection mechanism to limit access
to certain segments or pages and prevent unprivileged access. The processor's
segment-protection mechanism recognizes 4 privilege levels, numbered from 0 to 3
(Figure 8). The greater the level number, the lesser the privileges provided.
The page-level protection mechanism restricts access to pages based on two privilege
levels: supervisor mode and user mode. If the processor is operating at a current
privilege level (CPL) 0, 1, or 2, it is in a supervisor mode and the processor can access all
pages. If the processor is operating at a CPL 3, it is in a user mode and the processor can
access only user level pages.
Figure 8. Privilege levels in the x86 architecture.
When the processor detects a privilege level violation, it generates a general-protection
exception (#GP). The x86 has more than 20 privileged instructions. These instructions
can be executed only when the current privilege level (CPL) is 0 (most privileged).
In addition to the CPL, the x86 has an I/O privilege level (IOPL) field in the EFLAGS
register that indicates the I/O privilege level of the currently running program. Some
instructions, while allowed to execute when the CPL is not 0, might generate a #GP
exception if the CPL value is higher than IOPL. These instructions include CLI (clear
interrupt), STI (set interrupt flag), IN/INS (input from port), and OUT/OUTS (output
to port).
In addition to the above instructions, there are many instructions [3] that, while not
privileged, reference registers or memory locations that would allow a VM to access a
memory region not assigned to that VM. These sensitive instructions will not cause a
#GP exception. The trap-and-emulate method for virtualization of a GOS, as stated in
“VMM Requirements” on page 9, does not apply to these instructions. However, these
instructions may impact other VMs.
Level 0 - OS Kernel
Level 1
Level 2
Level 3 - Applications
23 The x86 Processor Architecture Sun Microsystems, Inc.
Segmented ArchitectureIn protected mode, all memory accesses must go through a logical address } Linear
address (LA) } Physical Address (PA) translation scheme. The logical address to LA
translation is managed by the x86 segmentation architecture which divides a process's
address space into multiple protected segments.
A logical address, which is used as the address of an operand or of an instruction,
consists of a 16-bit segment selector and a 32-bit offset. A segment selector points to a
segment descriptor that defines the segment (see Figure 11 on page 24). The segment
base address is contained in the segment descriptor. The sum of the offset in a logical
address and the segment base address gives the LA. The Solaris OS directly maps an LA
to a process's Virtual Address (VA) by setting the segment base address to NULL.
For each memory reference, a VA and a segment selector are provided to the processor
(Figure 9). The segment selector, which is loaded to the segment register, is used to
identify a segment descriptor for the address.
Figure 9. Segment Selector
Every segment descriptor has a visible part and a hidden part, as illustrated in Figure 10
(see also [7], Volume 3A Section 3.4.3). The visible part is the segment selector, an
index that points into either the global descriptor table (GDT) or the local descriptor
table (LDT) to identify from which descriptor the hidden part of the segment register is
to be loaded. The hidden part includes portions containing segment descriptor
information loaded from the descriptor table.
Figure 10. Each segment descriptor has a visible and a hidden part.
Segmentation: VA + Segment Base Address (always 0 in Solaris) } Linear address
Paging: Linear address } Physical Address
Index: up to 8K descriptors (bits 3-15)TI: Table Indicator; 0=GDT, 1=LDTRPL: Request Privilege Level
012315
Index TI RPL
Selector
Visible
Type Base Address Limit CPL
Hidden
24 The x86 Processor Architecture Sun Microsystems, Inc.
The hidden fields of a segment register are loaded to the processor from a descriptor
table and are stored in the descriptor cache registers. The descriptor cache registers,
like the TLB, allow the hardware processor to refer to the contents of the segment
register's hidden part without further reference to the descriptor table. Each time a
segment register is loaded, the descriptor cache register gets fully loaded from the
descriptor table. Since each VM has its own descriptor table (for example, the GDT), the
VMM has to maintain a shadow copy of each VM’s descriptor table. A context switch to
a VM will cause the VM's shadow descriptor table to be loaded to the hardware
descriptor table. If the content of the descriptor table is changed by the VMM because
of a context switch to another VM, the segment is non-reversible, which means the
segment cannot be restored if an event such as a trap causes the segment to be saved
and replaced.
The Current Privilege Level (CPL) is stored in the hidden portion of the segment register.
The CPL is initially equal to the privilege level of the code segment from which it is
being loaded. The processor changes the CPL when program control is transferred to a
code segment with a different privilege level.
The segment descriptor contains the size, location, access control, and status
information of the segment that is stored in either the LDT or GDT. The OS sets segment
descriptors in the descriptor table and controls which descriptor entry to use for a
segment (Figure 11). See “CPU Privilege Mode” on page 45 for a discussion of setting
the segment descriptor in the Solaris OS.
Figure 11. Segment descriptor.
The privilege check performed by the processor recognizes three types of privilege
levels: requested privilege level (RPL), current privilege level (CPL), and descriptor
privilege level (DPL). A segment can be loaded if the DPL of the segment is numerically
greater than or equal to both the CPL and the RPL. In other words, a segment can be
L: 64-bit code segmentAVL: Available for use by system softwareBase: Segment base addressD/B Default operation size (0=64-bit segment, 1=32 bit segment)DBL: Descriptor Privilege LevelG: GranularitySL: Segment Limit 19:16P: Segment presentS: Descriptor type (0=system, 1=code or data)Type: segment type
0
0
78111213141516
16
19202122232431
31
SPAVLLD/BD DPLBase 31:24 Base 23:16SL Type
Segment Limit 15:00Base 15:00
25 The x86 Processor Architecture Sun Microsystems, Inc.
accessed only by code that has equal or higher privilege level. Otherwise, a general-
protection fault exception, #GP, is generated and the segment register is not loaded.
On 64-bit systems, linear address space (flat memory model) is used to create a
continuous, unsegmented address space for both kernel and application programs.
Segmentation is disabled in the sense that privilege checking can not apply to VA to LA
translations as it doesn't exist. The only protection left to prevent a user application
from accessing kernel memory is through the page protection mechanism. This is why
the kernel of a GOS has to run in ring 3 (user mode in page level protection) on a 64-bit
system.
Paging ArchitectureWhen operating in the protected mode, the LA } PA translation is performed by the
paging hardware of the x86 processor. To access data in memory, the processor requires
the presence of a VA } PA translation in the TLB (in Solaris, LA is equal to VA), the page
table backing up the TLB entry, and a page of physical memory. For the x86 processor,
loading the VA } PA page translation from the page table to TLB is performed
automatically by the processor. The OS is responsible for allocating physical memory
and loading the VA } PA translation to the page table.
When the processor cannot load a translation from the page table, it generates a page
fault exception, #PF. A #PF exception on x86 processors usually means a physical page
has not been allocated, because the loading of the translation from the page table to
the TLB is handled by the processor (Figure 12).
Figure 12. Translations through the TLB are accomplished in the processor itself, while translations through page tables are performed by the OS.
The x86 processor uses a control register, %cr3, to manage the loading of address
translations from the page table to the TLB. The base address of a process's page table
is kept by the OS and loaded to %cr3 when the process is contexted in to run. On the
Solaris OS, %cr3 is kept in the kernel hat structure. Each address space, as, has one
hat structure. The mdb(1) command can be used to find the value of the %cr3
register of a process:
TLB Entry
Performed by the processor
Page Table Physical Memory
Performed by the OS
26 The x86 Processor Architecture Sun Microsystems, Inc.
When multiple VMs are running, the automatic loading of page translations from the
page table to the TLB actually makes the virtualization more difficult because all page
tables have to be accessible by the processor. As a result, pages table updates can only
be performed by the VMM to enforce a consistent memory usage on the system. “Page
Translations Virtualization” on page 14 discusses two mechanism for managing page
tables by the VMM.
Another issue of the x86 paging architecture is related to the flushing of TLB entries.
Unlike many RISC processors which support a tagged TLB, the x86 TLB is not tagged. A
TLB miss results in a walk of the page table by the processor to find and load the
translation to the TLB. Since the TLB is not tagged, a change in the %cr3 register due to
a virtual memory context switch will result in invalidating all TLB entries. This adversely
affects performance if the VMM and VM are not in the same address space.
A typical solution to address the performance impact of TLB flushing is to reserve a
region of the VM address space for the VMM. With this solution, the VMM and VM can
run from the same address space and thus avoid a TLB flush when a VM memory
operation traps to the VMM. The latest CPUs from Intel and AMD with hardware
virtualization support include tagged TLBs, and consequently the translation of
different address spaces can co-exist in the TLB.
I/O and InterruptsIn general, x86 support for exceptions and I/O interrupts does not impose any
particular challenge to the implementation of a VMM. The x86 processor uses the
interrupt descriptor table (IDT) to provide a handler for a particular interrupt or
exception. Access to the IDT functions is privileged and, therefore, can only be
performed by the VMM. The Sun xVM Hypervisor for x86 provides a mechanism to relay
hardware interrupts to a VM through its event channel hypervisor calls (see “Event
Channels” on page 43).
% mdb -k> ::psS PID PPID PGID SID UID FLAGS ADDR NAME....R 9352 9351 9352 9352 28155 0x4a014000 fffffffec2ae78c0 bash> fffffffec2ae78c0::print -t 'struct proc' ! grep p_as struct as *p_as = 0xfffffffed15ba7e0> 0xfffffffed15ba7e0::print -t 'struct as' ! grep a_hat struct hat *a_hat = 0xfffffffed1718e98> 0xfffffffed1718e98::print -t 'struct hat' ! grep hat_htable htable_t *hat_htable = 0xfffffffed0f67678> 0xfffffffed0f67678::print -t 'struct htable' ! grep ht_pfn pfn_t ht_pfn = 0x16d37 // %cr3
27 The x86 Processor Architecture Sun Microsystems, Inc.
The x86 processor allows device memory and registers to be accessed through either an
I/O address space or memory-mapped I/O. An I/O address space access is performed
using special I/O instructions such as IN and OUT. These instructions, while allowed to
execute when the CPL is not 0, will result in a #GP exception if the processor's CPL
value is higher than the I/O privilege level (IOPL). The Sun xVM Hypervisor for x86
provides a hypervisor call to set the IOPL, enabling a GOS to directly access I/O ports by
setting the IOPL to its privilege level.
When using memory-mapped I/O, any of the processor’s instructions that reference
memory can be used to access an I/O location with protection provided through
segmentation and paging. PIO, whether it is using I/O address space or memory-
mapped I/O, is normally uncacheable as device registers are usually accessed with
precise programming order. PIO uses addresses in a VM's address space and doesn't
cause any security and isolation issues.
The x86 processor uses physical addresses for DMA. DMA in a virtualized x86 system has
certain issues:
• A 32-bit, non-dual-address-cycle (DAC) PCI device can not address beyond 4 GB of
memory.
• It is possible for one domain’s DMA to intrude into another domain's physical
memory, thus causing the risk of security violation.
The solution to the above issues is to have an I/O memory management unit (IOMMU)
as a part of an I/O bridge or north bridge that performs a translation of I/O addresses
(for example, an address that appears on the PCI bus) to machine memory addresses.
The I/O address can be any address that is recognized by the IOMMU. An IOMMU can
also improve the performance of large chunk data transfers by mapping a contiguous
I/O address to multiple physical pages in one DMA transaction. However, the IOMMU
may hurt the I/O performance for small data transfers because the DMA setup cost is
higher than that of DMA without an IOMMU.
For more details on the IOMMU, also known as hardware address translation service
(hardware ATS), see “I/O Virtualization” on page 16.
Timer DevicesAn OS typically uses several timer devices for different purposes. Timer devices are
characterized by their frequency granularity, frequency reliability, and ability to
generate interrupts and receive counter input. Understanding the characteristics of
timer devices is important for the discussion of timekeeping in a virtualized
environment, as the VMM provides virtualized timekeeping of some timers to its
overlaying VMs. Virtualized timekeeping has significant impact on the accuracy of time
related functions in the GOS and, thus, on the performance and results of time sensitive
applications.
28 The x86 Processor Architecture Sun Microsystems, Inc.
An x86 system typically includes the following timer devices:
• Programmable Interrupt Timer (PIT)
PITs use a 1.193182 Mhz crystal oscillator and have a 16-bit counter and counter input
register. The PIT contains three timers. Timer 0 can generate interrupts and is used by
the Solaris OS as the system timer. Timer 1 was historically used for RAM refreshes
and timer 2 for the PC speaker.
• Time Stamp Counter (TSC)
The TSC is a feature of the x86 architecture that is accessed via the RDTSC
instruction. The TSC, a 64-bit counter, changes with the processor speed. The TSC
cannot generate interrupts and has no counter input register. The TSC is the finest
grained of all timers and is used in the Solaris OS as the high resolution timer. For
example, the gethrtime(3C) function uses the TSC to return the current high-
resolution real time.
• Real Time Clock (RTC)
The RTC is used as the time-of-day (TOD) clock in the Solaris OS. The RTC uses a battery
as an alternate power source, enabling it to continue to keep time while the primary
source of power is not available. The RTC can generate interrupts and has a counter
input register. It is the lowest grained timer on the system.
• Local Advanced Programmable Interrupt Controller (APIC) Timer
The local APIC timer, which is a part of the local APIC, has a 32-bit counter and
counter input register. It can generate interrupts and has the same frequency as the
front side bus. The Solaris OS supports the use of the local APIC timer as one of the
cyclic timers.
• High Precision Event Timer (HPET)The HPET is a relatively new timer available in some new x86 systems. The HPET is
intended to replace the PIT and the RTC for generating periodic interrupts. The HPET
can generate interrupts, is 64-bits wide, and has a counter input register. The Solaris
OS currently does not use the HPET.
• Advanced Configuration and Power Interface (ACPI) Timer The ACPI timer has a 24-bit counter, can generate interrupts, and has no input
counter register. The Solaris OS does not use the ACPI timer.
29 SPARC Processor Architecture Sun Microsystems, Inc.
Chapter 4
SPARC Processor Architecture
This chapter provides background information on the SPARC processor architecture that
is relevant to later discussions on Logical Domains (Chapter 7 on page 79).
The SPARC (Scalable Processor Architecture) processor, first introduced in 1987, is a big-
endian RISC processor ISA. SPARC International (SI), an industry organization, was
established in 1989 to promote the open SPARC architecture. In 1994, SI introduced a
64-bit version of the SPARC processor as SPARC v9. The UltraSPARC processor, which is a
Sun-specific implementation of SPARC v9, was introduced in 1996 and has been
incorporated into all Sun SPARC platforms shipping today.
In 2005, Sun's UltraSPARC architecture was open sourced as the UltraSPARC
Architecture 2005 Specification [2]. Included in this enhanced UltraSPARC 2005
specification is support for Chip-level Multithreading (CMT) for a highly threaded
processor architecture and a hyperprivileged mode that allows the hypervisor to
virtualize the processor to run multiple domains. The design of the UltraSPARC T1
processor, which is the first implementation of the UltraSPARC Architecture 2005
Specification, is also open sourced. The UltraSPARC T1 processor includes 8 cores with 4
strands in each core, providing a total of 32 strands per processor.
In August 2007 Sun announced the UltraSPARC T2 processor, the follow-up CMT
processor to the UltraSPARC T1 processor, and the OpenSPARC T2 architecture [33]
which is the open source version of the UltraSPARC T2 processor. Sun also released the
UltraSPARC Architecture 2007 specification [34] which adds a section for error handling
and expands the discussion for memory management. The UltraSPARC T2 processor has
several enhancements over the UltraSPARC T1 processor. These enhancements include
64 strands, per-core floating-point and graphic units, and integrated PCIe and 10 GB
Ethernet (for more details see “Processor Components” on page 31).
The remainder of this chapter discusses the following features of the UltraSPARC T1/T2
processor architecture, and describes their effect on virtualization implementations:
• Processor privilege mode — The UltraSPARC 2005 specification defines a
hyperprivileged mode for the hypervisor operations.
• Sun4v Chip Multithreaded architecture — This feature enables the creation of up to
32 domains, each with its own dedicated strands, on an UltraSPARC T1 processor, and
up to 64 domains on an UltraSPARC T2 processor.
• Address Space Identifier (ASI)— The ASI provides functionality to control access to a
range of address spaces, similar to the segmentation used by x86 processors.
• Memory Management Unit (MMU) — The software-controlled MMU allows an
efficient redirection of page faults to the intended domain for loading translations.
30 SPARC Processor Architecture Sun Microsystems, Inc.
• Trap and interrupt handling — Each strand (virtual processor) has its own trap and
interrupt priority registers. This functionality allows the hypervisor to re-direct traps
to the target CPU and enables the trap to be taken by the GOS's trap handler.
Note – The terms strand, hardware thread, logical processor, virtual CPU and virtual processor are used by various documents to refer to the same concept. For consistency, the term strand is used in this chapter.
Processor Mode of OperationThe UltraSPARC 2005 specification defines three privilege modes: non-privileged,
privileged, and hyperprivileged. In hyperprivileged mode, the processor can access all
registers and address spaces, and can execute all instructions. Instructions, registers,
and address spaces for privileged and non-privileged modes are restricted.
The processor operates in privileged mode when PSTATE.priv is set to 1 and
HPSTATE.hpriv is set to 0. The processor operates in hyperprivileged mode when
HPSTATE.hpriv is set to 1 (PSTATE.priv is ignored).
Table 1 lists the availability of instructions, registers, and address spaces for each of the
privilege modes, and includes information on where further details can be found in the
UltraSPARC Architecture 2005 Specification [2].
Table 1. Documentation describing the availability of components in the UltraSPARC processor.
Based on the availability of instructions, registers, and the ASI in hyperprivileged mode,
the following functions of the hypervisor can be deduced:
• Reset the processor: SIR instruction
• Control hyperprivileged traps and interrupts: HTSTATE, HTBA, HINTP registers
• Control strand operation: ASI 0x41, and HSTICK_CMPR and STRAND_STS registers
• Manage MMU: ASI 0x50-0x5F
Component Locationa
a. Location in the UltraSPARC Architecture 2005 Specification [2].
Comments
Instruction Table 7-2 All instructions except SIR, RDHPR, and RHPR (which require hyperprivilege to execute) can be executed from the privileged mode.
Registers Chapter 5 There are seven hyperprivileged registers: HPSTATE, HTSTATE, HINTP, HTBA, HVER, HSTICK_CMPR, and STRAND_STS. These registers are used by the hypervisor in the hyperprivileged mode.
Address Space
Tables 9-1 and 10-1
ASIs 0x30-0x7F are for hyperprivileged access only. These ASIs are mainly for CMT control, MMU, TLB, and hyperprivileged scratch registers.
31 SPARC Processor Architecture Sun Microsystems, Inc.
Processor ComponentsThe UltraSPARC T1 processor[10] contains eight cores, and each core has hardware
support for four strands. One FPU and one L2 cache are shared among all cores in the
processor. Each core has its own Level 1 instruction and data cache (L1 Icache and
Dcache) and TLB that are shared among all strands in the core. In addition, each strand
contains the following:
• A full register file with eight register windows and four sets of global registers (a total
of 160 registers: 8 * 16 registers per window, + 4 * 8 global registers)
• Most of the ASIs
• Ancillary privileged registers
• Trap queue with up to 16 entries
This hardware support in each strand allows the hypervisor to partition the processor
into 32 domains, with one strand for each domain. Each strand can execute instructions
separately without requiring a software scheduler in the hypervisor to coordinate the
processor resources.
Table 2 summarizes the association of processor components to their location in the
processor, core and strand.
Table 2. Location of key processor components in the UltraSPARC T1 processor.
The UltraSPARC T2 processor[33] is built upon the UltraSPARC T1 architecture. It has the
following enhancements over the UltraSPARC T1 processor:
• EIght strands per core (for a total of 64 strands)
• Two integer pipelines per core, with each integer pipeline supporting 4 strands
• A floating-point and graphics unit (FGU) per core
• Integrated PCI-E and 10 Gb/Gb Ethernet (System-on-Chip)
• Eight banks of 4 MB L2 cache
The UltraSPARC T2 has a total of 64 strands in 8 cores, and each core has its own
floating-pointing and graphics unit (FGU). This allows up to 64 domains to be created on
the UltraSPARC T2 processor. This design also adds integrated support for industry
standard I/O interfaces such PCI-Express and 10 Gb Ethernet.
Table 3 summarizes the association of processor components to physical processor, core
and strand.
Processor Core Strand
• Floating Point Unit • L2 cache crossbar• L2 cache
• 6 stage instruction pipeline
• L1 Icache and Dcache• TLB
• Register file with 160 registers• Most of ASI• Ancillary state register (ASR)• Trap registers• Privileged registers
32 SPARC Processor Architecture Sun Microsystems, Inc.
Table 3. Location of key processor components in the UltraSPARC T2 processor.
Address Space IdentifierUnlike x86 processors in 32-bit mode, which use segmentation to divide a process's
address space into several segments of protected address spaces, the SPARC v9
processor has a flat 64-bit address space. An address in the SPARC V9 processor is a
tuple consisting of an 8-bit address space identifier (ASI) and a 64-bit byte-address offset
within the specified address space. The ASI provides attributes of an address space,
including the following:
• Privileged or non-privileged
• Register or memory
• Endianness (for example, little-endian or big-endian)
• Physical or virtual address
• Cacheable or non-cacheable
The SPARC processor's ASI allows different types of address spaces (user virtual address
space, kernel virtual address space, processor control and status registers, etc.) to co-
exist as separate and independent address spaces for a given context. Unlike x86
processors in which user processes and the kernel share the same address space, user
processes and the kernel have their own address space on SPARC processors.
Access to these address spaces are protected by the ASI associated with each address
space. ASIs in the range 0x00-0x2F may be accessed only by software running in
privileged or hyperprivileged mode; ASIs in the range 0x30-0x7F may be accessed
only by software running in hyperprivileged mode. An access to a restricted (privileged
or hyperprivileged) ASI (0x00-0x7F) by non-privileged software will result in a
privileged_action trap.
Table 9-1 and Table 10--1 of [2] provide a summary and description for each ASI.
Memory Management UnitThe traditional UltraSPARC architecture supports two types of memory addressing:
• Virtual Address (VA) — managed by the GOS and used by user programs
• Physical address (PA) — passed by the processor to the system bus when accessing
physical memory
Processor Core Strand
• 8 banks 4 MB L2 cache• L2 cache crossbar• Memory controller• PCI-E• 10 Gb/Gb Ethernet
• 2 instruction pipelines (8 stages)
• L1 Icache and Dcache• TLB• FGU (12 stages)
• Full register file with 8 windows• Most of ASI• Ancillary state register (ASR)• Privileged registers
33 SPARC Processor Architecture Sun Microsystems, Inc.
The Memory Management Unit (MMU) of the UltraSPARC processor provides the
translation of VAs to PAs. This translation enables user programs to use a VA to locate
data in physical memory.
The SpitFire Memory Management Unit (sfmmu) is Sun's implementation of the
UltraSPARC MMU. The sfmmu hardware consists of Translation Lookaside Buffers (TLBs)
and a number of MMU registers:
• Translation Lookaside Buffer (TLB)
The TLB provides virtual to physical address translations. Each entry of the TLB is a
Translation Table Entry (TTE) that holds information for a single page mapping of
virtual to physical addresses. The format of the TTE is shown in Figure 13. The TTE
consists of two 64-bit words, representing the tag and data of the translation. The
privileged field, P, controls whether or not the page can be accessed by non-
privileged software.
• MMU registers
A number of MMU registers are used for accessing TLB entries, removing TLB entries
(demap), context management, handling TLB misses, and support for Translation
Storage Buffer (TSB) access. The TSB, an array of TTE entries, is a cache of translation
tables used to quickly reload the TLB. The TSB resides in the system memory and is
managed entirely by the OS. The UltraSPARC processors includes some MMU
hardware registers for speeding up TSB access. The TLB miss handler will first search
the TSB for the translation. If the translation is not found in the TSB, the TLB handler
calls to a more sophisticated (and slower) TSB miss handler to load the translation
table to the TSB.
Figure 13. The translation lookaside buffer (TLB) is an array of translation table entries containing tag and data portions.
A TLB hit occurs if both the context and virtual address match an entry in the TLB.
Address aliasing (multiple TLB entries with the same physical address) is permitted.
Unlike the x86 processor, the loading of page translations to the TLB is manually
managed by software through traps. In the event of a TLB miss, a trap is generated
trying first to get the translation from the Translation Storage Buffer (TSB) (Figure 14).
The TSB, an in-memory array of translations, acts like a direct-mapped cache for the
TLB. If the translation is not present in the TSB, a TSB miss trap is generated. The TSB
miss trap handler uses a software lookup mechanism based on the hash memory entry
0
0345678910111213
4142474863
636261 56 55
vacontext_id 000000
sztaddr wpevsoft
ep
cv
cp
ie
nfo
soft2
TTETag
TTEData
34 SPARC Processor Architecture Sun Microsystems, Inc.
block structure, hme_blk, to obtain the TTE. If a translation is still not found in
hme_blk, the kernel generic trap handler is invoked to call the kernel function
pagefault() to allocate physical memory for the virtual address and load the
translation into the hme_blk hash structure.
Figure 14 depicts the mechanism for handling TLB misses in an unvirtualized domain.
Figure 14. Handling a TLB miss in an unvirtualized domain, UltraSPARC T1/T2 processor architecture.
Similarly, Figure 15 depicts how TLB misses are handled in a virtualized domain. In a
virtualized environment, the UltraSPARC T1/T2 processor adds a Real Address type, in
addition to the VA and PA, into the types of memory addressing (Figure 15). Real
addresses (RA), which are equivalent to the physical memory in Sun xVM Server (see
“Physical Memory Management” on page 52) are provided to the GOS as the
underlying physical memory allocated to it. The GOS-maintained TSBs are used to
translate VAs into RAs. The hypervisor manages the translation from RA to PA.
Figure 15. Handling a TLB miss in a virtualized domain, UltraSPARC T1/T2 processor architecture.
Applications, which are non-privileged software, use only VAs. The OS kernel, which is
privileged software, uses both VAs and RAs. The hypervisor, which is hyperprivileged
software, normally uses PAs. “Physical Memory Allocation” on page 88 discusses in
detail the types of memory addressing used in LDoms.
The UltraSPARC T2 processor adds a hardware table walk for loading TLB entries. The
hardware table walk accesses the TSBs to find TTEs that match the virtual address and
context ID of the request. Since a GOS cannot access or control physical memory, the
TTEs in the TSBs controlled by a GOS contain real page numbers, not physical page
numbers (see “Physical Memory Allocation” on page 88). TTEs in the TSBs controlled by
the hypervisor can contain real page numbers or physical page numbers. The
hypervisor performs the RA-to-PA translation within the hardware table walk to permit
the hardware table walk to load a GOS TTEs into the TLB for VA-to-PA translation.
TLB
ProcessorMMU
TTE cachein memory
OS datastructure
OSfunction
TTE load to TLB
TLB missTSB
TTE load to TSB
TSB misshome_blk
hat_memload()
Allocate memorypagefault ()
TLB
ProcessorMMU
Managed byHypervisor
TTE cachein memory
OS datastructure
TTE load to TLB
TLB missPA<-RA
TLB missTSB
RA<-VATTE load to TSB
TSB misshme_blk
OSfunction
hat_memload()
Allocate memorypagefault()
35 SPARC Processor Architecture Sun Microsystems, Inc.
TrapsIn the SPARC processor, a trap transfers software execution from one privileged mode to
another privileged mode at the same or higher level. The only exception is that
unprivileged mode can not trap to another unprivileged mode. A trap can be generated
by the following methods:
• Internally by the processor (memory faults, privileged exceptions, etc.)
• Externally generated by I/O devices (interrupts)
• Externally generated by another processor (cross calls)
• Software generated (for example, the Tcc instruction)
A trap is associated with a Trap Type (TT), a 9-bit value. (TT values 0x180-0x1FF are
reserved for future use.) The transfer of software execution occurs through a trap table
that contains an array of TT handlers indexed by the TT value. Each trap table entry is
32-bytes in length and contains the first eight instructions of the TT handler. When a
trap occurs, the processor gets the TT from the TT register and the trap table base
address (TBA) from the TBA register. After saving the current executing states and
updating some registers, the processor starts to execute the instructions in the trap
table handler.
The SPARC processors support nesting traps using a trap level (TL). The maximum TL
(MAXTL) value is typically in the range of 2-6, and depends on the processor; in
UltraSPARC T1/T2 processors, MAXTL is 6. Each trap level has one set of trap stack
control registers: trap type (TT), trap program counter (TPC), trap next program
counter (TNPC), and trap state (TSTATE). These registers provide trap software
execution state and control for the current TL. The ability to support nested traps in
SPARC processors makes the implementation of an OS trap handler easier and more
efficient, as the OS doesn't need to explicitly save the current trap stack information.
On UltraSPARC T1/T2 processors, each strand has a full set of trap control and stack
registers which include TT, TL, TPC, TNPC, TSTATE, HTSTATE (hyperprivileged trap
state), TBA, HTBA (hyperprivileged trap base address), and PIL (priority interrupt
level). This design feature allows each strand to receive traps independently of other
strands. This capability significantly helps trap handling and management by the
hypervisor, as traps are delivered to a strand without being queued up in the hypervisor.
InterruptsOn SPARC platforms, interrupt requests are delivered to the CPU as traps. Traps 0x041
through 0x04F are used for Priority Interrupt Level (PIL) interrupts, and trap 0x60 is
used for the vector interrupt. There are 15 interrupt levels for PIL interrupts. Interrupts
are serviced in accordance to their PIL, with higher PILs having higher priority. The
vector interrupt is used to support the data bearing vector interrupt which allows a
device to include its private data in the interrupt packet (also known as the mondo
36 SPARC Processor Architecture Sun Microsystems, Inc.
vector). With vector interrupt, device CSR access can be eliminated and the complexity
of device hardware can be reduced.
PIL interrupts are delivered to the processor through the ASR's SOFTINT_REG register.
The SOFTINT_REG register contains a 15 bit int_level field. When a bit in this field
is set, a trap is generated and the PIL of the trap corresponds to the location of the bit
in that field. There is one SOFTINT_REG for each strand.
In LDoms, the interrupt delivery from an I/O device to a GOS is a two-step process:
• An I/O device sends an interrupt request using the vector interrupt (trap 0x60) to the
hypervisor. The hypervisor inserts the interrupt request into the interrupt queue of
the target virtual processor.
• The target processor receives the interrupt request on its interrupt queue through
trap 0x7D (for device) or 0x7C (for cross calls), and schedules an interrupt to itself to
be processed at a later time by setting bits in the privileged SOFTINT register which
causes a PIL interrupt (trap 0x41-0x4F). For more details on interrupt delivery, see
“Trap and Interrupt Handling” on page 85.
SPARC Processor Architecture Sun Microsystems, Inc.
Section II
Hardware Virtualization Implementations
• Chapter 5: Sun xVM Server (page 39)
• Chapter 6: Sun xVM Server with Hardware VM (HVM) (page 63)
• Chapter 7: Logical Domains (page 79)
• Chapter 8: VMware (page 97)
38 SPARC Processor Architecture Sun Microsystems, Inc.
39 Sun xVM Server Sun Microsystems, Inc.
Chapter 5
Sun xVM Server
Sun xVM Server is a a paravirtualized Solaris OS that incorporates the Xen open source
community work. The open source VMM, Xen, was originally developed by the Systems
Research Group of the University of Cambridge Computer Laboratory, as part of the UK-
EPSRC funded XenoServers project. The first versions of Xen, targeted at the Linux
community for the x86 processor, required the Linux kernel to be specifically modified
to run on the Xen VMM. This OS paravirtualization made it impossible to run Windows
on early versions of Xen, because Microsoft did not permit the Windows software to be
modified.
In December 2005 the Xen development team released Xen 3.0, the first version of its
VMM that supported hardware-assisted virtual machines (HVM). With this new version,
an unmodified OS could be hosted on the Intel-VTx and AMD-V (Pacifica) processors.
Xen 3.0 eliminated the need for paravirtualization and enabled Microsoft Windows to
run in a Xen environment side-by-side with Linux and the Solaris OS.
Xen 3.0 supports the x86 CPU both with HVM and without HVM. Xen 3.0 also extends
support for symmetric multiprocessing, 64-bit operating systems, and up to 64 GB RAM
allowed by the x86 physical address extension (PAE) in 32-bit mode.
HVM technology affects the Xen implementation in many ways. This chapter discusses
the architecture and design of Sun xVM Server, which does not leverage the processor
HVM feature. Chapter 5 discusses Sun xVM Server for x86 processors with HVM support
(Sun xVM Server with HVM).
Note – Sun xVM Server includes support for the Xen open source community work on the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform. In this paper, in order to distinguish the discussion of x86 and UltraSPARC T1/T2 processors, Sun xVM Server is specifically used to refer to the Sun hardware virtualization product for the x86 platform, and LDoms is used to refer to the Sun hardware virtualization product for the UltraSPARC T1 and T2 platforms.
This chapter is organized as follows:
• “Sun xVM Server Architecture Overview” on page 40 provides an overview of the Sun
xVM Server architecture.
• “Sun xVM Server CPU Virtualization” on page 45 discusses the CPU virtualization
employed by Sun xVM Server.
• “Sun xVM Server Memory Virtualization” on page 52 describes memory management
issues.
40 Sun xVM Server Sun Microsystems, Inc.
• “Sun xVM Server I/O Virtualization” on page 56 discusses the I/O virtualization used
in Sun xVM Server.
Sun xVM Server Architecture OverviewA Sun xVM Server virtualized system consists of an x86 system, a VMM, a control VM
running Sun xVM Server (Dom0), and zero or more VMs (DomU), as shown in Figure 16.
The Sun xVM Hypervisor for x86, the VMM of the Sun xVM Server system, manages
hardware resources and provides services to the VMs. Each VM, including Dom0, runs
an instance of a guest operating system (GOS) and is capable of communicating with
the VMM through a set of hypervisor calls.
Figure 16. A Sun xVM Server virtualized system consists of a VMM, a control VM (Dom0), and zero or more VMs (DomU).
The Dom0 VM has some unique characteristics not available in other VMs:
• First VM started by the VMM
• Able to directly access I/O devices
• Runs domain manager to create, start, stop, and configure other VMs
• Provides I/O access service to other VMs (DomU)
Each DomU VM runs an instance of a paravirtualized GOS, and gets VMM services
through a set of hypercalls. Access to I/O devices from each DomU VM are provided by
drivers in Dom0.
Console IF XenStore
GuestApplications/
DomainManager/
Console
Guest OS
Dom 0
GuestApplications
Guest OS
Dom U
GuestApplications
Guest OS
Dom U
Sun xVM Hypervisor for x86
Sun X64 Server
Grant Tables
GuestApplications
Guest OS
Dom U
Scheduler
Hypercalls
Event Channel
41 Sun xVM Server Sun Microsystems, Inc.
Sun xVM Hypervisor for x86 ServicesThe Sun xVM Hypervisor for x86, the VMM of the Sun xVM Server, provides several
communication channels between itself and overlying domains:
• Hypercalls — synchronous calls from a GOS to the VMM
• Event Channel — asynchronous notifications from the VMM to VMs
• Grant Table — shared memory communication between the VMM and VMs, and
among VMs
• XenStore — a hierarchical collection of control and status repository
Each of these mechanisms is described in more detail in the following sections.
Hypercalls
The Sun xVM Server hypercalls are a set of interfaces used by a GOS to request service
from the VMM. The hypercalls are invoked in a manner similar to OS system calls: a
software interrupt is issued which vectors to an entry point within the VMM. Hypercalls
use INT $0x82 on a 32-bit system and SYSCALL on a 64-bit system, with the
particular hypercall contained in the %eax register.
For example, the common routine for hypercalls with four arguments on a 64-bit Solaris
kernel is:
The function in assembly is as follows:
The calling convention is compliant with the AMD64 ABI [8].
The SYSCALL instruction is intended to enable unprivileged software (ring 3) to access
services from privileged software (ring 0). Solaris system calls also use SYSCALL to
allow user applications to access Solaris kernel services. Having SYSCALL used by both
Solaris system calls and the hypercalls means that the SYSCALL made by the user
process in Solaris is delivered indirectly by the VMM to the Solaris kernel. This causes a
slight overhead for each Solaris system call.
long__hypercall4(ulong_t callnum, ulong_t a1, ulong_t a2, ulong_t a3, ulong_t a4);
_[0]> __hypercall4,7/ai__hypercall4:__hypercall4: movl %edi,%eax /* %edi is the first argument */__hypercall4+2: movq %rsi,%rdi__hypercall4+5: movq %rdx,%rsi__hypercall4+8: movq %rcx,%rdx__hypercall4+0xb: movq %r8,%r10__hypercall4+0xe: syscall__hypercall4+0x10: ret
42 Sun xVM Server Sun Microsystems, Inc.
A complete list of Sun xVM Server hypercalls is provided in Table 4:
As Table 4 shows, the hypercalls provide a variety of functions for a GOS:
• Perform privileged operations such as setting the trap table, updating the page table,
loading the GDT, and setting the GS and FS segment registers
• Get services from the VMM such as using the event channel, grant table,
set_callbacks services, and scheduled operations
• Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example, a
new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERVISOR_mmu_update(), which validates and applies a list of
Table 4. Sun xVM Server hypercalls.
Privilege Operations:
long set_trap_table(trap_info_t *table); long mmu_update(mmu_update_t *req, int count, int *success_count, domid_t domid); long set_gdt(ulong_t *frame_list, int entries); long stack_switch(ulong_t ss, ulong_t esp); long fpu_taskswitch(int set); long mmuext_op(struct mmuext_op *req, int count, int *success_count, domid_t domain_id); long update_descriptor(maddr_t ma, uint64_t desc); long update_va_mapping(ulong_t va, uint64_t new_pte, ulong_t flags); long set_timer_op(uint64_t timeout); long physdev_op(void *physdev_op); long vm_assist(uint_t cmd, uint_t type); long update_va_mapping_otherdomain(ulong_t va, uint64_t new_pte, ulong_t flags, domid_t domain_id); long iret(); long set_segment_base(int reg, ulong_t value);long nmi_op(ulong_t op, void *arg); long hvm_op(int cmd, void *arg);
VMM Services:
long set_callbacks(ulong_t event_address, ulong_t failsafe_address, ulong_t syscall_address); long grant_table_op(uint_t cmd, void *uop, uint_t count); long event_channel_op(void *op); long xen_version(int cmd, void *arg); long set_debugreg(int reg, ulong_t value); long get_debugreg(int reg); long multicall(void *call_list, int nr_calls); long console_io(int cmd, int count, char *str); long sched_op(int cmd, void *arg); long do_kexec_op(unsigned long op, int arg1, void *arg);
VM Control Operations:
long sched_op_compat(int cmd, ulong_t arg); long platform_op(xen_platform_op_t *platform_op); long memory_op(int cmd, void *arg); long vcpu_op(int cmd, int vcpuid, void *extra_args); long sysctl(xen_sysctl_t *sysctl); long domctl(xen_domctl_t *domctl); long acm_op();
43 Sun xVM Server Sun Microsystems, Inc.
updates, is called by the Solaris kernel to perform the page table updates. This routine
returns control to the calling domain when the operation is completed.
In the following example, a kmdb(1M) breakpoint is set at the mmu_update() call.
The stack trace illustrates how the mmu_update() function is called after a new
process is created by fork():
The above example shows that the kernel doesn't maintain a copy of the page table. It
uses the mmu_update() hypercall to request the VMM to update the page table.
Event Channels
To a GOS, a VMM event is the equivalent of a hardware interrupt. Communication from
the VMM to a VM is provided through an asynchronous event mechanism, called an
event channel, which replaces the usual delivery mechanisms for device interrupts. A
VM creates an event channel to send and receive asynchronous event notifications.
Three classes of events are delivered by this event channel mechanism:
• Bi-directional inter- and intra-VM connections
A VM can bind an event-channel port to another domain or to another virtual CPU
within the VM.
• Physical interrupts
A VM with direct access to hardware (Dom0) can bind an event-channel port to a
physical interrupt source.
• Virtual interrupts
A VM can bind an event-channel port to a virtual interrupt source, such as the virtual-
timer device.
[1]> set_pteval+0x4f:b // set breakpoint at HYPERVISOR_mmu_update[1]> :c // continue kmdb: stop at set_pteval+0x4f // the breakpoint reachedkmdb: target stopped at:set_pteval+0x4f:call -0x5a34 <HYPERVISOR_mmu_update>[1]> $c // display the stack trace set_pteval+0x4f(c753000, 1fb, 3, f9c29027)x86pte_copy+0x73(fffffffec08115a8, fffffffec2a8a0d8, 1fb, 5)hat_alloc+0x228(fffffffec2fa88c0)as_alloc+0x99()as_dup+0x3f(fffffffec27b1d28, fffffffec2a11168)cfork+0x102(0, 1, 0)forksys+0x25(0, 0)sys_syscall32+0x13e(){1]
44 Sun xVM Server Sun Microsystems, Inc.
Event channels are addressed by a port. Each channel is associated with two bits of
information:
• unsigned long evtchn_pending[sizeof(unsigned long) * 8]
This value notifies the domain that there is a pending notification to be processed.
This bit is cleared by the GOS.
• unsigned long evtchn_mask[sizeof(unsigned long) * 8]
This value specifies if the event channel is masked. If this bit is clear and PENDING is
set, an asynchronous upcall will be scheduled. This bit is only updated by the GOS; it
is read-only within the VMM.
Interrupts to a VM are virtualized by mapping them to event channels. These interrupts
are delivered asynchronously to the target domain using a callback supplied via the
set_callbacks hypercall. A guest OS can map these events onto its standard
interrupt dispatch mechanisms. The VMM is responsible for determining the target
domain that will handle each physical interrupt source.
“Interrupts and Exceptions” on page 49 provides a detailed discussion of how an
interrupt is handled by the VMM and delivered to a VM using an event channel.
Grant Tables
The Sun xVM Hypervisor for x86 allows sharing memory among VMs, and between the
VMM and a VM, through a grant table mechanism. Each VM makes some of its pages
available to other VMs by granting access to its pages. The grant table is a data
structure that a VM uses to expose some of its pages, specifying what permissions
other VMs have on its pages. The following example shows the information stored in a
grant table entry:
The flags field stores the type and various flag information of the grant table. There
are three types of grant table entries:
• GTF_invalid — Grants no privileges.
• GTF_permit_access — Allows the domain domid to map/access the specified
frame.
• GTF_accept_transfer — Allows domid to transfer ownership of one page frame
to this guest; the VMM writes the page number to frame.
struct grant_entry { /* GTF_xxx: various type and flag information. [XEN,GST] */ uint16_t flags; /* The domain being granted foreign privileges. [GST] */ domid_t domid; uint32_t frame; // page frame number (PFN)};
45 Sun xVM Server Sun Microsystems, Inc.
The type information acts as a capability which the grantee can use to perform
operations on the granter's memory. A grant reference also encapsulates the details of
a shared page, removing the need for a domain to know the real machine address of a
page it is sharing. This makes it possible to share memory correctly with domains
running in fully virtualized memory.
Device drivers in the Sun xVM Server (see “Sun xVM Server I/O Virtualization” on
page 56) use grant tables to send data between drivers of different domains, and use
event channels and callback services for asynchronous notification of data availability.
XenStore
XenStore [22] is a shared storage space used by domains to communicate and store
configuration information. XenStore is the mechanism by which control-plane activities,
including the following, occur:
• Setting up shared memory regions and event channels for use with split device
drivers
• Notifying the guest of control events (for example, balloon driver requests)
• Reporting status information from the guest (for example, performance-related
statistics)
The store is arranged as a hierarchical collection of key-value pairs. Each domain has a
directory hierarchy containing data related to its configuration. Domains are permitted
to register for notifications about changes in a subtree of the store, and to apply
changes to the store transactionally.
Sun xVM Server CPU VirtualizationThe Sun xVM Hypervisor for x86 provides a paravirtualized environment to a VM. Full
CPU virtualization to a VM is achieved by a concerted coordination of CPU management
by the VMM, and CPU usage by the GOS within a VM.
The next sections discuss CPU virtualization employed by the Sun xVM Server for these
tasks:
• Deprivileging CPUs to run a VM
• Scheduling CPUs for VMs
• Handling and delivery of interrupts to a VM
• Providing timer services to a VM
CPU Privilege ModeThe Sun xVM Hypervisor for x86 operates at a higher privilege level than the GOS. On
32-bit x86 processors with protection mode enabled, a GOS may use rings 1, 2 and 3 as
it sees fit. The Sun xVM Server kernel uses ring 1 for its own operation and places
applications in ring 3.
46 Sun xVM Server Sun Microsystems, Inc.
On 64-bit systems, linear address space (flat memory model) is used to create a
continuous, unsegmented address space for both the kernel and application programs.
Segmentation is disabled and rings 1 and 2, which practically do not exist, have the
same privilege to access paging as ring 0 (see “Protected Mode” and following sections
beginning on page 22). To protect the VMM, the Sun xVM Server kernel is therefore
restricted to run in ring 3 for the 64-bit mode and in ring 1 for the 32-bit mode only, as
seen in the definitions in segments.h:
If both kernel and user application run with the same privilege level, how does Sun
xVM Server protect the kernel from user applications? The answer is given as follows
[32]:
1. The VMM performs context switching between kernel mode and the currently
running application in user mode. The VMM tracks which mode the GOS is
running, kernel or user.
2. The GOS maintains two top level (PML4) page tables per process, one each for
kernel and user. The GOS registers the two page tables with the VMM. The kernel
page table contains translations for both the kernel and user addresses, and the
user page table contains translations only for the user addresses. During the
context switch, the VMM switches the top level page table so the kernel addresses
are not visible to the user process. The linear address mapping to paging data
structure for 64-bit x86 processor is shown below in Figure 17:
Figure 17. Linear address mapping to paging data structure for 64-bit x86 processor.
Switching the PML4 page tables between kernel and user mode enables a 64-bit
address space to be split into two logically separate address spaces. In this logical
separation of a 64-bit address space, the kernel can access both its address space and a
user address space while a user process can access only its own address space. The user
address space in this addressing scheme is therefore restricted to use the lower 48 bits
of the 64-bit address space. The resulting address space partition in the 64-bit Sun xVM
Server is shown as follows, in Figure 18:
% cat intel/sys/segments.h....#if defined(__amd64)#define SEL_XPL 0 /* xen privilege level */#define SEL_KPL 3 /* both kernel and user in ring 3 */#elif defined(__i386)#define SEL_XPL 0 /* xen privilege level */#define SEL_KPL 1 /* kernel privilege level under xen */#endif /* __i386 */
01112202129303839474863
Sign Extended PML4 PDP PDE PTE Offset
47 Sun xVM Server Sun Microsystems, Inc.
Figure 18. Address space partitioning in the 64-bit Sun xVM Server.
As discussed previously (see “Segmented Architecture” on page 23), the processor
privilege level is set when a segment is loaded. The Solaris OS uses the GDT for user and
kernel segments. The segment index of each segment type is assigned as shown in
Table 5 on page 54.
The command kmdb(1M) can be used to examine the segment descriptor of kernel
code:
[0]> gdt0+30::print -t 'struct user_desc' // 64-bit kernel code segment{ unsigned long usd_lolimit :16 = 0x7000 unsigned long usd_lobase :16 = 0xe030 unsigned long usd_midbase :8 = 0 unsigned long usd_type :5 = 0xe unsigned long usd_dpl :2 = 0x3 unsigned long usd_p :1 = 0x1 unsigned long usd_hilimit :4 = 0x4 unsigned long usd_avl :1 = 0 unsigned long usd_long :1 = 0 unsigned long usd_def32 :1 = 0 unsigned long usd_gran :1 = 0x1 unsigned long usd_hibase :8 = 0xfb}> gdt0+40::print -t 'struct user_desc' // 32-bit user code segment{ unsigned long usd_lolimit :16 = 0xc450 unsigned long usd_lobase :16 = 0xe030 unsigned long usd_midbase :8 = 0xf8 unsigned long usd_type :5 = 0xe unsigned long usd_dpl :2 = 0x3 unsigned long usd_p :1 = 0x1 unsigned long usd_hilimit :4 = 0x1 unsigned long usd_avl :1 = 0 unsigned long usd_long :1 = 0 unsigned long usd_def32 :1 = 0 unsigned long usd_gran :1 = 0x1 unsigned long usd_hibase :8 = 0xfb}
Kernel (ring 3)
VMM (ring 0)
Reserved
User (ring 3)
0
247 0x7FFF FFFFFFFF
0xFFFF8000 00000000
0xFFFF8800 00000000
48 Sun xVM Server Sun Microsystems, Inc.
The descriptor privilege level (DPL) of both kernel and 32-bit user code segments are set
to 3. At boot time, the Sun xVM Hypervisor for x86 is loaded into memory in ring 0.
After initialization, it loads the Solaris kernel to run as Dom0 in ring 3. The domain
Dom0 is permitted to use the VM control hypercall interfaces (see Table 4 on page 42),
and is responsible for hosting the application-level management software.
CPU SchedulingThe Sun xVM Hypervisor for x86 provides two schedulers for the user to choose
between: Credit and simple Earliest Deadline First (sEDF). The Credit scheduler is the
default scheduler; sEDF might be phased out and removed from the Sun xVM Server
implementation.
The Credit scheduler is a proportional fair share CPU scheduler. Each physical CPU
(PCPU) manages a queue of runnable virtual CPUs (VCPUs). This queue is sorted by
VCPU priority. A VCPU's priority can be either over or under, representing whether this
VCPU has exceeded its share of the PCPU or not.
A VCPU's share is determined by weight assigned to the VM and credit accumulated by
the VCPU in each accounting period.
The first equation determines the total credit of a VM and the second equation
determines the credit of a VCPU in a VM. Credittotal is a constant; Weighttotal is the sum
of the weight of all domains. A VM's weight is assigned to a VM using xm(1M) (for
example, xm sched-credit -w weight). In each accounting period, fixed amount
of credits are added to idle VCPUs and are subtracted from running VCPUs.
The VCPU has the priority under if the VCPU has not consumed all credits it possesses.
On each PCPU, at every scheduling decision (when a VCPU blocks, yields, completes its
time slice, or is awakened), the next VCPU to run is picked off the head of the run queue
of priority under. When a VM runs, it consumes credits of its VCPU[s]. When a VCPU
uses all its allocated credits, the VCPU's priority is changed from under to over. When a
CPU doesn't find a VCPU of priority under on its local run queue, it will look on other
PCPUs for one. This load balancing guarantees each VM receives its fair share of PCPU
resources system-wide. Before a PCPU goes idle, it will look on other PCPUs to find any
runnable VCPU. This guarantees that no PCPU idles when there is runnable work in the
system.
Earliest Deadline First (EDF) scheduling provides weighted CPU sharing by comparing
the deadline of scheduled periodic processes (or domains, in the case of Sun xVM
CreditVMi Credittotal WeightVM
i×( ) Weighttotal 1–( )+
Weighttotal--------------------------------------------------------------------------------------------------------------------=
CreditVCPUj i, CreditVM
iTotalVCPUVM
i1–( )+
TotalVCPUVMi
--------------------------------------------------------------------------------------------------------------------------=
49 Sun xVM Server Sun Microsystems, Inc.
Server). This scheduler places domains in a priority queue. Each domain is associated
with two parameters: time requested to run, and an interval or deadline. Whenever a
scheduling event occurs, the queue is searched for the domain closest to its deadline.
This domain is then scheduled for execution next with the time requested. The EDF
scheduler gives a better CPU utilization when a system is underloaded. When the
system is overloaded, the set of domains that will miss deadlines is largely
unpredictable (it is a function of the exact deadlines and time at which the overload
occurs).
Interrupts and ExceptionsThe x86 processor uses a vector of size 256 to associate with exceptions and interrupts.
The vector number is an index into the interrupt descriptor table (IDT). The IDT
associates each vector with a gate descriptor for the procedure for handling the
interrupt or exception. The IDT register (IDTR) contains the base address of the IDT.
When Sun xVM Server is booting up, it registers its own IDT to the VMM. During system
initialization, an early stage of Solaris boot, the Solaris kernel function
init_desctbls() is called to initialize the GDT and IDT:
The Solaris kernel function init_desctbls() passes each of its exception and
interrupt vectors to the VMM using the set_trap_table() hypercall:
The set_trap_table() hypercall has one argument, trap_info, which contains
the privilege level of the GOS code segment, the code segment selector, and the
address of the handler which will be used to set the instruction pointer when the VMM
voidinit_desctbls(void){ .... init_idt(&idt0[0]); for (vec = 0; vec < NIDT; vec++) xen_idt_write(&idt0[vec], vec); ....}
voidxen_idt_write(gate_desc_t *sgd, uint_t vec){ trap_info_t trapinfo[2];
bzero(trapinfo, sizeof (trapinfo)); if (xen_idt_to_trap_info(vec, sgd, &trapinfo[0]) == 0) return; if (xen_set_trap_table(trapinfo) != 0) panic("xen_idt_write: xen_set_trap_table() failed");}
50 Sun xVM Server Sun Microsystems, Inc.
passes the control back to the GOS (see following code segment). The value of
trap_info is set in the function xen_idt_to_trap_info() using the setting in
the kernel global variable array, idt0.
On a 64-bit system, the interrupt descriptor has the descriptor privilege level (DPL) 3,
similar to the segment descriptor:
When an interrupt or exception occurs, the VMM’s trap handler is invoked to handle
the interrupt or exception. If this is an exception caused by a GOS, the VMM’s trap
handler sets the pending bit (see “Event Channels” on page 43) and calls the GOS's
exception handler. Interrupts for the GOS are virtualized by mapping them to event
channels, which are delivered asynchronously to the target GOS via the
set_callbacks() hypercall.
In the following example, a kmdb(1M) breakpoint is set at the interrupt service routine
of the sd driver, sdintr(). The function xen_callback_handler(), the callback
function used for processing events from the VMM, is registered in the VMM by the
hypercall set_callbacks(). When an interrupt intended for sd arrives, the
typedef struct trap_info { uint8_t vector; /* exception vector */
uint8_t flags; /* 0-3: privilege level */ uint16_t cs; /* code selector */ unsigned long address; /* code offset */} trap_info_t;
[0]> idt0::print 'struct gate_desc'{ sgd_looffset = 0x4bf0 sgd_selector = 0xe030 sgd_ist = 0 sgd_resv1 = 0 sgd_type = 0xe sgd_dpl = 0x3 sgd_p = 0x1 sgd_hioffset = 0xfb84 sgd_hi64offset = 0xffffffff sgd_resv2 = 0 sgd_zero = 0 sgd_resv3 = 0}
51 Sun xVM Server Sun Microsystems, Inc.
hypercall HYPERVISOR_block() detects an event is available and then invokes the
callback function:
Pending events are stored in a per-domain bitmask (see “Event Channels” on page 43),
that is updated by the VMM before invoking an event-callback handler specified by the
GOS. The function xen_callback_handler() is responsible for resetting the set of
pending events and responding to the notifications in an appropriate manner. A VM
may explicitly defer event handling by setting a VMM-readable software flag; this is
analogous to disabling interrupts on a real processor.
Timer Services“Timer Devices” on page 27 discusses several hardware timers available on x86 systems.
These hardware devices vary in their frequency reliability, granularity, counter size, and
ability to generate interrupts. The Solaris OS employs some of these timer devices for
running the OS clock and high resolution timer:
• OS system clock — The Solaris OS uses the local APIC timer on multiprocessor
systems to generate ticks for the system clock. On uniprocessor systems, the Solaris
OS uses the PIT to generate ticks for the system clock.
• High resolution timer — The Solaris OS uses the TSC timer for a high resolution
timer. The PIT counter is used to calibrate the TSC counter.
• Time-of-day clock — The time-of-day (TOD) clock is based on the RTC. Only Dom0 can
set the TOD clock. The DomU VMs don't have the permission to update the machine's
physical RTC. Therefore, any attempt by the date(1) command to set the date and
time on DomU will be quietly ignored.
In Sun xVM Server, the VMM provides the system time to each VCPU when it is
scheduled to run. The high resolution timer, gethrtime(), is still run through the
sd`sdintr:sd`sdintr: ec8b4855 = pushq %rbp[0]> $csd`sdintr(fffffffec0670000)mpt`mpt_intr+0xdb(fffffffec0670000, 0)av_dispatch_autovect+0x78(1b)dispatch_hardint+0x33(1b, 0)switch_sp_and_call+0x13()do_interrupt+0x9b(ffffff0001005ae0, 1)xen_callback_handler+0x36c(ffffff0001005ae0, 1)xen_callback+0xd9()HYPERVISOR_sched_op+0x29(1, 0)HYPERVISOR_block+0x11()mach_cpu_idle+0x52()cpu_idle+0xcc()idle+0x10e()thread_start+8()[0]>
52 Sun xVM Server Sun Microsystems, Inc.
unprivileged RDTSC instruction, thus the high resolution timer is not virtualized. The
virtualized system time relies on the current TSC to calculate the time in nanoseconds
since the VCPU was scheduled.
Sun xVM Server Memory VirtualizationMemory virtualization in Sun xVM Server deals with the following two memory
management issues:
• Physical memory sharing and partitioning
• Page table access
Physical Memory ManagementSun xVM Server introduces a distinction between machine memory and physical
memory. Machine memory refers to the entire amount of memory installed in the
machine. Physical memory is a per-VM abstraction that allows a GOS to envision its
memory as a contiguous range of physical pages starting at physical page frame
number (PFN) 0, despite the fact that the underlying machine PFN may be sparsely
allocated and in any order (see “Page Translations Virtualization” on page 14).
The VMM maintains a table of machine-to-physical memory mappings. The GOS
performs all page allocations and management based on physical memory. During
page table updates, a conversion from physical memory to machine memory is
performed before making the mmu_update() hypercall to update the page tables.
Since VMs get created and deleted randomly throughout time, the VMM employs
memory hotplug and ballooning schemes to optimize the memory usage in a machine.
Memory hotplug allows a GOS to dynamically add or remove physical memory to its
inventory. The memory ballooning technique allows a VMM to dynamically adjust the
usage of physical memory among VMs.
For example, consider a machine that has 8 GB of memory. Two VMs, VM-A and VM-B,
are initially created with 5 GB of memory each. Memory hotplug adds 5 GB memory to
both VMs after they are booted. The total memory committed to both VMs is greater
than the actually physical memory available. When VM-A needs more physical memory,
the memory ballooning technique increases memory pressure in VM-B by inflating the
balloon driver. This results in memory being paged out to free up the memory
consumed by VM-B, and thus more memory becoming available to VM-A.
The GOS requests the service of physical memory management to the VMM through
the memory_op(cmd, ...) hypercall. The operations supported by the
memory_ops() hypercall include the following:
• XENMEM_increase_reservation
• XENMEM_decrease_reservation
53 Sun xVM Server Sun Microsystems, Inc.
• XENMEM_populate_physmap
• XENMEM_maximum_ram_page
• XENMEM_current_reservation
• XENMEM_maximum_reservation
• XENMEM_machphys_mfn_list
• XENMEM_add_to_physmap
• XENMEM_translate_gpfn_list
• XENMEM_memory_map
• XENMEM_machine_memory_map
• XENMEM_set_memory_map
• XENMEM_machphys_mapping
• XENMEM_exchange
Page Translations“Segmented Architecture” on page 23 describes two stages of address translation to
arrive at a physical address: virtual address (VA) to linear address (LA) translation using
segmentation, and LA to physical address (PA) translation using paging. Solaris x64 uses
a flat address space in which the VA and LA are equivalent, which means the base
address of the segment is 0. In Solaris 10, the Global Descriptor Table (GDT) contains the
segment descriptor for the code and data segments of both kernel and user processes,
as shown in Table 5 on page 54.
Since there is only one GDT in a system, the VMM maintains the GDT in its memory. If a
GOS wishes to use something other than the default segment mapping that the VMM
GDT provides, it must register a custom GDT with the VMM using the set_gdt()
hypercall. In the following code sample, frame_list is the physical address of the
page that contains the GDT and entries is the number of entries in the GDT.
The Solaris 32-bit thread library uses %gs to refer to the LWP state manipulated by the
internals of the thread library. The 64-bit thread library uses %fs to refer to the LWP
state as specified by the AMD64 ABI [8]. The 64-bit kernel still uses %gs for its CPU state
(%fs is never used in the kernel). The MSR's KernelBase register is used to store the
kernel %gs content while it switches to run the 32-bit user LWP. The privileged
instruction SWAPGS is used to restore the kernel %gs during the context switch to the
xen_set_gdt(ulong_t *frame_list, int entries){ .... if ((err = HYPERVISOR_set_gdt(frame_list, entries)) != 0) { .... } return (err);}
54 Sun xVM Server Sun Microsystems, Inc.
kernel context. So when the VMM performs a context switch between the guest kernel
mode and the guest user mode, it executes SWAPGS as part of the context switch (see
“CPU Privilege Mode” on page 45).
The GDT segment is given in Table 5 below:
Table 5. The GDT segment.
Every LWP context switch requires an update to the GDT for the new LWP. The GOS uses
update_descriptor() for the task:
On an x86 system, the base physical address of the page directory is contained in the
control register %cr3. In the Solaris OS, the value of %cr3 is stored in the process's
hat structure, proc->p_as->a_hat->hat_table->ht_pfn, as shown in “Paging
Architecture” on page 25. The loading of %cr3 is performed by the VMM for security
and coherency reasons.
% cat intel/sys/segments.h#define GDT_NULL 0 /* null */#define GDT_B32DATA 1 /* dboot 32 bit data descriptor */#define GDT_B32CODE 2 /* dboot 32 bit code descriptor */#define GDT_B16CODE 3 /* bios call 16 bit code descriptor */#define GDT_B16DATA 4 /* bios call 16 bit data descriptor */#define GDT_B64CODE 5 /* dboot 64 bit code descriptor */#define GDT_BGSTMP 7 /* kmdb descriptor only used in boot */
#if defined(__amd64)
#define GDT_KCODE 6 /* kernel code seg %cs */#define GDT_KDATA 7 /* kernel data seg %ds */#define GDT_U32CODE 8 /* 32-bit process on 64-bit kernel %cs */#define GDT_UDATA 9 /* user data seg %ds (32 and 64 bit) */#define GDT_UCODE 10 /* native user code seg %cs */#define GDT_LDT 12 /* LDT for current process */#define GDT_KTSS 14 /* kernel tss */#define GDT_FS GDT_NULL /* kernel %fs segment selector */#define GDT_GS GDT_NULL /* kernel %gs segment selector */#define GDT_LWPFS 55 /* lwp private %fs segment selector (32-bit)*/#define GDT_LWPGS 56 /* lwp private %gs segment selector (32-bit)*/#define GDT_BRANDMIN 57 /* first entry in GDT for brand usage */#define GDT_BRANDMAX 61 /* last entry in GDT for brand usage */#define NGDT 62 /* number of entries in GDT */
intel/ia32/os/desctbls.c
update_gdt_usegd(uint_t sidx, user_desc_t *udp){ .... if (HYPERVISOR_update_descriptor(pa_to_ma(dpa), *(uint64_t *)udp)) panic("xen_update_gdt_usegd: HYPERVISOR_update_descriptor");}
55 Sun xVM Server Sun Microsystems, Inc.
“Page Translations Virtualization” on page 14 discusses two alternatives for updating
page tables in a virtualized environment: hypervisor calls to a read-only page table and
shadow page tables. The Sun xVM Hypervisor for x86 provides an additional alternative,
a writable page table, for the GOS to implement page translations. In the default mode
of operation, the VMM uses both read-only page tables and writable page tables to
manage page tables. The VMM allows the GOS to use a writable page table to update
the lowest level page tables (for example, the PTE). The higher levels, such as PDE, PDP,
and PML4, use a read-only page table and are updated using the hypercall
mmu_update(). Updates to higher level page tables are much less frequent compared
to the PTE page table updates.
• Read-only page table
The GOS has read-only access to page tables and uses the mmu_update() hypercall
to update page tables. As described in the previous section “Physical Memory
Management” on page 52, the GOS has a view of pseudo-physical memory, and a
translation from physical address to machine address is performed before the
mmu_update() call.
• Writable page table
If a GOS attempts to write to a page table that is maintained by the VMM, this
attempt will result in a #PF fault to the VMM. In the VMM fault handling routine,
the following tasks are performed:
– Hold the lock for all further page table updates
– Disconnect the page that contains the updated page table by clearing the page
present bit of the page table entry in the parent page table
– Make the page writable by the GOS
The page will be reconnected to the paging hierarchy again automatically in a
number of situations, including when the guest modifies a different page-table page,
when the domain is preempted, and whenever the guest uses the VMM’s explicit
page-table update interfaces.
• Shadow page table
Voidset_pteval(paddr_t table, uint_t index, uint_t level, x86pte_t pteval){ .... ma = pa_to_ma(PT_INDEX_PHYSADDR(pfn_to_pa(ht->ht_pfn), entry)); t[0].ptr = ma | MMU_NORMAL_PT_UPDATE; t[0].val = new; if (HYPERVISOR_mmu_update(t, cnt, &count, DOMID_SELF)) panic("HYPERVISOR_mmu_update() failed"); ....}
56 Sun xVM Server Sun Microsystems, Inc.
The VMM maintains a independent copy of page tables, called the shadow page
table, that is pointed to by the %cr3 register. If a page fault occurs when a GOS’s
page table is accessed, the VMM propagates changes made to the GOS’s page table
to the shadow page table. Shadow page mode can be set in the GOS by calling
dom0_op(DOM0 SHADOW CONTROL).
In addition to creating a translation entry, the VMM also provides the mmuext_op()
hypercall for the GOS to flush, to invalidate, or to lock a page translation. For example,
it is necessary to lock the translations of a process when it is being created. The
mmuext_op() is invoked by the kernel during the fork(2) system call:
Sun xVM Server I/O VirtualizationSun xVM Server uses a split device driver architecture to provide device services to
DomU domains. The device services are provided by two co-operating drivers: the front-
end driver, which runs in a DomU, and the back-end driver, which runs in Dom0
(Figure 19). Sun xVM Server doesn't export any real devices to DomU domains. All
device access made by DomU domains must go through the back-end driver located in
Dom0.
[3]> :ckmdb: stop at xen_pin+0x3akmdb: target stopped at:xen_pin+0x3a: call +0x208b1 <HYPERVISOR_mmuext_op>[3]> $cxen_pin+0x3a(ff2c, 3)hat_alloc+0x285(fffffffec381b7e8)as_alloc+0x99()as_dup+0x3f(fffffffec381ba88, fffffffec3d0f8d0)cfork+0x102(0, 1, 0)forksys+0x25(0, 0)sys_syscall32+0x13e()
57 Sun xVM Server Sun Microsystems, Inc.
Figure 19. The split device driver architecture employed by Sun xVM Server includes a front-end driver in DomU and a back-end driver in Dom0.
Dom0 is a special VM that has access to the real device hardware. The front-end driver
appears to a GOS in DomU as a real device. This driver receives I/O requests from
applications as usual. However, since this front-end driver does not have access to the
physical hardware of the system, it must then send requests to the back-end driver in
Dom0. The back-end driver is responsible for issuing I/O requests to the real device
hardware. When the I/O completes, the back-end notifies the front-end that the data is
ready for use; the front-end is then able to report I/O completion and unblock the I/O
call.
When the Solaris OS is initialized, devices identify themselves and are organized into
the device tree. This device tree depicts a hierarchy of nodes, with each node on the
tree representing a device. Sun xVM Server exports a complete device tree to domain
Dom0 so that it can directly accesses all physical devices on the system. For DomU
domains, the paravirtualized Solaris OS uses information passed to it by xm(1M) to
disable PCI bus probing and create virtual Sun xVM Server device nodes under the VMM
virtual device nexus driver, xpvd.
System Calls
Back EndDrivers
User
Kernel
Sun x64 Server
Dom0
Sun xVM Hypervisor for x86
Grant Tables/EventChannel/Xen Callback
System CallsUser
Kernel
DomU
Front EndDrivers
X86 Hardware (CPU, Memory, Devices)
NativeDriver
58 Sun xVM Server Sun Microsystems, Inc.
Output from the prtconf(1M) command shows the device tree as exported by Sun
xVM Server to a VM in a DomU domain. As the prtconf(1M) output shows, there are
no physical devices of any kind on the device tree in DomU:
A driver that provides services to other drivers is called a bus nexus driver and is shown
in the device tree hierarchy as a node with children. The nexus driver provides bus
mapping and translation services to subordinate devices in the device tree. The type of
services provided by the nexus driver include interrupt priority assignment, DMA
resource mapping, and device memory mapping. As seen in the previous
prtconf(1M) output, the xpvd driver is the root nexus driver for all Sun xVM Server
devices on DomU. An individual device driver is represented in the tree as a node with
no children. This type of node is referred to as a leaf driver. In the above example,
xenbus, domcaps, xencons, xdf, and xnf are leaf drivers.
# prtconfSystem Configuration: Sun Microsystems i86pcMemory size: 2048 MegabytesSystem Peripherals (Software Nodes):
i86xpv scsi_vhci, instance #0 isa (driver not attached) xpvd, instance #0 xencons, instance #0 xenbus, instance #0 domcaps, instance #0 balloon, instance #0 xdf, instance #0 xnf, instance #0 iscsi, instance #0 pseudo, instance #0 agpgart, instance #0 options, instance #0 xsvc, instance #0 cpus (driver not attached) cpu, instance #0 (driver not attached) cpu, instance #1 (driver not attached)
59 Sun xVM Server Sun Microsystems, Inc.
The Sun xVM Server-related driver modules for Dom0 and DomU respectively are shown
below:
• The xpvtod driver provides setting and getting the time-of-day for the VM. TOD
service is provided by the RTC timer. If the request to set the TOD comes from a
DomU domain, the request is silently ignored, as DomU doesn't have permission
to set the RTC timer.
• The nexus driver in Solaris provides bus mapping and translation services to
subordinate devices in the device tree. The xpvd driver is the nexus driver for all
virtual I/O drivers which don't directly access physical device. This driver’s primary
functions are to provide interrupt mapping and to invoke the initialization routine
of its children devices.
• The xenbus driver provides a bus abstraction that drivers can use to communicate
between VMs. The bus is mainly used for configuration negotiation, leaving most
data transfer to be done via an interdomain channel composed of a grant table
and an event channel. The xenbus driver also makes the configuration data
available to the XenStore shared storage repository (see “XenStore” on page 45).
• The evtchn driver is used for receiving and demultiplexing event-channel signals
to the user land.
• The balloon driver is controlled by the VMM to manage physical memory usage
by a VM. (See “Physical Memory Virtualization” on page 13 and “Physical Memory
Management” on page 52).
• The privcmd driver is used by the domain manager on Dom0 to get the VMM
service for VM management.
Sun xVM Server related device modules on dom0:xpvtod (TOD module for Xen)xpvd (virtual device nexus driver) xencons (virtual console driver)privcmd (privcmd driver)evtchn (Evtchn driver)xenbus (virtual bus driver)xdb (vbd backend driver)xnb (xnb module)xsvc (xsvc driver)balloon (balloon driver)
Sun xVM Server related device modules on domU:xenbus (virtual bus driver) xpvtod (TOD module for i86xpv)xpvd (virtual device nexus driver)xencons (virtual console driver)xdf (Xen virtual block driver)xnf (Virtual Ethernet driver)
60 Sun xVM Server Sun Microsystems, Inc.
• The drivers xdf and xdb, the front-end and back-end block device drivers
respectively, are discussed in “Disk Driver” on page 60. The xnf and xnb drivers, the
front-end and back-end network drivers respectively, are discussed in “Network
Driver” on page 61.
Data transfer between interdomain drivers is mainly provided by the VMM grant table
and event-channel services. Most of the data transfer is handled in a similar fashion to
DMA transfer between host and device. Data is put in the grant table by the sending
VM, and notification is sent to the receiving VM through the event channel. Then, the
callback routine in the receiving VM is invoked to process the data.
Disk DriverThe xdb driver, the back-end driver on Dom0, is used to provide services for block device
management. This driver receives I/O requests from DomU domains and sends them on
to the native driver. On DomU, xdf is the pseudo block driver that gets the I/O requests
from applications and sends them to the xdb driver in Dom0. The xdf driver provides
functions similar to those of the SCSI target disk driver, sd, on an unvirtualized Solaris
system.
On Solaris systems, the main interface between a file system and storage device is the
strategy(9E) driver entry point. The strategy(9E) entry point takes only one
argument, buf(9S), which is the basic data structure for block I/O transfer. The I/O
request made by a file system to the strategy(9E) entry point is called PAGEIO, as
the memory buffer for the I/O is allocated from the kernel page pool. An application
can also open the storage device as a raw device and perform read(2) and write(2)
operations directly on the raw device. Such an I/O request is called PHYSIO,
physio(9F), as the memory buffer for the I/O is allocated by the application.
In addition to the strategy(9E) driver entry point for supporting file system and raw
device access, a disk driver also supports a set of ioctl(2) operations for disk control
and management. The dkio(7I) disk control operations define a standard set of
ioctl(2) commands. Normally, support for dkio(7I) operations requires direct
access to the device. In DomU, xdf supports most ioctl(2) commands as defined in
dkio(7I) by emulating the disk control inside xdf. No communication is made by
xdf to the back-end driver for ioctl(2) operations.
The sequence of events for disk I/O data transfer is illustrated in Figure 20. The disk
control path, ioctl(2), is similar to the data path.
When a disk I/O request is issued by a DomU domain, the sequence is as follows:
1. The file system calls the xdf driver's strategy(9E) entry point as a result of a
read(2) or write(2) system call.
61 Sun xVM Server Sun Microsystems, Inc.
2. The xdf driver puts the I/O buffer, buf(9S), on the grant table. This buffer is
allocated from the DomU memory. Permission for other domain access is granted
to this memory.
3. The xdf driver notifies Dom0 of an event through event channel.
4. The VMM event channel generates an interrupt to the xdb driver in Dom0.
5. The xdb driver in Dom0 gets the DomU I/O buffer through the grant_table.
6. The xdb driver in Dom0 calls the native driver's strategy(9E) entry point.
7. The native driver performs DMA.
8. The VMM receives the device interrupt.
9. The VMM generates an event to Dom0.
10. The xdb driver's iodone() routine is called by biodone(9F).
11. The xdb driver’s iodone() routine generates an event to DomU.
12. The xdf driver in DomU receives an interrupt to free up the grant table and DMA
resources, and calls biodone(9F) to wake up anyone waiting for it.
When a disk I/O request is issued by the control domain DomO, the sequence is as
follows:
13. Block I/O requests are sent directly to the native driver.
Figure 20. Sequence of events for an I/O request from a Sun xVM Server virtual machine.
Network DriverThe Sun xVM Server network drivers uses a similar approach to the disk block driver for
handling network packets. On DomU, the pseudo network driver xnf gets the I/O
requests from the network stack and sends them to xnb on Dom0. The back-end
network driver xnb on Dom0 forwards packets sent by xnf to the native network driver.
Xen Callback
read(2)/write(2)
FSxdb
UserKernel
X86 Hardware (CPU, Memory, Devices)
Sun xVM Hypervisor for x86
Grant Tables
8
read(2)/write(2)
FS
xdf
UserKernel
DomU
1
3
9
6
2
411
13
10
5
Sun X64 Server
NativeDriver
12Event Channel7
62 Sun xVM Server Sun Microsystems, Inc.
The buffer management for packet receiving has more impact on network performance
than packet transmitting does. On the packet receiving end, the data is transferred via
DMA into the native driver receiving buffer on domO. Then, the packet is copied from
the native driver buffer to the VMM buffer. The VMM buffer is then mapped to the
DomU kernel address space without another copy of the data.
The sequence of operations for packet receiving is as follows:
1. Data is transferred via DMA into the native driver, bge, receive buffer ring.
2. The xnb drivers gets a new buffer from the VMM and copies data from the bge
receive ring to the new buffer.
3. The xnb driver sends DomU an event through the event channel.
4. The xnf driver in DomU receives an interrupt.
5. The xnf driver maps a mblk(9S)to the VMM buffer and sends the mblk(9S) to
the upper stack.
Figure 21. Sequence of events for a network request from a Sun xVM Server virtual machine.
domU
xnf`xnf_process_recv+0x275xnf`xnf_intr+0x5eunix`av_dispatch_autovect+0x78unix`dispatch_hardint+0x33unix`switch_sp_and_call+0x13
Xen Callback
X86 Hardware (CPU, Memory, Devices)
Sun xVM Hypervisor for x86
Sun X64 Server
dom0
xnb_to_peerxnbo`from_mac+0x1cmac`mac_do_rx+0x88mac`mac_rx+0x1bvnic`vnic_rx+0x59vnic`vnic_classifier_rx+0x6bmac`mac_do_rx+0x88mac`mac_rx+0x1bbge`bge_receive+0x564bge`bge_intr+0x182unix`av_dispatch_autovect+0x78unix`dispatch_hardint+0x33unix`switch_sp_and_call+0x13
3
5
Grant Tables
Network Chip
1 Event Channel
4
2
63 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
Chapter 6
Sun xVM Server with Hardware VM (HVM)
Intel and AMD have independently developed extensions to the x86 architecture that
provide hardware support for virtualization. These extensions enable a VMM to provide
full virtualization to a VM, and support the running of unmodified guest operating
systems on a VM. This approach is in contrast to Sun xVM Server PV, which requires
modifications to the guest operating system.
Virtual machines that are supported by virtualization capable processors are called
Hardware Virtual Machines (HVMs). An HVM environment includes the following
requirements:
• A processor that allows an OS with reduced privilege to execute sensitive instructions
• A memory management scheme for a VM to update its page tables without accessing
MMU hardware
• An I/O emulation scheme that enables a VM to use its native driver to access devices
through an I/O VM (see “I/O Virtualization” on page 16)
• An emulated BIOS to bootstrap the OS
The x86 processor for HVM meets the first requirement, allowing an OS with reduced
privilege to execute sensitive instructions. However, a processor alone is not enough to
provide full virtualization. The memory management, I/O emulation, and emulated
BIOS requirements necessitate enhancements in the VMM.
This chapter begins with a discussion of HVM operations that are applicable to both
Intel and AMD virtualization extensions, followed by Intel and AMD specific
enhancements for HVM.
After the introduction of processor extensions, Sun xVM Server enhancements in the
areas of BIOS emulation, memory management, and I/O virtualization for full
virtualization are discussed in detail.
Note – Intel's virtualization extension is called Virtual Machine Extensions (VMX), and is documented in the IA-32 Intel Architecture Software Developer's Manual (see [7] Volume 3B Chapters 19-23). AMD's extension is called Secure Virtual Machine (SVM), and is documented in the AMD64 Architecture Programmer s Manual Volume 2: System Programming (see [9] Chapter 15).
64 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
HVM Operations and Data StructureBoth Intel and AMD's extension for HVM, though not compatible to each other, are
similar in basic concepts. Both create a special mode of operation that allows system
software running in a reduced privileged mode to execute sensitive instructions. In
addition, both implementations also define state and control data structures that
enable the transition between modes of operation.
The processor for HVM has two operating modes: privileged mode and reduced
privilege mode. Processor behavior in the privileged mode is very much the same as the
processor running without the virtualization extension. Processor behavior in the
reduced privilege mode is restricted and modified to facilitate virtualization.
Table 6 summarizes the terms used by Intel and AMD for HVM. The extension creates
new instructions, and a HVM control and state data structure (HVMCSDS) for the VMM
to manage transition from one mode to another. The HVMCSDS is called VMCS on the
Intel processor and is called VMCB on the AMD processor. The VMM associates a
HVMCSDS with each VM. For a VM with multiple VCPUs, the VMM can associate a
HVMCSDS with each VCPU in the VM.
Table 6. Comparison of Intel and AMD processor support for virtualization.
After HVM is enabled, the processor is operating at privileged mode. Transitions from
privileged mode to reduced privilege modes are called VM Entries. Transitions from
reduced privilege mode to privileged mode are called VM Exits. Figure 22 illustrates
entry and exit with the HVMCSDS.
Intel AMD
Virtualization Operation VMX SVM
Privileged Mode VMX Root Host Mode
Reduced-privileged mode VMX non-Root Guest Mode
HVM Control and State Data Structure (HVMCSDS) VMCS VMCB
Entering non-privileged mode VMLAUNCH/VMRESUME
VMRUN
Exiting non-privileged mode Implicit Implicit
65 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
Figure 22. Virtual machine entry and exit with hardware support on AMD and Intel processors.
VM entry is explicitly initiated by the VMM using an instruction (VMLAUNCH and
VMRESUME on Intel; VMRUN on AMD). The processor performs checks on the processor
state, VMM state, control fields, and the VM state before loading the VM state from the
HVMCSDS to launch the VM entry.
As a part of VM entry, the VMM can inject an event into the VM. The event injection
process is used to deliver virtualized external interrupts to a VM. A VM normally doesn't
get interrupts from I/O devices, because I/O devices are not exposed to VMs (with the
exception of Dom0). As will be shown in “Sun xVM Server with HVM I/O Virtualization
(QEMU)” on page 71, a VM's I/O is handled by a special domain (Dom0) that runs a
paravitualized OS and has direct access to I/O devices. When an I/O operation
completes, Dom0 informs the VMM to send an interrupt through an hvm_op hypercall.
The VMM prepares the HVMCSDS for event injection and the VM's return instruction
pointer (RIP) is pushed on the stack.
VM exit occurs implicitly in response to certain instruction and events in a VM. The
VMM governs the conditions causing a VM exit through manipulating the control fields
in the HVMCSDS. The events that can be controlled to result in a VM exit include the
following (see [9] Chapter 20):
• External interrupts, non-maskable interrupts, and system management interrupts
• Executing certain instructions (such as RDPMC, RDTSC, or instructions that access
the CR)
• Exceptions
The exact conditions that cause a VM exit are defined in the HVMCSDS control fields.
Certain conditions may cause a VM exit for one VM but not for other VMs.
VM exits behave like a fault, meaning that the instructions causing the VM exit does
not execute and no processor state is updated by the instruction. The VM exit handler
VM1
Virtual Machine Monitor (VMM)
VMX non-root operation (Intel)Guest Mode (AMD)
VMX root operation (Intel)Host Mode (AMD)
VMCS/VMCB
VM EXIT/#VMEXIT
VM ENTER/VMRUN
VM2
VMCS/VMCB
VM EXIT/#VMEXIT
VM ENTER/VMRUN
66 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
in the VMM is responsible for taking appropriate actions for the VM exit. Unlike
exceptions, the VM exit handler is specified in the HVMCSDS host RIP field rather than
using the IDT:
Intel Virtualization Technology SpecificsIntel Virtualization (Intel-VT), code name Vanderpool, is the Intel virtual machine
extensions (VMX) to run unmodified guest OSes. Intel-VT has two implementations: VT-x defines the extensions to the IA-32 Intel architecture, and VT-i defines the
extensions to the Intel Itanium architecture. This paper focuses on the Intel VT-x
implementation.
Table 6 on page 64 summarizes the terms used in Intel documents [7] for HVM. Intel-
VTx adds several new instructions to the existing IA-32 instructions set to facilitate HVM
operations (see Table 7):
In addition to new VMX instructions and VMCS, VT-x introduces a direct I/O architecture
for Intel-VT [28] to improve VM security, reliability, and performance through I/O
enhancements. As will be shown in “Sun xVM Server with HVM I/O Virtualization
(QEMU)” on page 71, the current I/O virtualization implementation for Sun xVM Server
with HVM, which is based on the QEMU project, is inefficient as all I/O transaction have
to go through Dom0, unreliable as the I/O virtualization layer on Dom0 becomes a
single point-of-failure, and insecure as a VM may access other VM's DMA memory by
manipulating the value written to I/O port.
static void construct_vmcs(struct vcpu *v){..../* Host CS:RIP. */__vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS);__vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler);....}
Table 7. Intel-VTx instructions that facilitate HVM operations.
Instruction Description
VMLAUNCH/VMRESUME launch/resume VM
VMCLEAR clear VMCS
VMPTRLD/VMPTRST load/store VMCS
VMREAD/VMWRITE read/write VMCS
VMXON/VMXOFF enable/disable VMX operation
VMCALL call to the VMM
67 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
The Intel-VT direct I/O architecture specifies the following hardware capabilities to the
VMM:
• DMA remapping — This feature provides IOMMU support for I/O address translation
and caching capabilities. The IOMMU as specified in the architecture includes a page
table hierarchy similar to the processor page table, and an IOTLB for frequently
accessed I/O pages. Addresses used in the DMA transactions are allocated from
IOMMU address space, and the IOMMU hardware provide address translation from
the IOMMU address space to the system memory address space.
• I/O device assignment across VMs — This feature allows a PCI/PCI-X device that is
behind a PCI-E to PCI/PCI-X bridge or a PCI-E device to be assigned to a VM, regardless
of how the PCI bus is bound to a VM.
AMD Secure Virtual Machine SpecificsThe AMD Secure Virtual Machine (SVM), code name Pacifica, is similar to Intel VT-x in
technology and design. The AMD SVM uses the instruction VMRUN to switch between a
GOS and the VMM. The instruction VMRUN takes, as a single argument, the physical
address of a 4KB-aligned page, the virtual machine control block (VMCB), which
describes a virtual machine (guest) to be executed.
In addition to functions that are equivalent to those in Intel VT-x, AMD SVM provides
additional features, that are not available in Intel VT-x, to improve HVM operations:
• Nested page table (NPT)
As an alternative to using a shadow page table for address translation (see “Shadow
Page Table” on page 69), AMD SVM uses two %cr3 registers, gCR3 and nCR3, to
point to guest page tables and nested page tables respectively. Guest page tables
map guest linear addresses to guest physical addresses. Nested page tables map
guest physical addresses to system physical addresses. The table walker first
translates that entry’s guest physical address into a system physical address. Then
translations from guest linear to system physical addresses are cached in the TLB for
subsequent guest access.
• Tagged TLB
To avoid a TLB flush during context switch (see “Paging Architecture” on page 25),
AMD SVM provides a tagged TLB with Address Space Identifier (ASID) bits to
distinguish different address spaces. A tagged TLB allows the VMM to use shadow
page tables or multiple nested page tables for address translation during a context
switch without flushing the TLBs.
• IOMMU
The AMD64 IOMMU enables secure virtual machine guest operating system access to
selected I/O devices by providing address translation and access protection on DMA
transfers by peripheral devices. The IOMMU can be thought of as a combination and
68 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
generalization of two facilities included in the AMD64 architecture: the Graphics
Aperture Remapping Table (GART) and the Device Exclusion Vector (DEV). The GART
provides address translation of I/O device accesses to a small range of the system
physical address space, and the DEV provides a limited degree of I/O device
classification and memory protection.
Sun xVM Server with HVM Architecture OverviewSun xVM Server with HVM supports the running of unmodified operating systems in
DomU. However, Dom0 still requires a paravirtualized OS in order to provide full I/O
virtualization support for DomUs.
To support full virtualization, the Sun xVM Hypervisor for x86 has extended its
paravirtualized architecture with the following enhancements:
• A set of HVM functions (struct hvm_function_table) for processor dependent
implementation of HVM, and an hvm_op hypercall
• A shadow page table to virtualize memory management
• Device emulation based on the QEMU project for I/O virtualization
• Emulated BIOS, hvmload, to bootstrap the GOS
These enhancements are discussed in more detail in the following sections.
69 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
Processor Dependent HVM FunctionsThe Sun xVM Hypervisor for x86 defines a set of foundational interfaces, struct
hvm_function_table, to abstract processor HVM specifics. The struct
hvm_function_table entries are:
The VMM uses hvm_function_table to provide a VCPU to a VM. The entry points in
hvm_function_table fall into two categories: setup and runtime. The setup entry
points are called when a VM is being created. The runtime entry points are called
before VM entry or after VM exit. Since the HVMCSDS data structure abstracts the
states and controls of a VCPU, the entry points in hvm_function_table are
primarily used to manipulate the data structure.
Shadow Page TableBecause the GOS is unmodified, the read-only page table scheme for page translation
as used in the Sun xVM Server is no longer applicable. The read-only page table scheme
requires the OS to make hypercalls into the VMM to update page tables. To support an
unmodified OS, the shadow page table scheme becomes the only option available. In
this scheme, the shadow page table (also known as the active page table hierarchy) is
the actual page table used by the processor.
struct hvm_function_table { void (*disable)(void); int (*vcpu_initialise)(struct vcpu *v); void (*vcpu_destroy)(struct vcpu *v); void (*store_cpu_guest_regs)( struct vcpu *v, struct cpu_user_regs *r, unsigned long *crs); void (*load_cpu_guest_regs)( struct vcpu *v, struct cpu_user_regs *r); int (*paging_enabled)(struct vcpu *v); int (*long_mode_enabled)(struct vcpu *v); int (*pae_enabled)(struct vcpu *v); int (*guest_x86_mode)(struct vcpu *v); unsigned long (*get_guest_ctrl_reg)(struct vcpu *v, unsigned int num); unsigned long (*get_segment_base)(struct vcpu *v, enum x86_segment seg); void (*get_segment_register)(struct vcpu *v, enum x86_segment seg, struct segment_register *reg); void (*update_host_cr3)(struct vcpu *v); void (*update_guest_cr3)(struct vcpu *v); void (*stts)(struct vcpu *v); void (*set_tsc_offset)(struct vcpu *v, u64 offset); void (*inject_exception)(unsigned int trapnr, int errcode, unsigned long cr2); void (*init_ap_context)(struct vcpu_guest_context *ctxt, int vcpuid, int trampoline_vector); void (*init_hypercall_page)(struct domain *d, void *hypercall_page);};
70 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
In supporting shadow page [29], the Sun xVM Hypervisor for x86 attempts to intercept
all updates to a guest page table, and updates both the VM's page table and the
shadow page table maintained by the VMM, keeping both page tables synchronized at
all times. This implementation results in two page faults, one due to faulting the actual
page and a second one due to page table access.
This shadow page table scheme has a significant impact on the VM performance. An
alternative such as nested page table (see “AMD Secure Virtual Machine Specifics” on
page 67) has been proposed to improve the memory virtualization performance.
Sun xVM Server Interrupt and Exception Handling for HVMThe VMM can specify processor behavior on specific exceptions and interrupts by
setting appropriate control filed in the HVMCSDS. When a physical interrupt occurs, the
processor uses the setting in the HVMCSDS to determine whether this interrupt would
result in the VM exit of a running VM. Upon VM exit, the VMM gets the interrupt vector
from the HVMCSDS, sets controls field for event injection, and launches the VM entry of
the target VM.
The interrupt handling by the VMM is a two stage process: from physical device to the
VMM, and from a virtual device in Dom0 to the target VM. The VMM controls the IDT
for interrupt from physical devices. Each VM registers its own IDT with the VMM. When
a physical interrupt arrives, the VMM delivers the interrupt to a virtual device in Dom0.
The virtual device then generates a virtual interrupt to a VM.
A virtual interrupt is delivered to a VM through event injection by setting the VM entry
control field in the HVMCSDS for event injection. The VMM uses the
inject_exception entry point in hvm_function_table (see “Processor
Dependent HVM Functions” on page 69) to set the HVMCSDS event injection control
field. The event is delivered when the VM is entered.
Emulated BIOSThe PC BIOS provides hardware initialization, boot services, and runtime services to the
OS. There are some restrictions on VMX operation. An OS in HVM cannot operate in real
mode. Unlike a paravirtualized OS that can change its bring up sequence for an
environment without BIOS, an unmodified OS requires an emulated BIOS to perform
some real mode operations before control is passed to the OS. Sun xVM Server includes
a BIOS emulator, hvmloader, as a surrogate to real BIOS.
The hvmloader BIOS emulation contains three components: ROMBIOS, VGABIOS,
and VMXAssist. Both ROMBIOS and VGABIOS are based the open source Bochs BIOS
[23]. The VMXAssist component is included in hvmloader to emulate real mode,
which is required by hvmloader and bootstrap loaders. The hvmloader BIOS
emulator is bootstrapped as any other 32-bit OS. After it is loaded, hvmloader copies
71 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
its three components to pre-assigned addresses (VGABIOS at C000:0000,
VMXAssist at D000:0000, and ROMBIOS at F000:0000) and transfers control to
VMXAssist.
The hvmloader BIOS emulator does not directly interface with physical devices. It
communicates with virtual devices as discussed in the following section “Sun xVM
Server with HVM I/O Virtualization (QEMU)”.
Sun xVM Server with HVM I/O Virtualization (QEMU)Sun xVM Server I/O virtualization on an HVM-enabled environment is based on the
open source QEMU project [24]. QEMU is a machine emulator that uses dynamic binary
translation to run an unmodified OS and its applications in a virtual machine. QEMU
includes several components: CPU emulators, emulated devices, generic devices,
machine descriptions, user interface, and a debugger. The emulated devices and
generic devices in QEMU make up its device models for I/O virtualization. Sun xVM
Server uses QEMU's device models to provide full I/O virtualization to VMs.
For example, QEMU supports several emulated network interfaces, including ne2000,
PCNet, and Realteck 8139. The Solaris OS has the pcn driver for the PCNet NIC.
The Solaris OS running in DomU can use pcn and communicate to QEMU on a Solaris
Dom0 that has a e1000g NIC. The pcnet emulation in QEMU converts Solaris pcn
transactions to a generic virtual network interface (such as TAP), which forwards the
packet to the driver for the native network interface (such as e1000g).
QEMU I/O emulation is illustrated in Figure 23. The principle of operation for sending
out an I/O request is outlined as follows:
1. An OS interfaces with a device through I/O ports and/or memory-mapped device
memory. The device performs certain operations, such as DMA, in response to I/O
port/memory access by the OS. At the completion of the operation, the device
generates an interrupt to notify the OS (Steps 1 and 2 on Figure 23).
2. The VMM monitors and intercepts the device I/O ports and memory accesses
(Step 3 on Figure 23).
3. The VMM forwards the I/O port/memory data to an I/O virtualization layer such as
QEMU (Step 4 in Figure 23).
4. QEMU decodes the I/O port/memory data and performs necessary emulation for
the I/O request (Step 5 in Figure 23).
5. QEMU delivers the emulated I/O request to the OS native device interface (Steps 6
and 7 in Figure 23).
72 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
Figure 23. I/O emulation in Sun xVM Server using QEMU for dynamic binary translation.
Using the AMD PCNet LANCE PCI Ethernet controller as an example, the vendor ID and
device ID of the PC Net chip is respectively 1022 and 2000. From prtconf(1M)
output, the PCI registers exported by the device are:
According to IEEE1275 OpenBoot Firmware [25], the reg property is generated by
reading the base address registers in the configuration address space. Each entry in the
reg property format consists of one 32-bit cell for register configuration, a 64-bit
address cell, and a 64-bit size cell [26]. As the prtconf(1M) output shows, the PCNet
chip has a 128 byte (0x00000080) register in the I/O address space (01 in the first byte
of0x01008810 denotes I/O address space). QEMU emulation for PCNet simply
monitors the Solaris driver access to the 128 bytes register using x86 IN/OUT
instructions.
% prtconf -v.... pci1022,2000, instance #0 Hardware properties: name='assigned-addresses' type=int items=5 value=81008810.00000000.00001400.00000000.00000080 name='reg' type=int items=10 value=00008800.00000000.00000000.00000000.00000000.01008810.00000000.00000000.00000000.00000080....
UserKernel
X86 Hardware (CPU, Memory, Devices)
Dom0 Dom U
Sun xVM Hypervisor for x86
VM exit handler
User
socket(3c)
Kernel
19 6
2
11
qemu-dm/ioemu/pcnet
10pcn
5
4TAP/
NativeNIC
7
NIC
8
I/O PortDevice memory
3
11
hvm hypercall
Event Channel
10
73 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
The QEMU virtualization for transmitting and receiving a packet using the PCNet
emulation is illustrated in the Figure 23 on page 72. The sequence of events
corresponding to the numbered dots in the figure is described below:
1. Applications make an I/O request to the driver through system calls.
2. The pcn driver writes to the DMA descriptor using the OUT instruction. In pcn,
pcn_send() calls pcn_OutCSR() to start the DMA transaction. Then,
pcn_OutCSR() calls ddi_put16() to write a value to an I/O address. Next,
ddi_put16() checks whether the mapping (io_handle) is for I/O space or
memory space. If the mapping is for the I/O space, it moves its third argument to
%rax and port ID to %rdx, and issues the OUTW instruction to the port referenced
by %dx.
The OUT instruction causes a VM exit. The CPU is setup by the VMM to have an
unconditional VM exit if the VM executes IN/OUT/INS/OUS as shown in the
setting of the CPU_BASED_UNCOND_IO_EXITING bit in VM exit control (see
Table 20-6 in [7]).
pcn_send(){....pcn_OutCSR(pcnp, CSR0, CSR0_INEA | CSR0_TDMD);...}static voidpcn_OutCSR(struct pcninstance *pcnp, uintptr_t reg, ushort_t value){ ddi_put16(pcnp->io_handle, REG16(pcnp->io_reg, PCN_IO_RAP), reg); ddi_put16(pcnp->io_handle, REG16(pcnp->io_reg, PCN_IO_RDP), value);}ENTRY(ddi_put16) movl ACC_ATTR(%rdi), %ecx cmpl $_CONST(DDI_ACCATTR_IO_SPACE|DDI_ACCATTR_DIRECT), %ecx jne 8f movq %rdx, %rax movq %rsi, %rdx outw (%dx) ret
74 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
The VM exit handler is set in the host RIP field in HVMCDCS (see “HVM
Operations and Data Structure” on page 64). The VM exit handler examines the
exit reason and calls the I/O instruction function, vmx_io_instruction(), to
handle the VM exit.
3. The VM exit handler for I/O instructions in the VMM examines the exit
qualification, and gets OUT information from the HVMCDCS. This information
includes:
– Size of the access (1 byte, 2 byte, or 4 bytes)
– Direction of the access (IN or OUT)
– Port number
– Double fault exception or not
– Size and address of string buffer if this is an I/O string operation
#define MONITOR_CPU_BASED_EXEC_CONTROLS \ ( MONITOR_CPU_BASED_EXEC_CONTROLS_SUBARCH | \ CPU_BASED_HLT_EXITING | \ CPU_BASED_INVDPG_EXITING | \ CPU_BASED_MWAIT_EXITING | \ CPU_BASED_MOV_DR_EXITING | \ CPU_BASED_UNCOND_IO_EXITING | \ CPU_BASED_USE_TSC_OFFSETING )void vmx_init_vmcs_config(void){...._vmx_vmexit_control = adjust_vmx_controls(MONITOR_VM_EXIT_CONTROLS,MSR_IA32_VMX_EXIT_CTLS_MSR);....}
asmlinkage void vmx_vmexit_handler(struct cpu_user_regs *regs){....case EXIT_REASON_IO_INSTRUCTION: exit_qualification = __vmread(EXIT_QUALIFICATION); inst_len = __get_instruction_length(); vmx_io_instruction(exit_qualification, inst_len); break;....}
75 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
The VM exit handler then fills in struct ioreq fields, and sends the I/O
request to its client by calling send_pio_req().
4. The client of the I/O request (qemu-dm) is blocked on the event channel device
node created by the evtchn module (see “Event Channels” on page 43). In the
VMM, hvm_send_assist_req() gets called by send_pio_req() to set the
event pending bit of the event channel and wake up the qemu-dm client waiting
on the event.
5. The QEMU emulator, qemu-dm, is a user process that contains the ioemu module
for I/O emulation. The ioemu module waits on one end of the event channel for
I/O requests from the VMM.
When an I/O request arrives, ioemu is unblocked and cpu_handle_ioreq()
is called to get the ioreg structure from the event channel. Based on the
information in ioreq, appropriate pcnet functions are invoked to handle the I/O request.
static void vmx_io_instruction(unsigned long exit_qualification, unsigned long inst_len){....send_pio_req(port, count, size, addr, dir, df, 1);....}
void hvm_send_assist_req(struct vcpu *v){....p->state = STATE_IOREQ_READY; notify_via_xen_event_channel(v->arch.hvm_vcpu.xen_port);}
int main_loop(void){ .... qemu_set_fd_handler(evtchn_fd, cpu_handle_ioreq, NULL, env);
while (1) { .... main_loop_wait(10); } ....}
76 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
6. After pcnet decodes the ioreq structure, ioemu sends the packet to the TAP
network interface. The TAP network interface [27] is a virtual ethernet network
device that provides two interfaces to applications:
– Character device —/dev/tapX
– Virtual network interface — tapX
where X is the instance number of the TAP interface. Applications can write Ethernet
frames to the /dev/tapX character interface, and the TAP driver will receive this
frame from the tapX network interface. In the same manner, a packet that kernel
writes to the tapX network interface can be read by application from the character /dev/tapX device node.
To continue the packet flow, pcnet_transmit() is called to send out ioreq. In
pcnet_tranmit(), qemu_send_packet() invokes tap_receive() to write
the packet to the TAP character interface which will forward the packet to the native
driver interface.
7. The Dom0 native driver sends the packet to the network hardware. This marks the
end of transmitting a packet from DomU to the real network.
8. Dom0 receives an interrupt indicating a packet intended for DomU has arrived.
This marks the beginning of receiving a packet targeted to a DomU from the real
network. The native network driver forwards the packet through a bridge to the
TAP network interface, tapX.
9. Next, tap_send() is invoked when data is written to the file. The packet is read
from the character interface of /dev/tapX. Next, qemu_send_packet() calls
pcnet_receive() to send out the buffer.
static void pcnet_transmit(PCNetState *s){
....qemu_send_packet(s->vc, s->buffer, s->xmit_pos);....
}static void tap_receive(void *opaque, const uint8_t *buf, int size){
....for(;;) {
ret = write(s->fd, buf, size);....
}}
77 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
10. The pcnet_receive() function in ioemu copies data read from the TAP
character device to the VMM memory. The data can be either an I/O port value
from the IN instruction or a network packet. At the end of data transfer, pcnet
informs the VMM to generate an interrupt.
11. The ioemu module makes a hvm_opt(set_pci_intx_level) hypercall to the
VMM to generate an interrupt to the target domain.
The VMM sets the guest HVMCDCS area to inject an event with the next VM entry. The
target VM will get an interrupt when the VMM launches a VM entry to the target
domain (see “Sun xVM Server Interrupt and Exception Handling for HVM” on page 70).
Sun xVM Server with HVM I/O Virtualization (PV Drivers)As shown in the previous section, the QEMU I/O emulation used in Sun xVM Server
with HVM suffers significant performance overhead. An I/O packet has to go through
several context switches, including a switch to the user level at Dom0, to reach its
destination. One alternative for improving the performance is to use a similar I/O
static void tap_send(void *opaque){ .... size = read(s->fd, buf, sizeof(buf)); if (size > 0) { qemu_send_packet(s->vc, buf, size); }}
static void pcnet_receive(void *opaque, const uint8_t *buf, int size){....cpu_physical_memory_write(rbadr, src, count); ...pcnet_update_irq(s);}
int xc_hvm_set_pci_intx_level( int xc_handle, domid_t dom, uint8_t domain, uint8_t bus, uint8_t device, uint8_t intx, unsigned int level){ .... hypercall.op = __HYPERVISOR_hvm_op; hypercall.arg[0] = HVMOP_set_pci_intx_level; hypercall.arg[1] = (unsigned long)&arg; .... rc = do_xen_hypercall(xc_handle, &hypercall); ....}
78 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.
virtualization model as the Sun xVM Server PV architecture (see “Sun xVM Server I/O
Virtualization” on page 56). Paravirtualized drivers (PV drivers) like xbf and xnf are
included in the OS distribution. When a VM is created, Dom0 exports virtual I/O devices
(for example, xnf and xbf) instead of emulated I/O devices (for example, pcn and
mpt) to the GOS. PV drivers are subsequently bound to these virtual devices and used
for handling I/O. The I/O transactions follow the same path as described in Chapter 5,
“Sun xVM Server”. PV drivers will be provided for Solaris 10 and Windows so they can
run unmodified in the Sun xVM Server with better I/O performance.
79 Logical Domains Sun Microsystems, Inc.
Chapter 7
Logical Domains
The Logical Domains (LDoms) technology from Sun Microsystems allows a system's
resources, such as memory, CPUs, and I/O devices, to be allocated into logical
groupings. Multiple isolated systems, each with their own operating system, resources,
and identity within a single computer system, can then be created using these
partitioned resources.
Unlike Sun xVM Server, LDoms technology partitions a processor into multiple strands,
and assigns each strand its own hardware resources. (See “Terms and Definitions” on
page 113.) Each virtual machine, called a domain in LDoms terminology, is associated
with one or more dedicated strands. A thin layer of firmware, called the hypervisor, is
interposed between the hardware and the operating system (Figure 24). The hypervisor
abstracts the hardware resources and provides an interface to the operating system
software.
Figure 24. The hypervisor, a thin layer of firmware, abstracts hardware resources and presents them to the OS.
The LDoms implementation includes four components:
• UltraSPARC T1/T2 processor
• UltraSPARC hypervisor
• Logical Domain Manager (LDM)
• Paravirtualized Solaris OS
Note – The terms strand, hardware thread, logical processor, virtual CPU and virtual processor are used by various documents to refer to the same concept. For consistency, the term strand is used in this chapter.
CPU Mem CPU Mem
CPU Mem
CPU Mem CPU Mem
CPU Mem
Hardware
Hypervisor
Control
Domain
Solaris 10 Solaris 10 Solaris 10 Linux
Domain
1
Domain
2
Domain
3
80 Logical Domains Sun Microsystems, Inc.
Note – In Sun documents, the term hypervisor is used to refer to the hyperprivileged software that performs the functions of the VMM and the term domain is used to refer to a VM. To accommodate Sun's terminologies, hypervisor and domain (instead of VMM and VM) are used in this chapter.
This chapter assumes a basic understanding of the UltraSPARC T1/T2 processor, which
plays a major role in the implementation of LDoms. (See Chapter 4, “SPARC Processor
Architecture” on page 29.) The remainder of the chapter is organized as follows:
• “Logical Domains (LDoms) Architecture Overview” on page 80 provides an overview of
the LDoms architecture and the other three components of LDoms: paravirtualized
Solaris, the UltraSPARC hypervisor, and the Logical Domain manager.
• “CPU Virtualization in LDoms” on page 84 discusses CPU virtualization including trap
and interrupt handling.
• “Memory Virtualization in LDoms” on page 88 discusses memory virtualization
including physical memory allocation and page translations.
• “I/O Virtualization in LDoms” on page 91 discusses I/O virtualization and describes
the operation of the disk block and network drivers.
Logical Domains (LDoms) Architecture OverviewLogical Domains (LDoms) technology supports CPU partitioning and enables multiple
OS instances to run on a single UltraSPARC T1/T2 system. The UltraSPARC T1/T2
architecture has been enhanced from the original UltraSPARC specification to
incorporate hypervisor technology that supports hardware level virtualization.
The hypervisor is delivered with the UltraSPARC T1/T2 platform, not with the OS. During
a boot, the OpenBoot PROM (OBP) loads the Solaris OS directly from the disk. After the
boot, a logical domain manager is enabled and initializes the first domain as the
control domain. From a control domain, the administrator can create, shutdown,
configure, and destroy other domains. The control domain can also be configured as an
I/O domain, which has direct access to I/O devices and provides services for other
domains to access I/O devices (Figure 25).
81 Logical Domains Sun Microsystems, Inc.
Figure 25. A control domain, Solaris OS, and Linux guest domains running in logical domains on an UltraSPARC T1/T2 processor-powered server.
The UltraSPARC T1/T2 processor architecture is described earlier in Chapter 4, “SPARC
Processor Architecture” on page 29. In this section, the other three components of the
LDoms technology — paravirtualized Solaris OS, hypervisor, and logical domain
manager — are discussed.
Paravirtualized Solaris OSThe Solaris kernel implementation for the UltraSPARC T1/T2 hardware class (uname -m) is referred to as the Solaris sun4v architecture. In this implementation, the
Solaris OS is paravirtualized to replace operations that require hyperprivileged mode
with hypervisor calls. The Solaris OS communicates with the hypervisor through a set of
hypervisor APIs, and uses these APIs to request that the hypervisor perform
hyperprivileged operations.
Sun4v support for LDoms is a combination of partitioning the UltraSPARC T1/T2
processor into strands and virtualization of memory and I/O services. Unlike Sun xVM
Server and VMware, an LDoms domain does not share strands with other domains.
Each domain has one or more strands assigned to it, and each strand has its own
hardware resources so that it can execute instructions independently of other strands.
The virtualization of CPU functions to support CMT is implemented at the processor
rather than at the software level (that is, there is no software scheduler). A Solaris guest
OS can directly access strand-specific registers in a domain and can, for example,
perform operations such as setting an OS trap table to the trap base address register
(TBA).
The Solaris sun4v architecture assumes that the platform includes the hypervisor as
part of its firmware. The hypervisor runs in the hyperprivileged mode, and the Solaris
LDM
OBP
Hypervisor APIHypervisor Services
ALOM POST
KernelHypercalls
UltraSPARC T1/T2 Processor-powered Server Firmware
Control DomainI/O Domain
Firmware
Devices
�������
Drivers
KernelHypercalls
GuestDomain
KernelHypercalls
GuestDomain
Sun UltraSPARC T1/T2 Server
LinuxSolarisSolaris
82 Logical Domains Sun Microsystems, Inc.
OS runs in the privileged mode of the processor. The Solaris kernel uses hypercalls to
request that the hypervisor perform hyperprivileged functions of the processor.
Like Intel's VT and AMD's Pacifica architectures, the sun4v architecture leverages CPU
support (hyperprivileged mode) for the implementation of the hypervisor. Unlike Intel's
VT and AMD's Pacifica architectures which provide a special mode of execution for the
hypervisor and thus make the hypervisor transparent to the GOS, the support for the
hypervisor in UltraSPARC T1/T2 is non-transparent to the GOS. The UltraSPARC T1/T2
processors provide a set of hypervisor APIs for the GOS to delegate the hyperprivileged
operations to the hypervisor.
Hypervisor ServicesThe hypervisor layer is a component of the UltraSPARC T1/T2 system's firmware. An
UltraSPARC system’s firmware consists of Open Boot PROM (OBP), Advanced Lights Out
Management (ALOM), Power-on Self Test (POST), and the hypervisor.
The hypervisor leverages the UltraSPARC T1/T2 hyperprivileged extensions to provide a
protection mechanism for running multiple guest domains on the system. The
hypervisor includes a number of hypervisor services to its overlaying domains. These
services include hypervisor APIs that are the interfaces for a GOS to request hypervisor
services, and Logical Domain Channel (LDC) services which are used by virtual device
drivers for inter-domain communications.
Hypervisor API
The Sun4v hypervisor API [11] uses the Tcc instruction to cause the GOS to trap into
hyperprivileged mode, in a similar fashion to how OS system calls are implemented.
The function of the hypervisor API is equivalent to system calls in the OS that enable
user applications to request services from the OS. The Sun4v hypervisor API allows a
GOS to perform the following actions:
• Request services from the hypervisor
• Get and set CPU information through the hypervisor
The UltraSPARC Virtual Machine Specification [11] lists the complete set of services
and APIs for:
• API versioning — request and check for a version of the hypervisor APIs with which it
may be compatible
• Domain services — enable a control domain to request information about or to
affect other domains
• CPU services — control and configure a strand; includes operations such as
start/stop/suspend a strand, set/get the trap base address register, and configure the
interrupt queue
• MMU services — perform MMU related operations such as configure the TSB,
map/demap the TLB, and configure the fault status register
83 Logical Domains Sun Microsystems, Inc.
• Memory services — zero and flush data from cache to memory
• Interrupt services — get/set interrupt enabled, target strand, and state of the
interrupt
• Time-of-Day services — get/set time-of-day
• Console services — get/put a character to the console
• Channel Services — provide communication channels between domains (see
“Logical Domain Channel (LDC) Services” on page 83)
The following two examples of hv_mem_sync() and hv_api_set_version()
show the implementation for hypervisor calls:
The trap type in the range 0x180-0x1FF is used to transition from a privileged mode
to a hyperprivileged mode. In the two preceding examples, a TT value of 0x180 (offset
of 0) is used for hv_mem_sync(), and a TT value of 0x1FF (offset of 0x7f) is used for
hv_api_set_version().
Hypervisor calls are normally invoked during the startup of the kernel to set up strands
for the domain. Only a few hypercall functions are called during the runtime of the
kernel, including: hv_tod_set(), hv_tod_get(), hv_set_ctx0(),
hv_mmu_map_perm_addr(), hv_mmu_unmap_perm_addr(),
hv_set_ctxnon0(), and hv_mmu_set_stat_area().
Logical Domain Channel (LDC) Services
The hypervisor provides communication channels between domains. These channels
are accessed within a domain as an endpoint. Two endpoints are connected together
forming a bi-directional point-to-point LDC.
All traffic sent to a local endpoint arrives at the corresponding endpoint at the other
end of the channel in the form of short fixed-length (64-byte) message packets. Each
endpoint is associated with one receive queue and one transmit queue. Messages from
a channel are deposited by the hypervisor at the tail of a queue, and the receiving
% mdb -k> hv_mem_sync,6/aihv_mem_sync:hv_mem_sync: mov %o2, %o4hv_mem_sync+4: mov 0x32, %o5hv_mem_sync+8: ta %icc, %g0 + 0hv_mem_sync+0xc:retlhv_mem_sync+0x10: stx %o1, [%o4]> hv_api_set_version,6/aihv_api_set_version:hv_api_set_version: mov %o3, %o4hv_api_set_version+4: clr %o5hv_api_set_version+8: ta %icc, %g0 + 0x7fhv_api_set_version+0xc: retlhv_api_set_version+0x10: stx %o1, [%o4]
84 Logical Domains Sun Microsystems, Inc.
domain indicates receipt by moving the corresponding head pointer for the queue. To
send a packet down an LDC, a domain inserts the packet into its transmit queue, and
then uses a hypervisor API call to update the tail pointer for the transmit queue.
In the Solaris OS, the hypervisor LDC service is used as a simulated I/O bus interface,
enabling a virtual device to communicate with a real device on the I/O domain. All
virtual devices that communicate with the I/O domain for device access are a leaf
nodes on the LDC bus. For example, a virtual disk driver, vdc, uses the LDC service to
communicate with the virtual disk driver, vds, on the other side of the channel. Both
vdc and vds are leaf nodes on the channel bus (see “I/O Virtualization in LDoms” on
page 91).
Logical Domain Manager The Logical Domain Manager (LDM) provides the following functionality:
• Provides a control point for managing a domain's configuration and operation
• Binds a domain to the resources of the underlying local physical machine
• Manages the integrity of the configuration in a persistent and consistent manner
The LDM is a software module that runs on a control domain (see “Logical Domains
(LDoms) Architecture Overview” on page 80). The LDM uses the LDC to communicate
with the hypervisor when binding a domain to hardware resources, and stores the
configuration in the service processor. The LDM is only required when a domain
reconfiguration operation is needed, such as during the creation, shutdown, or deletion
of a domain.
The LDM maintains two persistent databases: one for the currently defined domains,
and one for active domains. The active domain database is stored with the service
processor, and the currently defined database is held with LDM's own persistent
storage. The command line interface to the LDM is ldm(1M).
CPU Virtualization in LDomsThe hypervisor exposes strands to a domain. Each strand has it own registers and trap
queues; shares L1 caches, the TLB, and the instruction pipeline with other strands in
the same core; and shares the L2 cache with other strands in the socket. Strands on the
UltraSPARC T1 processor share the FPU with other strands, while each core in the
UltraSPARC T2 processor has its own floating-point and graphics unit (FGU). Each
domain has its own strands that are not shared with other domains. The software
threads (also known as kernel threads) are executed on the strands that are bound to
that domain. Unlike the VMM in Sun xVM Server or VMware, there is no software
scheduler in the hypervisor.
CPU virtualization in LDoms, from a software perspective, involves trap and interrupt
handling and timer services.
85 Logical Domains Sun Microsystems, Inc.
Trap and Interrupt HandlingEach strand has two trap tables for handling traps: the hyperprivileged trap table and
the privileged trap table. The trap table used for handling a trap depends on the
following criteria:
• Trap type (TT)
• Trap level at the time when the trap is taken
• Privilege mode at the time when the trap is taken
The UltraSPARC Architecture 2005 specification (see Table 12-4 in [2]) lists the mode in
which a trap is delivered based on a given TT and current privileged mode.
The hyperprivileged trap table and the privileged trap table are installed, respectively,
by the hypervisor and the GOS. For example, the Solaris OS installs the trap table for
sun4v in mach_cpu_startup():
And the hypervisor installs its trap table in start_master():
Each strand has two interrupt queues: cpu_mondo and dev_mondo. The cpu_mondo
queue is used for CPU-to-CPU cross-call interrupts; the dev_mondo queue is used for I/O-to-CPU interrupts. The Solaris kernel allocates memory for each queue, and
registers these queues with the hv_cpu_qconf() hypercall. When the queue is non-
empty (that is, the queue header is not equal to the queue tail), a trap is generated to
the target CPU. The data of the interrupt received (mondo data) is stored in the queue.
ENTRY_NP(mach_cpu_startup) .... set trap_table, %g1 wrpr %g1, %tba ! write trap_table to %tba ....
ENTRY_NP(start_master) .... setx htraptable, %g3, %g1 wrhpr %g1, %htba ....
86 Logical Domains Sun Microsystems, Inc.
The Solaris kernel function for registering the interrupt queues is
cpu_intrq_register() as shown below:
The I/O and CPU cross-call interrupt delivering mechanism is as follows:
1. An I/O device asserts its interrupt line to generate an interrupt to the processor.
The I/O bridge chip receives the interrupt request and prepares a mondo packet to
be sent to the target processor whose CPU number is stored in the bridge chip
register by the OS. The mondo packet contains an interrupt number that uniquely
identifies the source of the interrupt.
2. The hypervisor receives an interrupt request from the hardware through the
interrupt vector trap (0x60). For example, the trap table for the T2000 firmware
has the following entries:
The CPU number and interrupt number are also delivered, along with the
interrupt trap. The interrupt vector trap handle, VECINTR, uses the interrupt
number to determine the source of the interrupt. If the interrupt is coming from
I/O, the trap handler use the CPU number to find the dev_mondo queue
associated with the CPU and adds the interrupt to the tail of the dev_mondo
queue. When the head of the queue is not equal to the tail, a trap (0x7C for CPU
cross calls and 0x7D for I/O) is generated to the CPU that owns the queue.
3. Traps 0x7C and 0x7D are taken via the GOS trap table. For I/O interrupts,
dev_mondo() is the trap handler for 0x7D.
voidcpu_intrq_register(struct cpu *cpu){ struct machcpu *mcpup = &cpu->cpu_m; uint64_t ret;
ret = hv_cpu_qconf(INTR_CPU_Q, mcpup->cpu_q_base_pa, cpu_q_entries); ....
ret = hv_cpu_qconf(INTR_DEV_Q, mcpup->dev_q_base_pa, dev_q_entries); ....}
ENTRY(htraptable) .... TRAP(tt0_05e, HSTICK_INTR) /* HV: hstick match */ TRAP(tt0_05f, NOT) /* reserved */ TRAP(tt0_060, VECINTR) /* interrupt vector */ ....
87 Logical Domains Sun Microsystems, Inc.
The dev_mondo() handler takes the interrupt out of the queue by
incrementing the queue header. It also finds the interrupt vector data, struct
intr_vec, from the system’s interrupt vector table. The struct intr_vec
data contains the priority interrupt level (PIL) and the driver's interrupt service
routine (ISR) for the interrupt. The dev_mondo() handler then sets the
SOFTINT register with the PIL of the interrupt.
4. Setting the SOFTINT register causes an interrupt_level_n trap, 0x41-
0x4f, to be generated where n is the PIL of the interrupt. The GOS's trap handler
for the interrupt_level_1 interrupt, for example, is shown below:
If the PIL of the interrupt is below the clock PIL, an interrupt thread is allocated
to handle the interrupt. Otherwise, the high level interrupt is handled by the
currently executing thread.
In summary, the interrupt delivering mechanism is a two stage process. First, an
interrupt is delivered to the hypervisor as the interrupt vector trap, 0x60. Then the
interrupt is added to an interrupt queue, which causes another trap to the GOS.
LDoms Timer Service The system time is provided by the programmable interrupt generator. Clock interrupts
are sent directly from the hardware to the domain, without being queued in the
hypervisor. Therefore, unlike Sun xVM Server domains, LDoms exhibit no “lost ticks”
issues.
The time of day (TOD) is maintained by the hypervisor on a per-domain basis. The
Solaris OS uses the tod_get() and tod_set() hypercalls to get and set the TOD,
respectively. Setting the TOD in one domain does not affect any other domain.
# mdb -k> trap_table+0x20*0x7c/ai0x1000f80:0x1000f80: ba,a,pt %xcc, +0xc784 <cpu_mondo>> trap_table+0x20*0x7d/ai0x1000fa0:0x1000fa0: ba,a,pt %xcc, +0xc800 <dev_mondo>>
> trap_table+0x20*0x41,2/aitt_pil1:tt_pil1: ba,pt %xcc, +0xc33c <pil_interrupt>0x1000824: mov 1, %g4>
88 Logical Domains Sun Microsystems, Inc.
The high resolution timer is provided by the rdtick instruction, which reads the
counter field of the TICK register. The rdtick instruction is a privileged instruction
that can be executed by the Solaris OS without the hypervisor involvement.
Memory Virtualization in LDomsSimilar to Sun xVM Server, memory virtualization in LDoms deals with two memory
management issues:
• Physical memory sharing and partitioning
• Page translations
Physical Memory AllocationThe UltraSPARC T1/T2 processors supports three types of memory addressing:
• Virtual Address (VA) — utilized by user programs
• Real Address (RA) — describes the underlying memory allocated to a GOS
• Physical Address (PA) — appears in the system bus for accessing physical memory
Multiple virtual address spaces within the same real address space are distinguished by
a context identifier (context ID). The context ID is included as a field in the TTE for VA to
PA translation (see “Memory Management Unit” on page 32). The GOS can create
multiple virtual address spaces, using the primary and secondary context registers to
associate a context ID with every virtual address. The GOS manages the allocation of
context IDs among the processes within the domain.
Multiple real address spaces within the same physical address space are distinguished
by a partition identifier (partition ID). The hypervisor can create multiple real address
spaces, using the partition register to associate a partition ID with every real address.
The hypervisor manages the allocation of partition IDs.
Because of the new addressing scheme, a number of new ASIs are defined for RA and PA
addressing, as described in Table 8.
Table 8. New ASIs defined for real and physical addresses.
ASI # ASI Name Description
0x14 ASI_REAL Real Address (memory)
0x15 ASI_REAL_IO Noncacheable Real Address
0x1C ASI_REAL_LITTLE Real Address Little-endian
0x1D ASI_REAL_IO_LITTLE Noncacheable Real Address Little-endian
0x21 ASI_MMU_CONTEXTID MMU context register
0x52 ASI_MMU_REAL MMU Register
89 Logical Domains Sun Microsystems, Inc.
The partition ID register is defined in ASI 0x58, VA 0x80 [2] with an 8-bit field for the
partition ID.
The full representation of each type of address is as follows:
or:
Figure 26 illustrates the type of addressing in each of mode of operation.
Figure 26. Different types of addressing are used in different modes of operation.
Page TranslationsPage translations in the UltraSPARC architecture are managed by software through
several different type of traps (see “Memory Management Unit” on page 32).
Depending on the trap type, traps may be handled by the hypervisor or the GOS. Table 9
summarizes the MMU related trap types (see also Table 12-4 in [2]).
Table 9. MMU-related trap types in the UltraSPARC T1/T2 processor
real_address = context_ID :: virtual_address
physical_address = partition ID :: real_address
physical_address = partition ID :: context ID :: virtual_address
Trap name Trap Cause TT Handled by
fast_instruction_access_MMU_miss iTLB Miss 0x64 Hypervisor
fast_data_access_MMU_miss dTLB Miss 0x68 Hypervisor
fast_data_access_protection Protection Violation 0x6c Hypervisor
instruction_access_exception Several 0x08 Hypervisor
data_access_exception Several 0x30 Hypervisor
Process64-bitaddressing
Process64-bitaddressing
64-bit VA + context ID
64-bit VA + context ID + Partition ID
LogicalDomain
PhysicalSystem
Process64-bitaddressing
Virtual AddressingUnprivileged mode
User Space
Kernel SpaceReal AddressingPrivileged mode
Physical AddressingHyperprivileged mode
90 Logical Domains Sun Microsystems, Inc.
In the hypervisor trap table, htraptable, the instructions for handling dTLB miss,
trap 0x68, are:
The trap table transfers control to dmmu_miss() to load the page translation from the
TSB. If the translation doesn't exist in the TSB, dmmu_miss() calls dtsb_miss().
The handler dtsb_miss() sets the TT register to trap type 0x31
(data_access_MMU_miss), changes the PSTATE register to the privileged mode,
and transfers control to the GOS's trap handler for trap 0x31. The portion of
dtsb_miss() that performs this functionality is shown in the following example:
In the Solaris OS, the trap handler for trap type 0x31 calls the handler
sfmmu_slow_dmmu_miss() to load the page translation from hme_blk. If no entry
is found in hme_blk for the virtual address, sfmmu_slow_dmmu_miss() calls
sfmmu_pagefault() to transfer control to Solaris's trap() handler.
instruction_access_MMU_miss iTSB Miss 0x09 GOS
data_access_MMU_miss dTSB Miss 0x31 GOS
*mem_address_not_aligned Misaligned memory operation
0x34-0x39
Hypervisor
% mdb ./ontario/release/q> htraptable+0x20*0x68,8/aihtraptable+0xd00:htraptable+0xd00: rdpr %priv_16, %g1htraptable+0xd04: cmp %g1, 3htraptable+0xd08: bgu,pn %xcc, +0x73b8 <watchdog_guest>htraptable+0xd0c: mov 0x28, %g1htraptable+0xd10: ba,pt %xcc, +0x97a0 <dmmu_miss>htraptable+0xd14: ldxa [%g1] 0x4f, %g1htraptable+0xd18: illtrap 0htraptable+0xd1c: illtrap 0
> dtsb_miss,80/ai ....
wrpr %g0, 0x31, %tt ! write 0x31 to %ttrdpr %pstate, %g3 ! read %pstate to %g3or %g3, 4, %g3 wrpr %g3, %pstate ! write %g3 to%pstaterdpr %tba, %g3 ! get privileged mode's trap
! table base addressadd %g3, 0x620, %g3 ! set %g3 to the address of
! trap type 0x31....jmp %g3 ! jump to 0x31 trap handler
91 Logical Domains Sun Microsystems, Inc.
I/O Virtualization in LDomsLDoms provide the ability to partition system PCI buses so that more than one domain
can directly access devices. (Currently, access by up to two domains is supported.) A
domain that has direct access to devices is called an I/O domain or service domain. A
domain that doesn't have direct access to devices uses the virtual I/O (VIO) framework
and goes through an I/O domain for access.
The device tree of a domain is determined by the OBP of that domain. The OBP device
tree of a typical non-I/O domain is shown in the following example:
% mdb -k> trap_table+0x20*0x31,2/aiscb+0x620:scb+0x620: ba,a +0xc1b4 <sfmmu_slow_dmmu_miss>scb+0x624: illtrap 0> sfmmu_pagefault,80/ai ....sfmmu_pagefault+0x78: sethi %hi(0x101d400), %g1sfmmu_pagefault+0x7c: or %g1, 0x364, %g1sfmmu_pagefault+0x80: ba,pt %xcc, -0x13f0 <sys_trap>
{0} ok show-devs
/cpu@3/cpu@2/cpu@1/cpu@0
/virtual-devices@100/virtual-memory/memory@m0,4000000
/aliases/options/openprom/chosen
/packages/virtual-devices@100/channel-devices@200/virtual-devices@100/console@1/virtual-devices@100/ncp@4/virtual-devices@100/channel-devices@200/disk@0/virtual-devices@100/channel-devices@200/network@0
/openprom/client-services/packages/obp-tftp/packages/kbd-translator/packages/SUNW,asr/packages/dropins/packages/terminal-emulator/packages/disk-label/packages/deblocker/packages/SUNW,builtin-drivers
92 Logical Domains Sun Microsystems, Inc.
During the system boot, the OBP device tree information is passed to the Solaris OS and
used to create the system device nodes. Output from the following pftconf(1M)
command shows the system configuration of a typical non-I/O domain:
As this system configuration shows, no physical devices are exported to the domain.
The virtual-devices entry is the nexus node of all virtual devices. The channel-
devices entry is the bus node for the virtual devices that require communication with
the I/O domain. The disk and network entries are leaf nodes on the channel-
devices bus.
# prtconfSystem Configuration: Sun Microsystems sun4vMemory size: 1024 MegabytesSystem Peripherals (Software Nodes):
SUNW,Sun-Fire-T200 scsi_vhci, instance #0 packages (driver not attached)
SUNW,builtin-drivers (driver not attached) deblocker (driver not attached) disk-label (driver not attached) terminal-emulator (driver not attached) dropins (driver not attached) SUNW,asr (driver not attached) kbd-translator (driver not attached) obp-tftp (driver not attached) ufs-file-system (driver not attached)
chosen (driver not attached) openprom (driver not attached) client-services (driver not attached)
options, instance #0 aliases (driver not attached) memory (driver not attached) virtual-memory (driver not attached) virtual-devices, instance #0 ncp (driver not attached) console, instance #0 channel-devices, instance #0 disk, instance #0 network, instance #0
cpu (driver not attached) cpu (driver not attached) cpu (driver not attached) cpu (driver not attached)
iscsi, instance #0 pseudo, instance #0
93 Logical Domains Sun Microsystems, Inc.
The Solaris drivers that are specific to the LDom configuration are listed below:
Similar to Sun xVM Server, the LDoms VIO on a non-I/O domain uses a split device driver
architecture for virtual disk and network devices. The vdc and vnet client drivers are
used in non-I/O domains. The vds and vsw server drivers are used in the I/O domain to
support the vdc and vnet drivers. The vnex nexus driver, the driver for the
virtual-devices nexus node, provides bus services to its children nodes, vnet and
vdc.
The VIO framework uses the hypervisor’s Logical Domain Channel (LDC) service for
driver communication between domains. The LDC forms bi-directional point-to-point
links between two endpoints. All traffic sent to a local endpoint arrives only at the
corresponding endpoint at the other end of the channel in the form of short fixed-
length (64 byte) message packets. Each endpoint is associated with one receive queue
and one transmit queue. Messages from a channel are deposited by the hypervisor at
the tail of a queue, and the receiving domain indicates receipt by moving the
corresponding head pointer for the queue. To send a packet down an LDC, a domain
inserts the packet into its transmit queue, and then uses a hypervisor API call to update
the tail pointer for the transmit queue.
Disk Block DriverOn non-I/O domains, the vdc client driver provides disk interface. The vdc driver
receives I/O requests from the file system or raw device access, and sends these
requests to the hypervisor LDC. The vds server driver, located in the I/O domain, sits on
the other end of LDC. The vds driver receives requests from the vdc driver and then
forwards these requests to the disk service to which the disk device on the client is
mapped.
The sequence of events for disk I/O is illustrated in Figure 27.
LDOM drivers:vdc (virtual disk client 1.4) non I/O domain onlyldc (sun4v LDC module v1.5)ds (Domain Services 1.3)cnex (sun4v channel-devices nexus dri)vnex (sun4v virtual-devices nexus dri)dr_cpu (sun4v CPU DR 1.2)drctl (DR Control pseudo driver v1.1)qcn (sun4v console driver v1.5)vnet (vnet driver v1.4) non I/O domain onlyvds (virtual disk server v1.6) I/O domain onlyvsw (sun4v Virtual Switch Driver 1.5) I/O domain only
94 Logical Domains Sun Microsystems, Inc.
Figure 27. Sequence of events for disk I/O from a non-I/O domain to an I/O domain.
For non-I/O domains, the following events occur when applications use read(2) and
write(2) system calls to access a file:
1. The file system calls the vdc driver's strategy(9E) entry point.
2. The vdc drivers send the I/O buf, buf(9S), to the LDC. The vdc driver returns
after all data is successfully sent to the LDC.
3. The vds driver is notified by the hypervisor that messages are available on its
queue.
4. The vds driver retrieves data from the LDC and sends it to the device service that is
mapped to the client virtual disk.
5. The vds driver starts the block I/O by sending the I/O request to the native driver
and then dispatching a task queue to await I/O completion.
6. The native SCSI driver receives the device interrupt.
7. The vds driver's I/O completion is woken up by biodone(9F).
8. The vds driver sends a message to vdc indicating I/O completion.
9. The vdc driver receives the message from vds, and calls biodone(9F) to wake
up anyone waiting for it.
For I/O domains, the I/O path of data requests is simpler:
10. Block I/O requests are sent directly from the file system to the native driver.
In addition to the strategy(9E) driver entry point for supporting file system and raw
device access, the vdc driver also supports most of the ioctl(2) commands as
read(2)/write(2)
FS
UserKernel
Yes No
UltraSPARC T1 (CPU, Memory, Devices)
I/O Domain(Server)
Logical Domain Channel (LDC)
LDOM Hypervisor
read(2)/write(2)
FS
vdc
UserKernel
Non I/O Domain(Client)
1
9
6
2
4
10
File?
3
vds
NativeDriver
8
5
7
95 Logical Domains Sun Microsystems, Inc.
defined in dkio(7I) for disk control. The Solaris kernel variable dk_ioctl1 defines
the exact disk ioctl commands supported by the vdc driver.
Network DriverThe Solaris LDoms network drivers include a client network driver, vnet, and a virtual
switch, vsw, on the server side. To transmit a packet, vnet sends a packet over the LDC
to vsw. The binding of vnet to vsw is defined in the vnet resource of the domain
when the domain is created. The vsw forwards the packet to the native driver, and
includes the IP address of vnet as the source address. The vnet driver returns as soon
as the packet has been put on a buffer and the buffer has been added to the tail of the
LDC queue.
When receiving packets from the network, if the native driver is configured as a virtual
switch in the vswitch resource of the domain, the packet is passed up from the native
driver to vsw. The vsw finds the MAC address associated with the destination IP
address from its ARP table. The vsw gets the target domain from the MAC address, and
gets the vnet interface from the vnet resource. The packet is then sent to the LDC of
the designated vnet driver.
The vnet driver uses Solaris GLD v3 interfaces and is fully compatible with the native
driver using the same GLD v3 interface.
Figure 28 depicts the flow of receiving a packet from the network through an I/O
domain to a guest domain. The sequence of operations for receiving packets is as
follows:
1. Data is stored via DMA into the native driver, e1000g, receive buffer ring.
2. The vsw sends the packet to client driver, vnet, through the LDC.
3. The LDC receiving worker thread gets the packet and sends it to the vnet driver.
1. Information on the Solaris kernel variable dk_ioctl can be looked up at the Web site: http://www.opensolaris.org/.
96 Logical Domains Sun Microsystems, Inc.
Figure 28. Flow of control for receiving a network packet from an I/O domain to a guest domain.
Guest Domain
vnet`vgen_handle_datamsgldc`i_ldc_rx_hdlr+0x3c0cnex`cnex_intr_wrapper+0xcintr_thread+0x170idle+0x128thread_start+4
UltraSPARC T1/T2 (CPU, Memory, Devices)
LDoms Hypervisor
Sun UltraSPARC T1/T2 Server
I/O Domain
ldc`ldc_writevsw`vsw_dringsend+0x234vsw`vsw_portsend+0x60vsw`vsw_forward_all+0x134vsw`vsw_switch_l2_frame+0x248mac`mac_rx+0x58e1000g`e1000g_intr_pciexpress+0xb8px`px_msiq_intr+0x1b8intr_thread+0x170cpu_halt+0xc0idle+0x128thread_start+4
Logical Domain Channel
Network Chip
1
2
3
97 VMware Sun Microsystems, Inc.
Chapter 8
VMware
VMware, the current market leader in virtualization software for x86 platforms, offers
three virtual machine systems: VMware Workstation; no-cost VMware Server, formerly
known as VMware GSX Server; and VMware Infrastructure 3, a suite of virtualization
products based on VMware ESX Server Version 3.
The VMware Workstation and VMware Server products are add-on software modules
that run on a host OS such as Windows, Linux-hosted, or BSD variants (Figure 29). In
these implementations the VMM is a part of, and has the same privilege as, the host OS
kernel. The guest OS runs as an application on the host OS. The Solaris OS can only run
as a guest OS on VMware Workstation and VMware Server.
The VMware Infrastructure suite of products is built around the VMware ESX Server.
VMware ESX Server runs on bare metal and uses a derived version of SimOS [18] as the
kernel for running the VMM and I/O services. All other operating systems run as a guest
OS. VMware Infrastructure supports Windows, Linux, and Solaris as guest OS. VMware
ESX Server provides lower overhead and better control of system resources than
VMware Workstation and Server. However, because it provides all device drivers, it
therefore supports fewer devices than VMware Workstation and VMware Server.
Figure 29 shows the configuration of VMware ESX Server and GSX Server.
Figure 29. VMware GSX Server (Vmware Workstation and VMware Server products) runs within a host operating system, while VMware ESX Server runs on the bare metal.
VMware ESX Server is a Type I VMM, and has exclusive control of hardware resources
(see “Types of VMM” on page 10). In contrast, VMware Workstation and VMware Server
are Type II VMMs, and leverage the host OS by running inside the OS kernel.
GuestOS
Linux
ESX ServerGSX Server
VMM
GuestOS
Solaris
VMM
Host Operating System
Hardware
VMM
GuestOS
Linux
GuestOS
Solaris
GuestOS
Windows
HostApps
Hardware
98 VMware Sun Microsystems, Inc.
VMware Infrastructure OverviewVMware ESX Server, VMware's product for running enterprise applications in data
centers, serves as the foundation of the VMware Infrastructure solution. VMware ESX
Server includes the following components:
• Virtualization layer — abstracts the hardware resources including CPU, memory,
and I/O
• I/O interface — enables the delivery of file system services to VMs
• Service Console — provides an interface to manage resources and administer VMs
Figure 30 shows the functional components of the VMware ESX Server product. The
VMkernel, the core of the ESX Server, abstracts the underlying hardware resources and
implements the virtualization layer. The VMkernel includes multiple VMMs, one for
each VM. The VMM implements the virtual CPUs for each VM. The VMkernel also
includes modules for I/O driver emulation, the I/O stack, and device drivers for network
and storage devices. The service console, a RedHat Linux-based component, serves as a
boot loader and provides a management interface to the VMkernel.
Figure 30. VMware ESX Server functional components.
The following sections discuss the functional components of VMware Infrastructure,
with particular emphasis on the virtualization layer which forms the core of all VMware
virtualization products.
VMware CPU VirtualizationESX Server provides full virtualization, enabling an unmodified GOS to run on the
underlying x86 hardware. The full virtualization is achieved by the ESX virtualization
NetworkDriver
SCSIDriver
StorageEmulation
ExecutionMode
CPU Network Storage
GuestApplication
Service
Console
ManagementInterface
VMkernel
VMM Hardware Interface
Layer
Sun X64 Server
NetworkStack
StorageStack
NetworkEmulation
NetworkDriver
StorageDriver
99 VMware Sun Microsystems, Inc.
layer. The core of the ESX virtualization layer is the VMM, which includes three modules
(Figure 31) [12]:
• Execution decision module — decides whether VM instructions should be sent to
the direct execution module or the binary translation module
• Binary translation module — used to execute the VM whenever the hardware
processor is in a state in which direct execution cannot be used
• Direct execution module — enables the VM to directly execute its instruction
sequences on the underlying hardware processor
Figure 31. VMware ESX Server virtualizes the CPU hardware through binary translation whenever the processor itself cannot directly execute an instruction.
The decision to use binary translation or direct execution depends on the state of the
processor and whether the segment is reversible or not (see “Segmented Architecture”
on page 23). If the content of the descriptor table, for example the GDT, is changed by
the VMM because of a context switch to another VM, the segment is non-reversible.
Direct execution can be used only if the VM is running in an unprivileged mode and the
hidden descriptors of the segment register are reversible. In all other cases, the VMM
will switch to the binary translation module.
Binary TranslationThe binary translation (BT) module is believed influenced by the machine simulators
Shade [13] and Embra [14]. Embra is part of SimOS [18] which was developed by a
Stanford team led by Mendel Rosenblum, one of the founders of VMware. While
extensive details of the BT module implementation have not been published, Agesen
[15], Embra [14], and Shade [13] provide some information on its implementation.
The BT module translates GOS instructions, which are running in a deprivileged VM,
into instructions that can run in the privileged VMM segment. The BT module receives
x86 binary instructions, including privileged instructions, as input. The output of the
module is a set of instructions that can be safely executed in the non-privileged mode.
Agesen [15] gives an example of how control flow is handled in the BT module.
VM
GOS
BinaryTranslation
DirectExecution
VMMExecution ModuleExecution
Decision
100 VMware Sun Microsystems, Inc.
To avoid frequently retranslating blocks of instructions, translated blocks are kept in a
Translation Cache (TC). The execution of a block of instructions is simulated by locating
the block’s translation in the TC and jumping to it. A hash table maintains the
mappings from a program counter to the address of the translated code in the TC.
The main loop of the dynamic binary translation simulator is shown in Figure 32. The
loop checks to see if the current simulated program counter address is present in the
TC. If it is present in the TC, the translated block is executed. If it is not, the translator is
called to add the block to the TC. Each block of translated code ends by loading the new
simulated program counter and jumping back to the main loop for dispatching.
Figure 32. Binary translation manages a translation cache to reduce the need to re-translate frequently executed blocks of instructions.
A more detailed description of binary translation is beyond the scope of this paper.
Readers should refer to Shade [13] and Embra [14] for more details about dynamic
binary translation.
Some privileged instructions that have simple operations use in-TC sequences. For
example, a clear interrupt instruction (cli) can be replaced by setting a virtual
processor flag. Privileged instructions that have more complex operations (such as
setting cr3 during a context switch), require a call out of the TC to perform the
emulation work.
In addition to binary translation and logic for determining the code execution, the
virtualization layer employs other techniques to overcome x86 virtualization issues:
• Memory Tracing
The virtualization layer traces modifications on any given physical page of the virtual
machine, and is notified of all read and write accesses made to that page in a
transparent manner. This memory tracing ability in the VMM is enabled by page
faults and the ability to single-step the virtual machine via binary translation.
Translator
Translation Cache
Main{....dispatch loopif (PC_not_in TC(PC))tc=translate(pc);
newpc = pc_to_tc(pc);jump_to_pc(newpc)....
}
translate(pc) {
....blk = read_instructions(pc);perform_translation(blck);write_into_TC(blk);
....}
Translation Cache:code fragments which endwith jump back todispatch_loop
101 VMware Sun Microsystems, Inc.
• Shadow Descriptor Tables
The x86's segmented architecture (see “Segmented Architecture” on page 23) has a
segment caching mechanism that allows the segment register's hidden fields to be
re-used. However, this can approach can cause difficulty if the descriptor table is
modified in a non-coherent way. The virtualization layer supports the GOS system
descriptor tables using VMM shadow descriptor tables.
The VMM descriptor tables include shadow descriptors that correspond to
predetermined descriptors of the VM descriptor tables. The VMM also includes a
segment tracking mechanism that compares the shadow descriptors with their
corresponding VM segment descriptors. This mechanism indicates any lack of
correspondence between shadow descriptor tables with their corresponding VM
descriptor tables, and updates the shadow descriptors so that they correspond to
their respective corresponding VM segment descriptors.
The ESX Server's VMM implementation is unique in that each GOS has an associated
VMM. The ESX Server may include any number of VMMs in a given physical system,
each supporting a corresponding VM; the number of VMMs is limited only by available
memory and speed requirements. The features in the virtualization layer mentioned in
the previous discussion allow multiple concurrent VMMs, with each VMM supporting
an unmodified GOS in the virtualization layer.
CPU SchedulingThe ESX Server implements a rate-based proportional-share scheduler [19] that is
similar to the fair-share-scheduler scheme used by the Solaris OS (see [21] Chapter 8) in
which each virtual machine is given a number of shares. The amount of CPU time given
to each VM is based on its fractional share of the total number of shares of active VMs
in the whole system.
The term share is used to define a portion of the system’s CPU resources that is
allocated to a VM. If a greater number of CPU shares is assigned to a VM, relative to
other VMs, then that VM receives more CPU resources from the scheduler. CPU shares
are not equivalent to percentages of CPU resources. Rather, shares are used to define
the relative weight of a CPU load in a VM in relation to CPU loads of other VMs.
The following formula shows how the scheduler calculates per-domain allocation of
CPU resources.
The ESX scheduler allows specifying minimum (reservation) and maximum (limit) CPU
utilization for each virtual machine. A minimum CPU reservation guarantees that a
virtual machine always has this minimum percentage of a physical CPU’s time
Allocationdomain
i
Sharesdomain
i
TotalShares------------------------------------------------------=
102 VMware Sun Microsystems, Inc.
allocated to it, regardless of the total number of shares. A maximum CPU limit ensures
that the virtual machine never uses more than this maximum percentage of a physical
CPU’s time, even if extra idle time is available. The proportional-share algorithm is only
applied if the VM CPU utilization falls within the range of reservation and limit CPU
utilization. Figure 33 shows how CPU resource allocation is calculated.
Figure 33. Calculation of CPU resources in VMware.
In an SMP environment in which a VM could have more than one virtual CPU (VCPU), a
scalability issue arises when one VCPU is spinning on a lock held by another VCPU that
gets de-scheduled. The spinning VCPU wastes CPU cycles spinning on the lock until the
lock owner VCPU is finally scheduled again and releases the lock.
ESX implements co-scheduling to work around this problem. In co-scheduling (also
called gang scheduling), all virtual processors of a VM are mapped one-to-one onto the
underlying processors and simultaneously scheduled for an equal time slice. The ESX
scheduler guarantees that no VCPUs are spinning on a lock hold by a VCPU that has
been preempted.
However, co-scheduling does introduce other problems. Because all VCPUs are
scheduled at the same time, co-scheduling activates a VCPU regardless of whether
there are jobs in the VCPU's run queue. Co-scheduling also precludes multiplexing
multiple VCPUs on the same physical processor.
Timer ServicesSimilar to Sun xVM Server, ESX Server faces the same issue of getting clock interrupts
delivered to VMs at the configured interval [16]. This issue arises because the VM may
not get scheduled when interrupts are due to deliver. ESX Server keeps track of the
clock interrupt backlog and tries to deliver clock interrupts at a higher rate when the
backlog gets large. However, the backlog can get so large that it is not possible for the
GOS to catch up with the real time. In such cases, ESX Server stops attempting to catch
Total MHz
Limit
Reservation
0 MHz
The CPU utilization
range where proportional-
share is applied
103 VMware Sun Microsystems, Inc.
up if the clock interrupt backlog grows beyond 60 seconds. Instead, ESX Servers sets its
record of the clock interrupt backlog to zero and synchronizes the GOS clock with the
host machine clock.
ESX Server virtualizes the Time Stamp Counter (TSC) so that the virtualized TSC counter
matches with the GOS clock (see “Time Stamp Counter (TSC)” on page 28). When the
clock interrupt backlog is cleared due to catching up or due to reset when the backlog is
too large, the virtualized TSC catches up with the adjusted clock.
VMware Memory VirtualizationSimilar to Sun xVM Server, memory virtualization in VMware ESX Server deals with two
memory management issues: physical memory management and page translations.
Physical Memory ManagementSimilar to Sun xVM Server, ESX Server virtualizes a VM's physical memory by adding an
extra level of address translation when mapping a VM's physical memory pages to the
physical memory pages on the underlying machine. Also like Sun xVM Server, the
underlying physical pages are referred to as machine pages, and the VM's physical
pages as physical pages. Each VM sees a contiguous, zero-based, addressable physical
memory space whereas the underlying machine memory used by each virtual machine
may not be contiguous.
ESX Server manages physical memory allocation and reclamation, similar to Sun xVM
Server, by using the memory ballooning technique. More detailed information on how
the memory ballooning technique manages the physical memory allocation and
reclamation is included in [5].
Page TranslationsEach GOS in the ESX Server maintains page tables for virtual-to-physical address
mappings. The VMM also maintains shadow page tables for the virtual-to-machine
page mappings along with physical-to-machine mappings in its memory. The
processor's MMU uses the VMM's shadow page table. When a GOS updates its page
tables with a virtual-to-physical translation, the VMM intercepts the instruction, gets
the physical-to-machine mapping from its memory, and loads the shadow page table
with the virtual-to-machine mapping. This mechanism allows normal memory accesses
in the VM to execute without adding address translation overhead if the shadow page
tables are set up for that access.
VMware I/O VirtualizationEvery VM is configured with a set of standard PC virtual devices: PS2/ keyboard and
mouse, IDE controller with ATA disk and ATAPI CDROM, serial port, parallel port, and
104 VMware Sun Microsystems, Inc.
sound chip [20]. In addition, ESX Server also provides virtual PCI emulation for PCI add-
on devices such as SCSI, Ethernet, and SVGA graphics (see Figure 30 on page 98).
The device tree as exported by the VMM to a GOS is shown in the following
prtconf(1M) output.
% prtconfSystem Configuration: Sun Microsystems i86pcMemory size: 1648 MegabytesSystem Peripherals (Software Nodes):i86pc scsi_vhci, instance #0isa, instance #0 i8042, instance #0 keyboard, instance #0 mouse, instance #0 lp (driver not attached) asy, instance #0 (driver not attached) asy, instance #1 (driver not attached) fdc, instance #0 fd, instance #0
pci, instance #0 pci15ad,1976 (driver not attached) pci8086,7191, instance #0 pci15ad,1976 (driver not attached) pci-ide, instance #0 ide, instance #0 sd, instance #16 ide (driver not attached) pci15ad,1976 (driver not attached) display, instance #0 pci1000,30, instance #0 sd, instance #0 pci15ad,750, instance #0
iscsi, instance #0 pseudo, instance #0 options, instance #0 agpgart, instance #0 (driver not attached) xsvc, instance #0 objmgr, instance #0 acpi (driver not attached) used-resources (driver not attached)
cpus (driver not attached) cpu, instance #0 (driver not attached) cpu, instance #1 (driver not attached)
105 VMware Sun Microsystems, Inc.
The PCI vendor ID of VMware is 15ad. The following entries are relevant to VMware I/O
virtualization:
For the example device tree shown here, the Solaris OS binds the e1000g driver to
pci15ad,750 and uses e1000g as the network driver. The actual network hardware
used on the system is a Broadcom's NetXtreme Dual Gigabit Adapter with the PCI ID
pci14e4,1468. VMware translates the e1000g device interfaces passed by the
Solaris e1000g driver, and sends them to the Broadcom's NetXtreme device.
For storage, unlike Sun xVM Server, ESX Server continues to use sd as the interface to
file systems. The emulation of disk interface is provided at the SCSI bus adapter
interface (LSI logic SCSI controller) instead of at the SCSI target interface (SCSI disk sd).
Device EmulationEach storage device, regardless of the specific adapters, appears as a SCSI drive
connected to an LSI Logic SCSI adapter within the VM. For network I/O, ESX Server
emulates an AMD Lance/PCNet or Intel E1000g device driver, or uses a custom interface
called vmxnet for the physical network adapter.
VMware provides device emulation rather than the I/O emulation as used by Sun xVM
Server and UltraSparc LDoms (see “I/O Virtualization” on page 16). In a simple scenario,
consider an application within the VM making an I/O request to the GOS, as illustrated
in Figure 34:
Device Entry Description
pci15ad,750 VMware emulation of Intel's 100FX Gigabit Ethernet
pci15ad,1976 VMware emulation of the Intel 440BX/ZX PCI bridge chip
pci1000,30 the LSI logic 53C1020/1030 SCSI controller
display VMware virtual SVGA
106 VMware Sun Microsystems, Inc.
Figure 34. Sequence of events for applications making an I/O request.
1. Applications perform I/O operations through the interface to the device as
exported by the VMware VMM (see “VMware I/O Virtualization” on page 103). The
virtual device interface uses the native drivers (for example, the e1000g for
network and mpt for the LSI SCSI HBA) in the Solaris kernel.
2. The Solaris native driver attempts to access the device via the IN/OUT instructions
(for example, by writing a DMA descriptor to the device's DMA engine).
3. The VMM intercepts the I/O instructions and then transfers control to the device-
independent module in the VMkernel for handling the I/O request.
4. The VMkernel converts the I/O request from the emulated device to one for the
real device, and sends the converted I/O request to the driver for the real device.
5. The VMware driver sends the I/O request to the real I/O device.
6. When an I/O request completion interrupt (for example, DMA completion
interrupt) arrives, the VMkernel device driver receives and processes the interrupt.
7. The VMkernel then notifies the VMM of the target virtual machine, which copies
data to the VM memory and then raises the interrupt to the GOS.
8. The Solaris driver’s interrupt service routine (ISR) is called.
9. The Solaris driver performs a sequence of I/O accesses (for example, reads the
transaction status, acknowledges receipt) to the I/O ports before passing the data
to its applications.
The VMkernel ensures that data intended for each virtual machine is isolated from
other VMs.
VMware SupportedVirtual Device Interface
Solaris Native Drivers
Guest Application
1
Device Independent
I/O Access HandlerDevice Emulation
Module
2 9
8
VMM
VMKernel
Hardware Interface Layer
Sun x64 Server
3
7
Real Device Driver
I/O Device
4
5 6
7
VMware Sun Microsystems, Inc.
Section III
Additional Information
• Appendix A: VMM Comparison (page 109)
• Appendix B: References (page 111)
• Appendix C: Terms and Definitions (page 113)
• Appendix D: Author Biography (page 117)
108 VMware Sun Microsystems, Inc.
109 VMM Comparison Sun Microsystems, Inc.
Appendix A
VMM Comparison
This chapter presents a summary comparison of the four virtual machine monitors
discussed in this paper: Sun xVM Server without HVM, Sun xVM Server with HVM,
VMware, and Logical Domains (LDoms). Table 10 summarizes their general
characteristics; provides information on their CPU, Memory, and I/O virtualization
implementation; and lists the management options available for each.
Table 10. Comparison of virtual machine monitors discussed in this paper.
General Sun xVM Server w/o HVM Sun xVM Server w/HVM VMware LDoms
VMM version 3.0.4 3.0.4 ESX 3.0.1 LDoms 1.0.1
Supported ISA x86 and IA-64 x86 and IA-64 x86 UltraSPARC T1/T2
VMM Layer Run on bare metal Run on bare metal Run on bare metal Firmware
Virtualization Scheme Paravirtualization Full Full Paravirtualization
Supported GOS Linux, NetBSD, FreeBSD, OpenBSD, Solaris
Linux, NetBSD, FreeBSD, OpenBSD, Windows
Windows, Linux, Netware, Solaris
Solaris, Linux
SMP GOS Yes Yes Yes Yes
64-bit GOS Yes Yes Yes Yes
Max VMs Limited by memory Limited by memory Limited by memory 32 on UltraSPARC T1; 64 on UltraSPARC T2
Method of operation Modified GOS Hardware Virtualization Binary Translation Modified OS
License GPL (free) GPL (free) Proprietary CDDL (Free)
CPU Sun xVM Server w/o HVM Sun xVM Server w/HVM VMware LDoms
CPU scheduling Credit Credit Fair Share N/A
VMM Privilege Mode Privileged (ring 0) Privileged (ring 0) Privileged Hyperprivileged
GOS Privileged Mode Unprivileged (ring 3 for 64-bit kernel; ring 1 for 32-bit kernel
Reduced privileged Deprivileged Privileged
CPU Granularity Fractional Fractional Fractional 1 strand
Interrupt Queued and delivered
when the VM is scheduled
to run
Queued and delivered
when the VM is scheduled
to run
Queued and delivered
when the VM is scheduled
to run
Deliver directly to the VM
Memory Sun xVM Server w/o HVM Sun xVM Server w/HVM VMware LDoms
Page Translation Hypercall to VMM Shadow Page Shadow page Hypercall to VMM
Physical Memory Allocation
Balloon driver Balloon driver Balloon driver Hard Partition
Page Tables Managed by VMM Managed by VMM Managed by VMM Managed by GOS
110 VMM Comparison Sun Microsystems, Inc.
I/O Sun xVM Server w/o HVM Sun xVM Server w/HVM VMware LDoms
I/O Granularity Shared Shared Shared PCI bus
I/O Virtualization I/O emulation by Dom0 Device emulation by QEMU or I/O emulation by Dom0
Device emulation by vmkernel
I/O emulation by I/O domain
Device drivers Virtual driver on DomU, native driver on Dom0
Native driver on DomU and Dom0 (QEMU)
Native driver on guest supported by the VMM
Virtual driver on non I/O domain and native driver on I/O domain
Management Sun xVM Server w/o HVM Sun xVM Server w/HVM VMware LDoms
Management Model Dom0 - SPOF Dom0 - SPOF Service console - SPOF Control domains
Interface CLI: xm(1M) GUI: virt-manager
CLI: (xm(1)) GUI: virt-manager
GUI: Virtual Center CLI: ldm(1M), XML, and SNMP MIBs
111 References Sun Microsystems, Inc.
Appendix B
References
1. Popek, Gerald J. and Goldberg, Robert P. “Formal Requirements for Virtualizable
Third Generation Architectures,” Communications of the ACM 17 (7), pages 412-
421, July 1974.
2. UltraSPARC Architecture 2005: One Architecture.... Multiple Innovative
Implementations, Draft D0.9, 15 May 2007.
3. Robin, John Scott and Irvine, Cynthia E. “Analysis of the Intel Pentium's Ability to
Support a Secure Virtual Machine Monitor,” Proceedings of the 9th USENIX
Security Symposium, August 2000.
4. VMware: http://www.vmware.com/vinfrastructure/
5. Waldspurger, Carl A. “Memory Resource Management in VMware ESX Server,”
Proceedings of the 5th Symposium on Operating Systems Design and
Implementation, Dec. 2002.
6. Xen, “The Xen virtual machine monitor,” University of Cambridge Computer
Laboratory: http://www.cl.cam.ac.uk/research/srg/netos/xen/
7. IA-32 Intel Architecture Software Developer's Manual, March 2006.
8. System V Application Binary Interface AMD64 Architecture Processor Supplement
Draft Version 0.98, September 27, 2006. http://www.x86-64.org/documentation/abi.pdf
9. AMD64 Architecture Programmer’s Manual, Volume 2: System Programming, Rev.
3.12, September 2006.
10. OpenSPARC T1 Microarchitecture Specification, Revision A, August 2006.
11. UltraSPARC Virtual Machine Specification (The sun4v architecture and Hypervisor
API specification), Revision 1.0, January 24, 2006.
12. Devine, Scott W.; Bugnion, Edouard; Rosenblum, Mendel. “Virtualization system
including a virtual machine monitor for a computer with a segmented
architecture,” U.S. Patent 6,397,242, October 26, 1998.
13. Cmelik, Robert F. and Keppel, David. “Shade: A Fast Instruction Set Simulator for
Execution Profiling,” ACM SIGMETRICS Performance Evaluation Review, pages 128-
137, May 1994.
14. Witchel, Emmett and Rosenblum, Mendel. “Embra: Fast and Flexible Machine
Simulation,” The Proceedings of ACM SIGMETRICS '96: Conference on
Measurement and Modeling of Computer Systems, 1996.
15. Adams, Keith and Agesen, Ole. “A Comparison of Software and Hardware
Techniques for x86 Virtualization,” ASPLOS 2006, San Jose, CA, USA, October 21-25,
2006.
112 References Sun Microsystems, Inc.
16. “Timekeeping in VMware Virtual Machines,” VMware white paper, August 2005.
17. Bittman, T. “Gartner RAS Core Strategic Planning SPA-21-5502, Research Note 14,”
November 2003.
18. Rosenblum, Mendel; Herrod, Stephen A.; Witchel, Emmett; and Gupta, Anoop.
“Complete Computer Simulation: The SimOS Approach,” IEEE Parallel and
Distributed Technology, pages 34-43, Winter 1995.
19. “VMware ESX Server 2 Architecture and Performance Implication,” VMware white
paper, 2005.
20. Sugerman, Jeremy; Venkitachalam, Ganesh; and Lim, Beng-Hong. “Virtualizing I/O
Devices on VMware Workstation’s Hosted Virtual Machine Monitor,” Proceedings
of the 2001 USENIX Annual Technical Conference, Boston, Massachusetts, USA,
June 25-30, 2001.
21. System Administration Guide: Solaris Containers-Resource Management and
Solaris Zones, Part No: 817-1592 -14, June 2007
22. Drakos, Nikos; Hennecke, Marcus; Moore, Ross; and Swan, Herb. Xen Interface
manual: Xen v3.0 for x86.
23. Bochs IA-32 Emulator Project: http://bochs.sourceforge.net/
24. QEMU, Open Source Processor Emulator: http://fabrice.bellard.free.fr/qemu/
25. IEEE 1275-1994 Open Firmware: http://playground.sun.com/1275/
26. PCI Bus Binding to IEEE std. 1275-1994, Rev 2.1 August 29, 1998.
27. TAP — a Virtual Ethernet network device: http://vtun.sourceforge.net/tun/
28. Intel Virtualization Technology for Directed I/O Architecture Specification, May
2007, Order Number: D51397-002.
29. Shadow2 presentation at Xen Technical Summit, Summer 2006: http://www.xensource.com/files/summit_3/XenSummit_Shadow2.pdf
30. PCI SIG, “Address Translation Services,” Revision 1.0, March 8, 2007.
31. AMD I/O Virtualization Technology (IOMMU) Specification, Revision 1.20,
Publication# 34434, February 2007.
32. Jun Nakajima, Asit Mallick, Ian Pratt, Keir Fraser, “x86-64 XenLinux: Architecture,
Implementation, and Optimizations,” Proceedings of the Linux Symposium, July
19-22 2006. Ontario, Canada.
33. OpenSPARC T2 Core Microarchitecture Specification, July 2007, Revision 5.
34. UltraSPARC Architecture 2007, Hyperprivileged, Privileged, and Nonprivileged,
Draft D0.91, Aug 2007.
35. PCI SIG, “Single Root I/O Virtualization and Sharing Specification,” Revision 1.0,
September 11, 2007.
113 Terms and Definitions Sun Microsystems, Inc.
Appendix C
Terms and Definitions
Hardware level virtualization introduces several terms that are used throughout this
document. The following terms are defined in the context of hardware-level
virtualization.
Balloon driverA method for dynamic sharing of physical memory among VMs [5].
Binary TranslationIn computing, binary translation [13] [14] usually refers to the emulation of one instruction set by another through translation of instructions to allow software programs (e.g., operating systems and applications) written for a particular processor architecture to run on another. In the context of VMware products, binary translation refers to the conversion of one set of instruction sequences that belongs to a VM and has been deprivileged, to another set of instruction sequences that can run in a privileged VMM segment. VMware uses binary translation [12] and [15] to provide full virtualization of x86 processor.
DomainA running virtual machine within which a guest OS runs. Domain and virtual machine are used interchangeably in this document.
Full VirtualizationFull virtualization is an implementation of virtual machine that doesn't require guest OS to be modified to run in the VM. The techniques used for full virtualization can be a dynamic translation of software programs running in a VM (e.g., VMware products), or providing a complete emulation of the underlying processor (e.g., Xen with Intel-VT or AMD-V).
Guest Operating Systems (GOS)A GOS is one of the OSes that the VMM can host in a VM. The relationship between VMM, VM, and GOS is analogous to the relationship between, respectively, OS, process, and program.
Hardware Level Virtualization Hardware Level Virtualization is the technique of using a thin layer of software to abstract the system hardware resources for creating multiple instance of virtual executing environment, each of which runs a separate instance of operating system.
Hardware Thread See strand.
HVMHardware Virtual Machine, also known as hardware-assisted virtualization.
HypervisorHypervisor is another term for VMM. Hypervisor is an extension of the term supervisor which was commonly applied to operating system kernel.
Logical Domains (LDoms)Logical domains are Sun's implementation for hardware level virtualization based on the UltraSPARC T1 processor technology. LDom technology allows multiple domains to be created on one processor; each domain runs an instance of OS supported by one or more strands.
Operating System Level VirtualizationOS Level Virtualization is provided by an OS by virtualizing its services to allow multiple and separate operating environments to be created for applications. The services virtualized by the
114 Terms and Definitions Sun Microsystems, Inc.
OS includes: file system, devices, networking, security, and Inter Process Communication (IPC).
PacificaAMD's implementation for Hardware Virtualization, also known as AMD-V or AMD SVM.
ParavirtualizationParavirtualization is an implementation of virtual machine that requires the guest OS to be modified to run in the VM. Paravirtualization provides partial emulation of the underlying hardware to a VM and requires the guest OS to replace all sensitive instructions and passes the control to the VMM for handling these operations.
Privileged InstructionsPrivileged instruction are those that result in trap if the processor is running in user mode and do not result in trap if the processor is running in supervisor mode.
Secure Virtual Machine (SVM)AMD's implementation for Hardware Virtualization, also known as Pacifica or ADM-V (see [9] Chapter 15).
Sensitive InstructionsSensitive instructions [1] [12] are those that change the configuration of resources (memory), affect the processor mode without going through the memory trap sequence (page fault), or whose behavior changes with the processor mode or the contents of relocation register. If sensitive instructions are a subset of privileged instructions, it is relatively easy to build a VM because all sensitive instructions will result in a trap and the underlying VMM can process the trap and emulate the behavior of these sensitive instructions. If some sensitive instructions are not privileged instructions, special measure has to be taken to handles these sensitive instructions.
Shadow PageA technique for hiding the layout of machine memory from a virtual machine's operating system. A virtual page table is presented to the guest OS by the VMM, but not connected to the processor's memory management unit. The VMM is responsible for trapping accesses to the table, validating updates and maintaining consistency with the real page table that is visible to the processor MMU. Shadow page is typically used to provide full virtualization to a VM.
Simple Earliest Deadline First (sEDF) One of the scheduling algorithms used in Sun xVM Hypervisor for x86 for scheduling domains. See section “CPU Scheduling” on page 48 for a detailed description of sEDF.
StrandStrand [2] refers to the state that hardware must maintain in order to execute a software thread. Specifically, a strand is the software-visible state (PC, NPC, general-purpose registers, floating-point registers, condition codes, status registers, ASRs, etc.) of a thread plus any microarchitecture state required by hardware for its execution. Strand replaces the ambiguous term hardware thread. The number of strands in a processor defines the number of threads that an operating system can schedule on that processor at any given time.
Sun xVM Hypervisor for x86Sun xVM Hypervisor for x86 is the VMM of the Sun xVM Server.
Sun xVM InfrastructureSun Cross Virtualization and Management Infrastructure is a complete solution offering for virtualizing and managing the data center. Sun xVM Infrastructure = Sun xVM Server + xVM Ops Center
Sun xVM Ops CenterSun xVM Ops Center is the management suite for the Sun xVM Server.
115 Terms and Definitions Sun Microsystems, Inc.
Sun xVM ServerSun xVM Server is a paravirtualized Solaris OS that includes support for the Xen open source community work on the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform. In this paper, Sun xVM Server specifically refers to the Sun xVM Server for the x86 platform.
VanderpoolIntel's implementation for Hardware Virtualization, also known as Intel-VT.
Virtual CPU (VCPU)VCPU is an entity that can be dispatched by the scheduler of a guest OS. For UltraSPARC processors’s LDoms, VCPU is also know as strand, hardware thread, or logical processor.
Virtual Machine (VM)Virtual machine is a discrete execution environment that abstracts computer platform resources to an operating system. Each virtual machine runs an independent and separate instance of operating system. Popek and Goldberg [1] also defines VM as an “efficient, isolated duplicate of a real machine.”
Virtual Machine Monitor (VMM)The VMM is a software layer that runs directly on top of the hardware and virtualizes all resources of the computer system. The VMM layer is situated between VMs and hardware resources. The VMM abstracts hardware resources to VMs and performs privileged and sensitive actions on the behalf of VM.
Virtualization Technology (VT)Intel's implementation for Hardware Virtualization, also known as Vanderpool.
XenXen is a open source VMM for x86, IA-64, and PPC [6].
116 Terms and Definitions Sun Microsystems, Inc.
117 Author Biography Sun Microsystems, Inc.
Appendix D
Author Biography
Chien-Hua Yen is currently a senior staff engineer in the ISV engineering group at Sun.
Before joining Sun more than 12 years ago, he had been with several Silicon Valley
companies working as a software development engineer on Unix file systems, real time
embedded system, and device drivers. His first job at Sun was with the kernel I/O
group developing a kernel virtual memory segment driver for device memory mapping.
After the kernel group, he worked with third party hardware vendors on developing PCI
drivers for the Solaris OS and high availability products for the Sun CompactPCI board.
In the last two yeas, Chien-Hua has been working with ISVs on application performance
tuning, Solaris 10 adoption, and Solaris virtualization.
AcknowledgementsThe author would like to thank Honlin Su, Lodewijk Bonebakker, Thomas Bastian, Ray
Voight, and Joost Pronk for their invaluable comment; Patric Change for his
encouragement and support; Suzanne Zorn for her editorial work; and Kemer
Thompson for his constructive comments and his coordination of the reviews.
Solaris Operating System Hardware Virtualization Product Architecture On the Web sun.com
Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com
© 2007 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Java, JVM, Solaris, and Sun BluePrints are trademarks or registered trademarks of Sun Microsystems, Inc. in the United
States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks
are based upon architecture developed by Sun Microsystems, Inc. Information subject to change without notice. Printed in USA 11/07