solaris operating system hardware virtualization product 268

SOLARIS™ OPERATING SYSTEMHARDWARE VIRTUALIZATION PRODUCT ARCHITECTURE Chien-Hua Yen, ISV Engineering [email protected]

Sun BluePrints™ On-Line — November 2007

Part No 820-3703-10Revision 1.0, 11/27/07Edition: November 2007

Sun Microsystems, Inc.

Table of Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

Hardware Level Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Section 1: Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Virtual Machine Monitor Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

VMM Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

VMM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

The x86 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

SPARC Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Section 2: Hardware Virtualization Implementations . . . . . . . . . . . . . . . . . . . . . 37

Sun xVM Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Sun xVM Server Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Sun xVM Server CPU Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Sun xVM Server Memory Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Sun xVM Server I/O Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Sun xVM Server with Hardware VM (HVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

HVM Operations and Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Sun xVM Server with HVM Architecture Overview. . . . . . . . . . . . . . . . . . . . . . . 68

Logical Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Logical Domains (LDoms) Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . 80

CPU Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Memory Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

I/O Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

VMware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

VMware Infrastructure Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

VMware CPU Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

VMware Memory Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

VMware I/O Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Section 3: Additional Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

VMM Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Terms and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Author Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

1 Introduction Sun Microsystems, Inc.

Chapter 1

Introduction

In the IT industry, virtualization is a mechanism of presenting a set of logical computing

resources over a fixed hardware configuration so that these logical resources can be

accessed in the same manner as the original hardware configuration. The concept of

virtualization is not new. First introduced in the late 1960s on mainframe computers,

virtualization has recently become popular as a means to consolidate servers and

reduce the costs of hardware acquisition, energy consumption, and space utilization.

The hardware resources that can be virtualized include computer systems, storage, and

the network.

Server virtualization can be implemented at different levels on the computing stack,

including the application level, operating system level, and hardware level:

• An example of application level virtualization is the Virtual Machine for the Java™

platform (Java Virtual Machine or JVM™ machine)1. The JVM implementation

provides an application execution environment as a layer between the application

and the OS, removing application dependency on OS-specific APIs and hardware-

specific characteristics.

• OS level virtualization abstracts OS services such as file systems, devices,

networking, and security, and provides a virtualized operating environment to

applications. Typically, OS level virtualization is implemented by the OS kernel.

Only one instance of the kernel runs on the system, and it provides multiple

virtualized operating environments to applications. Examples of OS level

virtualization include Solaris™ Containers technology, Linux VServers, and FreeBSD

Jails. OS level virtualization has less performance overhead and better system

resource utilization than hardware level virtualization. Since one OS kernel is

shared among all virtual operating environments, isolation among all virtualized

operating environments is as good as the OS provides.

• Hardware level virtualization, discussed in detail in this paper, has become popular

recently because of increasing CPU power and low utilization of CPU resources in the

IT data center. Hardware level virtualization allows a system to run multiple OS

instances. With less sharing of system resources than OS level virtualization,

hardware virtualization provides stronger isolation of operating environments.

The Solaris OS includes bundled support for application and OS level virtualization with

its JVM software and Solaris Containers offerings. Sun first added support for hardware

virtualization in the Solaris 10 11/06 release with Sun Logical Domains (LDoms)

technology, supported on Sun servers which utilize UltraSPARC T1 or UltraSPARC T2

1. The terms "Java Virtual Machine" and "JVM" mean a Virtual Machine for the Java(TM) platform.


processors. VMware also supports the Solaris OS as a guest OS in its VMware Server and

Virtual Infrastructure products starting with the Solaris 10 1/06 release. In October

2007, Sun announced the Sun xVM family of products that includes the Sun xVM Server

and the Sun xVM Ops Center management system:

• Sun xVM Server — includes support for the Xen open source community work [6] on

the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform

• Sun xVM Ops Center — a management suite for the Sun xVM Server

Note – In this paper, in order to distinguish the discussion of x86 and UltraSPARC T1/T2 processors, Sun xVM Server is specifically used to refer to the Sun hardware virtualization product for the x86 platform, and LDoms is used to refer to the Sun hardware virtualization product for the UltraSPARC T1 and T2 platforms.

The hardware virtualization technology and new products built around this technology

have expanded options and opportunities for deploying servers with better utilization,

more flexibility, and enhanced functionality. In reaping the benefits of the hardware

virtualization, IT professionals also face the challenges of operating within the

limitation of a virtualized environment while delivering the same level of service

agreement as the physical operating environment. Meeting this requirement requires a

good understanding of virtualization technologies, CPU architecture, and software

implementations, and awareness of their strengths and limitations.

Hardware Level VirtualizationHardware level virtualization is a mechanism of virtualizing the system hardware

resources such as CPU, memory, and I/O, and creating multiple execution

environments on a single system. Each of these execution environments runs an

instance of the operating system.

A hardware level virtualization implementation typically consists of several virtual

machines (VMs), as shown in Figure 1. A layer of software, the virtual machine monitor

(VMM), manages system hardware resources and presents an abstraction of these

resources to each VM. The VMM runs in privileged mode and has full control of system

hardware. A guest operating system (GOS) runs in each VM. The GOS to VM is

analogous to program to process in which OS plays the function of the VMM.


Figure 1. In hardware level virtualization, the VMM software manages hardware resources and presents an abstraction of these resources to one or more virtual machines.

Hardware resource virtualization can take the form of sharing, partitioning, or

delegating:

• Sharing — Resources are shared among VMs. The VMM coordinates the use of

resources by VMs. For example, the VMM may include a CPU scheduler to run threads

of VMs based on a pre-determined scheduling policy and VM priority.

• Partitioning — Resources are partitioned so that each VM gets the portion of

resources allocated to it. Partitioning can be dynamically adjusted by the VMM based

on the utilization of each VM. Examples of resource partitioning include the

ballooning memory technique employed in Sun xVM Server and VMware, and the

allocation of CPU resources in Logical Domains technology.

• Delegating — With delegating, resources are not directly accessible by a VM.

Instead, all resource accesses are made through a control VM that has direct access

to the resource. I/O device virtualization is normally accessed via delegation.

The distinction and boundaries between the virtualization methods are often not clear.

For example, sharing may be used for one component and partitioning used in others,

and together they make up an integral functional module.

Benefits of Hardware Level VirtualizationHardware level virtualization allows multiple operating systems to run on a single

server system. This ability offers many benefits that are not available in a single OS

server. These benefits can be summarized in three functional categories:

• Workload ConsolidationAccording to Gartner [17] “Intel servers running at 10 percent to 15 percent

utilization are common.” Many IT organizations run out and buy a new server every

time they deploy a new application. With virtualization, computers no longer have to

be dedicated to a particular task. Applications and users can share computing

resources, remaining blissfully unaware that they are doing so. Companies can shift

computing resources around to meet demand at a given time, and get by with less

infrastructure overall. When used for consolidation, virtualization can also save

VM

GOS

VM

GOS

VM

GOS

Virtual Machine Monitor (VMM)

Platform Hardware


hardware and maintenance expenses, floor space, cooling costs, and power

consumption.

• Workload MigrationHardware level virtualization decouples the OS from the underlying physical platform

resources. A guest OS state, along with the user applications running on top of it, can

be encapsulated into an entity and moved to another system. This capability is useful

for migrating a legacy OS system from an old under-powered server to a more

powerful server while preserving the investment in software. When a server needs to

be maintained, a VM can be dynamically migrated to a new sever with no down time,

further enhancing availability. Changes in workload intensity levels can be addressed

by dynamically shifting underlying resources to the starving VMs. Legacy applications

that ran natively on a server continue to run on the same OS running inside a VM,

leveraging the existing investment in applications and tools.

• Workload IsolationWorkload isolation includes fault and security isolations. Multiple guest OSes run

independently, and thus a software failure in one VM does not affect other VMs.

However, the VMM layer introduces a single point of failure that can bring down all

VMs on the system. A VMM failure, although potentially catastrophic, is less probable

than a failure in the OS because the complexity of VMM is much less than that of an

OS.

Multiple VMs also provide strong security isolation among themselves with each VM

running an independent OS. Security intrusions are confined to the VM in which they

occur. The boundary around each VM is enforced by the VMM and the inter-domain

communication, if provided by the VMM, is restricted to specific kernel modules only.

One distinct feature of hardware level virtualization is the ability to run multiple

instances of heterogeneous operating systems on a single hardware platform. This

feature is important for the following reasons:

• Better security and fault containment among application services can be achieved

through OS isolation.

• Applications written for one OS can run on a system that supports a different OS.

• Better management of system resource utilization is possible among the virtualized

environments.

ScopeThis paper explores the underlying hardware architecture and software implementation

for enabling hardware virtualization. Great emphasis has been placed on the CPU

hardware architecture limitations for virtualizing CPU services and their software

workarounds. In addition, this paper discusses in detail the software architecture for

implementing the following types of virtualization:


• CPU virtualization — uses processor privileged mode to control resource usage by

the VM, and relays hardware traps and interrupts to VMs

• Memory virtualization — partitions physical memory among multiple VMs and

handles page translations for each VM

• I/O virtualization — uses a dedicated VM with direct access to I/O devices to provide

device services

The paper is organized into three sections. Section I, Background Information, contains

information on VMMs and provides details on the x86 and SPARC processors:

• “Virtual Machine Monitor Basics” on page 9 discusses the core of hardware

virtualization, the VMM, as well as requirements for the VMM and several types of

VMM implementations.

• “The x86 Processor Architecture” on page 21 describes features of the x86 processor

architecture that are pertinent to virtualization.

• “SPARC Processor Architecture” on page 29 describes features of the SPARC processor

that affect virtualization implementations.

Section II, Hardware Virtualization Implementations, provides details on the Sun xVM

Server, Logical Domains, and VMware implementations:

• “Sun xVM Server” on page 39 discusses a paravirtualized Solaris OS that is based on

an open source VMM implementation for x86[6] processors and is planned for

inclusion in a future Solaris release.

• “Sun xVM Server with Hardware VM (HVM)” on page 63 continues the discussion of

Sun xVM Server for the x86 processors that support hardware virtual machines: Intel-

VT and AMD-V.

• “Logical Domains” on page 79 discusses Logical Domains (LDoms), supported on Sun

servers that utilize UltraSPARC T1 or T2 processors, and describes Solaris OS support

for this feature.

• “VMware” on page 97 discusses the VMware implementation for the VMM.

Section III, Additional Information, contains a concluding comparison, references, and

appendices:

• “VMM Comparison” on page 109 presents a summary of the VMM implementations

discussed in this paper.

• “References” on page 111 provides a comprehensive listing of related references.

• “Terms and Definitions” on page 113 contains a glossary of terms.

• “Author Biography” on page 117 provides information on the author.

Introduction Sun Microsystems, Inc.

Section I

Background Information

• Chapter 2: Virtual Machine Monitor Basics (page 9)

• Chapter 3: The x86 Processor Architecture (page 21)

• Chapter 4: SPARC Processor Architecture (page 29)

9 Virtual Machine Monitor Basics Sun Microsystems, Inc.

Chapter 2

Virtual Machine Monitor Basics

At the heart of hardware level virtualization is the VMM. The VMM is a software layer

that abstracts computer hardware resources so that multiple OS instances can run on a

physical system. Hardware resources are normally controlled and managed by the OS.

In a virtualized environment the VMM takes this role, managing and coordinating

hardware resources. There is no clear boundary between an OS and the VMM from the

definition point of view. The division of functions between OS and the VMM can be

influenced by factors such as processor architecture, performance, OS, and non-

technical requirements such as ease of installation and migration.

Certain VMM requirements exist for running multiple OS instances on a system. These

requirements, discussed in detail in the next section, stem primarily from processor

architecture design that is inherently an impediment to hardware virtualization. Based

on these requirements, two types of VMMs have emerged, each with distinct

characteristics in defining the relationship between the VMM and an OS. This

relationship determines the privilege level of the VMM and an OS, and the control and

sharing of hardware resources.

VMM RequirementsA software program communicates with the computer hardware through instructions.

Instructions, in turn, operate on registers and memory. If any of the instructions,

registers, or memory involved in an action is privileged, that instruction results in a

privileged action. Sometimes an action, which is not necessarily privileged, attempts to

change the configuration of resources in the system. Subsequently, this action would

impact other actions whose behavior or result depends on the configuration of

resources. The instructions that result in such operations are called sensitive

instructions.

In the context of the virtualization discussion, a processor's instructions can be

classified into three groups:

• Privileged instructions are those that trap if the processor is in non-privileged mode

and do not trap if it is in privileged mode.

• Sensitive instructions are those that change or reference the configuration of

resources (memory), affect the processor mode without going through the memory

trap sequence (page fault), or reference the sensitive registers whose contents

change when the processor switches to run another VM.

• Non-privileged and non-sensitive instructions are those that do not fall into either

the privileged or sensitive categories described above.


Sensitive instructions have “a major bearing on the virtualizability of a machine” [1]

because of their system-wide impact. In a virtualized environment, a GOS should only

contain non-privileged and non-sensitive instructions.

If sensitive instructions are a subset of privileged instructions, it is relatively easy to

build a VM because all sensitive instructions will result in a trap. In this case a VMM can

be constructed to catch all traps that result from execution of sensitive instructions by a

GOS. All privileged and sensitive actions from VMs would be caught by the VMM, and

resources could be allocated and managed accordingly (a technique called trap-and-

emulate). A GOS's trap handler could then be called by the VMM trap handler to

perform the GOS-specific actions for the trap.

If a sensitive instruction is a non-privileged instruction, the instruction executed by one

VM will be unnoticed. Robin and Irvine [3] identified several x86 instructions in this

category. These instructions cannot be safely executed by a GOS as they can impact the

operations of other VMs or adversely affect the operation of its own GOS. Instead, these

instructions must be substituted by the VMM service. The substitution can be in the

form of an API for the GOS to call, or a dynamic conversion of these instructions to

explicit processor traps.

Types of VMMIn a virtualized environment, the VMM controls the hardware resources. VMMs can be

categorized into two types, based on this control of resources:

• Type I — maintains exclusive control of hardware resources

• Type II —leverages the host OS by running inside the OS kernel

The Type I VMM [3] has several distinct characteristics: it is the first software to run

(besides BIOS and the boot loader), it has full and exclusive control of system hardware,

and it runs in privileged mode directly on the physical processor. The GOS on a Type I

VMM implementation runs in a less privileged mode than the VMM to avoid conflicts

managing the hardware resources.

An example of a Type I VMM is Sun xVM Server. Sun xVM Server includes a bundled

VMM, the Sun vVM Hypervisor for x86. The Sun xVM Hypervisor for x86 is the first

software, beside BIOS and boot loader, to run during boot as shown in the GRUB

menu.lst file:

title Sun xVM Server kernel$ /boot/$ISADIR/xen.gzmodule$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unixmodule$ /platform/i86pc/$ISADIR/boot_archive


The GRUB bootloader first loads the Sun xVM Hypervisor for x86, xen.gz. After the

VMM gains control of the hardware, it loads the Solaris kernel,

/platform/i86xpv/kernel/$ISADIR/unix, to run as a GOS.

Sun's Logical Domains and VMware's Virtual Infrastructure 3 [4] (formerly knows as

VMware ESX Server), described in detail in Chapter 7 “Logical Domains” on page 79 and

Chapter 8 “VMware” on page 97, are also Type I VMMs.

A Type II VMM typically runs inside a host OS kernel as an add-on module, and the host

OS maintains control of the hardware resources. The GOS in a Type II VMM is a process

of the host OS. A Type II VMM leverages the kernel services of the host OS to access

hardware, and intercepts a GOS's privileged operations and performs these operations

in the context of the host OS. Type II VMMs have the advantage of preserving the

existing installation by allowing a new GOS to be added to an running OS.

An example of type II VMM is VMware's VMware Server (formerly known as VMware

GSX Server).

Figure 2 illustrates the relationships among hardware, VMM, GOS, host OS, and user

application in virtualized environments.

Figure 2. Virtual machine monitors vary in how they support guest OS, host OS, and user applications in virtualized environments.

VMM ArchitectureAs discussed in “VMM Requirements” on page 9, the VMM performs some of the

functions that an OS normally does: namely, it controls and arbitrates CPU and memory

resources, and provides services to upper layer software for sensitive and privileged

operations. These functions require the VMM to run in privileged mode and the OS to

relinquish the privileged and sensitive operations to the VMM. In addition to processor

and memory operation, I/O device support also has a large impact on VMM

architecture.

Apps

Type I VMMServer

Type II VMMServer

PhysicalServer

GOS

Apps

GOS

Apps

GOS

VMM

Platform Hardware

Apps Apps

GOS

Apps

GOS

VMM VMM

Host OS

Platform Hardware

OS

Unprivileged Mode

Privileged ModePlatform Hardware

User SpaceApplications


VMM in Privileged ModeA processor typically has two or more privileged modes. The operating system kernel

runs in the privileged mode. The user applications run in a non-privileged mode and

trap to the kernel when they need to access system resources or services from the

kernel.

The GOS normally assumes it runs in the most privileged mode of the processor.

Running a VMM in a privileged mode can be accomplished with one of the following

three methods:

• Deprivileging the GOS — This method usually requires a modification to the OS to

run at a lower privilege level. For x86 systems, the OS normally runs at protected ring

0, the most privileged level. In Sun xVM Server, ring 0 is reserved to run the VMM.

This requires the GOS to be modified, or paravirtualized, to run outside of ring 0 at a

lower privilege level.

• Hyperprivileging the VMM — Instead of changing the GOS to run at lower privilege,

another approach taken by the chip vendors is to create a hyperprivileged processor

mode for the VMM. The Sun UltraSPARC T1 and T2 processor’s hyperprivileged mode

[2], Intel-VT's VMX-root operation (see [7] Volume 3B, Chapter 19), and AMD-V’s

VMRUN-Exit state (see [9] Chapter 15) are examples of a hyperprivileged processor for

VMM operations.

• Both VMM and GOS run in same privileged mode — It is possible to have both the

VMM and GOS run in the same privileged mode. In this case, the VMM intercepts all

privileged and sensitive operations of a GOS before passing them to the processor. For

example, VMware allows both the GOS and the VMM to run in privileged mode.

VMware dynamically examines each instruction to decide whether the processor

state and the segment reversibility (see “Segmented Architecture” on page 23) allow

the instruction to be executed directly without the involvement of the VMM. If the

GOS is in privileged mode or the code segment is non-reversible, the VMM performs

necessary conversions of the core execution path.

Removing Sensitive Instructions in the GOSPrivileged and sensitive operations are normally executed by the OS kernel. In a

virtualized environment, the GOS has to relinquish the privileged and sensitive

operations to the VMM. This is accomplished by one of the following approaches:

• Modifying the GOS source code to use the VMM services for handling sensitive

operations (paravirtualization)This method is used by Sun xVM Server and Sun's Logical Domains (LDoms). Sun xVM

Server and LDoms provide a set of hypercalls for an OS to request VMM services. The

VMM-aware Solaris OS uses these hypercalls to replace its sensitive instructions.


• Dynamically translating the GOS sensitive instructions by software

As described in a previous section, VMware uses binary translation to replace the GOS

sensitive instructions with VMM instructions.

• Dynamically translating the GOS sensitive instructions by hardware

This method requires the processor to provides a special mode of operation that is

entered when an sensitive instruction is executed in reduced privileged mode.

The first approach, which involves modifying the GOS source code, is called

paravirtualization, because the VMM provides only partial virtualization of the

processor. The GOS must replace its sensitive and privileged operations with the VMM

service. The remaining two approaches provide full virtualization to the VM, enabling

the GOS to run without modification

In addition to OS modification, performance requirements, processor architecture

design, tolerance of a single point of failure, and support for legacy OS installations

have an impact on the design of VMM architecture.

Physical Memory VirtualizationMemory management by the VMM involves two tasks: partitioning physical memory

for VMs, and supporting page translations in a VM.

Each OS assumes physical memory starts from page frame number (PFN) 0 and is

contiguous to the size configured for that VM. An OS uses physical addresses in

operations like page table updates and Direct Memory Access (DMA). In reality, the

starting PFN of the memory exported to a VM may not start from PFN 0 and may not be

contiguous.

The virtualization of physical address is provided in the VMM by creating another layer

of addressing scheme, namely machine address (MA). Within a GOS, a virtual address

(VA) is used by applications, and a physical address (PA) is used by the OS in DMA and

page tables. The VMM maps a PA from a VM to a MA, which is used on hardware. The

VMM maintains translation tables, one for each VM, for mapping PAs to MAs.

Figure 3 depicts the scheme to partition machine memory to physical memory for each

VM.


Figure 3. Example physical-to-machine memory mapping.

A ballooning technique [5] has been used in some virtualization products to achieve

better utilization of physical memory among VMs. The idea behind the ballooning

technique is simple. The VMM controls a balloon module in a GOS. When the VMM

wants to reclaim memory, it inflates the balloon to increase pressure on memory,

forcing the GOS to page out memory to disk. If the demand for physical memory

decreases, the VMM deflates the balloon in a VM, enabling the GOS to claim more

memory.

Page Translations VirtualizationAccess to processor's page translation hardware is a privileged operation, and this

operation is performed by the privileged VMM. Exactly what the VMM needs to perform

depends on the processor architecture. For example, x86 hardware automatically loads

translations from the page table to the Translation Lookaside Buffer (TLB). The software

has no control of loading page translations to the TLB. Therefore, the VMM is

responsible for updating the page table that is seen by the hardware. The SPARC

processor uses software through traps to load page translations to the TLB. A GOS

maintains its page tables in its own memory, and the VMM gets page translations from

the VM and loads them to the TLB.

VMMs typically support the following two methods to support page translations:

• Hypervisor calls — The GOS makes a call to the VMM for page translation

operations. This method is commonly used by paravirtualized OSes, as it provides

better performance.

• Shadow page table — The VMM maintains an independent copy of page tables,

called shadow page tables, from the guest page tables. When a page fault occurs,

the VMM propagates changes made by the GOS's page table to the shadow page

table. This method is commonly used by VMMs that support full virtualization, as the

GOS continues to update its own page table and the synchronization of the guest

VM0

Physical Memory

PFN 0

VM1

PFN 0 MPFN 0

Machine Memory

VM/GOS VMM


page table and the shadow page table is handled by the VMM when page faults

occur.

Figure 4 shows three different page translation implementations in the Solaris OS on

x86 and SPARC platforms.

1. The paravirtualized Sun xVM Server uses the following approach on x86 platforms:

[1] The GOS uses the hypervisor call method to update the page tables

maintained by the VMM.

2. The Sun xVM Server with HVM and VMware use the following approach:

[2a] The GOS maintains its own guest page table. The synchronization between

the guest page table and the hardware page table (shadow page table) is

handled by the VMM when page faults occur.

[2b] The x86 CPU loads the page translation from the hardware page table to

the TLB.

3. On SPARC systems, the Solaris OS uses the following approach for Logical Domains:

[3a] The GOS maintains its own page table. The GOS takes an entry from the

page table as an argument to the hypervisor call that loads the translations

to the TLB.

[3b] The VMM gets the page translation from the GOS and loads the translation

to the TLB.

Figure 4. Page translation schemes used on x86 and SPARC architectures.

The memory management implementation for Sun xVM Server, Sun xVM Server with

HVM, VMware, and Logical Domains using these mechanisms is discussed in detail in

later sections of this paper.

GOS

VMM

HardwareTLB

SPARC Page Translations

HV Calls

GOS

VMM

Hardware

1

2a

3a

TLB

3b

X86 Page Translations

Guest Page Table

HV Calls

HW Page Table

2b

Guest Page Table

TLB Operations


I/O VirtualizationI/O devices are typically managed by a special software module called the device driver

running in the kernel context. Due to vastly different types and varieties of device types

and device drivers, the VMM either includes few device drivers or leaves device

management entirely to the GOS. In the latter case, because of existing device

architecture limitations (discussed later in the section), devices can only be exclusively

managed by one VM.

This constraint creates some challenges for I/O access by a VM, and limits the

following:

• What device are exported to a VM

• How devices are exported to a VM

• How each I/O transaction is handled by a VM and the VMM

Consequently, I/O has the most challenges in the areas of compatibility and

performance for virtual machines. In order to explain what devices are exported and

how they are exported, it is first necessary to understand the options available to

handle I/O transactions in a VM.

There are, in general, three approaches for I/O virtualization, as illustrated in Figure 5:

• Direct I/O (VM1 and VM3)

• Virtual I/O using I/O transaction emulation (VM2)

• Virtual I/O using device emulation (VM4)

Figure 5. Different I/O virtualization techniques used by virtual machine monitors.

For direct I/O, the VMM exports all or a portion of the physical devices attached to the

system to a VM, and relies on VMs to manage devices. The VM that has direct I/O

access uses the existing driver in the GOS to communicate directly with the device.

VM 1 and VM3 in Figure 5 have direct I/O access to devices. VM1 is also a special I/O VM

that provides virtual I/O for other VMs, such as VM2, to access devices.

VM1

I/O Transaction Emulation andNative Driver

Direct I/O

I/O VM

VM2

Virtual Driver

Virtual I/Othru

I/O VM

Virtual I/OthruVMM

VM3

Native Driver

Network Chip SCSI Controller

Device Emulation and Device Driver

Direct I/O

VM4

Native Driveror

Virtual Driver

VMM

Sun X64 Server


Virtual I/O is made possible by controlling the device types exported to a VM. There are

two different methods of implementing virtual I/O: I/O transaction emulation (shown

in VM2 in Figure 5) and device emulation (shown in VM4).

• I/O transaction emulation requires virtual drivers on both ends for each type of I/O

transaction (data and control functions). As shown in Figure 5, the virtual driver on

the client side (VM2) receives I/O requests from applications and forwards requests

through the VMM to the virtual driver on the server side (VM1); the virtual driver on

the server side then sends out the request to the device.

I/O transaction emulation is typically used in paravirtualization because the OS on

the client side needs to include the special drivers to communicate with its

corresponding driver in the OS on the server side, and needs to add kernel interfaces

for inter-domain communication using the VMM services. However, it is possible to

have PV drivers in an un-paravirtualized OS (full virtualization) for better I/O

performance. For example, Solaris 10, which is not paravirtualized, can include PV

drivers on a HVM-capable system to get better performance than that achieved using

device emulation drivers such as QEMU. (See “Sun xVM Server with HVM I/O

Virtualization (QEMU)” on page 71.)

I /O transaction emulation may cause application compatibility issues if the virtual

driver does not provide all data and control functions (for example, ioctl(2)) that

the existing driver does.

• Device emulation provides an emulation of a device type, enabling the existing

driver for the emulated device in a GOS to be used. The VMM exports emulated

device nodes to a VM so that the existing drivers for the emulated devices in a GOS

are used. By doing this, the VMM controls the driver used by a GOS for a particular

device type; for example, using the e1000g driver for all network devices. Thus, the

VMM can focus on the emulation of underlying hardware using one driver interface.

Driver accesses to the I/O register and port in a GOS, which will result in a trap due to

invalid address, are caught and converted to access the real device hardware. VM4 in

Figure 5 uses native OS drivers to access emulated devices exported by the VMM.

Device emulation is in general less efficient and more limited on platforms supported

than I/O transaction emulation. Device emulation does not require changes in the

GOS and, therefore, is typically used to provide full virtualization to a VM.

Virtual I/O, unlike direct I/O, requires additional drivers in either the I/O VM or the

VMM to provide I/O virtualization. This constraint:

• Limits the type of devices that are made available to a VM

• Limits device functionality

• Causes significant I/O performance overhead

While virtualization provides full application binary compatibility, I/O becomes a

trouble area in terms of application compatibility and performance in a VM. One


solution to the I/O virtualization issues is to allow VMs to directly access I/O, as shown

by VM3 in Figure 5.

Direct I/O access by VMs requires additional hardware support to ensure device

accesses by a VM are isolated and restricted to resources owned by the assigned VM. In

order to understand the industry effort to allow an I/O device to be shared among VMs,

it is necessary to examine device operations from an OS point of view.

The interactions between an OS and a device consist, in general, of three operations:

1. Programmed I/O (PIO) — host-initiated data transfer. In PIO, a host OS maps a

virtual address to a piece of device memory and accesses the device memory using

CPU load/store instructions.

2. Direct Memory Access (DMA) —device-initiated data transfer without the CPU

involvement. In DMA, a host OS writes an address of its memory and the transfer

size to a device's DMA descriptor. After receiving an enable DMA instruction from

the host driver, the device performs data transfer at a time it chooses and uses

interrupts to notify the host OS of DMA completion.

3. Interrupt —a device-generated asynchronous event notification.

Interrupts are already virtualized by all VMM implementations as is shown in the later

discussions for Sun xVM Server, Logical Domains, and VMware. The challenge of I/O

sharing among VMs therefore lies in the device handling for PIO and DMA. To meet the

challenges, PCI SIG has released a suite of IOV specifications for PCI Express (PCIe)

devices, in particular the “Single Root I/O Virtualization and Sharing Specification”

(SRIOV) specification [35] for device sharing and PIO operation, and the “Address

Translation Services (ATS)” specification [30] for DMA operation.

Device Configuration and PIO

A PCI device exports its memory to the host through Base Address Registers (BARs) in its

configuration space. A device's configuration space is identified in the PCI configuration

address space as shown in Figure 6.

Figure 6. PCI configuration address space.

A PCI device can have up to 8 physical functions (PF). Each PF has its own 256 byte

configuration header. The BARs of a PCI function, which are 32-bit wide, are located at

offset 0x10-0x24 in the configuration header. The host gets the size of the memory

region mapped by a BAR by writing a value of all 1's to the BAR and then reading the

value back. The address written to a BAR is the assigned starting address of the memory

region mapped to the BAR.

011516232431

Reserved Bus NumberRegisterNumber

FunctionNumber

DeviceNumber 00

2781011


To allow multiple VMs to share a PF, the SRIOV specification introduces the notion of a

Virtual Function (VF). Each VF shares some common configuration header fields with

the PF and other VFs. The VF BARs are defined in the PCIe's SRIOV extended capabilities

structure. A VF contains a set of non-shared physical resources, such as work queue and

data buffer, which are required to deliver function specific services. These resources are

exported through the VF BARs and are directly accessible by a VM.

The starting address of a VF's memory space is derived from the first VF's memory

space address and the size of VF's BAR. For any given VFx, the starting address of its

memory space mapped to BARa is calculated according to the following formula:

where addr (VF1, BARa) is the starting address of BARa for the first VF and (VF BARa

aperture size) is the size of the VF BARa as determined by writing a value of 1's to BARa

and reading the value back. Using this mechanism, a GOS in a VM is able to share the

device with other VMs while performing device operations that pertain only to the VM.

DMA

In many current implementations (especially in most x86 platforms), physical addresses

are used in DMA. Since a VM shares the same physical address space on the system

with other VMs, a VM might read/write to another VM's memory through DMA. For

example, a device driver in a VM might write the memory contents that belong to other

VMs to a disk and read the data back into the VM's memory. This causes a potential

breach in security and fault isolation among VMs.

To provide isolation during DMA operation, the ATS specification defines a scheme for a

VM to use the address mapped to its own physical memory for DMA operation. (This

approach is used in similar designs such as IOMMU Specification [31] and DMA

Remapping [28].) This DMA ATS enables DMA memory to be partitioned into multiple

domains, and keeps DMA transactions on one domain isolated from other domains.

Figure 7 shows device DMA with and without ATS. With DMA ATS, the DMA address is

like a virtual address that is associated with a context (VM). DMA transactions initiated

by a VM can only be associated with the memory owned by the VM. DMA ATS is a

chipset function that resides outside of the processor.

addr VFx BARa( , ) addr VF1 BARa( , ) x 1–( ) VF BARa aperature size( )×+=


Figure 7. DMA with and without address translation service (ATS).

As shown in Figure 7, the physical address (PA) is used on the hardware platform

without hardware support for ATS. For platforms with hardware support for ATS, a GOS

in a VM writes either a device virtual address (DVA) or a guest physical address (GPA) to

the device’s DMA engine. The device driver in the GOS loads the mappings of either the

DVA or GPA to the host physical address (HPA) in the hardware IOMMU. The HPA is the

address understood by the memory controller.

Note – The distinction between the HPA and GPA is described in detail in later sections for Sun xVM Server (see “Physical Memory Management” on page 52), for UltraSPARC LDoms (see “Physical Memory Allocation” on page 88), and for VMware (see “Physical Memory Management” on page 103).

When the device performs a DMA operation, a DVA/GPA address appears on the PCI bus

and is intercepted by the hardware IOMMU. The hardware IOMMU looks up the

mapping for the DVA/GPA, finds the corresponding HPA, and moves the PCI data to

system memory pointed to by the HPA. Since either DVA or GPA of a VM has its own

address space, ATS allows system memory for DMA to be partitioned and, thus,

prevents a VM from accessing another VM’s DMA buffer.

PA - Physical AddressHPA - Host Physical AddressDVA - Device Virtual AddressGPA - Guest Physical Address

System Memory

DMA Buffer

DMA Buffer

North Bridge

CPU

PCI Device

PCI DeviceSouth Bridge

DMA without ATS

PAPA

PAPA

PA

System Memory

DMA Buffer DMA

Buffer

DMA Buffer

VM2VM1

DMA Buffer

North Bridge

CPU

PCI Device

PCI Device

South Bridgew/ IOMMU

DMA with ATS

HPAHPA

DVA/GPA

DVA/GPA

HPA

21 The x86 Processor Architecture Sun Microsystems, Inc.

Chapter 3

The x86 Processor Architecture

This chapter provides background information on the x86 processor architecture that is

relevant to later discussions on Sun xVM Server (Chapter 5 on page 39), Sun xVM Server

with HVM (Chapter 6 on page 63), and VMware (Chapter 8 on page 97).

The x86 processor was not designed to run in a virtualized environment, and the x86

architecture presents some challenges for CPU and memory virtualization. This chapter

discusses the following x86 architecture features that are pertinent to virtualization:

• Protected Mode

The protected mode in the x86 processor utilizes two mechanisms, segmentation and

paging, to prevent a program from accessing a segment or a page with a higher

privilege level. Privilege level controls how the VMM and a GOS work together to

provide CPU virtualization.

• Segmented Architecture

The x86 segmented architecture converts a program's virtual addresses into linear

addresses that are used by the paging mechanism to map into physical memory.

During the conversion, the processor's privilege level is checked against the privilege

level of the segment for the address. Because of the segment cache technique

employed by the x86 processor, the VMM must ensure segment cache consistency

with the VM descriptor table updates. This x86 feature results in a significant amount

of work for the VMM of full virtualization products such as VMware.

• Paging Architecture

The x86 paging architecture provides page translations to the TLB and page tables.

Because the loading of page translations from page table to TLB is done

automatically by hardware on the x86 platform, page table updates have to be

performed by the privileged VMM. Several mechanisms are available for updating

this “hardware” page table by a VM.

• I/O and Interrupts

A device interacts with a host processor through PIO, DMA, and interrupts. PIO in the

x86 processor can be performed through either I/O ports using special I/O

instructions or through memory-mapped addresses with general purpose MOVE and

String instructions. DMA in most x86 platforms is performed with physical

addresses. This can cause a security and isolation breach in a virtualized environment

because a VM may read/write other VMs memory contents. Interrupts and

exceptions are handled through the Interrupt Descriptor Table (IDT). There is only one

IDT on the system and access to the IDT is privileged. Therefore, interrupts have to be

handled by the VM and virtualized to be delivered to a VM.


• Timer Devices

The x86 platform includes several timer devices for time keeping purposes.

Knowledge of the characteristics of these devices is important to fully understand

time keeping in a VM: Some timer devices are interrupt driven (which is virtualized

and delayed) and some require privileged access to update the device counter.

Protected ModeThe x86 architecture protected mode provides a protection mechanism to limit access

to certain segments or pages and prevent unprivileged access. The processor's

segment-protection mechanism recognizes 4 privilege levels, numbered from 0 to 3

(Figure 8). The greater the level number, the lesser the privileges provided.

The page-level protection mechanism restricts access to pages based on two privilege

levels: supervisor mode and user mode. If the processor is operating at a current

privilege level (CPL) 0, 1, or 2, it is in a supervisor mode and the processor can access all

pages. If the processor is operating at a CPL 3, it is in a user mode and the processor can

access only user level pages.

Figure 8. Privilege levels in the x86 architecture.

When the processor detects a privilege level violation, it generates a general-protection

exception (#GP). The x86 has more than 20 privileged instructions. These instructions

can be executed only when the current privilege level (CPL) is 0 (most privileged).

In addition to the CPL, the x86 has an I/O privilege level (IOPL) field in the EFLAGS

register that indicates the I/O privilege level of the currently running program. Some

instructions, while allowed to execute when the CPL is not 0, might generate a #GP

exception if the CPL value is higher than IOPL. These instructions include CLI (clear

interrupt), STI (set interrupt flag), IN/INS (input from port), and OUT/OUTS (output

to port).

In addition to the above instructions, there are many instructions [3] that, while not

privileged, reference registers or memory locations that would allow a VM to access a

memory region not assigned to that VM. These sensitive instructions will not cause a

#GP exception. The trap-and-emulate method for virtualization of a GOS, as stated in

“VMM Requirements” on page 9, does not apply to these instructions. However, these

instructions may impact other VMs.

Level 0 - OS Kernel

Level 1

Level 2

Level 3 - Applications


Segmented ArchitectureIn protected mode, all memory accesses must go through a logical address } Linear

address (LA) } Physical Address (PA) translation scheme. The logical address to LA

translation is managed by the x86 segmentation architecture which divides a process's

address space into multiple protected segments.

A logical address, which is used as the address of an operand or of an instruction,

consists of a 16-bit segment selector and a 32-bit offset. A segment selector points to a

segment descriptor that defines the segment (see Figure 11 on page 24). The segment

base address is contained in the segment descriptor. The sum of the offset in a logical

address and the segment base address gives the LA. The Solaris OS directly maps an LA

to a process's Virtual Address (VA) by setting the segment base address to NULL.

For each memory reference, a VA and a segment selector are provided to the processor

(Figure 9). The segment selector, which is loaded to the segment register, is used to

identify a segment descriptor for the address.

Figure 9. Segment Selector

Every segment descriptor has a visible part and a hidden part, as illustrated in Figure 10

(see also [7], Volume 3A Section 3.4.3). The visible part is the segment selector, an

index that points into either the global descriptor table (GDT) or the local descriptor

table (LDT) to identify from which descriptor the hidden part of the segment register is

to be loaded. The hidden part includes portions containing segment descriptor

information loaded from the descriptor table.

Figure 10. Each segment descriptor has a visible and a hidden part.

Segmentation: VA + Segment Base Address (always 0 in Solaris) } Linear address

Paging: Linear address } Physical Address

Index: up to 8K descriptors (bits 3-15)TI: Table Indicator; 0=GDT, 1=LDTRPL: Request Privilege Level

012315

Index TI RPL

Selector

Visible

Type Base Address Limit CPL

Hidden


The hidden fields of a segment register are loaded to the processor from a descriptor

table and are stored in the descriptor cache registers. The descriptor cache registers,

like the TLB, allow the hardware processor to refer to the contents of the segment

register's hidden part without further reference to the descriptor table. Each time a

segment register is loaded, the descriptor cache register gets fully loaded from the

descriptor table. Since each VM has its own descriptor table (for example, the GDT), the

VMM has to maintain a shadow copy of each VM’s descriptor table. A context switch to

a VM will cause the VM's shadow descriptor table to be loaded to the hardware

descriptor table. If the content of the descriptor table is changed by the VMM because

of a context switch to another VM, the segment is non-reversible, which means the

segment cannot be restored if an event such as a trap causes the segment to be saved

and replaced.

The Current Privilege Level (CPL) is stored in the hidden portion of the segment register.

The CPL is initially equal to the privilege level of the code segment from which it is

being loaded. The processor changes the CPL when program control is transferred to a

code segment with a different privilege level.

The segment descriptor contains the size, location, access control, and status

information of the segment that is stored in either the LDT or GDT. The OS sets segment

descriptors in the descriptor table and controls which descriptor entry to use for a

segment (Figure 11). See “CPU Privilege Mode” on page 45 for a discussion of setting

the segment descriptor in the Solaris OS.

Figure 11. Segment descriptor.

The privilege check performed by the processor recognizes three types of privilege

levels: requested privilege level (RPL), current privilege level (CPL), and descriptor

privilege level (DPL). A segment can be loaded if the DPL of the segment is numerically

greater than or equal to both the CPL and the RPL. In other words, a segment can be

L: 64-bit code segmentAVL: Available for use by system softwareBase: Segment base addressD/B Default operation size (0=64-bit segment, 1=32 bit segment)DBL: Descriptor Privilege LevelG: GranularitySL: Segment Limit 19:16P: Segment presentS: Descriptor type (0=system, 1=code or data)Type: segment type

0

0

78111213141516

16

19202122232431

31

SPAVLLD/BD DPLBase 31:24 Base 23:16SL Type

Segment Limit 15:00Base 15:00


accessed only by code that has equal or higher privilege level. Otherwise, a general-

protection fault exception, #GP, is generated and the segment register is not loaded.

On 64-bit systems, linear address space (flat memory model) is used to create a

continuous, unsegmented address space for both kernel and application programs.

Segmentation is disabled in the sense that privilege checking can not apply to VA to LA

translations as it doesn't exist. The only protection left to prevent a user application

from accessing kernel memory is through the page protection mechanism. This is why

the kernel of a GOS has to run in ring 3 (user mode in page level protection) on a 64-bit

system.

Paging ArchitectureWhen operating in the protected mode, the LA } PA translation is performed by the

paging hardware of the x86 processor. To access data in memory, the processor requires

the presence of a VA } PA translation in the TLB (in Solaris, LA is equal to VA), the page

table backing up the TLB entry, and a page of physical memory. For the x86 processor,

loading the VA } PA page translation from the page table to TLB is performed

automatically by the processor. The OS is responsible for allocating physical memory

and loading the VA } PA translation to the page table.

When the processor cannot load a translation from the page table, it generates a page

fault exception, #PF. A #PF exception on x86 processors usually means a physical page

has not been allocated, because the loading of the translation from the page table to

the TLB is handled by the processor (Figure 12).

Figure 12. Translations through the TLB are accomplished in the processor itself, while translations through page tables are performed by the OS.

The x86 processor uses a control register, %cr3, to manage the loading of address

translations from the page table to the TLB. The base address of a process's page table

is kept by the OS and loaded to %cr3 when the process is contexted in to run. On the

Solaris OS, %cr3 is kept in the kernel hat structure. Each address space, as, has one

hat structure. The mdb(1) command can be used to find the value of the %cr3

register of a process:

TLB Entry

Performed by the processor

Page Table Physical Memory

Performed by the OS


When multiple VMs are running, the automatic loading of page translations from the

page table to the TLB actually makes the virtualization more difficult because all page

tables have to be accessible by the processor. As a result, pages table updates can only

be performed by the VMM to enforce a consistent memory usage on the system. “Page

Translations Virtualization” on page 14 discusses two mechanism for managing page

tables by the VMM.

Another issue of the x86 paging architecture is related to the flushing of TLB entries.

Unlike many RISC processors which support a tagged TLB, the x86 TLB is not tagged. A

TLB miss results in a walk of the page table by the processor to find and load the

translation to the TLB. Since the TLB is not tagged, a change in the %cr3 register due to

a virtual memory context switch will result in invalidating all TLB entries. This adversely

affects performance if the VMM and VM are not in the same address space.

A typical solution to address the performance impact of TLB flushing is to reserve a

region of the VM address space for the VMM. With this solution, the VMM and VM can

run from the same address space and thus avoid a TLB flush when a VM memory

operation traps to the VMM. The latest CPUs from Intel and AMD with hardware

virtualization support include tagged TLBs, and consequently the translation of

different address spaces can co-exist in the TLB.

I/O and InterruptsIn general, x86 support for exceptions and I/O interrupts does not impose any

particular challenge to the implementation of a VMM. The x86 processor uses the

interrupt descriptor table (IDT) to provide a handler for a particular interrupt or

exception. Access to the IDT functions is privileged and, therefore, can only be

performed by the VMM. The Sun xVM Hypervisor for x86 provides a mechanism to relay

hardware interrupts to a VM through its event channel hypervisor calls (see “Event

Channels” on page 43).

% mdb -k> ::psS PID PPID PGID SID UID FLAGS ADDR NAME....R 9352 9351 9352 9352 28155 0x4a014000 fffffffec2ae78c0 bash> fffffffec2ae78c0::print -t 'struct proc' ! grep p_as struct as *p_as = 0xfffffffed15ba7e0> 0xfffffffed15ba7e0::print -t 'struct as' ! grep a_hat struct hat *a_hat = 0xfffffffed1718e98> 0xfffffffed1718e98::print -t 'struct hat' ! grep hat_htable htable_t *hat_htable = 0xfffffffed0f67678> 0xfffffffed0f67678::print -t 'struct htable' ! grep ht_pfn pfn_t ht_pfn = 0x16d37 // %cr3


The x86 processor allows device memory and registers to be accessed through either an

I/O address space or memory-mapped I/O. An I/O address space access is performed

using special I/O instructions such as IN and OUT. These instructions, while allowed to

execute when the CPL is not 0, will result in a #GP exception if the processor's CPL

value is higher than the I/O privilege level (IOPL). The Sun xVM Hypervisor for x86

provides a hypervisor call to set the IOPL, enabling a GOS to directly access I/O ports by

setting the IOPL to its privilege level.

When using memory-mapped I/O, any of the processor’s instructions that reference

memory can be used to access an I/O location with protection provided through

segmentation and paging. PIO, whether it is using I/O address space or memory-

mapped I/O, is normally uncacheable as device registers are usually accessed with

precise programming order. PIO uses addresses in a VM's address space and doesn't

cause any security and isolation issues.

The x86 processor uses physical addresses for DMA. DMA in a virtualized x86 system has

certain issues:

• A 32-bit, non-dual-address-cycle (DAC) PCI device can not address beyond 4 GB of

memory.

• It is possible for one domain’s DMA to intrude into another domain's physical

memory, thus causing the risk of security violation.

The solution to the above issues is to have an I/O memory management unit (IOMMU)

as a part of an I/O bridge or north bridge that performs a translation of I/O addresses

(for example, an address that appears on the PCI bus) to machine memory addresses.

The I/O address can be any address that is recognized by the IOMMU. An IOMMU can

also improve the performance of large chunk data transfers by mapping a contiguous

I/O address to multiple physical pages in one DMA transaction. However, the IOMMU

may hurt the I/O performance for small data transfers because the DMA setup cost is

higher than that of DMA without an IOMMU.

For more details on the IOMMU, also known as hardware address translation service

(hardware ATS), see “I/O Virtualization” on page 16.

Timer DevicesAn OS typically uses several timer devices for different purposes. Timer devices are

characterized by their frequency granularity, frequency reliability, and ability to

generate interrupts and receive counter input. Understanding the characteristics of

timer devices is important for the discussion of timekeeping in a virtualized

environment, as the VMM provides virtualized timekeeping of some timers to its

overlaying VMs. Virtualized timekeeping has significant impact on the accuracy of time

related functions in the GOS and, thus, on the performance and results of time sensitive

applications.


An x86 system typically includes the following timer devices:

• Programmable Interrupt Timer (PIT)

PITs use a 1.193182 Mhz crystal oscillator and have a 16-bit counter and counter input

register. The PIT contains three timers. Timer 0 can generate interrupts and is used by

the Solaris OS as the system timer. Timer 1 was historically used for RAM refreshes

and timer 2 for the PC speaker.

• Time Stamp Counter (TSC)

The TSC is a feature of the x86 architecture that is accessed via the RDTSC

instruction. The TSC, a 64-bit counter, changes with the processor speed. The TSC

cannot generate interrupts and has no counter input register. The TSC is the finest

grained of all timers and is used in the Solaris OS as the high resolution timer. For

example, the gethrtime(3C) function uses the TSC to return the current high-

resolution real time.

• Real Time Clock (RTC)

The RTC is used as the time-of-day (TOD) clock in the Solaris OS. The RTC uses a battery

as an alternate power source, enabling it to continue to keep time while the primary

source of power is not available. The RTC can generate interrupts and has a counter

input register. It is the lowest grained timer on the system.

• Local Advanced Programmable Interrupt Controller (APIC) Timer

The local APIC timer, which is a part of the local APIC, has a 32-bit counter and

counter input register. It can generate interrupts and has the same frequency as the

front side bus. The Solaris OS supports the use of the local APIC timer as one of the

cyclic timers.

• High Precision Event Timer (HPET)The HPET is a relatively new timer available in some new x86 systems. The HPET is

intended to replace the PIT and the RTC for generating periodic interrupts. The HPET

can generate interrupts, is 64-bits wide, and has a counter input register. The Solaris

OS currently does not use the HPET.

• Advanced Configuration and Power Interface (ACPI) Timer The ACPI timer has a 24-bit counter, can generate interrupts, and has no input

counter register. The Solaris OS does not use the ACPI timer.

29 SPARC Processor Architecture Sun Microsystems, Inc.

Chapter 4

SPARC Processor Architecture

This chapter provides background information on the SPARC processor architecture that

is relevant to later discussions on Logical Domains (Chapter 7 on page 79).

The SPARC (Scalable Processor Architecture) processor, first introduced in 1987, is a big-

endian RISC processor ISA. SPARC International (SI), an industry organization, was

established in 1989 to promote the open SPARC architecture. In 1994, SI introduced a

64-bit version of the SPARC processor as SPARC v9. The UltraSPARC processor, which is a

Sun-specific implementation of SPARC v9, was introduced in 1996 and has been

incorporated into all Sun SPARC platforms shipping today.

In 2005, Sun's UltraSPARC architecture was open sourced as the UltraSPARC

Architecture 2005 Specification [2]. Included in this enhanced UltraSPARC 2005

specification is support for Chip-level Multithreading (CMT) for a highly threaded

processor architecture and a hyperprivileged mode that allows the hypervisor to

virtualize the processor to run multiple domains. The design of the UltraSPARC T1

processor, which is the first implementation of the UltraSPARC Architecture 2005

Specification, is also open sourced. The UltraSPARC T1 processor includes 8 cores with 4

strands in each core, providing a total of 32 strands per processor.

In August 2007 Sun announced the UltraSPARC T2 processor, the follow-up CMT

processor to the UltraSPARC T1 processor, and the OpenSPARC T2 architecture [33]

which is the open source version of the UltraSPARC T2 processor. Sun also released the

UltraSPARC Architecture 2007 specification [34] which adds a section for error handling

and expands the discussion for memory management. The UltraSPARC T2 processor has

several enhancements over the UltraSPARC T1 processor. These enhancements include

64 strands, per-core floating-point and graphic units, and integrated PCIe and 10 GB

Ethernet (for more details see “Processor Components” on page 31).

The remainder of this chapter discusses the following features of the UltraSPARC T1/T2

processor architecture, and describes their effect on virtualization implementations:

• Processor privilege mode — The UltraSPARC 2005 specification defines a

hyperprivileged mode for the hypervisor operations.

• Sun4v Chip Multithreaded architecture — This feature enables the creation of up to

32 domains, each with its own dedicated strands, on an UltraSPARC T1 processor, and

up to 64 domains on an UltraSPARC T2 processor.

• Address Space Identifier (ASI)— The ASI provides functionality to control access to a

range of address spaces, similar to the segmentation used by x86 processors.

• Memory Management Unit (MMU) — The software-controlled MMU allows an

efficient redirection of page faults to the intended domain for loading translations.


• Trap and interrupt handling — Each strand (virtual processor) has its own trap and

interrupt priority registers. This functionality allows the hypervisor to re-direct traps

to the target CPU and enables the trap to be taken by the GOS's trap handler.

Note – The terms strand, hardware thread, logical processor, virtual CPU and virtual processor are used by various documents to refer to the same concept. For consistency, the term strand is used in this chapter.

Processor Mode of OperationThe UltraSPARC 2005 specification defines three privilege modes: non-privileged,

privileged, and hyperprivileged. In hyperprivileged mode, the processor can access all

registers and address spaces, and can execute all instructions. Instructions, registers,

and address spaces for privileged and non-privileged modes are restricted.

The processor operates in privileged mode when PSTATE.priv is set to 1 and

HPSTATE.hpriv is set to 0. The processor operates in hyperprivileged mode when

HPSTATE.hpriv is set to 1 (PSTATE.priv is ignored).

Table 1 lists the availability of instructions, registers, and address spaces for each of the

privilege modes, and includes information on where further details can be found in the

UltraSPARC Architecture 2005 Specification [2].

Table 1. Documentation describing the availability of components in the UltraSPARC processor.

Based on the availability of instructions, registers, and the ASI in hyperprivileged mode,

the following functions of the hypervisor can be deduced:

• Reset the processor: SIR instruction

• Control hyperprivileged traps and interrupts: HTSTATE, HTBA, HINTP registers

• Control strand operation: ASI 0x41, and HSTICK_CMPR and STRAND_STS registers

• Manage MMU: ASI 0x50-0x5F

Component Locationa

a. Location in the UltraSPARC Architecture 2005 Specification [2].

Comments

Instruction Table 7-2 All instructions except SIR, RDHPR, and RHPR (which require hyperprivilege to execute) can be executed from the privileged mode.

Registers Chapter 5 There are seven hyperprivileged registers: HPSTATE, HTSTATE, HINTP, HTBA, HVER, HSTICK_CMPR, and STRAND_STS. These registers are used by the hypervisor in the hyperprivileged mode.

Address Space

Tables 9-1 and 10-1

ASIs 0x30-0x7F are for hyperprivileged access only. These ASIs are mainly for CMT control, MMU, TLB, and hyperprivileged scratch registers.


Processor ComponentsThe UltraSPARC T1 processor[10] contains eight cores, and each core has hardware

support for four strands. One FPU and one L2 cache are shared among all cores in the

processor. Each core has its own Level 1 instruction and data cache (L1 Icache and

Dcache) and TLB that are shared among all strands in the core. In addition, each strand

contains the following:

• A full register file with eight register windows and four sets of global registers (a total

of 160 registers: 8 * 16 registers per window, + 4 * 8 global registers)

• Most of the ASIs

• Ancillary privileged registers

• Trap queue with up to 16 entries

This hardware support in each strand allows the hypervisor to partition the processor

into 32 domains, with one strand for each domain. Each strand can execute instructions

separately without requiring a software scheduler in the hypervisor to coordinate the

processor resources.

Table 2 summarizes the association of processor components to their location in the

processor, core and strand.

Table 2. Location of key processor components in the UltraSPARC T1 processor.

The UltraSPARC T2 processor[33] is built upon the UltraSPARC T1 architecture. It has the

following enhancements over the UltraSPARC T1 processor:

• EIght strands per core (for a total of 64 strands)

• Two integer pipelines per core, with each integer pipeline supporting 4 strands

• A floating-point and graphics unit (FGU) per core

• Integrated PCI-E and 10 Gb/Gb Ethernet (System-on-Chip)

• Eight banks of 4 MB L2 cache

The UltraSPARC T2 has a total of 64 strands in 8 cores, and each core has its own

floating-pointing and graphics unit (FGU). This allows up to 64 domains to be created on

the UltraSPARC T2 processor. This design also adds integrated support for industry

standard I/O interfaces such PCI-Express and 10 Gb Ethernet.

Table 3 summarizes the association of processor components to physical processor, core

and strand.

Processor Core Strand

• Floating Point Unit • L2 cache crossbar• L2 cache

• 6 stage instruction pipeline

• L1 Icache and Dcache• TLB

• Register file with 160 registers• Most of ASI• Ancillary state register (ASR)• Trap registers• Privileged registers


Table 3. Location of key processor components in the UltraSPARC T2 processor.

Address Space IdentifierUnlike x86 processors in 32-bit mode, which use segmentation to divide a process's

address space into several segments of protected address spaces, the SPARC v9

processor has a flat 64-bit address space. An address in the SPARC V9 processor is a

tuple consisting of an 8-bit address space identifier (ASI) and a 64-bit byte-address offset

within the specified address space. The ASI provides attributes of an address space,

including the following:

• Privileged or non-privileged

• Register or memory

• Endianness (for example, little-endian or big-endian)

• Physical or virtual address

• Cacheable or non-cacheable

The SPARC processor's ASI allows different types of address spaces (user virtual address

space, kernel virtual address space, processor control and status registers, etc.) to co-

exist as separate and independent address spaces for a given context. Unlike x86

processors in which user processes and the kernel share the same address space, user

processes and the kernel have their own address space on SPARC processors.

Access to these address spaces are protected by the ASI associated with each address

space. ASIs in the range 0x00-0x2F may be accessed only by software running in

privileged or hyperprivileged mode; ASIs in the range 0x30-0x7F may be accessed

only by software running in hyperprivileged mode. An access to a restricted (privileged

or hyperprivileged) ASI (0x00-0x7F) by non-privileged software will result in a

privileged_action trap.

Table 9-1 and Table 10--1 of [2] provide a summary and description for each ASI.

Memory Management UnitThe traditional UltraSPARC architecture supports two types of memory addressing:

• Virtual Address (VA) — managed by the GOS and used by user programs

• Physical address (PA) — passed by the processor to the system bus when accessing

physical memory

Processor Core Strand

• 8 banks 4 MB L2 cache• L2 cache crossbar• Memory controller• PCI-E• 10 Gb/Gb Ethernet

• 2 instruction pipelines (8 stages)

• L1 Icache and Dcache• TLB• FGU (12 stages)

• Full register file with 8 windows• Most of ASI• Ancillary state register (ASR)• Privileged registers


The Memory Management Unit (MMU) of the UltraSPARC processor provides the

translation of VAs to PAs. This translation enables user programs to use a VA to locate

data in physical memory.

The SpitFire Memory Management Unit (sfmmu) is Sun's implementation of the

UltraSPARC MMU. The sfmmu hardware consists of Translation Lookaside Buffers (TLBs)

and a number of MMU registers:

• Translation Lookaside Buffer (TLB)

The TLB provides virtual to physical address translations. Each entry of the TLB is a

Translation Table Entry (TTE) that holds information for a single page mapping of

virtual to physical addresses. The format of the TTE is shown in Figure 13. The TTE

consists of two 64-bit words, representing the tag and data of the translation. The

privileged field, P, controls whether or not the page can be accessed by non-

privileged software.

• MMU registers

A number of MMU registers are used for accessing TLB entries, removing TLB entries

(demap), context management, handling TLB misses, and support for Translation

Storage Buffer (TSB) access. The TSB, an array of TTE entries, is a cache of translation

tables used to quickly reload the TLB. The TSB resides in the system memory and is

managed entirely by the OS. The UltraSPARC processors includes some MMU

hardware registers for speeding up TSB access. The TLB miss handler will first search

the TSB for the translation. If the translation is not found in the TSB, the TLB handler

calls to a more sophisticated (and slower) TSB miss handler to load the translation

table to the TSB.

Figure 13. The translation lookaside buffer (TLB) is an array of translation table entries containing tag and data portions.

A TLB hit occurs if both the context and virtual address match an entry in the TLB.

Address aliasing (multiple TLB entries with the same physical address) is permitted.

Unlike the x86 processor, the loading of page translations to the TLB is manually

managed by software through traps. In the event of a TLB miss, a trap is generated

trying first to get the translation from the Translation Storage Buffer (TSB) (Figure 14).

The TSB, an in-memory array of translations, acts like a direct-mapped cache for the

TLB. If the translation is not present in the TSB, a TSB miss trap is generated. The TSB

miss trap handler uses a software lookup mechanism based on the hash memory entry

0

0345678910111213

4142474863

636261 56 55

vacontext_id 000000

sztaddr wpevsoft

ep

cv

cp

ie

nfo

soft2

TTETag

TTEData


block structure, hme_blk, to obtain the TTE. If a translation is still not found in

hme_blk, the kernel generic trap handler is invoked to call the kernel function

pagefault() to allocate physical memory for the virtual address and load the

translation into the hme_blk hash structure.

Figure 14 depicts the mechanism for handling TLB misses in an unvirtualized domain.

Figure 14. Handling a TLB miss in an unvirtualized domain, UltraSPARC T1/T2 processor architecture.

Similarly, Figure 15 depicts how TLB misses are handled in a virtualized domain. In a

virtualized environment, the UltraSPARC T1/T2 processor adds a Real Address type, in

addition to the VA and PA, into the types of memory addressing (Figure 15). Real

addresses (RA), which are equivalent to the physical memory in Sun xVM Server (see

“Physical Memory Management” on page 52) are provided to the GOS as the

underlying physical memory allocated to it. The GOS-maintained TSBs are used to

translate VAs into RAs. The hypervisor manages the translation from RA to PA.

Figure 15. Handling a TLB miss in a virtualized domain, UltraSPARC T1/T2 processor architecture.

Applications, which are non-privileged software, use only VAs. The OS kernel, which is

privileged software, uses both VAs and RAs. The hypervisor, which is hyperprivileged

software, normally uses PAs. “Physical Memory Allocation” on page 88 discusses in

detail the types of memory addressing used in LDoms.

The UltraSPARC T2 processor adds a hardware table walk for loading TLB entries. The

hardware table walk accesses the TSBs to find TTEs that match the virtual address and

context ID of the request. Since a GOS cannot access or control physical memory, the

TTEs in the TSBs controlled by a GOS contain real page numbers, not physical page

numbers (see “Physical Memory Allocation” on page 88). TTEs in the TSBs controlled by

the hypervisor can contain real page numbers or physical page numbers. The

hypervisor performs the RA-to-PA translation within the hardware table walk to permit

the hardware table walk to load a GOS TTEs into the TLB for VA-to-PA translation.

TLB

ProcessorMMU

TTE cachein memory

OS datastructure

OSfunction

TTE load to TLB

TLB missTSB

TTE load to TSB

TSB misshome_blk

hat_memload()

Allocate memorypagefault ()

TLB

ProcessorMMU

Managed byHypervisor

TTE cachein memory

OS datastructure

TTE load to TLB

TLB missPA<-RA

TLB missTSB

RA<-VATTE load to TSB

TSB misshme_blk

OSfunction

hat_memload()

Allocate memorypagefault()


TrapsIn the SPARC processor, a trap transfers software execution from one privileged mode to

another privileged mode at the same or higher level. The only exception is that

unprivileged mode can not trap to another unprivileged mode. A trap can be generated

by the following methods:

• Internally by the processor (memory faults, privileged exceptions, etc.)

• Externally generated by I/O devices (interrupts)

• Externally generated by another processor (cross calls)

• Software generated (for example, the Tcc instruction)

A trap is associated with a Trap Type (TT), a 9-bit value. (TT values 0x180-0x1FF are

reserved for future use.) The transfer of software execution occurs through a trap table

that contains an array of TT handlers indexed by the TT value. Each trap table entry is

32-bytes in length and contains the first eight instructions of the TT handler. When a

trap occurs, the processor gets the TT from the TT register and the trap table base

address (TBA) from the TBA register. After saving the current executing states and

updating some registers, the processor starts to execute the instructions in the trap

table handler.

The SPARC processors support nesting traps using a trap level (TL). The maximum TL

(MAXTL) value is typically in the range of 2-6, and depends on the processor; in

UltraSPARC T1/T2 processors, MAXTL is 6. Each trap level has one set of trap stack

control registers: trap type (TT), trap program counter (TPC), trap next program

counter (TNPC), and trap state (TSTATE). These registers provide trap software

execution state and control for the current TL. The ability to support nested traps in

SPARC processors makes the implementation of an OS trap handler easier and more

efficient, as the OS doesn't need to explicitly save the current trap stack information.

On UltraSPARC T1/T2 processors, each strand has a full set of trap control and stack

registers which include TT, TL, TPC, TNPC, TSTATE, HTSTATE (hyperprivileged trap

state), TBA, HTBA (hyperprivileged trap base address), and PIL (priority interrupt

level). This design feature allows each strand to receive traps independently of other

strands. This capability significantly helps trap handling and management by the

hypervisor, as traps are delivered to a strand without being queued up in the hypervisor.

InterruptsOn SPARC platforms, interrupt requests are delivered to the CPU as traps. Traps 0x041

through 0x04F are used for Priority Interrupt Level (PIL) interrupts, and trap 0x60 is

used for the vector interrupt. There are 15 interrupt levels for PIL interrupts. Interrupts

are serviced in accordance to their PIL, with higher PILs having higher priority. The

vector interrupt is used to support the data bearing vector interrupt which allows a

device to include its private data in the interrupt packet (also known as the mondo


vector). With vector interrupt, device CSR access can be eliminated and the complexity

of device hardware can be reduced.

PIL interrupts are delivered to the processor through the ASR's SOFTINT_REG register.

The SOFTINT_REG register contains a 15 bit int_level field. When a bit in this field

is set, a trap is generated and the PIL of the trap corresponds to the location of the bit

in that field. There is one SOFTINT_REG for each strand.

In LDoms, the interrupt delivery from an I/O device to a GOS is a two-step process:

• An I/O device sends an interrupt request using the vector interrupt (trap 0x60) to the

hypervisor. The hypervisor inserts the interrupt request into the interrupt queue of

the target virtual processor.

• The target processor receives the interrupt request on its interrupt queue through

trap 0x7D (for device) or 0x7C (for cross calls), and schedules an interrupt to itself to

be processed at a later time by setting bits in the privileged SOFTINT register which

causes a PIL interrupt (trap 0x41-0x4F). For more details on interrupt delivery, see

“Trap and Interrupt Handling” on page 85.

SPARC Processor Architecture Sun Microsystems, Inc.

Section II

Hardware Virtualization Implementations

• Chapter 5: Sun xVM Server (page 39)

• Chapter 6: Sun xVM Server with Hardware VM (HVM) (page 63)

• Chapter 7: Logical Domains (page 79)

• Chapter 8: VMware (page 97)

39 Sun xVM Server Sun Microsystems, Inc.

Chapter 5

Sun xVM Server

Sun xVM Server is a a paravirtualized Solaris OS that incorporates the Xen open source

community work. The open source VMM, Xen, was originally developed by the Systems

Research Group of the University of Cambridge Computer Laboratory, as part of the UK-

EPSRC funded XenoServers project. The first versions of Xen, targeted at the Linux

community for the x86 processor, required the Linux kernel to be specifically modified

to run on the Xen VMM. This OS paravirtualization made it impossible to run Windows

on early versions of Xen, because Microsoft did not permit the Windows software to be

modified.

In December 2005 the Xen development team released Xen 3.0, the first version of its

VMM that supported hardware-assisted virtual machines (HVM). With this new version,

an unmodified OS could be hosted on the Intel-VTx and AMD-V (Pacifica) processors.

Xen 3.0 eliminated the need for paravirtualization and enabled Microsoft Windows to

run in a Xen environment side-by-side with Linux and the Solaris OS.

Xen 3.0 supports the x86 CPU both with HVM and without HVM. Xen 3.0 also extends

support for symmetric multiprocessing, 64-bit operating systems, and up to 64 GB RAM

allowed by the x86 physical address extension (PAE) in 32-bit mode.

HVM technology affects the Xen implementation in many ways. This chapter discusses

the architecture and design of Sun xVM Server, which does not leverage the processor

HVM feature. Chapter 5 discusses Sun xVM Server for x86 processors with HVM support

(Sun xVM Server with HVM).

Note – Sun xVM Server includes support for the Xen open source community work on the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform. In this paper, in order to distinguish the discussion of x86 and UltraSPARC T1/T2 processors, Sun xVM Server is specifically used to refer to the Sun hardware virtualization product for the x86 platform, and LDoms is used to refer to the Sun hardware virtualization product for the UltraSPARC T1 and T2 platforms.

This chapter is organized as follows:

• “Sun xVM Server Architecture Overview” on page 40 provides an overview of the Sun

xVM Server architecture.

• “Sun xVM Server CPU Virtualization” on page 45 discusses the CPU virtualization

employed by Sun xVM Server.

• “Sun xVM Server Memory Virtualization” on page 52 describes memory management

issues.


• “Sun xVM Server I/O Virtualization” on page 56 discusses the I/O virtualization used

in Sun xVM Server.

Sun xVM Server Architecture OverviewA Sun xVM Server virtualized system consists of an x86 system, a VMM, a control VM

running Sun xVM Server (Dom0), and zero or more VMs (DomU), as shown in Figure 16.

The Sun xVM Hypervisor for x86, the VMM of the Sun xVM Server system, manages

hardware resources and provides services to the VMs. Each VM, including Dom0, runs

an instance of a guest operating system (GOS) and is capable of communicating with

the VMM through a set of hypervisor calls.

Figure 16. A Sun xVM Server virtualized system consists of a VMM, a control VM (Dom0), and zero or more VMs (DomU).

The Dom0 VM has some unique characteristics not available in other VMs:

• First VM started by the VMM

• Able to directly access I/O devices

• Runs domain manager to create, start, stop, and configure other VMs

• Provides I/O access service to other VMs (DomU)

Each DomU VM runs an instance of a paravirtualized GOS, and gets VMM services

through a set of hypercalls. Access to I/O devices from each DomU VM are provided by

drivers in Dom0.

Console IF XenStore

GuestApplications/

DomainManager/

Console

Guest OS

Dom 0

GuestApplications

Guest OS

Dom U

GuestApplications

Guest OS

Dom U

Sun xVM Hypervisor for x86

Sun X64 Server

Grant Tables

GuestApplications

Guest OS

Dom U

Scheduler

Hypercalls

Event Channel


Sun xVM Hypervisor for x86 ServicesThe Sun xVM Hypervisor for x86, the VMM of the Sun xVM Server, provides several

communication channels between itself and overlying domains:

• Hypercalls — synchronous calls from a GOS to the VMM

• Event Channel — asynchronous notifications from the VMM to VMs

• Grant Table — shared memory communication between the VMM and VMs, and

among VMs

• XenStore — a hierarchical collection of control and status repository

Each of these mechanisms is described in more detail in the following sections.

Hypercalls

The Sun xVM Server hypercalls are a set of interfaces used by a GOS to request service

from the VMM. The hypercalls are invoked in a manner similar to OS system calls: a

software interrupt is issued which vectors to an entry point within the VMM. Hypercalls

use INT $0x82 on a 32-bit system and SYSCALL on a 64-bit system, with the

particular hypercall contained in the %eax register.

For example, the common routine for hypercalls with four arguments on a 64-bit Solaris

kernel is:

The function in assembly is as follows:

The calling convention is compliant with the AMD64 ABI [8].

The SYSCALL instruction is intended to enable unprivileged software (ring 3) to access

services from privileged software (ring 0). Solaris system calls also use SYSCALL to

allow user applications to access Solaris kernel services. Having SYSCALL used by both

Solaris system calls and the hypercalls means that the SYSCALL made by the user

process in Solaris is delivered indirectly by the VMM to the Solaris kernel. This causes a

slight overhead for each Solaris system call.

long__hypercall4(ulong_t callnum, ulong_t a1, ulong_t a2, ulong_t a3, ulong_t a4);

_[0]> __hypercall4,7/ai__hypercall4:__hypercall4: movl %edi,%eax /* %edi is the first argument */__hypercall4+2: movq %rsi,%rdi__hypercall4+5: movq %rdx,%rsi__hypercall4+8: movq %rcx,%rdx__hypercall4+0xb: movq %r8,%r10__hypercall4+0xe: syscall__hypercall4+0x10: ret


A complete list of Sun xVM Server hypercalls is provided in Table 4:

As Table 4 shows, the hypercalls provide a variety of functions for a GOS:

• Perform privileged operations such as setting the trap table, updating the page table,

loading the GDT, and setting the GS and FS segment registers

• Get services from the VMM such as using the event channel, grant table,

set_callbacks services, and scheduled operations

• Control VM operations such as platform_op, domain control, and virtual CPU control

An example use of a hypercall is to request a set of page table updates. For example, a

new process created by the fork(2) call requires the creation of page tables. The

hypercall HYPERVISOR_mmu_update(), which validates and applies a list of

Table 4. Sun xVM Server hypercalls.

Privilege Operations:

long set_trap_table(trap_info_t *table); long mmu_update(mmu_update_t *req, int count, int *success_count, domid_t domid); long set_gdt(ulong_t *frame_list, int entries); long stack_switch(ulong_t ss, ulong_t esp); long fpu_taskswitch(int set); long mmuext_op(struct mmuext_op *req, int count, int *success_count, domid_t domain_id); long update_descriptor(maddr_t ma, uint64_t desc); long update_va_mapping(ulong_t va, uint64_t new_pte, ulong_t flags); long set_timer_op(uint64_t timeout); long physdev_op(void *physdev_op); long vm_assist(uint_t cmd, uint_t type); long update_va_mapping_otherdomain(ulong_t va, uint64_t new_pte, ulong_t flags, domid_t domain_id); long iret(); long set_segment_base(int reg, ulong_t value);long nmi_op(ulong_t op, void *arg); long hvm_op(int cmd, void *arg);

VMM Services:

long set_callbacks(ulong_t event_address, ulong_t failsafe_address, ulong_t syscall_address); long grant_table_op(uint_t cmd, void *uop, uint_t count); long event_channel_op(void *op); long xen_version(int cmd, void *arg); long set_debugreg(int reg, ulong_t value); long get_debugreg(int reg); long multicall(void *call_list, int nr_calls); long console_io(int cmd, int count, char *str); long sched_op(int cmd, void *arg); long do_kexec_op(unsigned long op, int arg1, void *arg);

VM Control Operations:

long sched_op_compat(int cmd, ulong_t arg); long platform_op(xen_platform_op_t *platform_op); long memory_op(int cmd, void *arg); long vcpu_op(int cmd, int vcpuid, void *extra_args); long sysctl(xen_sysctl_t *sysctl); long domctl(xen_domctl_t *domctl); long acm_op();


updates, is called by the Solaris kernel to perform the page table updates. This routine

returns control to the calling domain when the operation is completed.

In the following example, a kmdb(1M) breakpoint is set at the mmu_update() call.

The stack trace illustrates how the mmu_update() function is called after a new

process is created by fork():

The above example shows that the kernel doesn't maintain a copy of the page table. It

uses the mmu_update() hypercall to request the VMM to update the page table.

Event Channels

To a GOS, a VMM event is the equivalent of a hardware interrupt. Communication from

the VMM to a VM is provided through an asynchronous event mechanism, called an

event channel, which replaces the usual delivery mechanisms for device interrupts. A

VM creates an event channel to send and receive asynchronous event notifications.

Three classes of events are delivered by this event channel mechanism:

• Bi-directional inter- and intra-VM connections

A VM can bind an event-channel port to another domain or to another virtual CPU

within the VM.

• Physical interrupts

A VM with direct access to hardware (Dom0) can bind an event-channel port to a

physical interrupt source.

• Virtual interrupts

A VM can bind an event-channel port to a virtual interrupt source, such as the virtual-

timer device.

[1]> set_pteval+0x4f:b // set breakpoint at HYPERVISOR_mmu_update[1]> :c // continue kmdb: stop at set_pteval+0x4f // the breakpoint reachedkmdb: target stopped at:set_pteval+0x4f:call -0x5a34 <HYPERVISOR_mmu_update>[1]> $c // display the stack trace set_pteval+0x4f(c753000, 1fb, 3, f9c29027)x86pte_copy+0x73(fffffffec08115a8, fffffffec2a8a0d8, 1fb, 5)hat_alloc+0x228(fffffffec2fa88c0)as_alloc+0x99()as_dup+0x3f(fffffffec27b1d28, fffffffec2a11168)cfork+0x102(0, 1, 0)forksys+0x25(0, 0)sys_syscall32+0x13e(){1]


Event channels are addressed by a port. Each channel is associated with two bits of

information:

• unsigned long evtchn_pending[sizeof(unsigned long) * 8]

This value notifies the domain that there is a pending notification to be processed.

This bit is cleared by the GOS.

• unsigned long evtchn_mask[sizeof(unsigned long) * 8]

This value specifies if the event channel is masked. If this bit is clear and PENDING is

set, an asynchronous upcall will be scheduled. This bit is only updated by the GOS; it

is read-only within the VMM.

Interrupts to a VM are virtualized by mapping them to event channels. These interrupts

are delivered asynchronously to the target domain using a callback supplied via the

set_callbacks hypercall. A guest OS can map these events onto its standard

interrupt dispatch mechanisms. The VMM is responsible for determining the target

domain that will handle each physical interrupt source.

“Interrupts and Exceptions” on page 49 provides a detailed discussion of how an

interrupt is handled by the VMM and delivered to a VM using an event channel.

Grant Tables

The Sun xVM Hypervisor for x86 allows sharing memory among VMs, and between the

VMM and a VM, through a grant table mechanism. Each VM makes some of its pages

available to other VMs by granting access to its pages. The grant table is a data

structure that a VM uses to expose some of its pages, specifying what permissions

other VMs have on its pages. The following example shows the information stored in a

grant table entry:

The flags field stores the type and various flag information of the grant table. There

are three types of grant table entries:

• GTF_invalid — Grants no privileges.

• GTF_permit_access — Allows the domain domid to map/access the specified

frame.

• GTF_accept_transfer — Allows domid to transfer ownership of one page frame

to this guest; the VMM writes the page number to frame.

struct grant_entry { /* GTF_xxx: various type and flag information. [XEN,GST] */ uint16_t flags; /* The domain being granted foreign privileges. [GST] */ domid_t domid; uint32_t frame; // page frame number (PFN)};


The type information acts as a capability which the grantee can use to perform

operations on the granter's memory. A grant reference also encapsulates the details of

a shared page, removing the need for a domain to know the real machine address of a

page it is sharing. This makes it possible to share memory correctly with domains

running in fully virtualized memory.

Device drivers in the Sun xVM Server (see “Sun xVM Server I/O Virtualization” on

page 56) use grant tables to send data between drivers of different domains, and use

event channels and callback services for asynchronous notification of data availability.

XenStore

XenStore [22] is a shared storage space used by domains to communicate and store

configuration information. XenStore is the mechanism by which control-plane activities,

including the following, occur:

• Setting up shared memory regions and event channels for use with split device

drivers

• Notifying the guest of control events (for example, balloon driver requests)

• Reporting status information from the guest (for example, performance-related

statistics)

The store is arranged as a hierarchical collection of key-value pairs. Each domain has a

directory hierarchy containing data related to its configuration. Domains are permitted

to register for notifications about changes in a subtree of the store, and to apply

changes to the store transactionally.

Sun xVM Server CPU VirtualizationThe Sun xVM Hypervisor for x86 provides a paravirtualized environment to a VM. Full

CPU virtualization to a VM is achieved by a concerted coordination of CPU management

by the VMM, and CPU usage by the GOS within a VM.

The next sections discuss CPU virtualization employed by the Sun xVM Server for these

tasks:

• Deprivileging CPUs to run a VM

• Scheduling CPUs for VMs

• Handling and delivery of interrupts to a VM

• Providing timer services to a VM

CPU Privilege ModeThe Sun xVM Hypervisor for x86 operates at a higher privilege level than the GOS. On

32-bit x86 processors with protection mode enabled, a GOS may use rings 1, 2 and 3 as

it sees fit. The Sun xVM Server kernel uses ring 1 for its own operation and places

applications in ring 3.


On 64-bit systems, linear address space (flat memory model) is used to create a

continuous, unsegmented address space for both the kernel and application programs.

Segmentation is disabled and rings 1 and 2, which practically do not exist, have the

same privilege to access paging as ring 0 (see “Protected Mode” and following sections

beginning on page 22). To protect the VMM, the Sun xVM Server kernel is therefore

restricted to run in ring 3 for the 64-bit mode and in ring 1 for the 32-bit mode only, as

seen in the definitions in segments.h:

If both kernel and user application run with the same privilege level, how does Sun

xVM Server protect the kernel from user applications? The answer is given as follows

[32]:

1. The VMM performs context switching between kernel mode and the currently

running application in user mode. The VMM tracks which mode the GOS is

running, kernel or user.

2. The GOS maintains two top level (PML4) page tables per process, one each for

kernel and user. The GOS registers the two page tables with the VMM. The kernel

page table contains translations for both the kernel and user addresses, and the

user page table contains translations only for the user addresses. During the

context switch, the VMM switches the top level page table so the kernel addresses

are not visible to the user process. The linear address mapping to paging data

structure for 64-bit x86 processor is shown below in Figure 17:

Figure 17. Linear address mapping to paging data structure for 64-bit x86 processor.

Switching the PML4 page tables between kernel and user mode enables a 64-bit

address space to be split into two logically separate address spaces. In this logical

separation of a 64-bit address space, the kernel can access both its address space and a

user address space while a user process can access only its own address space. The user

address space in this addressing scheme is therefore restricted to use the lower 48 bits

of the 64-bit address space. The resulting address space partition in the 64-bit Sun xVM

Server is shown as follows, in Figure 18:

% cat intel/sys/segments.h....#if defined(__amd64)#define SEL_XPL 0 /* xen privilege level */#define SEL_KPL 3 /* both kernel and user in ring 3 */#elif defined(__i386)#define SEL_XPL 0 /* xen privilege level */#define SEL_KPL 1 /* kernel privilege level under xen */#endif /* __i386 */

01112202129303839474863

Sign Extended PML4 PDP PDE PTE Offset


Figure 18. Address space partitioning in the 64-bit Sun xVM Server.

As discussed previously (see “Segmented Architecture” on page 23), the processor

privilege level is set when a segment is loaded. The Solaris OS uses the GDT for user and

kernel segments. The segment index of each segment type is assigned as shown in

Table 5 on page 54.

The command kmdb(1M) can be used to examine the segment descriptor of kernel

code:

[0]> gdt0+30::print -t 'struct user_desc' // 64-bit kernel code segment{ unsigned long usd_lolimit :16 = 0x7000 unsigned long usd_lobase :16 = 0xe030 unsigned long usd_midbase :8 = 0 unsigned long usd_type :5 = 0xe unsigned long usd_dpl :2 = 0x3 unsigned long usd_p :1 = 0x1 unsigned long usd_hilimit :4 = 0x4 unsigned long usd_avl :1 = 0 unsigned long usd_long :1 = 0 unsigned long usd_def32 :1 = 0 unsigned long usd_gran :1 = 0x1 unsigned long usd_hibase :8 = 0xfb}> gdt0+40::print -t 'struct user_desc' // 32-bit user code segment{ unsigned long usd_lolimit :16 = 0xc450 unsigned long usd_lobase :16 = 0xe030 unsigned long usd_midbase :8 = 0xf8 unsigned long usd_type :5 = 0xe unsigned long usd_dpl :2 = 0x3 unsigned long usd_p :1 = 0x1 unsigned long usd_hilimit :4 = 0x1 unsigned long usd_avl :1 = 0 unsigned long usd_long :1 = 0 unsigned long usd_def32 :1 = 0 unsigned long usd_gran :1 = 0x1 unsigned long usd_hibase :8 = 0xfb}

Kernel (ring 3)

VMM (ring 0)

Reserved

User (ring 3)

0

247 0x7FFF FFFFFFFF

0xFFFF8000 00000000

0xFFFF8800 00000000


The descriptor privilege level (DPL) of both kernel and 32-bit user code segments are set

to 3. At boot time, the Sun xVM Hypervisor for x86 is loaded into memory in ring 0.

After initialization, it loads the Solaris kernel to run as Dom0 in ring 3. The domain

Dom0 is permitted to use the VM control hypercall interfaces (see Table 4 on page 42),

and is responsible for hosting the application-level management software.

CPU SchedulingThe Sun xVM Hypervisor for x86 provides two schedulers for the user to choose

between: Credit and simple Earliest Deadline First (sEDF). The Credit scheduler is the

default scheduler; sEDF might be phased out and removed from the Sun xVM Server

implementation.

The Credit scheduler is a proportional fair share CPU scheduler. Each physical CPU

(PCPU) manages a queue of runnable virtual CPUs (VCPUs). This queue is sorted by

VCPU priority. A VCPU's priority can be either over or under, representing whether this

VCPU has exceeded its share of the PCPU or not.

A VCPU's share is determined by weight assigned to the VM and credit accumulated by

the VCPU in each accounting period.

The first equation determines the total credit of a VM and the second equation

determines the credit of a VCPU in a VM. Credittotal is a constant; Weighttotal is the sum

of the weight of all domains. A VM's weight is assigned to a VM using xm(1M) (for

example, xm sched-credit -w weight). In each accounting period, fixed amount

of credits are added to idle VCPUs and are subtracted from running VCPUs.

The VCPU has the priority under if the VCPU has not consumed all credits it possesses.

On each PCPU, at every scheduling decision (when a VCPU blocks, yields, completes its

time slice, or is awakened), the next VCPU to run is picked off the head of the run queue

of priority under. When a VM runs, it consumes credits of its VCPU[s]. When a VCPU

uses all its allocated credits, the VCPU's priority is changed from under to over. When a

CPU doesn't find a VCPU of priority under on its local run queue, it will look on other

PCPUs for one. This load balancing guarantees each VM receives its fair share of PCPU

resources system-wide. Before a PCPU goes idle, it will look on other PCPUs to find any

runnable VCPU. This guarantees that no PCPU idles when there is runnable work in the

system.

Earliest Deadline First (EDF) scheduling provides weighted CPU sharing by comparing

the deadline of scheduled periodic processes (or domains, in the case of Sun xVM

CreditVMi Credittotal WeightVM

i×( ) Weighttotal 1–( )+

Weighttotal--------------------------------------------------------------------------------------------------------------------=

CreditVCPUj i, CreditVM

iTotalVCPUVM

i1–( )+

TotalVCPUVMi

--------------------------------------------------------------------------------------------------------------------------=


Server). This scheduler places domains in a priority queue. Each domain is associated

with two parameters: time requested to run, and an interval or deadline. Whenever a

scheduling event occurs, the queue is searched for the domain closest to its deadline.

This domain is then scheduled for execution next with the time requested. The EDF

scheduler gives a better CPU utilization when a system is underloaded. When the

system is overloaded, the set of domains that will miss deadlines is largely

unpredictable (it is a function of the exact deadlines and time at which the overload

occurs).

Interrupts and ExceptionsThe x86 processor uses a vector of size 256 to associate with exceptions and interrupts.

The vector number is an index into the interrupt descriptor table (IDT). The IDT

associates each vector with a gate descriptor for the procedure for handling the

interrupt or exception. The IDT register (IDTR) contains the base address of the IDT.

When Sun xVM Server is booting up, it registers its own IDT to the VMM. During system

initialization, an early stage of Solaris boot, the Solaris kernel function

init_desctbls() is called to initialize the GDT and IDT:

The Solaris kernel function init_desctbls() passes each of its exception and

interrupt vectors to the VMM using the set_trap_table() hypercall:

The set_trap_table() hypercall has one argument, trap_info, which contains

the privilege level of the GOS code segment, the code segment selector, and the

address of the handler which will be used to set the instruction pointer when the VMM

voidinit_desctbls(void){ .... init_idt(&idt0[0]); for (vec = 0; vec < NIDT; vec++) xen_idt_write(&idt0[vec], vec); ....}

voidxen_idt_write(gate_desc_t *sgd, uint_t vec){ trap_info_t trapinfo[2];

bzero(trapinfo, sizeof (trapinfo)); if (xen_idt_to_trap_info(vec, sgd, &trapinfo[0]) == 0) return; if (xen_set_trap_table(trapinfo) != 0) panic("xen_idt_write: xen_set_trap_table() failed");}


passes the control back to the GOS (see following code segment). The value of

trap_info is set in the function xen_idt_to_trap_info() using the setting in

the kernel global variable array, idt0.

On a 64-bit system, the interrupt descriptor has the descriptor privilege level (DPL) 3,

similar to the segment descriptor:

When an interrupt or exception occurs, the VMM’s trap handler is invoked to handle

the interrupt or exception. If this is an exception caused by a GOS, the VMM’s trap

handler sets the pending bit (see “Event Channels” on page 43) and calls the GOS's

exception handler. Interrupts for the GOS are virtualized by mapping them to event

channels, which are delivered asynchronously to the target GOS via the

set_callbacks() hypercall.

In the following example, a kmdb(1M) breakpoint is set at the interrupt service routine

of the sd driver, sdintr(). The function xen_callback_handler(), the callback

function used for processing events from the VMM, is registered in the VMM by the

hypercall set_callbacks(). When an interrupt intended for sd arrives, the

typedef struct trap_info { uint8_t vector; /* exception vector */

uint8_t flags; /* 0-3: privilege level */ uint16_t cs; /* code selector */ unsigned long address; /* code offset */} trap_info_t;

[0]> idt0::print 'struct gate_desc'{ sgd_looffset = 0x4bf0 sgd_selector = 0xe030 sgd_ist = 0 sgd_resv1 = 0 sgd_type = 0xe sgd_dpl = 0x3 sgd_p = 0x1 sgd_hioffset = 0xfb84 sgd_hi64offset = 0xffffffff sgd_resv2 = 0 sgd_zero = 0 sgd_resv3 = 0}


hypercall HYPERVISOR_block() detects an event is available and then invokes the

callback function:

Pending events are stored in a per-domain bitmask (see “Event Channels” on page 43),

that is updated by the VMM before invoking an event-callback handler specified by the

GOS. The function xen_callback_handler() is responsible for resetting the set of

pending events and responding to the notifications in an appropriate manner. A VM

may explicitly defer event handling by setting a VMM-readable software flag; this is

analogous to disabling interrupts on a real processor.

Timer Services“Timer Devices” on page 27 discusses several hardware timers available on x86 systems.

These hardware devices vary in their frequency reliability, granularity, counter size, and

ability to generate interrupts. The Solaris OS employs some of these timer devices for

running the OS clock and high resolution timer:

• OS system clock — The Solaris OS uses the local APIC timer on multiprocessor

systems to generate ticks for the system clock. On uniprocessor systems, the Solaris

OS uses the PIT to generate ticks for the system clock.

• High resolution timer — The Solaris OS uses the TSC timer for a high resolution

timer. The PIT counter is used to calibrate the TSC counter.

• Time-of-day clock — The time-of-day (TOD) clock is based on the RTC. Only Dom0 can

set the TOD clock. The DomU VMs don't have the permission to update the machine's

physical RTC. Therefore, any attempt by the date(1) command to set the date and

time on DomU will be quietly ignored.

In Sun xVM Server, the VMM provides the system time to each VCPU when it is

scheduled to run. The high resolution timer, gethrtime(), is still run through the

sd`sdintr:sd`sdintr: ec8b4855 = pushq %rbp[0]> $csd`sdintr(fffffffec0670000)mpt`mpt_intr+0xdb(fffffffec0670000, 0)av_dispatch_autovect+0x78(1b)dispatch_hardint+0x33(1b, 0)switch_sp_and_call+0x13()do_interrupt+0x9b(ffffff0001005ae0, 1)xen_callback_handler+0x36c(ffffff0001005ae0, 1)xen_callback+0xd9()HYPERVISOR_sched_op+0x29(1, 0)HYPERVISOR_block+0x11()mach_cpu_idle+0x52()cpu_idle+0xcc()idle+0x10e()thread_start+8()[0]>


unprivileged RDTSC instruction, thus the high resolution timer is not virtualized. The

virtualized system time relies on the current TSC to calculate the time in nanoseconds

since the VCPU was scheduled.

Sun xVM Server Memory VirtualizationMemory virtualization in Sun xVM Server deals with the following two memory

management issues:

• Physical memory sharing and partitioning

• Page table access

Physical Memory ManagementSun xVM Server introduces a distinction between machine memory and physical

memory. Machine memory refers to the entire amount of memory installed in the

machine. Physical memory is a per-VM abstraction that allows a GOS to envision its

memory as a contiguous range of physical pages starting at physical page frame

number (PFN) 0, despite the fact that the underlying machine PFN may be sparsely

allocated and in any order (see “Page Translations Virtualization” on page 14).

The VMM maintains a table of machine-to-physical memory mappings. The GOS

performs all page allocations and management based on physical memory. During

page table updates, a conversion from physical memory to machine memory is

performed before making the mmu_update() hypercall to update the page tables.

Since VMs get created and deleted randomly throughout time, the VMM employs

memory hotplug and ballooning schemes to optimize the memory usage in a machine.

Memory hotplug allows a GOS to dynamically add or remove physical memory to its

inventory. The memory ballooning technique allows a VMM to dynamically adjust the

usage of physical memory among VMs.

For example, consider a machine that has 8 GB of memory. Two VMs, VM-A and VM-B,

are initially created with 5 GB of memory each. Memory hotplug adds 5 GB memory to

both VMs after they are booted. The total memory committed to both VMs is greater

than the actually physical memory available. When VM-A needs more physical memory,

the memory ballooning technique increases memory pressure in VM-B by inflating the

balloon driver. This results in memory being paged out to free up the memory

consumed by VM-B, and thus more memory becoming available to VM-A.

The GOS requests the service of physical memory management to the VMM through

the memory_op(cmd, ...) hypercall. The operations supported by the

memory_ops() hypercall include the following:

• XENMEM_increase_reservation

• XENMEM_decrease_reservation


• XENMEM_populate_physmap

• XENMEM_maximum_ram_page

• XENMEM_current_reservation

• XENMEM_maximum_reservation

• XENMEM_machphys_mfn_list

• XENMEM_add_to_physmap

• XENMEM_translate_gpfn_list

• XENMEM_memory_map

• XENMEM_machine_memory_map

• XENMEM_set_memory_map

• XENMEM_machphys_mapping

• XENMEM_exchange

Page Translations“Segmented Architecture” on page 23 describes two stages of address translation to

arrive at a physical address: virtual address (VA) to linear address (LA) translation using

segmentation, and LA to physical address (PA) translation using paging. Solaris x64 uses

a flat address space in which the VA and LA are equivalent, which means the base

address of the segment is 0. In Solaris 10, the Global Descriptor Table (GDT) contains the

segment descriptor for the code and data segments of both kernel and user processes,

as shown in Table 5 on page 54.

Since there is only one GDT in a system, the VMM maintains the GDT in its memory. If a

GOS wishes to use something other than the default segment mapping that the VMM

GDT provides, it must register a custom GDT with the VMM using the set_gdt()

hypercall. In the following code sample, frame_list is the physical address of the

page that contains the GDT and entries is the number of entries in the GDT.

The Solaris 32-bit thread library uses %gs to refer to the LWP state manipulated by the

internals of the thread library. The 64-bit thread library uses %fs to refer to the LWP

state as specified by the AMD64 ABI [8]. The 64-bit kernel still uses %gs for its CPU state

(%fs is never used in the kernel). The MSR's KernelBase register is used to store the

kernel %gs content while it switches to run the 32-bit user LWP. The privileged

instruction SWAPGS is used to restore the kernel %gs during the context switch to the

xen_set_gdt(ulong_t *frame_list, int entries){ .... if ((err = HYPERVISOR_set_gdt(frame_list, entries)) != 0) { .... } return (err);}


kernel context. So when the VMM performs a context switch between the guest kernel

mode and the guest user mode, it executes SWAPGS as part of the context switch (see

“CPU Privilege Mode” on page 45).

The GDT segment is given in Table 5 below:

Table 5. The GDT segment.

Every LWP context switch requires an update to the GDT for the new LWP. The GOS uses

update_descriptor() for the task:

On an x86 system, the base physical address of the page directory is contained in the

control register %cr3. In the Solaris OS, the value of %cr3 is stored in the process's

hat structure, proc->p_as->a_hat->hat_table->ht_pfn, as shown in “Paging

Architecture” on page 25. The loading of %cr3 is performed by the VMM for security

and coherency reasons.

% cat intel/sys/segments.h#define GDT_NULL 0 /* null */#define GDT_B32DATA 1 /* dboot 32 bit data descriptor */#define GDT_B32CODE 2 /* dboot 32 bit code descriptor */#define GDT_B16CODE 3 /* bios call 16 bit code descriptor */#define GDT_B16DATA 4 /* bios call 16 bit data descriptor */#define GDT_B64CODE 5 /* dboot 64 bit code descriptor */#define GDT_BGSTMP 7 /* kmdb descriptor only used in boot */

#if defined(__amd64)

#define GDT_KCODE 6 /* kernel code seg %cs */#define GDT_KDATA 7 /* kernel data seg %ds */#define GDT_U32CODE 8 /* 32-bit process on 64-bit kernel %cs */#define GDT_UDATA 9 /* user data seg %ds (32 and 64 bit) */#define GDT_UCODE 10 /* native user code seg %cs */#define GDT_LDT 12 /* LDT for current process */#define GDT_KTSS 14 /* kernel tss */#define GDT_FS GDT_NULL /* kernel %fs segment selector */#define GDT_GS GDT_NULL /* kernel %gs segment selector */#define GDT_LWPFS 55 /* lwp private %fs segment selector (32-bit)*/#define GDT_LWPGS 56 /* lwp private %gs segment selector (32-bit)*/#define GDT_BRANDMIN 57 /* first entry in GDT for brand usage */#define GDT_BRANDMAX 61 /* last entry in GDT for brand usage */#define NGDT 62 /* number of entries in GDT */

intel/ia32/os/desctbls.c

update_gdt_usegd(uint_t sidx, user_desc_t *udp){ .... if (HYPERVISOR_update_descriptor(pa_to_ma(dpa), *(uint64_t *)udp)) panic("xen_update_gdt_usegd: HYPERVISOR_update_descriptor");}


“Page Translations Virtualization” on page 14 discusses two alternatives for updating

page tables in a virtualized environment: hypervisor calls to a read-only page table and

shadow page tables. The Sun xVM Hypervisor for x86 provides an additional alternative,

a writable page table, for the GOS to implement page translations. In the default mode

of operation, the VMM uses both read-only page tables and writable page tables to

manage page tables. The VMM allows the GOS to use a writable page table to update

the lowest level page tables (for example, the PTE). The higher levels, such as PDE, PDP,

and PML4, use a read-only page table and are updated using the hypercall

mmu_update(). Updates to higher level page tables are much less frequent compared

to the PTE page table updates.

• Read-only page table

The GOS has read-only access to page tables and uses the mmu_update() hypercall

to update page tables. As described in the previous section “Physical Memory

Management” on page 52, the GOS has a view of pseudo-physical memory, and a

translation from physical address to machine address is performed before the

mmu_update() call.

• Writable page table

If a GOS attempts to write to a page table that is maintained by the VMM, this

attempt will result in a #PF fault to the VMM. In the VMM fault handling routine,

the following tasks are performed:

– Hold the lock for all further page table updates

– Disconnect the page that contains the updated page table by clearing the page

present bit of the page table entry in the parent page table

– Make the page writable by the GOS

The page will be reconnected to the paging hierarchy again automatically in a

number of situations, including when the guest modifies a different page-table page,

when the domain is preempted, and whenever the guest uses the VMM’s explicit

page-table update interfaces.

• Shadow page table

Voidset_pteval(paddr_t table, uint_t index, uint_t level, x86pte_t pteval){ .... ma = pa_to_ma(PT_INDEX_PHYSADDR(pfn_to_pa(ht->ht_pfn), entry)); t[0].ptr = ma | MMU_NORMAL_PT_UPDATE; t[0].val = new; if (HYPERVISOR_mmu_update(t, cnt, &count, DOMID_SELF)) panic("HYPERVISOR_mmu_update() failed"); ....}


The VMM maintains a independent copy of page tables, called the shadow page

table, that is pointed to by the %cr3 register. If a page fault occurs when a GOS’s

page table is accessed, the VMM propagates changes made to the GOS’s page table

to the shadow page table. Shadow page mode can be set in the GOS by calling

dom0_op(DOM0 SHADOW CONTROL).

In addition to creating a translation entry, the VMM also provides the mmuext_op()

hypercall for the GOS to flush, to invalidate, or to lock a page translation. For example,

it is necessary to lock the translations of a process when it is being created. The

mmuext_op() is invoked by the kernel during the fork(2) system call:

Sun xVM Server I/O VirtualizationSun xVM Server uses a split device driver architecture to provide device services to

DomU domains. The device services are provided by two co-operating drivers: the front-

end driver, which runs in a DomU, and the back-end driver, which runs in Dom0

(Figure 19). Sun xVM Server doesn't export any real devices to DomU domains. All

device access made by DomU domains must go through the back-end driver located in

Dom0.

[3]> :ckmdb: stop at xen_pin+0x3akmdb: target stopped at:xen_pin+0x3a: call +0x208b1 <HYPERVISOR_mmuext_op>[3]> $cxen_pin+0x3a(ff2c, 3)hat_alloc+0x285(fffffffec381b7e8)as_alloc+0x99()as_dup+0x3f(fffffffec381ba88, fffffffec3d0f8d0)cfork+0x102(0, 1, 0)forksys+0x25(0, 0)sys_syscall32+0x13e()


Figure 19. The split device driver architecture employed by Sun xVM Server includes a front-end driver in DomU and a back-end driver in Dom0.

Dom0 is a special VM that has access to the real device hardware. The front-end driver

appears to a GOS in DomU as a real device. This driver receives I/O requests from

applications as usual. However, since this front-end driver does not have access to the

physical hardware of the system, it must then send requests to the back-end driver in

Dom0. The back-end driver is responsible for issuing I/O requests to the real device

hardware. When the I/O completes, the back-end notifies the front-end that the data is

ready for use; the front-end is then able to report I/O completion and unblock the I/O

call.

When the Solaris OS is initialized, devices identify themselves and are organized into

the device tree. This device tree depicts a hierarchy of nodes, with each node on the

tree representing a device. Sun xVM Server exports a complete device tree to domain

Dom0 so that it can directly accesses all physical devices on the system. For DomU

domains, the paravirtualized Solaris OS uses information passed to it by xm(1M) to

disable PCI bus probing and create virtual Sun xVM Server device nodes under the VMM

virtual device nexus driver, xpvd.

System Calls

Back EndDrivers

User

Kernel

Sun x64 Server

Dom0


Grant Tables/EventChannel/Xen Callback

System CallsUser

Kernel

DomU

Front EndDrivers

X86 Hardware (CPU, Memory, Devices)

NativeDriver


Output from the prtconf(1M) command shows the device tree as exported by Sun

xVM Server to a VM in a DomU domain. As the prtconf(1M) output shows, there are

no physical devices of any kind on the device tree in DomU:

A driver that provides services to other drivers is called a bus nexus driver and is shown

in the device tree hierarchy as a node with children. The nexus driver provides bus

mapping and translation services to subordinate devices in the device tree. The type of

services provided by the nexus driver include interrupt priority assignment, DMA

resource mapping, and device memory mapping. As seen in the previous

prtconf(1M) output, the xpvd driver is the root nexus driver for all Sun xVM Server

devices on DomU. An individual device driver is represented in the tree as a node with

no children. This type of node is referred to as a leaf driver. In the above example,

xenbus, domcaps, xencons, xdf, and xnf are leaf drivers.

# prtconfSystem Configuration: Sun Microsystems i86pcMemory size: 2048 MegabytesSystem Peripherals (Software Nodes):

i86xpv scsi_vhci, instance #0 isa (driver not attached) xpvd, instance #0 xencons, instance #0 xenbus, instance #0 domcaps, instance #0 balloon, instance #0 xdf, instance #0 xnf, instance #0 iscsi, instance #0 pseudo, instance #0 agpgart, instance #0 options, instance #0 xsvc, instance #0 cpus (driver not attached) cpu, instance #0 (driver not attached) cpu, instance #1 (driver not attached)


The Sun xVM Server-related driver modules for Dom0 and DomU respectively are shown

below:

• The xpvtod driver provides setting and getting the time-of-day for the VM. TOD

service is provided by the RTC timer. If the request to set the TOD comes from a

DomU domain, the request is silently ignored, as DomU doesn't have permission

to set the RTC timer.

• The nexus driver in Solaris provides bus mapping and translation services to

subordinate devices in the device tree. The xpvd driver is the nexus driver for all

virtual I/O drivers which don't directly access physical device. This driver’s primary

functions are to provide interrupt mapping and to invoke the initialization routine

of its children devices.

• The xenbus driver provides a bus abstraction that drivers can use to communicate

between VMs. The bus is mainly used for configuration negotiation, leaving most

data transfer to be done via an interdomain channel composed of a grant table

and an event channel. The xenbus driver also makes the configuration data

available to the XenStore shared storage repository (see “XenStore” on page 45).

• The evtchn driver is used for receiving and demultiplexing event-channel signals

to the user land.

• The balloon driver is controlled by the VMM to manage physical memory usage

by a VM. (See “Physical Memory Virtualization” on page 13 and “Physical Memory

Management” on page 52).

• The privcmd driver is used by the domain manager on Dom0 to get the VMM

service for VM management.

Sun xVM Server related device modules on dom0:xpvtod (TOD module for Xen)xpvd (virtual device nexus driver) xencons (virtual console driver)privcmd (privcmd driver)evtchn (Evtchn driver)xenbus (virtual bus driver)xdb (vbd backend driver)xnb (xnb module)xsvc (xsvc driver)balloon (balloon driver)

Sun xVM Server related device modules on domU:xenbus (virtual bus driver) xpvtod (TOD module for i86xpv)xpvd (virtual device nexus driver)xencons (virtual console driver)xdf (Xen virtual block driver)xnf (Virtual Ethernet driver)


• The drivers xdf and xdb, the front-end and back-end block device drivers

respectively, are discussed in “Disk Driver” on page 60. The xnf and xnb drivers, the

front-end and back-end network drivers respectively, are discussed in “Network

Driver” on page 61.

Data transfer between interdomain drivers is mainly provided by the VMM grant table

and event-channel services. Most of the data transfer is handled in a similar fashion to

DMA transfer between host and device. Data is put in the grant table by the sending

VM, and notification is sent to the receiving VM through the event channel. Then, the

callback routine in the receiving VM is invoked to process the data.

Disk DriverThe xdb driver, the back-end driver on Dom0, is used to provide services for block device

management. This driver receives I/O requests from DomU domains and sends them on

to the native driver. On DomU, xdf is the pseudo block driver that gets the I/O requests

from applications and sends them to the xdb driver in Dom0. The xdf driver provides

functions similar to those of the SCSI target disk driver, sd, on an unvirtualized Solaris

system.

On Solaris systems, the main interface between a file system and storage device is the

strategy(9E) driver entry point. The strategy(9E) entry point takes only one

argument, buf(9S), which is the basic data structure for block I/O transfer. The I/O

request made by a file system to the strategy(9E) entry point is called PAGEIO, as

the memory buffer for the I/O is allocated from the kernel page pool. An application

can also open the storage device as a raw device and perform read(2) and write(2)

operations directly on the raw device. Such an I/O request is called PHYSIO,

physio(9F), as the memory buffer for the I/O is allocated by the application.

In addition to the strategy(9E) driver entry point for supporting file system and raw

device access, a disk driver also supports a set of ioctl(2) operations for disk control

and management. The dkio(7I) disk control operations define a standard set of

ioctl(2) commands. Normally, support for dkio(7I) operations requires direct

access to the device. In DomU, xdf supports most ioctl(2) commands as defined in

dkio(7I) by emulating the disk control inside xdf. No communication is made by

xdf to the back-end driver for ioctl(2) operations.

The sequence of events for disk I/O data transfer is illustrated in Figure 20. The disk

control path, ioctl(2), is similar to the data path.

When a disk I/O request is issued by a DomU domain, the sequence is as follows:

1. The file system calls the xdf driver's strategy(9E) entry point as a result of a

read(2) or write(2) system call.


2. The xdf driver puts the I/O buffer, buf(9S), on the grant table. This buffer is

allocated from the DomU memory. Permission for other domain access is granted

to this memory.

3. The xdf driver notifies Dom0 of an event through event channel.

4. The VMM event channel generates an interrupt to the xdb driver in Dom0.

5. The xdb driver in Dom0 gets the DomU I/O buffer through the grant_table.

6. The xdb driver in Dom0 calls the native driver's strategy(9E) entry point.

7. The native driver performs DMA.

8. The VMM receives the device interrupt.

9. The VMM generates an event to Dom0.

10. The xdb driver's iodone() routine is called by biodone(9F).

11. The xdb driver’s iodone() routine generates an event to DomU.

12. The xdf driver in DomU receives an interrupt to free up the grant table and DMA

resources, and calls biodone(9F) to wake up anyone waiting for it.

When a disk I/O request is issued by the control domain DomO, the sequence is as

follows:

13. Block I/O requests are sent directly to the native driver.

Figure 20. Sequence of events for an I/O request from a Sun xVM Server virtual machine.

Network DriverThe Sun xVM Server network drivers uses a similar approach to the disk block driver for

handling network packets. On DomU, the pseudo network driver xnf gets the I/O

requests from the network stack and sends them to xnb on Dom0. The back-end

network driver xnb on Dom0 forwards packets sent by xnf to the native network driver.

Xen Callback

read(2)/write(2)

FSxdb

UserKernel



Grant Tables

8

read(2)/write(2)

FS

xdf

UserKernel

DomU

1

3

9

6

2

411

13

10

5

Sun X64 Server

NativeDriver

12Event Channel7


The buffer management for packet receiving has more impact on network performance

than packet transmitting does. On the packet receiving end, the data is transferred via

DMA into the native driver receiving buffer on domO. Then, the packet is copied from

the native driver buffer to the VMM buffer. The VMM buffer is then mapped to the

DomU kernel address space without another copy of the data.

The sequence of operations for packet receiving is as follows:

1. Data is transferred via DMA into the native driver, bge, receive buffer ring.

2. The xnb drivers gets a new buffer from the VMM and copies data from the bge

receive ring to the new buffer.

3. The xnb driver sends DomU an event through the event channel.

4. The xnf driver in DomU receives an interrupt.

5. The xnf driver maps a mblk(9S)to the VMM buffer and sends the mblk(9S) to

the upper stack.

Figure 21. Sequence of events for a network request from a Sun xVM Server virtual machine.

domU

xnf`xnf_process_recv+0x275xnf`xnf_intr+0x5eunix`av_dispatch_autovect+0x78unix`dispatch_hardint+0x33unix`switch_sp_and_call+0x13

Xen Callback



Sun X64 Server

dom0

xnb_to_peerxnbo`from_mac+0x1cmac`mac_do_rx+0x88mac`mac_rx+0x1bvnic`vnic_rx+0x59vnic`vnic_classifier_rx+0x6bmac`mac_do_rx+0x88mac`mac_rx+0x1bbge`bge_receive+0x564bge`bge_intr+0x182unix`av_dispatch_autovect+0x78unix`dispatch_hardint+0x33unix`switch_sp_and_call+0x13

3

5

Grant Tables

Network Chip

1 Event Channel

4

2

63 Sun xVM Server with Hardware VM (HVM) Sun Microsystems, Inc.

Chapter 6

Sun xVM Server with Hardware VM (HVM)

Intel and AMD have independently developed extensions to the x86 architecture that

provide hardware support for virtualization. These extensions enable a VMM to provide

full virtualization to a VM, and support the running of unmodified guest operating

systems on a VM. This approach is in contrast to Sun xVM Server PV, which requires

modifications to the guest operating system.

Virtual machines that are supported by virtualization capable processors are called

Hardware Virtual Machines (HVMs). An HVM environment includes the following

requirements:

• A processor that allows an OS with reduced privilege to execute sensitive instructions

• A memory management scheme for a VM to update its page tables without accessing

MMU hardware

• An I/O emulation scheme that enables a VM to use its native driver to access devices

through an I/O VM (see “I/O Virtualization” on page 16)

• An emulated BIOS to bootstrap the OS

The x86 processor for HVM meets the first requirement, allowing an OS with reduced

privilege to execute sensitive instructions. However, a processor alone is not enough to

provide full virtualization. The memory management, I/O emulation, and emulated

BIOS requirements necessitate enhancements in the VMM.

This chapter begins with a discussion of HVM operations that are applicable to both

Intel and AMD virtualization extensions, followed by Intel and AMD specific

enhancements for HVM.

After the introduction of processor extensions, Sun xVM Server enhancements in the

areas of BIOS emulation, memory management, and I/O virtualization for full

virtualization are discussed in detail.

Note – Intel's virtualization extension is called Virtual Machine Extensions (VMX), and is documented in the IA-32 Intel Architecture Software Developer's Manual (see [7] Volume 3B Chapters 19-23). AMD's extension is called Secure Virtual Machine (SVM), and is documented in the AMD64 Architecture Programmer s Manual Volume 2: System Programming (see [9] Chapter 15).


HVM Operations and Data StructureBoth Intel and AMD's extension for HVM, though not compatible to each other, are

similar in basic concepts. Both create a special mode of operation that allows system

software running in a reduced privileged mode to execute sensitive instructions. In

addition, both implementations also define state and control data structures that

enable the transition between modes of operation.

The processor for HVM has two operating modes: privileged mode and reduced

privilege mode. Processor behavior in the privileged mode is very much the same as the

processor running without the virtualization extension. Processor behavior in the

reduced privilege mode is restricted and modified to facilitate virtualization.

Table 6 summarizes the terms used by Intel and AMD for HVM. The extension creates

new instructions, and a HVM control and state data structure (HVMCSDS) for the VMM

to manage transition from one mode to another. The HVMCSDS is called VMCS on the

Intel processor and is called VMCB on the AMD processor. The VMM associates a

HVMCSDS with each VM. For a VM with multiple VCPUs, the VMM can associate a

HVMCSDS with each VCPU in the VM.

Table 6. Comparison of Intel and AMD processor support for virtualization.

After HVM is enabled, the processor is operating at privileged mode. Transitions from

privileged mode to reduced privilege modes are called VM Entries. Transitions from

reduced privilege mode to privileged mode are called VM Exits. Figure 22 illustrates

entry and exit with the HVMCSDS.

Intel AMD

Virtualization Operation VMX SVM

Privileged Mode VMX Root Host Mode

Reduced-privileged mode VMX non-Root Guest Mode

HVM Control and State Data Structure (HVMCSDS) VMCS VMCB

Entering non-privileged mode VMLAUNCH/VMRESUME

VMRUN

Exiting non-privileged mode Implicit Implicit


Figure 22. Virtual machine entry and exit with hardware support on AMD and Intel processors.

VM entry is explicitly initiated by the VMM using an instruction (VMLAUNCH and

VMRESUME on Intel; VMRUN on AMD). The processor performs checks on the processor

state, VMM state, control fields, and the VM state before loading the VM state from the

HVMCSDS to launch the VM entry.

As a part of VM entry, the VMM can inject an event into the VM. The event injection

process is used to deliver virtualized external interrupts to a VM. A VM normally doesn't

get interrupts from I/O devices, because I/O devices are not exposed to VMs (with the

exception of Dom0). As will be shown in “Sun xVM Server with HVM I/O Virtualization

(QEMU)” on page 71, a VM's I/O is handled by a special domain (Dom0) that runs a

paravitualized OS and has direct access to I/O devices. When an I/O operation

completes, Dom0 informs the VMM to send an interrupt through an hvm_op hypercall.

The VMM prepares the HVMCSDS for event injection and the VM's return instruction

pointer (RIP) is pushed on the stack.

VM exit occurs implicitly in response to certain instruction and events in a VM. The

VMM governs the conditions causing a VM exit through manipulating the control fields

in the HVMCSDS. The events that can be controlled to result in a VM exit include the

following (see [9] Chapter 20):

• External interrupts, non-maskable interrupts, and system management interrupts

• Executing certain instructions (such as RDPMC, RDTSC, or instructions that access

the CR)

• Exceptions

The exact conditions that cause a VM exit are defined in the HVMCSDS control fields.

Certain conditions may cause a VM exit for one VM but not for other VMs.

VM exits behave like a fault, meaning that the instructions causing the VM exit does

not execute and no processor state is updated by the instruction. The VM exit handler

VM1

Virtual Machine Monitor (VMM)

VMX non-root operation (Intel)Guest Mode (AMD)

VMX root operation (Intel)Host Mode (AMD)

VMCS/VMCB

VM EXIT/#VMEXIT

VM ENTER/VMRUN

VM2

VMCS/VMCB

VM EXIT/#VMEXIT

VM ENTER/VMRUN


in the VMM is responsible for taking appropriate actions for the VM exit. Unlike

exceptions, the VM exit handler is specified in the HVMCSDS host RIP field rather than

using the IDT:

Intel Virtualization Technology SpecificsIntel Virtualization (Intel-VT), code name Vanderpool, is the Intel virtual machine

extensions (VMX) to run unmodified guest OSes. Intel-VT has two implementations: VT-x defines the extensions to the IA-32 Intel architecture, and VT-i defines the

extensions to the Intel Itanium architecture. This paper focuses on the Intel VT-x

implementation.

Table 6 on page 64 summarizes the terms used in Intel documents [7] for HVM. Intel-

VTx adds several new instructions to the existing IA-32 instructions set to facilitate HVM

operations (see Table 7):

In addition to new VMX instructions and VMCS, VT-x introduces a direct I/O architecture

for Intel-VT [28] to improve VM security, reliability, and performance through I/O

enhancements. As will be shown in “Sun xVM Server with HVM I/O Virtualization

(QEMU)” on page 71, the current I/O virtualization implementation for Sun xVM Server

with HVM, which is based on the QEMU project, is inefficient as all I/O transaction have

to go through Dom0, unreliable as the I/O virtualization layer on Dom0 becomes a

single point-of-failure, and insecure as a VM may access other VM's DMA memory by

manipulating the value written to I/O port.

static void construct_vmcs(struct vcpu *v){..../* Host CS:RIP. */__vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS);__vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler);....}

Table 7. Intel-VTx instructions that facilitate HVM operations.

Instruction Description

VMLAUNCH/VMRESUME launch/resume VM

VMCLEAR clear VMCS

VMPTRLD/VMPTRST load/store VMCS

VMREAD/VMWRITE read/write VMCS

VMXON/VMXOFF enable/disable VMX operation

VMCALL call to the VMM


The Intel-VT direct I/O architecture specifies the following hardware capabilities to the

VMM:

• DMA remapping — This feature provides IOMMU support for I/O address translation

and caching capabilities. The IOMMU as specified in the architecture includes a page

table hierarchy similar to the processor page table, and an IOTLB for frequently

accessed I/O pages. Addresses used in the DMA transactions are allocated from

IOMMU address space, and the IOMMU hardware provide address translation from

the IOMMU address space to the system memory address space.

• I/O device assignment across VMs — This feature allows a PCI/PCI-X device that is

behind a PCI-E to PCI/PCI-X bridge or a PCI-E device to be assigned to a VM, regardless

of how the PCI bus is bound to a VM.

AMD Secure Virtual Machine SpecificsThe AMD Secure Virtual Machine (SVM), code name Pacifica, is similar to Intel VT-x in

technology and design. The AMD SVM uses the instruction VMRUN to switch between a

GOS and the VMM. The instruction VMRUN takes, as a single argument, the physical

address of a 4KB-aligned page, the virtual machine control block (VMCB), which

describes a virtual machine (guest) to be executed.

In addition to functions that are equivalent to those in Intel VT-x, AMD SVM provides

additional features, that are not available in Intel VT-x, to improve HVM operations:

• Nested page table (NPT)

As an alternative to using a shadow page table for address translation (see “Shadow

Page Table” on page 69), AMD SVM uses two %cr3 registers, gCR3 and nCR3, to

point to guest page tables and nested page tables respectively. Guest page tables

map guest linear addresses to guest physical addresses. Nested page tables map

guest physical addresses to system physical addresses. The table walker first

translates that entry’s guest physical address into a system physical address. Then

translations from guest linear to system physical addresses are cached in the TLB for

subsequent guest access.

• Tagged TLB

To avoid a TLB flush during context switch (see “Paging Architecture” on page 25),

AMD SVM provides a tagged TLB with Address Space Identifier (ASID) bits to

distinguish different address spaces. A tagged TLB allows the VMM to use shadow

page tables or multiple nested page tables for address translation during a context

switch without flushing the TLBs.

• IOMMU

The AMD64 IOMMU enables secure virtual machine guest operating system access to

selected I/O devices by providing address translation and access protection on DMA

transfers by peripheral devices. The IOMMU can be thought of as a combination and


generalization of two facilities included in the AMD64 architecture: the Graphics

Aperture Remapping Table (GART) and the Device Exclusion Vector (DEV). The GART

provides address translation of I/O device accesses to a small range of the system

physical address space, and the DEV provides a limited degree of I/O device

classification and memory protection.

Sun xVM Server with HVM Architecture OverviewSun xVM Server with HVM supports the running of unmodified operating systems in

DomU. However, Dom0 still requires a paravirtualized OS in order to provide full I/O

virtualization support for DomUs.

To support full virtualization, the Sun xVM Hypervisor for x86 has extended its

paravirtualized architecture with the following enhancements:

• A set of HVM functions (struct hvm_function_table) for processor dependent

implementation of HVM, and an hvm_op hypercall

• A shadow page table to virtualize memory management

• Device emulation based on the QEMU project for I/O virtualization

• Emulated BIOS, hvmload, to bootstrap the GOS

These enhancements are discussed in more detail in the following sections.


Processor Dependent HVM FunctionsThe Sun xVM Hypervisor for x86 defines a set of foundational interfaces, struct

hvm_function_table, to abstract processor HVM specifics. The struct

hvm_function_table entries are:

The VMM uses hvm_function_table to provide a VCPU to a VM. The entry points in

hvm_function_table fall into two categories: setup and runtime. The setup entry

points are called when a VM is being created. The runtime entry points are called

before VM entry or after VM exit. Since the HVMCSDS data structure abstracts the

states and controls of a VCPU, the entry points in hvm_function_table are

primarily used to manipulate the data structure.

Shadow Page TableBecause the GOS is unmodified, the read-only page table scheme for page translation

as used in the Sun xVM Server is no longer applicable. The read-only page table scheme

requires the OS to make hypercalls into the VMM to update page tables. To support an

unmodified OS, the shadow page table scheme becomes the only option available. In

this scheme, the shadow page table (also known as the active page table hierarchy) is

the actual page table used by the processor.

struct hvm_function_table { void (*disable)(void); int (*vcpu_initialise)(struct vcpu *v); void (*vcpu_destroy)(struct vcpu *v); void (*store_cpu_guest_regs)( struct vcpu *v, struct cpu_user_regs *r, unsigned long *crs); void (*load_cpu_guest_regs)( struct vcpu *v, struct cpu_user_regs *r); int (*paging_enabled)(struct vcpu *v); int (*long_mode_enabled)(struct vcpu *v); int (*pae_enabled)(struct vcpu *v); int (*guest_x86_mode)(struct vcpu *v); unsigned long (*get_guest_ctrl_reg)(struct vcpu *v, unsigned int num); unsigned long (*get_segment_base)(struct vcpu *v, enum x86_segment seg); void (*get_segment_register)(struct vcpu *v, enum x86_segment seg, struct segment_register *reg); void (*update_host_cr3)(struct vcpu *v); void (*update_guest_cr3)(struct vcpu *v); void (*stts)(struct vcpu *v); void (*set_tsc_offset)(struct vcpu *v, u64 offset); void (*inject_exception)(unsigned int trapnr, int errcode, unsigned long cr2); void (*init_ap_context)(struct vcpu_guest_context *ctxt, int vcpuid, int trampoline_vector); void (*init_hypercall_page)(struct domain *d, void *hypercall_page);};


In supporting shadow page [29], the Sun xVM Hypervisor for x86 attempts to intercept

all updates to a guest page table, and updates both the VM's page table and the

shadow page table maintained by the VMM, keeping both page tables synchronized at

all times. This implementation results in two page faults, one due to faulting the actual

page and a second one due to page table access.

This shadow page table scheme has a significant impact on the VM performance. An

alternative such as nested page table (see “AMD Secure Virtual Machine Specifics” on

page 67) has been proposed to improve the memory virtualization performance.

Sun xVM Server Interrupt and Exception Handling for HVMThe VMM can specify processor behavior on specific exceptions and interrupts by

setting appropriate control filed in the HVMCSDS. When a physical interrupt occurs, the

processor uses the setting in the HVMCSDS to determine whether this interrupt would

result in the VM exit of a running VM. Upon VM exit, the VMM gets the interrupt vector

from the HVMCSDS, sets controls field for event injection, and launches the VM entry of

the target VM.

The interrupt handling by the VMM is a two stage process: from physical device to the

VMM, and from a virtual device in Dom0 to the target VM. The VMM controls the IDT

for interrupt from physical devices. Each VM registers its own IDT with the VMM. When

a physical interrupt arrives, the VMM delivers the interrupt to a virtual device in Dom0.

The virtual device then generates a virtual interrupt to a VM.

A virtual interrupt is delivered to a VM through event injection by setting the VM entry

control field in the HVMCSDS for event injection. The VMM uses the

inject_exception entry point in hvm_function_table (see “Processor

Dependent HVM Functions” on page 69) to set the HVMCSDS event injection control

field. The event is delivered when the VM is entered.

Emulated BIOSThe PC BIOS provides hardware initialization, boot services, and runtime services to the

OS. There are some restrictions on VMX operation. An OS in HVM cannot operate in real

mode. Unlike a paravirtualized OS that can change its bring up sequence for an

environment without BIOS, an unmodified OS requires an emulated BIOS to perform

some real mode operations before control is passed to the OS. Sun xVM Server includes

a BIOS emulator, hvmloader, as a surrogate to real BIOS.

The hvmloader BIOS emulation contains three components: ROMBIOS, VGABIOS,

and VMXAssist. Both ROMBIOS and VGABIOS are based the open source Bochs BIOS

[23]. The VMXAssist component is included in hvmloader to emulate real mode,

which is required by hvmloader and bootstrap loaders. The hvmloader BIOS

emulator is bootstrapped as any other 32-bit OS. After it is loaded, hvmloader copies


its three components to pre-assigned addresses (VGABIOS at C000:0000,

VMXAssist at D000:0000, and ROMBIOS at F000:0000) and transfers control to

VMXAssist.

The hvmloader BIOS emulator does not directly interface with physical devices. It

communicates with virtual devices as discussed in the following section “Sun xVM

Server with HVM I/O Virtualization (QEMU)”.

Sun xVM Server with HVM I/O Virtualization (QEMU)Sun xVM Server I/O virtualization on an HVM-enabled environment is based on the

open source QEMU project [24]. QEMU is a machine emulator that uses dynamic binary

translation to run an unmodified OS and its applications in a virtual machine. QEMU

includes several components: CPU emulators, emulated devices, generic devices,

machine descriptions, user interface, and a debugger. The emulated devices and

generic devices in QEMU make up its device models for I/O virtualization. Sun xVM

Server uses QEMU's device models to provide full I/O virtualization to VMs.

For example, QEMU supports several emulated network interfaces, including ne2000,

PCNet, and Realteck 8139. The Solaris OS has the pcn driver for the PCNet NIC.

The Solaris OS running in DomU can use pcn and communicate to QEMU on a Solaris

Dom0 that has a e1000g NIC. The pcnet emulation in QEMU converts Solaris pcn

transactions to a generic virtual network interface (such as TAP), which forwards the

packet to the driver for the native network interface (such as e1000g).

QEMU I/O emulation is illustrated in Figure 23. The principle of operation for sending

out an I/O request is outlined as follows:

1. An OS interfaces with a device through I/O ports and/or memory-mapped device

memory. The device performs certain operations, such as DMA, in response to I/O

port/memory access by the OS. At the completion of the operation, the device

generates an interrupt to notify the OS (Steps 1 and 2 on Figure 23).

2. The VMM monitors and intercepts the device I/O ports and memory accesses

(Step 3 on Figure 23).

3. The VMM forwards the I/O port/memory data to an I/O virtualization layer such as

QEMU (Step 4 in Figure 23).

4. QEMU decodes the I/O port/memory data and performs necessary emulation for

the I/O request (Step 5 in Figure 23).

5. QEMU delivers the emulated I/O request to the OS native device interface (Steps 6

and 7 in Figure 23).


Figure 23. I/O emulation in Sun xVM Server using QEMU for dynamic binary translation.

Using the AMD PCNet LANCE PCI Ethernet controller as an example, the vendor ID and

device ID of the PC Net chip is respectively 1022 and 2000. From prtconf(1M)

output, the PCI registers exported by the device are:

According to IEEE1275 OpenBoot Firmware [25], the reg property is generated by

reading the base address registers in the configuration address space. Each entry in the

reg property format consists of one 32-bit cell for register configuration, a 64-bit

address cell, and a 64-bit size cell [26]. As the prtconf(1M) output shows, the PCNet

chip has a 128 byte (0x00000080) register in the I/O address space (01 in the first byte

of0x01008810 denotes I/O address space). QEMU emulation for PCNet simply

monitors the Solaris driver access to the 128 bytes register using x86 IN/OUT

instructions.

% prtconf -v.... pci1022,2000, instance #0 Hardware properties: name='assigned-addresses' type=int items=5 value=81008810.00000000.00001400.00000000.00000080 name='reg' type=int items=10 value=00008800.00000000.00000000.00000000.00000000.01008810.00000000.00000000.00000000.00000080....

UserKernel


Dom0 Dom U


VM exit handler

User

socket(3c)

Kernel

19 6

2

11

qemu-dm/ioemu/pcnet

10pcn

5

4TAP/

NativeNIC

7

NIC

8

I/O PortDevice memory

3

11

hvm hypercall

Event Channel

10


The QEMU virtualization for transmitting and receiving a packet using the PCNet

emulation is illustrated in the Figure 23 on page 72. The sequence of events

corresponding to the numbered dots in the figure is described below:

1. Applications make an I/O request to the driver through system calls.

2. The pcn driver writes to the DMA descriptor using the OUT instruction. In pcn,

pcn_send() calls pcn_OutCSR() to start the DMA transaction. Then,

pcn_OutCSR() calls ddi_put16() to write a value to an I/O address. Next,

ddi_put16() checks whether the mapping (io_handle) is for I/O space or

memory space. If the mapping is for the I/O space, it moves its third argument to

%rax and port ID to %rdx, and issues the OUTW instruction to the port referenced

by %dx.

The OUT instruction causes a VM exit. The CPU is setup by the VMM to have an

unconditional VM exit if the VM executes IN/OUT/INS/OUS as shown in the

setting of the CPU_BASED_UNCOND_IO_EXITING bit in VM exit control (see

Table 20-6 in [7]).

pcn_send(){....pcn_OutCSR(pcnp, CSR0, CSR0_INEA | CSR0_TDMD);...}static voidpcn_OutCSR(struct pcninstance *pcnp, uintptr_t reg, ushort_t value){ ddi_put16(pcnp->io_handle, REG16(pcnp->io_reg, PCN_IO_RAP), reg); ddi_put16(pcnp->io_handle, REG16(pcnp->io_reg, PCN_IO_RDP), value);}ENTRY(ddi_put16) movl ACC_ATTR(%rdi), %ecx cmpl $_CONST(DDI_ACCATTR_IO_SPACE|DDI_ACCATTR_DIRECT), %ecx jne 8f movq %rdx, %rax movq %rsi, %rdx outw (%dx) ret


The VM exit handler is set in the host RIP field in HVMCDCS (see “HVM

Operations and Data Structure” on page 64). The VM exit handler examines the

exit reason and calls the I/O instruction function, vmx_io_instruction(), to

handle the VM exit.

3. The VM exit handler for I/O instructions in the VMM examines the exit

qualification, and gets OUT information from the HVMCDCS. This information

includes:

– Size of the access (1 byte, 2 byte, or 4 bytes)

– Direction of the access (IN or OUT)

– Port number

– Double fault exception or not

– Size and address of string buffer if this is an I/O string operation

#define MONITOR_CPU_BASED_EXEC_CONTROLS \ ( MONITOR_CPU_BASED_EXEC_CONTROLS_SUBARCH | \ CPU_BASED_HLT_EXITING | \ CPU_BASED_INVDPG_EXITING | \ CPU_BASED_MWAIT_EXITING | \ CPU_BASED_MOV_DR_EXITING | \ CPU_BASED_UNCOND_IO_EXITING | \ CPU_BASED_USE_TSC_OFFSETING )void vmx_init_vmcs_config(void){...._vmx_vmexit_control = adjust_vmx_controls(MONITOR_VM_EXIT_CONTROLS,MSR_IA32_VMX_EXIT_CTLS_MSR);....}

asmlinkage void vmx_vmexit_handler(struct cpu_user_regs *regs){....case EXIT_REASON_IO_INSTRUCTION: exit_qualification = __vmread(EXIT_QUALIFICATION); inst_len = __get_instruction_length(); vmx_io_instruction(exit_qualification, inst_len); break;....}


The VM exit handler then fills in struct ioreq fields, and sends the I/O

request to its client by calling send_pio_req().

4. The client of the I/O request (qemu-dm) is blocked on the event channel device

node created by the evtchn module (see “Event Channels” on page 43). In the

VMM, hvm_send_assist_req() gets called by send_pio_req() to set the

event pending bit of the event channel and wake up the qemu-dm client waiting

on the event.

5. The QEMU emulator, qemu-dm, is a user process that contains the ioemu module

for I/O emulation. The ioemu module waits on one end of the event channel for

I/O requests from the VMM.

When an I/O request arrives, ioemu is unblocked and cpu_handle_ioreq()

is called to get the ioreg structure from the event channel. Based on the

information in ioreq, appropriate pcnet functions are invoked to handle the I/O request.

static void vmx_io_instruction(unsigned long exit_qualification, unsigned long inst_len){....send_pio_req(port, count, size, addr, dir, df, 1);....}

void hvm_send_assist_req(struct vcpu *v){....p->state = STATE_IOREQ_READY; notify_via_xen_event_channel(v->arch.hvm_vcpu.xen_port);}

int main_loop(void){ .... qemu_set_fd_handler(evtchn_fd, cpu_handle_ioreq, NULL, env);

while (1) { .... main_loop_wait(10); } ....}


6. After pcnet decodes the ioreq structure, ioemu sends the packet to the TAP

network interface. The TAP network interface [27] is a virtual ethernet network

device that provides two interfaces to applications:

– Character device —/dev/tapX

– Virtual network interface — tapX

where X is the instance number of the TAP interface. Applications can write Ethernet

frames to the /dev/tapX character interface, and the TAP driver will receive this

frame from the tapX network interface. In the same manner, a packet that kernel

writes to the tapX network interface can be read by application from the character /dev/tapX device node.

To continue the packet flow, pcnet_transmit() is called to send out ioreq. In

pcnet_tranmit(), qemu_send_packet() invokes tap_receive() to write

the packet to the TAP character interface which will forward the packet to the native

driver interface.

7. The Dom0 native driver sends the packet to the network hardware. This marks the

end of transmitting a packet from DomU to the real network.

8. Dom0 receives an interrupt indicating a packet intended for DomU has arrived.

This marks the beginning of receiving a packet targeted to a DomU from the real

network. The native network driver forwards the packet through a bridge to the

TAP network interface, tapX.

9. Next, tap_send() is invoked when data is written to the file. The packet is read

from the character interface of /dev/tapX. Next, qemu_send_packet() calls

pcnet_receive() to send out the buffer.

static void pcnet_transmit(PCNetState *s){

....qemu_send_packet(s->vc, s->buffer, s->xmit_pos);....

}static void tap_receive(void *opaque, const uint8_t *buf, int size){

....for(;;) {

ret = write(s->fd, buf, size);....

}}


10. The pcnet_receive() function in ioemu copies data read from the TAP

character device to the VMM memory. The data can be either an I/O port value

from the IN instruction or a network packet. At the end of data transfer, pcnet

informs the VMM to generate an interrupt.

11. The ioemu module makes a hvm_opt(set_pci_intx_level) hypercall to the

VMM to generate an interrupt to the target domain.

The VMM sets the guest HVMCDCS area to inject an event with the next VM entry. The

target VM will get an interrupt when the VMM launches a VM entry to the target

domain (see “Sun xVM Server Interrupt and Exception Handling for HVM” on page 70).

Sun xVM Server with HVM I/O Virtualization (PV Drivers)As shown in the previous section, the QEMU I/O emulation used in Sun xVM Server

with HVM suffers significant performance overhead. An I/O packet has to go through

several context switches, including a switch to the user level at Dom0, to reach its

destination. One alternative for improving the performance is to use a similar I/O

static void tap_send(void *opaque){ .... size = read(s->fd, buf, sizeof(buf)); if (size > 0) { qemu_send_packet(s->vc, buf, size); }}

static void pcnet_receive(void *opaque, const uint8_t *buf, int size){....cpu_physical_memory_write(rbadr, src, count); ...pcnet_update_irq(s);}

int xc_hvm_set_pci_intx_level( int xc_handle, domid_t dom, uint8_t domain, uint8_t bus, uint8_t device, uint8_t intx, unsigned int level){ .... hypercall.op = __HYPERVISOR_hvm_op; hypercall.arg[0] = HVMOP_set_pci_intx_level; hypercall.arg[1] = (unsigned long)&arg; .... rc = do_xen_hypercall(xc_handle, &hypercall); ....}


virtualization model as the Sun xVM Server PV architecture (see “Sun xVM Server I/O

Virtualization” on page 56). Paravirtualized drivers (PV drivers) like xbf and xnf are

included in the OS distribution. When a VM is created, Dom0 exports virtual I/O devices

(for example, xnf and xbf) instead of emulated I/O devices (for example, pcn and

mpt) to the GOS. PV drivers are subsequently bound to these virtual devices and used

for handling I/O. The I/O transactions follow the same path as described in Chapter 5,

“Sun xVM Server”. PV drivers will be provided for Solaris 10 and Windows so they can

run unmodified in the Sun xVM Server with better I/O performance.

79 Logical Domains Sun Microsystems, Inc.

Chapter 7

Logical Domains

The Logical Domains (LDoms) technology from Sun Microsystems allows a system's

resources, such as memory, CPUs, and I/O devices, to be allocated into logical

groupings. Multiple isolated systems, each with their own operating system, resources,

and identity within a single computer system, can then be created using these

partitioned resources.

Unlike Sun xVM Server, LDoms technology partitions a processor into multiple strands,

and assigns each strand its own hardware resources. (See “Terms and Definitions” on

page 113.) Each virtual machine, called a domain in LDoms terminology, is associated

with one or more dedicated strands. A thin layer of firmware, called the hypervisor, is

interposed between the hardware and the operating system (Figure 24). The hypervisor

abstracts the hardware resources and provides an interface to the operating system

software.

Figure 24. The hypervisor, a thin layer of firmware, abstracts hardware resources and presents them to the OS.

The LDoms implementation includes four components:

• UltraSPARC T1/T2 processor

• UltraSPARC hypervisor

• Logical Domain Manager (LDM)

• Paravirtualized Solaris OS

Note – The terms strand, hardware thread, logical processor, virtual CPU and virtual processor are used by various documents to refer to the same concept. For consistency, the term strand is used in this chapter.

CPU Mem CPU Mem

CPU Mem

CPU Mem CPU Mem

CPU Mem

Hardware

Hypervisor

Control

Domain

Solaris 10 Solaris 10 Solaris 10 Linux

Domain

1

Domain

2

Domain

3


Note – In Sun documents, the term hypervisor is used to refer to the hyperprivileged software that performs the functions of the VMM and the term domain is used to refer to a VM. To accommodate Sun's terminologies, hypervisor and domain (instead of VMM and VM) are used in this chapter.

This chapter assumes a basic understanding of the UltraSPARC T1/T2 processor, which

plays a major role in the implementation of LDoms. (See Chapter 4, “SPARC Processor

Architecture” on page 29.) The remainder of the chapter is organized as follows:

• “Logical Domains (LDoms) Architecture Overview” on page 80 provides an overview of

the LDoms architecture and the other three components of LDoms: paravirtualized

Solaris, the UltraSPARC hypervisor, and the Logical Domain manager.

• “CPU Virtualization in LDoms” on page 84 discusses CPU virtualization including trap

and interrupt handling.

• “Memory Virtualization in LDoms” on page 88 discusses memory virtualization

including physical memory allocation and page translations.

• “I/O Virtualization in LDoms” on page 91 discusses I/O virtualization and describes

the operation of the disk block and network drivers.

Logical Domains (LDoms) Architecture OverviewLogical Domains (LDoms) technology supports CPU partitioning and enables multiple

OS instances to run on a single UltraSPARC T1/T2 system. The UltraSPARC T1/T2

architecture has been enhanced from the original UltraSPARC specification to

incorporate hypervisor technology that supports hardware level virtualization.

The hypervisor is delivered with the UltraSPARC T1/T2 platform, not with the OS. During

a boot, the OpenBoot PROM (OBP) loads the Solaris OS directly from the disk. After the

boot, a logical domain manager is enabled and initializes the first domain as the

control domain. From a control domain, the administrator can create, shutdown,

configure, and destroy other domains. The control domain can also be configured as an

I/O domain, which has direct access to I/O devices and provides services for other

domains to access I/O devices (Figure 25).


Figure 25. A control domain, Solaris OS, and Linux guest domains running in logical domains on an UltraSPARC T1/T2 processor-powered server.

The UltraSPARC T1/T2 processor architecture is described earlier in Chapter 4, “SPARC

Processor Architecture” on page 29. In this section, the other three components of the

LDoms technology — paravirtualized Solaris OS, hypervisor, and logical domain

manager — are discussed.

Paravirtualized Solaris OSThe Solaris kernel implementation for the UltraSPARC T1/T2 hardware class (uname -m) is referred to as the Solaris sun4v architecture. In this implementation, the

Solaris OS is paravirtualized to replace operations that require hyperprivileged mode

with hypervisor calls. The Solaris OS communicates with the hypervisor through a set of

hypervisor APIs, and uses these APIs to request that the hypervisor perform

hyperprivileged operations.

Sun4v support for LDoms is a combination of partitioning the UltraSPARC T1/T2

processor into strands and virtualization of memory and I/O services. Unlike Sun xVM

Server and VMware, an LDoms domain does not share strands with other domains.

Each domain has one or more strands assigned to it, and each strand has its own

hardware resources so that it can execute instructions independently of other strands.

The virtualization of CPU functions to support CMT is implemented at the processor

rather than at the software level (that is, there is no software scheduler). A Solaris guest

OS can directly access strand-specific registers in a domain and can, for example,

perform operations such as setting an OS trap table to the trap base address register

(TBA).

The Solaris sun4v architecture assumes that the platform includes the hypervisor as

part of its firmware. The hypervisor runs in the hyperprivileged mode, and the Solaris

LDM

OBP

Hypervisor APIHypervisor Services

ALOM POST

KernelHypercalls

UltraSPARC T1/T2 Processor-powered Server Firmware

Control DomainI/O Domain

Firmware

Devices

��

Drivers

KernelHypercalls

GuestDomain

KernelHypercalls

GuestDomain

Sun UltraSPARC T1/T2 Server

LinuxSolarisSolaris


OS runs in the privileged mode of the processor. The Solaris kernel uses hypercalls to

request that the hypervisor perform hyperprivileged functions of the processor.

Like Intel's VT and AMD's Pacifica architectures, the sun4v architecture leverages CPU

support (hyperprivileged mode) for the implementation of the hypervisor. Unlike Intel's

VT and AMD's Pacifica architectures which provide a special mode of execution for the

hypervisor and thus make the hypervisor transparent to the GOS, the support for the

hypervisor in UltraSPARC T1/T2 is non-transparent to the GOS. The UltraSPARC T1/T2

processors provide a set of hypervisor APIs for the GOS to delegate the hyperprivileged

operations to the hypervisor.

Hypervisor ServicesThe hypervisor layer is a component of the UltraSPARC T1/T2 system's firmware. An

UltraSPARC system’s firmware consists of Open Boot PROM (OBP), Advanced Lights Out

Management (ALOM), Power-on Self Test (POST), and the hypervisor.

The hypervisor leverages the UltraSPARC T1/T2 hyperprivileged extensions to provide a

protection mechanism for running multiple guest domains on the system. The

hypervisor includes a number of hypervisor services to its overlaying domains. These

services include hypervisor APIs that are the interfaces for a GOS to request hypervisor

services, and Logical Domain Channel (LDC) services which are used by virtual device

drivers for inter-domain communications.

Hypervisor API

The Sun4v hypervisor API [11] uses the Tcc instruction to cause the GOS to trap into

hyperprivileged mode, in a similar fashion to how OS system calls are implemented.

The function of the hypervisor API is equivalent to system calls in the OS that enable

user applications to request services from the OS. The Sun4v hypervisor API allows a

GOS to perform the following actions:

• Request services from the hypervisor

• Get and set CPU information through the hypervisor

The UltraSPARC Virtual Machine Specification [11] lists the complete set of services

and APIs for:

• API versioning — request and check for a version of the hypervisor APIs with which it

may be compatible

• Domain services — enable a control domain to request information about or to

affect other domains

• CPU services — control and configure a strand; includes operations such as

start/stop/suspend a strand, set/get the trap base address register, and configure the

interrupt queue

• MMU services — perform MMU related operations such as configure the TSB,

map/demap the TLB, and configure the fault status register


• Memory services — zero and flush data from cache to memory

• Interrupt services — get/set interrupt enabled, target strand, and state of the

interrupt

• Time-of-Day services — get/set time-of-day

• Console services — get/put a character to the console

• Channel Services — provide communication channels between domains (see

“Logical Domain Channel (LDC) Services” on page 83)

The following two examples of hv_mem_sync() and hv_api_set_version()

show the implementation for hypervisor calls:

The trap type in the range 0x180-0x1FF is used to transition from a privileged mode

to a hyperprivileged mode. In the two preceding examples, a TT value of 0x180 (offset

of 0) is used for hv_mem_sync(), and a TT value of 0x1FF (offset of 0x7f) is used for

hv_api_set_version().

Hypervisor calls are normally invoked during the startup of the kernel to set up strands

for the domain. Only a few hypercall functions are called during the runtime of the

kernel, including: hv_tod_set(), hv_tod_get(), hv_set_ctx0(),

hv_mmu_map_perm_addr(), hv_mmu_unmap_perm_addr(),

hv_set_ctxnon0(), and hv_mmu_set_stat_area().

Logical Domain Channel (LDC) Services

The hypervisor provides communication channels between domains. These channels

are accessed within a domain as an endpoint. Two endpoints are connected together

forming a bi-directional point-to-point LDC.

All traffic sent to a local endpoint arrives at the corresponding endpoint at the other

end of the channel in the form of short fixed-length (64-byte) message packets. Each

endpoint is associated with one receive queue and one transmit queue. Messages from

a channel are deposited by the hypervisor at the tail of a queue, and the receiving

% mdb -k> hv_mem_sync,6/aihv_mem_sync:hv_mem_sync: mov %o2, %o4hv_mem_sync+4: mov 0x32, %o5hv_mem_sync+8: ta %icc, %g0 + 0hv_mem_sync+0xc:retlhv_mem_sync+0x10: stx %o1, [%o4]> hv_api_set_version,6/aihv_api_set_version:hv_api_set_version: mov %o3, %o4hv_api_set_version+4: clr %o5hv_api_set_version+8: ta %icc, %g0 + 0x7fhv_api_set_version+0xc: retlhv_api_set_version+0x10: stx %o1, [%o4]


domain indicates receipt by moving the corresponding head pointer for the queue. To

send a packet down an LDC, a domain inserts the packet into its transmit queue, and

then uses a hypervisor API call to update the tail pointer for the transmit queue.

In the Solaris OS, the hypervisor LDC service is used as a simulated I/O bus interface,

enabling a virtual device to communicate with a real device on the I/O domain. All

virtual devices that communicate with the I/O domain for device access are a leaf

nodes on the LDC bus. For example, a virtual disk driver, vdc, uses the LDC service to

communicate with the virtual disk driver, vds, on the other side of the channel. Both

vdc and vds are leaf nodes on the channel bus (see “I/O Virtualization in LDoms” on

page 91).

Logical Domain Manager The Logical Domain Manager (LDM) provides the following functionality:

• Provides a control point for managing a domain's configuration and operation

• Binds a domain to the resources of the underlying local physical machine

• Manages the integrity of the configuration in a persistent and consistent manner

The LDM is a software module that runs on a control domain (see “Logical Domains

(LDoms) Architecture Overview” on page 80). The LDM uses the LDC to communicate

with the hypervisor when binding a domain to hardware resources, and stores the

configuration in the service processor. The LDM is only required when a domain

reconfiguration operation is needed, such as during the creation, shutdown, or deletion

of a domain.

The LDM maintains two persistent databases: one for the currently defined domains,

and one for active domains. The active domain database is stored with the service

processor, and the currently defined database is held with LDM's own persistent

storage. The command line interface to the LDM is ldm(1M).

CPU Virtualization in LDomsThe hypervisor exposes strands to a domain. Each strand has it own registers and trap

queues; shares L1 caches, the TLB, and the instruction pipeline with other strands in

the same core; and shares the L2 cache with other strands in the socket. Strands on the

UltraSPARC T1 processor share the FPU with other strands, while each core in the

UltraSPARC T2 processor has its own floating-point and graphics unit (FGU). Each

domain has its own strands that are not shared with other domains. The software

threads (also known as kernel threads) are executed on the strands that are bound to

that domain. Unlike the VMM in Sun xVM Server or VMware, there is no software

scheduler in the hypervisor.

CPU virtualization in LDoms, from a software perspective, involves trap and interrupt

handling and timer services.


Trap and Interrupt HandlingEach strand has two trap tables for handling traps: the hyperprivileged trap table and

the privileged trap table. The trap table used for handling a trap depends on the

following criteria:

• Trap type (TT)

• Trap level at the time when the trap is taken

• Privilege mode at the time when the trap is taken

The UltraSPARC Architecture 2005 specification (see Table 12-4 in [2]) lists the mode in

which a trap is delivered based on a given TT and current privileged mode.

The hyperprivileged trap table and the privileged trap table are installed, respectively,

by the hypervisor and the GOS. For example, the Solaris OS installs the trap table for

sun4v in mach_cpu_startup():

And the hypervisor installs its trap table in start_master():

Each strand has two interrupt queues: cpu_mondo and dev_mondo. The cpu_mondo

queue is used for CPU-to-CPU cross-call interrupts; the dev_mondo queue is used for I/O-to-CPU interrupts. The Solaris kernel allocates memory for each queue, and

registers these queues with the hv_cpu_qconf() hypercall. When the queue is non-

empty (that is, the queue header is not equal to the queue tail), a trap is generated to

the target CPU. The data of the interrupt received (mondo data) is stored in the queue.

ENTRY_NP(mach_cpu_startup) .... set trap_table, %g1 wrpr %g1, %tba ! write trap_table to %tba ....

ENTRY_NP(start_master) .... setx htraptable, %g3, %g1 wrhpr %g1, %htba ....


The Solaris kernel function for registering the interrupt queues is

cpu_intrq_register() as shown below:

The I/O and CPU cross-call interrupt delivering mechanism is as follows:

1. An I/O device asserts its interrupt line to generate an interrupt to the processor.

The I/O bridge chip receives the interrupt request and prepares a mondo packet to

be sent to the target processor whose CPU number is stored in the bridge chip

register by the OS. The mondo packet contains an interrupt number that uniquely

identifies the source of the interrupt.

2. The hypervisor receives an interrupt request from the hardware through the

interrupt vector trap (0x60). For example, the trap table for the T2000 firmware

has the following entries:

The CPU number and interrupt number are also delivered, along with the

interrupt trap. The interrupt vector trap handle, VECINTR, uses the interrupt

number to determine the source of the interrupt. If the interrupt is coming from

I/O, the trap handler use the CPU number to find the dev_mondo queue

associated with the CPU and adds the interrupt to the tail of the dev_mondo

queue. When the head of the queue is not equal to the tail, a trap (0x7C for CPU

cross calls and 0x7D for I/O) is generated to the CPU that owns the queue.

3. Traps 0x7C and 0x7D are taken via the GOS trap table. For I/O interrupts,

dev_mondo() is the trap handler for 0x7D.

voidcpu_intrq_register(struct cpu *cpu){ struct machcpu *mcpup = &cpu->cpu_m; uint64_t ret;

ret = hv_cpu_qconf(INTR_CPU_Q, mcpup->cpu_q_base_pa, cpu_q_entries); ....

ret = hv_cpu_qconf(INTR_DEV_Q, mcpup->dev_q_base_pa, dev_q_entries); ....}

ENTRY(htraptable) .... TRAP(tt0_05e, HSTICK_INTR) /* HV: hstick match */ TRAP(tt0_05f, NOT) /* reserved */ TRAP(tt0_060, VECINTR) /* interrupt vector */ ....


The dev_mondo() handler takes the interrupt out of the queue by

incrementing the queue header. It also finds the interrupt vector data, struct

intr_vec, from the system’s interrupt vector table. The struct intr_vec

data contains the priority interrupt level (PIL) and the driver's interrupt service

routine (ISR) for the interrupt. The dev_mondo() handler then sets the

SOFTINT register with the PIL of the interrupt.

4. Setting the SOFTINT register causes an interrupt_level_n trap, 0x41-

0x4f, to be generated where n is the PIL of the interrupt. The GOS's trap handler

for the interrupt_level_1 interrupt, for example, is shown below:

If the PIL of the interrupt is below the clock PIL, an interrupt thread is allocated

to handle the interrupt. Otherwise, the high level interrupt is handled by the

currently executing thread.

In summary, the interrupt delivering mechanism is a two stage process. First, an

interrupt is delivered to the hypervisor as the interrupt vector trap, 0x60. Then the

interrupt is added to an interrupt queue, which causes another trap to the GOS.

LDoms Timer Service The system time is provided by the programmable interrupt generator. Clock interrupts

are sent directly from the hardware to the domain, without being queued in the

hypervisor. Therefore, unlike Sun xVM Server domains, LDoms exhibit no “lost ticks”

issues.

The time of day (TOD) is maintained by the hypervisor on a per-domain basis. The

Solaris OS uses the tod_get() and tod_set() hypercalls to get and set the TOD,

respectively. Setting the TOD in one domain does not affect any other domain.

# mdb -k> trap_table+0x20*0x7c/ai0x1000f80:0x1000f80: ba,a,pt %xcc, +0xc784 <cpu_mondo>> trap_table+0x20*0x7d/ai0x1000fa0:0x1000fa0: ba,a,pt %xcc, +0xc800 <dev_mondo>>

> trap_table+0x20*0x41,2/aitt_pil1:tt_pil1: ba,pt %xcc, +0xc33c <pil_interrupt>0x1000824: mov 1, %g4>


The high resolution timer is provided by the rdtick instruction, which reads the

counter field of the TICK register. The rdtick instruction is a privileged instruction

that can be executed by the Solaris OS without the hypervisor involvement.

Memory Virtualization in LDomsSimilar to Sun xVM Server, memory virtualization in LDoms deals with two memory

management issues:

• Physical memory sharing and partitioning

• Page translations

Physical Memory AllocationThe UltraSPARC T1/T2 processors supports three types of memory addressing:

• Virtual Address (VA) — utilized by user programs

• Real Address (RA) — describes the underlying memory allocated to a GOS

• Physical Address (PA) — appears in the system bus for accessing physical memory

Multiple virtual address spaces within the same real address space are distinguished by

a context identifier (context ID). The context ID is included as a field in the TTE for VA to

PA translation (see “Memory Management Unit” on page 32). The GOS can create

multiple virtual address spaces, using the primary and secondary context registers to

associate a context ID with every virtual address. The GOS manages the allocation of

context IDs among the processes within the domain.

Multiple real address spaces within the same physical address space are distinguished

by a partition identifier (partition ID). The hypervisor can create multiple real address

spaces, using the partition register to associate a partition ID with every real address.

The hypervisor manages the allocation of partition IDs.

Because of the new addressing scheme, a number of new ASIs are defined for RA and PA

addressing, as described in Table 8.

Table 8. New ASIs defined for real and physical addresses.

ASI # ASI Name Description

0x14 ASI_REAL Real Address (memory)

0x15 ASI_REAL_IO Noncacheable Real Address

0x1C ASI_REAL_LITTLE Real Address Little-endian

0x1D ASI_REAL_IO_LITTLE Noncacheable Real Address Little-endian

0x21 ASI_MMU_CONTEXTID MMU context register

0x52 ASI_MMU_REAL MMU Register


The partition ID register is defined in ASI 0x58, VA 0x80 [2] with an 8-bit field for the

partition ID.

The full representation of each type of address is as follows:

or:

Figure 26 illustrates the type of addressing in each of mode of operation.

Figure 26. Different types of addressing are used in different modes of operation.

Page TranslationsPage translations in the UltraSPARC architecture are managed by software through

several different type of traps (see “Memory Management Unit” on page 32).

Depending on the trap type, traps may be handled by the hypervisor or the GOS. Table 9

summarizes the MMU related trap types (see also Table 12-4 in [2]).

Table 9. MMU-related trap types in the UltraSPARC T1/T2 processor

real_address = context_ID :: virtual_address

physical_address = partition ID :: real_address

physical_address = partition ID :: context ID :: virtual_address

Trap name Trap Cause TT Handled by

fast_instruction_access_MMU_miss iTLB Miss 0x64 Hypervisor

fast_data_access_MMU_miss dTLB Miss 0x68 Hypervisor

fast_data_access_protection Protection Violation 0x6c Hypervisor

instruction_access_exception Several 0x08 Hypervisor

data_access_exception Several 0x30 Hypervisor

Process64-bitaddressing


64-bit VA + context ID

64-bit VA + context ID + Partition ID

LogicalDomain

PhysicalSystem


Virtual AddressingUnprivileged mode

User Space

Kernel SpaceReal AddressingPrivileged mode

Physical AddressingHyperprivileged mode


In the hypervisor trap table, htraptable, the instructions for handling dTLB miss,

trap 0x68, are:

The trap table transfers control to dmmu_miss() to load the page translation from the

TSB. If the translation doesn't exist in the TSB, dmmu_miss() calls dtsb_miss().

The handler dtsb_miss() sets the TT register to trap type 0x31

(data_access_MMU_miss), changes the PSTATE register to the privileged mode,

and transfers control to the GOS's trap handler for trap 0x31. The portion of

dtsb_miss() that performs this functionality is shown in the following example:

In the Solaris OS, the trap handler for trap type 0x31 calls the handler

sfmmu_slow_dmmu_miss() to load the page translation from hme_blk. If no entry

is found in hme_blk for the virtual address, sfmmu_slow_dmmu_miss() calls

sfmmu_pagefault() to transfer control to Solaris's trap() handler.

instruction_access_MMU_miss iTSB Miss 0x09 GOS

data_access_MMU_miss dTSB Miss 0x31 GOS

*mem_address_not_aligned Misaligned memory operation

0x34-0x39

Hypervisor

% mdb ./ontario/release/q> htraptable+0x20*0x68,8/aihtraptable+0xd00:htraptable+0xd00: rdpr %priv_16, %g1htraptable+0xd04: cmp %g1, 3htraptable+0xd08: bgu,pn %xcc, +0x73b8 <watchdog_guest>htraptable+0xd0c: mov 0x28, %g1htraptable+0xd10: ba,pt %xcc, +0x97a0 <dmmu_miss>htraptable+0xd14: ldxa [%g1] 0x4f, %g1htraptable+0xd18: illtrap 0htraptable+0xd1c: illtrap 0

> dtsb_miss,80/ai ....

wrpr %g0, 0x31, %tt ! write 0x31 to %ttrdpr %pstate, %g3 ! read %pstate to %g3or %g3, 4, %g3 wrpr %g3, %pstate ! write %g3 to%pstaterdpr %tba, %g3 ! get privileged mode's trap

! table base addressadd %g3, 0x620, %g3 ! set %g3 to the address of

! trap type 0x31....jmp %g3 ! jump to 0x31 trap handler


I/O Virtualization in LDomsLDoms provide the ability to partition system PCI buses so that more than one domain

can directly access devices. (Currently, access by up to two domains is supported.) A

domain that has direct access to devices is called an I/O domain or service domain. A

domain that doesn't have direct access to devices uses the virtual I/O (VIO) framework

and goes through an I/O domain for access.

The device tree of a domain is determined by the OBP of that domain. The OBP device

tree of a typical non-I/O domain is shown in the following example:

% mdb -k> trap_table+0x20*0x31,2/aiscb+0x620:scb+0x620: ba,a +0xc1b4 <sfmmu_slow_dmmu_miss>scb+0x624: illtrap 0> sfmmu_pagefault,80/ai ....sfmmu_pagefault+0x78: sethi %hi(0x101d400), %g1sfmmu_pagefault+0x7c: or %g1, 0x364, %g1sfmmu_pagefault+0x80: ba,pt %xcc, -0x13f0 <sys_trap>

{0} ok show-devs

/cpu@3/cpu@2/cpu@1/cpu@0

/virtual-devices@100/virtual-memory/memory@m0,4000000

/aliases/options/openprom/chosen

/packages/virtual-devices@100/channel-devices@200/virtual-devices@100/console@1/virtual-devices@100/ncp@4/virtual-devices@100/channel-devices@200/disk@0/virtual-devices@100/channel-devices@200/network@0

/openprom/client-services/packages/obp-tftp/packages/kbd-translator/packages/SUNW,asr/packages/dropins/packages/terminal-emulator/packages/disk-label/packages/deblocker/packages/SUNW,builtin-drivers


During the system boot, the OBP device tree information is passed to the Solaris OS and

used to create the system device nodes. Output from the following pftconf(1M)

command shows the system configuration of a typical non-I/O domain:

As this system configuration shows, no physical devices are exported to the domain.

The virtual-devices entry is the nexus node of all virtual devices. The channel-

devices entry is the bus node for the virtual devices that require communication with

the I/O domain. The disk and network entries are leaf nodes on the channel-

devices bus.

# prtconfSystem Configuration: Sun Microsystems sun4vMemory size: 1024 MegabytesSystem Peripherals (Software Nodes):

SUNW,Sun-Fire-T200 scsi_vhci, instance #0 packages (driver not attached)

SUNW,builtin-drivers (driver not attached) deblocker (driver not attached) disk-label (driver not attached) terminal-emulator (driver not attached) dropins (driver not attached) SUNW,asr (driver not attached) kbd-translator (driver not attached) obp-tftp (driver not attached) ufs-file-system (driver not attached)

chosen (driver not attached) openprom (driver not attached) client-services (driver not attached)

options, instance #0 aliases (driver not attached) memory (driver not attached) virtual-memory (driver not attached) virtual-devices, instance #0 ncp (driver not attached) console, instance #0 channel-devices, instance #0 disk, instance #0 network, instance #0

cpu (driver not attached) cpu (driver not attached) cpu (driver not attached) cpu (driver not attached)

iscsi, instance #0 pseudo, instance #0


The Solaris drivers that are specific to the LDom configuration are listed below:

Similar to Sun xVM Server, the LDoms VIO on a non-I/O domain uses a split device driver

architecture for virtual disk and network devices. The vdc and vnet client drivers are

used in non-I/O domains. The vds and vsw server drivers are used in the I/O domain to

support the vdc and vnet drivers. The vnex nexus driver, the driver for the

virtual-devices nexus node, provides bus services to its children nodes, vnet and

vdc.

The VIO framework uses the hypervisor’s Logical Domain Channel (LDC) service for

driver communication between domains. The LDC forms bi-directional point-to-point

links between two endpoints. All traffic sent to a local endpoint arrives only at the

corresponding endpoint at the other end of the channel in the form of short fixed-

length (64 byte) message packets. Each endpoint is associated with one receive queue

and one transmit queue. Messages from a channel are deposited by the hypervisor at

the tail of a queue, and the receiving domain indicates receipt by moving the

corresponding head pointer for the queue. To send a packet down an LDC, a domain

inserts the packet into its transmit queue, and then uses a hypervisor API call to update

the tail pointer for the transmit queue.

Disk Block DriverOn non-I/O domains, the vdc client driver provides disk interface. The vdc driver

receives I/O requests from the file system or raw device access, and sends these

requests to the hypervisor LDC. The vds server driver, located in the I/O domain, sits on

the other end of LDC. The vds driver receives requests from the vdc driver and then

forwards these requests to the disk service to which the disk device on the client is

mapped.

The sequence of events for disk I/O is illustrated in Figure 27.

LDOM drivers:vdc (virtual disk client 1.4) non I/O domain onlyldc (sun4v LDC module v1.5)ds (Domain Services 1.3)cnex (sun4v channel-devices nexus dri)vnex (sun4v virtual-devices nexus dri)dr_cpu (sun4v CPU DR 1.2)drctl (DR Control pseudo driver v1.1)qcn (sun4v console driver v1.5)vnet (vnet driver v1.4) non I/O domain onlyvds (virtual disk server v1.6) I/O domain onlyvsw (sun4v Virtual Switch Driver 1.5) I/O domain only


Figure 27. Sequence of events for disk I/O from a non-I/O domain to an I/O domain.

For non-I/O domains, the following events occur when applications use read(2) and

write(2) system calls to access a file:

1. The file system calls the vdc driver's strategy(9E) entry point.

2. The vdc drivers send the I/O buf, buf(9S), to the LDC. The vdc driver returns

after all data is successfully sent to the LDC.

3. The vds driver is notified by the hypervisor that messages are available on its

queue.

4. The vds driver retrieves data from the LDC and sends it to the device service that is

mapped to the client virtual disk.

5. The vds driver starts the block I/O by sending the I/O request to the native driver

and then dispatching a task queue to await I/O completion.

6. The native SCSI driver receives the device interrupt.

7. The vds driver's I/O completion is woken up by biodone(9F).

8. The vds driver sends a message to vdc indicating I/O completion.

9. The vdc driver receives the message from vds, and calls biodone(9F) to wake

up anyone waiting for it.

For I/O domains, the I/O path of data requests is simpler:

10. Block I/O requests are sent directly from the file system to the native driver.

In addition to the strategy(9E) driver entry point for supporting file system and raw

device access, the vdc driver also supports most of the ioctl(2) commands as

read(2)/write(2)

FS

UserKernel

Yes No

UltraSPARC T1 (CPU, Memory, Devices)

I/O Domain(Server)

Logical Domain Channel (LDC)

LDOM Hypervisor

read(2)/write(2)

FS

vdc

UserKernel

Non I/O Domain(Client)

1

9

6

2

4

10

File?

3

vds

NativeDriver

8

5

7


defined in dkio(7I) for disk control. The Solaris kernel variable dk_ioctl1 defines

the exact disk ioctl commands supported by the vdc driver.

Network DriverThe Solaris LDoms network drivers include a client network driver, vnet, and a virtual

switch, vsw, on the server side. To transmit a packet, vnet sends a packet over the LDC

to vsw. The binding of vnet to vsw is defined in the vnet resource of the domain

when the domain is created. The vsw forwards the packet to the native driver, and

includes the IP address of vnet as the source address. The vnet driver returns as soon

as the packet has been put on a buffer and the buffer has been added to the tail of the

LDC queue.

When receiving packets from the network, if the native driver is configured as a virtual

switch in the vswitch resource of the domain, the packet is passed up from the native

driver to vsw. The vsw finds the MAC address associated with the destination IP

address from its ARP table. The vsw gets the target domain from the MAC address, and

gets the vnet interface from the vnet resource. The packet is then sent to the LDC of

the designated vnet driver.

The vnet driver uses Solaris GLD v3 interfaces and is fully compatible with the native

driver using the same GLD v3 interface.

Figure 28 depicts the flow of receiving a packet from the network through an I/O

domain to a guest domain. The sequence of operations for receiving packets is as

follows:

1. Data is stored via DMA into the native driver, e1000g, receive buffer ring.

2. The vsw sends the packet to client driver, vnet, through the LDC.

3. The LDC receiving worker thread gets the packet and sends it to the vnet driver.

1. Information on the Solaris kernel variable dk_ioctl can be looked up at the Web site: http://www.opensolaris.org/.

http://www.opensolaris.org/


Figure 28. Flow of control for receiving a network packet from an I/O domain to a guest domain.

Guest Domain

vnet`vgen_handle_datamsgldc`i_ldc_rx_hdlr+0x3c0cnex`cnex_intr_wrapper+0xcintr_thread+0x170idle+0x128thread_start+4

UltraSPARC T1/T2 (CPU, Memory, Devices)

LDoms Hypervisor

Sun UltraSPARC T1/T2 Server

I/O Domain

ldc`ldc_writevsw`vsw_dringsend+0x234vsw`vsw_portsend+0x60vsw`vsw_forward_all+0x134vsw`vsw_switch_l2_frame+0x248mac`mac_rx+0x58e1000g`e1000g_intr_pciexpress+0xb8px`px_msiq_intr+0x1b8intr_thread+0x170cpu_halt+0xc0idle+0x128thread_start+4

Logical Domain Channel

Network Chip

1

2

3

97 VMware Sun Microsystems, Inc.

Chapter 8

VMware

VMware, the current market leader in virtualization software for x86 platforms, offers

three virtual machine systems: VMware Workstation; no-cost VMware Server, formerly

known as VMware GSX Server; and VMware Infrastructure 3, a suite of virtualization

products based on VMware ESX Server Version 3.

The VMware Workstation and VMware Server products are add-on software modules

that run on a host OS such as Windows, Linux-hosted, or BSD variants (Figure 29). In

these implementations the VMM is a part of, and has the same privilege as, the host OS

kernel. The guest OS runs as an application on the host OS. The Solaris OS can only run

as a guest OS on VMware Workstation and VMware Server.

The VMware Infrastructure suite of products is built around the VMware ESX Server.

VMware ESX Server runs on bare metal and uses a derived version of SimOS [18] as the

kernel for running the VMM and I/O services. All other operating systems run as a guest

OS. VMware Infrastructure supports Windows, Linux, and Solaris as guest OS. VMware

ESX Server provides lower overhead and better control of system resources than

VMware Workstation and Server. However, because it provides all device drivers, it

therefore supports fewer devices than VMware Workstation and VMware Server.

Figure 29 shows the configuration of VMware ESX Server and GSX Server.

Figure 29. VMware GSX Server (Vmware Workstation and VMware Server products) runs within a host operating system, while VMware ESX Server runs on the bare metal.

VMware ESX Server is a Type I VMM, and has exclusive control of hardware resources

(see “Types of VMM” on page 10). In contrast, VMware Workstation and VMware Server

are Type II VMMs, and leverage the host OS by running inside the OS kernel.

GuestOS

Linux

ESX ServerGSX Server

VMM

GuestOS

Solaris

VMM

Host Operating System

Hardware

VMM

GuestOS

Linux

GuestOS

Solaris

GuestOS

Windows

HostApps

Hardware


VMware Infrastructure OverviewVMware ESX Server, VMware's product for running enterprise applications in data

centers, serves as the foundation of the VMware Infrastructure solution. VMware ESX

Server includes the following components:

• Virtualization layer — abstracts the hardware resources including CPU, memory,

and I/O

• I/O interface — enables the delivery of file system services to VMs

• Service Console — provides an interface to manage resources and administer VMs

Figure 30 shows the functional components of the VMware ESX Server product. The

VMkernel, the core of the ESX Server, abstracts the underlying hardware resources and

implements the virtualization layer. The VMkernel includes multiple VMMs, one for

each VM. The VMM implements the virtual CPUs for each VM. The VMkernel also

includes modules for I/O driver emulation, the I/O stack, and device drivers for network

and storage devices. The service console, a RedHat Linux-based component, serves as a

boot loader and provides a management interface to the VMkernel.

Figure 30. VMware ESX Server functional components.

The following sections discuss the functional components of VMware Infrastructure,

with particular emphasis on the virtualization layer which forms the core of all VMware

virtualization products.

VMware CPU VirtualizationESX Server provides full virtualization, enabling an unmodified GOS to run on the

underlying x86 hardware. The full virtualization is achieved by the ESX virtualization

NetworkDriver

SCSIDriver

StorageEmulation

ExecutionMode

CPU Network Storage

GuestApplication

Service

Console

ManagementInterface

VMkernel

VMM Hardware Interface

Layer

Sun X64 Server

NetworkStack

StorageStack

NetworkEmulation

NetworkDriver

StorageDriver


layer. The core of the ESX virtualization layer is the VMM, which includes three modules

(Figure 31) [12]:

• Execution decision module — decides whether VM instructions should be sent to

the direct execution module or the binary translation module

• Binary translation module — used to execute the VM whenever the hardware

processor is in a state in which direct execution cannot be used

• Direct execution module — enables the VM to directly execute its instruction

sequences on the underlying hardware processor

Figure 31. VMware ESX Server virtualizes the CPU hardware through binary translation whenever the processor itself cannot directly execute an instruction.

The decision to use binary translation or direct execution depends on the state of the

processor and whether the segment is reversible or not (see “Segmented Architecture”

on page 23). If the content of the descriptor table, for example the GDT, is changed by

the VMM because of a context switch to another VM, the segment is non-reversible.

Direct execution can be used only if the VM is running in an unprivileged mode and the

hidden descriptors of the segment register are reversible. In all other cases, the VMM

will switch to the binary translation module.

Binary TranslationThe binary translation (BT) module is believed influenced by the machine simulators

Shade [13] and Embra [14]. Embra is part of SimOS [18] which was developed by a

Stanford team led by Mendel Rosenblum, one of the founders of VMware. While

extensive details of the BT module implementation have not been published, Agesen

[15], Embra [14], and Shade [13] provide some information on its implementation.

The BT module translates GOS instructions, which are running in a deprivileged VM,

into instructions that can run in the privileged VMM segment. The BT module receives

x86 binary instructions, including privileged instructions, as input. The output of the

module is a set of instructions that can be safely executed in the non-privileged mode.

Agesen [15] gives an example of how control flow is handled in the BT module.

VM

GOS

BinaryTranslation

DirectExecution

VMMExecution ModuleExecution

Decision


To avoid frequently retranslating blocks of instructions, translated blocks are kept in a

Translation Cache (TC). The execution of a block of instructions is simulated by locating

the block’s translation in the TC and jumping to it. A hash table maintains the

mappings from a program counter to the address of the translated code in the TC.

The main loop of the dynamic binary translation simulator is shown in Figure 32. The

loop checks to see if the current simulated program counter address is present in the

TC. If it is present in the TC, the translated block is executed. If it is not, the translator is

called to add the block to the TC. Each block of translated code ends by loading the new

simulated program counter and jumping back to the main loop for dispatching.

Figure 32. Binary translation manages a translation cache to reduce the need to re-translate frequently executed blocks of instructions.

A more detailed description of binary translation is beyond the scope of this paper.

Readers should refer to Shade [13] and Embra [14] for more details about dynamic

binary translation.

Some privileged instructions that have simple operations use in-TC sequences. For

example, a clear interrupt instruction (cli) can be replaced by setting a virtual

processor flag. Privileged instructions that have more complex operations (such as

setting cr3 during a context switch), require a call out of the TC to perform the

emulation work.

In addition to binary translation and logic for determining the code execution, the

virtualization layer employs other techniques to overcome x86 virtualization issues:

• Memory Tracing

The virtualization layer traces modifications on any given physical page of the virtual

machine, and is notified of all read and write accesses made to that page in a

transparent manner. This memory tracing ability in the VMM is enabled by page

faults and the ability to single-step the virtual machine via binary translation.

Translator

Translation Cache

Main{....dispatch loopif (PC_not_in TC(PC))tc=translate(pc);

newpc = pc_to_tc(pc);jump_to_pc(newpc)....

}

translate(pc) {

....blk = read_instructions(pc);perform_translation(blck);write_into_TC(blk);

....}

Translation Cache:code fragments which endwith jump back todispatch_loop


• Shadow Descriptor Tables

The x86's segmented architecture (see “Segmented Architecture” on page 23) has a

segment caching mechanism that allows the segment register's hidden fields to be

re-used. However, this can approach can cause difficulty if the descriptor table is

modified in a non-coherent way. The virtualization layer supports the GOS system

descriptor tables using VMM shadow descriptor tables.

The VMM descriptor tables include shadow descriptors that correspond to

predetermined descriptors of the VM descriptor tables. The VMM also includes a

segment tracking mechanism that compares the shadow descriptors with their

corresponding VM segment descriptors. This mechanism indicates any lack of

correspondence between shadow descriptor tables with their corresponding VM

descriptor tables, and updates the shadow descriptors so that they correspond to

their respective corresponding VM segment descriptors.

The ESX Server's VMM implementation is unique in that each GOS has an associated

VMM. The ESX Server may include any number of VMMs in a given physical system,

each supporting a corresponding VM; the number of VMMs is limited only by available

memory and speed requirements. The features in the virtualization layer mentioned in

the previous discussion allow multiple concurrent VMMs, with each VMM supporting

an unmodified GOS in the virtualization layer.

CPU SchedulingThe ESX Server implements a rate-based proportional-share scheduler [19] that is

similar to the fair-share-scheduler scheme used by the Solaris OS (see [21] Chapter 8) in

which each virtual machine is given a number of shares. The amount of CPU time given

to each VM is based on its fractional share of the total number of shares of active VMs

in the whole system.

The term share is used to define a portion of the system’s CPU resources that is

allocated to a VM. If a greater number of CPU shares is assigned to a VM, relative to

other VMs, then that VM receives more CPU resources from the scheduler. CPU shares

are not equivalent to percentages of CPU resources. Rather, shares are used to define

the relative weight of a CPU load in a VM in relation to CPU loads of other VMs.

The following formula shows how the scheduler calculates per-domain allocation of

CPU resources.

The ESX scheduler allows specifying minimum (reservation) and maximum (limit) CPU

utilization for each virtual machine. A minimum CPU reservation guarantees that a

virtual machine always has this minimum percentage of a physical CPU’s time

Allocationdomain

i

Sharesdomain

i

TotalShares------------------------------------------------------=


allocated to it, regardless of the total number of shares. A maximum CPU limit ensures

that the virtual machine never uses more than this maximum percentage of a physical

CPU’s time, even if extra idle time is available. The proportional-share algorithm is only

applied if the VM CPU utilization falls within the range of reservation and limit CPU

utilization. Figure 33 shows how CPU resource allocation is calculated.

Figure 33. Calculation of CPU resources in VMware.

In an SMP environment in which a VM could have more than one virtual CPU (VCPU), a

scalability issue arises when one VCPU is spinning on a lock held by another VCPU that

gets de-scheduled. The spinning VCPU wastes CPU cycles spinning on the lock until the

lock owner VCPU is finally scheduled again and releases the lock.

ESX implements co-scheduling to work around this problem. In co-scheduling (also

called gang scheduling), all virtual processors of a VM are mapped one-to-one onto the

underlying processors and simultaneously scheduled for an equal time slice. The ESX

scheduler guarantees that no VCPUs are spinning on a lock hold by a VCPU that has

been preempted.

However, co-scheduling does introduce other problems. Because all VCPUs are

scheduled at the same time, co-scheduling activates a VCPU regardless of whether

there are jobs in the VCPU's run queue. Co-scheduling also precludes multiplexing

multiple VCPUs on the same physical processor.

Timer ServicesSimilar to Sun xVM Server, ESX Server faces the same issue of getting clock interrupts

delivered to VMs at the configured interval [16]. This issue arises because the VM may

not get scheduled when interrupts are due to deliver. ESX Server keeps track of the

clock interrupt backlog and tries to deliver clock interrupts at a higher rate when the

backlog gets large. However, the backlog can get so large that it is not possible for the

GOS to catch up with the real time. In such cases, ESX Server stops attempting to catch

Total MHz

Limit

Reservation

0 MHz

The CPU utilization

range where proportional-

share is applied


up if the clock interrupt backlog grows beyond 60 seconds. Instead, ESX Servers sets its

record of the clock interrupt backlog to zero and synchronizes the GOS clock with the

host machine clock.

ESX Server virtualizes the Time Stamp Counter (TSC) so that the virtualized TSC counter

matches with the GOS clock (see “Time Stamp Counter (TSC)” on page 28). When the

clock interrupt backlog is cleared due to catching up or due to reset when the backlog is

too large, the virtualized TSC catches up with the adjusted clock.

VMware Memory VirtualizationSimilar to Sun xVM Server, memory virtualization in VMware ESX Server deals with two

memory management issues: physical memory management and page translations.

Physical Memory ManagementSimilar to Sun xVM Server, ESX Server virtualizes a VM's physical memory by adding an

extra level of address translation when mapping a VM's physical memory pages to the

physical memory pages on the underlying machine. Also like Sun xVM Server, the

underlying physical pages are referred to as machine pages, and the VM's physical

pages as physical pages. Each VM sees a contiguous, zero-based, addressable physical

memory space whereas the underlying machine memory used by each virtual machine

may not be contiguous.

ESX Server manages physical memory allocation and reclamation, similar to Sun xVM

Server, by using the memory ballooning technique. More detailed information on how

the memory ballooning technique manages the physical memory allocation and

reclamation is included in [5].

Page TranslationsEach GOS in the ESX Server maintains page tables for virtual-to-physical address

mappings. The VMM also maintains shadow page tables for the virtual-to-machine

page mappings along with physical-to-machine mappings in its memory. The

processor's MMU uses the VMM's shadow page table. When a GOS updates its page

tables with a virtual-to-physical translation, the VMM intercepts the instruction, gets

the physical-to-machine mapping from its memory, and loads the shadow page table

with the virtual-to-machine mapping. This mechanism allows normal memory accesses

in the VM to execute without adding address translation overhead if the shadow page

tables are set up for that access.

VMware I/O VirtualizationEvery VM is configured with a set of standard PC virtual devices: PS2/ keyboard and

mouse, IDE controller with ATA disk and ATAPI CDROM, serial port, parallel port, and


sound chip [20]. In addition, ESX Server also provides virtual PCI emulation for PCI add-

on devices such as SCSI, Ethernet, and SVGA graphics (see Figure 30 on page 98).

The device tree as exported by the VMM to a GOS is shown in the following

prtconf(1M) output.

% prtconfSystem Configuration: Sun Microsystems i86pcMemory size: 1648 MegabytesSystem Peripherals (Software Nodes):i86pc scsi_vhci, instance #0isa, instance #0 i8042, instance #0 keyboard, instance #0 mouse, instance #0 lp (driver not attached) asy, instance #0 (driver not attached) asy, instance #1 (driver not attached) fdc, instance #0 fd, instance #0

pci, instance #0 pci15ad,1976 (driver not attached) pci8086,7191, instance #0 pci15ad,1976 (driver not attached) pci-ide, instance #0 ide, instance #0 sd, instance #16 ide (driver not attached) pci15ad,1976 (driver not attached) display, instance #0 pci1000,30, instance #0 sd, instance #0 pci15ad,750, instance #0

iscsi, instance #0 pseudo, instance #0 options, instance #0 agpgart, instance #0 (driver not attached) xsvc, instance #0 objmgr, instance #0 acpi (driver not attached) used-resources (driver not attached)

cpus (driver not attached) cpu, instance #0 (driver not attached) cpu, instance #1 (driver not attached)


The PCI vendor ID of VMware is 15ad. The following entries are relevant to VMware I/O

virtualization:

For the example device tree shown here, the Solaris OS binds the e1000g driver to

pci15ad,750 and uses e1000g as the network driver. The actual network hardware

used on the system is a Broadcom's NetXtreme Dual Gigabit Adapter with the PCI ID

pci14e4,1468. VMware translates the e1000g device interfaces passed by the

Solaris e1000g driver, and sends them to the Broadcom's NetXtreme device.

For storage, unlike Sun xVM Server, ESX Server continues to use sd as the interface to

file systems. The emulation of disk interface is provided at the SCSI bus adapter

interface (LSI logic SCSI controller) instead of at the SCSI target interface (SCSI disk sd).

Device EmulationEach storage device, regardless of the specific adapters, appears as a SCSI drive

connected to an LSI Logic SCSI adapter within the VM. For network I/O, ESX Server

emulates an AMD Lance/PCNet or Intel E1000g device driver, or uses a custom interface

called vmxnet for the physical network adapter.

VMware provides device emulation rather than the I/O emulation as used by Sun xVM

Server and UltraSparc LDoms (see “I/O Virtualization” on page 16). In a simple scenario,

consider an application within the VM making an I/O request to the GOS, as illustrated

in Figure 34:

Device Entry Description

pci15ad,750 VMware emulation of Intel's 100FX Gigabit Ethernet

pci15ad,1976 VMware emulation of the Intel 440BX/ZX PCI bridge chip

pci1000,30 the LSI logic 53C1020/1030 SCSI controller

display VMware virtual SVGA


Figure 34. Sequence of events for applications making an I/O request.

1. Applications perform I/O operations through the interface to the device as

exported by the VMware VMM (see “VMware I/O Virtualization” on page 103). The

virtual device interface uses the native drivers (for example, the e1000g for

network and mpt for the LSI SCSI HBA) in the Solaris kernel.

2. The Solaris native driver attempts to access the device via the IN/OUT instructions

(for example, by writing a DMA descriptor to the device's DMA engine).

3. The VMM intercepts the I/O instructions and then transfers control to the device-

independent module in the VMkernel for handling the I/O request.

4. The VMkernel converts the I/O request from the emulated device to one for the

real device, and sends the converted I/O request to the driver for the real device.

5. The VMware driver sends the I/O request to the real I/O device.

6. When an I/O request completion interrupt (for example, DMA completion

interrupt) arrives, the VMkernel device driver receives and processes the interrupt.

7. The VMkernel then notifies the VMM of the target virtual machine, which copies

data to the VM memory and then raises the interrupt to the GOS.

8. The Solaris driver’s interrupt service routine (ISR) is called.

9. The Solaris driver performs a sequence of I/O accesses (for example, reads the

transaction status, acknowledges receipt) to the I/O ports before passing the data

to its applications.

The VMkernel ensures that data intended for each virtual machine is isolated from

other VMs.

VMware SupportedVirtual Device Interface

Solaris Native Drivers

Guest Application

1

Device Independent

I/O Access HandlerDevice Emulation

Module

2 9

8

VMM

VMKernel

Hardware Interface Layer

Sun x64 Server

3

7

Real Device Driver

I/O Device

4

5 6

7

VMware Sun Microsystems, Inc.

Section III

Additional Information

• Appendix A: VMM Comparison (page 109)

• Appendix B: References (page 111)

• Appendix C: Terms and Definitions (page 113)

• Appendix D: Author Biography (page 117)

109 VMM Comparison Sun Microsystems, Inc.

Appendix A

VMM Comparison

This chapter presents a summary comparison of the four virtual machine monitors

discussed in this paper: Sun xVM Server without HVM, Sun xVM Server with HVM,

VMware, and Logical Domains (LDoms). Table 10 summarizes their general

characteristics; provides information on their CPU, Memory, and I/O virtualization

implementation; and lists the management options available for each.

Table 10. Comparison of virtual machine monitors discussed in this paper.

General Sun xVM Server w/o HVM Sun xVM Server w/HVM VMware LDoms

VMM version 3.0.4 3.0.4 ESX 3.0.1 LDoms 1.0.1

Supported ISA x86 and IA-64 x86 and IA-64 x86 UltraSPARC T1/T2

VMM Layer Run on bare metal Run on bare metal Run on bare metal Firmware

Virtualization Scheme Paravirtualization Full Full Paravirtualization

Supported GOS Linux, NetBSD, FreeBSD, OpenBSD, Solaris

Linux, NetBSD, FreeBSD, OpenBSD, Windows

Windows, Linux, Netware, Solaris

Solaris, Linux

SMP GOS Yes Yes Yes Yes

64-bit GOS Yes Yes Yes Yes

Max VMs Limited by memory Limited by memory Limited by memory 32 on UltraSPARC T1; 64 on UltraSPARC T2

Method of operation Modified GOS Hardware Virtualization Binary Translation Modified OS

License GPL (free) GPL (free) Proprietary CDDL (Free)

CPU Sun xVM Server w/o HVM Sun xVM Server w/HVM VMware LDoms

CPU scheduling Credit Credit Fair Share N/A

VMM Privilege Mode Privileged (ring 0) Privileged (ring 0) Privileged Hyperprivileged

GOS Privileged Mode Unprivileged (ring 3 for 64-bit kernel; ring 1 for 32-bit kernel

Reduced privileged Deprivileged Privileged

CPU Granularity Fractional Fractional Fractional 1 strand

Interrupt Queued and delivered

when the VM is scheduled

to run

Queued and delivered


to run

Queued and delivered


to run

Deliver directly to the VM

Memory Sun xVM Server w/o HVM Sun xVM Server w/HVM VMware LDoms

Page Translation Hypercall to VMM Shadow Page Shadow page Hypercall to VMM

Physical Memory Allocation

Balloon driver Balloon driver Balloon driver Hard Partition

Page Tables Managed by VMM Managed by VMM Managed by VMM Managed by GOS

110 VMM Comparison Sun Microsystems, Inc.

I/O Sun xVM Server w/o HVM Sun xVM Server w/HVM VMware LDoms

I/O Granularity Shared Shared Shared PCI bus

I/O Virtualization I/O emulation by Dom0 Device emulation by QEMU or I/O emulation by Dom0

Device emulation by vmkernel

I/O emulation by I/O domain

Device drivers Virtual driver on DomU, native driver on Dom0

Native driver on DomU and Dom0 (QEMU)

Native driver on guest supported by the VMM

Virtual driver on non I/O domain and native driver on I/O domain

Management Sun xVM Server w/o HVM Sun xVM Server w/HVM VMware LDoms

Management Model Dom0 - SPOF Dom0 - SPOF Service console - SPOF Control domains

Interface CLI: xm(1M) GUI: virt-manager

CLI: (xm(1)) GUI: virt-manager

GUI: Virtual Center CLI: ldm(1M), XML, and SNMP MIBs

111 References Sun Microsystems, Inc.

Appendix B

References

1. Popek, Gerald J. and Goldberg, Robert P. “Formal Requirements for Virtualizable

Third Generation Architectures,” Communications of the ACM 17 (7), pages 412-

421, July 1974.

2. UltraSPARC Architecture 2005: One Architecture.... Multiple Innovative

Implementations, Draft D0.9, 15 May 2007.

3. Robin, John Scott and Irvine, Cynthia E. “Analysis of the Intel Pentium's Ability to

Support a Secure Virtual Machine Monitor,” Proceedings of the 9th USENIX

Security Symposium, August 2000.

4. VMware: http://www.vmware.com/vinfrastructure/

5. Waldspurger, Carl A. “Memory Resource Management in VMware ESX Server,”

Proceedings of the 5th Symposium on Operating Systems Design and

Implementation, Dec. 2002.

6. Xen, “The Xen virtual machine monitor,” University of Cambridge Computer

Laboratory: http://www.cl.cam.ac.uk/research/srg/netos/xen/

7. IA-32 Intel Architecture Software Developer's Manual, March 2006.

8. System V Application Binary Interface AMD64 Architecture Processor Supplement

Draft Version 0.98, September 27, 2006. http://www.x86-64.org/documentation/abi.pdf

9. AMD64 Architecture Programmer’s Manual, Volume 2: System Programming, Rev.

3.12, September 2006.

10. OpenSPARC T1 Microarchitecture Specification, Revision A, August 2006.

11. UltraSPARC Virtual Machine Specification (The sun4v architecture and Hypervisor

API specification), Revision 1.0, January 24, 2006.

12. Devine, Scott W.; Bugnion, Edouard; Rosenblum, Mendel. “Virtualization system

including a virtual machine monitor for a computer with a segmented

architecture,” U.S. Patent 6,397,242, October 26, 1998.

13. Cmelik, Robert F. and Keppel, David. “Shade: A Fast Instruction Set Simulator for

Execution Profiling,” ACM SIGMETRICS Performance Evaluation Review, pages 128-

137, May 1994.

14. Witchel, Emmett and Rosenblum, Mendel. “Embra: Fast and Flexible Machine

Simulation,” The Proceedings of ACM SIGMETRICS '96: Conference on

Measurement and Modeling of Computer Systems, 1996.

15. Adams, Keith and Agesen, Ole. “A Comparison of Software and Hardware

Techniques for x86 Virtualization,” ASPLOS 2006, San Jose, CA, USA, October 21-25,

2006.

http://www.vmware.com/vinfrastructure/

http://www.vmware.com/vinfrastructure/

http://www.cl.cam.ac.uk/research/srg/netos/xen/

http://www.x86-64.org/documentation/abi.pdf

112 References Sun Microsystems, Inc.

16. “Timekeeping in VMware Virtual Machines,” VMware white paper, August 2005.

17. Bittman, T. “Gartner RAS Core Strategic Planning SPA-21-5502, Research Note 14,”

November 2003.

18. Rosenblum, Mendel; Herrod, Stephen A.; Witchel, Emmett; and Gupta, Anoop.

“Complete Computer Simulation: The SimOS Approach,” IEEE Parallel and

Distributed Technology, pages 34-43, Winter 1995.

19. “VMware ESX Server 2 Architecture and Performance Implication,” VMware white

paper, 2005.

20. Sugerman, Jeremy; Venkitachalam, Ganesh; and Lim, Beng-Hong. “Virtualizing I/O

Devices on VMware Workstation’s Hosted Virtual Machine Monitor,” Proceedings

of the 2001 USENIX Annual Technical Conference, Boston, Massachusetts, USA,

June 25-30, 2001.

21. System Administration Guide: Solaris Containers-Resource Management and

Solaris Zones, Part No: 817-1592 -14, June 2007

22. Drakos, Nikos; Hennecke, Marcus; Moore, Ross; and Swan, Herb. Xen Interface

manual: Xen v3.0 for x86.

23. Bochs IA-32 Emulator Project: http://bochs.sourceforge.net/

24. QEMU, Open Source Processor Emulator: http://fabrice.bellard.free.fr/qemu/

25. IEEE 1275-1994 Open Firmware: http://playground.sun.com/1275/

26. PCI Bus Binding to IEEE std. 1275-1994, Rev 2.1 August 29, 1998.

27. TAP — a Virtual Ethernet network device: http://vtun.sourceforge.net/tun/

28. Intel Virtualization Technology for Directed I/O Architecture Specification, May

2007, Order Number: D51397-002.

29. Shadow2 presentation at Xen Technical Summit, Summer 2006: http://www.xensource.com/files/summit_3/XenSummit_Shadow2.pdf

30. PCI SIG, “Address Translation Services,” Revision 1.0, March 8, 2007.

31. AMD I/O Virtualization Technology (IOMMU) Specification, Revision 1.20,

Publication# 34434, February 2007.

32. Jun Nakajima, Asit Mallick, Ian Pratt, Keir Fraser, “x86-64 XenLinux: Architecture,

Implementation, and Optimizations,” Proceedings of the Linux Symposium, July

19-22 2006. Ontario, Canada.

33. OpenSPARC T2 Core Microarchitecture Specification, July 2007, Revision 5.

34. UltraSPARC Architecture 2007, Hyperprivileged, Privileged, and Nonprivileged,

Draft D0.91, Aug 2007.

35. PCI SIG, “Single Root I/O Virtualization and Sharing Specification,” Revision 1.0,

September 11, 2007.

http://bochs.sourceforge.net/

http://fabrice.bellard.free.fr/qemu/

http://playground.sun.com/1275/

http://vtun.sourceforge.net/tun/

http://www.xensource.com/files/summit_3/XenSummit_Shadow2.pdf

113 Terms and Definitions Sun Microsystems, Inc.

Appendix C

Terms and Definitions

Hardware level virtualization introduces several terms that are used throughout this

document. The following terms are defined in the context of hardware-level

virtualization.

Balloon driverA method for dynamic sharing of physical memory among VMs [5].

Binary TranslationIn computing, binary translation [13] [14] usually refers to the emulation of one instruction set by another through translation of instructions to allow software programs (e.g., operating systems and applications) written for a particular processor architecture to run on another. In the context of VMware products, binary translation refers to the conversion of one set of instruction sequences that belongs to a VM and has been deprivileged, to another set of instruction sequences that can run in a privileged VMM segment. VMware uses binary translation [12] and [15] to provide full virtualization of x86 processor.

DomainA running virtual machine within which a guest OS runs. Domain and virtual machine are used interchangeably in this document.

Full VirtualizationFull virtualization is an implementation of virtual machine that doesn't require guest OS to be modified to run in the VM. The techniques used for full virtualization can be a dynamic translation of software programs running in a VM (e.g., VMware products), or providing a complete emulation of the underlying processor (e.g., Xen with Intel-VT or AMD-V).

Guest Operating Systems (GOS)A GOS is one of the OSes that the VMM can host in a VM. The relationship between VMM, VM, and GOS is analogous to the relationship between, respectively, OS, process, and program.

Hardware Level Virtualization Hardware Level Virtualization is the technique of using a thin layer of software to abstract the system hardware resources for creating multiple instance of virtual executing environment, each of which runs a separate instance of operating system.

Hardware Thread See strand.

HVMHardware Virtual Machine, also known as hardware-assisted virtualization.

HypervisorHypervisor is another term for VMM. Hypervisor is an extension of the term supervisor which was commonly applied to operating system kernel.

Logical Domains (LDoms)Logical domains are Sun's implementation for hardware level virtualization based on the UltraSPARC T1 processor technology. LDom technology allows multiple domains to be created on one processor; each domain runs an instance of OS supported by one or more strands.

Operating System Level VirtualizationOS Level Virtualization is provided by an OS by virtualizing its services to allow multiple and separate operating environments to be created for applications. The services virtualized by the


OS includes: file system, devices, networking, security, and Inter Process Communication (IPC).

PacificaAMD's implementation for Hardware Virtualization, also known as AMD-V or AMD SVM.

ParavirtualizationParavirtualization is an implementation of virtual machine that requires the guest OS to be modified to run in the VM. Paravirtualization provides partial emulation of the underlying hardware to a VM and requires the guest OS to replace all sensitive instructions and passes the control to the VMM for handling these operations.

Privileged InstructionsPrivileged instruction are those that result in trap if the processor is running in user mode and do not result in trap if the processor is running in supervisor mode.

Secure Virtual Machine (SVM)AMD's implementation for Hardware Virtualization, also known as Pacifica or ADM-V (see [9] Chapter 15).

Sensitive InstructionsSensitive instructions [1] [12] are those that change the configuration of resources (memory), affect the processor mode without going through the memory trap sequence (page fault), or whose behavior changes with the processor mode or the contents of relocation register. If sensitive instructions are a subset of privileged instructions, it is relatively easy to build a VM because all sensitive instructions will result in a trap and the underlying VMM can process the trap and emulate the behavior of these sensitive instructions. If some sensitive instructions are not privileged instructions, special measure has to be taken to handles these sensitive instructions.

Shadow PageA technique for hiding the layout of machine memory from a virtual machine's operating system. A virtual page table is presented to the guest OS by the VMM, but not connected to the processor's memory management unit. The VMM is responsible for trapping accesses to the table, validating updates and maintaining consistency with the real page table that is visible to the processor MMU. Shadow page is typically used to provide full virtualization to a VM.

Simple Earliest Deadline First (sEDF) One of the scheduling algorithms used in Sun xVM Hypervisor for x86 for scheduling domains. See section “CPU Scheduling” on page 48 for a detailed description of sEDF.

StrandStrand [2] refers to the state that hardware must maintain in order to execute a software thread. Specifically, a strand is the software-visible state (PC, NPC, general-purpose registers, floating-point registers, condition codes, status registers, ASRs, etc.) of a thread plus any microarchitecture state required by hardware for its execution. Strand replaces the ambiguous term hardware thread. The number of strands in a processor defines the number of threads that an operating system can schedule on that processor at any given time.

Sun xVM Hypervisor for x86Sun xVM Hypervisor for x86 is the VMM of the Sun xVM Server.

Sun xVM InfrastructureSun Cross Virtualization and Management Infrastructure is a complete solution offering for virtualizing and managing the data center. Sun xVM Infrastructure = Sun xVM Server + xVM Ops Center

Sun xVM Ops CenterSun xVM Ops Center is the management suite for the Sun xVM Server.


Sun xVM ServerSun xVM Server is a paravirtualized Solaris OS that includes support for the Xen open source community work on the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform. In this paper, Sun xVM Server specifically refers to the Sun xVM Server for the x86 platform.

VanderpoolIntel's implementation for Hardware Virtualization, also known as Intel-VT.

Virtual CPU (VCPU)VCPU is an entity that can be dispatched by the scheduler of a guest OS. For UltraSPARC processors’s LDoms, VCPU is also know as strand, hardware thread, or logical processor.

Virtual Machine (VM)Virtual machine is a discrete execution environment that abstracts computer platform resources to an operating system. Each virtual machine runs an independent and separate instance of operating system. Popek and Goldberg [1] also defines VM as an “efficient, isolated duplicate of a real machine.”

Virtual Machine Monitor (VMM)The VMM is a software layer that runs directly on top of the hardware and virtualizes all resources of the computer system. The VMM layer is situated between VMs and hardware resources. The VMM abstracts hardware resources to VMs and performs privileged and sensitive actions on the behalf of VM.

Virtualization Technology (VT)Intel's implementation for Hardware Virtualization, also known as Vanderpool.

XenXen is a open source VMM for x86, IA-64, and PPC [6].

117 Author Biography Sun Microsystems, Inc.

Appendix D

Author Biography

Chien-Hua Yen is currently a senior staff engineer in the ISV engineering group at Sun.

Before joining Sun more than 12 years ago, he had been with several Silicon Valley

companies working as a software development engineer on Unix file systems, real time

embedded system, and device drivers. His first job at Sun was with the kernel I/O

group developing a kernel virtual memory segment driver for device memory mapping.

After the kernel group, he worked with third party hardware vendors on developing PCI

drivers for the Solaris OS and high availability products for the Sun CompactPCI board.

In the last two yeas, Chien-Hua has been working with ISVs on application performance

tuning, Solaris 10 adoption, and Solaris virtualization.

AcknowledgementsThe author would like to thank Honlin Su, Lodewijk Bonebakker, Thomas Bastian, Ray

Voight, and Joost Pronk for their invaluable comment; Patric Change for his

encouragement and support; Suzanne Zorn for her editorial work; and Kemer

Thompson for his constructive comments and his coordination of the reviews.

Solaris Operating System Hardware Virtualization Product Architecture On the Web sun.com

Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com

© 2007 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Java, JVM, Solaris, and Sun BluePrints are trademarks or registered trademarks of Sun Microsystems, Inc. in the United

States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks

are based upon architecture developed by Sun Microsystems, Inc. Information subject to change without notice. Printed in USA 11/07

solaris operating system hardware virtualization product 268

Documents