This is a 'Paper Reading' post for Course ECE1759. The topic is 'Virtualization'. This paper list is here:

Edouard Bugnion, Scott Devine, and Mendel Rosenblum, Disco: Running Commodity Operating Systems on Scalable Multiprocessors, Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles (SOSP), October 1997, Saint Malo, France.

Carl A. Waldspurger, Memory Resource Management in VMware ESX Server, In Proceedings of 5th Symposium on Operating Systems Design and Implementation (OSDI), Dec. 2002

Keith Adams and Ole Agesen, A Comparison of Software and Hardware Techniques for x86 Virtualization, In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2006.

[Optional reading] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, Xen and the Art of Virtualization, In Proceedings of the 19th Symposium on Operating System Principles, October, 2003.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Introduction

virtual machine monitors was an popular idea in the 1970s.
use virtual machines to run multiple commodity operating systems on a scalable multi-processor
Features:
- overhead is small
- provides scalability
- able to deal with NUMA time
Software is complicated, so that to adapt to hardware. -> For functionality and reliability hardware develop faster than software. -> Late, incompatible, and possibly even buggy system.
Insert an additional layer of software between the hardware and operating system: virtual machine monitor, in which multiple OS can run on a single scalable computer.
features compared with traditional VM Monitors:
- minimizes the overhead of virtual machines
- enhances the resource sharing between virtual machines running on the same system
- allows OS in different VMs communicate though NFS or TCP/IP
- allows efficient sharing of memory and disk
- has a global buffer cache shared by VMs.
the basic overhead of virtualization is at most 16%
page placement and dynamic page migration and replication allow Disco to hide the NUMA-ness of the memory system

Core Ideas

A Return to Virtual Machine Monitors

The VMM support two different possible solutions to handle applications whose resource needs exceed the scalability of commodity operating systems
- Modify commodity OS to support sharing memory regions across virtual machine boundaries
- Use specialized operating systems for resource-intensive applications
The failures of OS will only occur in the VM instead of spreading out.
NUMA memory management issues can also be handled by the monitor, effectively hiding the entire problem from the operating systems.
Multiple versions of OS can run on a multiprocessor simultaneously.

Challenges Facing Virtual Machines

Overheads

Overheads like execution of privileged instructions must be emulated by the monitor. I/O must be intercepted and remapped by the monitor.
VM cost extra memory.
- large memory structures such as the file system buffer cache are also replicated resulting in a increase in memory usage.
- replication of file systems cause waste.

Resource Management

the lack of information available to the monitor to make good policy decisions.
- the monitor must make resource management decisions without the high-level knowledge that an operating system would have.

Previously, files cannot be shared among VMs.

Disco: A Virtual Machine Monitor

Disco's Interface

Processors

To match designated OS, VMMs need to emulate a processor, including instructions, the memory management unit, and the trap architecture.

Physical Memory

Disco provides an abstraction of integrate main memory
- residing in a contiguous physical address space starting at address zero.
Use dynamic page migration and replication to imitate UMA on NUMA.

I/O Devices

Disco intercept all I/O operations to translate or emulate the operation.
Disco exports special abstractions for the SCSI disk and network devices.
- Disco virtualizes disks by providing a set of virtual disks that any virtual machine can mount.
- the monitor virtualizes access to the networking devices of the underlying system.
  - Each virtual machine is assigned a distinct link-level address on an internal virtual subnet handled by Disco.

Implementation of Disco

careful attention has been given to: NUMA memory placement, cache-aware data structures, and interprocessor communication patterns
- example: do not use cache-unfriendly linked lists
To improve NUMA locality,
- The Disco code is replicated all the memory of the machine to make sure that Disco's instruction cache misses can be satisfied locally.
- Machine-wide data structures are partitioned and set in the node that the data part is accessed only or mostly on.
For shared data structure,
- few locks are used and wait-free synchronization using the MIPS LL/SC instruction pair is heavily employed.

Virtual CPUs

Disco emulates the execution of the virtual CPU by using direct execution on the real CPU.
operations that cannot be safely exported to the virtual machine
- privileged instructions performed by the operating system such as TLB modification, and the direct access to physical memory and I/O devices
For each virtual CPU
- Disco keeps a process table contains the saved registers and other state of a virtual CPU
- To perform the emulation of privileged instructions
  - additionally maintains the privileged registers and TLB contents of the virtual CPU
Disco contains a simple scheduler that allows the virtual processors to be time-shared across the physical processors of the machine.

Virtual Physical Memory

Logic Addresses -> Physical Addresses -> Machine Addresses
Disco keeps a pmap data structure for each VM.
- Each pmap entry contains a pre-computed TLB entry that references the physical page location in real memory.
On a normal MIPS processor
- all user mode memory references need to be translated
- kernel mode references can directly access physical memory and I/O devices through the unmapped segment of the kernel virtual address space
- So cannot use previous mapping. To solve the problems:
  - Disco re-link the operating system code and data to a mapped region of the address space
Disco flushes the machine’s TLB when scheduling a different virtual CPU on a physical processor.
Overhead:
- More cache miss due to the TLB flush when virtual CPU switch.
- Slower TLB miss due to the emulation.
To mitigate the overhead, Disco caches recent virtual-to-machine translations in a second-level software TLB.

NUMA Memory Management

Disco targets machines that maintain cache-coherence in hardware.
Disco uses a robust policy that moves only pages that will likely result in an eventual performance benefit
- Move the page to the node use it frequently
- Replicated the read-shared pages
- write-shared pages keep unmoved
- limits the number of times a page can move to avoid excessive overheads
Hardware counts cache misses to each page from every physical processor.
Disco maintains a memmap data structure that contains an entry for each real machine memory page.
- the memmap entry contains a list of the virtual machines using the page and the virtual addresses used to access them. A memmap entry also contains pointers to any replicated copies of the page.

Virtual I/O Devices

Use a special device drivers instead of trapping and emulating
Each Disco device defines a monitor call used by the device driver to pass all command arguments in a single trap.
For disks and network interfaces include a DMA map as part of their arguments
- intercept, translate into machine address, then interact directly with the physical device.
Disco's copy-on-write disks allow virtual machines to share both main memory and disk storage resources.

Copy-on-write Disks

Disco intercepts every disk request that DMAs data into memory.
- For the shared-data request, make the page read-only for new request and use COW when write.
All the virtual machines can share the same root disk containing the kernel and application programs.
To preserve the isolation of the virtual machines, disk writes must be kept private to the virtual machine that issues them.
- Disco logs the modified sectors so that the copy-on-write disk is never actually modified.
- For now only for non-persistent disks (????).
Two data structures for memory and disk sharing:
- For each disk device:
  - maintains a B-Tree indexed by the range of disk sectors being requested. It's used to find the machine memory address of the sectors in the global disk cache.
- A second B-Tree is kept for each disk and VM to find any modifications to the block made by that VM.
COW is for the disk whose modifications should not be shared or persistent.
For persistent disks containing user files, only one VM can mount it.
NFS can be use to share files though VMs.

Virtual Network Interface

A virtual subnet is designed to allow virtual machines to communicate with each other, while avoiding replicated data whenever possible.
The virtual subnet and networking interfaces of Disco also use copy-on-write mappings to reduce copying and to allow for memory sharing.
message transfer between VMs -> the DMA unit to map the page read-only into both the sending and receiving virtual machines physical address spaces.
The file data of NFS can be shared among servers and clients.
- global buffer cache:
  - copy-on-write disks
  - the access to persistent data through the specialized network device
- all read-only pages can be shared between virtual machines.
  - to avoid access remote shared-page, use replication

Running Commodity Operating Systems

hardware abstraction level (HAL): a level allows the operating system to be effectively ported to run on new platforms.
- Typically the HAL of modern operating systems changes with each new version of a machine while the rest of the system can remain unchanged.
Most of the changes made in IRIX were part of the HAL

Necessary Changes for MIPS Architecture

basically try to solve the directly mapping problem for MIPS's kernel mode.
- relocate the unmapped segment of the virtual machines into a portion of the mapped supervisor segment of the MIPS processor.

Device Drivers

Disco designed drivers, like UART, SCSI disks, and ethernet drivers, to support Discos monitor call interface.

Changes to the HAL

convert these frequently used privileged instructions to use non-trapping load and store instructions to a special page of the address space that contains these registers.
- only applied to instructions that read and write privileged registers without causing other side-effects
- Reduce trap for the privileged register access.
To help the monitor make better resource management decisions, the author added code to the HAL to pass hints to the monitor giving it higher-level knowledge of resource utilization.
- inserted some monitor calls in physical memory management module
Disco add semantics to HAL to support a idle mode(reduced power consumption mode) of MIPS processors. This mode is used by the operating system henever the system is idle. Disco will deschedule the virtual CPU until the mode is cleared or an interrupt is posted.

Other Changes to IRIX

the virtual network device can only take advantage of the remapping techniques if the packets contain properly aligned, complete pages that are not written.
- But the mbuf of IRIX do not meet the requirements, so they modified the data structure.
Specialized a call to bcopy to a new remap function offered by the HAL to avoid NFS's copying from mbufs to buffer cache. With the call, they as shared as much as possible.

SPLASHOS: A Specialized Operating System

They design a small OS without virtual memory, deferring all page faulting responsibilities directly to Disco.
- The application is linked with the library operating system and runs in the same address space as the operating system.
- The OS is a good example that Disco is good for small special application that do not require a full function OS.

Reference

Memory Resource Management in VMware ESX Server

Introduction

Techniques:
- ballooning:
  - reclaims the pages considered least valuable by the operating system running in a virtual machine.
- idle memory tax
  - achieves efficient memory utilization while maintaining performance isolation guarantees
- content-based page sharing & hot I/O page remapping
  - exploit transparent page remapping to eliminate redundancy and reduce copying overheads.
- They are combined to efficiently support virtual machine workloads that overcommit memory
In many computing environments, individual servers are underutilized.
VMware Workstation:
- a hosted VM.
- Use pre-existing OS for portable I/O device support.
The need to run existing OSs without modification.

Core Ideas

Memory Virtualization

Same as Disco, ESX Server maintains a pmap data structure for each VM to translate physical page numbers (PPNs) to machine page numbers (MPNs).
ESX can remap PPN -> MPN relations
ESX can monitor or interpose on guest memory accesses

Reclamation Mechanisms

supports overcommitment of memory
- the total size configured for all running virtual machines >> the total amount of actual machine memory
Each VM is given the illusion of having a fixed amount of physical memory: "max size", constant.

Page Replacement Issues

need a reclaim mechanism
earlier virtual machine systems?
- Introduce another level of paging, moving some VM physical pages to a swap area on disk.
- Problem: hard to decide which pages to reclaim

Ballooning

balloon works like a pseudo-device driver or kernel service in Guest OS.
It has no external interface within the guest, and communicates with ESX Server via a private channel.
When reclaim, the balloon inflate by allocating pinned physical pages within the VM.
When memory is plentiful, the guest OS will return memory from its free list.
When memory is scarce, it must reclaim space to satisfy the driver allocation request.
The guest OS decides which particular pages to reclaim and, if necessary, pages them out to its own virtual disk.
When a guest PPN is ballooned, the system annotates its pmap entry and deallocates the associated MPN.
- If the ballooned page is reaccessed(though should not happen), reallocate a new MPN for the PPN.
balloon drivers poll the server once per second to obtain a target balloon size, and limit the allocation rates to avoid stressing the guest OS.
Concerns:
- being uninstalled
- being disabled
- unavailable when booting
- being limited the balloon size as an application on Guest OS.

Demand Paging

When ballooning is not possible or insufficient, the system falls back to a paging mechanism.
- Memory is reclaimed by paging out to an ESX Server swap area on disk, without any guest involvement.
The ESX Server use a swap daemon and a higher-level policy module to manage swapping.
Now a randomized page replacement policy is used.

Mentioned in Disco
Problem: Disco required several guest OS modifications

Basic Idea: identify page copies by their contents
Pages with identical contents can be shared regardless of when, where, or how those contents were generated.
Using Hashing to identify pages.
- false matches are possible
- Once found existed shared page, COW is used
Higher-level page sharing policies control when and where to scan for copies.

Implementation

Each frame is encoded compactly in 16 bytes.
A shared frame:
- a hash value, MPN, a reference count(16-bit), and a link for chaining.
A hint frame:
- a truncated hash value to make room for a reference back to the corresponding guest page, a VM identifier, PPN.
Overhead of page sharing < 0.5% of system memory.
a separate overflow table is used to store any extended frames with larger counts.
- For example, the empty zero page filled completely with zero bytes
The current ESX Server page sharing implementation scans guest pages randomly.
- Configuration are use to avoid overhead from scanning CPU overhead

Shares vs. Working Sets

an explicit parameter is introduced that allows system administrators to control the relative importance of these conflicting goals.

resource rights are encapsulated by shares, which are owned by clients that consume resources
A client is entitled to consume resources proportional to its share allocation
When one client demands more space, a replacement algorithm selects a victim client that relinquishes some of its previously-allocated space. Memory is revoked from the client that owns the fewest shares per allocated page.

Reclaiming Idle Memory

limitation of pure proportional-share
- ignore memory usage and working sets
ESX Server resolves this problem by introducing an idle memory tax.
- The basic idea is to charge a client more for an idle page than for one it is actively using.
- Min-funding revocation is extended to use an adjusted shares-per-page ratio.

Measuring Idle Memory

specific active and idle pages need not be identified individually.
ESX Server uses a statistical sampling approach to obtain aggregate VM working set estimates directly, without any guest involvement.
Selected randomly using a uniform distribution to evaluate active ratio.
By default, ESX Server samples 100 pages for each 30 second period.
To avoid sudden change, inspired by work on balancing stability and agility from the networking domain, ESX maintains separate exponentially-weighted moving averages with different gain parameters.
what is idle memory and active memory. (invalidate its cached mapping, if re-establish in the sampling periods, then active).

Allocation Policies

This section describes how these various mechanisms are coordinated in response to specified allocation parameters and system load.

Parameters

System administrators use three basic parameters to control the allocation of memory to each VM:
- a min size, a max size, and memory shares.

Admission Control

An admission control policy ensures that sufficient unreserved memory and server swap space is available before a VM is allowed to power on.
Typical VMs reserve 32 MB for overhead, of which 4 to 8 MB is devoted to the frame buffer, and the remainder contains implementation-specific data structures. Additional memory is required for VMs larger than 1 GB.
Disk swap space must be reserved for the remaining VM memory; i.e. max - min.

Dynamic Reallocation

ESX Server recomputes memory allocations dynamically in response to various events. For example:
- changes to system-wide or per-VM allocation parameters.
- addition or removal of a VM
- changes in the amount of free memory
EXS uses four thresholds to reflect different reclamation states: high, soft, hard, and low, which default to 6%, 4%, 2%, and 1% of system memory, respectively
- high state, free memory is sufficient and no reclamation is performed.
- soft state, the system reclaims memory using ballooning, and resorts to paging only in cases where ballooning is not possible.
- hard state, the system relies on paging to forcibly reclaim memory
- below the low threshold, the system continues to reclaim memory via paging, and additionally blocks the execution of all VMs that are above their target allocations.

Questions

Advantages

Disadvantages

Shared paging cause overhead. Sometimes the performance is more important than utilization.

Reference

A Comparison of Software and Hardware Techniques for x86 Virtualization

Introduction

Until recently, The x86 architecture has not permitted classical trap-and-emulate virtualization.
- Recently, the major x86 CPU manufacturers have announced architectural extensions to directly support virtualization in hard
Surprisingly, the hardware VMM often suffers lower performance than the pure software VMM.
hardware support problems:
- it offers no support for MMU virtualization
- it fails to co-exist with existing software techniques for MMU virtualization
Software virtualization: binary translation.
Contribution of this paper:
- a review of VMware Workstations software VMM, focusing on performance properties of the virtual instruction execution engine;
- a review of the emerging hardware support, identifying performance trade-offs;
- a quantitative performance comparison of a software and a hardware VMM.
first-generation hardware support offer rare performance advantages
- reason: high VMM/guest transition costs and a rigid programming model lacking flexibility

Core Ideas

Classical virtualization

essential characteristics of VMM:
- Fidelity: executes identically.
- Performance: executes majority of instructions without VMM.
- Safety: manages hardware resources.
1974?trap-and-emulate
the most important ideas from classical VMM implementations: deprivileging, shadow structures and traces.

De-privileging

A classical VMM executes guest operating systems directly, but at a reduced privilege level.
- intercepts traps from the de-privileged guest, and emulates the trapping instruction against the virtual machine state.

Primary and shadow structures

VMM derives shadow structures from guest-level primary structures
- On-CPU privileged state(e.g. register): use image to emulate.
- off-CPU privileged data(e.g. memory): not naturally coincide with trapping instructions. hard to control

Memory traces

VMMs typically use hardware page protection mechanisms to trap accesses to in-memory primary structures.
- Basically set the shadow structure write protected. It cause a fault when being access, then the VMM capture that and do modification in primary structures and then propagate it to shadow structures.

Tracing example: x86 page tables

To protect the host from guest memory accesses, VMMs typically construct shadow page tables in which to run the guest.
VMware Workstations VMM manages its shadow page tables as a cache of the guest page tables.
- As the guest accesses previously untouched regions of its virtual address space
- The VMM distinguishes true page faults from hidden page faults.
- True faults are forwarded to the guest;
- hidden faults cause the VMM to construct an appropriate shadow PTE, and resume guest execution.
The VMM uses traces to prevent its shadow PTEs from becoming incoherent with the guest PTEs, though tracing can cause overhead.
three-way trade-off among trace costs, hidden page faults and context switch costs

Previously, the VMM, hardware, OS are designed by one company.
- researchers and practitioners using two orthogonal approaches to refine classical virtualization.
One approach, add flexibility in the VMM/guest OS interface:
- modified guest OSs to provide higher-level information to the VMM.
  - relax fidelity requirement
  - better performance
Other approach, add flexibility in the hardware/VMM interface:
- VMM encodes much of the guest privileged state in a hardware-defined format, then executes the SIE instruction to start interpretive execution. Many guest operations which would trap in a de-privileged environment directly access shadow fields in interpretive execution. (Basically use new hardware and corresponding instructions to reduce traps)
The first one revive as paravirtualization
For the second one, x86 vendors are introducing hardware facilities inspired by interpretive execution.

Software virtualization

x86 obstacles to virtualization

The x86 protected modes are not classically virtualizable
- Visibility of privileged state.
- Lack of traps when privileged instructions run at user-level.

Simple binary translation

Use interpreter separates virtual state (the VCPU) from physical state (the CPU) to overcome semantic obstacles to x86 virtualization.
interpretation ensures Fidelity and Safety but fail to meet Performance bar.
translator's properties:
- Binary
- Dynamic
- On demand
- System level
- Subsetting
- Adaptive
The translator does not attempt to improve the translated code. Assume if guest code is performance critical, it should have been optimized.

Hardware virtualization

AMD: SVM
Intel: VT

x86 architecture extensions

virtual machine control block(VMCB) combines control state with a subset of the state of a guest virtual CPU.
A new, less privileged execution mode, guest mode, supports direct execution of guest code, including privileged code.
A new instruction, vmrun, transfers from host(previously architected x86 execution environment) to guest mode.
- Upon execution of vmrun, the hardware loads guest state from the VMCB and continues execution in guest mode.
- Guest execution proceeds until some condition, expressed by the VMM using control bits of the VMCB, is reached.
- Performs an exit operation to quit guest mode.

Hardware VMM implementation

When running a protected mode guest, the VMM fills in a VMCB with the current guest state and executes vmrun.
On guest exits, the VMM reads the VMCB fields describing the conditions for the exit, and vectors to appropriate emulation code.
Most of this emulation code is shared with the software VMM.
Since current virtualization hardware does not include explicit support for MMU virtualization, the hardware VMM inherits the software VMMs implementation of the shadowing technique.
The VT and SVM extensions make classical virtualization possible on x86. The resulting performance depends primarily on the frequency of exits.

Qualitative comparison

Where BT(Binary Translation) wins:
- Trap elimination
- Emulation speed
- Callout avoidance
Where hardware VMM wins:
- Code density
- Precise exceptions
- System calls run without VMM intervention

Lectures

For MIPS, when try to access TLB, there is a trap
For x86, when try to access TLB, it is handled by hardware, if hit, then return without trap. If miss, a page fault(trap) is issued.
- When TLB hit, the performance is the same.

Reference

Xen and the Art of Virtualization

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Introduction

Core Ideas

A Return to Virtual Machine Monitors

Challenges Facing Virtual Machines

Overheads

Resource Management

Communication and Sharing

Disco: A Virtual Machine Monitor

Disco's Interface

Processors

Physical Memory

I/O Devices

Implementation of Disco

Virtual CPUs

Virtual Physical Memory

NUMA Memory Management

Virtual I/O Devices

Copy-on-write Disks

Virtual Network Interface

Running Commodity Operating Systems

Necessary Changes for MIPS Architecture

Device Drivers

Changes to the HAL

Other Changes to IRIX

SPLASHOS: A Specialized Operating System

Reference

Memory Resource Management in VMware ESX Server

Introduction

Core Ideas

Memory Virtualization

Reclamation Mechanisms

Page Replacement Issues

Ballooning

Demand Paging

Sharing Memory

Transparent Page Sharing

Content-Based Page Sharing

Implementation

Shares vs. Working Sets

Share-Based Allocation

Reclaiming Idle Memory

Measuring Idle Memory

Allocation Policies

Parameters

Admission Control

Dynamic Reallocation

Questions

Advantages

Disadvantages

Reference

A Comparison of Software and Hardware Techniques for x86 Virtualization

Introduction

Core Ideas

Classical virtualization

De-privileging

Primary and shadow structures

Memory traces

Tracing example: x86 page tables

Refinements to classical virtualization

Software virtualization

x86 obstacles to virtualization

Simple binary translation

Hardware virtualization

x86 architecture extensions

Hardware VMM implementation

Qualitative comparison

Lectures

Reference

Xen and the Art of Virtualization

Reference