1759 'Virtualization' NOV-18

This is a 'Paper Reading' post for Course ECE1759. The topic is 'Virtualization'. This paper list is here:

  • Edouard Bugnion, Scott Devine, and Mendel Rosenblum, Disco: Running Commodity Operating Systems on Scalable Multiprocessors, Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles (SOSP), October 1997, Saint Malo, France.
  • Carl A. Waldspurger, Memory Resource Management in VMware ESX Server, In Proceedings of 5th Symposium on Operating Systems Design and Implementation (OSDI), Dec. 2002
  • Keith Adams and Ole Agesen, A Comparison of Software and Hardware Techniques for x86 Virtualization, In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2006.
  • [Optional reading] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, Xen and the Art of Virtualization, In Proceedings of the 19th Symposium on Operating System Principles, October, 2003.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors


  • virtual machine monitors was an popular idea in the 1970s.
  • use virtual machines to run multiple commodity operating systems on a scalable multi-processor
  • Features:
    • overhead is small
    • provides scalability
    • able to deal with NUMA time
  • Software is complicated, so that to adapt to hardware. -> For functionality and reliability hardware develop faster than software. -> Late, incompatible, and possibly even buggy system.
  • Insert an additional layer of software between the hardware and operating system: virtual machine monitor, in which multiple OS can run on a single scalable computer.
  • features compared with traditional VM Monitors:
    • minimizes the overhead of virtual machines
    • enhances the resource sharing between virtual machines running on the same system
    • allows OS in different VMs communicate though NFS or TCP/IP
    • allows efficient sharing of memory and disk
    • has a global buffer cache shared by VMs.
  • the basic overhead of virtualization is at most 16%
  • page placement and dynamic page migration and replication allow Disco to hide the NUMA-ness of the memory system

Core Ideas

A Return to Virtual Machine Monitors

  • The VMM support two different possible solutions to handle applications whose resource needs exceed the scalability of commodity operating systems
    • Modify commodity OS to support sharing memory regions across virtual machine boundaries
    • Use specialized operating systems for resource-intensive applications
  • The failures of OS will only occur in the VM instead of spreading out.
  • NUMA memory management issues can also be handled by the monitor, effectively hiding the entire problem from the operating systems.
  • Multiple versions of OS can run on a multiprocessor simultaneously.

Challenges Facing Virtual Machines


  • Overheads like execution of privileged instructions must be emulated by the monitor. I/O must be intercepted and remapped by the monitor.
  • VM cost extra memory.
    • large memory structures such as the file system buffer cache are also replicated resulting in a increase in memory usage.
    • replication of file systems cause waste.

Resource Management

  • the lack of information available to the monitor to make good policy decisions.
    • the monitor must make resource management decisions without the high-level knowledge that an operating system would have.

Communication and Sharing

  • Previously, files cannot be shared among VMs.

Disco: A Virtual Machine Monitor

Disco's Interface

  • To match designated OS, VMMs need to emulate a processor, including instructions, the memory management unit, and the trap architecture.
Physical Memory
  • Disco provides an abstraction of integrate main memory
    • residing in a contiguous physical address space starting at address zero.
  • Use dynamic page migration and replication to imitate UMA on NUMA.
I/O Devices
  • Disco intercept all I/O operations to translate or emulate the operation.
  • Disco exports special abstractions for the SCSI disk and network devices.
    • Disco virtualizes disks by providing a set of virtual disks that any virtual machine can mount.
    • the monitor virtualizes access to the networking devices of the underlying system.
      • Each virtual machine is assigned a distinct link-level address on an internal virtual subnet handled by Disco.

Implementation of Disco

  • careful attention has been given to: NUMA memory placement, cache-aware data structures, and interprocessor communication patterns
    • example: do not use cache-unfriendly linked lists
  • To improve NUMA locality,
    • The Disco code is replicated all the memory of the machine to make sure that Disco's instruction cache misses can be satisfied locally.
    • Machine-wide data structures are partitioned and set in the node that the data part is accessed only or mostly on.
  • For shared data structure,
    • few locks are used and wait-free synchronization using the MIPS LL/SC instruction pair is heavily employed.
Virtual CPUs
  • Disco emulates the execution of the virtual CPU by using direct execution on the real CPU.

  • operations that cannot be safely exported to the virtual machine

    • privileged instructions performed by the operating system such as TLB modification, and the direct access to physical memory and I/O devices
  • For each virtual CPU

    • Disco keeps a process table contains the saved registers and other state of a virtual CPU
    • To perform the emulation of privileged instructions
      • additionally maintains the privileged registers and TLB contents of the virtual CPU
  • Disco contains a simple scheduler that allows the virtual processors to be time-shared across the physical processors of the machine.

Virtual Physical Memory
  • Logic Addresses -> Physical Addresses -> Machine Addresses
  • Disco keeps a pmap data structure for each VM.
    • Each pmap entry contains a pre-computed TLB entry that references the physical page location in real memory.
  • On a normal MIPS processor
    • all user mode memory references need to be translated
    • kernel mode references can directly access physical memory and I/O devices through the unmapped segment of the kernel virtual address space
    • So cannot use previous mapping. To solve the problems:
      • Disco re-link the operating system code and data to a mapped region of the address space
  • Disco flushes the machine’s TLB when scheduling a different virtual CPU on a physical processor.
  • Overhead:
    • More cache miss due to the TLB flush when virtual CPU switch.
    • Slower TLB miss due to the emulation.
  • To mitigate the overhead, Disco caches recent virtual-to-machine translations in a second-level software TLB.
NUMA Memory Management
  • Disco targets machines that maintain cache-coherence in hardware.
  • Disco uses a robust policy that moves only pages that will likely result in an eventual performance benefit
    • Move the page to the node use it frequently
    • Replicated the read-shared pages
    • write-shared pages keep unmoved
    • limits the number of times a page can move to avoid excessive overheads
  • Hardware counts cache misses to each page from every physical processor.
  • Disco maintains a memmap data structure that contains an entry for each real machine memory page.
    • the memmap entry contains a list of the virtual machines using the page and the virtual addresses used to access them. A memmap entry also contains pointers to any replicated copies of the page.
Virtual I/O Devices
  • Use a special device drivers instead of trapping and emulating
  • Each Disco device defines a monitor call used by the device driver to pass all command arguments in a single trap.
  • For disks and network interfaces include a DMA map as part of their arguments
    • intercept, translate into machine address, then interact directly with the physical device.
  • Disco's copy-on-write disks allow virtual machines to share both main memory and disk storage resources.
Copy-on-write Disks
  • Disco intercepts every disk request that DMAs data into memory.
    • For the shared-data request, make the page read-only for new request and use COW when write.
  • All the virtual machines can share the same root disk containing the kernel and application programs.
  • To preserve the isolation of the virtual machines, disk writes must be kept private to the virtual machine that issues them.
    • Disco logs the modified sectors so that the copy-on-write disk is never actually modified.
    • For now only for non-persistent disks (????).
  • Two data structures for memory and disk sharing:
    • For each disk device:
      • maintains a B-Tree indexed by the range of disk sectors being requested. It's used to find the machine memory address of the sectors in the global disk cache.
    • A second B-Tree is kept for each disk and VM to find any modifications to the block made by that VM.
  • COW is for the disk whose modifications should not be shared or persistent.
  • For persistent disks containing user files, only one VM can mount it.
  • NFS can be use to share files though VMs.
Virtual Network Interface
  • A virtual subnet is designed to allow virtual machines to communicate with each other, while avoiding replicated data whenever possible.
  • The virtual subnet and networking interfaces of Disco also use copy-on-write mappings to reduce copying and to allow for memory sharing.
  • message transfer between VMs -> the DMA unit to map the page read-only into both the sending and receiving virtual machines physical address spaces.
  • The file data of NFS can be shared among servers and clients.
    • global buffer cache:
      • copy-on-write disks
      • the access to persistent data through the specialized network device
    • all read-only pages can be shared between virtual machines.
      • to avoid access remote shared-page, use replication

Running Commodity Operating Systems

  • hardware abstraction level (HAL): a level allows the operating system to be effectively ported to run on new platforms.
    • Typically the HAL of modern operating systems changes with each new version of a machine while the rest of the system can remain unchanged.
  • Most of the changes made in IRIX were part of the HAL
Necessary Changes for MIPS Architecture
  • basically try to solve the directly mapping problem for MIPS's kernel mode.
    • relocate the unmapped segment of the virtual machines into a portion of the mapped supervisor segment of the MIPS processor.
Device Drivers
  • Disco designed drivers, like UART, SCSI disks, and ethernet drivers, to support Discos monitor call interface.
Changes to the HAL
  • convert these frequently used privileged instructions to use non-trapping load and store instructions to a special page of the address space that contains these registers.

    • only applied to instructions that read and write privileged registers without causing other side-effects
    • Reduce trap for the privileged register access.
  • To help the monitor make better resource management decisions, the author added code to the HAL to pass hints to the monitor giving it higher-level knowledge of resource utilization.

    • inserted some monitor calls in physical memory management module
  • Disco add semantics to HAL to support a idle mode(reduced power consumption mode) of MIPS processors. This mode is used by the operating system henever the system is idle. Disco will deschedule the virtual CPU until the mode is cleared or an interrupt is posted.

Other Changes to IRIX
  • the virtual network device can only take advantage of the remapping techniques if the packets contain properly aligned, complete pages that are not written.
    • But the mbuf of IRIX do not meet the requirements, so they modified the data structure.
  • Specialized a call to bcopy to a new remap function offered by the HAL to avoid NFS's copying from mbufs to buffer cache. With the call, they as shared as much as possible.

SPLASHOS: A Specialized Operating System

  • They design a small OS without virtual memory, deferring all page faulting responsibilities directly to Disco.
    • The application is linked with the library operating system and runs in the same address space as the operating system.
    • The OS is a good example that Disco is good for small special application that do not require a full function OS.


Memory Resource Management in VMware ESX Server


  • Techniques:
    • ballooning:
      • reclaims the pages considered least valuable by the operating system running in a virtual machine.
    • idle memory tax
      • achieves efficient memory utilization while maintaining performance isolation guarantees
    • content-based page sharing & hot I/O page remapping
      • exploit transparent page remapping to eliminate redundancy and reduce copying overheads.
    • They are combined to efficiently support virtual machine workloads that overcommit memory
  • In many computing environments, individual servers are underutilized.
  • VMware Workstation:
    • a hosted VM.
    • Use pre-existing OS for portable I/O device support.
  • The need to run existing OSs without modification.

Core Ideas

Memory Virtualization

  • Same as Disco, ESX Server maintains a pmap data structure for each VM to translate physical page numbers (PPNs) to machine page numbers (MPNs).
  • ESX can remap PPN -> MPN relations
  • ESX can monitor or interpose on guest memory accesses

Reclamation Mechanisms

  • supports overcommitment of memory
    • the total size configured for all running virtual machines >> the total amount of actual machine memory
  • Each VM is given the illusion of having a fixed amount of physical memory: "max size", constant.

Page Replacement Issues

  • need a reclaim mechanism
  • earlier virtual machine systems?
    • Introduce another level of paging, moving some VM physical pages to a swap area on disk.
    • Problem: hard to decide which pages to reclaim


  • balloon works like a pseudo-device driver or kernel service in Guest OS.
  • It has no external interface within the guest, and communicates with ESX Server via a private channel.
  • When reclaim, the balloon inflate by allocating pinned physical pages within the VM.
  • When memory is plentiful, the guest OS will return memory from its free list.
  • When memory is scarce, it must reclaim space to satisfy the driver allocation request.
  • The guest OS decides which particular pages to reclaim and, if necessary, pages them out to its own virtual disk.
  • When a guest PPN is ballooned, the system annotates its pmap entry and deallocates the associated MPN.
    • If the ballooned page is reaccessed(though should not happen), reallocate a new MPN for the PPN.
  • balloon drivers poll the server once per second to obtain a target balloon size, and limit the allocation rates to avoid stressing the guest OS.
  • Concerns:
    • being uninstalled
    • being disabled
    • unavailable when booting
    • being limited the balloon size as an application on Guest OS.

Demand Paging

  • When ballooning is not possible or insufficient, the system falls back to a paging mechanism.
    • Memory is reclaimed by paging out to an ESX Server swap area on disk, without any guest involvement.
  • The ESX Server use a swap daemon and a higher-level policy module to manage swapping.
  • Now a randomized page replacement policy is used.

Sharing Memory

Transparent Page Sharing

  • Mentioned in Disco
  • Problem: Disco required several guest OS modifications

Content-Based Page Sharing

  • Basic Idea: identify page copies by their contents
  • Pages with identical contents can be shared regardless of when, where, or how those contents were generated.
  • Using Hashing to identify pages.
    • false matches are possible
    • Once found existed shared page, COW is used
  • Higher-level page sharing policies control when and where to scan for copies.


  • Each frame is encoded compactly in 16 bytes.
  • A shared frame:
    • a hash value, MPN, a reference count(16-bit), and a link for chaining.
  • A hint frame:
    • a truncated hash value to make room for a reference back to the corresponding guest page, a VM identifier, PPN.
  • Overhead of page sharing < 0.5% of system memory.
  • a separate overflow table is used to store any extended frames with larger counts.
    • For example, the empty zero page filled completely with zero bytes
  • The current ESX Server page sharing implementation scans guest pages randomly.
    • Configuration are use to avoid overhead from scanning CPU overhead

Shares vs. Working Sets

  • an explicit parameter is introduced that allows system administrators to control the relative importance of these conflicting goals.

Share-Based Allocation

  • resource rights are encapsulated by shares, which are owned by clients that consume resources
  • A client is entitled to consume resources proportional to its share allocation
  • When one client demands more space, a replacement algorithm selects a victim client that relinquishes some of its previously-allocated space. Memory is revoked from the client that owns the fewest shares per allocated page.

Reclaiming Idle Memory

  • limitation of pure proportional-share
    • ignore memory usage and working sets
  • ESX Server resolves this problem by introducing an idle memory tax.
    • The basic idea is to charge a client more for an idle page than for one it is actively using.
    • Min-funding revocation is extended to use an adjusted shares-per-page ratio.

Measuring Idle Memory

  • specific active and idle pages need not be identified individually.
  • ESX Server uses a statistical sampling approach to obtain aggregate VM working set estimates directly, without any guest involvement.
  • Selected randomly using a uniform distribution to evaluate active ratio.
  • By default, ESX Server samples 100 pages for each 30 second period.
  • To avoid sudden change, inspired by work on balancing stability and agility from the networking domain, ESX maintains separate exponentially-weighted moving averages with different gain parameters.
  • what is idle memory and active memory. (invalidate its cached mapping, if re-establish in the sampling periods, then active).

Allocation Policies

  • This section describes how these various mechanisms are coordinated in response to specified allocation parameters and system load.


  • System administrators use three basic parameters to control the allocation of memory to each VM:
    • a min size, a max size, and memory shares.

Admission Control

  • An admission control policy ensures that sufficient unreserved memory and server swap space is available before a VM is allowed to power on.
  • Typical VMs reserve 32 MB for overhead, of which 4 to 8 MB is devoted to the frame buffer, and the remainder contains implementation-specific data structures. Additional memory is required for VMs larger than 1 GB.
  • Disk swap space must be reserved for the remaining VM memory; i.e. max - min.

Dynamic Reallocation

  • ESX Server recomputes memory allocations dynamically in response to various events. For example:
    • changes to system-wide or per-VM allocation parameters.
    • addition or removal of a VM
    • changes in the amount of free memory
  • EXS uses four thresholds to reflect different reclamation states: high, soft, hard, and low, which default to 6%, 4%, 2%, and 1% of system memory, respectively
    • high state, free memory is sufficient and no reclamation is performed.
    • soft state, the system reclaims memory using ballooning, and resorts to paging only in cases where ballooning is not possible.
    • hard state, the system relies on paging to forcibly reclaim memory
    • below the low threshold, the system continues to reclaim memory via paging, and additionally blocks the execution of all VMs that are above their target allocations.




  • Shared paging cause overhead. Sometimes the performance is more important than utilization.


A Comparison of Software and Hardware Techniques for x86 Virtualization


  • Until recently, The x86 architecture has not permitted classical trap-and-emulate virtualization.
    • Recently, the major x86 CPU manufacturers have announced architectural extensions to directly support virtualization in hard
  • Surprisingly, the hardware VMM often suffers lower performance than the pure software VMM.
  • hardware support problems:
    • it offers no support for MMU virtualization
    • it fails to co-exist with existing software techniques for MMU virtualization
  • Software virtualization: binary translation.
  • Contribution of this paper:
    • a review of VMware Workstations software VMM, focusing on performance properties of the virtual instruction execution engine;
    • a review of the emerging hardware support, identifying performance trade-offs;
    • a quantitative performance comparison of a software and a hardware VMM.
  • first-generation hardware support offer rare performance advantages
    • reason: high VMM/guest transition costs and a rigid programming model lacking flexibility

Core Ideas

Classical virtualization

  • essential characteristics of VMM:
    • Fidelity: executes identically.
    • Performance: executes majority of instructions without VMM.
    • Safety: manages hardware resources.
  • 1974?trap-and-emulate
  • the most important ideas from classical VMM implementations: deprivileging, shadow structures and traces.


  • A classical VMM executes guest operating systems directly, but at a reduced privilege level.
    • intercepts traps from the de-privileged guest, and emulates the trapping instruction against the virtual machine state.

Primary and shadow structures

  • VMM derives shadow structures from guest-level primary structures
    • On-CPU privileged state(e.g. register): use image to emulate.
    • off-CPU privileged data(e.g. memory): not naturally coincide with trapping instructions. hard to control

Memory traces

  • VMMs typically use hardware page protection mechanisms to trap accesses to in-memory primary structures.
    • Basically set the shadow structure write protected. It cause a fault when being access, then the VMM capture that and do modification in primary structures and then propagate it to shadow structures.

Tracing example: x86 page tables

  • To protect the host from guest memory accesses, VMMs typically construct shadow page tables in which to run the guest.
  • VMware Workstations VMM manages its shadow page tables as a cache of the guest page tables.
    • As the guest accesses previously untouched regions of its virtual address space
    • The VMM distinguishes true page faults from hidden page faults.
    • True faults are forwarded to the guest;
    • hidden faults cause the VMM to construct an appropriate shadow PTE, and resume guest execution.
  • The VMM uses traces to prevent its shadow PTEs from becoming incoherent with the guest PTEs, though tracing can cause overhead.
  • three-way trade-off among trace costs, hidden page faults and context switch costs

Refinements to classical virtualization

  • Previously, the VMM, hardware, OS are designed by one company.
    • researchers and practitioners using two orthogonal approaches to refine classical virtualization.
  • One approach, add flexibility in the VMM/guest OS interface:
    • modified guest OSs to provide higher-level information to the VMM.
      • relax fidelity requirement
      • better performance
  • Other approach, add flexibility in the hardware/VMM interface:
    • VMM encodes much of the guest privileged state in a hardware-defined format, then executes the SIE instruction to start interpretive execution. Many guest operations which would trap in a de-privileged environment directly access shadow fields in interpretive execution. (Basically use new hardware and corresponding instructions to reduce traps)
  • The first one revive as paravirtualization
  • For the second one, x86 vendors are introducing hardware facilities inspired by interpretive execution.

Software virtualization

x86 obstacles to virtualization

  • The x86 protected modes are not classically virtualizable
    • Visibility of privileged state.
    • Lack of traps when privileged instructions run at user-level.

Simple binary translation

  • Use interpreter separates virtual state (the VCPU) from physical state (the CPU) to overcome semantic obstacles to x86 virtualization.
  • interpretation ensures Fidelity and Safety but fail to meet Performance bar.
  • translator's properties:
    • Binary
    • Dynamic
    • On demand
    • System level
    • Subsetting
    • Adaptive
  • The translator does not attempt to improve the translated code. Assume if guest code is performance critical, it should have been optimized.

Hardware virtualization

  • AMD: SVM
  • Intel: VT

x86 architecture extensions

  • virtual machine control block(VMCB) combines control state with a subset of the state of a guest virtual CPU.
  • A new, less privileged execution mode, guest mode, supports direct execution of guest code, including privileged code.
  • A new instruction, vmrun, transfers from host(previously architected x86 execution environment) to guest mode.
    • Upon execution of vmrun, the hardware loads guest state from the VMCB and continues execution in guest mode.
    • Guest execution proceeds until some condition, expressed by the VMM using control bits of the VMCB, is reached.
    • Performs an exit operation to quit guest mode.

Hardware VMM implementation

  • When running a protected mode guest, the VMM fills in a VMCB with the current guest state and executes vmrun.
  • On guest exits, the VMM reads the VMCB fields describing the conditions for the exit, and vectors to appropriate emulation code.
  • Most of this emulation code is shared with the software VMM.
  • Since current virtualization hardware does not include explicit support for MMU virtualization, the hardware VMM inherits the software VMMs implementation of the shadowing technique.
  • The VT and SVM extensions make classical virtualization possible on x86. The resulting performance depends primarily on the frequency of exits.

Qualitative comparison

  • Where BT(Binary Translation) wins:
    • Trap elimination
    • Emulation speed
    • Callout avoidance
  • Where hardware VMM wins:
    • Code density
    • Precise exceptions
    • System calls run without VMM intervention


  • For MIPS, when try to access TLB, there is a trap
  • For x86, when try to access TLB, it is handled by hardware, if hit, then return without trap. If miss, a page fault(trap) is issued.
    • When TLB hit, the performance is the same.


Xen and the Art of Virtualization


Welcome to my other publishing channels