1759 'Kernel Structure' OCT-4

This is a 'Paper Reading' post for Course ECE1759. The topic is 'Kernel Structure'. This paper list is here:

  • D. M. Ritchie and K. Thompson, The UNIX Time-Sharing System, Communications of the ACM, Vol. 17, No. 7, July 1974, pp. 365-375.
  • D. R. Engler, M. F. Kaashoek, and J. O'Toole Jr. Exokernel: An Operating System Architecture For Application-specific Resource Management, Proceedings of the 15th SOSP, December 1995, pp 251-266.
  • A. Baumann, J. Appavoo, O. Krieger, T. Roscoe, A fork() in the road, In Workshop on Hot Topics in Operating Systems (HotOS'19), May 13-15, 2019, Bertinoro, Italy.
  • R. Pike, D. Presotto, S. Dorward, B. Flandrena, K. Thompson, H. Trickey, and P. Winterbottom, Plan 9 From Bell Labs, USENIX Computing Systems, Vol. 8, No. 3, Summer 1995, pp. 221-254.

The UNIX Time-Sharing System


Main Features

  • a hierarchical file system incorporating demountable volumes;
  • compatible file, device, and inter-process I/O;
  • the ability to initiate asynchronous processes;
  • system command language selectable on a per-user basis; and
  • over 100 subsystems including a dozen languages.

"The Greatest Part"

C language

functional improvements:

  • multiprogramming
  • the ability to share reentrant code among several user programs

Core Ideas

The File System

  • "." and ".." have been introduced
  • hardlink has already been introduced
  • I/O device can be seen as special files
    • file and device I/O are as similar as possible.
    • file and device names have the same syntax and meaning, so that a program expecting a file name as a parameter can be passed a device name.
    • use to the same protection mechanism.
  • Current mount mechanism has been introduced.
    • No across-mounted-point link is allowed.
    • In the root directories of all file systems, removable or not, the name “..” refers to the directory itself instead of to its parent.
  • The group concept has not been introduced in protection mechanism. (only 6 bits)
  • Set-User-ID bit is the 7th bit.
  • A File Descriptor is returned when open a file.
  • No file locker to prevent multi-write.
  • The files are also divided into blocks(512B), larger or small files:
    • Less or equal than 8 blocks: store block address.
    • More than 8 blocks: store the the address of an indirect block of 256 addresses of blocks.
  • A system table is maintained for mount: (i-number, device)-pair
  • Buffer cache has been introduced

Processes and Images


  • An image is a computer execution environment.
  • The program text segment begins at location 0 in the virtual address space.(write-protected)
  • The memory framework is introduced.
    • 0: Text Segment
    • 8KB (adjustable) above Text Segment: non-shared, writable data (Sounds like Heap Segment)
    • Highest address: Stack Segment


  • An image is a computer execution environment.
  • fork is introduced
  • pipes is introduced
    • the pipe must be set up by a common ancestor of the processes involved.
  • execute is introduced
  • wait is introduced
    • It causes its caller to suspend execution until one of its children has completed execution. Then wait returns the processid of the terminated process.
  • exit is introduced

The Shell

  • Command line interpreter
  • 0 and 1 are opened at the start of the shell.
  • Redirection is introduced
  • Filters is introduced
  • ";" and "&" are introduced
  • It is also possible to execute commands conditionally on character string comparisons or on existence of given files and to perform transfers of control within filed command sequences.
  • When execute a command, the parent process waits until child process die.


  • When an illegal action is caught, unless other arrangements have been made, the system terminates the process and writes the user’s image on file core in the current directory. A debugger can be used to determine the state of the program at the time of the fault.
  • interrupt signal and quit signal are introduced


What is i-list, and the defects


  • Simple
  • Unique abstraction
    • Everything is a file
    • Process management(fork(), exec())
      • Concurrency
      • Simple
      • Between fork and exec, you can do a lot, for example, you can redirect. That is what the redirection symbol implemented. Pipe's shard-memory is also initialized in this phase.
  • No error, then keep silent. With errors, say loudly.
  • Execute different software in Pipeline. Like flink, K8S.
  • Portability


  • The group level protection has not been used.
  • No virtual memory.(PDP-11 does not support)

Differences with current systems

  1. 7bit protection
  2. large and small files
  3. i-list in not used anymore
  4. No networking
  5. No virtual memory


Exokernel: An Operating System Architecture For Application-specific Resource Management


The exokernel operating system architecture addresses this problem by providing application-level management of physical resources. In the exokernel architecture, a small kernel securely exports all hardware resources through a lowlevel interface to untrusted library operating systems. Library operating systems use this interface to implement system objects and policies.

  • Idea: Separation of resource protection from management
  • Virtual memory (VM) and interprocess communication (IPC), are implemented entirely at application level by untrusted software.
  • Greater flexibility and better performance

The cost of fixed High-level Abstractions

  • hurt application performance
  • hide information
  • limit the functionality

Library Operating System

  • Not trusted by Kernel
  • Provide as much portability and compatibility as is desirable
  • Can be replaced easily

Core Ideas

Exokernel Design

  • In separating protection from management, an exokernel performs three important tasks:
    • tracking ownership of resources
    • ensuring protection by guarding all resource usage or binding points
    • revoking access to resources.
  • To achieve the tasks, exokernel employs three techs:
    • Secure bindings: library operating systems can securely bind to machine resources.
    • Visible revocation: allows library operating systems to participate in a resource revocation protocol.
    • Abort protocol: break secure bindings of uncooperative library operating systems by force.
  • Design Principles
    • Securely expose hardware
    • Expose allocation
    • Expose Names
    • Expose Revocation
  • An exokernel hands over resource policy decisions to library operating systems.

Secure Bindings

One of the primary tasks of an exokernel is to multiplex resources securely, providing protection for mutually distrustful applications. To implement protection an exokernel must guard each resource. To perform this task efficiently an exokernel allows library operating systems to bind to resources using secure bindings.

A secure binding is a protection mechanism that decouples authorization from the actual use of a resource.

  • three basic techniques to implement secure bindings
    • hardware mechanisms
      • For example, some hardware has ownership tag in its resource.
    • software caching
      • An exokernel can use a large software TLB to cache address translations that do not fit in the hardware TLB.
    • downloading application code

Multiplexing Physical Memory

When a library operating system allocates a physical memory page, the exokernel creates a secure binding for that page by recording the owner and the read and write capabilities specified by the library operating system.

  • When TLB(in CPU) Miss, to avoid that the library operating systems need to regain secure bindings, ExoKernel mains a software TLB cache virtual-to-physical mappings.
  • If the hardware defines a page-table interface, Then that hardware page-table should be guarded instead of above software TLB.
  • The basic principle is straightforward: privileged machine operations such as TLB loads and DMA must be guarded by an exokernel.
  • To break a secure binding, kernel need to clear capabilities and resource and set resource as free.

Multiplexing the Network

Multiplexing(Use one thread to multiple socket connection, so kernel need to know which socket to wake up) the network efficiently is challenging, since protocol-specific knowledge is required to interpret the contents of incoming messages and identify the intended recipient.

  • hardware-based mechanism is the use of the virtual circuit in ATM cells to securely bind streams to applications
  • Software support for message demultiplexing can be provided by packet filters
    • Packet filters can be viewed as an implementation of secure bindings in which application code—the filters— are downloaded into the kernel.
Downloading Code
  • to improve performance
    • elimination of kernel crossings.
    • the execution time of downloaded code can be readily bounded

Visible Resource Revocation

  • Can be visible or invisible, invisible revocation perform better.
  • An exokernel needs to reveal each revocation to the relevant library operating system

The Abort Protocol

An exokernel must also be able to take resources from library operating systems that fail to respond satisfactorily to revocation requests.

  • "Break secure bindings by force"
  • If a library operating system fails to comply with the revocation protocol, an exokernel simply breaks all existing secure bindings to the resource and informs the library operating system.
  • Use a repossession vector to record the forced loss of a resource.
    • when kernel add a resource to the vector, the library operating system receives a “repossession” exception so that it can update any mappings that use the resource
  • Some vital resource will be guaranteed not to be repossessed. Even when must be repossessed, Kernel should tell the library operating system to submit itself to another server.


  • Why memory IO do not (always) need to trap in kernel like disk IO?
    • Disk IO use trap because it's simple. Memory IO do not use because efficiency.
    • TLB has some protection bits, memory access can be checked in it on hardware.


  • Similar with Hypervisor
  • In the end become Virtualization -> Big Impact


  • Hard to use for normal users.
  • Though performance enhancement of exokernel is remarkable, the most of bottlenecks of current application are not in the Kernel.


A fork() in the road


  • Fork worked well in 1970s, but now it becomes a bad abstraction.
  • More modern posix_spawn is rarely used.
  • Fork must be removed from the kernel.

Core Ideas


  • Genie' fork compared with Unix's fork: it permitted the parent process to specify the address space and machine context for the new child process.
  • TENEX' fork support
    • sharing the address space between parent and child
    • creating the child with an empty address space


  • Simple: no parameters.
  • Eased concurrency
    • A program could initialise, parse its configuration files, and then fork multiple copies of itself that ran either different functions from the same binary or processed different inputs.


  • Fork is no longer simple.
    • New APIs need to consider Fork, they must be prepared for their state to be forked at any time.
    • Numerous system call flags control fork’s behaviours with respect to memory mappings, file descriptors and threads.
  • Fork doesn’t compose.
    • Not friendly for user-mode kernel bypass features, for example, Buffered IO.
  • Fork isn’t thread-safe.
    • Childs may have inconsistent snapshot. For example, one thread doing memory allocation and holding a heap lock, while another thread forks. Following allocating will cause deadlock.
  • Fork is insecure.
    • The child may hold some unneeded privileges, which breaks the least privilege principle.
  • Fork is slow.
    • In the past, it worked well as the memory size and relative access cost were small.
    • Now even the time to establish copy-on-write mappings is a problem: Chrome experiences delays of up to 100 ms in fork
  • Fork doesn’t scale.
    • Setup fork’s copy-on-write mappings hurt scalability
  • Fork encourages memory overcommit.
    • Small process may inherent the large parent, which may cause fork fail, since it need to double the virtual size of the process.
    • A solution is Overcommit. However, Overcommit may cause OOM.

Best scenario: a single threaded process with a small memory footprint and simple memory layout that requires fine-grained control over the execution environment of its children but does not need to be strongly isolated from them.


Evidence that supporting fork limits changes in OS architecture, and restricts the ability of OSes to adapt with hardware evolution:

  • Fork is incompatible with a single address space.
    • Some new OS architecture only have one universal address space. Fork is incompatible.
  • Fork is incompatible with heterogeneous hardware.
    • OS cannot duplicate the process state on the heterogeneous hardware like NIC/GPU.
  • Fork infects an entire system.
    • Every layers of the system need to implement fork.


  • High-level: Spawn
    • Significant performance, reliability and portability advantages
    • Some less-common operations are not yet supported
    • Still lacks an effective error-reporting mechanism
  • Alternative: vfork()
    • unsafe
  • Low-level: Cross-process operations
    • There is an alternative model where system calls that modify per-process state are not constrained to merely the current process, but rather can manipulate any process to which the caller has access.
  • Alternative: clone()
    • It provide more flags, like fork, behavior of which is implicit or undefined for many abstractions.
    • Still suffer most of fork problems.

Fork-only use-cases

There exist special cases where fork is not followed by exec, that rely on duplicating the parent

  • Multi-process servers
  • Copy-on-write memory


to rectify the situation:

  • Deprecate fork
    • should use a spawn API
  • Improve the alternatives
  • Fix our teaching


  • New perspective.


  • Too many complaints.
  • Given no solutions.
  • Did not explain why even if there are so many problems, people still use this function.


Welcome to my other publishing channels