1759 'Distribution' OCT-21

This is a 'Paper Reading' post for Course ECE1759. The topic is 'Distribution'. This paper list is here:

  • Leslie Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, Communications of the ACM 21(7):558-565, July 1978.
  • D. R. Cheriton and W. Zwaenepoel, The Distributed V Kernel and its Performance for Diskless Workstations, Proceedings of the 9th Symposium on Operating Systems Principles, pp. 129-140, November 1983.
  • Cary G. Gray and David R. Cheriton, Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency, Proceedings of the Twelfth ACM Symposium on Operating Systems Priciples (SOSP), December 1989, Litchfield Park, AZ, USA.

Time, Clocks, and the Ordering of Events in a Distributed System


  • Lamport Timestamp/Lamport Clock
  • A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.
  • A system is distributed if the message transmission delay is not negligible compared to the time between events in a single process
  • The relation "happened before" is therefore only a partial ordering of the events in the system.

Core Ideas

The Partial Ordering

  • Computers do not always have a clock and clocks can be inaccurate. --> define the "happened before" relation without using physical clocks.
  • Just like we learn in undergraduate courses.
  • About concurrent: Two events are concurrent if neither can causally affect the other.

Logical Clocks

  • A clock is just a way of assigning a number to an event, where the number is thought of as the time at which the event occurred.
  • May be implemented by counters with no actual timing mechanism.
  • Clock Condition. For any events a, b: if a---> b then C(a) < C(b).
    • C1. If a and b are events in process P~, and a comes before b, then Ci(a) < Ci(b).
    • C2. If a is the sending of a message by process Pi and b is the receipt of that message by process Pi, then Ci(a) < Ci(b).
  • Implementation rule of the system of clocks:
    • Each process P~ increments Ci between any two successive events.
    • If event a is the sending of a message m by process P~, then the message m contains a timestamp Tm= Ci(a). (b) Upon receiving a message m, process Pi sets Ci greater than or equal to its present value and greater than Tin.

Ordering the Events Totally

  • We can use a system of clocks satisfying the Clock Condition to place a total ordering on the set of all system events.
  • The total ordering varies. But the partial ordering is uniquely determined by the system of events.
  • To simplify the implementation of a clock system determined by the system of events, the paper make three assumptions
    • for any two processes Pi and Pj, the messages sent from Pi to Pi are received in the same order as they are sent.
    • every message is eventually received.
    • a process can send messages directly to every other process.

Anomalous Behavior

  • The actual order may be different from the order based on clock systems.
  • Two possible ways to avoid such anomalous behavior.
    • Explicitly introduce into the system the necessary information about the ordering
    • Use physical clocks to eliminate anomalous behavior. construct a system of clocks which satisfies the following condition.
      • Strong Clock Condition. For any events a, b in the set: if a --> b then C(a} < C(b).
      • set is some set of "real" events in physical space-time, and let --> be the partial ordering of events defined by special relativity.

Physical Clocks

Too abstract


  • Meaningful as a ideal algorithm. The timestamp is the base of all the distributed systems.


  • Not practical: no perfect network, the machine can crash


The Distributed V Kernel and its Performance for Diskless Workstations


  • Introduce a message-oriented kernel that provides uniform local and network interprocess communication called the distributed V kernel.

  • The system is used in an environment of diskless workstations connected by a high-speed local network to a set of file servers. Over a local network

    • Diskless workstations can access remote files with minimal performance penalty.
    • The V message facility can be used to access remote files at comparable cost to any well-tuned specialized file access protocol.
  • Of particular interest are the controversial aspects of our approach, namely:

    • The use of diskless workstations with all secondary storage provided by backend file servers.
    • The use of a general purpose network interprocess communication facility (as opposed to special-purpose file access protocols) and, in particular, the use of a Thoth-like interprocess communication mechanism.
  • Diskless workstations' advantages:

    • Lower hardware cost per workstation.
    • Simpler maintenance and economies of scale with shared file servers.
    • Little or no memory or processing overhead on the workstation for file system and disk handling.
    • Fewer problems with replication, consistency and distribution of files.
  • Diskless workstations' disadvantage: the overhead of the network.

  • The V did not use stream protocol, on the contrary it adopted a synchronous "request-response" model of message-passing and data transfer which does not support application-level use of streaming.

Core Ideas

V Kernel Interprocess Communication


  • Message exchange of Send, Receive, Reply
    • any number of MoveTo or MoveFrom between message being received and reply being sent.
  • Primitives
    • Send(message, pid)
    • pid = Recieve(message)
    • (pid, count) = ReceiveWithSegment(message, segptr, segsize)
    • Reply(message, pid)
    • ReplyWithSegment(message, pid. destptr, segptr, segsize)
    • MoveFrom(srcpid, desk sre, count)
    • MoveTo(destpid, desk src, count)
    • SetPid(Iogicalid. pid, scope)
    • pid = GetPid(logicalid, scope)


  • Discussion of synchronous message-passing model
    • Synchronous request-response message communication make it easy to use for programmers.
    • The distinction between small messages and a separate data transfer facility ties in well with common usage pattern.
    • Finally, synchronous communication and small, fixed-size messages reduce queuing and buffering problems in the kernel. (large amounts of data are transferred directly between users' address spaces without extra copies)
  • The paper found that the semantics of the primitives facilitated an efficient distributed implementation. The only major departure from Thoth was the explicit specification of segments in messages and the addition of the primitives ReceivettqthSegment and ReplyWithSegment.

Implementation Issues

  • Remote operations are implemented directly in the kernel
  • lnterkernel packets use the "raw" Ethernet data link level. (IP has 20% increase of overhead)
  • Request-response pattern helps implement reliability with raw ethernet
  • Mapping from process id to location is aided by a host specification
  • No per-packet acknowledgements for large data transfers
  • File page-level transfers require the minimal number of packets

Process Naming

  • Uses a global (fiat) naming space for specifying processes.
    • Unique within local network
  • Currently, in the 10 Mb implementation, the top 8 bits of the logical host identifier are the physical network address of the workstation.
  • In the 10 Mb implementation, a table maps logical hosts to network addresses

Remote Message implementation

  • If process is not local uses the NonLocalSend
    • remote machine receives a packet containing the request
    • remote kernel creates an alien process descriptor to represent the remote sending process as a standard in-kernel descriptor
  • original sender re-transmits packet after a timeout, T
  • remote kernel can respond with packet, or a reply-pending packet in the case no descriptors are available.

Remote Data Transfer

  • MoveTo/MoveFrom transmits the data in a sequence of maximally-sized packets to the destination workstation.
    • single ack packet.
  • MoveTo and MoveFrom, there is no queueing or buffering
    • basically one copy from memory to NIC, then NIC, to destination.
  • full retransmission since last correctly received packet.
  • page-level file access request involve intermediate amount of data.

Remote Segment Access

  • Previous protocol good for small amounts of data, or tens of packets
  • Larger sized require a better protocol
  • File access implemented with IO protocol originally developed for Verex kernel
    • For a two-packet (data sized) transfer, it requires 4 network messages.
      • Double of a specialized protocol
  • Receive/ReplyWithSegment
    • To put request with data.


  • The system is fast only because its network is fast. (10M)


Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency


  • Leases are proposed as a time-based mechanism that provides efficient consistent access to cached data in distributed systems. Non-Byzantine failures affect performance, not correctness, with their effect minimized by short leases.

  • Try to solve the problem of ensuring consistency between the cached data and its primary location of storage.

  • Definition of consistency: By consistent, we mean that the behavior is equivalent to there being only a single (uncached) copy of the data except for the performance benefit of the cache

Core Ideas

Leases and Cache Consistency

  • A cache using leases requires a valid lease on the datum (in addition to holding the datum) before it returns the datum in response to a read, or modifies the datum in response to a write.
  • Short lease terms have several advantage
    • Minimize the delay resulting from client and server failures.
    • Minimize the false write-sharing that occurs.
    • Reduce the storage requirements at the server
  • Long lease terms have several advantage
    • More efficient both for the client and server on files that are accessed repeatedly and have relatively little write-sharing.


  • Cache:
    • Invalidate VS Update
    • Write-Through VS Write-Back
  • High performance: Invalidate & Write-Back
  • Leases: Invalidate & Write-Through


  • Cannot cope with the clock drift problems.


Welcome to my other publishing channels