This is a 'Paper Reading' post for Course ECE1759. The topic is 'File System'. This paper list is here:

Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry, "A Fast File System for Unix," ACM Transactions on Computer Systems, 2(3), August 1984, pp. 181-197.

Mendel Rosenblum and John K. Ousterhout, "The Design and Implementation of a Log-Structured File System," Proceedings of the 13th ACM Symposium on Operating Systems Principles, December 1991.

Gregory R. Ganger, Marshall Kirk McKusick, Craig A.N. Soules, and Yale N. Patt. "Soft Updates: A Solution to the Metadata Update Problem in File Systems," ACM Transactions on Computer Systems, Vol. 18, No. 2, May 2000, Pages 127-153.

A Fast File System for Unix

Introduction

Original UNIX file system:
- File system I/0 is buffered by the kernel; there are no alignment constraints on data transfers and all operations are made to appear synchronous.
- A file system never spans multiple partitions.
The original 512-byte UNIX file system is incapable of providing the data throughput rates that many applications require.
A file system providing higher bandwidth is needed.
Several Improvement of Previous work
- Changing the basic block size from 512 to 1024 bytes.
  - Changing the basic block size from 512 to 1024 bytes.
  - Most files could be described without need to access indirect blocks since the direct blocks contained twice as much data
- Have a process that periodically reorganized the data on the disk to restore locality
  - The free list was initially ordered for optimal access, it quickly became scrambled as files were created and removed.

Core Ideas

FFS

The superblock is replicated to protect against catastrophic loss.
- Since the superblock data does not change, the copies need not be referenced unless a hard error.
To create files as large as 2^32 bytes with only two levels of indirection, the file size system block is 4096 bytes.
- Can not be changed without rebuilding the file system.
Divides a disk partition into one or more areas called cylinder groups.
- Consist of one or more consecutive cylinders on a disk.
- At a varying offset from the beginning of each cylinder group, there are
  - a redundant copy of the superblock
  - space for inodes
  - a bit map describing available blocks in the cylinder group
  - summary information describing the usage of data blocks within the cylinder group
- Bit map of available blocks in the cylinder group replaces the traditional file system's free list.
- For each cylinder group a static number of inodes is allocated at file system creation time

Larger Blocks

larger data can be transferred in one disk transaction
larger blocks waste space for small files.
- Solution: allow the division of a single file system block into one or more fragments. Block can be broken optionally into 2, 4, or 8 fragments, each of which is addressable.
- The block map associated with each cylinder group records the space available in a cylinder group at the fragment level.
- To gain performance, use block to allocate if possible
Since file systems with different block sizes may reside on the same system, the file system interface has been extended to provide application programs the optimal size for a read or write:
- For files, the optimal size is the block size of the file system on which the file is being accessed.
- For other objects, such as pipes and sockets, the optimal size is the underlying buffer size.
A file system cannot be kept completely full.
- The free space reserve, that gives the minimum acceptable percentage of file system blocks that should be free.
The percentage of waste is comparable to old file system.

File System Parameterization

Files are tried to allocate on the consecutive blocks on the same cylinder.
On a processor with an I/O channel that does not require any processor intervention between mass storage transfer requests, two consecutive disk blocks can often be accessed without suffering lost time because of an intervening disk revolution.
For processors without I/O channels, the main processor must field an interrupt and prepare for a new disk transfer. The expected time to service this interrupt and schedule a new disk transfer depends on the speed of the main processor.
Based on the physical characteristics of each disk, the allocation routines can calculate the number of milliseconds required to skip over a block.

Layout Policies

Two distinct parts:
- Global policies: use file system wide summary information to make decisions regarding the placement of new inodes and data blocks
- Local allocation routines: use a locally optimal scheme to lay out data blocks
Two methods for improving file system performance
- Increase the locality of reference to minimize seek latency
- Improve the layout of data to make larger transfers possible
Global policies try to balance between:
- localizing data that are concurrently accessed
  - Too much may become old file system
- spreading out unrelated data
  - Too much may cause low latency
The policy routines try to place all data blocks for a file in the same cylinder group, but redirect block allocation to a different cylinder group when a file exceeds 48 kilobytes
- Avoid becoming completely full.

Performance

Use much more bandwidth(47% vs 3-5%)
Faster RW
The reading rate is always at least as fast as the writing rate
- In contrast the old file system is about 50 percent faster at writing files than reading them.
  - Because the write system call is asynchronous and the kernel can generate disk transfer requests much faster than they can be serviced.
The performance is limited memory copy operations.

FILE SYSTEM FUNCTIONAL ENHANCEMENTS

Long File Names
- 255 characters.
- Use directories entry.
File Locking
- Use a separate "lock" file.
Symbolic Links
Rename
Quotas

Advantages

Fast
Larger file is supported

Reference

The Design and Implementation of a Log-Structured File System

Log in the only structure.
Use segment cleaner to compress
Use high bandwidth(70% vs 5-10%)

Introduction

Assumptions:
- files are cached in main memory and increasing memory sizes will make the caches more and more effective at satisfying read requests
  - -> disk traffic will become dominated by writes
Sequential nature
- write performance
- faster crash recovery
Difference from others:
- Others use log to do speeding up and crash recovery
- LFS only use log.
Challenge: GC

Core Ideas

Design for file systems of the 1990’s

Problems with existing file systems

FFS
- physically separates different files
- separate inodes
- the directory entry containing the file’s name
- takes at least five separate disk I/Os, each preceded by a seek, to create a new file
tend to write synchronously

Log-structured file systems

Fundamental idea
- buffering a sequence of file system changes in the file cache and then writing all the changes to disk sequentially in a single disk write operation.
LFS Converts the many small synchronous random writes of traditional file systems into large asynchronous sequential transfers that can utilize nearly 100% of the raw disk bandwidth.

File location and reading

Inode
- 10 direct address.
- one or more indirect blocks.
- Sprite LFS doesn’t place inodes at fixed positions; they are written to the log

Free space management: segments

Two choices
- Threading
  - Skip the blocks containing newest data
  - Sparse, not contiguous.
- Copying
  - Compress
  - Copying overhead
LFS:
- Any given segment(512KB or 1MB) is always written sequentially from its beginning to its end, and all live data must be copied out of a segment before the segment can be rewritten.

Segment cleaning mechanism

Three-step algorithm
- Read N dirty segments.
- Identify which blocks are live.
- Write live data back to log.
Writing a segment summary block as part of each segment.
- Identifies each file appearing in segment and each block in each file.
- little overhead during writing, but useful during crash recovery.
Checking alive:
- segment summary information -> block’s identity -> file’s inode -> check file’s inode
- a version number in the inode map entry is used to quickly check

Segment cleaning policies

When should the cleaner run?
How many segments should be cleaned at once?
Which segments should be cleaned? (more important)
How should cleaned blocks be grouped? (more important)

Simulation results

Segment usage table

segment usage table
- For each segment, the table records the number of live bytes in the segment and the most recent modified time of any block in the segment.

Crash recovery

Know where last mods were (end of log), so know where any inconsistencies might be).
Use checkpoints and roll-forward to recover.

Checkpoints

Flush all modified data, write inode map, segment usage table, and end- of-log pointer to checkpoint region.
At reboot, read each of two checkpoint regions, initialize structures from the checkpoint. (assumption: last byte of region is last byte read)
Checkpoint every 30 seconds (should checkpoint after some amount of data has been written).

Roll-forward

Read segments written after the last checkpoint.
Can lose recently created files if data blocks were written, but inode was not.
Use write-ahead logging to maintain consistency between directory entries and inodes.

Lectures

The assumption of the paper is not that reasonable. The cache cannot cover all the read. And the write performance is actually hard to examine as most of the write is actually to cache (system cache or firmware cache) instead of directly to disk.
Though it did not foresee the SSD's characteristics, it did fit SSD well.

kexiang's blog

1759 'File System' OCT-28

A Fast File System for Unix

Introduction

Core Ideas

FFS

Larger Blocks

File System Parameterization

Layout Policies

Performance

FILE SYSTEM FUNCTIONAL ENHANCEMENTS

Advantages

Reference

The Design and Implementation of a Log-Structured File System

Introduction

Core Ideas

Design for file systems of the 1990’s

Problems with existing file systems

Log-structured file systems

File location and reading

Free space management: segments

Segment cleaning mechanism

Segment cleaning policies

Simulation results

Segment usage table

Crash recovery

Checkpoints

Roll-forward

Lectures

Reference