A small history on Large block sizes in Linux: Part 2

This is a multipart series where I will be going over the support of Large block sizes(LBS) on Linux.

This article will cover the previous attempts at enabling LBS in the Linux kernel. There were three major efforts that I will be covering in this article:

2007: Christoph Lamenter posted Large Block Size support
2007 & 2009: Nick Piggin posted fsblock & fsblock v2.
2018: Dave Chinner xfs: Block size > PAGE_SIZE support

If you only care about the final attempt that made it upstream, then please refer to the next part.

This post will require some level of the Linux kernel internals understanding. Before reading this article, I would highly recommend checking out Part1 section: Why the limitation on block sizes?.

2007: Christoph Lamenter posted Large Block Size support

The initial use case for LBS came from the CD/DVD world where the block size typically where in the range of 32k/64k. Lamenter sent patches to enable LBS by making changes mainly in the page cache.

The main idea was to use compound page allocation in the page cache to match block size of the device. The crux of the change is that we set the order of allocation in the page cache and make sure to always allocate with that order:

static inline void set_mapping_order(struct address_space *a, int order)
{
	a->order = order;
	a->shift = order + PAGE_SHIFT;
	a->offset_mask = (1UL << a->shift) - 1;
	if (order)
		a->flags |= __GFP_COMP;
}
static inline int mapping_order(struct address_space *a)
{
	return a->order;
}
...
static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
{
	return alloc_pages(gfp, order); // Before order was always == 0
}

This is a simplification of the patchset, but this is the main meat of the changes. We will see in the next blog that the current implementation resembles the approach taken 17 years back.

The patchset was rejected because it was adding more complexity to the core VM subsystem and the implementation could not handle faults on larger pages to make mmap() work. So the patchset just disabled mmap functionality if LBS was enabled, which is not great.

2009: Nick piggins posted fsblock

To understand the motivation of fsblock, first we need to understand the struct buffer_head in the kernel. buffer_head structure tracks buffers in memory. Buffers are in-memory copy of a disk block from a block device. A logical disk block can correspond to multiple sectors in disk. The main motivation from this series was to completely rip out struct buffer_head as it is one of the oldest code that has been in the kernel, but many filesystems use it. fsblock was an attempt to improve the “buffer” layer which sits in between filesystem and a block device.

One of the promises of fsblock rewrite was ability to support large block sizes in filesystems. The maximum block size supported by struct buffer_head is limited by the PAGE_SIZE of the system. The PAGE_SIZE limitation is embedded by design in buffer heads, especially the maximum number of buffers a struct buffer_head can hold is limited by PAGE_SIZE.

Following is struct buffer_head[2]:

struct buffer_head {
        unsigned long        b_state;          /* buffer state flags */
        atomic_t             b_count;          /* buffer usage counter */
        struct buffer_head   *b_this_page;     /* buffers using this page */
        struct page          *b_page;          /* page storing this buffer */
        sector_t             b_blocknr;        /* logical block number */
	...
	<snip>
};

The buffers are stored in b_page field. Even though we could store compound pages in b_page struct, buffer_head was designed with b_page holding a single page in mind. So a single buffer can be at most a single page.(See MAX_BUF_PER_PAGE link)

Many filesystems at that time used the buffer_head structure to cache the block device reads in memory. This lead to a limitation of the logical block size of the underlying block device to be maximum of host PAGE_SIZE.

Similar to buffer_head, fsblock struct holds a disk block in memory, but it adds the concept of “superpage block” which could hold multiple blocks. Unlike buffer_head, fsblock does not have limitation on the size of the block, i.e, each disk block could be PAGE_SIZE and we can map multiple disk block with a single fsblock struct. This enables filesystems to have LBS support.

This patchset did not get any traction, probably because it was added as a complete replacement to the buffer_head instead of an incremental improvement.

2018: Dave Chinner xfs: Block size > PAGE_SIZE support

This was the first patchset that came very close in adding Block size > PAGE_SIZE support in XFS. VFS IOMAP library came out of XFS as a generic library that provides helpers to interact with page cache and the storage device. iomap was designed in such a way to support block sizes > page sizes.

This patchset extended iomap to deal with block size > page size to circumvent the limitation of the page cache. It adds a new flag: IOMAP_F_ZERO_AROUND to iomap. This flag tells the iomap layer to zero the whole block, even if it is a sub-block IO.


Sub-block IO requiring zeroing around

Direct IO: Minimal changes were required to support LBS in the direct IO path. All that needs to be done is padding of zeroes to an IO so that it can occupy the entire block.

Writeback: It is the process used by Linux to write the dirty pages that has been modified in the page cache to be written back to the backing device. For example, before doing a Direct IO on a file range, iomap should first write any of the dirty pages that overlap in this range.

This patchset removes the writepage callback and forces the memory management (MM) layer to force using the writepages callback. These callbacks are used by the kernel to write the dirty pages to the backing device when there is memory pressure. This is the first step to ensure we can write back multiple pages that belong to one FS block. The minimum data unit that a filesystem works is a filesystem block(FSB), so it is important that the whole block gets written on disk instead of partial block writes. The patch also changes writepages callback by modifying the range to include the whole “block”:

/*
 * If the block size is larger than page size, extent the incoming write
 * request to fsb granularity and alignment. This is a requirement for
 * data integrity operations and it doesn't hurt for other write
 * operations, so do it unconditionally.
 */
if (wbc->range_start)
	wbc->range_start = round_down(wbc->range_start, bsize);
if (wbc->range_end != LLONG_MAX)
	wbc->range_end = round_up(wbc->range_end, bsize);
if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
	wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);

Buffered IO: IOMAP_F_ZERO_AROUND was mainly added to support buffered IO. From the commit message (as Dave chinner is much more expressive than me):

For block size > page size, a single page write is a sub-block
write. Hence they have to be treated differently when these writes
land in a hole or unwritten extent. The underlying block is going to
be allocated, but if we only write a single page to it the rest of
the block is going to be uninitialised. This creates a stale data
exposure problem.

To avoid this, when we write into the middle of a new block, we need
to instantiate and zero the pages in the block around the current
page. When writeback occurs, all the pages will get written back and
the block will be fully initialised.

IOMAP_F_ZERO_AROUND did not make it mainline as the kernel started the folio conversion around this time, and it solves this zero around problem from the memory management layer. The final implementation we got it upstreamed relied heavily on large folio implementation.

Even though this patchset did not upstream LBS support, lots of the XFS bugs to support this feature were fixed as a part of this series. This made things easy later to support LBS in XFS.

Conclusion:

All the previous efforts in supporting LBS revolved around 2 things:

Each allocation in the page cache matches the FSB.
No partial FSB should be written to the disk at any point.

In the next post, I will cover the final LBS support that is in the process of getting mainlined soon.

Happy reading!

References:

[1] Large blocksize support [2] This is a snapshot from 2.6 kernel. Newer versions have more fields.