A small history on Large block sizes in Linux: Part 1

This is a multipart series where I will be going over the support of Large block sizes in Linux.

What is a Large block anyway?

Before writing articles about it, it is important to know what a Large Block Size(LBS) is. In the context of Linux, LBS is defined as a scenario when the block size is greater than page size of the system. Block size can refer to both logical block size of a block device or a filesystem block size(FSB) of a filesystem.

Linux has traditionally supported block sizes that are less than or equal to the system page size. We shall discuss the rationale later in this article. This means that the block size of a filesystem or block device can never be greater than 4k bytes in an x86_64 system. Is that an issue for most people? The most likely answer is no, but sometimes having a system that supports LBS can be helpful.

Why Large block size?

One of the earliest use case from 2007 for LBS was from CDs and DVDs which have bigger block sizes around 32k and 64k[1]. People dealt it with having a shim layer to overcome this limitation, but it had an effect on I/O speed.
Another use case for LBS is the growing size of SSDs (High capacity SSDs). As these SSDs need a bigger mapping table leading increased RAM costs, device manufacturers are increasing the block size in which they do mapping (Indirection unit) to reduce the cost. I wrote a detailed article about Indirection unit and its effect on WAF here.
Mounting a filesystem that was formatted with larger blocksizes than it is supported in a different system. Let’s say a drive was formatted with 64k block size on a PowerPC system (as the page size is 64k) but the drive needs to be analyzed on a x86_64 system. This is currently impossible as x86_64 cannot mount a filesystem with 64k block size.
A database might have a bigger “page size” than the underlying filesystem’s block size due to the LBS limitation. It is much more useful if both the filesystem and database have the same notion of a block size, which might simplify database operations.

Why the limitation on block sizes?

TL;DR, Linux Page cache.

Page cache is an integral part on Linux when accessing a filesystem. Page cache can be thought of as a simple buffer cache that the kernel manages to speed up access to a file. For example, a simple pread/pwrite will go through the page cache if the file was not opened with O_DIRECT.

The kernel will flush the cache that has been modified regularly through a mechanism called writeback. The minimum unit of flushing is in PAGE_SIZEed chunks, since the Linux page cache is strongly tied to the system page size.

In filesystems, a block is the minimum allocation unit and cannot be split during writeback. Therefore, the only way to make sure the whole “block” can be written to the device without splitting is by making sure the block size doesn’t go over the minimum writeback unit size.

In summary:

Historically, page cache was closely tied to system page size.
No support to track the “blocks” > page size as a single unit in the page cache to avoid eviction of partial blocks.

Conclusion:

This article talked about LBS, why it’s important, and why Linux doesn’t support it yet. The next part will talk about previous attempts at adding LBS support to Linux.

Happy reading!

References:

[1] Large blocksize support