Atomic Writes in Linux (Part 1) - Why Databases Need Atomic I/O

A write operation to a storage device is considered atomic if the entire I/O transfer from userspace to the device completes successfully, or not at all. This means there’s no partial data transfer; either the full I/O is present on the device, or none of it is.

Atomic I/O is crucial for critical applications like databases, where data integrity and reliability are paramount. Without this guarantee, databases must implement additional mechanisms, such as extensive logging, to ensure reliability. This article will primarily focus on atomicity within the context of databases, given their stringent requirements.

Understanding torn I/O issue in databases:

      [ Database Application (MySQL/PG) ]
                  |  
                  |  1. 16KB/8KB Page Dirty in Buffer Pool
                  V
      [ Linux VFS / Filesystem Layer ]
                  |
                  |  2. Page fragmented into 4KB segments (FS Blocks)
                  V
      [ Linux Block Layer / Scheduler ]
                  |
                  |  3. Segments merged/split into BIOs
                  V
      [ NVMe/SCSI Driver & Controller ]
                  |
                  |  4. Data lands in SSD DRAM Cache (Volatile!)
                  V
      =================================== <--- Power Loss Zone
      [ Physical NAND Flash Media ]

Relational databases like MySQL and PostgreSQL work on database pages (DB pages). Each DB page stores a bunch of data along with extra metadata about the data. All the DB pages are also checksummed to detect corruption. It is important to understand that databases depend on the fact that all DB pages are always intact. If a DB page is corrupted, then the database has to do recovery to replace the corrupted block. Each database has different default DB page sizes. MySQL has 16KB as the default DB page size while PostgreSQL has chosen 8KB as the default DB page size.

As illustrated above, I/O travels through several layers from the database application to the non-volatile storage. As many things can go wrong, leading to a single database page being torn before it reaches the storage device, databases employ different torn write protection mechanism.

PostgreSQL uses a technique called Log Page on First Write. MySQL uses a technique called Double-Write Buffering. In essence, these techniques involve writing the same data twice. This double-write overhead can reduce device lifetime and consume additional bandwidth.

If both the storage device and the operating system expose atomic I/O APIs, databases can leverage them to offload the responsibility of torn write protection, eliminating the need for redundant writes. This is beneficial for database users, as it can reduce the number of writes to a device, thereby increasing its lifespan and decreasing I/O overhead.

Atomic I/O in NVMe:

Achieving end-to-end atomic support requires both operating system and device capabilities. While operating system support will be detailed in the next article, this section focuses on atomicity within NVMe devices.

Unlike SCSI, which uses a distinct command (WRITE ATOMIC), NVMe support is implicit. If a write request adheres to specific size and alignment rules defined by the controller, the drive guarantees atomicity,

NVMe devices expose two critical parameters related to atomic writes:

AWUN (Atomic Write Unit Normal): The maximum size guaranteed to be atomic during normal operation.
AWUPF (Atomic Write Unit Power Fail): The maximum size guaranteed to remain atomic even during a power loss. This is the critical value for databases.

AWUN:

NVMe controller can decide to execute commands in parallel that might have overlapping LBA ranges. AWUN’s primary purpose is to ensure the atomicity of command execution in the presence of other concurrent commands. By enforcing inter-command serialization, AWUN prevents a situation where multiple parallel writes from different threads interleave, which could lead to logical blocks containing a partial mix of data from various commands.

It’s important to note that AWUN specifically governs normal, active operations and does not provide protection against torn writes caused by power failures or error conditions; that scenario is handled by AWUPF.

The AWUN support in QEMU serializes commands when LBA ranges overlap. (Inspecting QEMU’s NVMe emulation code can offer valuable insights into how NVMe controllers operate.)

AWUPF:

AWUPF is an NVMe specification parameter that indicates the maximum size of a write operation guaranteed to be atomic to non-volatile media, even during a power failure or error.

While AWUN governs the interleaving of parallel commands during normal operations, AWUPF specifically dictates the drive’s behavior during catastrophic interruptions. Given that operating systems and databases rely on this feature for data integrity and crash recovery, the Linux kernel prioritizes the AWUPF value when reporting atomic limits to applications.

Many enterprise SSDs include capacitors that store sufficient residual electrical charge to power the drive briefly after a main power loss. This enables the drive’s controller to complete any in-flight data writes or flush its volatile DRAM cache to non-volatile NAND flash media before a full shutdown.

Visualizing the Guarantee: If a database page size is less than or equal to AWUPF, the hardware guarantees one of the following outcomes during a crash:

Scenario: Overwriting "Old Data" with "New Data" (16KB) during Power Loss

       [ Physical Media Sectors ]
       | 0 | 1 | 2 | 3 |
       -----------------

WITHOUT ATOMICITY (Torn Write):
Result: | New | New | Old | Old |  <-- CORRUPTION (Mixed State)

WITH NVMe AWUPF (Atomic Write):
Result: | New | New | New | New |  <-- SUCCESS (Fully Written)
             OR
Result: | Old | Old | Old | Old |  <-- ROLLBACK (Nothing Written)

The application will never see a partial mix; it is strictly all-or-nothing

In the next article, we will delve into how atomic I/O APIs are defined within the Linux kernel.

Happy reading!