The Uncommitted Changes Blog

A small history on Large block sizes in Linux: Part 0

2024-03-14T00:00:00+00:00

This is a multipart series where I will be going over the support of Large block sizes in Linux.

What is a Large block anyway?

It is important to define what a Large Block Size(LBS) is before writing a three part article about it. In the context of Linux, LBS is defined as a scenario when the block size is greater than page size of the system. Block size can refer to both logical block size of a block device or a filesystem block size of a filesystem.

Linux has historically only supported block sizes which are less than or equal to system page size. We shall get into why later in this article. This means that block size of a filesystem or a block device can never be greater than 4096 in an x86_64 system. Is that a problem for most people? The answer is likely no, but there are use cases where having a system that supports LBS is useful.

Why Large block size?

One of the earliest use case from 2007 for LBS was from CDs and DVDs which have bigger block sizes around 32k and 64k[1]. People dealt it with having a shim layer to overcome this limitation, but it had an effect on I/O speed.
Another use case for LBS is the growing size of SSDs (High capacity SSDs). As these SSDs need a bigger mapping table leading increased RAM costs, device manufacturers are increasing the block size in which they do mapping (Indirection unit) to reduce the cost. I wrote a detailed article about Indirection unit and its effect on WAF here.
Mounting a filesystem that was formatted with larger blocksizes than it is supported in a different system. Let’s say a drive was formatted with 64k block size on a PowerPC system (as the page size is 64k) but the drive needs to be analyzed on a x86_64 system. This is currently impossible as x86_64 cannot mount a filesystem with 64k block size.
A database might have a bigger “page size” than the underlying filesystem’s block size due to the LBS limitation. It is much more useful if both the filesystem and database have the same notion of a block size which might simplifify the database and reduce the number of operations to ensure consistency.

Why the limitation on block sizes?

TL;DR, Linux Page cache.

Page cache is an integral part in Linux when accessing a filesystem. Page cache can be thought of as a simple buffer cache that kernel manages to speed up access to a file. For example, a simple pread/pwrite will go through the page cache if the file was not opened with O_DIRECT.

Kernel will flush the cache that was modified regularly through a mechanism called writeback. The minimum unit of flushing is PAGE_SIZE as the linux page cache is strongly tied to the system page size.

In case of filesystems, a block represents minimum allocation, and it cannot be split during writeback. Therefore, during writeback, the only way to guarantee the whole “block” can be written to the device is to make sure the block size does not go beyond the minimum writeback unit size, which is a PAGE_SIZE.

Conclusion:

This article discussed what is LBS, why LBS and why LBS is not currently supported in Linux. The next part will discuss the first effort that was attempted to solve the LBS issue in 2007.

Happy reading!

References:

[1] Large blocksize support

Adding MSI(x) interrupt support to SerenityOS

2024-03-13T00:00:00+00:00

Traditional PCI devices use a shared interrupt line to signal the CPU when they need attention. This can lead to performance issues as all the devices that are connected to the interrupt line need to invoke their interrupt handler. Message Signalled Interrupts (MSI) was developed to address these problems by providing a more efficient and scalable way of handling interrupts.

MSI were introduced as a part of PCI 2.2, and it works by allowing the device to send an interrupt message directly to the CPU through the PCI bus. When the CPU receives the message, it knows exactly which device generated the interrupt and can handle it accordingly. This reduces interrupt latency and helps to avoid conflicts between devices that share the same interrupt line.

The following image shows how E1000NetworkAdapter and NVMe device are sharing the same interrupt line (10) when using the traditional pin-based interrupts:


E1000NetworkAdapter and NVMe sharing the same interrupt line

MSI can solve this problem by not sharing the same interrupt line. I decided add support for MSI and MSIx interrupt mechanism as SerenityOS was lacking those features.

In pin-based interrupt mechanism, the driver reads the interrupt line field in the PCI header and uses that to program the interrupt handler. For MSI based interrupts, the driver has to program the device with an IRQ number that it wants the device to trigger when an interrupt occurs. As serenity always used pin-based interrupts, new APIs were introduced to make MSI(x) work.

The PRs can be found here:

Pull Request: MSIx

Pull Request: MSI. Check out MSIx PR before seeing this PR as I added MSIx support first.


NVMe using MSIx without sharing the interrupt line

Additional resources:

Please check out the osdev article and the intel software manual chapter 11 for more information. Even though they were good documentation, I missed some information when I was implementing this feature. I also used Linux source code and Haiku OS source code to reverse engineer how this feature is implemented.

Happy Hacking!

Impact of Indirection Unit on Write Amplification in SSDs

2023-12-18T00:00:00+00:00

Developers typically think of SSDs as a black box which will store any IO that is coming its way into a non-volatile memory (such as NAND). Even though the part about storing the IO to the non-volatile memory is true, the way it achieves it depends on various implementation details and parameters. These parameters can have different side effects on performance, endurance, etc.

One such parameter we will explore in this article is the Indirection Unit and how it impacts the device’s endurance based on Write Amplification.

First, let us see what is Write Amplification and then discuss about Indirection Unit and its impact.

Write Amplification:

Write Amplification (WA) happens in SSD when the actual amount of written physical data is more than the amount of logical data that is written by the host.

WAF is the mathematical representation of this phenomenon, describing the ratio of physical writes to logical writes. Let us say the host wrote 4KB, and the SSD has to write 16KB to accommodate that operation; then the WAF will be 4. WAF has a direct impact on the lifetime of the SSDs as more WAF leads to more writes to the underlying media.

WAF on the device can be calculated as follows:

# io_len: size of the IO from the host
# io_extra: extra IO incurred by the SSD for a given io_len

WAF = (io_len + io_extra) / io_len

End-End WAF is a culmination of different WAFs⁰:

WAFTotal = WAF_App * WAF_Device
# Splitting WAF_Device further:
WAFTotal = WAF_App * WAF_SSD * WAF_IU

WAF contribution happens because of different factors. We will look into the impact of the Indirection unit(WAF_IU) on WAF as part of this article.

Indirection Unit:

SSDs maintain a Logical to Physical mapping(L2P) table to map a logical block to an underlying physical NAND block in its RAM(device RAM and not host RAM¹). Logically contiguous blocks do not translate to physical contiguous blocks. This is similar to how virtual memory works in an Operating system.

If the mapping granularity is 4KB, then a 4KB logical block corresponds to a 4KB physical block. Whenever a block is written to the device, a new mapping is created in RAM, and it is used again to find the corresponding physical block when a read happens.


SSD Logical to Physical mapping

Having a mapping table will incur some RAM costs for the device. Assuming a 1:1 L2P with 4KB Logical block size will require at least 256MB of RAM for an SSD of size 256GB.

4KB LBA size in 256 GB SSD = 64M entries (256GB / 4KB)
Each entry could be 32 bits.
64M * 4 bytes = 256 MB

The amount of RAM is directly proportional to the size of the SSD. Extrapolating the same math for 64TB SSD will result in having a whooping 64GB of RAM in the device to hold the mapping table.

High-capacity SSDs have already started to appear in the market³, and device vendors need to use new techniques to keep the RAM under control for mapping table to reduce cost, etc.

One technique that device vendors actively use to reduce RAM footprint is to increase the mapping ratio or the Indirection Unit. Instead of having 1:1 mapping, device could have n:1 L2P mapping, where n > 1. RAM footprint is inversely proportional to n i.e., multiple logical blocks could have 1 physical mapping as follows:


Mapping table with 16k Indirection Unit (4:1)

Even though increasing the logical block size above 4KB is an option, backward compatibility with the host will not make the transition easy⁴. Solidigm’s high-capacity drive has an Indirection Unit of 16k for an SSD with 61.44TB capacity³.

The following section discusses the impact of increasing Indirection Unit(IU) on WAF.

Indirection Unit impact on WAF:

As increasing the IU is inevitable for high-capacity SSDs, evaluating its effects on WAF is essential. As multiple LBAs map to a single physical block, IO writes that are smaller than IU will incur a Read-Modify-Write(RMW) that leads to WAF. RMW has to read the old data, merge the new data, and write it back to the media.


Read-Modify-Write operation²

Optimal write should align and be a multiple of IUs. RMW negatively impacts the performance and lifetime of the SSD due to extra writes incurred.

Quantifying IU WAF:

WAF_IU can be easily quantified by monitoring the IO write patterns coming from the host.

On a 16k IU drive, io spanning from offset 12k to 32k (io_len of 20k) will incur an extra_io of 12k due to RMW as explained before. The resulting WAF_IU is 1.6. ASCII art explaining the workload:

0        4        8       12       16       20       24       28       32
|--------|--------|--------|--------|--------|--------|--------|--------|..  LBA space
                           <-------------------------------------------->    io_len
<-#######################->                                                  extra_io
<----------------------------------------------------------------------->    total_io

WAF_IU can be calculated as follows(code gist here):

# io_offset: IO offset from the host
# io_len: size of the IO from the host
# IU: Indirection Unit

total_io = (round_up((io_offset + io_len), IU) - round_down((io_offset), IU))

WAF = total_io / io_len

One interesting observation that the above formula indicates that the extra IO due to IU is caused due to the unalignment in the either ends of an IO.


Worst case WAF_IU for different IO length

The plot above shows the worst case WAF_IU for different IO length on a 16k IU device. If the IO size from the host much greater than the IU, then the impact on WAF due to IU drastically reduces. The biggest impact on WAF due to IU happens when the IO size is smaller than the IU of the device.

Takeaways:

Indirection units will increase to reduce cost for high-capacity SSDs.
The indirection unit of the device has an impact on total WAF.
Impact of Indirection Unit on WAF is highest when IO size is smaller than the Indirection unit and lowest when the IO size is higher than the Indirection unit.
The host can avoid the WAF due to IU by aligning and sending IO writes that are a multiple of IU to the device.

References:

⁰ Real Life workloads allow more efficient data granularity and enable very large SSD capacities

¹ There are HMB SSDs which store the L2P table in Host RAM but they are not yet widely used

² Achieving Optimal Performance & Endurance on Coarse Indirection Unit SSDs

³Solidigm Launches 61.44TB PCIe SSD

⁴Transition to Advanced Format 4K Sector Hard Drives

Writing a RAM-backed block driver in the Linux Kernel

2022-11-30T00:00:00+00:00

Linux block layer stack is a complicated beast as it needs to cater to all use cases, but it also allows a block device driver writer to focus only on dealing with the complexity of the device. This article explores a simple RAM-backed block device driver module in the Linux Kernel. The main idea of this article is to show the framework the block layer provides to write a device driver in the kernel land.

A simple block driver: blkram that lives in the RAM will be written from scratch as a part of this article. I decided to do this to focus on the block layer stack with a practical example without having to deal with the complexity of an actual block device such as a SATA or an NVMe drive. Maybe in the future, I will explore writing an NVMe driver from scratch in the kernel.

Linux Block layer is constantly being innovated and modified. As there are no API/ABI guarantees within the kernel, the code that is shown in this article which is based on Linux 6.1.0-rc6 might be outdated in a year.

Linux Block layer

The Linux Block layer introduced blk-mq framework around 2013. All the new block drivers are required to use this framework. Some drivers still use older frameworks in the kernel, but most of the drivers have been modified to be consistent. The following picture taken from paper shows how the block layer stack with blk-mq works¹


Block layer stack

The blk-mq uses a two-layer multi-queue design where software queues based on the number of cpu cores are mapped to a hardware queue/queues. The primary rationale behind this design is to allow a block device driver to fully use the multiple hardware queues present in modern devices such as NVMe SSD. Older devices with a single queue can map all the software queues to a single hardware queue. The blkram driver will also use a single hardware queue. The reader can find more information about blk-mq from this paper and this LWN article.

BLKRAM driver

blkram is an out-of-tree kernel module using the blk-mq framework and do read & writes in the memory(RAM) that will be written as a part of this article. The code can be found here on Github.

Module:

Before talking about the initialization, we need to talk about a kernel module. module_init and module_exit needs to be defined that are automatically called when a module is loaded and unloaded respectively:

module_init(blk_ram_init);
module_exit(blk_ram_exit);

To store the relevant information of the driver, a new structure blk_ram_dev is introduced, which has the following members:

struct blk_ram_dev_t {
	sector_t capacity;
	u8 *data;
	struct blk_mq_tag_set tag_set;
	struct gendisk *disk;
};

The capacity holds the capacity of the block device in sectors (512 bytes), and data will contain the pointer to the actual block of memory backing the block device. The blk_mq_tag_set and gendisk structure will be explained in more depth later.

The capacity/size of the driver is exported as a module parameter, and it can be set while loading the module:

// To change the default: insmod blkram.ko capacity_mb=80
unsigned long capacity_mb = 40;
module_param(capacity_mb, ulong, 0644);
MODULE_PARM_DESC(capacity_mb, "capacity of the block device in MB");

As this driver is not associated with a lower-level driver such as PCI, a pointer to the struct blk_ram_dev_t needs to be stored as a static variable in the module:

static struct blk_ram_dev_t *blk_ram_dev = NULL;

Initialization:

The initialization code of the driver goes under the blk_ram_init function.

The register_blkdev function is first called to get a major number for the block device. This is an optional function to call. We store the major number as a static module parameter as it will be used again in blk_ram_exit function to clean up.

Memory is allocated for the struct blk_ram_dev_t using kzalloc. kzalloc allocates the memory in RAM and initializes it with zero (similar to kmalloc with memset(0)). After this, memory needs to be allocated for the RAM memory backing the block device. A default value of 40 MB is chosen here. kvmalloc is used to allocate that memory as the value is considerable. kvmalloc function tries to allocate physically contiguous memory and if that fails, then it allocates virtually contiguous memory which might not be physically contiguous. Having a physically discontiguous memory should not be an issue for this driver. Besides, the kernel does not allow using kmalloc for requested capacity than a certain limit.³

// Omitted error handling

blk_ram_dev = kzalloc(sizeof(struct blk_ram_scratch_dev), GFP_KERNEL);
blk_ram_dev->data = kvmalloc(data_size_bytes, GFP_KERNEL);
...

Setting up the request queue:

Request queue must be configured before setting up the disk parameters. I think of Request queue as the data plane where the actual data is transferred to the device and disk abstraction is the control plane(struct gendisk) of a block device.

struct blk_mq_tag_set is used by the block driver to configure request queue with the number of hardware queues, queue depth, callbacks, etc. This structure does a lot more than just store these parameters. It also has tags, which track requests sent to a block device. The code below sets up the tag_set data structure:

// Omitted error handling
memset(&blk_ram_dev->tag_set, 0, sizeof(blk_ram_dev->tag_set));
blk_ram_dev->tag_set.ops = &blk_ram_mq_ops;
blk_ram_dev->tag_set.queue_depth = 128;
blk_ram_dev->tag_set.numa_node = NUMA_NO_NODE;
blk_ram_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
blk_ram_dev->tag_set.cmd_size = 0;
blk_ram_dev->tag_set.driver_data = blk_ram_dev;
blk_ram_dev->tag_set.nr_hw_queues = 1;

ret = blk_mq_alloc_tag_set(&blk_ram_dev->tag_set);
disk = blk_ram_dev->disk =
	blk_mq_alloc_disk(&blk_ram_dev->tag_set, blk_ram_dev);

blk_queue_logical_block_size(disk->queue, PAGE_SIZE);
blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
blk_queue_max_segments(disk->queue, 32);

tag_set.ops provides the callbacks to blk_mq. One important callback that needs to be set for this driver is queue_rq. This callback is called whenever a request is ready to be processed by the device driver. More about queue_rq later in the article.

tag_set.flags is used to set certain request queue properties. BLK_MQ_F_SHOULD_MERGE flag is set to let the block layer to merge contiguous requests together:

if (!(hctx->flags & BLK_MQ_F_SHOULD_MERGE) ||
    list_empty_careful(&ctx->rq_lists[type]))
	goto out_put;
...
/*
 * Reverse check our software queue for entries that we could
 * potentially merge with. Currently includes a hand-wavy stop
 * count of 8, to not spend too much time checking for merges.
 */
if (blk_bio_list_merge(q, &ctx->rq_lists[type], bio, nr_segs))
	ret = true;

tag_set.nr_hw_queues is an important parameter that is used to inform the block layer about the number of hardware queues this device can support. In the case of blkram, only one hardware queue is chosen. For NVMe devices which can physically support multiple hardware queue, tag_set.nr_hw_queues can be given a higher value and blk_mq_map_queues can map SW queues to the HW queues.

A tag_set is allocated with blk_mq_alloc_tag_set call with the respective parameters. A request queue can be created with the corresponding tag_set by calling blk_mq_alloc_disk function. This function only allocates a disk but does not “add” it to the system. struct gendisk contains a reference to the request queue that can be used to configure parameters such as logical_block_size, physical block size, etc.(block settings can be explored in this file block/blk-settings.c).

Setting up the disk:

The gendisk structure stores the relevant context about a block device with its bookkeeping information such as name, major/minor number, partitions, etc. The struct gendisk can be found in blkdev.h. As mentioned earlier, one could think of it as the control plane of a block device.

disk->major = major;
disk->first_minor = minor;
disk->minors = 1;
snprintf(disk->disk_name, DISK_NAME_LEN, "blkram", minor);
disk->fops = &blk_ram_rq_ops;
disk->flags = GENHD_FL_NO_PART;
set_capacity(disk, blk_ram_dev->capacity);

ret = add_disk(disk);

Major number identifies the driver associated with a device, and minor identifies the exact device that belongs to the driver so that the device can be differentiated. For example in block devices, different partitions are given a different minor number, but the major number will remain the same.

As it is just a simple block driver, I decided not to support any partitions. GENHD_FL_NO_PART flag is set to the disk to tell the block layer not to scan for any partitions. Similarly, minors is set to 1 as there will be no partitions. Block layer code that checks for GENHD_FL_NO_PART and skip scanning for partitions:

static int blk_add_partitions(struct gendisk *disk)
{
	if (disk->flags & GENHD_FL_NO_PART)
		return 0;

	state = check_partition(disk);
...

disk->fops contains all the callbacks for the block device that is used to perform open, release, ioctl, etc. The following snippet should be enough for the blkram driver as we don’t need to do anything special:

static const struct block_device_operations blk_ram_rq_ops = {
	.owner = THIS_MODULE,
};

Finally, calling add_disk should create a block device /dev/blkram.

Request processing:

queue_rq callback is called by the block layer to process a request by the device driver. Typically, queue_rq callback is used by a driver to send the commands to a device, and the command completion is notified by an interrupt request. As this block driver is dealing with RAM, which has low latency, requests can be completed synchronously in the queue_rq callback.

static blk_status_t blk_ram_queue_rq(...)
{
	loff_t pos = blk_rq_pos(rq) << SECTOR_SHIFT;
	struct bio_vec bv;
	struct req_iterator iter;
	blk_status_t err = BLK_STS_OK;
        ....

	blk_mq_start_request(rq);

	rq_for_each_segment(bv, rq, iter) {
		unsigned int len = bv.bv_len;
		void *buf = page_address(bv.bv_page) + bv.bv_offset;
		...
		switch (req_op(rq)) {
		case REQ_OP_READ:
			memcpy(buf, blkram->data + pos, len);
			break;
		case REQ_OP_WRITE:
			memcpy(blkram->data + pos, buf, len);
			break;
		default:
			err = BLK_STS_IOERR;
			goto end_request;
		}
		pos += len;
	}

end_request:
	blk_mq_end_request(rq, err);
	return BLK_STS_OK;
}

blk_mq_start_request is called first to inform the block layer that the driver has started processing the request. This is important for the block layer to do accounting and keep track of each request for a potential timeout.

rq_for_each_segment is used to iterate over all the segments in a request and perform any operation on a bio_vec (block IO vector). Only read and write are supported by the blkram driver. When the request is REQ_OP_READ, then a memcpy is performed from the data (backing store of this block device) to the page given the bio_vec, and vice versa for REQ_OP_WRITE.

blk_mq_end_request is called with the appropriate err to mark that the request is now completed. In NVMe devices, this function is called as a part of the interrupt request when the device signals its completion of a command.

Testing:

The driver is now ready to be tested. The module can be loaded as follows:

$ insmod blkram.ko capacity_mb=80
$ lsblk | grep blkram
blkram  253:0    0   80M  0 disk

A quick and easy way to test if read and write is working is through fio. Install fio and run the following command:

$ fio --name=randomwrite  --ioengine=io_uring --iodepth=16 --rw=randwrite \
                 --size=80M --verify=crc32 --filename=/dev/blkram
randomwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=16
fio-3.31-8-g7a7bc
Starting 1 process

randomwrite: (groupid=0, jobs=1): err= 0: pid=968: Mon Dec  5 18:23:19 2022
  read: IOPS=62.1k, BW=242MiB/s (254MB/s)(80.0MiB/330msec)
    slat (usec): min=6, max=195, avg= 7.80, stdev= 2.08
    clat (usec): min=8, max=444, avg=241.21, stdev=10.61
     lat (usec): min=15, max=452, avg=249.01, stdev=10.85
     ....
  write: IOPS=78.8k, BW=308MiB/s (323MB/s)(80.0MiB/260msec); 0 zone resets
    slat (usec): min=9, max=194, avg=12.22, stdev= 5.38
    clat (usec): min=20, max=1165, avg=190.17, stdev=68.66
     lat (usec): min=31, max=1176, avg=202.39, stdev=73.00
     ....
   bw (  KiB/s): min=163840, max=163840, per=52.00%, avg=163840.00, stdev= 0.00, samples=1
   iops        : min=40960, max=40960, avg=40960.00, stdev= 0.00, samples=1
  lat (usec)   : 10=0.01%, 50=0.01%, 100=0.02%, 250=89.26%, 500=10.23%
  lat (usec)   : 750=0.48%
  lat (msec)   : 2=0.01%
  cpu          : usr=48.15%, sys=46.11%, ctx=1390, majf=0, minf=573
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=20480,20480,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=242MiB/s (254MB/s), 242MiB/s-242MiB/s (254MB/s-254MB/s), io=80.0MiB (83.9MB), run=330-330msec
  WRITE: bw=308MiB/s (323MB/s), 308MiB/s-308MiB/s (323MB/s-323MB/s), io=80.0MiB (83.9MB), run=260-260msec

Disk stats (read/write):
  blkram: ios=11148/661, merge=0/19819, ticks=18/1238, in_queue=1256, util=42.53%

The above fio command will send random writes to the device, and at the end verifies (by reading) if the device contains what was written.

Conclusion:

A simple RAM-backed block device driver was explored as a part of this article. The main idea behind writing this is to understand the blk-mq framework provided by the block layer stack and write a device driver using it. There is a lot of knobs that blk-mq offers which is not covered in this article that could be utilized to optimize the driver depending on the device.

I highly recommend the reader to clone the example from github and play with it in QEMU. I already have an article about using QEMU for NVMe development, and it can be used to easily create a virtual machine with QEMU. The best way to explore is by using the trace-cmd² utility or just with debug prints in the kernel to see how different blk-settings affect the request sent to this device.

I hope you enjoyed the article. Happy Hacking!

¹ LWN article about block layer part1 & part2

² Learning the linux kernel with tracing video

³ what happens when kmalloc is used instead of kvmalloc:

Kernel panics when kmalloc is used instead of kvmalloc for 40 MB of data_size_bytes:

WARNING: CPU: 0 PID: 3467 at mm/page_alloc.c:5527 __alloc_pages+0x48b/0x5a0

__alloc_pages+0x48b/0x5a0 corresponds to the following line in the kernel:

$ addr2line --exe=vmlinux --functions __alloc_pages+0x48b
__alloc_pages
linux/mm/page_alloc.c:5527 (discriminator 9)

Looking at the code at mm/page_alloc:5527:

#define MAX_ORDER 11
....
....
/*
  * There are several places where we assume that the order value is sane
  * so bail out early if the request is out of bound.
  */
 if (WARN_ON_ONCE_GFP(order >= MAX_ORDER, gfp))
         return NULL;

Any request of kmalloc with order 11 or above: 2^10 * PAGE_SIZE (4096 for x86)= 4 MB will fail this check.

QEMU setup for NVMe development

2022-11-08T00:00:00+00:00

QEMU is an emulator that can be used during the development of an NVMe driver. It offers NVMe 1.4 spec-compliant controller emulation. The neat part about using QEMU is that it only emulates the controller and not the device itself, thereby allowing the driver writer to focus solely on writing a spec-compliant driver without initially worrying about the quirks that come along with an actual NVMe device. On top of that, QEMU offers tracing capabilities, making debugging very easy during initial development. And, last but not least, an actual NVMe device is not needed for development, and the host machine will not be affected in any way during the development. That is enough marketing as to why QEMU is excellent for NVMe driver development.

vmctl will be used to set up the QEMU development environment. It makes life a bit easy by automating the creation and management of QEMU as one of the primary usecase it specifically targets is NVMe development. However, it is not necessary to use this tool to create and manage QEMU. Vagrant with libvirt is a possible alternative

If you already have a QEMU setup for Linux development, only a few lines of setup command are required. Please go ahead and skip VMCTL section, as I will cover that at the end of the article.

VMCTL:

The official github page has an excellent README which should be good enough to get started. I will reiterate certain parts of the README in a different order and add a bit more context for readers who are entirely new to this topic.

Make sure to clone the official repo and make sure the vmctl is added to the path via a symlink as suggested in the official README. Before we use vmctl, a boot image needs to be created.

Here is an Ansible role to automate the steps described in this article for readers who prefer IaC.

Ubuntu boot image

Download a ubuntu cloud image from the official site. Resize the ubuntu image as follows:

>$ qemu-img resize ubuntu--server-cloudimg-amd64.img 8G

Create a new folder called vms to hold all the VM related data and copy the ubuntu qcow into a folder named img (inside vms) as base.qcow2.

>$ mkdir -p vms/img
>$ cp ubuntu--server-cloudimg-amd64.img vms/img/base.qcow2

cloud-init

After creating a Ubuntu based qcow image, the image can be configured using cloud-init script provided by vmctl. Running this helps set some defaults that will be useful when we boot the system. Also we will set it up to accept ssh connections from our host by providing it our ssh public key.

>$ ./vmctl/contrib/generate-cloud-config-seed.sh ~/.ssh/.pub
>$mv seed.img vms/img/

Using vmctl to boot the image in QEMU

The official repo provides a set of example configuration files to boot Linux with NVMe storage. One thing to note is that even though the guest OS running in QEMU sees an NVMe drive, QEMU only emulates the NVMe controller, but underneath, it uses the storage media of the host. For more details on how QEMU emulates the NVMe controller, do check out this video by the current maintainer of the QEMU NVMe subsystem.

Copy the relevant files to the config subfolder inside vms folder.

>$ mkdir vms/config
>$ cp vmctl/examples/vm/nvme.conf vms/config
>$ cp vmctl/examples/vm/q35-base.conf vms/config
>$ cp vmctl/examples/vm/common.conf vms/config

Add just one line QEMU_PARAMS+=("-s") in the nvme.conf file as follows:

_setup_nvme() {
# setup basevm
  _setup_q35_base

  QEMU_PARAMS+=("-s")

The reason to add -s option to qemu is to enable debugging with gdb from the host machine. This will be handy to do remote debugging with gdb as it opens up port:1234 for that purpose.

Firstly, the image needs to be configured with the seed.img that was created. Run the following to do that:

>vms$ vmctl -c nvme.conf run -c

Once the configuration is complete from the previous step, the image can be booted by running the following command:

>vms$ vmctl -c nvme.conf run

Now check whether the build worked by sshing into the VM as follows from another terminal:

ssh -p 2222 'root@localhost'

Inside the VM, make sure that there is an NVMe driver attached to our VM by running lsblk inside the VM:

[root@archlinux ~]# lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
vda     254:0    0   8G  0 disk
└─vda1  254:1    0   8G  0 part /
nvme0n1 259:0    0   1G  0 disk

As we can see, Linux detects the NVMe drive and creates a block device nvme0n1.

If you encounter any issues, ensure you are inside the vms folder. If you want to run the command from a different folder, set the VMCTL_VMROOT environment variable pointing to the vms directory.

Using a custom kernel and tracing

We can use Linus’s tree to use the latest mainline kernel version. And QEMU also has the option to use a custom kernel during boot. To compile a custom kernel, do the following:

>$ git clone https://github.com/torvalds/linux.git 
>$ cd linux
>$ make menuconfig
>$ make -j$(nproc)

Grab a cup of coffee while the kernel builds.

Once the build is complete, run the vmctl tool by pointing to the kernel build dir as follows:

>vms$ vmctl -c nvme.conf run -k

It is also a good idea to enable pci_nvme tracing in QEMU to help with the debugging.

The final command, which does everything, is as follows:

>vms$ vmctl -c nvme.conf run -t pci_nvme -k

Note that vmctl uses a systemd service to mount the kernel directory from host, enabling the use of modules compiled in the host to be consumed inside the Guest. This is a nice feature to avoid building all modules into a single kernel binary, thereby significantly reducing kernel build time when modifying a module. Also, some test suites, such as blktest, require some drivers to be dynamically loadable modules.

To view the QEMU trace, run the following in another terminal:

>vms$ vmctl -c nvme.conf log -f

Tmux could be used to run these commands in different panes in the same window.

VFIO device passthrough

VFIO passthrough can be used to access the NVMe device from the Guest OS. VFIO-PCI module will pass the NVMe device to the Guest, and the Guest OS’s NVMe driver can be used to talk to the device.

Pre and post hooks from vmctl config can be used to bind/unbind the respective device before starting QEMU. This article explains the setup needed to do vfio-pci passthrough.

The following pre hook will detach the host’s NVMe driver from 01:00.0 PCI port and attach it to the vfio-pci module before starting QEMU, and the post hook in nvme.conf will restore the original state after exiting QEMU :

_pre() {
		# Pre hook to run before starting QEMU
	 
		# unbind 0000:01:00.0 from nvme kernel module
		echo '0000:01:00.0' > /sys/bus/pci/drivers/nvme/unbind
	 
		# bind 0000:01:00.0 to vfio-pci kernel module
		echo '0000:01:00.0' > /sys/bus/pci/drivers/vfio-pci/bind
	}
	 
_post() {
		# Post hook to run after exiting QEMU
	 
		# unbind 0000:01:00.0 from vfio-pci kernel module
		echo '0000:01:00.0' > /sys/bus/pci/drivers/vfio-pci/unbind
	 
		# bind 0000:01:00.0 to xhci_hcd kernel module
		echo '0000:01:00.0' > /sys/bus/pci/drivers/nvme/bind
	}

Without using VMCTL

As mentioned before, vmctl tool makes life a bit easier to manage QEMU NVMe development. But it is an optional tool.

If you already have a workflow with QEMU, then it can be easily extended.

To create a QEMU instance with an NVMe driver, add the following lines¹ while running your QEMU command:

-drive file=nvm.img,if=none,id=nvm
-device nvme,serial=deadbeef,drive=nvm

This requires nvm.img raw image, which can be easily created as follows:

>$ qemu-img create nvm.img 1G

Apart from the nvme changes to the qemu command, make sure to use the -kernel option to point to the custom kernel and -s option to enable gdb remote debugging.

Conclusion

This article showed how to set up a QEMU-based environment for an NVMe driver development. I have also added an Ansible role to automate the vmctl setup. It also builds and installs the upstream QEMU for VMCTL.

I hope you enjoyed the article. Happy Hacking!

¹ Taken from the official QEMU documentation here

² NVMe VFIO passthrough with QEMU link

My Homelab hardware for self-hosting

2022-08-21T00:00:00+00:00

I recently started to look into self-hosting certain services to increase privacy and, most importantly, have fun along the way. I generally work on the Linux kernel for my day job; while exciting, I don’t get to use Linux where it shines the most: as a server.

This article will cover my Hardware setup for my Homelab and its bringup.

Humble beginning

Before going all in by buying fancy hardware, I wanted to do a trial run with old hardware and see if it was something I wanted to do. My friend donated her old HP EliteBook 8470p as she had no use for it anymore. So, like any sane person, I removed Windows and installed Linux in it. I went with Arch Linux as it would only be my test server. I know some people run their server with Arch Linux, but that will probably not be me in the future.


HP Elitbook 8470p

I just ran a Samba share for network file sharing and installed paperless-ngx in docker. paperless-ngx app came in handy many times to quickly access my personal documents. I used this server once for web scraping to find an appointment in my local municipality. So, I could already see the potential of self-hosting services and see myself tinkering with it.

As it might not be a great idea to add extra storage via USB, I decided to get better hardware with horsepower and expandability.

Homelab Hardware specification

I bought a used Supermicro X11SSL-CF Motherboard with Intel Xeon 1245 v6, 32GB ECC RAM and 64GB of SATA DOM. The Motherboard has 6 SATA slots and 2 Mini SAS HD slots. In addition, it has an LSI 3008 RAID controller for the SAS ports, which I plan to use in IT mode in the future so that all the disks connected via the SAS ports appear as individual HDDs. Unfortunately, the RAM that came with the Motherboard had issues, so I had to buy my RAM. More about how I discovered this issue later.

Storage:

I will use the 64GB SATA DOM for running my primary OS. My spare 500 GB Samsung SATA SSD will be used for fast storage and used as a cache layer with dm-cache. For now, I just got one 4TB Ironwolf NAS HDD as my primary storage. I am planning to buy one more and use one of them in RAID1. I am not a big data hoarder, so for now, 4TB of storage should be enough to get started. I have enough expansion ports in the Motherboard for future expansion.

Component	Model
CPU	Intel(R) Xeon(R) CPU E3-1245 v6 @ 3.70GHz
RAM	2 x 8GB Samsung ECC RAM (M391A1G43EB1-CRC)
Storage (OS)	SuperDOM 64GB
Storage (fast)	Samsung 870 EVO SSD - 500GB
Storage (slow)	2 x IronWolf Harddisk ST4000VN008 4TB
Case	Fractal Node 804

Bringup

One issue with buying used components is component reliability. Generally, these server-grade Motherboards are designed to last long; still, there can be issues.

Before connecting the peripherals, I tried to do a bring-up test to see if I could reach the BIOS option. The board did nothing, and I went into panic mode that the board might be kaput, and I lost my money. I tried all the debugging steps mentioned in the Supermicro installation guide, and everything pointed towards replacing my Motherboard. Finally, after some hours of debugging, I discovered I didn’t connect my power supply properly (:facepalm:).

Once the system started functioning as intended, I tried booting into a Live ISO to check everything was properly detected by the OS. Then came the next issue; there were random kernel panics. Again, I panicked that the board might be kaput.

I was very confused because the kernel panics were random, and it was not reproducible. So, I decided to do a memtest that is part of Live USB ISO installer. To my surprise, it emitted a lot of memory errors as below:


Memtest errors (Image not from my server as I forgot to take a picture. Taken from here)

So once I replaced the RAM, the random kernel panics did not occur anymore. Only one of the two 16G RAM sticks had an issue. I will probably reuse the other one later when I expand the memory.

I like the built-in IPMI in the Motherboard that allows me to operate the server without having it connect to the monitor, keyboard, etc. Even though there are alternatives such as Pi-KVM and TinyPilot that are DIY KVM, each cost at least 100 Euros (assembled version might cost around 300 Euros) with extra components lying around the server. So I am happy with the inbuilt IPMI to control my server remotely.


IP-KVM

Future plans

Hardware:

Change the LSI 3008 SAS controller from IR mode to IT mode (link)

This change disables the RAID functionality of the SAS controller, and it presents each drive individually to the host. It comes in handy when I want later use something like Proxmox and pass through the complete controller for some storage OS, such as TrueNAS to manage drives.
Add more RAM and storage

I didn’t want to oversize my hardware with RAM and storage up front. I also want to choose filesystem and software in such a way that I can gradually add more disks.

System software:

openSUSE as the primary OS

I am planning to install openSUSE Leap as my primary OS. I have heard good reviews about openSUSE Leap for servers. openSUSE’s default filesystem is BTRFS, which will be my primary filesystem to store data.

I initially thought of going with Proxmox as my hypervisor instead of installing OS in the bare metal. I might do that in the future, but I will go with the bare metal install now. Anyway, I am planning to deploy most of my software via docker.
Filesystem and redundancy

I will use a 2 x 4TB Harddrive in RAID1 with 100GB of SSD as a fast cache using dm-cache. I will be using BTRFS as my primary FS to store data. I am familiar with BTRFS, and its snapshot functionality is incredible. If I decide to add extra storage, I will go with Mergerfs and Snapraid combo, with each drive having its fast cache and formatted with BTRFS. It allows me to scale as I need.

I am also keeping an eye on RAIDz expansion feature that might be coming to ZFS soon which might give me motivation to try a storage OS such as TrueNAS.

Conclusion

I had a lot of fun researching, finding, and assembling my server. But, I suppose the real fun will begin once I start installing applications. I initially intend to try out different stacks: with and without virtualization, different filesystem layouts, different backup strategies, etc.

I will probably have more blog articles in the future about my software setup and the applications I will deploy.

That is it for now. Happy Homelabbing!!


Fully built supermicro server

Taming mailing lists with NeoMutt and notmuch

2022-01-22T00:00:00+00:00

Still, a lot of Free and Open Source Software(FOSS) projects use mailing lists as the primary form of communication and more importantly: sending patches. As my new job required following certain Linux mailing lists, it all seemed very new as I never had a mailing list-based workflow before. The key differentiator in a mailing list-based workflow is that there is going to be a huge influx of emails but not all of them are relevant to you. So the challenge that many people face when they start is filtering emails. Letting all emails go to the inbox and then filtering them manually is an arduous task that will soon prove to be futile.

In this article, I will try to cover some tips and tricks to handle mailing lists and make it as usable as a part of the daily workflow.

I use Gmail to subscribe to mailing lists and it was almost impossible to follow things via the web-based Gmail client. I even tried setting up filters but it still didn’t feel “right” to me. After trying multiple clients such as Thunderbird, and talking to people, I finally found the one: NeoMutt + notmuch based on my colleague’s suggestion. Initially, it felt like a daunting task to set it up but the final result gave me that peace of mind that I am never planning to look back. I hope explaining my workflow can help others and make parsing through the mailing list a bit less scary.

NeoMutt is a command-line mail reader (and a fork of Mutt) and notmuch is an awesome Mail Indexer. Mbsync(iSync) will be used as a way to synchronize mails with the mail server through IMAP. As mbsync downloads the email locally, it is very useful for offline viewing. We will not go over mbsync setup here but there is a good ArchWiki page. As a prerequisite, make sure the mails are synced using mbsync locally to a folder in MailDir format.

The basic idea is to filter and tag the incoming emails with notmuch and use Neomutt’s inbuilt notmuch integration to view the emails on a per tag basis.

Notmuch

Notmuch allows to compose logical rules to add tags to an email. My notmuch configuration is very basic and it is as follows:

[database]
path=

[user]
name=
primary_email=

[new]
tags=unread;

[search]
exclude_tags=spam;

[maildir]
synchronize_flags=true

Once this is wired, tags can be applied to the emails based on our custom rules.

My strategy is to apply a unique tag per mailing list. But there could be cases where there is a single mailing list for different topics. I will also show a technique where we can add some filters to only extract relevant emails from that mailing list.

Syncing script

An easy way to manage and apply tags with notmuch is to create a custom script that can run every time we want to download emails locally. The script will run mbsync to synchronize emails and run notmuch to apply tags on them.

Syncing script example:

#!/bin/sh

/usr/bin/mbsync 

notmuch new

#Example tagging for the linux block layer mailing list
notmuch tag +block -- cc:linux-block@vger.kernel.org or to:linux-block@vger.kernel.org
notmuch tag +io_uring -- cc:io-uring@vger.kernel.org or to:io-uring@vger.kernel.org

Syncing script also allows us to mark some unwanted mails to be marked as read. I follow the development of NVMe emulation in QEMU but there is a single mailing list in QEMU for the whole block layer. notmuch has tagging also based on Subject. As I know the email that is of my interest has the hw/nvme in the Subject line, I do the following in my syncing script to only extract the relevant emails:

notmuch tag +qemu-nvme -- cc:qemu-block@nongnu.org or to:qemu-block@nongnu.org and subject:"hw/nvme"

# I am not interested in anything that is not related to hw/nvme
notmuch tag -unread -- cc:qemu-block@nongnu.org or to:qemu-block@nongnu.org not tag:qemu-nvme

The mails coming from qemu-block@nongnu.org and do not contain the subject:hw/nvme are automatically marked as read. As the reader might notice, this is a very useful feature to declutter the Mailbox from unwanted emails.

Github notifications

This technique can also be extended to following PRs for projects that are hosted on Github. Once the notification via email is enabled for PRs in Github, a custom notmuch tag can be applied on the Subject to filter out PRs that are not interesting.

For example, to follow only the developments in the Kernel in SerenityOS, the following rule can be added:

notmuch tag +serenity-kernel -- to:serenity@noreply.github.com and subject:"Kernel"
notmuch tag +serenity -- tag:personal and to:serenity@noreply.github.com        
                                                                                
# I am not interested in anything that is not related to Kernel and not directed to me
notmuch tag -unread -- to:serenity@noreply.github.com and not tag:serenity-kernel and not tag:serenity

Of course, the project that is being followed should have some sort of a contributing guideline that enforces the subsystem to be mentioned in the Subject or the Body.

Neomutt + Notmuch integration

Neomutt has this concept of virtual-mailboxes that takes description or a notmuch URL as its input. Neomutt has a custom syntax for notmuch search arguments which is used to construct the notmuch URL.

Ideally, it is better to create a virtual mailbox per notmuch tag in neomutt.

This is my configuration for the virtual-mailbox which goes in neomutt config:

#neomuttrc

virtual-mailboxes "block" "notmuch://?query=tag:block%20and%20tag:unread"
virtual-mailboxes "io_uring" "notmuch://?query=tag:io_uring%20and%20tag:unread"

The “block” virtual-mailbox will show only new mails that come from the linux-block mailing list (the same applies to “io_uring” virtual-mailbox). I prefer to only see the new emails from the respective mailing list instead of seeing it along with the old emails from that mailing list.

Pruning and Reading

Not all emails in a mailing list might be of importance so my workflow involves a pruning step where I get rid of uninteresting patches and I mark patches that I want to revisit later with a todo tag. Neomutt allows to easily tag a mail with a shortcut as follows:

macro index,pager tt "!todo\n" "Toggle the 'todo' tag"

The above macro binds tt with toggling the todo tag on an email.

So while pruning the mails, I tag the important emails with a todo tag so that I can go through them later. The virtual-mailbox configuration with todo tag is as follows:

virtual-mailboxes "todo" "notmuch://?query=tag:todo"
virtual-mailboxes "block" "notmuch://?query=tag:block%20and%20tag:unread%20and%20not%20tag:todo"
virtual-mailboxes "io_uring" "notmuch://?query=tag:io_uring%20and%20tag:unread%20and%20not%20tag:todo"

The above virtual-mailbox rule allows ignoring emails with todo label that are there in the block and io_uring virtual-mailbox. This is useful while pruning the mailing list as I don’t want to see the email with todo tag even if it is unread.

TIP: Add a macro that calls the syncing script from NeoMutt. I have configured S to run the syncing script:

macro index S " 2>&1" "sync email and notmuch"

TIP: Github sends the link of the patch/diff file in the email. A simple macro can be added to download the diff file locally as follows:

bind index,pager O noop
macro index,pager O "| grep \"\^https\.\*diff\$\" | xargs wget "

Conclusion

That is it for this article. I will try to keep this article up-to-date as I learn new tricks to optimize my workflow.

This setup avoids the need to go to the main Inbox folder and rather directly jump into the mails with relevant tags. Personally, this setup creates less cognitive overhead and less context switching because the relevant emails are filtered and organized.


My neomutt + notmuch setup

Useful articles:

Hope you enjoyed the article and found it helpful. Happy mailing!

Creating a stopwatch in Pinetime

2021-04-03T00:00:00+00:00

The Pinetime is a free (as in freedom) and open-source smartwatch that is a completely community-driven side-project. Enthusiasts all around the world are writing multiple firmware for the watch. Currently, Infinitime is one of the most popular firmware for the Pinetime¹ project and the latest watches shipped use it as the default firmware. I also like this project because I get to use Modern C++ on a tiny nrf52 microcontroller that is onboard in Pinetime.

Recently I created a Pull Request for a stopwatch and it got merged upstream. So in this short article, I am going to cover the basics of the “framework” for writing applications (Apps) in Infinitime, and how I designed a basic stopwatch.

Apps Interface:

There is no/minimal documentation as to what to do to write an app in Infinitime. So I decided to see how other apps are implemented and just use the power of grep to figure out.

Overall the author and the community have done a great job in creating a well abstracted, event-driven code base to develop apps.

So all the Apps in Infinitime inherits from the base class Screen. These are the functions that must/shall be implemented in our App class that inherits from Screen (added comments in the code for clarity):

// This function is called every 20 ticks (approx to 20 ms).
// The part that needs to be updated continuously should be inside this function.
// This function needs to be implemented by the App (as it is a pure virtual function)
// The return value should be true if this App needs to continue to run.
virtual bool Refresh() = 0; 

// Pinetime has a button that is typically used to wake up the watch. 
// This function will implement the code in the event of a button push
// and it is optional to implement.
virtual bool OnbuttonPushed(); 

// The two functions below are triggered for touchevents
virtual bool OnTouchEvent(TouchEvents event);
virtual bool OnTouchEvent(uint16_t x, uint16_t y);

Of course, there are more things apart from implementing the Screen interface to get the application to work. The best way to figure that out right now is to do a grep of some application class that has already been implemented and replicate it for your application.

Stopwatch

Now that we know the basic framework on how to develop an app in Infinitime, let’s start chalking out the logic of a stopwatch before jumping into the implementation.

Logic

As we want the stopwatch to be event-driven, the easiest way to describe its logic is via a good old state machine/diagram as shown below:


State diagram for a stopwatch

(If you understood the logic of the state machine by just looking at the diagram, then skip ahead to the next subsection: Implementation)

The INIT state has everything reset and the only event possible from the INIT state is to press play (event).

Once play is pressed, the state transitions to a RUNNING state. There are two possible events in the RUNNING state: press lap or press pause. If the press lap event is triggered, the current elapsed time should be displayed (aka as split) and the clock should keep on running. If the press pause event is triggered, the state transitions to the HALTED state.

In the HALTED state, the stop clock timer is paused and information already displayed on the watch should stay as is. There are two possible events in the HALTED state: press play or press stop. The press play event should restart the stop clock from where it got paused, and the press stop should reset the stop clock and transition to the INIT state.

Implementation

The two important things to figure out for the implementation is how to display something on the watch, and how to calculate the time elapsed while the stop clock is running. So Infinitime uses LVGL to create graphics on the watch display and FreeRTOS as the real-time operating system (which could also be used to retrieve the time elapsed).

Time elapsed:

FreeRTOS provides a convenience function xTaskGetTickCount() to get current tick count. Ticks are some sort of a scaled unit of time that is used by the FreeRTOS. Now that there is a way to get the current time in ticks, the time elapsed can be easily calculated.

Whenever the user triggers the play event, the xTaskGetTickCount() function is called to store the start time in ticks. So each time the Refresh() event is called by the Infinitime framework, the xTaskGetTickCount() can be called again to find the difference between the start time to get the time elapsed. Very easy…., except the story doesn’t end there.

In microcontrollers, the time can be measured by different timers present in the system which pulses at a certain frequency. The RTC (real time counter) that FreeRTOS uses in the Infinitime is the RTC1 which has a frequency of 32,768 kHz. The counter just keeps increasing for each tick and FreeRTOS stores this counter in a 32-bit value². Whenever the counter reaches a value of 4,294,967,295 (2³² - 1), it resets backs to zero. So this is generally called an overflow in the counter.

The counter flow can lead to erroneous calculation in time elapsed. For e.g., assume our start time tick is equal to 4,294,967,291 (2³² - 4) and next tick we receive in the Refresh function is 5 due to an overflow in the counter. Even though the actual time elapsed is 10 ticks: 4 ticks until overflow + 6 ticks after overflow (0 is also a tick), but when the difference is directly calculated, it turns out to be 4,294,967,286 (4,294,967,291 - 5)ticks. That is completely wrong and will introduce an error in the calculation of the stopwatch.

To avoid the error with counter overflow, a simple check could be done as follows:

TickType_t delta = 0;
 // Take care of overflow
 if (startTime > currentTime) {
   // Could be done in one line but made it separate for 
   // clarity
   delta = 0xffffffff - startTime;
   delta += (currentTime + 1);
 } else {
   delta = currentTime - startTime;
 }

This simple check verifies if the startTime is greater than the currentTime (when we find a time machine this would not be valid anymore :P ) and calculates the delta appropriately.

Even though the above solution works in case of a timer overflow, I found an another neat solution which could do the same calculation in one line without any if/else condition³:

TickType_t delta = 0;
delta = (currentTime - startTime) & MAX_COUNTER_VALUE; // MAX_COUNTER_VALUE in our case is 0xFFFFFFFF(2^32 - 1)

I am not going to explain why the above solution works because it is deviating from the original intent of the article. But here is a small hint for the curious ones: this solution works because of the magic of the 2s complement. In case if some of you forgot how twos complement is used to store negative numbers, I highly recommend watching this amazing video from Ben Eater.

Display & State transitions:

As I already mentioned, Pinetime uses the LVGL library to display, create buttons, etc on Pinetime. It is a very popular graphics library used in many embedded platforms.

Label text and buttons widget are all that are needed to implement the stopwatch. The label text widget is used to display the time elapsed, laps, etc. The button widget will be used to start, pause and stop the stopwatch. Pressing the appropriate button will create the event in the state machine. LVGL’s button widget will take a callback and it will be invoked when the button press takes place.

As LVGL is a C library, the callback is a function pointer instead of a C++ std::function. As I wanted to pass the current object of the class StopWatch into the callback, I had to use a simple static dispatcher function. It was done as follows:

static void play_pause_event_handler(lv_obj_t* obj, lv_event_t event) {
  auto stopWatch = static_cast<StopWatch*>(obj->user_data);
  stopWatch->playPauseBtnEventHandler(event);
}

The play_pause_event_handler is the callback for the play/pause button. As LVGL allows to pass some user-defined data as a part of the callback, this pointer of the object was used there (obj->user_data). Using the this pointer, a private member function of the class StopWatch is called in-turn which also has access to all the class private variables.

These button’s callback serve an important purpose to through which we can indicate an event to the state machine. For example:

void StopWatch::playPauseBtnEventHandler(lv_event_t event) {
  if (event == LV_EVENT_CLICKED) {
    if (currentState == States::Init) {
      currentEvent = Events::Play;
    } else {
      // Simple Toggle for play/pause
      currentEvent = (currentEvent == Events::Play ? Events::Pause : Events::Play);
    }
  }
}

bool StopWatch::Refresh() {
  switch (currentState) {
    // Init state when an user first opens the app
    // and when a stop/reset button is pressed
    case States::Init: {
     .....
     ....

      if (currentEvent == Events::Play) {
		....
        currentState = States::Running;
      }
      break;
    }
   case States::Running: {
		....
   }
  ....
}

Let’s assume the currentState shown in the above snippet is set to State::Init. Whenever the play/pause button is pressed, StopWatch::playPauseBtnEventHandler is invoked. In that callback function, the currentEvent is set Events::Play. As the Refresh function is called every 20ms, in the next cycle we change the currentState to States::Running (check also the state diagram shown before), therefore, successfully making a state transition.

As play & pause are mutually independent and stop & lap are mutually independent only two buttons were used in total.

Lap/Split:

The final feature to implement is the lap (aka as split) feature. As the space in the watch is limited, we can only display the latest two splits. A simple circular buffer was used to hold the latest two splits and display them on the watch.

The callback for the lap button was as follows:

void StopWatch::stopLapBtnEventHandler(lv_event_t event) {
  if (event == LV_EVENT_CLICKED) {
    // If running, then this button is used to save laps
    if (currentState == States::Running) {
      lapBuffer.addLaps(currentTimeSeparated);
      lapNr++;
      lapPressed = true;
     .....

Whenever a lap button is pressed in the States::Running, the currentTime is added to the buffer and a boolean lapPressed is set to true. The Refresh function will check for any available data in the buffer and display it in the appropriate location.


Splits highlighted in the stopwatch app

Final result:

The final PR that got merged can be viewed here in case if someone wants to see the complete code.


Init state: Only play button available


Running state: Only pause or split button available	Halted state: Only stop or play button available


Stopwatch in action

Conclusion

Writing an app for Pinetime watch in a watch firmware like Infinitime was not as difficult as I thought it would be. All I needed was the logic and some embedded software knowledge to implement a stopwatch. The main credit goes to the author and the community that has done the leg work of setting up everything: from toolchains to libraries.

I believe the Pinetime watch provides a perfect playground if you are an embedded software enthusiast or someone with software knowledge but wants to dig deep into the embedded side of things. There is an active community from all over the world who are always willing to help. And all the firmware that is currently available for Pinetime is far from done. I believe everyone can contribute to them in some way and make this community-driven project a success.

Now, what are you waiting for? Get a Pinetime and start hacking! :)

Hope you enjoyed the article. Happy coding!

¹ Other firmwares that I know are wasp-OS, Pinetime Lite (A fork of Infinitime), pinetime-rust for all the rust lovers.

² The inbuilt RTC1 of the board only has a 24 bit counter but the FreeRTOS wrapper has a 32-bit wrapper on top of it.

³ I found this neat trick in FreeRTOS’s overflow protection difference calculation here

Porting CHIP8 to an ESP32

2020-12-12T00:00:00+00:00

I am really fascinated by how computers or a certain piece of electronic hardware work. Building emulators is an easy way to look under the veil of a piece of hardware or some processor using the software. Of course, it is not easy to develop an emulator for modern-day hardware as they are much more complex. One easy way to get started is to build an emulator for older hardware as processors used to be much simpler than what we have today in 2020. The best part is, some of the core concepts remain the same.

Building a CHIP8 “emulator” is one of the very popular ways of getting started with emulation programming. I put quotes around the emulator in the previous sentence because CHIP8 is a virtual machine. The CHIP8 programming language( sort of the “Byte code” ) is used to write programs that target the CHIP8 virtual machine. As the CHIP8 virtual machine has opcodes(operation code) that are similar to the modern processor instruction set and there is only a small number of them, building a CHIP8 emulator is one of the best ways to get started in the world of emulation.

I wrote a CHIP8 emulator using C++17 with SFML to handle IO(input/output) for desktop Linux. But this article is not about building a CHIP8 emulator for desktop. There are already a plethora of articles which does a much better job in explaining than what I could possibly do(article). I wanted a bit of extra challenge so I decided to port the existing CHIP8 emulator I wrote for the desktop to a smaller, resource-constrained microcontroller, ESP32¹. ESP32 is very cheap and it comes with a built-in Bluetooth Low Energy (BLE), which I will be using for giving inputs to my emulator.

I will not go through every single detail on how to build an ESP32 based CHIP8 emulator in this article. But this article will give you the overall picture plus some of the issues I faced while building the emulator that might come in handy for someone deciding to build their own. This might be a fun project for someone who already has some basic experience with embedded systems and looking to improve their skills with more advanced topics such as BLE, RTOS, filesystems (FS), etc.

Let’s get rolling!

Prerequisites

Hardware

ESP32 DevKitC (or any other varieties of ESP32 dev board)
ILI9341 based 2.4 inch 240x320 SPI TFT (link)
Bread board
Jumper wires
Android phone (Will be using an (Android App) as the Keyboard to interact with ESP32)

Software

C++17 with the native ESP-IDF (based on FreeRTOS) framework is used to program the ESP32. It was definitely nice to use modern C++ to write the firmware. I decided not to use Arduino. Arduino is great to get started and to quickly prototype something, but probably not great if you want to learn what is going on underneath and have total control due to its abstraction model. The best way to learn embedded software is to create your own abstraction with the native SDK provided by the vendor.

Bluetooth Low Energy(BLE) is used to communicate with the ESP32. ESP-IDF comes with a bluedroid based Bluetooth stack. The display driver for ILI9341 is based on this repo. I made a fork of it because I had to adapt the code a bit. It is added as a submodule in my main repo.

Optional

I used the Flutter SDK to develop the Android app. This is totally optional. You can use the android app I made to control the ESP32-CHIP8. It was a lot of fun developing the android app as I have never developed one before. And, especially, it is really easy to quickly prototype something with Flutter.

Implementation


Class diagram of the ESP32 CHIP8 system

The class diagram shown above roughly indicates different components and how they are related. The CHIP8 class in the diagram is sort of an orchestrator that glues different components together.

In the following subsections, we will discuss each class more in-depth. I have split the following subsections into Concept and Porting. The Concept part will mainly give the reader a bit of background on what I am trying to achieve, and it applies to any platform. In the Porting part, ESP32 specific changes that were needed to be incorporated is discussed.

Virtual Machine (VM)

Concept

The VM class implements the core logic of the CHIP8 VM in our system such as evaluating OPCODES and taking the necessary action. The CHIP8 has 4KB RAM and 16 8-bit general-purpose registers. The registers and RAM are emulated by using std::array. The main job of a VM can be summarized as follows:


Virtual Machine’s state diagram

The fetch part retrieves the OPCODE from the RAM based on the current Program Counter. The decode part looks for the current OPCODE implementation, which in our case is a simple switch statement. The execute part just executes the implementation corresponding to that OPCODE.

Porting

It was a bit surprising but I almost had no issues with this part when I ported from desktop to ESP32. But in a way, it makes sense because the idea of having a VM is to have the portability across multiple platforms. I did have to change the debug print statements from using fmt in desktop to ESP_LOG functions for ESP32. But it was a very minor change.

The main takeaway is to not introduce platform-specific dependency such as access to the display, keyboard in this class for easy portability.

Filesystem

Concept

The CHIP8 ROM contains the OPCODES in binary format and it needs to be stored in a place where the VM can access it. In Linux (or any other big operating OS) it is pretty easy to store the ROM in a folder. But ESP32 microcontroller has a Real-time operating system (FreeRTOS in this case) which is not a full-blown operating system.

Porting

ESP32 supports SPIFFS, which is a file system for SPI NOR flash devices for embedded systems. SPIFFS could be used as a filesystem to load the ROM into the VM.

ESP32 has a partition table for its flash memory. It can be found in the partitions.csv file, usually present in the root folder of the project. The SPIFFS storage details should be filled in the partitions.csv file. The csv file in my project looks as follows:

# Name,   Type, SubType, Offset,  Size, Flags
nvs,      data, nvs,     ,        0x6000,
phy_init, data, phy,     ,        0x1000,
factory,  app,  factory, ,        2M,
storage,  data,  spiffs, ,        1M,

As seen from the csv table, the last entry is reserved for data storage that has a capacity of 1 Megabyte (more than sufficient for a CHIP8 ROM). Now that the flash space is reserved for storage, the ROM needs to be written to that memory space.

ESP32 provides a convenience function: spiffs_create_partition_image in CMake to convert a particular folder into binary in correct data format, and also allows to be written onto ESP32 flash along with uploading(“flashing”) your code. So a folder needs to be created for the ROMs and can be easily used in our program in ESP32 using SPIFFS. This is the code I added in CMakeLists.txt file:

if(FLASH_SPIFFS)
    message("Flashing rom along with the app")
    spiffs_create_partition_image(storage ../../externals/rom FLASH_IN_PROJECT)
else()
    spiffs_create_partition_image(storage ../../externals/rom)
endif()

Enabling the FLASH_SPIFFS option while building the project will allow uploading both the application and the ROMs onto ESP32.

By doing all this, the ROM file can be easily accessed in the program similar to how it is done in Linux as follows:

...
std::ifstream file;
file.open("/spiffs/test_rom.ch8", std::ios::binary | std::ios::ate); 

Keyboard

Concept

The original CHIP8 keyboard had a 16-key hexadecimal keypad to interact with the console. The keyboard layout is shown below:


CHIP8 Keyboard layout (Image courtesy: link)

In CHIP8 there is no concept of interrupts. So CHIP8 VM manually polls for a keypress when it needs to receive an input.

Porting

In the desktop implementation, I used SFML library to get the key that is pressed on the keyboard. But for ESP32, I decided to use Bluetooth Low Energy (more on this in the next section) as a way to give keyboard inputs. The Keyboard object runs in core 1 and the BLE object runs in core 0 of ESP32 and is created as part of different FreeRTOS tasks. So the easiest way to pass a message between two tasks is using the FreeRTOS Queue. The BLE class puts the value it receives via bluetooth on to a queue using xQueueSend function and the Keyboard class receives the value using xQueueReceive function.

BLE

As indicated in the previous Keyboard subsection, BLE is used to give inputs to the ESP32-CHIP8 system. ESP32 will run a BLE GATT(Generic Attribute Profile) server. In BLE terms, ESP32 will act as a peripheral that advertises itself to be connected to a central (our smartphone in this case).

Making BLE work took me the most time in this complete project as I had zero experience with it before and I didn’t want to use any third-party library for it. There is a third-party library by Neil Kolban if you don’t want to write your own GATT server. But in case you are starting the BLE journey from scratch like me, writing a simple GATT server would be a nice way to understand some basics about it. The following are some of the articles/videos that helped me in learning the basics about BLE:

The important thing about the BLE stack is that everything works on callbacks. This is something that took me some time to wrap my head around. So, whenever a central(computer or a smartphone in this case) writes to the peripheral(ESP32), a callback is called with a write event that contains the value that was written. As ESP32 has two cores and has an RTOS, these asynchronous events are gracefully handled. So whenever a value is written to the ESP32 BLE server, the BLE class object will push that value to a FreeRTOS queue that will be used by the Keyboard class later. A simple sequence diagram of that flow is shown below:


BLE and Keyboard class sequence diagram

Display

Concept

The original CHIP8 machine uses a 64x32 pixel display in this format:


Chip8 display layout (64 32 pixels)*

A simple method to store the pixel map of the display is to have an array (std::bitset if you want to have a lower memory footprint) of length 2048( 64 * 32). Only two instructions(DRW and CLS) in CHIP8 VM modifies the display pixels. One easy optimization would be to draw on the display only if either of the two instruction was executed in that cycle. By doing this small optimization, we avoid the expensive draw on the display operation when it is not needed. I used this optimization even in my desktop version of my CHIP8.

Porting

I am using a 2.4 inch 240x320 SPI TFT for display based on the ILI9341 driver. The first problem to tackle is the default orientation of the TFT display. The picture below depicts what happens if a pixel is set at (0,0):


Pixel (0,0) set in TFT display

When we want to play, the display should be in landscape mode rather than portrait mode. So we need to perform a transformation from the default orientation of the TFT display to the actual CHIP8 display orientation. And as CHIP8 uses only 64 * 32 display pixels, we can scale it in our physical TFT display as the available display pixel is 240 * 320.

We can scale the CHIP8 pixel map by 4 times in our physical TFT display (64 * 4 = 256 & 32 * 4 = 128). And the way we scale up by 4 times is very simple: each pixel in the CHIP8 pixel map will correspond to a square of width 4 in the physical display. The transformation we want to achieve is as follows:


Left: Untransformed and unscaled projection of CHIP8 display pixel map on physical TFT display Right: Transformed and scaled projection of CHIP8 display pixel map on physical TFT display

So as shown in the figure above, we need to do two things:

Transform the CHIP8 pixel coordinates to the TFT display coordinates in landscape mode
Scale each CHIP8 pixel into a square of scale width in the TFT display.

Using simple coordinate geometry, we can make the transformation as follows:


Transformation equation from CHIP8 pixel map to the actual TFT display orientation

X_CHIP8 and Y_CHIP8 is the CHIP8 VM coordinates of a pixel. X_TFT and Y_TFT is the coordinates of our TFT display that we need to send SPI commands to display the pixel.

All righty! Let’s now try out in real hardware how our transformation looks so far.


Output in the TFT display on test rom

Hurray! The transformation is a success, but the display output is terribly slow. In my desktop implementation, this test rom would take less than a second, and with my esp32, it is taking almost 20 seconds! I expected my ESP32 CHIP8 to be slightly slower than my desktop, but not this slow. It is time for some more optimization!!

As I mentioned before, we need to draw the graphics on the TFT display only for two instructions:DRW and CLS. In my naive implementation, whenever I get an instruction to draw, I redraw the complete CHIP8 pixel map onto the screen. This means at least 2048 SPI transactions are happening to the screen when a draw instruction is received. But if only some pixels are changing in a cycle, instead we could update only the pixels that have changed from the previous frame. Luckily with the 512 kB RAM, caching could be used to improve the performance.

The caching that is used here is pretty straightforward. I have a local cache of the previous frame using std::bitset (Using bitset here is memory efficient. It takes only 256 bytes to store the information of 2048 pixels). Whenever a draw instruction is received from the VM, only the pixels that have changed from the previous frame are drawn. Pseudocode might look something like this:

void drawGfx() 
{
    for (auto [x, y] : chip8_display_coordinates) {
        // First transform the coordinates to TFT display coordinates
         auto [new_x, new_y] = transformCoordinates(x,y);
        // Check if the pixels have changed in the new frame
        if (chip_display_pixels(x,y) ^ disp_cache(x,y)) {
         // Update the cache
         disp_cache(x,y) = chip_display_pixels(x,y);
         //Draw the pixel on the display
         TFT_draw(new_x,new_y);
        }
    }
}

Now that the caching is applied, let’s see the improvement in the performance:


Output in the TFT display with caching to improve perf on test rom

Perfect! It takes less than a second similar to the desktop implementation. That is more than a 10x improvement in performance compared to my naive implementation.

CHIP8 (class)

Concept

As seen in the SW Class Diagram, this CHIP8 class initiates and ties everything together. The main responsibility of the CHIP8 class is to initialize all the classes and orchestrate coordination between them to run the CHIP8 VM.

Porting


Sequence diagram of the CHIP8 class interaction with other classes

First, the CHIP8 class spawns a BLE task with an appropriate service to process the incoming request. It is done by calling a few member functions of the BLE class. Then the CHIP8 class initializes the TFT display and the SPIFFS filesystem. And finally, it spawns a new task with a function that is indicated as start() for running the emulator itself (To create a new task, a function needs to be provided to the FreeRTOS task create function). Running the emulator as a part of a separate task allows specifying the stack size and the core it needs to run (Core 1 was chosen here as BLE runs on Core 0)

The start function is implemented as a simple state machine as follows:


State diagram of the start function

The new task created waits for an option to be selected, and once the new option is selected, the game starts by loading the correct CHIP8 ROM into the virtual machine. There is also an exit button observer (aka Observer Pattern) to exit the game and go back to Select Option state when the user wants to play a different game.

A short video snippet of the completed project:


A short video snippet of the final working ESP32-CHIP8 with CHIP8 numpad app

Conclusion

This work hits the perfect sweet spot as a side project in terms of difficulty and learning opportunities. Of course, I have mentioned only the happy flow here, but trust me, there were several moments I doubted myself as an embedded software engineer during the project. Moments like these reassure me that I am learning something new.

Doing a CHIP8 emulator in an operating system like Linux is also a good learning experience and a nice introduction to the world of emulation. But things that are trivial in Linux was not so anymore when I started porting them to a microcontroller. The caching I described in the Display subsection is one such example where I didn’t have to do such optimization in Linux but I had to do it in ESP32. That is why I feel it is a very good project for someone who has already played around with Arduino and wants to go beyond their comfort zone.

The complete ESP32 source code can be found here and the Android app can be found here. Feel free to have a look at it and also report if you find something weird.

Hope you enjoyed the article. Happy coding!

¹ Some hardcore embedded folks might say ESP32 is not really resource-constrained because of its 512 kB RAM and 4 MB flash, but compared to my laptop, an ESP32 is definitely resource-constrained.

But I was helping the compiler!

2020-08-16T00:00:00+00:00

Compilers are getting better with each release. Sometimes a noticeable difference can be observed in the assembly output for the same piece code in a different version of the same compiler (can be easily done via compiler explore).

I have gotten into the practice of checking the assembly output lately to analyze the overhead of various implementations. Beware, sometimes it can get addictive. But I think it is a nice way of learning to read assembly and also be amazed at how clever the compilers are these days.

In this article, I am going to cover one such incident that happened when I was looking at the assembly output of a function during my CHIP8 implementation.

But the context of the problem first!

The magic of move semantics

Move semantics was introduced in C++11. We can think of move semantics as a way of transferring ownership of an object. If you are really new to move semantics, consider the following example:

So your colleague has a document that you also want. We have two options here. The first option, you take that document to a copier, take a copy of the document for yourself, and return the original document to your colleague. The second option, assuming your colleague doesn’t need the document anymore, instead of throwing it away, your colleague can give it to you, thereby, saving paper.

Replace the document with a memory resource in a program, then the first option is doing a copy, and the second option is doing a move, where you transfer the ownership instead of wasting the resource.

Putting what I learned in action

While implementing my CHIP8 emulator, I saw an opportunity to replace an expensive copy operation into a cheap move operation (at least that is what I thought).

To give a bit of context: In each frame cycle, I had to return an array containing 2048 integers that will be used to draw the graphics on the screen. The pseudo-C++ code is shown below:

// chip8.cpp
static constexpr display_size = 2048;
class Chip8 {
    ... 
    public : 
        std::array<uint8_t, display_size> get_display_pixels() {
              // Do some computation
              return gfx;
          }
    ... 
    private : 
        std::array<uint8_t, display_size> gfx{};
};

// main.cpp
while (displayOn) {
    ... 
    const auto disp_pixels = emulator.get_display_pixels();
    ...
    // Use disp_pixels to draw pixels on the screen
}

This is what I assumed was going on when I did the call to get_display_pixels() member function:

The compiler copies the gfx private variable of the Chip8 class to the return value of the get_display_pixels() member function.
The compiler calls the copy constructor to copy the return value of the function call to the disp_pixels variable.

So, I concluded that I could use a move constructor to transfer the contents to my local variable disp_pixels to avoid a copy in the second step as described above.

So I changed my code in the main function as follows:

// main.cpp
while (displayOn) {
    ... 
    const auto disp_pixels = std::move(emulator.get_display_pixels());
    ...
    // Use disp_pixels to draw pixels on the screen
}

Before you get furious and stop reading the article further because what I assumed was completely wrong, I realized that too, and the rest of the article is about that.

As soon as I used a std::move as shown in my previous code snippet, I observed the compiler was generating more assembly code than my initial code without a std::move(with std::move: link, without std::move: link).

What went wrong? Hmm….

NRVO to the rescue

NRVO stands for Named Return Value Optimization. It is a nice trick that the compiler uses to omit unnecessary copy or move if certain conditions are met. Compilers have been using this trick for a long time. If NRVO takes place in our function call, then effectively we just do one copy instead of two. Let’s see how it works.

Even though the function signature of get_display_pixels indicates that it does not take any parameters, the compiler will pass one extra parameter behind the scenes from the caller (initialization call of disp_pixels from main.cpp) to the callee (get_display_pixels function in chip8.cpp). The caller will allocate the memory for the return value and pass the address of that memory to the callee. The callee will use that memory to construct the object and copy the value of the private variable gfx (in this case). As the memory of the caller (disp_pixels) was used by the callee, there is no need to copy the return value again, thereby, saving one unnecessary copy/move operation.

We should see the assembly output to really understand how NRVO is happening under the hood. The assembly code from the caller side is as follows:

 lea  rax, [rbp-16384]
 lea  rdx, [rbp-8192]
 mov  rsi, rdx
 mov  rdi, rax
 call Chip8::get_display_pixels()

Before the function call, rsi and rdi registers are loaded with upper and lower bound of the memory address of the disp_pixels variable. And, the trimmed assembly output from the callee side is as follows:

 Chip8::get_display_pixels():
 push rbp
 mov  rbp, rsp
 mov  QWORD PTR [rbp-8], rdi
 mov  QWORD PTR [rbp-16], rsi
...

As seen from the callee side, the rdi and rsi values are moved to the stack, and further operations are performed with that memory address. Pretty neat!

A simple analogy for NRVO I like to think of is when you are asking a friend to fill in water inside a water bottle, you would give your bottle to fill water from the tap directly. It would be inefficient to first fill the water in a temporary bottle and transfer the contents again to your bottle. In the C++ context, bottle is the memory space and the water it holds is the return value.

If we assume that the NRVO will take place, then the most efficient way of writing my function call is:

// main.cpp
while (displayOn) {
    ... 
    const auto disp_pixels = emulator.get_display_pixels();
    ...
    // Use disp_pixels to draw pixels on the screen
}

GCC and Clang even have an extra warning flag -Wpessimizing-move which detects when we are trying to use a move where compiler-generated NRVO is much more efficient.

Even though we can assume in many situations that a NRVO will take place, especially if optimizations are turned on, C++ standard does not guarantee NRVO in all situations¹. But what if compiler does not perform a NRVO?

Lvalues and Rvalues (and all other value categories in between)

Even though there are some proposals to guarantee NRVO, it is not yet guaranteed by the standard. As it is not guaranteed, should we explicitly indicate a move operation to save a copy just in case the compiler doesn’t do a NRVO? To answer that, I added the flag -fno-elide-constructors that disables copy elision (the super-set of NRVO) in our code, thereby, allowing to see what the compiler does otherwise.

I was surprised to see that compiler was still performing a NRVO for C++17 standard with -fno-elide-constructors enabled. But this was not the case for C++14, the compiler generated different assembly with -fno-elide-constructors enabled. If someone knows the reason why this difference occurs between C++17 and C++14 even though NRVO is not guaranteed, please email me about it. Godbolt link.

Let’s use C++14 with -fno-elide-constructors flag to simulate the scenario where the compiler fails to apply NRVO so that we can check whether we needed to do something extra to avoid superfluous copies.

So I added -fno-elide-constructors to disable any NRVO to the final code of the previous section. The caller generated the following assembly code:

 lea  rdx, [rbp-8192]
 lea  rax, [rbp-16384]
 mov  rsi, rdx
 mov  rdi, rax
 call Chip8::get_display()
 lea  rdx, [rbp-8192]
 lea  rax, [rbp-16384]
 mov  rsi, rdx
 mov  rdi, rax
call std::array<unsigned char, 32ul>::array(std::array<unsigned char, 32ul>&&)

As we can notice, the first 5 assembly instructions are the same as the version with NRVO enabled, and there are 5 more assembly instructions in this version as we disabled NRVO. The most important instruction we need to focus on is line number 10 where a move constructor(notice && in the function signature). Wait, a move constructor is invoked? I did not use a std::move but the compiler decided to do it anyway. To really comprehend the reason, we need to understand value categories in C++.

In these two articles: Understanding lvalues and rvalues in C and C++ and basic.lval#1, value categories are explained in detail². In brief, quoting from the first article: “An lvalue (locator value) represents an object that occupies some identifiable location in memory. Rvalue is an expression that does not represent an object occupying some identifiable location in memory.”. Of course, there are more categories than just a lvalue and a rvalue. I would highly recommend reading both articles. Though you don’t need a perfect grasp of them to understand what comes later in this article. Let’s get back to our original example and analyze why the move constructor was called.

// main.cpp
while (displayOn) {
    ... 
    const auto disp_pixels = emulator.get_display_pixels();
    ...
    // Use disp_pixels to draw pixels on the screen
}

In the above code snippet, the function call to get_display_pixels belongs to the rvalue (more precisely a prvalue) category and it generates a temporary. The compiler can now safely move that temporary into the disp_pixels variable because that temporary will be destroyed anyway after this statement. If the type that is being returned does not have a move constructor (in our case std::array has a move constructor), then the compiler will call the copy constructor.

In principle, if any of the moveable types (standard or user-defined) is returned from a function by value, we can safely assume either NRVO or move operation will take place resulting in no superfluous copies for standard compilers that support C++11 and above.

Conclusion

Probably you already knew about what I discussed in this article and you might think why did I ramble about things that I did not understand properly in the first place. That is the point of this article I guess.

It is always nice to use a new concept that we read from a blog or a book and use them immediately without understanding the implications of it. Sometimes, it can happen even to experienced programmers. So, whenever you learn a new concept, especially in C++, check the assembly that it generates to really verify what is happening, and what you expected to have happened. Trust me, learning a bit of assembly will definitely pay off because it lets us peek into what is ultimately going to run on the computer.

I really thought move semantics worked a certain way and I started prematurely optimizing (pessimizing in this case) without really understanding them in a broader context. Looking into the assembly definitely deepened my understanding of move semantics and some compiler optimizations.

Above all, I am starting to trust the compiler to do the right thing for me and respect the compiler writers even more. Check out this video by Matt Godbolt which stresses the point I am trying to make here.

Hope you enjoyed the article. Happy coding!

¹ RVO is guaranteed since C++17. To understand the difference between RVO and NRVO, refer to this link

² I also found this article by Sy brand to be extremely useful while writing this article.