But I was helping the compiler!

Compilers are getting better with each release. Sometimes a noticeable difference can be observed in the assembly output for the same piece code in a different version of the same compiler (can be easily done via compiler explore).

I have gotten into the practice of checking the assembly output lately to analyze the overhead of various implementations. Beware, sometimes it can get addictive. But I think it is a nice way of learning to read assembly and also be amazed at how clever the compilers are these days.

In this article, I am going to cover one such incident that happened when I was looking at the assembly output of a function during my CHIP8 implementation.

But the context of the problem first!

The magic of move semantics

Move semantics was introduced in C++11. We can think of move semantics as a way of transferring ownership of an object. If you are really new to move semantics, consider the following example:

So your colleague has a document that you also want. We have two options here. The first option, you take that document to a copier, take a copy of the document for yourself, and return the original document to your colleague. The second option, assuming your colleague doesn’t need the document anymore, instead of throwing it away, your colleague can give it to you, thereby, saving paper.

Replace the document with a memory resource in a program, then the first option is doing a copy, and the second option is doing a move, where you transfer the ownership instead of wasting the resource.

Putting what I learned in action

While implementing my CHIP8 emulator, I saw an opportunity to replace an expensive copy operation into a cheap move operation (at least that is what I thought).

To give a bit of context: In each frame cycle, I had to return an array containing 2048 integers that will be used to draw the graphics on the screen. The pseudo-C++ code is shown below:

// chip8.cpp
static constexpr display_size = 2048;
class Chip8 {
    ... 
    public : 
        std::array<uint8_t, display_size> get_display_pixels() {
              // Do some computation
              return gfx;
          }
    ... 
    private : 
        std::array<uint8_t, display_size> gfx{};
};

// main.cpp
while (displayOn) {
    ... 
    const auto disp_pixels = emulator.get_display_pixels();
    ...
    // Use disp_pixels to draw pixels on the screen
}

This is what I assumed was going on when I did the call to get_display_pixels() member function:

The compiler copies the gfx private variable of the Chip8 class to the return value of the get_display_pixels() member function.
The compiler calls the copy constructor to copy the return value of the function call to the disp_pixels variable.

So, I concluded that I could use a move constructor to transfer the contents to my local variable disp_pixels to avoid a copy in the second step as described above.

So I changed my code in the main function as follows:

// main.cpp
while (displayOn) {
    ... 
    const auto disp_pixels = std::move(emulator.get_display_pixels());
    ...
    // Use disp_pixels to draw pixels on the screen
}

Before you get furious and stop reading the article further because what I assumed was completely wrong, I realized that too, and the rest of the article is about that.

As soon as I used a std::move as shown in my previous code snippet, I observed the compiler was generating more assembly code than my initial code without a std::move(with std::move: link, without std::move: link).

What went wrong? Hmm….

NRVO to the rescue

NRVO stands for Named Return Value Optimization. It is a nice trick that the compiler uses to omit unnecessary copy or move if certain conditions are met. Compilers have been using this trick for a long time. If NRVO takes place in our function call, then effectively we just do one copy instead of two. Let’s see how it works.

Even though the function signature of get_display_pixels indicates that it does not take any parameters, the compiler will pass one extra parameter behind the scenes from the caller (initialization call of disp_pixels from main.cpp) to the callee (get_display_pixels function in chip8.cpp). The caller will allocate the memory for the return value and pass the address of that memory to the callee. The callee will use that memory to construct the object and copy the value of the private variable gfx (in this case). As the memory of the caller (disp_pixels) was used by the callee, there is no need to copy the return value again, thereby, saving one unnecessary copy/move operation.

We should see the assembly output to really understand how NRVO is happening under the hood. The assembly code from the caller side is as follows:

 lea  rax, [rbp-16384]
 lea  rdx, [rbp-8192]
 mov  rsi, rdx
 mov  rdi, rax
 call Chip8::get_display_pixels()

Before the function call, rsi and rdi registers are loaded with upper and lower bound of the memory address of the disp_pixels variable. And, the trimmed assembly output from the callee side is as follows:

 Chip8::get_display_pixels():
 push rbp
 mov  rbp, rsp
 mov  QWORD PTR [rbp-8], rdi
 mov  QWORD PTR [rbp-16], rsi
...

As seen from the callee side, the rdi and rsi values are moved to the stack, and further operations are performed with that memory address. Pretty neat!

A simple analogy for NRVO I like to think of is when you are asking a friend to fill in water inside a water bottle, you would give your bottle to fill water from the tap directly. It would be inefficient to first fill the water in a temporary bottle and transfer the contents again to your bottle. In the C++ context, bottle is the memory space and the water it holds is the return value.

If we assume that the NRVO will take place, then the most efficient way of writing my function call is:

// main.cpp
while (displayOn) {
    ... 
    const auto disp_pixels = emulator.get_display_pixels();
    ...
    // Use disp_pixels to draw pixels on the screen
}

GCC and Clang even have an extra warning flag -Wpessimizing-move which detects when we are trying to use a move where compiler-generated NRVO is much more efficient.

Even though we can assume in many situations that a NRVO will take place, especially if optimizations are turned on, C++ standard does not guarantee NRVO in all situations¹. But what if compiler does not perform a NRVO?

Lvalues and Rvalues (and all other value categories in between)

Even though there are some proposals to guarantee NRVO, it is not yet guaranteed by the standard. As it is not guaranteed, should we explicitly indicate a move operation to save a copy just in case the compiler doesn’t do a NRVO? To answer that, I added the flag -fno-elide-constructors that disables copy elision (the super-set of NRVO) in our code, thereby, allowing to see what the compiler does otherwise.

I was surprised to see that compiler was still performing a NRVO for C++17 standard with -fno-elide-constructors enabled. But this was not the case for C++14, the compiler generated different assembly with -fno-elide-constructors enabled. If someone knows the reason why this difference occurs between C++17 and C++14 even though NRVO is not guaranteed, please email me about it. Godbolt link.

Let’s use C++14 with -fno-elide-constructors flag to simulate the scenario where the compiler fails to apply NRVO so that we can check whether we needed to do something extra to avoid superfluous copies.

So I added -fno-elide-constructors to disable any NRVO to the final code of the previous section. The caller generated the following assembly code:

 lea  rdx, [rbp-8192]
 lea  rax, [rbp-16384]
 mov  rsi, rdx
 mov  rdi, rax
 call Chip8::get_display()
 lea  rdx, [rbp-8192]
 lea  rax, [rbp-16384]
 mov  rsi, rdx
 mov  rdi, rax
call std::array<unsigned char, 32ul>::array(std::array<unsigned char, 32ul>&&)

As we can notice, the first 5 assembly instructions are the same as the version with NRVO enabled, and there are 5 more assembly instructions in this version as we disabled NRVO. The most important instruction we need to focus on is line number 10 where a move constructor(notice && in the function signature). Wait, a move constructor is invoked? I did not use a std::move but the compiler decided to do it anyway. To really comprehend the reason, we need to understand value categories in C++.

In these two articles: Understanding lvalues and rvalues in C and C++ and basic.lval#1, value categories are explained in detail². In brief, quoting from the first article: “An lvalue (locator value) represents an object that occupies some identifiable location in memory. Rvalue is an expression that does not represent an object occupying some identifiable location in memory.”. Of course, there are more categories than just a lvalue and a rvalue. I would highly recommend reading both articles. Though you don’t need a perfect grasp of them to understand what comes later in this article. Let’s get back to our original example and analyze why the move constructor was called.

// main.cpp
while (displayOn) {
    ... 
    const auto disp_pixels = emulator.get_display_pixels();
    ...
    // Use disp_pixels to draw pixels on the screen
}

In the above code snippet, the function call to get_display_pixels belongs to the rvalue (more precisely a prvalue) category and it generates a temporary. The compiler can now safely move that temporary into the disp_pixels variable because that temporary will be destroyed anyway after this statement. If the type that is being returned does not have a move constructor (in our case std::array has a move constructor), then the compiler will call the copy constructor.

In principle, if any of the moveable types (standard or user-defined) is returned from a function by value, we can safely assume either NRVO or move operation will take place resulting in no superfluous copies for standard compilers that support C++11 and above.

Conclusion

Probably you already knew about what I discussed in this article and you might think why did I ramble about things that I did not understand properly in the first place. That is the point of this article I guess.

It is always nice to use a new concept that we read from a blog or a book and use them immediately without understanding the implications of it. Sometimes, it can happen even to experienced programmers. So, whenever you learn a new concept, especially in C++, check the assembly that it generates to really verify what is happening, and what you expected to have happened. Trust me, learning a bit of assembly will definitely pay off because it lets us peek into what is ultimately going to run on the computer.

I really thought move semantics worked a certain way and I started prematurely optimizing (pessimizing in this case) without really understanding them in a broader context. Looking into the assembly definitely deepened my understanding of move semantics and some compiler optimizations.

Above all, I am starting to trust the compiler to do the right thing for me and respect the compiler writers even more. Check out this video by Matt Godbolt which stresses the point I am trying to make here.

Hope you enjoyed the article. Happy coding!

¹ RVO is guaranteed since C++17. To understand the difference between RVO and NRVO, refer to this link

² I also found this article by Sy brand to be extremely useful while writing this article.