But I was helping the compiler!
Compilers are getting better with each release. Sometimes a noticeable difference can be observed in the assembly output for the same piece code in a different version of the same compiler (can be easily done via compiler explore).
I have gotten into the practice of checking the assembly output lately to analyze the overhead of various implementations. Beware, sometimes it can get addictive. But I think it is a nice way of learning to read assembly and also be amazed at how clever the compilers are these days.
In this article, I am going to cover one such incident that happened when I was looking at the assembly output of a function during my CHIP8 implementation.
But the context of the problem first!
The magic of move semantics
Move semantics was introduced in C++11. We can think of move semantics as a way of transferring ownership of an object. If you are really new to move semantics, consider the following example:
So your colleague has a document that you also want. We have two options here. The first option, you take that document to a copier, take a copy of the document for yourself, and return the original document to your colleague. The second option, assuming your colleague doesn’t need the document anymore, instead of throwing it away, your colleague can give it to you, thereby, saving paper.
Replace the document with a memory resource in a program, then the first option is doing a copy, and the second option is doing a move, where you transfer the ownership instead of wasting the resource.
Putting what I learned in action
While implementing my CHIP8 emulator, I saw an opportunity to replace an expensive copy operation into a cheap move operation (at least that is what I thought).
To give a bit of context: In each frame cycle, I had to return an array containing 2048 integers that will be used to draw the graphics on the screen. The pseudo-C++ code is shown below:
// chip8.cpp
static constexpr display_size = 2048;
class Chip8 {
...
public :
std::array<uint8_t, display_size> get_display_pixels() {
// Do some computation
return gfx;
}
...
private :
std::array<uint8_t, display_size> gfx{};
};
// main.cpp
while (displayOn) {
...
const auto disp_pixels = emulator.get_display_pixels();
...
// Use disp_pixels to draw pixels on the screen
}
This is what I assumed was going on when I did the call to get_display_pixels()
member function:
- The compiler
copies
thegfx
private variable of theChip8
class to the return value of theget_display_pixels()
member function. - The compiler calls the
copy constructor
to copy the return value of the function call to thedisp_pixels
variable.
So, I concluded that I could use a move constructor
to transfer the contents to my local variable disp_pixels
to avoid a copy in the second step as described above.
So I changed my code in the main
function as follows:
// main.cpp
while (displayOn) {
...
const auto disp_pixels = std::move(emulator.get_display_pixels());
...
// Use disp_pixels to draw pixels on the screen
}
Before you get furious and stop reading the article further because what I assumed was completely wrong, I realized that too, and the rest of the article is about that.
As soon as I used a std::move
as shown in my previous code snippet, I observed the compiler was generating more assembly code than my initial code without a std::move
(with std::move: link, without std::move: link).
What went wrong? Hmm….
NRVO to the rescue
NRVO stands for Named Return Value Optimization. It is a nice trick that the compiler uses to omit unnecessary copy or move if certain conditions are met. Compilers have been using this trick for a long time. If NRVO
takes place in our function call, then effectively we just do one copy instead of two. Let’s see how it works.
Even though the function signature of get_display_pixels
indicates that it does not take any parameters, the compiler will pass one extra parameter behind the scenes from the caller (initialization call of disp_pixels
from main.cpp
) to the callee (get_display_pixels
function in chip8.cpp
). The caller will allocate the memory for the return value and pass the address of that memory to the callee. The callee will use that memory to construct the object and copy the value of the private variable gfx
(in this case). As the memory of the caller (disp_pixels
) was used by the callee, there is no need to copy the return value again, thereby, saving one unnecessary copy/move operation.
We should see the assembly output to really understand how NRVO
is happening under the hood. The assembly code from the caller side is as follows:
1 lea rax, [rbp-16384]
2 lea rdx, [rbp-8192]
3 mov rsi, rdx
4 mov rdi, rax
5 call Chip8::get_display_pixels()
Before the function call, rsi
and rdi
registers are loaded with upper and lower bound of the memory address of the disp_pixels
variable. And, the trimmed assembly output from the callee side is as follows:
1 Chip8::get_display_pixels():
2 push rbp
3 mov rbp, rsp
4 mov QWORD PTR [rbp-8], rdi
5 mov QWORD PTR [rbp-16], rsi
...
As seen from the callee side, the rdi
and rsi
values are moved to the stack, and further operations are performed with that memory address. Pretty neat!
A simple analogy
for NRVO
I like to think of is when you are asking a friend to fill in water inside a water bottle, you would give your bottle to fill water from the tap directly. It would be inefficient to first fill the water in a temporary bottle and transfer the contents again to your bottle. In the C++ context, bottle
is the memory space
and the water
it holds is the return value
.
If we assume that the NRVO
will take place, then the most efficient way of writing my function call is:
// main.cpp
while (displayOn) {
...
const auto disp_pixels = emulator.get_display_pixels();
...
// Use disp_pixels to draw pixels on the screen
}
GCC and Clang even have an extra warning flag -Wpessimizing-move
which detects when we are trying to use a move
where compiler-generated NRVO is much more efficient.
Even though we can assume in many situations that a NRVO
will take place, especially if optimizations are turned on, C++ standard does not guarantee NRVO
in all situations1. But what if compiler does not perform a NRVO
?
Lvalues and Rvalues (and all other value categories in between)
Even though there are some proposals to guarantee NRVO
, it is not yet guaranteed by the standard. As it is not guaranteed, should we explicitly indicate a move operation to save a copy just in case the compiler doesn’t do a NRVO
? To answer that, I added the flag -fno-elide-constructors
that disables copy elision (the super-set of NRVO) in our code, thereby, allowing to see what the compiler does otherwise.
I was surprised to see that compiler was still performing a NRVO
for C++17
standard with -fno-elide-constructors
enabled. But this was not the case for C++14
, the compiler generated different assembly with -fno-elide-constructors
enabled. If someone knows the reason why this difference occurs between C++17
and C++14
even though NRVO
is not guaranteed, please email me about it. Godbolt link.
Let’s use C++14
with -fno-elide-constructors
flag to simulate the scenario where the compiler fails to apply NRVO
so that we can check whether we needed to do something extra to avoid superfluous copies.
So I added -fno-elide-constructors
to disable any NRVO
to the final code of the previous section. The caller generated the following assembly code:
1 lea rdx, [rbp-8192]
2 lea rax, [rbp-16384]
3 mov rsi, rdx
4 mov rdi, rax
5 call Chip8::get_display()
6 lea rdx, [rbp-8192]
7 lea rax, [rbp-16384]
8 mov rsi, rdx
9 mov rdi, rax
10 call std::array<unsigned char, 32ul>::array(std::array<unsigned char, 32ul>&&)
As we can notice, the first 5 assembly instructions are the same as the version with NRVO enabled, and there are 5 more assembly instructions in this version as we disabled NRVO. The most important instruction we need to focus on is line number 10 where a move constructor
(notice &&
in the function signature). Wait, a move constructor
is invoked? I did not use a std::move
but the compiler decided to do it anyway. To really comprehend the reason, we need to understand value categories
in C++.
In these two articles: Understanding lvalues and rvalues in C and C++ and basic.lval#1, value categories are explained in detail2. In brief, quoting from the first article: “An lvalue (locator value) represents an object that occupies some identifiable location in memory. Rvalue is an expression that does not represent an object occupying some identifiable location in memory.”. Of course, there are more categories than just a lvalue
and a rvalue
. I would highly recommend reading both articles. Though you don’t need a perfect grasp of them to understand what comes later in this article. Let’s get back to our original example and analyze why the move constructor was called.
// main.cpp
while (displayOn) {
...
const auto disp_pixels = emulator.get_display_pixels();
...
// Use disp_pixels to draw pixels on the screen
}
In the above code snippet, the function call to get_display_pixels
belongs to the rvalue
(more precisely a prvalue
) category and it generates a temporary. The compiler can now safely move
that temporary into the disp_pixels
variable because that temporary will be destroyed anyway after this statement. If the type that is being returned does not have a move constructor (in our case std::array
has a move constructor), then the compiler will call the copy constructor
.
In principle, if any of the moveable
types (standard or user-defined) is returned from a function by value
, we can safely assume either NRVO
or move operation
will take place resulting in no superfluous copies for standard compilers that support C++11
and above.
Conclusion
Probably you already knew about what I discussed in this article and you might think why did I ramble about things that I did not understand properly in the first place. That is the point of this article I guess.
It is always nice to use a new concept that we read from a blog or a book and use them immediately without understanding the implications of it. Sometimes, it can happen even to experienced programmers. So, whenever you learn a new concept, especially in C++, check the assembly that it generates to really verify what is happening, and what you expected to have happened. Trust me, learning a bit of assembly will definitely pay off because it lets us peek into what is ultimately going to run on the computer.
I really thought move semantics worked a certain way and I started prematurely optimizing (pessimizing in this case) without really understanding them in a broader context. Looking into the assembly definitely deepened my understanding of move semantics and some compiler optimizations.
Above all, I am starting to trust the compiler to do the right thing for me and respect the compiler writers even more. Check out this video by Matt Godbolt which stresses the point I am trying to make here.
Hope you enjoyed the article. Happy coding!
1 RVO
is guaranteed since C++17
. To understand the difference between RVO
and NRVO
, refer to this link
2 I also found this article by Sy brand to be extremely useful while writing this article.