Memory Ordering and Atomics
We talked about memory locations in Memory Model, now let’s discuss about their role in concurrency in detail. If two threads access the same location as long as they both want to read from it, there’s not any problem here, but once one of them tries to update the data, there’s a potential for a race condition. To avoid the race condition we need to enforce some kind of memory ordering. One way to achieve this is by using Mutex of course. If the same mutex is locked before to both accesses (one from Thread A, one from Thread B), only one thread can proceed. But the other way which is our topic today, is using atomic operations. Just for a small brush up, if different threads see different value for a single variable it’s called data race, or it’s simpler to say two or more threads access the same memory location at the same time, at least one of the threads is writing the data, but race condition happens when the result of an operation depends on the order of certain operations. Atomic operations can be a solution for data race but not race condition necessarily.
1std::atomic<bool> initialized{false};
2
3void worker() {
4 if (!initialized.load()) { // (1) Check
5 // Suppose two threads pass here at the same time!
6 // Both see false
7 // Both perform initialization
8 initialized.store(true); // (2) Set
9 std::cout << "Initialized by thread\n";
10 }
11}
Atomics
An atomic operation is and indivisible operation. what does this mean? We all know what Atom means in sciene the smallet possible particle that cannot be divided into smaller particles (though it’s not completely true anymore). But still in concurrent programming when we say atomicity we really mean an operation that cannot be divived, it’s either done completely or not done at all.
Types
There are some standard atomic types in C++ that can be found in the <atomic> header, and all operations on these
types are atomic. But it’s very interesting to know what’s under the hood, because we can also use
mutexes to make operations look atomic. That’s why there is a method called is_lock_free()
on these types that can be
called on these types to know if operations are done by atomic instructions or by using internal lock. There’s also a
method added since C++17, is_always_lock_free()
returns true if and only if that specific type is lock-free
for all supported hardware. For example int64_t
might be lock-free on a platform but it’s not on a specific platform.
1#include <atomic>
2
3int main()
4{
5 std::atomic_bool atom;
6 printf("Is lock free: %s\n", atom.is_lock_free() ? "True" : "False");
7
8 return 0;
9}
which returns:
Is lock free: True
Now Let’s look at another example:
1#include <atomic>
2
3struct Big
4{
5 uint64_t var1;
6 uint64_t var2;
7};
8
9int main()
10{
11 std::atomic<Big> atom;
12 printf("Is lock free: %s\n", atom.is_lock_free() ? "True" : "False");
13
14 return 0;
15}
But this time it returns:
Is lock free: False
So, Why isn’t std::atomic<Big>
lock-free? It’s because the hardware atomic instructions are
limited. I highly recommend taking a look at this manual of intel x64 and IA32
architecture intel Manual
. Also, it’s worth mentioning that atomic operations that are lock-free are, by design, also address-free,
which means they don’t rely on any processor’s state C++ draft.
We can put all atomic types into these categories:
std::atomic_flag
std::atomic<bool>
std::atomic<T*>
std::atomic<integral-types>
std::atomic<other-types>
Among these types, one of them is certainly different; std::atomic_flag
. It’s the simplest form of atomci type
and is basically a boolean flag. It can be set or clear. Because of its simple design it’s more like a building
block, so you rarely see it in actual use. One thing that’s important to note is the initialization of this type, which
should be done by ATOMIC_FLAG_INIT
. This value initializes the flag to the clear
state.
std::atomic_flag flag = ATOMIC_FLAG_INIT;
This makes it special, because it’s the only atomic type that needs to be initialized this way.
Operations
Now that we have a basic understanding of atomic types, let’s see what operations can be done on them.
When I say bool
, T*
, and so on, I mean std::atomic<bool>
, std::atomic<T*>
, and so on, just to save a
little bit space.
std::atomic_flag | bool | T* | integral-types | other-types |
---|---|---|---|---|
tes_and_set | is_lock_free | bool 1 | T* 2 | bool 3 |
clear | load | fetch_add, += | fetch_or, |= | |
store | fetch_sub, -= | fetch_and, &= | ||
exchange | ++ | fetch_xor, ^= | ||
compare_exchange_strong | -- | |||
compare_exchange_weak |
Now let’s look at what happens under the hood.
1#include <atomic>
2int main()
3{
4 std::atomic<int> a(25);
5 int res = a.fetch_add(2);
6
7 return res;
8}
The compiler turns this into something:
1main:
2 mov DWORD PTR [rsp-4], 25
3 mov eax, 2
4 lock xadd DWORD PTR [rsp-4], eax
5 ret
But before go any further, let’s look at another thing:
1#include <atomic>
2int main()
3{
4 std::atomic<int> a(25);
5 a.fetch_add(2);
6
7 return 0;
The result is interesting indeed, now the compiler turns this into:
1main:
2 mov DWORD PTR [rsp-4], 25
3 lock add DWORD PTR [rsp-4], 2
4 xor eax, eax
5 ret
First, what is Lock
and what does it do? According to the Intel Architecture the instruction format of x86-x64 is:

We don’t need to go into too much detail, but it’s good to know that the Lock prefix is one of those prefixes that
comes before the Opcode, like the ADD
instruction. The Lock
prefix is generally used with read-modify-write
operations, and according to the Intel manual, it can be used with these instructions:
ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB,
SUB, XOR, XADD, and XCHG.
Historically, there was a signal called Lock# (Lock Bar). If Lock#
was asserted, any access to memory had to wait until
Lock#
was de-asserted. However, this is not the case with modern CPUs. Modern CPUs are more complex and use
cache coherency protocols. This means the Lock#
signal isn’t used if the cache line can be put into an exclusive
state instead.
Now that we know what the Lock prefix does, let’s look at the code examples above. In the First example,
the generated code uses lock xadd, while the second example uses lock add. What’s the difference? You might be
surprised that XADD
is actually slower. Why? Because it first saves the original value somewhere and then performs the
ADD
operation. But if we don’t need the original value (as in the Second example), the compiler
just uses ADD
.
Speaking of speed, can we tell how much faster one is than the other? Yes! Thanks to Agner Fog’s
instruction tables, we can see how many CPU cycles each instruction takes. The extra cycle for lock xadd
is because
it’s a slightly more complex instruction: it needs to swap the original value out as well as add, so it performs a
read-modify-write and returns the previous value.
Wrapping up
In this post, we explored the basics of atomic operations in C++, how they help with concurrency, and what happens under the hood. We saw the difference between lock-free and non-lock-free atomics, learned about different atomic types, and took a look at some real examples from assembly instructions. Understanding these fundamentals is important for writing safe and efficient multithreaded programs.
If you are interested in more details, you can check the manuals and resources mentioned above. In future posts, we will go deeper into memory ordering and advanced concurrency topics.
Thanks!