Memory Ordering and Atomics

We talked about memory locations in Memory Model, now let’s discuss about their role in concurrency in detail. If two threads access the same location as long as they both want to read from it, there’s not any problem here, but once one of them tries to update the data, there’s a potential for a race condition. To avoid the race condition we need to enforce some kind of memory ordering. One way to achieve this is by using Mutex of course. If the same mutex is locked before to both accesses (one from Thread A, one from Thread B), only one thread can proceed. But the other way which is our topic today, is using atomic operations. Just for a small brush up, if different threads see different value for a single variable it’s called data race, or it’s simpler to say two or more threads access the same memory location at the same time, at least one of the threads is writing the data, but race condition happens when the result of an operation depends on the order of certain operations. Atomic operations can be a solution for data race but not race condition necessarily.

 1std::atomic<bool> initialized{false};
 2
 3void worker() {
 4    if (!initialized.load()) {      // (1) Check
 5        // Suppose two threads pass here at the same time!
 6        // Both see false
 7        // Both perform initialization
 8        initialized.store(true);    // (2) Set
 9        std::cout << "Initialized by thread\n";
10    }
11}

Atomics

An atomic operation is and indivisible operation. what does this mean? We all know what Atom means in sciene the smallet possible particle that cannot be divided into smaller particles (though it’s not completely true anymore). But still in concurrent programming when we say atomicity we really mean an operation that cannot be divived, it’s either done completely or not done at all.

Types

There are some standard atomic types in C++ that can be found in the <atomic> header, and all operations on these types are atomic. But it’s very interesting to know what’s under the hood, because we can also use mutexes to make operations look atomic. That’s why there is a method called is_lock_free() on these types that can be called on these types to know if operations are done by atomic instructions or by using internal lock. There’s also a method added since C++17, is_always_lock_free() returns true if and only if that specific type is lock-free for all supported hardware. For example int64_t might be lock-free on a platform but it’s not on a specific platform.

1#include <atomic>
2
3int main()
4{
5    std::atomic_bool atom; 
6    printf("Is lock free: %s\n", atom.is_lock_free() ? "True" : "False");
7
8    return 0;
9}

which returns:

Is lock free: True

Now Let’s look at another example:

 1#include <atomic>
 2
 3struct Big
 4{
 5    uint64_t var1;
 6    uint64_t var2;
 7};
 8
 9int main()
10{
11    std::atomic<Big> atom; 
12    printf("Is lock free: %s\n", atom.is_lock_free() ? "True" : "False");
13
14    return 0;
15}

But this time it returns:

Is lock free: False

So, Why isn’t std::atomic<Big> lock-free? It’s because the hardware atomic instructions are limited. I highly recommend taking a look at this manual of intel x64 and IA32 architecture intel Manual . Also, it’s worth mentioning that atomic operations that are lock-free are, by design, also address-free, which means they don’t rely on any processor’s state C++ draft.

We can put all atomic types into these categories:

  • std::atomic_flag
  • std::atomic<bool>
  • std::atomic<T*>
  • std::atomic<integral-types>
  • std::atomic<other-types>

Among these types, one of them is certainly different; std::atomic_flag. It’s the simplest form of atomci type and is basically a boolean flag. It can be set or clear. Because of its simple design it’s more like a building block, so you rarely see it in actual use. One thing that’s important to note is the initialization of this type, which should be done by ATOMIC_FLAG_INIT. This value initializes the flag to the clear state.

std::atomic_flag flag = ATOMIC_FLAG_INIT;

This makes it special, because it’s the only atomic type that needs to be initialized this way.

Operations

Now that we have a basic understanding of atomic types, let’s see what operations can be done on them.

When I say bool, T*, and so on, I mean std::atomic<bool>, std::atomic<T*>, and so on, just to save a little bit space.

std::atomic_flagboolT*integral-typesother-types
tes_and_setis_lock_freebool 1T* 2bool 3
clearloadfetch_add, +=fetch_or, |=
storefetch_sub, -=fetch_and, &=
exchange++fetch_xor, ^=
compare_exchange_strong--
compare_exchange_weak

Now let’s look at what happens under the hood.

First Example

1#include <atomic>
2int main()
3{
4    std::atomic<int> a(25);
5    int res = a.fetch_add(2);
6
7    return res;
8}

The compiler turns this into something:

1main:
2        mov     DWORD PTR [rsp-4], 25
3        mov     eax, 2
4        lock xadd       DWORD PTR [rsp-4], eax
5        ret

But before go any further, let’s look at another thing:

Second Example

1#include <atomic>
2int main()
3{
4    std::atomic<int> a(25);
5    a.fetch_add(2);
6
7    return 0;

The result is interesting indeed, now the compiler turns this into:

1main:
2        mov     DWORD PTR [rsp-4], 25
3        lock add        DWORD PTR [rsp-4], 2
4        xor     eax, eax
5        ret

First, what is Lock and what does it do? According to the Intel Architecture the instruction format of x86-x64 is:

Intel Instruction Set(x64)

We don’t need to go into too much detail, but it’s good to know that the Lock prefix is one of those prefixes that comes before the Opcode, like the ADD instruction. The Lock prefix is generally used with read-modify-write operations, and according to the Intel manual, it can be used with these instructions:

ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB,
SUB, XOR, XADD, and XCHG.

Historically, there was a signal called Lock# (Lock Bar). If Lock# was asserted, any access to memory had to wait until Lock# was de-asserted. However, this is not the case with modern CPUs. Modern CPUs are more complex and use cache coherency protocols. This means the Lock# signal isn’t used if the cache line can be put into an exclusive state instead.

Now that we know what the Lock prefix does, let’s look at the code examples above. In the First example, the generated code uses lock xadd, while the second example uses lock add. What’s the difference? You might be surprised that XADD is actually slower. Why? Because it first saves the original value somewhere and then performs the ADD operation. But if we don’t need the original value (as in the Second example), the compiler just uses ADD.

Speaking of speed, can we tell how much faster one is than the other? Yes! Thanks to Agner Fog’s instruction tables, we can see how many CPU cycles each instruction takes. The extra cycle for lock xadd is because it’s a slightly more complex instruction: it needs to swap the original value out as well as add, so it performs a read-modify-write and returns the previous value.

Agner Fog instruction tables


Wrapping up

In this post, we explored the basics of atomic operations in C++, how they help with concurrency, and what happens under the hood. We saw the difference between lock-free and non-lock-free atomics, learned about different atomic types, and took a look at some real examples from assembly instructions. Understanding these fundamentals is important for writing safe and efficient multithreaded programs.

If you are interested in more details, you can check the manuals and resources mentioned above. In future posts, we will go deeper into memory ordering and advanced concurrency topics.

Thanks!


  1. In addition to what std::atomic<bool> has. ↩︎

  2. In addition to what std::atomic<T*> has. ↩︎

  3. In addition to what std::atomic<bool> has. ↩︎