• 0
Votes
name
name Punditsdkoslkdosdkoskdo

What are the CPU-internal characteristics of CAS collision?

I'm trying to understand the low-level mechanics of CAS on x86/x64 and I'd really appreciate some help/insight.

The reason I've been thinking about this is that I'm trying to reason about exponential backoff, and figure out in principle what the correct single unit of backoff delay should be.

If I look at a lock-free freelist benchmark, without exponential backoff, I see as the number of threads increase, performance rapidly flatlines.

Release 7 Lock-Free Freelist Benchmark #1

   M
   N
   S
  L3U
L2U L2U
L1D L1D
L1I L1I
 P   P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 310134488,31013449,0,1.00
0 1 0 1 136313300,6815665,38365,0.22

0 1 0 1 136401284,6820064,50706,0.22
1 1 1 1 111134328,2778358,23851,0.09

0 0 1 1 334747444,16737372,2421,0.54
1 1 1 1 111105898,2777647,40399,0.09

As we know, live-lock can occur, where each thread prevents the others from progressing.

My original - and I believe now mistaken - thought was that CAS was interfering with CAS. By that I mean, the CAS instruction itself would destructively collide with another CAS, should they occur concurrently. Both would fail. (Prolly because I was in the back of my mind thinking about ethernet).

This 'obviously' explains the results - all those CAS instructions operating concurrently, very few have a chance to fully execute before being destructively interrupted.

Having thought about it some more, I believe now this cannot be the case. The CAS instruction does not in fact HAVE a failure mode. It will tell you the destination is equal or not equal to the comparand. That's all. It doesn't come back and say "oh, sorry, bumped into someone else".

Destructive interference IS occurring, but it's occurring at a higher level, in the data structure algorithm itself. When we push or pop from/to the freelist, we are actually TRYING to swap. We need the destination to be stable for long enough that we can read it, do whatever work we need to do, and then find it unchanged so we can complete our push/pop.

If other threads keep CASing, destination isn't stable - it keeps changing - and we keep having to retry our operation.

But now I'm confused.

What we see is that a single thread performs about 30 million push/pop operations. Destination has to be stable for the duration of one of these operations, for the operation to succeed, so we see there are 30 million 'slots'. If we have two threads, then the maximum theoretical performance we can have is 15 million operations per thread; each thread using half the slots.

Now let's come back to CAS. CAS has no failure mode. So what's happening when second thread tries to CAS when another thread is already CASing? well, the second thread will fail at the data structure level, since the swap couldn't occur, so it will retry the swap.

But now imagine we have LOTS of threads. The first thread to begin a CAS will succeed (assuming each CAS takes exactly the same time - not true, but that assumption doesn't change anything fundamental, so it's okay to reason with). All the others will fail.

But once the first thread has finished, the very next thread who reads the new destination value will have his CAS succeed (and all the other threads, still executing their CASs or now beginning new CASs, will fail).

So why do we not see perfect scaling? because every 'slot' should be being used!

I think therefore I do not understand CAS properly.

Reading Intel's Architecture Software Developer's Manual, I find that if all data is present in cache (which the situation I'm interested in), the cache coherency protocol takes care of CAS.

Drepper in his white paper describes LL/SC and how it works using MESI.

It seems reasonable to me for CAS to operate in a similar manner.

Let's consider the two thread case. First thread begins its CAS. The cache line with the destination is in its cache and marked exclusive.

Second thread begins to CAS. The first core sends its cache line over to the second core and both cores have that cache line marked shared.

First thread completes the CAS and writes to the cache line (the write always occurs on x86/x64, even if the compare was false; it just writes the original value).

The act of writing marks the cache line as modified; a RFO occurs, causing the second core to mark its cache line as invalid.

The second thread comes to complete its CAS and notices its cache line is invalid... and then, what? I find it hard to believe the instruction is in the CPU internally looped until it succeeds - although I wonder, because LL/SC on ARM requires you in your assembly to do this loop. But the CAS instruction knows that the value of destination has changed, so the results of its compare are invalid. But there's no error possible with CAS; it always returns true or false for the compare. But even if the instructions do loop until complete, I'd still expect perfect scaling. Each 'slot' should still be used.

So what happens? what is happening with CAS?

What I do see is that as the thread count rises, less and less work is done - all available 'slots' certainly are not being used. Something is causing this. Is it destructive interference between CAS instructions? or it is a large number of RFO hogging the CPU->northbridge bus?

What I do note with considerable interest is that two threads on the same physical core scale perfectly. Something special and different is happening in that case - two threads on separate physical cores scale half as well. But it's not enough of a clue to explain it all.

What you're seeing here is the cost of moving the data between the L1 caches of the two physical cores. When only using one core, the data sits in that L1 cache and each CAS operates at full speed with the data in the cache. When there are two cores active, on the other hand, each time a core succeeds in writing to the data, it will invalidate the other cache, which will result in the data needing to be copied between the caches before the other core can do anything (generally, it will block waiting for the load before the CAS to complete). This is much more expensive than the actual CAS (it needs to move the data up to the L3 cahce at least and then back down to the other L1 cache), and results in the slowdown you see, as the data ends up ping-ponging back and forth between the two L1 caches

  • 0
Reply Report