Re: MemoryFence()


Andrew Fish <afish@...>
 

On Feb 8, 2021, at 9:40 AM, Laszlo Ersek <lersek@...> wrote:

On 02/04/21 21:04, Paolo Bonzini wrote:
Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto:

(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e.
between writes)".
Acquire semantics typically order writes before reads, not /between/
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.
Acquire fences are barriers between earlier loads and subsequent loads
and stores; those earlier loads then synchronize with release stores
in other threads.

Release fences are barriers been earlier loads and stores against
subsequent stores, and those subsequent stores synchronize with
acquire loads in other threads.
[*]


In both cases, however, fences only make sense between memory
operations. So something like "after reads" and "before writes" would
have been more precise in some sense, but in practice the usual idiom
is "between" reads/writes as Laszlo wrote.

Note that reasoning about this only makes sense in the context of
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)

For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.
Sure, that's a matter of how to implement the primitives. If you think
that non-coherent DMA is important, a full dmb can be used too.

As far as the compiler is concerned, an asm in the macros *should*
block enough optimizations, even without making the accesses volatile.
CompilerFence (or the edk2 equivalent of cpu_relax, whose name escapes
me right now) would be necessary in the body of busy-waiting loops.
However somebody should check the MSVC docs for asm, too.
Doesn't look too specific:

https://docs.microsoft.com/en-us/cpp/assembler/inline/optimizing-inline-assembly?view=msvc-160
“inline assembly is not supported on the ARM and x64 processors. “ [1].

Kind of looks like MSVC replaced inline assembly with intrinsics [2].

Looks like _ReadBarrier, _ReadWriteBarrier [3] are cross arch. But as you point out when you look them up [4]:

The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics and the MemoryBarrier macro are all deprecated and should not be used. For inter-thread communication, use mechanisms such as atomic_thread_fence <https://docs.microsoft.com/en-us/cpp/standard-library/atomic-functions?view=msvc-160#atomic_thread_fence> and std::atomic<T> <https://docs.microsoft.com/en-us/cpp/standard-library/atomic?view=msvc-160> that are defined in the C++ Standard Library <https://docs.microsoft.com/en-us/cpp/standard-library/cpp-standard-library-reference?view=msvc-160>. For hardware access, use the /volatile:iso <https://docs.microsoft.com/en-us/cpp/build/reference/volatile-volatile-keyword-interpretation?view=msvc-160> compiler option together with the volatile <https://docs.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-160> keyword.

The scary statement there is /volatile:iso compiler option + volatile keyword for hardware access.

[1] https://docs.microsoft.com/en-us/cpp/assembler/inline/inline-assembler?view=msvc-160
[2] https://docs.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-160
[3] https://docs.microsoft.com/en-us/cpp/intrinsics/intrinsics-available-on-all-architectures?view=msvc-160
[4] https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160


It is very important to be *aware* of the acquire/release semantics,
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.
I agree as long as the primitives are self-documenting. A single
MemoryFence() does not make it clear in which direction the data is
flowing (whether from other threads to this one, or vice versa).
I've found this article very relevant:

https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming


(1) It provides an example for a store-load (Dekker's algorithm) where
any combination of read-acquire + write-release is insufficient. Thus it
would need an MFENCE (hence we need all four APIs in edk2).

(... If we jump back to the part I marked with [*], then we can see
Paolo's description of read-acquire and store-release covers load-load,
load-store, load-store (again), and store-store. What's not covered is
store-load, which Paolo said elsewhere in this thread is exactly what
x86 does reorder. So the MemoryFence() API's use would be "exceptional",
in future code, but it should exist for supporting patterns like
Dekker's algorithm.)


(2) I think the "Recommendations" section:

https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#recommendations

highlights the very problem we have. It recommends

When doing lockless programming, be sure to use volatile flag
variables and memory barrier instructions as needed.
*after* explaining why "volatile" is generally insufficient:

https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering

and *after* describing the compiler barriers.

So this recommendation should recommend compiler barriers rather than
volatile. :/


(3) The article recommends _ReadWriteBarrier, _ReadBarrier and
_WriteBarrier, for compiler fences. I think _ReadWriteBarrier should
suffice for edk2's purposes.

However, the following reference deprecates those intrinsics:

https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160

while offering *only* C++ language replacements.

Could we implement CompilerFence() for all edk2 architectures as
*non-inline* assembly? The function would consist of a return
instruction only. For x86, we could use a NASM source; for ARM, separate
MS and GNU assembler sources would be needed.

I totally want to get rid of "volatile" at least in future code, but
that's only possible if one of the following options can be satisfied:

- we find a supported replacement method for _ReadWriteBarrier when
using the MSFT toolchain family (such as the *non-inline*, empty
assembly function),

- or we accept that CompilerFence() is not possible to implement
portably, and we only offer the heavier-weight acquire / release /
full fences, which *include* a compiler fence too.

In the latter case, the body of a busy-waiting loop would have to use
the heavier read-acquire API.


--*--


So the structure of the solution we're looking for is:

- exactly *one* of:
- volatile
- compiler fence
- acquire fence used as a heavy substitute for compiler fence,
- and *all* of
- acquire fence (load-load, load-store)
- release fence (load-store, store-store)
- full fence (load-load, load-store, store-store, store-load)

The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.

I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).
I think we have a good mapping for GCC/clang on x86, but I’m still not 100% clear on what to do MSVC++.

The VC++ docs seem to point you toward:
1) volatile + /volatile:iso compiler flag for MMIO.
3) For synchronization: inline void atomic_thread_fence(memory_order Order) noexcept;

It looks like memory_order maps into the primitives you are proposing?

memory_order_relaxed The fence has no effect.
memory_order_consume The fence is an acquire fence.
memory_order_acquire The fence is an acquire fence.
memory_order_release The fence is a release fence.
memory_order_acq_rel The fence is both an acquire fence and a release fence.
memory_order_seq_cst The fence is both an acquire fence and a release fence, and is sequentially consistent.
But it kind of seems like we may be falling down a C++ rabbit hole…..


As I pointed out my experience with clang is volatile is not enough and we needed the compiler fence.

Thanks,

Andrew Fish

Thanks
Laszlo

Join rfc@edk2.groups.io to automatically receive all group messages.