Date   

Re: MemoryFence()

Andrew Fish <afish@...>
 

On Feb 8, 2021, at 9:40 AM, Laszlo Ersek <lersek@...> wrote:

On 02/04/21 21:04, Paolo Bonzini wrote:
Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto:

(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e.
between writes)".
Acquire semantics typically order writes before reads, not /between/
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.
Acquire fences are barriers between earlier loads and subsequent loads
and stores; those earlier loads then synchronize with release stores
in other threads.

Release fences are barriers been earlier loads and stores against
subsequent stores, and those subsequent stores synchronize with
acquire loads in other threads.
[*]


In both cases, however, fences only make sense between memory
operations. So something like "after reads" and "before writes" would
have been more precise in some sense, but in practice the usual idiom
is "between" reads/writes as Laszlo wrote.

Note that reasoning about this only makes sense in the context of
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)

For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.
Sure, that's a matter of how to implement the primitives. If you think
that non-coherent DMA is important, a full dmb can be used too.

As far as the compiler is concerned, an asm in the macros *should*
block enough optimizations, even without making the accesses volatile.
CompilerFence (or the edk2 equivalent of cpu_relax, whose name escapes
me right now) would be necessary in the body of busy-waiting loops.
However somebody should check the MSVC docs for asm, too.
Doesn't look too specific:

https://docs.microsoft.com/en-us/cpp/assembler/inline/optimizing-inline-assembly?view=msvc-160
“inline assembly is not supported on the ARM and x64 processors. “ [1].

Kind of looks like MSVC replaced inline assembly with intrinsics [2].

Looks like _ReadBarrier, _ReadWriteBarrier [3] are cross arch. But as you point out when you look them up [4]:

The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics and the MemoryBarrier macro are all deprecated and should not be used. For inter-thread communication, use mechanisms such as atomic_thread_fence <https://docs.microsoft.com/en-us/cpp/standard-library/atomic-functions?view=msvc-160#atomic_thread_fence> and std::atomic<T> <https://docs.microsoft.com/en-us/cpp/standard-library/atomic?view=msvc-160> that are defined in the C++ Standard Library <https://docs.microsoft.com/en-us/cpp/standard-library/cpp-standard-library-reference?view=msvc-160>. For hardware access, use the /volatile:iso <https://docs.microsoft.com/en-us/cpp/build/reference/volatile-volatile-keyword-interpretation?view=msvc-160> compiler option together with the volatile <https://docs.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-160> keyword.

The scary statement there is /volatile:iso compiler option + volatile keyword for hardware access.

[1] https://docs.microsoft.com/en-us/cpp/assembler/inline/inline-assembler?view=msvc-160
[2] https://docs.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-160
[3] https://docs.microsoft.com/en-us/cpp/intrinsics/intrinsics-available-on-all-architectures?view=msvc-160
[4] https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160


It is very important to be *aware* of the acquire/release semantics,
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.
I agree as long as the primitives are self-documenting. A single
MemoryFence() does not make it clear in which direction the data is
flowing (whether from other threads to this one, or vice versa).
I've found this article very relevant:

https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming


(1) It provides an example for a store-load (Dekker's algorithm) where
any combination of read-acquire + write-release is insufficient. Thus it
would need an MFENCE (hence we need all four APIs in edk2).

(... If we jump back to the part I marked with [*], then we can see
Paolo's description of read-acquire and store-release covers load-load,
load-store, load-store (again), and store-store. What's not covered is
store-load, which Paolo said elsewhere in this thread is exactly what
x86 does reorder. So the MemoryFence() API's use would be "exceptional",
in future code, but it should exist for supporting patterns like
Dekker's algorithm.)


(2) I think the "Recommendations" section:

https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#recommendations

highlights the very problem we have. It recommends

When doing lockless programming, be sure to use volatile flag
variables and memory barrier instructions as needed.
*after* explaining why "volatile" is generally insufficient:

https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering

and *after* describing the compiler barriers.

So this recommendation should recommend compiler barriers rather than
volatile. :/


(3) The article recommends _ReadWriteBarrier, _ReadBarrier and
_WriteBarrier, for compiler fences. I think _ReadWriteBarrier should
suffice for edk2's purposes.

However, the following reference deprecates those intrinsics:

https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160

while offering *only* C++ language replacements.

Could we implement CompilerFence() for all edk2 architectures as
*non-inline* assembly? The function would consist of a return
instruction only. For x86, we could use a NASM source; for ARM, separate
MS and GNU assembler sources would be needed.

I totally want to get rid of "volatile" at least in future code, but
that's only possible if one of the following options can be satisfied:

- we find a supported replacement method for _ReadWriteBarrier when
using the MSFT toolchain family (such as the *non-inline*, empty
assembly function),

- or we accept that CompilerFence() is not possible to implement
portably, and we only offer the heavier-weight acquire / release /
full fences, which *include* a compiler fence too.

In the latter case, the body of a busy-waiting loop would have to use
the heavier read-acquire API.


--*--


So the structure of the solution we're looking for is:

- exactly *one* of:
- volatile
- compiler fence
- acquire fence used as a heavy substitute for compiler fence,
- and *all* of
- acquire fence (load-load, load-store)
- release fence (load-store, store-store)
- full fence (load-load, load-store, store-store, store-load)

The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.

I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).
I think we have a good mapping for GCC/clang on x86, but I’m still not 100% clear on what to do MSVC++.

The VC++ docs seem to point you toward:
1) volatile + /volatile:iso compiler flag for MMIO.
3) For synchronization: inline void atomic_thread_fence(memory_order Order) noexcept;

It looks like memory_order maps into the primitives you are proposing?

memory_order_relaxed The fence has no effect.
memory_order_consume The fence is an acquire fence.
memory_order_acquire The fence is an acquire fence.
memory_order_release The fence is a release fence.
memory_order_acq_rel The fence is both an acquire fence and a release fence.
memory_order_seq_cst The fence is both an acquire fence and a release fence, and is sequentially consistent.
But it kind of seems like we may be falling down a C++ rabbit hole…..


As I pointed out my experience with clang is volatile is not enough and we needed the compiler fence.

Thanks,

Andrew Fish

Thanks
Laszlo


Re: MemoryFence()

Paolo Bonzini <pbonzini@...>
 

On 08/02/21 18:40, Laszlo Ersek wrote:
(3) The article recommends _ReadWriteBarrier, _ReadBarrier and
_WriteBarrier, for compiler fences. I think _ReadWriteBarrier should
suffice for edk2's purposes.
However, the following reference deprecates those intrinsics:
https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160
while offering*only* C++ language replacements.
In theory, atomic_thread_fence (and in this case, more specifically, atomic_signal_fence) is available also on C11. But if it is possible to assume that _RWB is present on Microsoft toolchains (even though it's deprecated) I would go for it as it seems to be the best match for CompilerFence.

In fact, even Microsoft's own STL does

#define _Compiler_barrier() \
_STL_DISABLE_DEPRECATED_WARNING _ReadWriteBarrier() \
_STL_RESTORE_DEPRECATED_WARNING

extern "C" inline void
atomic_signal_fence(const memory_order _Order) noexcept
{
if (_Order != memory_order_relaxed) {
_Compiler_barrier();
}
}

(https://github.com/microsoft/STL/blob/master/stl/inc/atomic).

For fences that require an assembly instruction, it is possible to use intrinsics too: _mm_mfence() for x86, __dmb(_ARM_BARRIER_SY) for ARM.

Paolo


Re: MemoryFence()

Laszlo Ersek
 

On 02/04/21 21:04, Paolo Bonzini wrote:
Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto:

(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e.
between writes)".
Acquire semantics typically order writes before reads, not /between/
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.
Acquire fences are barriers between earlier loads and subsequent loads
and stores; those earlier loads then synchronize with release stores
in other threads.

Release fences are barriers been earlier loads and stores against
subsequent stores, and those subsequent stores synchronize with
acquire loads in other threads.
[*]


In both cases, however, fences only make sense between memory
operations. So something like "after reads" and "before writes" would
have been more precise in some sense, but in practice the usual idiom
is "between" reads/writes as Laszlo wrote.

Note that reasoning about this only makes sense in the context of
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)

For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.
Sure, that's a matter of how to implement the primitives. If you think
that non-coherent DMA is important, a full dmb can be used too.

As far as the compiler is concerned, an asm in the macros *should*
block enough optimizations, even without making the accesses volatile.
CompilerFence (or the edk2 equivalent of cpu_relax, whose name escapes
me right now) would be necessary in the body of busy-waiting loops.
However somebody should check the MSVC docs for asm, too.
Doesn't look too specific:

https://docs.microsoft.com/en-us/cpp/assembler/inline/optimizing-inline-assembly?view=msvc-160

It is very important to be *aware* of the acquire/release semantics,
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.
I agree as long as the primitives are self-documenting. A single
MemoryFence() does not make it clear in which direction the data is
flowing (whether from other threads to this one, or vice versa).
I've found this article very relevant:

https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming


(1) It provides an example for a store-load (Dekker's algorithm) where
any combination of read-acquire + write-release is insufficient. Thus it
would need an MFENCE (hence we need all four APIs in edk2).

(... If we jump back to the part I marked with [*], then we can see
Paolo's description of read-acquire and store-release covers load-load,
load-store, load-store (again), and store-store. What's not covered is
store-load, which Paolo said elsewhere in this thread is exactly what
x86 does reorder. So the MemoryFence() API's use would be "exceptional",
in future code, but it should exist for supporting patterns like
Dekker's algorithm.)


(2) I think the "Recommendations" section:

https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#recommendations

highlights the very problem we have. It recommends

When doing lockless programming, be sure to use volatile flag
variables and memory barrier instructions as needed.
*after* explaining why "volatile" is generally insufficient:

https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering

and *after* describing the compiler barriers.

So this recommendation should recommend compiler barriers rather than
volatile. :/


(3) The article recommends _ReadWriteBarrier, _ReadBarrier and
_WriteBarrier, for compiler fences. I think _ReadWriteBarrier should
suffice for edk2's purposes.

However, the following reference deprecates those intrinsics:

https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160

while offering *only* C++ language replacements.

Could we implement CompilerFence() for all edk2 architectures as
*non-inline* assembly? The function would consist of a return
instruction only. For x86, we could use a NASM source; for ARM, separate
MS and GNU assembler sources would be needed.

I totally want to get rid of "volatile" at least in future code, but
that's only possible if one of the following options can be satisfied:

- we find a supported replacement method for _ReadWriteBarrier when
using the MSFT toolchain family (such as the *non-inline*, empty
assembly function),

- or we accept that CompilerFence() is not possible to implement
portably, and we only offer the heavier-weight acquire / release /
full fences, which *include* a compiler fence too.

In the latter case, the body of a busy-waiting loop would have to use
the heavier read-acquire API.


--*--


So the structure of the solution we're looking for is:

- exactly *one* of:
- volatile
- compiler fence
- acquire fence used as a heavy substitute for compiler fence,
- and *all* of
- acquire fence (load-load, load-store)
- release fence (load-store, store-store)
- full fence (load-load, load-store, store-store, store-load)

The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.

I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).

Thanks
Laszlo


Re: MemoryFence()

Ard Biesheuvel <ardb@...>
 

On Mon, 8 Feb 2021 at 17:11, Laszlo Ersek <lersek@...> wrote:

On 02/06/21 09:37, Ard Biesheuvel wrote:

On ARM, 'inner shareable' is [generally] the subset of nodes and
edges that make up the CPUs, caches and cache-coherent DMA masters
(but not main memory)
This seems consistent with the fact that Paolo recommended variants of
"dmb ish" for all the memory (not compiler) fences -- I guess we don't
care about actual RAM content as long as all CPUs and cache-coherent DMA
masters see the same data.

Generally, yes. But I already mentioned non-cache coherent DMA
devices, and we also have some ARM platforms that use non-shareable
mappings for SRAM, which is used as temporary PEI memory in some
cases.

So the safe option is really to just use system-wide barriers - if
anyone has a use case where this is a performance bottleneck, we can
always revisit this later.


Re: MemoryFence()

Laszlo Ersek
 

On 02/06/21 09:37, Ard Biesheuvel wrote:

On ARM, 'inner shareable' is [generally] the subset of nodes and
edges that make up the CPUs, caches and cache-coherent DMA masters
(but not main memory)
This seems consistent with the fact that Paolo recommended variants of
"dmb ish" for all the memory (not compiler) fences -- I guess we don't
care about actual RAM content as long as all CPUs and cache-coherent DMA
masters see the same data.

Thanks
Laszlo


Re: MemoryFence()

Ard Biesheuvel <ardb@...>
 

On Fri, 5 Feb 2021 at 20:35, Laszlo Ersek <lersek@...> wrote:

rant warning

On 02/05/21 19:29, Ni, Ray wrote:

I read a bit about c memory model.
https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync
Thanks.

I think what happened here is a catastrophic impedance mismatch.

The underlying, physical architecture is a message-passing one.

Yet the mental abstraction that is built atop, for programmers to deal
with, is "shared memory". Which suggests a global shared state.

This is *suicidal*.
<snip>

I don't think the message passing paradigm is 100% accurate but it is
very close.

You should realize that a modern system is a network of CPUs, caches,
DRAM controllers and other bus masters, which is being traversed by
packets that represent the memory transactions (request/response pairs
for reads and non-posted writes). There is no natural 'grand scheme'
ordering of things, and when a certain memory ordering is required,
additional work is needed, which translates into additional packets,
packet types and silicon gates. (Disclaimer: I'm not a HW engineer)

A barrier is a packet that is broadcast on this network to instruct
all nodes that all transactions of a certain type need to be completed
before servicing or initiating new ones. On ARM, 'inner shareable' is
[generally] the subset of nodes and edges that make up the CPUs,
caches and cache-coherent DMA masters (but not main memory)

This is why it is more difficult for x86 to scale to 100s of cores:
there are simply more of these broadcasts being sent for ordinary
memory accesses, which gets out of hand quite quickly when scaling
out.


Re: MemoryFence()

Ni, Ray
 


The shared memory model is a *wreck*, given how the hardware operates
underneath.
I am sorry for redirecting to the C++11 share memory model.
It has nothing to do with EDK2 here because EDK2 is C based.

The good news is EFI does not have threads and the PI MpServices protocol is indirectly messaging based as you wait for a
function to return or an event. Like C it can be abused since the memory is shared.
It's not that TRUE today.
If you look into UefiCpuPkg, MpInitLib, RegisterCpuFeaturesLib, PiSmmCpu, these modules heavily depend on the sync primitives.
I don't like the today's situation that:
1. no semaphore standard library
2. no mutex standard library (there is a BaseSynchronization library that provides spinlock, cmpxchg, lock inc/dec primitives. I prefer a mutex lib over this BaseSync lib.)
2. directly using volatile over SPIN_BLOCK
3. using volatile which can be avoided as concluded by the discussion.

If with the above wheels invented, even with MP service, one can write a complex long-run daemon procedure and sync with BSP properly.


Ironically the complexity of writing thread safe code in C and the fact that back in the day a big pool of the x86 firmware
writers were hardware folks who wrote in x86 assembler is one of the reasons EFI ended up with. No threads an a
cooperative event model…..

Thanks,

Andrew Fish

Laszlo


Re: MemoryFence()

Andrew Fish <afish@...>
 

On Feb 5, 2021, at 11:35 AM, Laszlo Ersek <lersek@...> wrote:

rant warning

On 02/05/21 19:29, Ni, Ray wrote:

I read a bit about c memory model.
https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync
Thanks.

I think what happened here is a catastrophic impedance mismatch.

The underlying, physical architecture is a message-passing one.

Yet the mental abstraction that is built atop, for programmers to deal
with, is "shared memory". Which suggests a global shared state.

This is *suicidal*.

There has forever been a debate of tastes; there's a camp that prefers
the message passing API, and another camp that prefers the shared memory
/ locking API. There are pros and cons; usually it is pointed out that
message passing is safer but less performant, and also that the message
passing APIs are not difficult to implement on top of shared memory /
locking. What people (IME) don't spend many words on however is an
attempt to implement shared memory / locking on top of message passing.
"Why would you want to do that, in an *application*?", is how the
thinking would go, I guess.

But that's *exactly* what's being sold to us with the shared memory /
locking API *itself*. The hardware underneath is message-passing!

All this complexity about "non-commutativity" in the release/acquire
pattern -- see: "There has been no synchronization between threads 1 and
3" -- *only* makes sense if the programmer thinks about *messages in
transit* between the threads (well, the processing units that execute
them). Whereas, the concepts that the programmer deals with *in the
source code* are called "x" and "y" -- the same names in each thread,
the same addresses, the same storage.

So you've got a message-passing architecture at the bottom, then build a
shared memory abstraction on top, which leaks like a sieve, and then you
ask the programmer to mentally deconstruct that abstraction, while he or
she still has to stare at code that uses the shared memory terminology.
It's *insane*. It's self-defeating. If an alien came by and you claimed
this was all invented to deceive programmers on purpose, they'd have no
problem believing it.

Sequentially Consistent is the *only* model that halfway makes sense
(whatever the execution cost), as long as we have a shared-anything
programming model.

If we replaced "x" and "y" in the "Overall Summary" example, with the
following objects:

- thread1_thread2_channel_a
- thread1_thread2_channel_b
- thread2_thread3_channel_c
- thread2_thread3_channel_d
- thread3_thread1_channel_e
- thread3_thread1_channel_f

in other words, if we explicitly instantiated both "x" and "y" in every
particular inter-thread relationship, and we considered "store" as
"send", and "load" as "receive", and each such channel were
bi-directional, then folks like me might actually get a fleeting chance
at scratching the surface.

The shared memory model is a *wreck*, given how the hardware operates
underneath.
Looks like we are heading for a crash too. So good thing we are thinking about it.

The good news is EFI does not have threads and the PI MpServices protocol is indirectly messaging based as you wait for a function to return or an event. Like C it can be abused since the memory is shared.

Ironically the complexity of writing thread safe code in C and the fact that back in the day a big pool of the x86 firmware writers were hardware folks who wrote in x86 assembler is one of the reasons EFI ended up with. No threads an a cooperative event model…..

Thanks,

Andrew Fish

Laszlo


Re: MemoryFence()

Laszlo Ersek
 

rant warning

On 02/05/21 19:29, Ni, Ray wrote:

I read a bit about c memory model.
https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync
Thanks.

I think what happened here is a catastrophic impedance mismatch.

The underlying, physical architecture is a message-passing one.

Yet the mental abstraction that is built atop, for programmers to deal
with, is "shared memory". Which suggests a global shared state.

This is *suicidal*.

There has forever been a debate of tastes; there's a camp that prefers
the message passing API, and another camp that prefers the shared memory
/ locking API. There are pros and cons; usually it is pointed out that
message passing is safer but less performant, and also that the message
passing APIs are not difficult to implement on top of shared memory /
locking. What people (IME) don't spend many words on however is an
attempt to implement shared memory / locking on top of message passing.
"Why would you want to do that, in an *application*?", is how the
thinking would go, I guess.

But that's *exactly* what's being sold to us with the shared memory /
locking API *itself*. The hardware underneath is message-passing!

All this complexity about "non-commutativity" in the release/acquire
pattern -- see: "There has been no synchronization between threads 1 and
3" -- *only* makes sense if the programmer thinks about *messages in
transit* between the threads (well, the processing units that execute
them). Whereas, the concepts that the programmer deals with *in the
source code* are called "x" and "y" -- the same names in each thread,
the same addresses, the same storage.

So you've got a message-passing architecture at the bottom, then build a
shared memory abstraction on top, which leaks like a sieve, and then you
ask the programmer to mentally deconstruct that abstraction, while he or
she still has to stare at code that uses the shared memory terminology.
It's *insane*. It's self-defeating. If an alien came by and you claimed
this was all invented to deceive programmers on purpose, they'd have no
problem believing it.

Sequentially Consistent is the *only* model that halfway makes sense
(whatever the execution cost), as long as we have a shared-anything
programming model.

If we replaced "x" and "y" in the "Overall Summary" example, with the
following objects:

- thread1_thread2_channel_a
- thread1_thread2_channel_b
- thread2_thread3_channel_c
- thread2_thread3_channel_d
- thread3_thread1_channel_e
- thread3_thread1_channel_f

in other words, if we explicitly instantiated both "x" and "y" in every
particular inter-thread relationship, and we considered "store" as
"send", and "load" as "receive", and each such channel were
bi-directional, then folks like me might actually get a fleeting chance
at scratching the surface.

The shared memory model is a *wreck*, given how the hardware operates
underneath.

Laszlo


Re: MemoryFence()

Paolo Bonzini <pbonzini@...>
 

Il ven 5 feb 2021, 19:47 Laszlo Ersek <lersek@...> ha scritto:

On 02/05/21 19:32, Paolo Bonzini wrote:

Unaligned pointers are already undefined behavior so you can ignore how
they are dealt with at the processor level.
My question was unclearly asked, sorry. Let's say we have a UINT32 at an
address that's not a multiple of 4, but a multiple of 2. A pointer to
that UINT32 is "acceptably aligned" on x86, but not "naturally aligned".
Dereferencing the pointer is not undefined (my reading of C99 suggests
that alignment requirements are implementation-defined), but I don't
know if the atomicity guarantee holds.
Got it, it's a very good question. All edk2 porting targets should
naturally align all types not larger than pointers. I can only think of old
m68k that had 2-byte aligned 32-bit pointers.

Another example; we may have a pointer to a packed structure, and we
might want to poke at a UINT32 field in that structure. Not through a
naked pointer-to-UINT32 of course, which would throw away the
packed-ness, but really through the pointer-to-the-whole-packed-struct.
Packed types indeed should not be considered atomic.

Paolo


Thanks
Laszlo


Re: MemoryFence()

Andrew Fish <afish@...>
 

On Feb 5, 2021, at 10:47 AM, Laszlo Ersek <lersek@...> wrote:

On 02/05/21 19:32, Paolo Bonzini wrote:

Unaligned pointers are already undefined behavior so you can ignore how
they are dealt with at the processor level.
My question was unclearly asked, sorry. Let's say we have a UINT32 at an
address that's not a multiple of 4, but a multiple of 2. A pointer to
that UINT32 is "acceptably aligned" on x86, but not "naturally aligned".
Dereferencing the pointer is not undefined (my reading of C99 suggests
that alignment requirements are implementation-defined), but I don't
know if the atomicity guarantee holds.
Laszlo,

Clang treats unaligned pointers as undefined behavior (UB). I’m not sure if this is due to C11? What I know is clang choses to NOT optimize away UB from alignment errors, but if you run the address sanitizer you get errors.

For fun one day I turned on the clang address sanitizer with our edk2 firmware and had it emit UD2 on faults so it did not require a runtime. It turned out our debugger stub had lots of alignment issues :( so I kind of gave up at that point :).

Thanks,

Andrew Fish

Another example; we may have a pointer to a packed structure, and we
might want to poke at a UINT32 field in that structure. Not through a
naked pointer-to-UINT32 of course, which would throw away the
packed-ness, but really through the pointer-to-the-whole-packed-struct.

Thanks
Laszlo


Re: MemoryFence()

Andrew Fish <afish@...>
 

On Feb 5, 2021, at 10:11 AM, Ni, Ray <ray.ni@...> wrote:

I saw the proposal of fences in first mail by Laszlo. Please forgive my ignorance. What is asm(“”) in x86? A nop? The how a nop can help as a processor level load store barrier?
Ray,

In C calling out to assembly is also a barrier/fence operation from the compilers point of view. Actually calling an indirect procedure call (gBS->*, Protoco->*) is also a barrier. The compiler has no idea what the assemble code is doing across the boundary so all the operations need to complete prior to calling the assembly code (indirect procedure call). I guess a binary static lib is in this list too. In reality it is anything the C compiler can’t link time optimize through.

For gcc flavored compliers any __asm__ call is a compiler read/write barrier. That is why you see __asm__ __volatile__ ("":::"memory”); in the memory fence. That means for gcc/clang any synchronization primitive implemented in inline assembler are also a compiler barrier/fence wrapping that assembly operation.

The VC++ inline assemble seems to be “more integrated” with the compiler, so I’m not sure what the rules are for that. Some one should really investigate that. Something tells me you will not find those answer on the Linux mailing list :).

The reason the MMIO operations are wrapped in _ReadWriteBarrier ()/__asm__ __volatile__ ("":::"memory”) has to do with the order of operations.
1) The leading barrier forces all the code before the call to complete the read/writes prior to the operation.
2) The trailing barrier forces the MMIO operation to complete before unrelated read writes that come after the call that could get reordered in optimization.

As I mentioned in this thread I’m wondering if VC++ has extra behaviors around volatile that are compiler dependent and that is why we have not seen a lot of issues to date?

I’m really glad this topic came up. It really seems like something we need to work through …...

Thanks,

Andrew Fish


thanks,
ray
发件人: rfc@edk2.groups.io <rfc@edk2.groups.io> 代表 Paolo Bonzini <pbonzini@...>
发送时间: Saturday, February 6, 2021 2:01:14 AM
收件人: Ni, Ray <ray.ni@...>; Laszlo Ersek <lersek@...>; Ard Biesheuvel <ardb@...>
抄送: Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: [edk2-rfc] MemoryFence()

On 05/02/21 18:53, Ni, Ray wrote:
Without calling _ReadWriteBarrier, is it possible that compiler
generates the assembly in the wrong location? I mean the compiler may
in-line the LibWaitForSemaphore and call cmpxchg earlier than the
desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required.
The proposed ReleaseMemoryFence() should already have that effect. All
the proposed fences except CompilerFence() are both compiler
optimization barriers and processor barriers.
InterlockedCompareExchange() is also both a compiler optimization
barrier and a processor barrier

CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks
optimizations but it has no effect at the processor level. It should
only be used (instead of volatile) in busy-waiting loops that do not
always go through an InterlockedCompareExchange.

Paolo

234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;





Re: MemoryFence()

Laszlo Ersek
 

On 02/05/21 19:17, Paolo Bonzini wrote:

On x86 load-load, load-store and store-store ordering is already guaranteed
by the processor. Therefore on x86 the AcquireMemoryFence and
ReleaseMemoryFence are just like CompilerFence: they only have to block
compiler-level reordering. MemoryFence is the only one that blocks
store-load reordering and needs to emit an MFENCE instruction.
This is exactly what the fence API implementations should have in their
comments, preferably with Intel / AMD manual references...

On ARM (either 32- or 64-bit) the processor-level guarantees are weaker,
and you need to emit a "dmb" instruction for acquire and release fences as
well.
Same, just with the ARM ARM.

Thanks!
Laszlo


Re: MemoryFence()

Laszlo Ersek
 

On 02/05/21 19:32, Paolo Bonzini wrote:

Unaligned pointers are already undefined behavior so you can ignore how
they are dealt with at the processor level.
My question was unclearly asked, sorry. Let's say we have a UINT32 at an
address that's not a multiple of 4, but a multiple of 2. A pointer to
that UINT32 is "acceptably aligned" on x86, but not "naturally aligned".
Dereferencing the pointer is not undefined (my reading of C99 suggests
that alignment requirements are implementation-defined), but I don't
know if the atomicity guarantee holds.

Another example; we may have a pointer to a packed structure, and we
might want to poke at a UINT32 field in that structure. Not through a
naked pointer-to-UINT32 of course, which would throw away the
packed-ness, but really through the pointer-to-the-whole-packed-struct.

Thanks
Laszlo


Re: MemoryFence()

Paolo Bonzini <pbonzini@...>
 

Il ven 5 feb 2021, 19:21 Laszlo Ersek <lersek@...> ha scritto:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?
Not as long as it's a pointer or smaller.
This is why I'd like someone else to write the primitives :)
To expand: it is an assumption on the implementation of the C compiler; it
admittedly goes beyond the C standard but it is a reasonable assumption,
compilers may *merge* loads and stores but have no reason to split aligned
loads and stores. Likewise for processors (the pointer limitation is
because sizes larger than the pointer's may not have an instruction that
reads or writes them atomically).

The standard would require you to use C11 atomic loads and stores, but as
far as I understand you cannot assume that they exist on all edk2 compilers.

- Is it architecture-dependent?
No.

- Is it alignment-dependent?
Unaligned pointers are already undefined behavior so you can ignore how
they are dealt with at the processor level.

Paolo


etc... :)

(I'm not challenging your statement of course; I know I should have
known, just not sure where from...)

Thanks,
Laszlo


Re: MemoryFence()

Ni, Ray
 

Wow! Thank you Paolo for explaining everything!

I read a bit about c memory model.
https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync

Do you have any documentation that can explain how the different models are implemented (using what instructions)?

I would like to know if there is a better primitive api other than the fences because MSVC deprecates the barrier apis.
https://docs.microsoft.com/en-us/cpp/intrinsics/readwritebarrier?view=msvc-160



thanks,
ray
________________________________
发件人: Paolo Bonzini <pbonzini@...>
发送时间: Saturday, February 6, 2021 2:17:53 AM
收件人: Ni, Ray <ray.ni@...>
抄送: edk2-rfc-groups-io <rfc@edk2.groups.io>; Laszlo Ersek <lersek@...>; Ard Biesheuvel <ardb@...>; Andrew Fish <afish@...>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: [edk2-rfc] MemoryFence()

Il ven 5 feb 2021, 19:12 Ni, Ray <ray.ni@...<mailto:ray.ni@...>> ha scritto:
I saw the proposal of fences in first mail by Laszlo. Please forgive my ignorance. What is asm(“”) in x86? A nop? The how a nop can help as a processor level load store barrier?

On x86 load-load, load-store and store-store ordering is already guaranteed by the processor. Therefore on x86 the AcquireMemoryFence and ReleaseMemoryFence are just like CompilerFence: they only have to block compiler-level reordering. MemoryFence is the only one that blocks store-load reordering and needs to emit an MFENCE instruction.

On ARM (either 32- or 64-bit) the processor-level guarantees are weaker, and you need to emit a "dmb" instruction for acquire and release fences as well.

Paolo



thanks,
ray
________________________________
发件人: rfc@edk2.groups.io<mailto:rfc@edk2.groups.io> <rfc@edk2.groups.io<mailto:rfc@edk2.groups.io>> 代表 Paolo Bonzini <pbonzini@...<mailto:pbonzini@...>>
发送时间: Saturday, February 6, 2021 2:01:14 AM
收件人: Ni, Ray <ray.ni@...<mailto:ray.ni@...>>; Laszlo Ersek <lersek@...<mailto:lersek@...>>; Ard Biesheuvel <ardb@...<mailto:ardb@...>>
抄送: Andrew Fish <afish@...<mailto:afish@...>>; edk2 RFC list <rfc@edk2.groups.io<mailto:rfc@edk2.groups.io>>; Kinney, Michael D <michael.d.kinney@...<mailto:michael.d.kinney@...>>; Leif Lindholm (Nuvia address) <leif@...<mailto:leif@...>>; Dong, Eric <eric.dong@...<mailto:eric.dong@...>>; Liming Gao (Byosoft address) <gaoliming@...<mailto:gaoliming@...>>; Ankur Arora <ankur.a.arora@...<mailto:ankur.a.arora@...>>
主题: Re: [edk2-rfc] MemoryFence()

On 05/02/21 18:53, Ni, Ray wrote:
Without calling _ReadWriteBarrier, is it possible that compiler
generates the assembly in the wrong location? I mean the compiler may
in-line the LibWaitForSemaphore and call cmpxchg earlier than the
desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required.
The proposed ReleaseMemoryFence() should already have that effect. All
the proposed fences except CompilerFence() are both compiler
optimization barriers and processor barriers.
InterlockedCompareExchange() is also both a compiler optimization
barrier and a processor barrier

CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks
optimizations but it has no effect at the processor level. It should
only be used (instead of volatile) in busy-waiting loops that do not
always go through an InterlockedCompareExchange.

Paolo

234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;


Re: MemoryFence()

Laszlo Ersek
 

On 02/05/21 18:34, Paolo Bonzini wrote:
On 05/02/21 18:31, Laszlo Ersek wrote:
On 02/05/21 18:21, Paolo Bonzini wrote:
On 05/02/21 17:56, Laszlo Ersek wrote:
     221  SPIN_LOCK *
     222  EFIAPI
     223  ReleaseSpinLock (
     224    IN OUT  SPIN_LOCK                 *SpinLock
     225    )
     226  {
     227    SPIN_LOCK    LockValue;
     228
     229    ASSERT (SpinLock != NULL);
     230
     231    LockValue = *SpinLock;
     232    ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue ==
SPIN_LOCK_RELEASED);
     233
     234    _ReadWriteBarrier ();
     235    *SpinLock = SPIN_LOCK_RELEASED;
     236    _ReadWriteBarrier ();
     237
     238    return SpinLock;
     239  }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill.  However, it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store.  Instead it should be just this:

     ReleaseMemoryFence ();
     *SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?
Not as long as it's a pointer or smaller.
This is why I'd like someone else to write the primitives :)

- Where is this documented?

- Is it architecture-dependent?

- Is it alignment-dependent?

etc... :)

(I'm not challenging your statement of course; I know I should have
known, just not sure where from...)

Thanks,
Laszlo


Re: MemoryFence()

Paolo Bonzini <pbonzini@...>
 

Il ven 5 feb 2021, 19:12 Ni, Ray <ray.ni@...> ha scritto:

I saw the proposal of fences in first mail by Laszlo. Please forgive my
ignorance. What is asm(“”) in x86? A nop? The how a nop can help as a
processor level load store barrier?
On x86 load-load, load-store and store-store ordering is already guaranteed
by the processor. Therefore on x86 the AcquireMemoryFence and
ReleaseMemoryFence are just like CompilerFence: they only have to block
compiler-level reordering. MemoryFence is the only one that blocks
store-load reordering and needs to emit an MFENCE instruction.

On ARM (either 32- or 64-bit) the processor-level guarantees are weaker,
and you need to emit a "dmb" instruction for acquire and release fences as
well.

Paolo



thanks,
ray
------------------------------
*发件人:* rfc@edk2.groups.io <rfc@edk2.groups.io> 代表 Paolo Bonzini <
pbonzini@...>
*发送时间:* Saturday, February 6, 2021 2:01:14 AM
*收件人:* Ni, Ray <ray.ni@...>; Laszlo Ersek <lersek@...>; Ard
Biesheuvel <ardb@...>
*抄送:* Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>;
Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia
address) <leif@...>; Dong, Eric <eric.dong@...>; Liming
Gao (Byosoft address) <gaoliming@...>; Ankur Arora <
ankur.a.arora@...>
*主题:* Re: [edk2-rfc] MemoryFence()

On 05/02/21 18:53, Ni, Ray wrote:
Without calling _ReadWriteBarrier, is it possible that compiler
generates the assembly in the wrong location? I mean the compiler may
in-line the LibWaitForSemaphore and call cmpxchg earlier than the
desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is
required.

The proposed ReleaseMemoryFence() should already have that effect. All
the proposed fences except CompilerFence() are both compiler
optimization barriers and processor barriers.
InterlockedCompareExchange() is also both a compiler optimization
barrier and a processor barrier

CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks
optimizations but it has no effect at the processor level. It should
only be used (instead of volatile) in busy-waiting loops that do not
always go through an InterlockedCompareExchange.

Paolo

234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;






Re: MemoryFence()

Andrew Fish <afish@...>
 

Ray,

Yes I was writing a mail when I wrote the bugzilla….

Ray,

Function calls are not compiler fences. The compiler just optimized the code way as dead code. For non VC++ we have a real compiler fence/barrier for MemoryFence(). I filed a bugzilla [1] so we can investigate this some more.

I hit a real bug with clang using the MMIO functions a long time ago and I think that might be why the GCC version of the MemoryFence() got the compiler fence/barrier.

If I had to guess VC++ is probably more conservative about volatiles and that is why we have not seen issue with VC++, but that is depending on non standard behavior.

Well this probably explains why we don’t have many MemoryFence() calls in the x86 CPU code, given they don’t do anything on VC++. I have noticed from debugging MP code that was developed using VC++, that we seem to hit more issues when compiled via clang. I wonder if the behavior difference with volatile would explain some of those?

VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}

And:

VOID
EFIAPI
MemoryFence (
VOID
)
{
// This is a little bit of overkill and it is more about the compiler that it is
// actually processor synchronization. This is like the _ReadWriteBarrier
// Microsoft specific intrinsic
__asm__ __volatile__ ("":::"memory");
}

So I think for VC++ it should be…

VOID
EFIAPI
MemoryFence (
VOID
)
{
_ReadWriteBarrier();
return;
}

[1] https://bugzilla.tianocore.org/show_bug.cgi?id=3213

Thanks,

Andrew Fish

On Feb 5, 2021, at 9:53 AM, Ni, Ray <ray.ni@...> wrote:

Without calling _ReadWriteBarrier, is it possible that compiler generates the assembly in the wrong location? I mean the compiler may in-line the LibWaitForSemaphore and call cmpxchg earlier than the desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required.

I think Andrew also has the same thoughts.

thanks,
ray
发件人: Paolo Bonzini <pbonzini@...>
发送时间: Saturday, February 6, 2021 1:34:15 AM
收件人: Laszlo Ersek <lersek@...>; Ni, Ray <ray.ni@...>; Ard Biesheuvel <ardb@...>
抄送: Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: MemoryFence()

On 05/02/21 18:31, Laszlo Ersek wrote:
On 02/05/21 18:21, Paolo Bonzini wrote:
On 05/02/21 17:56, Laszlo Ersek wrote:
221 SPIN_LOCK *
222 EFIAPI
223 ReleaseSpinLock (
224 IN OUT SPIN_LOCK *SpinLock
225 )
226 {
227 SPIN_LOCK LockValue;
228
229 ASSERT (SpinLock != NULL);
230
231 LockValue = *SpinLock;
232 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue ==
SPIN_LOCK_RELEASED);
233
234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
239 }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill. However, it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?
Not as long as it's a pointer or smaller.

Paolo


Re: MemoryFence()

Ni, Ray
 

I saw the proposal of fences in first mail by Laszlo. Please forgive my ignorance. What is asm(“”) in x86? A nop? The how a nop can help as a processor level load store barrier?


thanks,
ray
________________________________
发件人: rfc@edk2.groups.io <rfc@edk2.groups.io> 代表 Paolo Bonzini <pbonzini@...>
发送时间: Saturday, February 6, 2021 2:01:14 AM
收件人: Ni, Ray <ray.ni@...>; Laszlo Ersek <lersek@...>; Ard Biesheuvel <ardb@...>
抄送: Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: [edk2-rfc] MemoryFence()

On 05/02/21 18:53, Ni, Ray wrote:
Without calling _ReadWriteBarrier, is it possible that compiler
generates the assembly in the wrong location? I mean the compiler may
in-line the LibWaitForSemaphore and call cmpxchg earlier than the
desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required.
The proposed ReleaseMemoryFence() should already have that effect. All
the proposed fences except CompilerFence() are both compiler
optimization barriers and processor barriers.
InterlockedCompareExchange() is also both a compiler optimization
barrier and a processor barrier

CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks
optimizations but it has no effect at the processor level. It should
only be used (instead of volatile) in busy-waiting loops that do not
always go through an InterlockedCompareExchange.

Paolo

234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;

261 - 280 of 793