Re: MemoryFence()


Paolo Bonzini <pbonzini@...>
 

Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto:

(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".
Acquire semantics typically order writes before reads, not /between/
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.
Acquire fences are barriers between earlier loads and subsequent loads and
stores; those earlier loads then synchronize with release stores in other
threads.

Release fences are barriers been earlier loads and stores against
subsequent stores, and those subsequent stores synchronize with acquire
loads in other threads.

In both cases, however, fences only make sense between memory operations.
So something like "after reads" and "before writes" would have been more
precise in some sense, but in practice the usual idiom is "between"
reads/writes as Laszlo wrote.

Note that reasoning about this only makes sense in the context of
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)

For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.
Sure, that's a matter of how to implement the primitives. If you think that
non-coherent DMA is important, a full dmb can be used too.

As far as the compiler is concerned, an asm in the macros *should* block
enough optimizations, even without making the accesses volatile.
CompilerFence (or the edk2 equivalent of cpu_relax, whose name escapes me
right now) would be necessary in the body of busy-waiting loops. However
somebody should check the MSVC docs for asm, too.

It is very important to be *aware* of the acquire/release semantics,
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.
I agree as long as the primitives are self-documenting. A single
MemoryFence() does not make it clear in which direction the data is flowing
(whether from other threads to this one, or vice versa).

Paolo


Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?
Who decides what 'memory fence' means anyway?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?
'volatile' affects code generation only, i.e., which loads and stores
are issued. The *order* in which those loads observe stores done by
other masters is what is managed by barrier instructions. So they are
two sides of the same coin: 'volatile' may help to guarantee that
every assignment through a pointer variable results in a store
instruction, but how these are observed by other CPUs still needs to
be managed if it needs to occur in a certain way.

I think volatile is the wrong tool in most cases: the ordering of
accesses are not a property of the type or of the variable, but of the
code sequence. So using asm("") to ensure that an assignment through a
pointer is emitted as a store instruction, followed by a barrier
instruction is a typical pattern. And given the fact that barrier
instructions require asm sequences in most cases anyway, the asm ("")
will often be implied.


In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?

Would there be interest in reviewing such work?

Thanks
Laszlo

Join rfc@edk2.groups.io to automatically receive all group messages.