Re: MemoryFence()


Ankur Arora
 

On 2021-02-09 5:20 a.m., Laszlo Ersek wrote:
On 02/08/21 21:44, Ankur Arora wrote:
On 2021-02-08 9:40 a.m., Laszlo Ersek wrote:
So the structure of the solution we're looking for is:

- exactly *one* of:
   - volatile
   - compiler fence
   - acquire fence used as a heavy substitute for compiler fence,
Since the first two are of unequal power -- volatile for specific
accesses and the compiler fence being a general barrier across which
the compiler does not optimize memory accesses I think we should have
both instead of volatile or compiler fence.
I disagree, as long as we talk about CPU synchronization through
variables that live in RAM. "volatile" has a use case, yes, but not for
this purpose. The Linux document Paolo referenced earlier explains this
well. The compiler fence is a superset of volatile (for this use case),
feature-wise, *and* it has better performance.

However, instead of volatile, how about read_once(), write_once()
primitives? These, respectively ensure that the compiler emits a
single load or a store and does not combine or mutilate the load/store
in any way.

Internally, these are implemented via volatile (at least the Linux
kernel implementation that I looked at is) but they would explicitly
lay out the guarantees (and non-guarantees) that the C/C++ standard
makes:

 "However, the guarantees of the standard are not sufficient for using
 volatile for multi-threading. The C++ Standard does not stop the compiler
 from reordering non-volatile reads and writes relative to volatile reads
 and writes, and it says nothing about preventing CPU reordering."

(https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering)


I see at least two cases (essentially where these are useful on their
own without needing any other fences for ordering (compiler or cpu)):

* Device side communication:

MdeModulePkg/Bus/Ata/AhciPei/AhciMode.c::AhciWaitMemSet()
271     // The system memory pointed by Address will be updated by the
272     // SATA Host Controller, "volatile" is introduced to prevent
273     // compiler from optimizing the access to the memory address
274     // to only read once.
275     //
276     Value  = *(volatile UINT32 *) (UINTN) Address;
277     Value &= MaskValue;
278

So use read_once() instead of the volatile cast:
276     Value  = read_once (Address);
This seems to overlap with MmioRead32() from <IoLib.h>, at least in
purpose. I think the code should have used MmioRead32() in the first place.
AFAICS, the various versions of MmioRead, MmioWrite defined in
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c or
MdePkg/Library/BaseIoLibIntrinsic/IoLibNoIo.c

are effectively what I was proposing as well: a typecheck and
a volatile cast to ensure that the compiler emits a load or
a store.


* For handling cases where we don't need a fence, but only
need to ensure that the compiler does not mutilate a store in any
way. As an example, the code below clears the Handler and the
block below it calls the Handler if the pointer is non-NULL.

(From:
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-7-ankur.a.arora@oracle.com/,

https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-9-ankur.a.arora@oracle.com/)



CpuEject():
 99 +    //
100 +    // We are done until the next hot-unplug; clear the handler.
101 +    //
102 +    mCpuHotEjectData->Handler = NULL;

SmmCpuFeaturesRendezvousExit():
140 +  if (mCpuHotEjectData == NULL ||
141 +      mCpuHotEjectData->Handler == NULL) {
142 +    return;
143 +  }
144 +
145 +  mCpuHotEjectData->Handler (CpuIndex);

If the code on line 102 were instead,

102 +    write_once (&mCpuHotEjectData->Handler, NULL);

and that on lines 141-145 were:
140 +  if (mCpuHotEjectData == NULL ||
141 +      ((handler = read_once (mCpuHotEjectData->Handler)) == NULL) {
142 +    return;
143 +  }
144 +
145 +  handler (CpuIndex);
Hmm wait let me gather my thoughts on this...
We have this loop in SmiRendezvous()
[UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c]:
//
// Wait for BSP's signal to exit SMI
//
while (*mSmmMpSyncData->AllCpusInSync) {
CpuPause ();
}
(there are two instances of it in SmiRendezvous(), both executed before
reaching the Exit label).
This is the loop that separtes CpuEject() (on the BSP) from
SmmCpuFeaturesRendezvousExit() (on the AP).
The object accessed is a "volatile BOOLEAN", so conceptually this is a
loop body with a compiler fence.
The goal is to highlight the data flow (i.e., release -> acquire) in the
code, so we should put a release fence in CpuEject() and an acquire
fence in SmmCpuFeaturesRendezvousExit().
* Until we can do that -- until we have the better-named APIs -- we
should use MemoryFence() for both fences. However, because MemoryFence()
doesn't do anything on VS, we should also keep the volatile, which (on
x86) happens to force the same practical effect that we expect of the
release and acquire fences too.
* However IIUC you're saying that we shouldn't even aim at placing the
release and acquire fences in CpuEject / SmmCpuFeaturesRendezvousExit
respectively, but use write_once() / read_once().
Yes, exactly. My point was that at least in the restricted case that
I was describing -- where we are only concerned with,
mCpuHotEjectData->Handler and not it's ordering with respect to
other stores. Where that would be necessary we would need a compiler
and/or memory-fence.

My comment on that is: I don't know.
I don't know how read_once() / write_once() in Linux compares to the
acquire / release fences.
I don't know if read_once / write_once are just as expressive as the
acquire / release fences (in Paolo's opinion, for example). I also don't
know if their implementations (their effects on the CPUs / visibility)
would be interchangeable on all edk2 architectures.
So I think this is something that Paolo should please comment on.
*Personally*, I would be satisfied with the following "minimum version" too:
- update the MemoryFence() specification to say it is a full compiler
and CPU barrier (covering store-load as well)
- annotate MemoryFence() call sites with comments, to highlight the data
flow direction
- make MemoryFence() emit an MFENCE on x86, and waste a minuscule amount
of CPU time in response -- that's an acceptable price for covering
store-load (Paolo mentioned ~50 clock cycles as penalty)
- fix MemoryFence()'s implementation for VS (currently it does nothing
at all)
- use MemoryFence(), and do not use volatile, in the v7 unplug series
For the last item, I will assume that the MemoryFence() will act as both
a compiler and a CPU barrier while annotating my code to say what specific
ordering I'm banking on.

*However*, Paolo strongly recommends that we introduce and use the
directional barriers, for highlighting data flow, and for matching known
lockless patterns. Thus, I understand your read_once / write_once
suggestion to be an *alternative* to Paolo's. And I'm not equipped to
compare both recommendations, alas.
Oh, no. I wasn't suggesting read_once/write_once as an alternative to
the primitives that Paolo was talking about. Those would handle
ordering while read_once/write_once were only intended to ensure
that the compiler does not in any way mangle the stores and loads.

In general I would say that fences are -- except for one case -- strictly
greater in power than read_once/write_once. That one case is alignment
checks (which we do for instance in MmioRead32() and friends.)

Given that difference in power, I agree with the approach Paolo
recommended elsewhere that for CPU memory access syncrhonization we
start with Fences and then switch to lesser powered variants
as and when needed:

"However, these changes do not have to be done in a single step.
Starting with the rationalization of memory fences makes sense
because, in addition to enabling the annotation of code rather
than data, it is essentially a bugfix. "

I don't know why Linux has fences and read_once / write_once separately.
I know they're all documented, I think I tried to read those docs in the
past (unsuccessfully). I just feel that *all that* would be too much for
me to use edk2.
Yeah I'm not sure we need to go wild with these in EDK2 either. That's
just a recipe for bugs.

Thanks
Ankur

I have to defer this to Paolo and Ard.
Thanks
Laszlo

The use of read_once()/write_once() also document the implicit assumptions
being made in the code at the point of use.

Thanks
Ankur

- and *all* of
   - acquire fence (load-load, load-store)
   - release fence (load-store, store-store)
   - full fence    (load-load, load-store, store-store, store-load)

The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.

I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).

Thanks
Laszlo

Join {rfc@edk2.groups.io to automatically receive all group messages.