Date   

Re: MemoryFence()

Andrew Fish <afish@...>
 

On Feb 5, 2021, at 12:25 AM, Ni, Ray <ray.ni@...> wrote:

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.
I agree with this summary.
Volatile using in MpInitLib or PiSmmCpu driver cannot be emitted because different CPU threads change the same memory content to synchronize with each other.
I don’t quite understand if removing “volatile” what mechanism can be used to force the compiler generating code that always reads from memory?
Ray,

I agree the C definition of volatile is not thread safe.

I was doing some research on volatile and I realize the Linux kernel guides recommend against using volatile [1]. I think it has to do with people confusing volatile for being thread safe. So my guess is it is more of a coding standard to encourage people to use synchronization primitives. The primitives imply compiler fences which indirectly enforces the same behavior as volatile.

[1] https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html

Thanks,

Andrew Fish


Thanks,
Ray


Re: MemoryFence()

Paolo Bonzini <pbonzini@...>
 

On 05/02/21 18:31, Laszlo Ersek wrote:
On 02/05/21 18:21, Paolo Bonzini wrote:
On 05/02/21 17:56, Laszlo Ersek wrote:
    221  SPIN_LOCK *
    222  EFIAPI
    223  ReleaseSpinLock (
    224    IN OUT  SPIN_LOCK                 *SpinLock
    225    )
    226  {
    227    SPIN_LOCK    LockValue;
    228
    229    ASSERT (SpinLock != NULL);
    230
    231    LockValue = *SpinLock;
    232    ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue ==
SPIN_LOCK_RELEASED);
    233
    234    _ReadWriteBarrier ();
    235    *SpinLock = SPIN_LOCK_RELEASED;
    236    _ReadWriteBarrier ();
    237
    238    return SpinLock;
    239  }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill.  However, it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store.  Instead it should be just this:

    ReleaseMemoryFence ();
    *SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?
Not as long as it's a pointer or smaller.

Paolo


Re: MemoryFence()

Andrew Fish <afish@...>
 

On Feb 4, 2021, at 11:48 PM, Laszlo Ersek <lersek@...> wrote:

On 02/04/21 22:31, Andrew Fish wrote:


On Feb 4, 2021, at 10:09 AM, Laszlo Ersek <lersek@...> wrote:

Hi All,

here's the declaration of MemoryFence(), from <BaseLib.h>:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
);
The IA32 and X64 implementations of MemoryFence() are not memory
barriers, only compiler barriers, however.

This is a problem for two reasons:

- code that believes to add a memory barrier only adds a compiler
barrier, possibly creating an unsafe situation,

- code that knowingly calls MemoryFence() for the purposes of a compiler
barrier -- while it will hopefully work safely, relying on tacit
knowledge / assumptions about the x86 architecture -- results in a
source reading experience that mismatches known lockless patterns.


This topic has repeatedly come up in the past; most recently in the
review of Ankur's v6 series for
<https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some
inter-processor communication needs to happen.

I asked Paolo for help with understanding some of the synchronziation
patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is
going to rely on, and/or imitate, some of those patterns. From this
discussion with Paolo, the following have emerged (I'm going to
liberally steal / paraphrase Paolo's words):


(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".

(Note that this step would make MemoryFence() heavier-weight on x86 than
it currently is.)


(2) We should audit and rework existent uses (combinations) of
MemoryFence() into matched acquire / release pairs.

Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?

In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?
Laszlo,

Thanks for looking into this. I noticed most of the MemoryFence() usage[1] is in the virtual platforms so changes seem manageable. I guess 3rd party code that defines its own low level primitive could be impacted, but they chose to reinvent the wheel.

It looks like the generic platform performance risk is BaseIoLibIntrinsic. I know from debugging crashes a long time ago the the MemoryFence() for x86 is really a CompilerFence() in the new scheme. If the new AcquireMemoryFence()/ReleaseMemoryFence()/CompilerFence() replace MemoryFence() this likely just works. If not then we might need an architecture specific pattern to maintain current performance, we could always make a function internal to the lib that does the right thing for the given architecture. Also making it more precise might actually help ARM/AARCH64 performance?

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.

So for example in BaseIoLibIntrinsic() the volatile is used to force the memory read and the CompilerFence() is used to enforce the order of that read relative to other operations. Thus wrapping the volatile access in CompilerFence() is in some ways redundant as long as there is only a single access inside the CompilerFence() wrapper.
Right, I agree with everything you said thus far. (I don't claim to be
an expert on this, of course.) CompilerFence() is a "better volatile"
(for RAM only, not for MMIO), where we can allow the compiler to
optimize, up to a certain location.

I like the idea of properly annotating the code. Given x86 defaults to so much cache coherency for compatibility with older simpler times it can tend to hide issues. I’m not sure removing the volatile keyword is required, and it does imply this variable has special rules. I guess we could add a REQUIRE_FENCE or some such? But it it feels like defining the variables in some way to imply the require special handling has some value?
Objects that are shared between processors, or between processor and
some device, should always be highlighted in some way (naming, comments,
...).


The only reason you would need a volatile is if you had more than a single read inside of the fences. Something like:

UINT8 *Address;
UINT8. Value

MemoryFence ();
do {
Value = *Address
} while (Value == 0xFF) ;
MemoryFence ();

The compiler could optimize the above code (current MemoryFence() model) to a single read and infinite loop if 0xFF. Yes I realize you could move the MemoryFence () into the loop, but what if you did not care about that? Mostly just trying to think this through we some examples….
I think it depends on what Address refers to. If it refers to an MMIO
register, then volatile is needed, IMO (and the MemoryFence()s are
unnecessary). If Address points to a byte in RAM that's manipulated by
multiple CPUs, then I think:

- volatile is not needed
OK I see Linux recommends not using volatile [1], so I think this makes sense to me. It is not that volatile is not useful, but it is not thread safe, so not using volatile encourages people to use primitives to make things thread safe. So not using volatile is more of a coding convention.

- the two MemoryFence() calls should be removed
They may need to turn into CompilerFence().

- an AcquireMemoryFence() call should be inserted into the loop, after
the read access. Because, presumably, once the flag assumes the
appropriate value (it has been "released")), we'll proceed to reading
some memory resource that was protected by the flag until then.
CompilerFence ();
do {
Value = *Address;
AcquireMemoryFence();
} while (Value == 0xFF) ;



Regarding the many MemoryFence() calls in the virt platforms, most of
them do not exist for inter-CPU communication, but for communication
with devices through RAM. I admit I'm unsure what we should do with them.
I think this is where x86 makes things confusing with its coherency all the time motto. On x86 I think they are really CompilerFences() to force C memory model order of operations. They got added to fix real bugs on x86. Remember the compiler can still reorder a volatile access withs surrounding code as an optimization.

So for example if MmioWrite8() has MemoryFence() replaced with CompilerFence() these operations happen in order. If there is no CompilerFence() the order is implementation dependent.

UINT8 *A, *B, *C;

*A = 1;
MmioWrite8 (B, 2);
*C = 3;

So that is why we want the CompilerFence() + volatile for the MMIO. For example you could end up writing to a Common DMA buffer and then doing an MMIO write. You might want those to happen in sequence and you need the CompilerFence()

At least the CPU code is CPU specific so the rules should be more well defined :).

[1] https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html

Thanks,

Andrew Fish


Thanks
Laszlo


[1] /Volumes/Case/edk2-github(master)>git grep MemoryFence | grep ';'
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:320: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:334: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:344: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1493: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1497: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4702: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4707: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4712: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4765: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4770: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4775: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4828: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4833: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4838: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4892: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4894: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:86: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:88: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:115: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:117: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:147: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:149: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:179: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:181: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:211: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:213: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:243: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:245: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:275: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:277: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:305: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:307: MemoryFence ();
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:23:;MemoryFence (
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:24:;MemoryFence (
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:412: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:425: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:438: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:173: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:186: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:199: MemoryFence ();
OvmfPkg/Library/VirtioLib/VirtioLib.c:305: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:312: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:327: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:334: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:337: MemoryFence();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:122: MemoryFence ();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:124: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:117: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:122: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:171: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:177: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:595: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:625: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:56: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:58: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:99: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:102: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:257: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:384: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:426: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:437: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:101: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:103: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:173: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:176: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:169: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:172: MemoryFence ();
OvmfPkg/XenBusDxe/GrantTable.c:86: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:496: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:503: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:569: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:580: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:587: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:473: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:528: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:578: MemoryFence ();
UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c:204: MemoryFence ();


Thanks,

Andrew Fish

Would there be interest in reviewing such work?

Thanks
Laszlo


Re: MemoryFence()

Laszlo Ersek
 

On 02/05/21 18:21, Paolo Bonzini wrote:
On 05/02/21 17:56, Laszlo Ersek wrote:
    221  SPIN_LOCK *
    222  EFIAPI
    223  ReleaseSpinLock (
    224    IN OUT  SPIN_LOCK                 *SpinLock
    225    )
    226  {
    227    SPIN_LOCK    LockValue;
    228
    229    ASSERT (SpinLock != NULL);
    230
    231    LockValue = *SpinLock;
    232    ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue ==
SPIN_LOCK_RELEASED);
    233
    234    _ReadWriteBarrier ();
    235    *SpinLock = SPIN_LOCK_RELEASED;
    236    _ReadWriteBarrier ();
    237
    238    return SpinLock;
    239  }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill.  However, it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store.  Instead it should be just this:

    ReleaseMemoryFence ();
    *SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?

Thanks
Laszlo


Re: MemoryFence()

Paolo Bonzini <pbonzini@...>
 

On 05/02/21 17:56, Laszlo Ersek wrote:
On 02/05/21 17:04, Ni, Ray wrote:
798 **/
799 VOID
800 LibWaitForSemaphore (
801 IN OUT volatile UINT32 *Sem
802 )
I think LibWaitForSemaphore() is safe. (Which is not identical to saying
"it's in the best style".)
Agreed. I would prefer an explicit compiler fence before the "Value = *Sem" assignment and no volatile anywhere, but it is okay.

193 LockValue = *SpinLock;
194 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED);
195
196 _ReadWriteBarrier ();
197 Result = InterlockedCompareExchangePointer (
198 (VOID**)SpinLock,
199 (VOID*)SPIN_LOCK_RELEASED,
200 (VOID*)SPIN_LOCK_ACQUIRED
201 );
202
203 _ReadWriteBarrier ();
204 return (BOOLEAN) (Result == (VOID*) SPIN_LOCK_RELEASED);
205 }
Having the LockValue / ASSERT() stuff seems a bit pointless to me.
The _ReadWriteBarrier (aka compiler fence) is pointless too; it is implicit in the InterlockedCompareExchangePointer.

781 VOID
782 LibReleaseSemaphore (
783 IN OUT volatile UINT32 *Sem
784 )
785 {
786 InterlockedIncrement (Sem);
787 }
This seems OK to me.
Yep.

221 SPIN_LOCK *
222 EFIAPI
223 ReleaseSpinLock (
224 IN OUT SPIN_LOCK *SpinLock
225 )
226 {
227 SPIN_LOCK LockValue;
228
229 ASSERT (SpinLock != NULL);
230
231 LockValue = *SpinLock;
232 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED);
233
234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
239 }
Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill. However, it *is* buggy because it is missing a (processor) barrier on non-x86 architectures and has a useless barrier after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;

Paolo


Re: MemoryFence()

Laszlo Ersek
 

On 02/05/21 17:04, Ni, Ray wrote:
Maybe that explains why the MemoryFence was implemented so
confidently.
I trust Mike:)

In UefiCpuPkg/Library/RegisterCpuFeaturesLib/CpuFeaturesInitialize.c,
the LibWaitForSemaphore, LibReleaseSemaphore are coded without the
read write barrier.
Let's focus on the "down" operations first.

At commit c6be6dab9c4b:

789 /**
790 Decrement the semaphore by 1 if it is not zero.
791
792 Performs an atomic decrement operation for semaphore.
793 The compare exchange operation must be performed using
794 MP safe mechanisms.
795
796 @param Sem IN: 32-bit unsigned integer
797
798 **/
799 VOID
800 LibWaitForSemaphore (
801 IN OUT volatile UINT32 *Sem
802 )
803 {
804 UINT32 Value;
805
806 do {
807 Value = *Sem;
808 } while (Value == 0 ||
809 InterlockedCompareExchange32 (
810 Sem,
811 Value,
812 Value - 1
813 ) != Value);
814 }

But the AcquireSpinLockOrFail and ReleaseSpinLock in
MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c contain
read write barrier.
Is the former implementation wrong because without the barrier the
cmpxchg can happen earlier or later chosen by the compiler?

Does volatile keyword help here?
I think LibWaitForSemaphore() is safe. (Which is not identical to saying
"it's in the best style".)

The volatile ensures there is a load instruction, and that the compiler
generates the load instruction in the right place.

The volatile does not ensure that the data fetched is neither corrupt
nor stale -- the load could fetch a partially updated value or a stale
value, into "Value". But in this particular instance that should be no
problem:

- if we see a corrupt or stale zero value, we'll just continue spinning;
the value will stabilize eventually

- if we see a corrupt or stale *nonzero* value, then when we attempt to
decrement it with a safe InterlockedCompareExchange32(), we'll just
fail.

So the volatile fetch can be considered an "initial guess" that works
most of the time.

(BTW, PiSmmCpuDxeSmm has a more recent variant that is less
power-hungry; see commit 9001b750df64 ("UefiCpuPkg/PiSmmCpuDxeSmm: pause
in WaitForSemaphore() before re-fetch", 2020-07-31).)

The implementation seen in
"MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c" is:

165 /**
166 Attempts to place a spin lock in the acquired state.
167
168 This function checks the state of the spin lock specified by SpinLock. If
169 SpinLock is in the released state, then this function places SpinLock in the
170 acquired state and returns TRUE. Otherwise, FALSE is returned. All state
171 transitions of SpinLock must be performed using MP safe mechanisms.
172
173 If SpinLock is NULL, then ASSERT().
174 If SpinLock was not initialized with InitializeSpinLock(), then ASSERT().
175
176 @param SpinLock A pointer to the spin lock to place in the acquired state.
177
178 @retval TRUE SpinLock was placed in the acquired state.
179 @retval FALSE SpinLock could not be acquired.
180
181 **/
182 BOOLEAN
183 EFIAPI
184 AcquireSpinLockOrFail (
185 IN OUT SPIN_LOCK *SpinLock
186 )
187 {
188 SPIN_LOCK LockValue;
189 VOID *Result;
190
191 ASSERT (SpinLock != NULL);
192
193 LockValue = *SpinLock;
194 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED);
195
196 _ReadWriteBarrier ();
197 Result = InterlockedCompareExchangePointer (
198 (VOID**)SpinLock,
199 (VOID*)SPIN_LOCK_RELEASED,
200 (VOID*)SPIN_LOCK_ACQUIRED
201 );
202
203 _ReadWriteBarrier ();
204 return (BOOLEAN) (Result == (VOID*) SPIN_LOCK_RELEASED);
205 }

I think the explicit barriers are better style (more helpful for the
programmer, and in general, possibly better performance).

On the other hand (independently of what you asked), the
AcquireSpinLockOrFail() logic seems a bit fishy to me.

First, the ASSERT() deserves a comment, at the bare minimum. ("Even if
the data is stale or partially updated, we can't conjure up a completely
bogus value", or some such.)

Second, the rest of the function doesn't even depend on LockValue.
Having the LockValue / ASSERT() stuff seems a bit pointless to me.

--*--

Now, regarding the "up" operations:

[UefiCpuPkg/Library/RegisterCpuFeaturesLib/CpuFeaturesInitialize.c]

775 /**
776 Increment semaphore by 1.
777
778 @param Sem IN: 32-bit unsigned integer
779
780 **/
781 VOID
782 LibReleaseSemaphore (
783 IN OUT volatile UINT32 *Sem
784 )
785 {
786 InterlockedIncrement (Sem);
787 }

This seems OK to me.

(In an ideal world, InterlockedIncrement() should internally enforce a
compiler barrier too, so the volatile should not be needed.)

[MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c]

207 /**
208 Releases a spin lock.
209
210 This function places the spin lock specified by SpinLock in the release state
211 and returns SpinLock.
212
213 If SpinLock is NULL, then ASSERT().
214 If SpinLock was not initialized with InitializeSpinLock(), then ASSERT().
215
216 @param SpinLock A pointer to the spin lock to release.
217
218 @return SpinLock released the lock.
219
220 **/
221 SPIN_LOCK *
222 EFIAPI
223 ReleaseSpinLock (
224 IN OUT SPIN_LOCK *SpinLock
225 )
226 {
227 SPIN_LOCK LockValue;
228
229 ASSERT (SpinLock != NULL);
230
231 LockValue = *SpinLock;
232 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED);
233
234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
239 }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().

Laszlo


Re: MemoryFence()

Ni, Ray
 

Maybe that explains why the MemoryFence was implemented so confidently.
I trust Mike:)

In UefiCpuPkg/Library/RegisterCpuFeaturesLib/CpuFeaturesInitialize.c, the LibWaitForSemaphore, LibReleaseSemaphore are coded without the read write barrier.
But the AcquireSpinLockOrFail and ReleaseSpinLock in MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c contain read write barrier.
Is the former implementation wrong because without the barrier the cmpxchg can happen earlier or later chosen by the compiler?

Does volatile keyword help here?

thanks,
ray
________________________________
发件人: Paolo Bonzini <pbonzini@...>
发送时间: Friday, February 5, 2021 11:42:36 PM
收件人: Ard Biesheuvel <ardb@...>
抄送: Laszlo Ersek <lersek@...>; Ni, Ray <ray.ni@...>; Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: MemoryFence()

On 05/02/21 16:39, Ard Biesheuvel wrote:
On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:

On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:

* MdePkg/Library/BaseLib/X86MemoryFence.c:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
I think it's okay-ish (it works as long as you don't do link-time
optimization, but it's less efficient than a compiler intrinsic) because
X86MemoryFence.c is a separate file.
We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains.
Oops... Well, it *was* okay at the time MemoryFence() was first
implemented. :(

Paolo


Re: MemoryFence()

Paolo Bonzini <pbonzini@...>
 

On 05/02/21 16:39, Ard Biesheuvel wrote:
On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:

On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:

* MdePkg/Library/BaseLib/X86MemoryFence.c:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
I think it's okay-ish (it works as long as you don't do link-time
optimization, but it's less efficient than a compiler intrinsic) because
X86MemoryFence.c is a separate file.
We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains.
Oops... Well, it *was* okay at the time MemoryFence() was first implemented. :(

Paolo


Re: MemoryFence()

Ard Biesheuvel <ardb@...>
 

On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:

On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:

* MdePkg/Library/BaseLib/X86MemoryFence.c:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
I think it's okay-ish (it works as long as you don't do link-time
optimization, but it's less efficient than a compiler intrinsic) because
X86MemoryFence.c is a separate file.
We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains.


Re: MemoryFence()

Paolo Bonzini <pbonzini@...>
 

On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:
* MdePkg/Library/BaseLib/X86MemoryFence.c:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
I think it's okay-ish (it works as long as you don't do link-time optimization, but it's less efficient than a compiler intrinsic) because X86MemoryFence.c is a separate file.

Paolo


Re: MemoryFence()

Laszlo Ersek
 

On 02/05/21 14:15, Ni, Ray wrote:
This is very interesting. I am trying to involve the discussion to
learn.
I did some experiment.
----a.c----
// Copy EDKII MemoryFence() and as Laszlo said it's just a compiler fence.
void CompilerFence () {
return;
}

void main () {
// I copied Paolo's code in below.
int *Address = &main;
int Value;

do {
// Force re-reading *Address on every iteration
CompilerFence ();
Value = *Address;
} while (Value == 0xFF);
// As you explained above.
__asm mfence;
}
----END of a.c----

When I use "Cl.exe /O2 /FAcs a.c" to compile it.
I got a.cod as below
-----a.cod----
; Listing generated by Microsoft (R) Optimizing Compiler Version 19.27.29112.0

TITLE E:\Work\edk2\a.c
.686P
.XMM
include listing.inc
.model flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC _CompilerFence
PUBLIC _main
; Function compile flags: /Ogtpy
; File E:\Work\edk2\a.c
; COMDAT _main
_TEXT SEGMENT
_main PROC ; COMDAT

; 6 : int *Address = &main;

00000 b8 00 00 00 00 mov eax, OFFSET _main
00005 8b 00 mov eax, DWORD PTR [eax]
$LL4@main:

; 7 : int Value;
; 8 :
; 9 : do {
; 10 : // Force re-reading *Address on every iteration
; 11 : CompilerFence ();
; 12 : Value = *Address;
; 13 : } while (Value == 0xFF);

00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH
0000c 74 f9 je SHORT $LL4@main

; 14 : // As you explained above.
; 15 : __asm mfence;

0000e 0f ae f0 mfence
00011 c3 ret 0
_main ENDP
_TEXT ENDS
; Function compile flags: /Ogtpy
; File E:\Work\edk2\a.c
; COMDAT _CompilerFence
_TEXT SEGMENT
_CompilerFence PROC ; COMDAT

; 2 : return;
; 3 : }

00000 c2 00 00 ret 0
_CompilerFence ENDP
_TEXT ENDS
END
-----END of a.cod-----

Check the assembly:
00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH

The CompilerFence() call doesn't force to generate code to always read
the memory pointed by Address.
That's because your CompilerFence() function is not correct. You did:

void CompilerFence () {
return;
}
which is wrong. It enforces a C-language sequence point, but it does not
implement a compiler barrier.

And your code is wrong because the *original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:

* MdePkg/Library/BaseLib/X86MemoryFence.c:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
This code comes from historical commit e1f414b6a7d8 ("Import some basic
libraries instances for Mde Packages.", 2007-06-22).

* MdePkg/Library/BaseLib/BaseLib.inf:

[Sources.Ia32]
X86MemoryFence.c | MSFT

[Sources.X64]
X86MemoryFence.c | MSFT
Back to your email:

If I replace CompilerFence() with _ReadBarrier(), compiler can
generate assembly that always reads the memory pointed by Address.

It seems the EDKII MemoryFence() even cannot be used as a read
barrier.
Does it mean that the MSVC version of EDKII MemoryFence() is
implemented wrongly?
That's precisely the case.

This actually explains why MpInitLib and PiSmmCpuDxeSmm use "volatile",
rather than MemoryFence(). (This is one of those info bits that I've
been missing.) The reason is that, when built with Visual Studio,
MemoryFence() does absolutely nothing. So if MpInitLib and
PiSmmCpuDxeSmm used MemoryFence() rather than "volatile", things would
break immediately.

Thank you Ray for highlighting this, I couldn't have imagined.

The current status means that, considering all toolchains and all
architectures that edk2 supports, together, it's been *impossible* to
use MemoryFence() correctly.

And so my defensive thinking "whenever in doubt, use 'volatile' PLUS
MemoryFence()" apparently paid off, because when built with Visual
Studio, OVMF is left with 'volatile' *only*; in effect.

It also means that in Ankur's upcoming v7 series, relying *just* on
MemoryFence() will not suffice. We need 'volatile' (given the current
state of primitives); otherwise we'll have no barriers at all.

... I don't even know where to start changing the code. :(

Laszlo


Re: MemoryFence()

Laszlo Ersek
 

On 02/05/21 09:25, Ni, Ray wrote:
So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.
I agree with this summary.
Volatile using in MpInitLib or PiSmmCpu driver cannot be emitted because different CPU threads change the same memory content to synchronize with each other.
I don’t quite understand if removing “volatile” what mechanism can be used to force the compiler generating code that always reads from memory?
A compiler barrier (in case that's sufficient), or even a memory fence
-- seems more appropriate for inter-CPU synchronization -- which I think
should be implemented to automatically force a compiler barrier too.

Laszlo


Re: MemoryFence()

Laszlo Ersek
 

On 02/05/21 10:12, Paolo Bonzini wrote:
On 05/02/21 08:48, Laszlo Ersek wrote:
   MemoryFence ();
   do {
     Value = *Address
   } while (Value == 0xFF) ;
   MemoryFence ();
If Address points to a byte in RAM that's manipulated by
multiple CPUs, then I think:

- volatile is not needed

- the two MemoryFence() calls should be removed

- an AcquireMemoryFence() call should be inserted into the loop, after
the read access. Because, presumably, once the flag assumes the
appropriate value (it has been "released")), we'll proceed to reading
some memory resource that was protected by the flag until then.
This is correct, alternatively you could have this:

    do {
      // Force re-reading *Address on every iteration
      CompilerFence ();
      Value = *Address;
    } while (Value == 0xFF);
    // As you explained above.
    AcquireMemoryFence ();
Ah, thanks; also good example where a CompilerFence() would be justified.

Laszlo


Re: MemoryFence()

Paolo Bonzini <pbonzini@...>
 

On 05/02/21 14:15, Ni, Ray wrote:
I did some experiment.
----a.c----
// Copy EDKII MemoryFence() and as Laszlo said it's just a compiler fence.
void CompilerFence () {
return;
This should be

#define CompilerFence _ReadWriteBarrier

Paolo

}
void main () {
// I copied Paolo's code in below.
int *Address = &main;
int Value;
do {
// Force re-reading *Address on every iteration
CompilerFence ();
Value = *Address;
} while (Value == 0xFF);
// As you explained above.
__asm mfence;
}
----END of a.c----


Re: MemoryFence()

Ni, Ray
 

This is very interesting. I am trying to involve the discussion to learn.
I did some experiment.
----a.c----
// Copy EDKII MemoryFence() and as Laszlo said it's just a compiler fence.
void CompilerFence () {
return;
}

void main () {
// I copied Paolo's code in below.
int *Address = &main;
int Value;

do {
// Force re-reading *Address on every iteration
CompilerFence ();
Value = *Address;
} while (Value == 0xFF);
// As you explained above.
__asm mfence;
}
----END of a.c----

When I use "Cl.exe /O2 /FAcs a.c" to compile it.
I got a.cod as below
-----a.cod----
; Listing generated by Microsoft (R) Optimizing Compiler Version 19.27.29112.0

TITLE E:\Work\edk2\a.c
.686P
.XMM
include listing.inc
.model flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC _CompilerFence
PUBLIC _main
; Function compile flags: /Ogtpy
; File E:\Work\edk2\a.c
; COMDAT _main
_TEXT SEGMENT
_main PROC ; COMDAT

; 6 : int *Address = &main;

00000 b8 00 00 00 00 mov eax, OFFSET _main
00005 8b 00 mov eax, DWORD PTR [eax]
$LL4@main:

; 7 : int Value;
; 8 :
; 9 : do {
; 10 : // Force re-reading *Address on every iteration
; 11 : CompilerFence ();
; 12 : Value = *Address;
; 13 : } while (Value == 0xFF);

00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH
0000c 74 f9 je SHORT $LL4@main

; 14 : // As you explained above.
; 15 : __asm mfence;

0000e 0f ae f0 mfence
00011 c3 ret 0
_main ENDP
_TEXT ENDS
; Function compile flags: /Ogtpy
; File E:\Work\edk2\a.c
; COMDAT _CompilerFence
_TEXT SEGMENT
_CompilerFence PROC ; COMDAT

; 2 : return;
; 3 : }

00000 c2 00 00 ret 0
_CompilerFence ENDP
_TEXT ENDS
END
-----END of a.cod-----

Check the assembly:
00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH

The CompilerFence() call doesn't force to generate code to always read the memory pointed by Address.

If I replace CompilerFence() with _ReadBarrier(), compiler can generate assembly that always
reads the memory pointed by Address.

It seems the EDKII MemoryFence() even cannot be used as a read barrier.
Does it mean that the MSVC version of EDKII MemoryFence() is implemented wrongly?

Thanks,
Ray

-----Original Message-----
From: Paolo Bonzini <pbonzini@...>
Sent: Friday, February 5, 2021 5:13 PM
To: Laszlo Ersek <lersek@...>; Andrew Fish <afish@...>
Cc: edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address)
<leif@...>; Ard Biesheuvel (kernel.org address) <ardb@...>; Dong, Eric <eric.dong@...>; Ni, Ray
<ray.ni@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
Subject: Re: MemoryFence()

On 05/02/21 08:48, Laszlo Ersek wrote:
MemoryFence ();
do {
Value = *Address
} while (Value == 0xFF) ;
MemoryFence ();
If Address points to a byte in RAM that's manipulated by
multiple CPUs, then I think:

- volatile is not needed

- the two MemoryFence() calls should be removed

- an AcquireMemoryFence() call should be inserted into the loop, after
the read access. Because, presumably, once the flag assumes the
appropriate value (it has been "released")), we'll proceed to reading
some memory resource that was protected by the flag until then.
This is correct, alternatively you could have this:

do {
// Force re-reading *Address on every iteration
CompilerFence ();
Value = *Address;
} while (Value == 0xFF);
// As you explained above.
AcquireMemoryFence ();

Paolo


Re: MemoryFence()

Paolo Bonzini <pbonzini@...>
 

On 05/02/21 08:48, Laszlo Ersek wrote:
MemoryFence ();
do {
Value = *Address
} while (Value == 0xFF) ;
MemoryFence ();
If Address points to a byte in RAM that's manipulated by
multiple CPUs, then I think:
- volatile is not needed
- the two MemoryFence() calls should be removed
- an AcquireMemoryFence() call should be inserted into the loop, after
the read access. Because, presumably, once the flag assumes the
appropriate value (it has been "released")), we'll proceed to reading
some memory resource that was protected by the flag until then.
This is correct, alternatively you could have this:

do {
// Force re-reading *Address on every iteration
CompilerFence ();
Value = *Address;
} while (Value == 0xFF);
// As you explained above.
AcquireMemoryFence ();

Paolo


Re: MemoryFence()

Laszlo Ersek
 

On 02/04/21 21:04, Paolo Bonzini wrote:
Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto:
It is very important to be *aware* of the acquire/release semantics,
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.
I agree as long as the primitives are self-documenting. A single
MemoryFence() does not make it clear in which direction the data is flowing
(whether from other threads to this one, or vice versa).
I can say about the MemoryFence() calls I've added, and about the ones
Ankur is about to add, that we're aware of the data flow directions. We
could certainly annotate every such location with either "we're
publishing information here" or "we're consuming information here".

So for example:

(a) publish:

- prepare and expose the data
- "release fence"
- release the spinlock with InterlockedCompareExchange, or push the MMIO
doorbell register

(b1) consume:

- spin on a spinlock with InterlockedCompareExchange,
- "acquire fence"
- gather / consume the data

(b2) consume:

- spin on an MMIO latch, using "volatile"
- "acquire fence"
- gather / consume the data

(b3) consume:

- spin on a normal memory location with an "acquire fence" at the end of
the loop body -- still inside the loop body,
- gather / consume the data, after the loop


I believe these general patterns / semantics are "more or less" clear to
us all; I personally have not had to do anything fancier than the above,
programming fw_cfg, virtio, or even some inter-processor communication.

*However*, how those natural-language "fence" statements map to compiler
fences, CPU/memory fences, especially with regard to architecture
specifics -- I honestly admit that's where my eyes glaze over.

The MemoryFence() API has been very convenient for *both* of those
directions, but:

- it does not inherently express the intended data flow direction, as
Paolo notes

- it's unclear whether "volatile" (or a compiler barrier) is needed *in
addition* to MemoryFence() or not -- after all, if the compiler didn't
produce the necessary store/load instructions yet, "fencing" the effects
of those instructions makes no sense (paraphrasing Ard),

- the x86 implementation of MemoryFence() openly admits to breaking the
interface contract ("this is a compiler fence only"), without explaining
why -- i.e., for what architectural reasons -- it might *still* suffice
for upholding the contract.

So in the end one just goes for: "whenever in doubt, use volatile *PLUS*
MemoryFence()".


Here's what I'd prefer:

- someone more versed in the *implementation* of these primitivies than
myself should please add the "directional" barriers,

- if possible, I would prefer if a naked compiler barrier API didn't
exist at all (because (a) the above-noted directional barriers should
*guarantee* in the contracts to imply the necessary compiler barrier(s),
and (b) I wouldn't know where I'd be content with *just* a compiler barrier)


Then I could go over the MemoryFence() calls I've added, and replace
each with the proper directional variant, while simultaneously killing
*all* related "volatile" qualifiers. (I'd only keep volatile where it
referred to a known MMIO register, for example for pressing or reading
an MMIO doorbell / latch.)

Then I think someone from Intel should do the same for PiSmmCpuDxeSmm
and MpInitLib -- because every time I need to look at the swaths of
"volatile" in those drivers, I must start reasoning from square#1. And
for the life of me, I cannot squeeze a bunch of the patterns I see there
into any one of the patterns (a), (b1), (b2), and (b3) that I outlined
above. In every such case, I'm left wondering, "does it work by chance
only, or by silently exploiting architecture guarantees that I don't
know about".

In summary, I'm mostly sure what I'd do about the volatile's and
MemoryFence()s that *I* have written, provided the directional barrier
variants existed, but I don't have the first clue about cleaning up
PiSmmCpuDxeSmm and MpInitLib. And ultimately, it would be awesome if,
when discussing a patch series like Ankur's v6, we didn't have to go
look for examples and tacit intentions in PiSmmCpuDxeSmm. :/

Thanks,
Laszlo


Re: MemoryFence()

Ni, Ray
 

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.
I agree with this summary.
Volatile using in MpInitLib or PiSmmCpu driver cannot be emitted because different CPU threads change the same memory content to synchronize with each other.
I don’t quite understand if removing “volatile” what mechanism can be used to force the compiler generating code that always reads from memory?


Thanks,
Ray


Re: MemoryFence()

Laszlo Ersek
 

On 02/04/21 22:31, Andrew Fish wrote:


On Feb 4, 2021, at 10:09 AM, Laszlo Ersek <lersek@...> wrote:

Hi All,

here's the declaration of MemoryFence(), from <BaseLib.h>:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
);
The IA32 and X64 implementations of MemoryFence() are not memory
barriers, only compiler barriers, however.

This is a problem for two reasons:

- code that believes to add a memory barrier only adds a compiler
barrier, possibly creating an unsafe situation,

- code that knowingly calls MemoryFence() for the purposes of a compiler
barrier -- while it will hopefully work safely, relying on tacit
knowledge / assumptions about the x86 architecture -- results in a
source reading experience that mismatches known lockless patterns.


This topic has repeatedly come up in the past; most recently in the
review of Ankur's v6 series for
<https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some
inter-processor communication needs to happen.

I asked Paolo for help with understanding some of the synchronziation
patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is
going to rely on, and/or imitate, some of those patterns. From this
discussion with Paolo, the following have emerged (I'm going to
liberally steal / paraphrase Paolo's words):


(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".

(Note that this step would make MemoryFence() heavier-weight on x86 than
it currently is.)


(2) We should audit and rework existent uses (combinations) of
MemoryFence() into matched acquire / release pairs.

Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?

In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?
Laszlo,

Thanks for looking into this. I noticed most of the MemoryFence() usage[1] is in the virtual platforms so changes seem manageable. I guess 3rd party code that defines its own low level primitive could be impacted, but they chose to reinvent the wheel.

It looks like the generic platform performance risk is BaseIoLibIntrinsic. I know from debugging crashes a long time ago the the MemoryFence() for x86 is really a CompilerFence() in the new scheme. If the new AcquireMemoryFence()/ReleaseMemoryFence()/CompilerFence() replace MemoryFence() this likely just works. If not then we might need an architecture specific pattern to maintain current performance, we could always make a function internal to the lib that does the right thing for the given architecture. Also making it more precise might actually help ARM/AARCH64 performance?

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.

So for example in BaseIoLibIntrinsic() the volatile is used to force the memory read and the CompilerFence() is used to enforce the order of that read relative to other operations. Thus wrapping the volatile access in CompilerFence() is in some ways redundant as long as there is only a single access inside the CompilerFence() wrapper.
Right, I agree with everything you said thus far. (I don't claim to be
an expert on this, of course.) CompilerFence() is a "better volatile"
(for RAM only, not for MMIO), where we can allow the compiler to
optimize, up to a certain location.

I like the idea of properly annotating the code. Given x86 defaults to so much cache coherency for compatibility with older simpler times it can tend to hide issues. I’m not sure removing the volatile keyword is required, and it does imply this variable has special rules. I guess we could add a REQUIRE_FENCE or some such? But it it feels like defining the variables in some way to imply the require special handling has some value?
Objects that are shared between processors, or between processor and
some device, should always be highlighted in some way (naming, comments,
...).


The only reason you would need a volatile is if you had more than a single read inside of the fences. Something like:

UINT8 *Address;
UINT8. Value

MemoryFence ();
do {
Value = *Address
} while (Value == 0xFF) ;
MemoryFence ();

The compiler could optimize the above code (current MemoryFence() model) to a single read and infinite loop if 0xFF. Yes I realize you could move the MemoryFence () into the loop, but what if you did not care about that? Mostly just trying to think this through we some examples….
I think it depends on what Address refers to. If it refers to an MMIO
register, then volatile is needed, IMO (and the MemoryFence()s are
unnecessary). If Address points to a byte in RAM that's manipulated by
multiple CPUs, then I think:

- volatile is not needed

- the two MemoryFence() calls should be removed

- an AcquireMemoryFence() call should be inserted into the loop, after
the read access. Because, presumably, once the flag assumes the
appropriate value (it has been "released")), we'll proceed to reading
some memory resource that was protected by the flag until then.


Regarding the many MemoryFence() calls in the virt platforms, most of
them do not exist for inter-CPU communication, but for communication
with devices through RAM. I admit I'm unsure what we should do with them.

Thanks
Laszlo


[1] /Volumes/Case/edk2-github(master)>git grep MemoryFence | grep ';'
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:320: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:334: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:344: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1493: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1497: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4702: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4707: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4712: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4765: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4770: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4775: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4828: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4833: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4838: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4892: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4894: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:86: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:88: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:115: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:117: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:147: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:149: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:179: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:181: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:211: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:213: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:243: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:245: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:275: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:277: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:305: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:307: MemoryFence ();
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:23:;MemoryFence (
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:24:;MemoryFence (
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:412: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:425: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:438: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:173: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:186: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:199: MemoryFence ();
OvmfPkg/Library/VirtioLib/VirtioLib.c:305: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:312: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:327: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:334: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:337: MemoryFence();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:122: MemoryFence ();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:124: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:117: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:122: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:171: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:177: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:595: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:625: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:56: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:58: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:99: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:102: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:257: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:384: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:426: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:437: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:101: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:103: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:173: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:176: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:169: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:172: MemoryFence ();
OvmfPkg/XenBusDxe/GrantTable.c:86: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:496: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:503: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:569: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:580: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:587: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:473: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:528: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:578: MemoryFence ();
UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c:204: MemoryFence ();


Thanks,

Andrew Fish

Would there be interest in reviewing such work?

Thanks
Laszlo


Re: [RFC] Request for new git repository for EdkRepo

Michael D Kinney
 

Hi Nate,

I have created the following Tianocore repositories:

* edk2-edkrepo
* edk2-edkrepo-manifest

I have added you and Ashley to the EDK II Tools Maintainers team with write access to both of these new repositories.

The EDK II Maintainers team has also been grated write access to the edk2-edkrepo-manifest repo.

Best regards,

Mike

-----Original Message-----
From: rfc@edk2.groups.io <rfc@edk2.groups.io> On Behalf Of Nate DeSimone
Sent: Friday, November 13, 2020 3:18 PM
To: Bjorge, Erik C <erik.c.bjorge@...>; devel@edk2.groups.io; rfc@edk2.groups.io
Cc: Desimone, Ashley E <ashley.e.desimone@...>
Subject: Re: [edk2-rfc] [RFC] Request for new git repository for EdkRepo

Hi Everyone,

This RFC has been sitting for a while and it seems like we have reached a conclusion. I would like to make a request to
the TianoCore stewards to create 2 new git repositories in the project:

1. edkrepo
2. edkrepo-manifest

Based on the current Maintainers.txt content, the following individuals should initially be given write access to these
repositories:

Nate DeSimone <nathaniel.l.desimone@...>
Ashley DeSimone <ashley.e.desimone@...>

I have taken a poll between myself and the other maintainer, and given that the project is currently in transition to
using GitHub pull requests we both prefer to start out using the pull request method of code review for these new
repositories. If there is any setup needed to enable pull request code review please do so.

Let me know if the stewards need any more information to complete this request.

Thanks!
Nate

-----Original Message-----
From: Desimone, Nathaniel L
Sent: Wednesday, September 30, 2020 4:21 PM
To: Bjorge, Erik C <erik.c.bjorge@...>; devel@edk2.groups.io; rfc@edk2.groups.io
Cc: Desimone, Ashley E <ashley.e.desimone@...>
Subject: RE: [RFC] Request for new git repository for EdkRepo

Hi Erik,

A separate manifest repository would probably be a good idea. The manifest repo is used for a lot of EdkRepo's operations
so having a smaller repo dedicated repo could be beneficial from a performance standpoint.

Thanks,
Nate

-----Original Message-----
From: Bjorge, Erik C <erik.c.bjorge@...>
Sent: Wednesday, September 30, 2020 3:29 PM
To: Desimone, Nathaniel L <nathaniel.l.desimone@...>; devel@edk2.groups.io; rfc@edk2.groups.io
Cc: Desimone, Ashley E <ashley.e.desimone@...>
Subject: RE: [RFC] Request for new git repository for EdkRepo

I am fine with a new repo. This also supports a good workflow to get a tool that then lets you pull full platforms. In
theory you would only ever really need to clone a single repo manually (assuming reasonable manifest support).

Are you also looking at creating a separate manifest repo as well or just creating a manifest branch in the new EdkRepo
repository?

Thanks,
-Erik

-----Original Message-----
From: Desimone, Nathaniel L <nathaniel.l.desimone@...>
Sent: Wednesday, September 30, 2020 2:56 PM
To: devel@edk2.groups.io; rfc@edk2.groups.io
Cc: Desimone, Ashley E <ashley.e.desimone@...>; Bjorge, Erik C <erik.c.bjorge@...>
Subject: [RFC] Request for new git repository for EdkRepo

Hi Everyone,

Given that EdkRepo has existed in the project for over a year now (in edk2-staging) I think it is time to get out of
staging. I have considered multiple possible landing zones:

1. edk2
2. edk2-pytool-library
3. A new repository

Edk2 does not seem like a good location as EdkRepo isn't strictly necessary to build EDK II, and I think all of us would
prefer that edk2 not become a dumping ground. I have talked with the other maintainers of edk2-pytool-library and they
would prefer that EdkRepo not enter that repository because EdkRepo does not have a robust set of unit tests yet and they
don't want their test coverage metrics to decline. Therefore, the best choice seems to be a new repository. As always if
anyone has comments they are welcome!

Thanks,
Nate



281 - 300 of 789