MemoryFence()


Laszlo Ersek
 

On 02/05/21 17:04, Ni, Ray wrote:
Maybe that explains why the MemoryFence was implemented so
confidently.
I trust Mike:)

In UefiCpuPkg/Library/RegisterCpuFeaturesLib/CpuFeaturesInitialize.c,
the LibWaitForSemaphore, LibReleaseSemaphore are coded without the
read write barrier.
Let's focus on the "down" operations first.

At commit c6be6dab9c4b:

789 /**
790 Decrement the semaphore by 1 if it is not zero.
791
792 Performs an atomic decrement operation for semaphore.
793 The compare exchange operation must be performed using
794 MP safe mechanisms.
795
796 @param Sem IN: 32-bit unsigned integer
797
798 **/
799 VOID
800 LibWaitForSemaphore (
801 IN OUT volatile UINT32 *Sem
802 )
803 {
804 UINT32 Value;
805
806 do {
807 Value = *Sem;
808 } while (Value == 0 ||
809 InterlockedCompareExchange32 (
810 Sem,
811 Value,
812 Value - 1
813 ) != Value);
814 }

But the AcquireSpinLockOrFail and ReleaseSpinLock in
MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c contain
read write barrier.
Is the former implementation wrong because without the barrier the
cmpxchg can happen earlier or later chosen by the compiler?

Does volatile keyword help here?
I think LibWaitForSemaphore() is safe. (Which is not identical to saying
"it's in the best style".)

The volatile ensures there is a load instruction, and that the compiler
generates the load instruction in the right place.

The volatile does not ensure that the data fetched is neither corrupt
nor stale -- the load could fetch a partially updated value or a stale
value, into "Value". But in this particular instance that should be no
problem:

- if we see a corrupt or stale zero value, we'll just continue spinning;
the value will stabilize eventually

- if we see a corrupt or stale *nonzero* value, then when we attempt to
decrement it with a safe InterlockedCompareExchange32(), we'll just
fail.

So the volatile fetch can be considered an "initial guess" that works
most of the time.

(BTW, PiSmmCpuDxeSmm has a more recent variant that is less
power-hungry; see commit 9001b750df64 ("UefiCpuPkg/PiSmmCpuDxeSmm: pause
in WaitForSemaphore() before re-fetch", 2020-07-31).)

The implementation seen in
"MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c" is:

165 /**
166 Attempts to place a spin lock in the acquired state.
167
168 This function checks the state of the spin lock specified by SpinLock. If
169 SpinLock is in the released state, then this function places SpinLock in the
170 acquired state and returns TRUE. Otherwise, FALSE is returned. All state
171 transitions of SpinLock must be performed using MP safe mechanisms.
172
173 If SpinLock is NULL, then ASSERT().
174 If SpinLock was not initialized with InitializeSpinLock(), then ASSERT().
175
176 @param SpinLock A pointer to the spin lock to place in the acquired state.
177
178 @retval TRUE SpinLock was placed in the acquired state.
179 @retval FALSE SpinLock could not be acquired.
180
181 **/
182 BOOLEAN
183 EFIAPI
184 AcquireSpinLockOrFail (
185 IN OUT SPIN_LOCK *SpinLock
186 )
187 {
188 SPIN_LOCK LockValue;
189 VOID *Result;
190
191 ASSERT (SpinLock != NULL);
192
193 LockValue = *SpinLock;
194 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED);
195
196 _ReadWriteBarrier ();
197 Result = InterlockedCompareExchangePointer (
198 (VOID**)SpinLock,
199 (VOID*)SPIN_LOCK_RELEASED,
200 (VOID*)SPIN_LOCK_ACQUIRED
201 );
202
203 _ReadWriteBarrier ();
204 return (BOOLEAN) (Result == (VOID*) SPIN_LOCK_RELEASED);
205 }

I think the explicit barriers are better style (more helpful for the
programmer, and in general, possibly better performance).

On the other hand (independently of what you asked), the
AcquireSpinLockOrFail() logic seems a bit fishy to me.

First, the ASSERT() deserves a comment, at the bare minimum. ("Even if
the data is stale or partially updated, we can't conjure up a completely
bogus value", or some such.)

Second, the rest of the function doesn't even depend on LockValue.
Having the LockValue / ASSERT() stuff seems a bit pointless to me.

--*--

Now, regarding the "up" operations:

[UefiCpuPkg/Library/RegisterCpuFeaturesLib/CpuFeaturesInitialize.c]

775 /**
776 Increment semaphore by 1.
777
778 @param Sem IN: 32-bit unsigned integer
779
780 **/
781 VOID
782 LibReleaseSemaphore (
783 IN OUT volatile UINT32 *Sem
784 )
785 {
786 InterlockedIncrement (Sem);
787 }

This seems OK to me.

(In an ideal world, InterlockedIncrement() should internally enforce a
compiler barrier too, so the volatile should not be needed.)

[MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c]

207 /**
208 Releases a spin lock.
209
210 This function places the spin lock specified by SpinLock in the release state
211 and returns SpinLock.
212
213 If SpinLock is NULL, then ASSERT().
214 If SpinLock was not initialized with InitializeSpinLock(), then ASSERT().
215
216 @param SpinLock A pointer to the spin lock to release.
217
218 @return SpinLock released the lock.
219
220 **/
221 SPIN_LOCK *
222 EFIAPI
223 ReleaseSpinLock (
224 IN OUT SPIN_LOCK *SpinLock
225 )
226 {
227 SPIN_LOCK LockValue;
228
229 ASSERT (SpinLock != NULL);
230
231 LockValue = *SpinLock;
232 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED);
233
234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
239 }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().

Laszlo


Paolo Bonzini <pbonzini@...>
 

On 05/02/21 17:56, Laszlo Ersek wrote:
On 02/05/21 17:04, Ni, Ray wrote:
798 **/
799 VOID
800 LibWaitForSemaphore (
801 IN OUT volatile UINT32 *Sem
802 )
I think LibWaitForSemaphore() is safe. (Which is not identical to saying
"it's in the best style".)
Agreed. I would prefer an explicit compiler fence before the "Value = *Sem" assignment and no volatile anywhere, but it is okay.

193 LockValue = *SpinLock;
194 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED);
195
196 _ReadWriteBarrier ();
197 Result = InterlockedCompareExchangePointer (
198 (VOID**)SpinLock,
199 (VOID*)SPIN_LOCK_RELEASED,
200 (VOID*)SPIN_LOCK_ACQUIRED
201 );
202
203 _ReadWriteBarrier ();
204 return (BOOLEAN) (Result == (VOID*) SPIN_LOCK_RELEASED);
205 }
Having the LockValue / ASSERT() stuff seems a bit pointless to me.
The _ReadWriteBarrier (aka compiler fence) is pointless too; it is implicit in the InterlockedCompareExchangePointer.

781 VOID
782 LibReleaseSemaphore (
783 IN OUT volatile UINT32 *Sem
784 )
785 {
786 InterlockedIncrement (Sem);
787 }
This seems OK to me.
Yep.

221 SPIN_LOCK *
222 EFIAPI
223 ReleaseSpinLock (
224 IN OUT SPIN_LOCK *SpinLock
225 )
226 {
227 SPIN_LOCK LockValue;
228
229 ASSERT (SpinLock != NULL);
230
231 LockValue = *SpinLock;
232 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED);
233
234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
239 }
Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill. However, it *is* buggy because it is missing a (processor) barrier on non-x86 architectures and has a useless barrier after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;

Paolo


Laszlo Ersek
 

On 02/05/21 18:21, Paolo Bonzini wrote:
On 05/02/21 17:56, Laszlo Ersek wrote:
    221  SPIN_LOCK *
    222  EFIAPI
    223  ReleaseSpinLock (
    224    IN OUT  SPIN_LOCK                 *SpinLock
    225    )
    226  {
    227    SPIN_LOCK    LockValue;
    228
    229    ASSERT (SpinLock != NULL);
    230
    231    LockValue = *SpinLock;
    232    ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue ==
SPIN_LOCK_RELEASED);
    233
    234    _ReadWriteBarrier ();
    235    *SpinLock = SPIN_LOCK_RELEASED;
    236    _ReadWriteBarrier ();
    237
    238    return SpinLock;
    239  }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill.  However, it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store.  Instead it should be just this:

    ReleaseMemoryFence ();
    *SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?

Thanks
Laszlo


Andrew Fish <afish@...>
 

On Feb 4, 2021, at 11:48 PM, Laszlo Ersek <lersek@...> wrote:

On 02/04/21 22:31, Andrew Fish wrote:


On Feb 4, 2021, at 10:09 AM, Laszlo Ersek <lersek@...> wrote:

Hi All,

here's the declaration of MemoryFence(), from <BaseLib.h>:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
);
The IA32 and X64 implementations of MemoryFence() are not memory
barriers, only compiler barriers, however.

This is a problem for two reasons:

- code that believes to add a memory barrier only adds a compiler
barrier, possibly creating an unsafe situation,

- code that knowingly calls MemoryFence() for the purposes of a compiler
barrier -- while it will hopefully work safely, relying on tacit
knowledge / assumptions about the x86 architecture -- results in a
source reading experience that mismatches known lockless patterns.


This topic has repeatedly come up in the past; most recently in the
review of Ankur's v6 series for
<https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some
inter-processor communication needs to happen.

I asked Paolo for help with understanding some of the synchronziation
patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is
going to rely on, and/or imitate, some of those patterns. From this
discussion with Paolo, the following have emerged (I'm going to
liberally steal / paraphrase Paolo's words):


(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".

(Note that this step would make MemoryFence() heavier-weight on x86 than
it currently is.)


(2) We should audit and rework existent uses (combinations) of
MemoryFence() into matched acquire / release pairs.

Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?

In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?
Laszlo,

Thanks for looking into this. I noticed most of the MemoryFence() usage[1] is in the virtual platforms so changes seem manageable. I guess 3rd party code that defines its own low level primitive could be impacted, but they chose to reinvent the wheel.

It looks like the generic platform performance risk is BaseIoLibIntrinsic. I know from debugging crashes a long time ago the the MemoryFence() for x86 is really a CompilerFence() in the new scheme. If the new AcquireMemoryFence()/ReleaseMemoryFence()/CompilerFence() replace MemoryFence() this likely just works. If not then we might need an architecture specific pattern to maintain current performance, we could always make a function internal to the lib that does the right thing for the given architecture. Also making it more precise might actually help ARM/AARCH64 performance?

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.

So for example in BaseIoLibIntrinsic() the volatile is used to force the memory read and the CompilerFence() is used to enforce the order of that read relative to other operations. Thus wrapping the volatile access in CompilerFence() is in some ways redundant as long as there is only a single access inside the CompilerFence() wrapper.
Right, I agree with everything you said thus far. (I don't claim to be
an expert on this, of course.) CompilerFence() is a "better volatile"
(for RAM only, not for MMIO), where we can allow the compiler to
optimize, up to a certain location.

I like the idea of properly annotating the code. Given x86 defaults to so much cache coherency for compatibility with older simpler times it can tend to hide issues. I’m not sure removing the volatile keyword is required, and it does imply this variable has special rules. I guess we could add a REQUIRE_FENCE or some such? But it it feels like defining the variables in some way to imply the require special handling has some value?
Objects that are shared between processors, or between processor and
some device, should always be highlighted in some way (naming, comments,
...).


The only reason you would need a volatile is if you had more than a single read inside of the fences. Something like:

UINT8 *Address;
UINT8. Value

MemoryFence ();
do {
Value = *Address
} while (Value == 0xFF) ;
MemoryFence ();

The compiler could optimize the above code (current MemoryFence() model) to a single read and infinite loop if 0xFF. Yes I realize you could move the MemoryFence () into the loop, but what if you did not care about that? Mostly just trying to think this through we some examples….
I think it depends on what Address refers to. If it refers to an MMIO
register, then volatile is needed, IMO (and the MemoryFence()s are
unnecessary). If Address points to a byte in RAM that's manipulated by
multiple CPUs, then I think:

- volatile is not needed
OK I see Linux recommends not using volatile [1], so I think this makes sense to me. It is not that volatile is not useful, but it is not thread safe, so not using volatile encourages people to use primitives to make things thread safe. So not using volatile is more of a coding convention.

- the two MemoryFence() calls should be removed
They may need to turn into CompilerFence().

- an AcquireMemoryFence() call should be inserted into the loop, after
the read access. Because, presumably, once the flag assumes the
appropriate value (it has been "released")), we'll proceed to reading
some memory resource that was protected by the flag until then.
CompilerFence ();
do {
Value = *Address;
AcquireMemoryFence();
} while (Value == 0xFF) ;



Regarding the many MemoryFence() calls in the virt platforms, most of
them do not exist for inter-CPU communication, but for communication
with devices through RAM. I admit I'm unsure what we should do with them.
I think this is where x86 makes things confusing with its coherency all the time motto. On x86 I think they are really CompilerFences() to force C memory model order of operations. They got added to fix real bugs on x86. Remember the compiler can still reorder a volatile access withs surrounding code as an optimization.

So for example if MmioWrite8() has MemoryFence() replaced with CompilerFence() these operations happen in order. If there is no CompilerFence() the order is implementation dependent.

UINT8 *A, *B, *C;

*A = 1;
MmioWrite8 (B, 2);
*C = 3;

So that is why we want the CompilerFence() + volatile for the MMIO. For example you could end up writing to a Common DMA buffer and then doing an MMIO write. You might want those to happen in sequence and you need the CompilerFence()

At least the CPU code is CPU specific so the rules should be more well defined :).

[1] https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html

Thanks,

Andrew Fish


Thanks
Laszlo


[1] /Volumes/Case/edk2-github(master)>git grep MemoryFence | grep ';'
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:320: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:334: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:344: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1493: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1497: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4702: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4707: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4712: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4765: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4770: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4775: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4828: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4833: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4838: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4892: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4894: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:86: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:88: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:115: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:117: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:147: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:149: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:179: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:181: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:211: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:213: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:243: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:245: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:275: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:277: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:305: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:307: MemoryFence ();
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:23:;MemoryFence (
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:24:;MemoryFence (
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:412: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:425: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:438: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:173: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:186: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:199: MemoryFence ();
OvmfPkg/Library/VirtioLib/VirtioLib.c:305: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:312: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:327: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:334: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:337: MemoryFence();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:122: MemoryFence ();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:124: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:117: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:122: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:171: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:177: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:595: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:625: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:56: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:58: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:99: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:102: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:257: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:384: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:426: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:437: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:101: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:103: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:173: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:176: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:169: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:172: MemoryFence ();
OvmfPkg/XenBusDxe/GrantTable.c:86: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:496: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:503: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:569: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:580: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:587: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:473: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:528: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:578: MemoryFence ();
UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c:204: MemoryFence ();


Thanks,

Andrew Fish

Would there be interest in reviewing such work?

Thanks
Laszlo


Paolo Bonzini <pbonzini@...>
 

On 05/02/21 18:31, Laszlo Ersek wrote:
On 02/05/21 18:21, Paolo Bonzini wrote:
On 05/02/21 17:56, Laszlo Ersek wrote:
    221  SPIN_LOCK *
    222  EFIAPI
    223  ReleaseSpinLock (
    224    IN OUT  SPIN_LOCK                 *SpinLock
    225    )
    226  {
    227    SPIN_LOCK    LockValue;
    228
    229    ASSERT (SpinLock != NULL);
    230
    231    LockValue = *SpinLock;
    232    ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue ==
SPIN_LOCK_RELEASED);
    233
    234    _ReadWriteBarrier ();
    235    *SpinLock = SPIN_LOCK_RELEASED;
    236    _ReadWriteBarrier ();
    237
    238    return SpinLock;
    239  }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill.  However, it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store.  Instead it should be just this:

    ReleaseMemoryFence ();
    *SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?
Not as long as it's a pointer or smaller.

Paolo


Andrew Fish <afish@...>
 

On Feb 5, 2021, at 12:25 AM, Ni, Ray <ray.ni@...> wrote:

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.
I agree with this summary.
Volatile using in MpInitLib or PiSmmCpu driver cannot be emitted because different CPU threads change the same memory content to synchronize with each other.
I don’t quite understand if removing “volatile” what mechanism can be used to force the compiler generating code that always reads from memory?
Ray,

I agree the C definition of volatile is not thread safe.

I was doing some research on volatile and I realize the Linux kernel guides recommend against using volatile [1]. I think it has to do with people confusing volatile for being thread safe. So my guess is it is more of a coding standard to encourage people to use synchronization primitives. The primitives imply compiler fences which indirectly enforces the same behavior as volatile.

[1] https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html

Thanks,

Andrew Fish


Thanks,
Ray


Andrew Fish <afish@...>
 

On Feb 5, 2021, at 7:38 AM, Paolo Bonzini <pbonzini@...> wrote:

On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:
* MdePkg/Library/BaseLib/X86MemoryFence.c:
/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
I think it's okay-ish (it works as long as you don't do link-time optimization, but it's less efficient than a compiler intrinsic) because X86MemoryFence.c is a separate file.
We do link time optimization on VC++ and clang.

Thanks,

Andrew Fish

Paolo


Ni, Ray
 

Without calling _ReadWriteBarrier, is it possible that compiler generates the assembly in the wrong location? I mean the compiler may in-line the LibWaitForSemaphore and call cmpxchg earlier than the desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required.

I think Andrew also has the same thoughts.

thanks,
ray
________________________________
发件人: Paolo Bonzini <pbonzini@...>
发送时间: Saturday, February 6, 2021 1:34:15 AM
收件人: Laszlo Ersek <lersek@...>; Ni, Ray <ray.ni@...>; Ard Biesheuvel <ardb@...>
抄送: Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: MemoryFence()

On 05/02/21 18:31, Laszlo Ersek wrote:
On 02/05/21 18:21, Paolo Bonzini wrote:
On 05/02/21 17:56, Laszlo Ersek wrote:
221 SPIN_LOCK *
222 EFIAPI
223 ReleaseSpinLock (
224 IN OUT SPIN_LOCK *SpinLock
225 )
226 {
227 SPIN_LOCK LockValue;
228
229 ASSERT (SpinLock != NULL);
230
231 LockValue = *SpinLock;
232 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue ==
SPIN_LOCK_RELEASED);
233
234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
239 }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill. However, it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?
Not as long as it's a pointer or smaller.

Paolo


Paolo Bonzini <pbonzini@...>
 

On 05/02/21 18:53, Ni, Ray wrote:
Without calling _ReadWriteBarrier, is it possible that compiler generates the assembly in the wrong location? I mean the compiler may in-line the LibWaitForSemaphore and call cmpxchg earlier than the desired location.
Similar to LibReleaseSemaphore.
So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required.
The proposed ReleaseMemoryFence() should already have that effect. All the proposed fences except CompilerFence() are both compiler optimization barriers and processor barriers. InterlockedCompareExchange() is also both a compiler optimization barrier and a processor barrier

CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks optimizations but it has no effect at the processor level. It should only be used (instead of volatile) in busy-waiting loops that do not always go through an InterlockedCompareExchange.

Paolo

234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;


Ni, Ray
 

Andrew,
I just read the link you provided. It recommends to use mutex semaphore spin lock over volatile.
But the funny thing here is we are talking about how to implement the mutex semaphore spin lock in edk2

Maybe we need to rely on built in atomic functions but not sure these are gcc only?
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html


thanks,
ray
________________________________
发件人: Andrew Fish <afish@...>
发送时间: Saturday, February 6, 2021 1:48:50 AM
收件人: Ni, Ray <ray.ni@...>
抄送: Laszlo Ersek <lersek@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Ard Biesheuvel (kernel.org address) <ardb@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>; Paolo Bonzini <pbonzini@...>
主题: Re: MemoryFence()



On Feb 5, 2021, at 12:25 AM, Ni, Ray <ray.ni@...<mailto:ray.ni@...>> wrote:

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.
I agree with this summary.
Volatile using in MpInitLib or PiSmmCpu driver cannot be emitted because different CPU threads change the same memory content to synchronize with each other.
I don’t quite understand if removing “volatile” what mechanism can be used to force the compiler generating code that always reads from memory?


Ray,

I agree the C definition of volatile is not thread safe.

I was doing some research on volatile and I realize the Linux kernel guides recommend against using volatile [1]. I think it has to do with people confusing volatile for being thread safe. So my guess is it is more of a coding standard to encourage people to use synchronization primitives. The primitives imply compiler fences which indirectly enforces the same behavior as volatile.

[1] https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html

Thanks,

Andrew Fish


Thanks,
Ray


Ni, Ray
 

I saw the proposal of fences in first mail by Laszlo. Please forgive my ignorance. What is asm(“”) in x86? A nop? The how a nop can help as a processor level load store barrier?


thanks,
ray
________________________________
发件人: rfc@edk2.groups.io <rfc@edk2.groups.io> 代表 Paolo Bonzini <pbonzini@...>
发送时间: Saturday, February 6, 2021 2:01:14 AM
收件人: Ni, Ray <ray.ni@...>; Laszlo Ersek <lersek@...>; Ard Biesheuvel <ardb@...>
抄送: Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: [edk2-rfc] MemoryFence()

On 05/02/21 18:53, Ni, Ray wrote:
Without calling _ReadWriteBarrier, is it possible that compiler
generates the assembly in the wrong location? I mean the compiler may
in-line the LibWaitForSemaphore and call cmpxchg earlier than the
desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required.
The proposed ReleaseMemoryFence() should already have that effect. All
the proposed fences except CompilerFence() are both compiler
optimization barriers and processor barriers.
InterlockedCompareExchange() is also both a compiler optimization
barrier and a processor barrier

CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks
optimizations but it has no effect at the processor level. It should
only be used (instead of volatile) in busy-waiting loops that do not
always go through an InterlockedCompareExchange.

Paolo

234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;


Andrew Fish <afish@...>
 

Ray,

Yes I was writing a mail when I wrote the bugzilla….

Ray,

Function calls are not compiler fences. The compiler just optimized the code way as dead code. For non VC++ we have a real compiler fence/barrier for MemoryFence(). I filed a bugzilla [1] so we can investigate this some more.

I hit a real bug with clang using the MMIO functions a long time ago and I think that might be why the GCC version of the MemoryFence() got the compiler fence/barrier.

If I had to guess VC++ is probably more conservative about volatiles and that is why we have not seen issue with VC++, but that is depending on non standard behavior.

Well this probably explains why we don’t have many MemoryFence() calls in the x86 CPU code, given they don’t do anything on VC++. I have noticed from debugging MP code that was developed using VC++, that we seem to hit more issues when compiled via clang. I wonder if the behavior difference with volatile would explain some of those?

VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}

And:

VOID
EFIAPI
MemoryFence (
VOID
)
{
// This is a little bit of overkill and it is more about the compiler that it is
// actually processor synchronization. This is like the _ReadWriteBarrier
// Microsoft specific intrinsic
__asm__ __volatile__ ("":::"memory");
}

So I think for VC++ it should be…

VOID
EFIAPI
MemoryFence (
VOID
)
{
_ReadWriteBarrier();
return;
}

[1] https://bugzilla.tianocore.org/show_bug.cgi?id=3213

Thanks,

Andrew Fish

On Feb 5, 2021, at 9:53 AM, Ni, Ray <ray.ni@...> wrote:

Without calling _ReadWriteBarrier, is it possible that compiler generates the assembly in the wrong location? I mean the compiler may in-line the LibWaitForSemaphore and call cmpxchg earlier than the desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required.

I think Andrew also has the same thoughts.

thanks,
ray
发件人: Paolo Bonzini <pbonzini@...>
发送时间: Saturday, February 6, 2021 1:34:15 AM
收件人: Laszlo Ersek <lersek@...>; Ni, Ray <ray.ni@...>; Ard Biesheuvel <ardb@...>
抄送: Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: MemoryFence()

On 05/02/21 18:31, Laszlo Ersek wrote:
On 02/05/21 18:21, Paolo Bonzini wrote:
On 05/02/21 17:56, Laszlo Ersek wrote:
221 SPIN_LOCK *
222 EFIAPI
223 ReleaseSpinLock (
224 IN OUT SPIN_LOCK *SpinLock
225 )
226 {
227 SPIN_LOCK LockValue;
228
229 ASSERT (SpinLock != NULL);
230
231 LockValue = *SpinLock;
232 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue ==
SPIN_LOCK_RELEASED);
233
234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
239 }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill. However, it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?
Not as long as it's a pointer or smaller.

Paolo


Paolo Bonzini <pbonzini@...>
 

Il ven 5 feb 2021, 19:12 Ni, Ray <ray.ni@...> ha scritto:

I saw the proposal of fences in first mail by Laszlo. Please forgive my
ignorance. What is asm(“”) in x86? A nop? The how a nop can help as a
processor level load store barrier?
On x86 load-load, load-store and store-store ordering is already guaranteed
by the processor. Therefore on x86 the AcquireMemoryFence and
ReleaseMemoryFence are just like CompilerFence: they only have to block
compiler-level reordering. MemoryFence is the only one that blocks
store-load reordering and needs to emit an MFENCE instruction.

On ARM (either 32- or 64-bit) the processor-level guarantees are weaker,
and you need to emit a "dmb" instruction for acquire and release fences as
well.

Paolo



thanks,
ray
------------------------------
*发件人:* rfc@edk2.groups.io <rfc@edk2.groups.io> 代表 Paolo Bonzini <
pbonzini@...>
*发送时间:* Saturday, February 6, 2021 2:01:14 AM
*收件人:* Ni, Ray <ray.ni@...>; Laszlo Ersek <lersek@...>; Ard
Biesheuvel <ardb@...>
*抄送:* Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>;
Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia
address) <leif@...>; Dong, Eric <eric.dong@...>; Liming
Gao (Byosoft address) <gaoliming@...>; Ankur Arora <
ankur.a.arora@...>
*主题:* Re: [edk2-rfc] MemoryFence()

On 05/02/21 18:53, Ni, Ray wrote:
Without calling _ReadWriteBarrier, is it possible that compiler
generates the assembly in the wrong location? I mean the compiler may
in-line the LibWaitForSemaphore and call cmpxchg earlier than the
desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is
required.

The proposed ReleaseMemoryFence() should already have that effect. All
the proposed fences except CompilerFence() are both compiler
optimization barriers and processor barriers.
InterlockedCompareExchange() is also both a compiler optimization
barrier and a processor barrier

CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks
optimizations but it has no effect at the processor level. It should
only be used (instead of volatile) in busy-waiting loops that do not
always go through an InterlockedCompareExchange.

Paolo

234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;






Laszlo Ersek
 

On 02/05/21 18:34, Paolo Bonzini wrote:
On 05/02/21 18:31, Laszlo Ersek wrote:
On 02/05/21 18:21, Paolo Bonzini wrote:
On 05/02/21 17:56, Laszlo Ersek wrote:
     221  SPIN_LOCK *
     222  EFIAPI
     223  ReleaseSpinLock (
     224    IN OUT  SPIN_LOCK                 *SpinLock
     225    )
     226  {
     227    SPIN_LOCK    LockValue;
     228
     229    ASSERT (SpinLock != NULL);
     230
     231    LockValue = *SpinLock;
     232    ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue ==
SPIN_LOCK_RELEASED);
     233
     234    _ReadWriteBarrier ();
     235    *SpinLock = SPIN_LOCK_RELEASED;
     236    _ReadWriteBarrier ();
     237
     238    return SpinLock;
     239  }

Fishy. I would have implemented it with another
InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the
original value returned by the InterlockedCompareExchangePointer().
That would be overkill.  However, it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store.  Instead it should be just this:

     ReleaseMemoryFence ();
     *SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?
Not as long as it's a pointer or smaller.
This is why I'd like someone else to write the primitives :)

- Where is this documented?

- Is it architecture-dependent?

- Is it alignment-dependent?

etc... :)

(I'm not challenging your statement of course; I know I should have
known, just not sure where from...)

Thanks,
Laszlo


Ni, Ray
 

Wow! Thank you Paolo for explaining everything!

I read a bit about c memory model.
https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync

Do you have any documentation that can explain how the different models are implemented (using what instructions)?

I would like to know if there is a better primitive api other than the fences because MSVC deprecates the barrier apis.
https://docs.microsoft.com/en-us/cpp/intrinsics/readwritebarrier?view=msvc-160



thanks,
ray
________________________________
发件人: Paolo Bonzini <pbonzini@...>
发送时间: Saturday, February 6, 2021 2:17:53 AM
收件人: Ni, Ray <ray.ni@...>
抄送: edk2-rfc-groups-io <rfc@edk2.groups.io>; Laszlo Ersek <lersek@...>; Ard Biesheuvel <ardb@...>; Andrew Fish <afish@...>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: [edk2-rfc] MemoryFence()

Il ven 5 feb 2021, 19:12 Ni, Ray <ray.ni@...<mailto:ray.ni@...>> ha scritto:
I saw the proposal of fences in first mail by Laszlo. Please forgive my ignorance. What is asm(“”) in x86? A nop? The how a nop can help as a processor level load store barrier?

On x86 load-load, load-store and store-store ordering is already guaranteed by the processor. Therefore on x86 the AcquireMemoryFence and ReleaseMemoryFence are just like CompilerFence: they only have to block compiler-level reordering. MemoryFence is the only one that blocks store-load reordering and needs to emit an MFENCE instruction.

On ARM (either 32- or 64-bit) the processor-level guarantees are weaker, and you need to emit a "dmb" instruction for acquire and release fences as well.

Paolo



thanks,
ray
________________________________
发件人: rfc@edk2.groups.io<mailto:rfc@edk2.groups.io> <rfc@edk2.groups.io<mailto:rfc@edk2.groups.io>> 代表 Paolo Bonzini <pbonzini@...<mailto:pbonzini@...>>
发送时间: Saturday, February 6, 2021 2:01:14 AM
收件人: Ni, Ray <ray.ni@...<mailto:ray.ni@...>>; Laszlo Ersek <lersek@...<mailto:lersek@...>>; Ard Biesheuvel <ardb@...<mailto:ardb@...>>
抄送: Andrew Fish <afish@...<mailto:afish@...>>; edk2 RFC list <rfc@edk2.groups.io<mailto:rfc@edk2.groups.io>>; Kinney, Michael D <michael.d.kinney@...<mailto:michael.d.kinney@...>>; Leif Lindholm (Nuvia address) <leif@...<mailto:leif@...>>; Dong, Eric <eric.dong@...<mailto:eric.dong@...>>; Liming Gao (Byosoft address) <gaoliming@...<mailto:gaoliming@...>>; Ankur Arora <ankur.a.arora@...<mailto:ankur.a.arora@...>>
主题: Re: [edk2-rfc] MemoryFence()

On 05/02/21 18:53, Ni, Ray wrote:
Without calling _ReadWriteBarrier, is it possible that compiler
generates the assembly in the wrong location? I mean the compiler may
in-line the LibWaitForSemaphore and call cmpxchg earlier than the
desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required.
The proposed ReleaseMemoryFence() should already have that effect. All
the proposed fences except CompilerFence() are both compiler
optimization barriers and processor barriers.
InterlockedCompareExchange() is also both a compiler optimization
barrier and a processor barrier

CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks
optimizations but it has no effect at the processor level. It should
only be used (instead of volatile) in busy-waiting loops that do not
always go through an InterlockedCompareExchange.

Paolo

234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;


Paolo Bonzini <pbonzini@...>
 

Il ven 5 feb 2021, 19:21 Laszlo Ersek <lersek@...> ha scritto:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;
No concern that the store might not be atomic?
Not as long as it's a pointer or smaller.
This is why I'd like someone else to write the primitives :)
To expand: it is an assumption on the implementation of the C compiler; it
admittedly goes beyond the C standard but it is a reasonable assumption,
compilers may *merge* loads and stores but have no reason to split aligned
loads and stores. Likewise for processors (the pointer limitation is
because sizes larger than the pointer's may not have an instruction that
reads or writes them atomically).

The standard would require you to use C11 atomic loads and stores, but as
far as I understand you cannot assume that they exist on all edk2 compilers.

- Is it architecture-dependent?
No.

- Is it alignment-dependent?
Unaligned pointers are already undefined behavior so you can ignore how
they are dealt with at the processor level.

Paolo


etc... :)

(I'm not challenging your statement of course; I know I should have
known, just not sure where from...)

Thanks,
Laszlo


Laszlo Ersek
 

On 02/05/21 19:32, Paolo Bonzini wrote:

Unaligned pointers are already undefined behavior so you can ignore how
they are dealt with at the processor level.
My question was unclearly asked, sorry. Let's say we have a UINT32 at an
address that's not a multiple of 4, but a multiple of 2. A pointer to
that UINT32 is "acceptably aligned" on x86, but not "naturally aligned".
Dereferencing the pointer is not undefined (my reading of C99 suggests
that alignment requirements are implementation-defined), but I don't
know if the atomicity guarantee holds.

Another example; we may have a pointer to a packed structure, and we
might want to poke at a UINT32 field in that structure. Not through a
naked pointer-to-UINT32 of course, which would throw away the
packed-ness, but really through the pointer-to-the-whole-packed-struct.

Thanks
Laszlo


Laszlo Ersek
 

On 02/05/21 19:17, Paolo Bonzini wrote:

On x86 load-load, load-store and store-store ordering is already guaranteed
by the processor. Therefore on x86 the AcquireMemoryFence and
ReleaseMemoryFence are just like CompilerFence: they only have to block
compiler-level reordering. MemoryFence is the only one that blocks
store-load reordering and needs to emit an MFENCE instruction.
This is exactly what the fence API implementations should have in their
comments, preferably with Intel / AMD manual references...

On ARM (either 32- or 64-bit) the processor-level guarantees are weaker,
and you need to emit a "dmb" instruction for acquire and release fences as
well.
Same, just with the ARM ARM.

Thanks!
Laszlo


Andrew Fish <afish@...>
 

On Feb 5, 2021, at 10:11 AM, Ni, Ray <ray.ni@...> wrote:

I saw the proposal of fences in first mail by Laszlo. Please forgive my ignorance. What is asm(“”) in x86? A nop? The how a nop can help as a processor level load store barrier?
Ray,

In C calling out to assembly is also a barrier/fence operation from the compilers point of view. Actually calling an indirect procedure call (gBS->*, Protoco->*) is also a barrier. The compiler has no idea what the assemble code is doing across the boundary so all the operations need to complete prior to calling the assembly code (indirect procedure call). I guess a binary static lib is in this list too. In reality it is anything the C compiler can’t link time optimize through.

For gcc flavored compliers any __asm__ call is a compiler read/write barrier. That is why you see __asm__ __volatile__ ("":::"memory”); in the memory fence. That means for gcc/clang any synchronization primitive implemented in inline assembler are also a compiler barrier/fence wrapping that assembly operation.

The VC++ inline assemble seems to be “more integrated” with the compiler, so I’m not sure what the rules are for that. Some one should really investigate that. Something tells me you will not find those answer on the Linux mailing list :).

The reason the MMIO operations are wrapped in _ReadWriteBarrier ()/__asm__ __volatile__ ("":::"memory”) has to do with the order of operations.
1) The leading barrier forces all the code before the call to complete the read/writes prior to the operation.
2) The trailing barrier forces the MMIO operation to complete before unrelated read writes that come after the call that could get reordered in optimization.

As I mentioned in this thread I’m wondering if VC++ has extra behaviors around volatile that are compiler dependent and that is why we have not seen a lot of issues to date?

I’m really glad this topic came up. It really seems like something we need to work through …...

Thanks,

Andrew Fish


thanks,
ray
发件人: rfc@edk2.groups.io <rfc@edk2.groups.io> 代表 Paolo Bonzini <pbonzini@...>
发送时间: Saturday, February 6, 2021 2:01:14 AM
收件人: Ni, Ray <ray.ni@...>; Laszlo Ersek <lersek@...>; Ard Biesheuvel <ardb@...>
抄送: Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: [edk2-rfc] MemoryFence()

On 05/02/21 18:53, Ni, Ray wrote:
Without calling _ReadWriteBarrier, is it possible that compiler
generates the assembly in the wrong location? I mean the compiler may
in-line the LibWaitForSemaphore and call cmpxchg earlier than the
desired location.
Similar to LibReleaseSemaphore.

So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required.
The proposed ReleaseMemoryFence() should already have that effect. All
the proposed fences except CompilerFence() are both compiler
optimization barriers and processor barriers.
InterlockedCompareExchange() is also both a compiler optimization
barrier and a processor barrier

CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks
optimizations but it has no effect at the processor level. It should
only be used (instead of volatile) in busy-waiting loops that do not
always go through an InterlockedCompareExchange.

Paolo

234 _ReadWriteBarrier ();
235 *SpinLock = SPIN_LOCK_RELEASED;
236 _ReadWriteBarrier ();
237
238 return SpinLock;
it *is* buggy because it is missing a
(processor) barrier on non-x86 architectures and has a useless barrier
after the store. Instead it should be just this:

ReleaseMemoryFence ();
*SpinLock = SPIN_LOCK_RELEASED;





Andrew Fish <afish@...>
 

On Feb 5, 2021, at 10:47 AM, Laszlo Ersek <lersek@...> wrote:

On 02/05/21 19:32, Paolo Bonzini wrote:

Unaligned pointers are already undefined behavior so you can ignore how
they are dealt with at the processor level.
My question was unclearly asked, sorry. Let's say we have a UINT32 at an
address that's not a multiple of 4, but a multiple of 2. A pointer to
that UINT32 is "acceptably aligned" on x86, but not "naturally aligned".
Dereferencing the pointer is not undefined (my reading of C99 suggests
that alignment requirements are implementation-defined), but I don't
know if the atomicity guarantee holds.
Laszlo,

Clang treats unaligned pointers as undefined behavior (UB). I’m not sure if this is due to C11? What I know is clang choses to NOT optimize away UB from alignment errors, but if you run the address sanitizer you get errors.

For fun one day I turned on the clang address sanitizer with our edk2 firmware and had it emit UD2 on faults so it did not require a runtime. It turned out our debugger stub had lots of alignment issues :( so I kind of gave up at that point :).

Thanks,

Andrew Fish

Another example; we may have a pointer to a packed structure, and we
might want to poke at a UINT32 field in that structure. Not through a
naked pointer-to-UINT32 of course, which would throw away the
packed-ness, but really through the pointer-to-the-whole-packed-struct.

Thanks
Laszlo