Re: MemoryFence()
Andrew Fish <afish@...>
Ray,
toggle quoted message
Show quoted text
Yes I was writing a mail when I wrote the bugzilla…. Ray, Function calls are not compiler fences. The compiler just optimized the code way as dead code. For non VC++ we have a real compiler fence/barrier for MemoryFence(). I filed a bugzilla [1] so we can investigate this some more. I hit a real bug with clang using the MMIO functions a long time ago and I think that might be why the GCC version of the MemoryFence() got the compiler fence/barrier. If I had to guess VC++ is probably more conservative about volatiles and that is why we have not seen issue with VC++, but that is depending on non standard behavior. Well this probably explains why we don’t have many MemoryFence() calls in the x86 CPU code, given they don’t do anything on VC++. I have noticed from debugging MP code that was developed using VC++, that we seem to hit more issues when compiled via clang. I wonder if the behavior difference with volatile would explain some of those? VOID EFIAPI MemoryFence ( VOID ) { return; } And: VOID EFIAPI MemoryFence ( VOID ) { // This is a little bit of overkill and it is more about the compiler that it is // actually processor synchronization. This is like the _ReadWriteBarrier // Microsoft specific intrinsic __asm__ __volatile__ ("":::"memory"); } So I think for VC++ it should be… VOID EFIAPI MemoryFence ( VOID ) { _ReadWriteBarrier(); return; } [1] https://bugzilla.tianocore.org/show_bug.cgi?id=3213 Thanks, Andrew Fish On Feb 5, 2021, at 9:53 AM, Ni, Ray <ray.ni@...> wrote: |
|
Re: MemoryFence()
Ni, Ray
I saw the proposal of fences in first mail by Laszlo. Please forgive my ignorance. What is asm(“”) in x86? A nop? The how a nop can help as a processor level load store barrier?
thanks, ray ________________________________ 发件人: rfc@edk2.groups.io <rfc@edk2.groups.io> 代表 Paolo Bonzini <pbonzini@...> 发送时间: Saturday, February 6, 2021 2:01:14 AM 收件人: Ni, Ray <ray.ni@...>; Laszlo Ersek <lersek@...>; Ard Biesheuvel <ardb@...> 抄送: Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...> 主题: Re: [edk2-rfc] MemoryFence() On 05/02/21 18:53, Ni, Ray wrote: Without calling _ReadWriteBarrier, is it possible that compilerThe proposed ReleaseMemoryFence() should already have that effect. All the proposed fences except CompilerFence() are both compiler optimization barriers and processor barriers. InterlockedCompareExchange() is also both a compiler optimization barrier and a processor barrier CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks optimizations but it has no effect at the processor level. It should only be used (instead of volatile) in busy-waiting loops that do not always go through an InterlockedCompareExchange. Paolo it *is* buggy because it is missing a234 _ReadWriteBarrier (); |
|
Re: MemoryFence()
Ni, Ray
Andrew,
I just read the link you provided. It recommends to use mutex semaphore spin lock over volatile. But the funny thing here is we are talking about how to implement the mutex semaphore spin lock in edk2 Maybe we need to rely on built in atomic functions but not sure these are gcc only? https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html thanks, ray ________________________________ 发件人: Andrew Fish <afish@...> 发送时间: Saturday, February 6, 2021 1:48:50 AM 收件人: Ni, Ray <ray.ni@...> 抄送: Laszlo Ersek <lersek@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Ard Biesheuvel (kernel.org address) <ardb@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>; Paolo Bonzini <pbonzini@...> 主题: Re: MemoryFence() On Feb 5, 2021, at 12:25 AM, Ni, Ray <ray.ni@...<mailto:ray.ni@...>> wrote: So if I understand correctlyI agree with this summary. Volatile using in MpInitLib or PiSmmCpu driver cannot be emitted because different CPU threads change the same memory content to synchronize with each other. I don’t quite understand if removing “volatile” what mechanism can be used to force the compiler generating code that always reads from memory? Ray, I agree the C definition of volatile is not thread safe. I was doing some research on volatile and I realize the Linux kernel guides recommend against using volatile [1]. I think it has to do with people confusing volatile for being thread safe. So my guess is it is more of a coding standard to encourage people to use synchronization primitives. The primitives imply compiler fences which indirectly enforces the same behavior as volatile. [1] https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html Thanks, Andrew Fish Thanks, Ray |
|
Re: MemoryFence()
Paolo Bonzini <pbonzini@...>
On 05/02/21 18:53, Ni, Ray wrote:
Without calling _ReadWriteBarrier, is it possible that compiler generates the assembly in the wrong location? I mean the compiler may in-line the LibWaitForSemaphore and call cmpxchg earlier than the desired location.The proposed ReleaseMemoryFence() should already have that effect. All the proposed fences except CompilerFence() are both compiler optimization barriers and processor barriers. InterlockedCompareExchange() is also both a compiler optimization barrier and a processor barrier CompilerFence() is just a better name for _ReadWriteBarrier(), it blocks optimizations but it has no effect at the processor level. It should only be used (instead of volatile) in busy-waiting loops that do not always go through an InterlockedCompareExchange. Paolo it *is* buggy because it is missing a234 _ReadWriteBarrier (); |
|
Re: MemoryFence()
Ni, Ray
Without calling _ReadWriteBarrier, is it possible that compiler generates the assembly in the wrong location? I mean the compiler may in-line the LibWaitForSemaphore and call cmpxchg earlier than the desired location.
Similar to LibReleaseSemaphore. So my understanding is the _ReadWriteBarrier in ReleaseSpinLock is required. I think Andrew also has the same thoughts. thanks, ray ________________________________ 发件人: Paolo Bonzini <pbonzini@...> 发送时间: Saturday, February 6, 2021 1:34:15 AM 收件人: Laszlo Ersek <lersek@...>; Ni, Ray <ray.ni@...>; Ard Biesheuvel <ardb@...> 抄送: Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...> 主题: Re: MemoryFence() On 05/02/21 18:31, Laszlo Ersek wrote: On 02/05/21 18:21, Paolo Bonzini wrote:Not as long as it's a pointer or smaller.On 05/02/21 17:56, Laszlo Ersek wrote:No concern that the store might not be atomic?221 SPIN_LOCK *That would be overkill. However, it *is* buggy because it is missing a Paolo |
|
Re: MemoryFence()
Andrew Fish <afish@...>
On Feb 5, 2021, at 7:38 AM, Paolo Bonzini <pbonzini@...> wrote:We do link time optimization on VC++ and clang. Thanks, Andrew Fish Paolo |
|
Re: MemoryFence()
Andrew Fish <afish@...>
On Feb 5, 2021, at 12:25 AM, Ni, Ray <ray.ni@...> wrote:Ray,So if I understand correctlyI agree with this summary. I agree the C definition of volatile is not thread safe. I was doing some research on volatile and I realize the Linux kernel guides recommend against using volatile [1]. I think it has to do with people confusing volatile for being thread safe. So my guess is it is more of a coding standard to encourage people to use synchronization primitives. The primitives imply compiler fences which indirectly enforces the same behavior as volatile. [1] https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html Thanks, Andrew Fish
|
|
Re: MemoryFence()
Paolo Bonzini <pbonzini@...>
On 05/02/21 18:31, Laszlo Ersek wrote:
On 02/05/21 18:21, Paolo Bonzini wrote:Not as long as it's a pointer or smaller.On 05/02/21 17:56, Laszlo Ersek wrote:No concern that the store might not be atomic?221 SPIN_LOCK *That would be overkill. However, it *is* buggy because it is missing a Paolo |
|
Re: MemoryFence()
Andrew Fish <afish@...>
On Feb 4, 2021, at 11:48 PM, Laszlo Ersek <lersek@...> wrote:OK I see Linux recommends not using volatile [1], so I think this makes sense to me. It is not that volatile is not useful, but it is not thread safe, so not using volatile encourages people to use primitives to make things thread safe. So not using volatile is more of a coding convention. - the two MemoryFence() calls should be removedThey may need to turn into CompilerFence(). - an AcquireMemoryFence() call should be inserted into the loop, afterCompilerFence (); do { Value = *Address; AcquireMemoryFence(); } while (Value == 0xFF) ; I think this is where x86 makes things confusing with its coherency all the time motto. On x86 I think they are really CompilerFences() to force C memory model order of operations. They got added to fix real bugs on x86. Remember the compiler can still reorder a volatile access withs surrounding code as an optimization. So for example if MmioWrite8() has MemoryFence() replaced with CompilerFence() these operations happen in order. If there is no CompilerFence() the order is implementation dependent. UINT8 *A, *B, *C; *A = 1; MmioWrite8 (B, 2); *C = 3; So that is why we want the CompilerFence() + volatile for the MMIO. For example you could end up writing to a Common DMA buffer and then doing an MMIO write. You might want those to happen in sequence and you need the CompilerFence() At least the CPU code is CPU specific so the rules should be more well defined :). [1] https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html Thanks, Andrew Fish Thanks |
|
Re: MemoryFence()
Laszlo Ersek
On 02/05/21 18:21, Paolo Bonzini wrote:
On 05/02/21 17:56, Laszlo Ersek wrote: No concern that the store might not be atomic?221 SPIN_LOCK *That would be overkill. However, it *is* buggy because it is missing a Thanks Laszlo |
|
Re: MemoryFence()
Paolo Bonzini <pbonzini@...>
On 05/02/21 17:56, Laszlo Ersek wrote:
On 02/05/21 17:04, Ni, Ray wrote:Agreed. I would prefer an explicit compiler fence before the "Value = *Sem" assignment and no volatile anywhere, but it is okay. 193 LockValue = *SpinLock;The _ReadWriteBarrier (aka compiler fence) is pointless too; it is implicit in the InterlockedCompareExchangePointer. 781 VOIDYep. 221 SPIN_LOCK *That would be overkill. However, it *is* buggy because it is missing a (processor) barrier on non-x86 architectures and has a useless barrier after the store. Instead it should be just this: ReleaseMemoryFence (); *SpinLock = SPIN_LOCK_RELEASED; Paolo |
|
Re: MemoryFence()
Laszlo Ersek
On 02/05/21 17:04, Ni, Ray wrote:
Maybe that explains why the MemoryFence was implemented soLet's focus on the "down" operations first. At commit c6be6dab9c4b: 789 /** 790 Decrement the semaphore by 1 if it is not zero. 791 792 Performs an atomic decrement operation for semaphore. 793 The compare exchange operation must be performed using 794 MP safe mechanisms. 795 796 @param Sem IN: 32-bit unsigned integer 797 798 **/ 799 VOID 800 LibWaitForSemaphore ( 801 IN OUT volatile UINT32 *Sem 802 ) 803 { 804 UINT32 Value; 805 806 do { 807 Value = *Sem; 808 } while (Value == 0 || 809 InterlockedCompareExchange32 ( 810 Sem, 811 Value, 812 Value - 1 813 ) != Value); 814 } But the AcquireSpinLockOrFail and ReleaseSpinLock inI think LibWaitForSemaphore() is safe. (Which is not identical to saying "it's in the best style".) The volatile ensures there is a load instruction, and that the compiler generates the load instruction in the right place. The volatile does not ensure that the data fetched is neither corrupt nor stale -- the load could fetch a partially updated value or a stale value, into "Value". But in this particular instance that should be no problem: - if we see a corrupt or stale zero value, we'll just continue spinning; the value will stabilize eventually - if we see a corrupt or stale *nonzero* value, then when we attempt to decrement it with a safe InterlockedCompareExchange32(), we'll just fail. So the volatile fetch can be considered an "initial guess" that works most of the time. (BTW, PiSmmCpuDxeSmm has a more recent variant that is less power-hungry; see commit 9001b750df64 ("UefiCpuPkg/PiSmmCpuDxeSmm: pause in WaitForSemaphore() before re-fetch", 2020-07-31).) The implementation seen in "MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c" is: 165 /** 166 Attempts to place a spin lock in the acquired state. 167 168 This function checks the state of the spin lock specified by SpinLock. If 169 SpinLock is in the released state, then this function places SpinLock in the 170 acquired state and returns TRUE. Otherwise, FALSE is returned. All state 171 transitions of SpinLock must be performed using MP safe mechanisms. 172 173 If SpinLock is NULL, then ASSERT(). 174 If SpinLock was not initialized with InitializeSpinLock(), then ASSERT(). 175 176 @param SpinLock A pointer to the spin lock to place in the acquired state. 177 178 @retval TRUE SpinLock was placed in the acquired state. 179 @retval FALSE SpinLock could not be acquired. 180 181 **/ 182 BOOLEAN 183 EFIAPI 184 AcquireSpinLockOrFail ( 185 IN OUT SPIN_LOCK *SpinLock 186 ) 187 { 188 SPIN_LOCK LockValue; 189 VOID *Result; 190 191 ASSERT (SpinLock != NULL); 192 193 LockValue = *SpinLock; 194 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED); 195 196 _ReadWriteBarrier (); 197 Result = InterlockedCompareExchangePointer ( 198 (VOID**)SpinLock, 199 (VOID*)SPIN_LOCK_RELEASED, 200 (VOID*)SPIN_LOCK_ACQUIRED 201 ); 202 203 _ReadWriteBarrier (); 204 return (BOOLEAN) (Result == (VOID*) SPIN_LOCK_RELEASED); 205 } I think the explicit barriers are better style (more helpful for the programmer, and in general, possibly better performance). On the other hand (independently of what you asked), the AcquireSpinLockOrFail() logic seems a bit fishy to me. First, the ASSERT() deserves a comment, at the bare minimum. ("Even if the data is stale or partially updated, we can't conjure up a completely bogus value", or some such.) Second, the rest of the function doesn't even depend on LockValue. Having the LockValue / ASSERT() stuff seems a bit pointless to me. --*-- Now, regarding the "up" operations: [UefiCpuPkg/Library/RegisterCpuFeaturesLib/CpuFeaturesInitialize.c] 775 /** 776 Increment semaphore by 1. 777 778 @param Sem IN: 32-bit unsigned integer 779 780 **/ 781 VOID 782 LibReleaseSemaphore ( 783 IN OUT volatile UINT32 *Sem 784 ) 785 { 786 InterlockedIncrement (Sem); 787 } This seems OK to me. (In an ideal world, InterlockedIncrement() should internally enforce a compiler barrier too, so the volatile should not be needed.) [MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c] 207 /** 208 Releases a spin lock. 209 210 This function places the spin lock specified by SpinLock in the release state 211 and returns SpinLock. 212 213 If SpinLock is NULL, then ASSERT(). 214 If SpinLock was not initialized with InitializeSpinLock(), then ASSERT(). 215 216 @param SpinLock A pointer to the spin lock to release. 217 218 @return SpinLock released the lock. 219 220 **/ 221 SPIN_LOCK * 222 EFIAPI 223 ReleaseSpinLock ( 224 IN OUT SPIN_LOCK *SpinLock 225 ) 226 { 227 SPIN_LOCK LockValue; 228 229 ASSERT (SpinLock != NULL); 230 231 LockValue = *SpinLock; 232 ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED); 233 234 _ReadWriteBarrier (); 235 *SpinLock = SPIN_LOCK_RELEASED; 236 _ReadWriteBarrier (); 237 238 return SpinLock; 239 } Fishy. I would have implemented it with another InterlockedCompareExchangePointer(), and maybe ASSERT()ed on the original value returned by the InterlockedCompareExchangePointer(). Laszlo |
|
Re: MemoryFence()
Ni, Ray
Maybe that explains why the MemoryFence was implemented so confidently.
I trust Mike:) In UefiCpuPkg/Library/RegisterCpuFeaturesLib/CpuFeaturesInitialize.c, the LibWaitForSemaphore, LibReleaseSemaphore are coded without the read write barrier. But the AcquireSpinLockOrFail and ReleaseSpinLock in MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c contain read write barrier. Is the former implementation wrong because without the barrier the cmpxchg can happen earlier or later chosen by the compiler? Does volatile keyword help here? thanks, ray ________________________________ 发件人: Paolo Bonzini <pbonzini@...> 发送时间: Friday, February 5, 2021 11:42:36 PM 收件人: Ard Biesheuvel <ardb@...> 抄送: Laszlo Ersek <lersek@...>; Ni, Ray <ray.ni@...>; Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...> 主题: Re: MemoryFence() On 05/02/21 16:39, Ard Biesheuvel wrote: On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:Oops... Well, it *was* okay at the time MemoryFence() was firstWe rely heavily on LTO for both Visual Studio and GNU/Clang toolchains. implemented. :( Paolo |
|
Re: MemoryFence()
Paolo Bonzini <pbonzini@...>
On 05/02/21 16:39, Ard Biesheuvel wrote:
On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:Oops... Well, it *was* okay at the time MemoryFence() was first implemented. :(We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains. Paolo |
|
Re: MemoryFence()
Ard Biesheuvel <ardb@...>
On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:
We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains. |
|
Re: MemoryFence()
Paolo Bonzini <pbonzini@...>
On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFTI think it's okay-ish (it works as long as you don't do link-time optimization, but it's less efficient than a compiler intrinsic) because X86MemoryFence.c is a separate file. Paolo |
|
Re: MemoryFence()
Laszlo Ersek
On 02/05/21 14:15, Ni, Ray wrote:
This is very interesting. I am trying to involve the discussion toThat's because your CompilerFence() function is not correct. You did: void CompilerFence () {which is wrong. It enforces a C-language sequence point, but it does not implement a compiler barrier. And your code is wrong because the *original*, namely the MSFT implementation of MemoryFence() in edk2, is bugged: * MdePkg/Library/BaseLib/X86MemoryFence.c: /**This code comes from historical commit e1f414b6a7d8 ("Import some basic libraries instances for Mde Packages.", 2007-06-22). * MdePkg/Library/BaseLib/BaseLib.inf: [Sources.Ia32]Back to your email: If I replace CompilerFence() with _ReadBarrier(), compiler canThat's precisely the case. This actually explains why MpInitLib and PiSmmCpuDxeSmm use "volatile", rather than MemoryFence(). (This is one of those info bits that I've been missing.) The reason is that, when built with Visual Studio, MemoryFence() does absolutely nothing. So if MpInitLib and PiSmmCpuDxeSmm used MemoryFence() rather than "volatile", things would break immediately. Thank you Ray for highlighting this, I couldn't have imagined. The current status means that, considering all toolchains and all architectures that edk2 supports, together, it's been *impossible* to use MemoryFence() correctly. And so my defensive thinking "whenever in doubt, use 'volatile' PLUS MemoryFence()" apparently paid off, because when built with Visual Studio, OVMF is left with 'volatile' *only*; in effect. It also means that in Ankur's upcoming v7 series, relying *just* on MemoryFence() will not suffice. We need 'volatile' (given the current state of primitives); otherwise we'll have no barriers at all. ... I don't even know where to start changing the code. :( Laszlo |
|
Re: MemoryFence()
Laszlo Ersek
On 02/05/21 09:25, Ni, Ray wrote:
A compiler barrier (in case that's sufficient), or even a memory fenceSo if I understand correctlyI agree with this summary. -- seems more appropriate for inter-CPU synchronization -- which I think should be implemented to automatically force a compiler barrier too. Laszlo |
|
Re: MemoryFence()
Laszlo Ersek
On 02/05/21 10:12, Paolo Bonzini wrote:
On 05/02/21 08:48, Laszlo Ersek wrote:Ah, thanks; also good example where a CompilerFence() would be justified.This is correct, alternatively you could have this:MemoryFence ();If Address points to a byte in RAM that's manipulated by Laszlo |
|
Re: MemoryFence()
Paolo Bonzini <pbonzini@...>
On 05/02/21 14:15, Ni, Ray wrote:
I did some experiment.This should be #define CompilerFence _ReadWriteBarrier Paolo } |
|