Hi All, here's the declaration of MemoryFence(), from <BaseLib.h>: /** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ); The IA32 and X64 implementations of MemoryFence() are not memory barriers, only compiler barriers, however. This is a problem for two reasons: - code that believes to add a memory barrier only adds a compiler barrier, possibly creating an unsafe situation, - code that knowingly calls MemoryFence() for the purposes of a compiler barrier -- while it will hopefully work safely, relying on tacit knowledge / assumptions about the x86 architecture -- results in a source reading experience that mismatches known lockless patterns. This topic has repeatedly come up in the past; most recently in the review of Ankur's v6 series for < https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some inter-processor communication needs to happen. I asked Paolo for help with understanding some of the synchronziation patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is going to rely on, and/or imitate, some of those patterns. From this discussion with Paolo, the following have emerged (I'm going to liberally steal / paraphrase Paolo's words): (1) We should introduce finer-grained fence primitives: ARM AARCH64 i386 CompilerFence() asm("") asm("") asm("") AcquireMemoryFence() dmb ish dmb ishld asm("") ReleaseMemoryFence() dmb ish dmb ish asm("") MemoryFence() dmb ish dmb ish mfence "where AcquireMemoryFence() is used on the read side (i.e. between reads) and ReleaseMemoryFence() is used on the write side (i.e. between writes)". (Note that this step would make MemoryFence() heavier-weight on x86 than it currently is.) (2) We should audit and rework existent uses (combinations) of MemoryFence() into matched acquire / release pairs. Less importantly, this would restore the x86 behavior (performance) to the one seen with the current MemoryFence() implementation. More importantly, it would fix what the code *means*: "it's very important to stick to known lockless patterns, and having matching acquire/release fences makes it much more intuitive to understand what's going on". --*-- I'm not proposing this as a pre-requisite to merging Ankur's series (i.e., I don't expect the series to use acquire/release -- it can be converted later with the rest of the audit). However, I'd really like us to stop wasting more time on MemoryFence() doubts such as: - Hey it's not a memory fence at all, is it safe? - But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes we use for synchronization, between the other accesses. - Is it interchangeable with volatile? Yes? No? which one should we use? In the longer term, multi-processing code like MpInitLib (CpuDxe, CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of "volatile", and use the explicit barriers instead, for *clarity*. Step (2) would take quite a bit of thinking (if not much code). Comments? Would there be interest in reviewing such work? Thanks Laszlo
|
|
Ard Biesheuvel <ardb@...>
On Thu, 4 Feb 2021 at 19:09, Laszlo Ersek <lersek@...> wrote: Hi All,
here's the declaration of MemoryFence(), from <BaseLib.h>:
/** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ); The IA32 and X64 implementations of MemoryFence() are not memory barriers, only compiler barriers, however.
This is a problem for two reasons:
- code that believes to add a memory barrier only adds a compiler barrier, possibly creating an unsafe situation,
- code that knowingly calls MemoryFence() for the purposes of a compiler barrier -- while it will hopefully work safely, relying on tacit knowledge / assumptions about the x86 architecture -- results in a source reading experience that mismatches known lockless patterns.
This topic has repeatedly come up in the past; most recently in the review of Ankur's v6 series for <https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some inter-processor communication needs to happen.
I asked Paolo for help with understanding some of the synchronziation patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is going to rely on, and/or imitate, some of those patterns. From this discussion with Paolo, the following have emerged (I'm going to liberally steal / paraphrase Paolo's words):
(1) We should introduce finer-grained fence primitives:
ARM AARCH64 i386
CompilerFence() asm("") asm("") asm("") AcquireMemoryFence() dmb ish dmb ishld asm("") ReleaseMemoryFence() dmb ish dmb ish asm("") MemoryFence() dmb ish dmb ish mfence
"where AcquireMemoryFence() is used on the read side (i.e. between reads) and ReleaseMemoryFence() is used on the write side (i.e. between writes)".
Acquire semantics typically order writes before reads, not /between/ reads. Similarly, release semantics imply that all outstanding writes complete before a barrier with acquire semantics is permitted to complete. Note that reasoning about this only makes sense in the context of concurrency, i.e., different CPUs operating on the same memory (or coherent DMA masters) For non-coherent DMA, the 'ish' variants are not appropriate, and given the low likelihood that any of this is creates a performance bottleneck, I would suggest to only use full barriers on ARM. (Note that this step would make MemoryFence() heavier-weight on x86 than it currently is.)
(2) We should audit and rework existent uses (combinations) of MemoryFence() into matched acquire / release pairs.
It is very important to be *aware* of the acquire/release semantics, but I don't think it is necessary to use all the fine grained barrier types in EDK2. Less importantly, this would restore the x86 behavior (performance) to the one seen with the current MemoryFence() implementation.
More importantly, it would fix what the code *means*: "it's very important to stick to known lockless patterns, and having matching acquire/release fences makes it much more intuitive to understand what's going on".
--*--
I'm not proposing this as a pre-requisite to merging Ankur's series (i.e., I don't expect the series to use acquire/release -- it can be converted later with the rest of the audit).
However, I'd really like us to stop wasting more time on MemoryFence() doubts such as:
- Hey it's not a memory fence at all, is it safe?
Who decides what 'memory fence' means anyway? - But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes we use for synchronization, between the other accesses.
- Is it interchangeable with volatile? Yes? No? which one should we use?
'volatile' affects code generation only, i.e., which loads and stores are issued. The *order* in which those loads observe stores done by other masters is what is managed by barrier instructions. So they are two sides of the same coin: 'volatile' may help to guarantee that every assignment through a pointer variable results in a store instruction, but how these are observed by other CPUs still needs to be managed if it needs to occur in a certain way. I think volatile is the wrong tool in most cases: the ordering of accesses are not a property of the type or of the variable, but of the code sequence. So using asm("") to ensure that an assignment through a pointer is emitted as a store instruction, followed by a barrier instruction is a typical pattern. And given the fact that barrier instructions require asm sequences in most cases anyway, the asm ("") will often be implied. In the longer term, multi-processing code like MpInitLib (CpuDxe, CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of "volatile", and use the explicit barriers instead, for *clarity*.
Step (2) would take quite a bit of thinking (if not much code).
Comments?
Would there be interest in reviewing such work?
Thanks Laszlo
|
|
Paolo Bonzini <pbonzini@...>
Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto: (1) We should introduce finer-grained fence primitives:
ARM AARCH64 i386
CompilerFence() asm("") asm("") asm("") AcquireMemoryFence() dmb ish dmb ishld asm("") ReleaseMemoryFence() dmb ish dmb ish asm("") MemoryFence() dmb ish dmb ish mfence
"where AcquireMemoryFence() is used on the read side (i.e. between reads) and ReleaseMemoryFence() is used on the write side (i.e. between writes)".
Acquire semantics typically order writes before reads, not /between/ reads. Similarly, release semantics imply that all outstanding writes complete before a barrier with acquire semantics is permitted to complete.
Acquire fences are barriers between earlier loads and subsequent loads and stores; those earlier loads then synchronize with release stores in other threads. Release fences are barriers been earlier loads and stores against subsequent stores, and those subsequent stores synchronize with acquire loads in other threads. In both cases, however, fences only make sense between memory operations. So something like "after reads" and "before writes" would have been more precise in some sense, but in practice the usual idiom is "between" reads/writes as Laszlo wrote. Note that reasoning about this only makes sense in the context of concurrency, i.e., different CPUs operating on the same memory (or coherent DMA masters)
For non-coherent DMA, the 'ish' variants are not appropriate, and given the low likelihood that any of this is creates a performance bottleneck, I would suggest to only use full barriers on ARM.
Sure, that's a matter of how to implement the primitives. If you think that non-coherent DMA is important, a full dmb can be used too. As far as the compiler is concerned, an asm in the macros *should* block enough optimizations, even without making the accesses volatile. CompilerFence (or the edk2 equivalent of cpu_relax, whose name escapes me right now) would be necessary in the body of busy-waiting loops. However somebody should check the MSVC docs for asm, too. It is very important to be *aware* of the acquire/release semantics, but I don't think it is necessary to use all the fine grained barrier types in EDK2.
I agree as long as the primitives are self-documenting. A single MemoryFence() does not make it clear in which direction the data is flowing (whether from other threads to this one, or vice versa). Paolo Less importantly, this would restore the x86 behavior (performance) to the one seen with the current MemoryFence() implementation.
More importantly, it would fix what the code *means*: "it's very important to stick to known lockless patterns, and having matching acquire/release fences makes it much more intuitive to understand what's going on".
--*--
I'm not proposing this as a pre-requisite to merging Ankur's series (i.e., I don't expect the series to use acquire/release -- it can be converted later with the rest of the audit).
However, I'd really like us to stop wasting more time on MemoryFence() doubts such as:
- Hey it's not a memory fence at all, is it safe?
Who decides what 'memory fence' means anyway?
- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes we use for synchronization, between the other accesses.
- Is it interchangeable with volatile? Yes? No? which one should we use?
'volatile' affects code generation only, i.e., which loads and stores are issued. The *order* in which those loads observe stores done by other masters is what is managed by barrier instructions. So they are two sides of the same coin: 'volatile' may help to guarantee that every assignment through a pointer variable results in a store instruction, but how these are observed by other CPUs still needs to be managed if it needs to occur in a certain way.
I think volatile is the wrong tool in most cases: the ordering of accesses are not a property of the type or of the variable, but of the code sequence. So using asm("") to ensure that an assignment through a pointer is emitted as a store instruction, followed by a barrier instruction is a typical pattern. And given the fact that barrier instructions require asm sequences in most cases anyway, the asm ("") will often be implied.
In the longer term, multi-processing code like MpInitLib (CpuDxe, CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of "volatile", and use the explicit barriers instead, for *clarity*.
Step (2) would take quite a bit of thinking (if not much code).
Comments?
Would there be interest in reviewing such work?
Thanks Laszlo
|
|
I agree with Ard (I'm not a contributor of EDK2 but thought I'd add my 2 sense anyway). `volatile` should only be used if a pointer load/store is utilized in a context where you expect the external environment (e.g.: hardware) to alter the result of such an op. In the case of fences, volatile, I think, doesn't really do much in terms of helping. For fences x86 has the LFENCE instruction for load fences and SFENCE for store fences. If I'm not mistaken, MFENCE, then, is just a combination of the other two. Would it not then be more appropriate to use an LFENCE for a load (release?) fence, and SFENCE for a store (acquire?) fence? (I ask for release/acquire because I might have these mixed up.) I might of course be entirely wrong too. I'd love to, perhaps, become a contributor of EDK2 sometime, but I honestly have no idea where I'd even begin. :-)
toggle quoted message
Show quoted text
On 2/4/21, Ard Biesheuvel <ardb@...> wrote: On Thu, 4 Feb 2021 at 19:09, Laszlo Ersek <lersek@...> wrote:
Hi All,
here's the declaration of MemoryFence(), from <BaseLib.h>:
/** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ); The IA32 and X64 implementations of MemoryFence() are not memory barriers, only compiler barriers, however.
This is a problem for two reasons:
- code that believes to add a memory barrier only adds a compiler barrier, possibly creating an unsafe situation,
- code that knowingly calls MemoryFence() for the purposes of a compiler barrier -- while it will hopefully work safely, relying on tacit knowledge / assumptions about the x86 architecture -- results in a source reading experience that mismatches known lockless patterns.
This topic has repeatedly come up in the past; most recently in the review of Ankur's v6 series for <https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some inter-processor communication needs to happen.
I asked Paolo for help with understanding some of the synchronziation patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is going to rely on, and/or imitate, some of those patterns. From this discussion with Paolo, the following have emerged (I'm going to liberally steal / paraphrase Paolo's words):
(1) We should introduce finer-grained fence primitives:
ARM AARCH64 i386
CompilerFence() asm("") asm("") asm("") AcquireMemoryFence() dmb ish dmb ishld asm("") ReleaseMemoryFence() dmb ish dmb ish asm("") MemoryFence() dmb ish dmb ish mfence
"where AcquireMemoryFence() is used on the read side (i.e. between reads) and ReleaseMemoryFence() is used on the write side (i.e. between writes)".
Acquire semantics typically order writes before reads, not /between/ reads. Similarly, release semantics imply that all outstanding writes complete before a barrier with acquire semantics is permitted to complete.
Note that reasoning about this only makes sense in the context of concurrency, i.e., different CPUs operating on the same memory (or coherent DMA masters)
For non-coherent DMA, the 'ish' variants are not appropriate, and given the low likelihood that any of this is creates a performance bottleneck, I would suggest to only use full barriers on ARM.
(Note that this step would make MemoryFence() heavier-weight on x86 than it currently is.)
(2) We should audit and rework existent uses (combinations) of MemoryFence() into matched acquire / release pairs.
It is very important to be *aware* of the acquire/release semantics, but I don't think it is necessary to use all the fine grained barrier types in EDK2.
Less importantly, this would restore the x86 behavior (performance) to the one seen with the current MemoryFence() implementation.
More importantly, it would fix what the code *means*: "it's very important to stick to known lockless patterns, and having matching acquire/release fences makes it much more intuitive to understand what's going on".
--*--
I'm not proposing this as a pre-requisite to merging Ankur's series (i.e., I don't expect the series to use acquire/release -- it can be converted later with the rest of the audit).
However, I'd really like us to stop wasting more time on MemoryFence() doubts such as:
- Hey it's not a memory fence at all, is it safe?
Who decides what 'memory fence' means anyway?
- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes we use for synchronization, between the other accesses.
- Is it interchangeable with volatile? Yes? No? which one should we use?
'volatile' affects code generation only, i.e., which loads and stores are issued. The *order* in which those loads observe stores done by other masters is what is managed by barrier instructions. So they are two sides of the same coin: 'volatile' may help to guarantee that every assignment through a pointer variable results in a store instruction, but how these are observed by other CPUs still needs to be managed if it needs to occur in a certain way.
I think volatile is the wrong tool in most cases: the ordering of accesses are not a property of the type or of the variable, but of the code sequence. So using asm("") to ensure that an assignment through a pointer is emitted as a store instruction, followed by a barrier instruction is a typical pattern. And given the fact that barrier instructions require asm sequences in most cases anyway, the asm ("") will often be implied.
In the longer term, multi-processing code like MpInitLib (CpuDxe, CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of "volatile", and use the explicit barriers instead, for *clarity*.
Step (2) would take quite a bit of thinking (if not much code).
Comments?
Would there be interest in reviewing such work?
Thanks Laszlo
-- Signed, Ethin D. Probst
|
|
Paolo Bonzini <pbonzini@...>
Il gio 4 feb 2021, 21:09 Ethin Probst <harlydavidsen@...> ha scritto: For fences x86 has the LFENCE instruction for load fences and SFENCE for store fences. If I'm not mistaken, MFENCE, then, is just a combination of the other two. Would it not then be more appropriate to use an LFENCE for a load (release?) fence, and SFENCE for a store (acquire?) fence? No, LFENCE and SFENCE are only useful with special load and store instructions and are not needed in edk2. In addition, neither of the two orders earlier stores against subsequent loads; MFENCE is needed for that. Paolo (I ask for release/acquire because I might have these mixed up.) I might of course be entirely wrong too. I'd love to, perhaps, become a contributor of EDK2 sometime, but I honestly have no idea where I'd even begin. :-)
On 2/4/21, Ard Biesheuvel <ardb@...> wrote:
On Thu, 4 Feb 2021 at 19:09, Laszlo Ersek <lersek@...> wrote:
Hi All,
here's the declaration of MemoryFence(), from <BaseLib.h>:
/** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ); The IA32 and X64 implementations of MemoryFence() are not memory barriers, only compiler barriers, however.
This is a problem for two reasons:
- code that believes to add a memory barrier only adds a compiler barrier, possibly creating an unsafe situation,
- code that knowingly calls MemoryFence() for the purposes of a compiler barrier -- while it will hopefully work safely, relying on tacit knowledge / assumptions about the x86 architecture -- results in a source reading experience that mismatches known lockless patterns.
This topic has repeatedly come up in the past; most recently in the review of Ankur's v6 series for <https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some inter-processor communication needs to happen.
I asked Paolo for help with understanding some of the synchronziation patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is going to rely on, and/or imitate, some of those patterns. From this discussion with Paolo, the following have emerged (I'm going to liberally steal / paraphrase Paolo's words):
(1) We should introduce finer-grained fence primitives:
ARM AARCH64 i386
CompilerFence() asm("") asm("") asm("") AcquireMemoryFence() dmb ish dmb ishld asm("") ReleaseMemoryFence() dmb ish dmb ish asm("") MemoryFence() dmb ish dmb ish mfence
"where AcquireMemoryFence() is used on the read side (i.e. between reads) and ReleaseMemoryFence() is used on the write side (i.e. between writes)".
Acquire semantics typically order writes before reads, not /between/ reads. Similarly, release semantics imply that all outstanding writes complete before a barrier with acquire semantics is permitted to complete.
Note that reasoning about this only makes sense in the context of concurrency, i.e., different CPUs operating on the same memory (or coherent DMA masters)
For non-coherent DMA, the 'ish' variants are not appropriate, and given the low likelihood that any of this is creates a performance bottleneck, I would suggest to only use full barriers on ARM.
(Note that this step would make MemoryFence() heavier-weight on x86 than it currently is.)
(2) We should audit and rework existent uses (combinations) of MemoryFence() into matched acquire / release pairs.
It is very important to be *aware* of the acquire/release semantics, but I don't think it is necessary to use all the fine grained barrier types in EDK2.
Less importantly, this would restore the x86 behavior (performance) to the one seen with the current MemoryFence() implementation.
More importantly, it would fix what the code *means*: "it's very important to stick to known lockless patterns, and having matching acquire/release fences makes it much more intuitive to understand what's going on".
--*--
I'm not proposing this as a pre-requisite to merging Ankur's series (i.e., I don't expect the series to use acquire/release -- it can be converted later with the rest of the audit).
However, I'd really like us to stop wasting more time on MemoryFence() doubts such as:
- Hey it's not a memory fence at all, is it safe?
Who decides what 'memory fence' means anyway?
- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes we use for synchronization, between the other accesses.
- Is it interchangeable with volatile? Yes? No? which one should we use?
'volatile' affects code generation only, i.e., which loads and stores are issued. The *order* in which those loads observe stores done by other masters is what is managed by barrier instructions. So they are two sides of the same coin: 'volatile' may help to guarantee that every assignment through a pointer variable results in a store instruction, but how these are observed by other CPUs still needs to be managed if it needs to occur in a certain way.
I think volatile is the wrong tool in most cases: the ordering of accesses are not a property of the type or of the variable, but of the code sequence. So using asm("") to ensure that an assignment through a pointer is emitted as a store instruction, followed by a barrier instruction is a typical pattern. And given the fact that barrier instructions require asm sequences in most cases anyway, the asm ("") will often be implied.
In the longer term, multi-processing code like MpInitLib (CpuDxe, CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of "volatile", and use the explicit barriers instead, for *clarity*.
Step (2) would take quite a bit of thinking (if not much code).
Comments?
Would there be interest in reviewing such work?
Thanks Laszlo
-- Signed, Ethin D. Probst
|
|
On Feb 4, 2021, at 10:09 AM, Laszlo Ersek <lersek@...> wrote:
Hi All,
here's the declaration of MemoryFence(), from <BaseLib.h>:
/** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ); The IA32 and X64 implementations of MemoryFence() are not memory barriers, only compiler barriers, however.
This is a problem for two reasons:
- code that believes to add a memory barrier only adds a compiler barrier, possibly creating an unsafe situation,
- code that knowingly calls MemoryFence() for the purposes of a compiler barrier -- while it will hopefully work safely, relying on tacit knowledge / assumptions about the x86 architecture -- results in a source reading experience that mismatches known lockless patterns.
This topic has repeatedly come up in the past; most recently in the review of Ankur's v6 series for <https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some inter-processor communication needs to happen.
I asked Paolo for help with understanding some of the synchronziation patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is going to rely on, and/or imitate, some of those patterns. From this discussion with Paolo, the following have emerged (I'm going to liberally steal / paraphrase Paolo's words):
(1) We should introduce finer-grained fence primitives:
ARM AARCH64 i386
CompilerFence() asm("") asm("") asm("") AcquireMemoryFence() dmb ish dmb ishld asm("") ReleaseMemoryFence() dmb ish dmb ish asm("") MemoryFence() dmb ish dmb ish mfence
"where AcquireMemoryFence() is used on the read side (i.e. between reads) and ReleaseMemoryFence() is used on the write side (i.e. between writes)".
(Note that this step would make MemoryFence() heavier-weight on x86 than it currently is.)
(2) We should audit and rework existent uses (combinations) of MemoryFence() into matched acquire / release pairs.
Less importantly, this would restore the x86 behavior (performance) to the one seen with the current MemoryFence() implementation.
More importantly, it would fix what the code *means*: "it's very important to stick to known lockless patterns, and having matching acquire/release fences makes it much more intuitive to understand what's going on".
--*--
I'm not proposing this as a pre-requisite to merging Ankur's series (i.e., I don't expect the series to use acquire/release -- it can be converted later with the rest of the audit).
However, I'd really like us to stop wasting more time on MemoryFence() doubts such as:
- Hey it's not a memory fence at all, is it safe?
- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes we use for synchronization, between the other accesses.
- Is it interchangeable with volatile? Yes? No? which one should we use?
In the longer term, multi-processing code like MpInitLib (CpuDxe, CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of "volatile", and use the explicit barriers instead, for *clarity*.
Step (2) would take quite a bit of thinking (if not much code).
Comments?
Laszlo, Thanks for looking into this. I noticed most of the MemoryFence() usage[1] is in the virtual platforms so changes seem manageable. I guess 3rd party code that defines its own low level primitive could be impacted, but they chose to reinvent the wheel. It looks like the generic platform performance risk is BaseIoLibIntrinsic. I know from debugging crashes a long time ago the the MemoryFence() for x86 is really a CompilerFence() in the new scheme. If the new AcquireMemoryFence()/ReleaseMemoryFence()/CompilerFence() replace MemoryFence() this likely just works. If not then we might need an architecture specific pattern to maintain current performance, we could always make a function internal to the lib that does the right thing for the given architecture. Also making it more precise might actually help ARM/AARCH64 performance? So if I understand correctly 1) volatile basically tells C to always read the memory. So it impacts the C memory model. 2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier. 3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things. So for example in BaseIoLibIntrinsic() the volatile is used to force the memory read and the CompilerFence() is used to enforce the order of that read relative to other operations. Thus wrapping the volatile access in CompilerFence() is in some ways redundant as long as there is only a single access inside the CompilerFence() wrapper. I like the idea of properly annotating the code. Given x86 defaults to so much cache coherency for compatibility with older simpler times it can tend to hide issues. I’m not sure removing the volatile keyword is required, and it does imply this variable has special rules. I guess we could add a REQUIRE_FENCE or some such? But it it feels like defining the variables in some way to imply the require special handling has some value? The only reason you would need a volatile is if you had more than a single read inside of the fences. Something like: UINT8 *Address; UINT8. Value MemoryFence (); do { Value = *Address } while (Value == 0xFF) ; MemoryFence (); The compiler could optimize the above code (current MemoryFence() model) to a single read and infinite loop if 0xFF. Yes I realize you could move the MemoryFence () into the loop, but what if you did not care about that? Mostly just trying to think this through we some examples…. [1] /Volumes/Case/edk2-github(master)>git grep MemoryFence | grep ';' ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:320: MemoryFence (); ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:334: MemoryFence (); ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:344: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1493: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1497: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4702: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4707: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4712: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4765: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4770: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4775: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4828: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4833: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4838: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4892: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4894: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:86: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:88: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:115: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:117: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:147: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:149: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:179: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:181: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:211: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:213: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:243: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:245: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:275: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:277: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:305: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:307: MemoryFence (); MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:3:; MemoryFence() for AArch64 MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:23:;MemoryFence ( MdePkg/Library/BaseLib/Arm/MemoryFence.asm:3:; MemoryFence() for AArch64 MdePkg/Library/BaseLib/Arm/MemoryFence.asm:24:;MemoryFence ( OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:412: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:425: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:438: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:173: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:186: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:199: MemoryFence (); OvmfPkg/Library/VirtioLib/VirtioLib.c:305: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:312: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:327: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:334: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:337: MemoryFence(); OvmfPkg/Library/VmgExitLib/VmgExitLib.c:122: MemoryFence (); OvmfPkg/Library/VmgExitLib/VmgExitLib.c:124: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:117: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:122: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:171: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:177: MemoryFence (); OvmfPkg/PvScsiDxe/PvScsi.c:595: MemoryFence (); OvmfPkg/PvScsiDxe/PvScsi.c:625: MemoryFence (); OvmfPkg/VirtioNetDxe/Events.c:56: MemoryFence (); OvmfPkg/VirtioNetDxe/Events.c:58: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpGetStatus.c:99: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpGetStatus.c:102: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:257: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:384: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:426: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:437: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:101: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:103: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:173: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:176: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpTransmit.c:169: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpTransmit.c:172: MemoryFence (); OvmfPkg/XenBusDxe/GrantTable.c:86: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:496: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:503: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:569: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:580: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:587: MemoryFence (); OvmfPkg/XenPvBlkDxe/BlockFront.c:473: MemoryFence (); OvmfPkg/XenPvBlkDxe/BlockFront.c:528: MemoryFence (); OvmfPkg/XenPvBlkDxe/BlockFront.c:578: MemoryFence (); UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c:204: MemoryFence (); Thanks, Andrew Fish Would there be interest in reviewing such work?
Thanks Laszlo
|
|
Yep, the point on volatile is pretty much correct. An object that has the volatile type qualifier may be manipulated in ways unknown to the implementation (the compiler) and, therefore, the implementation shall not optimize away actions on that object or reorder those actions except as permitted by expression evaluation rules. An implementation needn't evaluate an expression (or part of one) if it determines that its value is unused and that no needed side effects are produced. This page ( https://en.cppreference.com/w/c/language/volatile) explains the volatile type qualifier in a lot more detail than I will here (otherwise we'd get way off topic and I don't think we want that to happen). I can't really provide much input to the other two points though.
toggle quoted message
Show quoted text
On 2/4/21, Andrew Fish via groups.io <afish@...> wrote:
On Feb 4, 2021, at 10:09 AM, Laszlo Ersek <lersek@...> wrote:
Hi All,
here's the declaration of MemoryFence(), from <BaseLib.h>:
/** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ); The IA32 and X64 implementations of MemoryFence() are not memory barriers, only compiler barriers, however.
This is a problem for two reasons:
- code that believes to add a memory barrier only adds a compiler barrier, possibly creating an unsafe situation,
- code that knowingly calls MemoryFence() for the purposes of a compiler barrier -- while it will hopefully work safely, relying on tacit knowledge / assumptions about the x86 architecture -- results in a source reading experience that mismatches known lockless patterns.
This topic has repeatedly come up in the past; most recently in the review of Ankur's v6 series for <https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some inter-processor communication needs to happen.
I asked Paolo for help with understanding some of the synchronziation patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is going to rely on, and/or imitate, some of those patterns. From this discussion with Paolo, the following have emerged (I'm going to liberally steal / paraphrase Paolo's words):
(1) We should introduce finer-grained fence primitives:
ARM AARCH64 i386
CompilerFence() asm("") asm("") asm("") AcquireMemoryFence() dmb ish dmb ishld asm("") ReleaseMemoryFence() dmb ish dmb ish asm("") MemoryFence() dmb ish dmb ish mfence
"where AcquireMemoryFence() is used on the read side (i.e. between reads) and ReleaseMemoryFence() is used on the write side (i.e. between writes)".
(Note that this step would make MemoryFence() heavier-weight on x86 than it currently is.)
(2) We should audit and rework existent uses (combinations) of MemoryFence() into matched acquire / release pairs.
Less importantly, this would restore the x86 behavior (performance) to the one seen with the current MemoryFence() implementation.
More importantly, it would fix what the code *means*: "it's very important to stick to known lockless patterns, and having matching acquire/release fences makes it much more intuitive to understand what's going on".
--*--
I'm not proposing this as a pre-requisite to merging Ankur's series (i.e., I don't expect the series to use acquire/release -- it can be converted later with the rest of the audit).
However, I'd really like us to stop wasting more time on MemoryFence() doubts such as:
- Hey it's not a memory fence at all, is it safe?
- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes we use for synchronization, between the other accesses.
- Is it interchangeable with volatile? Yes? No? which one should we use?
In the longer term, multi-processing code like MpInitLib (CpuDxe, CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of "volatile", and use the explicit barriers instead, for *clarity*.
Step (2) would take quite a bit of thinking (if not much code).
Comments?
Laszlo,
Thanks for looking into this. I noticed most of the MemoryFence() usage[1] is in the virtual platforms so changes seem manageable. I guess 3rd party code that defines its own low level primitive could be impacted, but they chose to reinvent the wheel.
It looks like the generic platform performance risk is BaseIoLibIntrinsic. I know from debugging crashes a long time ago the the MemoryFence() for x86 is really a CompilerFence() in the new scheme. If the new AcquireMemoryFence()/ReleaseMemoryFence()/CompilerFence() replace MemoryFence() this likely just works. If not then we might need an architecture specific pattern to maintain current performance, we could always make a function internal to the lib that does the right thing for the given architecture. Also making it more precise might actually help ARM/AARCH64 performance?
So if I understand correctly 1) volatile basically tells C to always read the memory. So it impacts the C memory model. 2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier. 3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.
So for example in BaseIoLibIntrinsic() the volatile is used to force the memory read and the CompilerFence() is used to enforce the order of that read relative to other operations. Thus wrapping the volatile access in CompilerFence() is in some ways redundant as long as there is only a single access inside the CompilerFence() wrapper.
I like the idea of properly annotating the code. Given x86 defaults to so much cache coherency for compatibility with older simpler times it can tend to hide issues. I’m not sure removing the volatile keyword is required, and it does imply this variable has special rules. I guess we could add a REQUIRE_FENCE or some such? But it it feels like defining the variables in some way to imply the require special handling has some value?
The only reason you would need a volatile is if you had more than a single read inside of the fences. Something like:
UINT8 *Address; UINT8. Value
MemoryFence (); do { Value = *Address } while (Value == 0xFF) ; MemoryFence ();
The compiler could optimize the above code (current MemoryFence() model) to a single read and infinite loop if 0xFF. Yes I realize you could move the MemoryFence () into the loop, but what if you did not care about that? Mostly just trying to think this through we some examples….
[1] /Volumes/Case/edk2-github(master)>git grep MemoryFence | grep ';' ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:320: MemoryFence (); ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:334: MemoryFence (); ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:344: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1493: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1497: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4702: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4707: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4712: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4765: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4770: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4775: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4828: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4833: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4838: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4892: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4894: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:86: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:88: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:115: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:117: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:147: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:149: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:179: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:181: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:211: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:213: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:243: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:245: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:275: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:277: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:305: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:307: MemoryFence (); MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:3:; MemoryFence() for AArch64 MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:23:;MemoryFence ( MdePkg/Library/BaseLib/Arm/MemoryFence.asm:3:; MemoryFence() for AArch64 MdePkg/Library/BaseLib/Arm/MemoryFence.asm:24:;MemoryFence ( OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:412: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:425: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:438: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:173: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:186: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:199: MemoryFence (); OvmfPkg/Library/VirtioLib/VirtioLib.c:305: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:312: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:327: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:334: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:337: MemoryFence(); OvmfPkg/Library/VmgExitLib/VmgExitLib.c:122: MemoryFence (); OvmfPkg/Library/VmgExitLib/VmgExitLib.c:124: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:117: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:122: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:171: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:177: MemoryFence (); OvmfPkg/PvScsiDxe/PvScsi.c:595: MemoryFence (); OvmfPkg/PvScsiDxe/PvScsi.c:625: MemoryFence (); OvmfPkg/VirtioNetDxe/Events.c:56: MemoryFence (); OvmfPkg/VirtioNetDxe/Events.c:58: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpGetStatus.c:99: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpGetStatus.c:102: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:257: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:384: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:426: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:437: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:101: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:103: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:173: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:176: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpTransmit.c:169: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpTransmit.c:172: MemoryFence (); OvmfPkg/XenBusDxe/GrantTable.c:86: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:496: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:503: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:569: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:580: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:587: MemoryFence (); OvmfPkg/XenPvBlkDxe/BlockFront.c:473: MemoryFence (); OvmfPkg/XenPvBlkDxe/BlockFront.c:528: MemoryFence (); OvmfPkg/XenPvBlkDxe/BlockFront.c:578: MemoryFence (); UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c:204: MemoryFence ();
Thanks,
Andrew Fish
Would there be interest in reviewing such work?
Thanks Laszlo
-- Signed, Ethin D. Probst
|
|
On 02/04/21 22:31, Andrew Fish wrote:
On Feb 4, 2021, at 10:09 AM, Laszlo Ersek <lersek@...> wrote:
Hi All,
here's the declaration of MemoryFence(), from <BaseLib.h>:
/** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ); The IA32 and X64 implementations of MemoryFence() are not memory barriers, only compiler barriers, however.
This is a problem for two reasons:
- code that believes to add a memory barrier only adds a compiler barrier, possibly creating an unsafe situation,
- code that knowingly calls MemoryFence() for the purposes of a compiler barrier -- while it will hopefully work safely, relying on tacit knowledge / assumptions about the x86 architecture -- results in a source reading experience that mismatches known lockless patterns.
This topic has repeatedly come up in the past; most recently in the review of Ankur's v6 series for <https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some inter-processor communication needs to happen.
I asked Paolo for help with understanding some of the synchronziation patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is going to rely on, and/or imitate, some of those patterns. From this discussion with Paolo, the following have emerged (I'm going to liberally steal / paraphrase Paolo's words):
(1) We should introduce finer-grained fence primitives:
ARM AARCH64 i386
CompilerFence() asm("") asm("") asm("") AcquireMemoryFence() dmb ish dmb ishld asm("") ReleaseMemoryFence() dmb ish dmb ish asm("") MemoryFence() dmb ish dmb ish mfence
"where AcquireMemoryFence() is used on the read side (i.e. between reads) and ReleaseMemoryFence() is used on the write side (i.e. between writes)".
(Note that this step would make MemoryFence() heavier-weight on x86 than it currently is.)
(2) We should audit and rework existent uses (combinations) of MemoryFence() into matched acquire / release pairs.
Less importantly, this would restore the x86 behavior (performance) to the one seen with the current MemoryFence() implementation.
More importantly, it would fix what the code *means*: "it's very important to stick to known lockless patterns, and having matching acquire/release fences makes it much more intuitive to understand what's going on".
--*--
I'm not proposing this as a pre-requisite to merging Ankur's series (i.e., I don't expect the series to use acquire/release -- it can be converted later with the rest of the audit).
However, I'd really like us to stop wasting more time on MemoryFence() doubts such as:
- Hey it's not a memory fence at all, is it safe?
- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes we use for synchronization, between the other accesses.
- Is it interchangeable with volatile? Yes? No? which one should we use?
In the longer term, multi-processing code like MpInitLib (CpuDxe, CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of "volatile", and use the explicit barriers instead, for *clarity*.
Step (2) would take quite a bit of thinking (if not much code).
Comments?
Laszlo,
Thanks for looking into this. I noticed most of the MemoryFence() usage[1] is in the virtual platforms so changes seem manageable. I guess 3rd party code that defines its own low level primitive could be impacted, but they chose to reinvent the wheel.
It looks like the generic platform performance risk is BaseIoLibIntrinsic. I know from debugging crashes a long time ago the the MemoryFence() for x86 is really a CompilerFence() in the new scheme. If the new AcquireMemoryFence()/ReleaseMemoryFence()/CompilerFence() replace MemoryFence() this likely just works. If not then we might need an architecture specific pattern to maintain current performance, we could always make a function internal to the lib that does the right thing for the given architecture. Also making it more precise might actually help ARM/AARCH64 performance?
So if I understand correctly 1) volatile basically tells C to always read the memory. So it impacts the C memory model. 2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier. 3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.
So for example in BaseIoLibIntrinsic() the volatile is used to force the memory read and the CompilerFence() is used to enforce the order of that read relative to other operations. Thus wrapping the volatile access in CompilerFence() is in some ways redundant as long as there is only a single access inside the CompilerFence() wrapper.
Right, I agree with everything you said thus far. (I don't claim to be an expert on this, of course.) CompilerFence() is a "better volatile" (for RAM only, not for MMIO), where we can allow the compiler to optimize, up to a certain location. I like the idea of properly annotating the code. Given x86 defaults to so much cache coherency for compatibility with older simpler times it can tend to hide issues. I’m not sure removing the volatile keyword is required, and it does imply this variable has special rules. I guess we could add a REQUIRE_FENCE or some such? But it it feels like defining the variables in some way to imply the require special handling has some value? Objects that are shared between processors, or between processor and some device, should always be highlighted in some way (naming, comments, ...). The only reason you would need a volatile is if you had more than a single read inside of the fences. Something like:
UINT8 *Address; UINT8. Value
MemoryFence (); do { Value = *Address } while (Value == 0xFF) ; MemoryFence ();
The compiler could optimize the above code (current MemoryFence() model) to a single read and infinite loop if 0xFF. Yes I realize you could move the MemoryFence () into the loop, but what if you did not care about that? Mostly just trying to think this through we some examples….
I think it depends on what Address refers to. If it refers to an MMIO register, then volatile is needed, IMO (and the MemoryFence()s are unnecessary). If Address points to a byte in RAM that's manipulated by multiple CPUs, then I think: - volatile is not needed - the two MemoryFence() calls should be removed - an AcquireMemoryFence() call should be inserted into the loop, after the read access. Because, presumably, once the flag assumes the appropriate value (it has been "released")), we'll proceed to reading some memory resource that was protected by the flag until then. Regarding the many MemoryFence() calls in the virt platforms, most of them do not exist for inter-CPU communication, but for communication with devices through RAM. I admit I'm unsure what we should do with them. Thanks Laszlo [1] /Volumes/Case/edk2-github(master)>git grep MemoryFence | grep ';' ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:320: MemoryFence (); ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:334: MemoryFence (); ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:344: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1493: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1497: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4702: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4707: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4712: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4765: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4770: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4775: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4828: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4833: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4838: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4892: MemoryFence (); MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4894: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:86: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:88: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:115: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:117: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:147: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:149: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:179: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:181: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:211: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:213: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:243: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:245: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:275: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:277: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:305: MemoryFence (); MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:307: MemoryFence (); MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:3:; MemoryFence() for AArch64 MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:23:;MemoryFence ( MdePkg/Library/BaseLib/Arm/MemoryFence.asm:3:; MemoryFence() for AArch64 MdePkg/Library/BaseLib/Arm/MemoryFence.asm:24:;MemoryFence ( OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:412: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:425: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:438: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:173: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:186: MemoryFence (); OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:199: MemoryFence (); OvmfPkg/Library/VirtioLib/VirtioLib.c:305: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:312: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:327: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:334: MemoryFence(); OvmfPkg/Library/VirtioLib/VirtioLib.c:337: MemoryFence(); OvmfPkg/Library/VmgExitLib/VmgExitLib.c:122: MemoryFence (); OvmfPkg/Library/VmgExitLib/VmgExitLib.c:124: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:117: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:122: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:171: MemoryFence (); OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:177: MemoryFence (); OvmfPkg/PvScsiDxe/PvScsi.c:595: MemoryFence (); OvmfPkg/PvScsiDxe/PvScsi.c:625: MemoryFence (); OvmfPkg/VirtioNetDxe/Events.c:56: MemoryFence (); OvmfPkg/VirtioNetDxe/Events.c:58: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpGetStatus.c:99: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpGetStatus.c:102: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:257: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:384: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:426: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpInitialize.c:437: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:101: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:103: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:173: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpReceive.c:176: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpTransmit.c:169: MemoryFence (); OvmfPkg/VirtioNetDxe/SnpTransmit.c:172: MemoryFence (); OvmfPkg/XenBusDxe/GrantTable.c:86: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:496: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:503: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:569: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:580: MemoryFence (); OvmfPkg/XenBusDxe/XenStore.c:587: MemoryFence (); OvmfPkg/XenPvBlkDxe/BlockFront.c:473: MemoryFence (); OvmfPkg/XenPvBlkDxe/BlockFront.c:528: MemoryFence (); OvmfPkg/XenPvBlkDxe/BlockFront.c:578: MemoryFence (); UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c:204: MemoryFence ();
Thanks,
Andrew Fish
Would there be interest in reviewing such work?
Thanks Laszlo
|
|
So if I understand correctly 1) volatile basically tells C to always read the memory. So it impacts the C memory model. 2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier. 3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things. I agree with this summary. Volatile using in MpInitLib or PiSmmCpu driver cannot be emitted because different CPU threads change the same memory content to synchronize with each other. I don’t quite understand if removing “volatile” what mechanism can be used to force the compiler generating code that always reads from memory? Thanks, Ray
|
|
On 02/04/21 21:04, Paolo Bonzini wrote: Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto: It is very important to be *aware* of the acquire/release semantics, but I don't think it is necessary to use all the fine grained barrier types in EDK2.
I agree as long as the primitives are self-documenting. A single MemoryFence() does not make it clear in which direction the data is flowing (whether from other threads to this one, or vice versa).
I can say about the MemoryFence() calls I've added, and about the ones Ankur is about to add, that we're aware of the data flow directions. We could certainly annotate every such location with either "we're publishing information here" or "we're consuming information here". So for example: (a) publish: - prepare and expose the data - "release fence" - release the spinlock with InterlockedCompareExchange, or push the MMIO doorbell register (b1) consume: - spin on a spinlock with InterlockedCompareExchange, - "acquire fence" - gather / consume the data (b2) consume: - spin on an MMIO latch, using "volatile" - "acquire fence" - gather / consume the data (b3) consume: - spin on a normal memory location with an "acquire fence" at the end of the loop body -- still inside the loop body, - gather / consume the data, after the loop I believe these general patterns / semantics are "more or less" clear to us all; I personally have not had to do anything fancier than the above, programming fw_cfg, virtio, or even some inter-processor communication. *However*, how those natural-language "fence" statements map to compiler fences, CPU/memory fences, especially with regard to architecture specifics -- I honestly admit that's where my eyes glaze over. The MemoryFence() API has been very convenient for *both* of those directions, but: - it does not inherently express the intended data flow direction, as Paolo notes - it's unclear whether "volatile" (or a compiler barrier) is needed *in addition* to MemoryFence() or not -- after all, if the compiler didn't produce the necessary store/load instructions yet, "fencing" the effects of those instructions makes no sense (paraphrasing Ard), - the x86 implementation of MemoryFence() openly admits to breaking the interface contract ("this is a compiler fence only"), without explaining why -- i.e., for what architectural reasons -- it might *still* suffice for upholding the contract. So in the end one just goes for: "whenever in doubt, use volatile *PLUS* MemoryFence()". Here's what I'd prefer: - someone more versed in the *implementation* of these primitivies than myself should please add the "directional" barriers, - if possible, I would prefer if a naked compiler barrier API didn't exist at all (because (a) the above-noted directional barriers should *guarantee* in the contracts to imply the necessary compiler barrier(s), and (b) I wouldn't know where I'd be content with *just* a compiler barrier) Then I could go over the MemoryFence() calls I've added, and replace each with the proper directional variant, while simultaneously killing *all* related "volatile" qualifiers. (I'd only keep volatile where it referred to a known MMIO register, for example for pressing or reading an MMIO doorbell / latch.) Then I think someone from Intel should do the same for PiSmmCpuDxeSmm and MpInitLib -- because every time I need to look at the swaths of "volatile" in those drivers, I must start reasoning from square#1. And for the life of me, I cannot squeeze a bunch of the patterns I see there into any one of the patterns (a), (b1), (b2), and (b3) that I outlined above. In every such case, I'm left wondering, "does it work by chance only, or by silently exploiting architecture guarantees that I don't know about". In summary, I'm mostly sure what I'd do about the volatile's and MemoryFence()s that *I* have written, provided the directional barrier variants existed, but I don't have the first clue about cleaning up PiSmmCpuDxeSmm and MpInitLib. And ultimately, it would be awesome if, when discussing a patch series like Ankur's v6, we didn't have to go look for examples and tacit intentions in PiSmmCpuDxeSmm. :/ Thanks, Laszlo
|
|
Paolo Bonzini <pbonzini@...>
On 05/02/21 08:48, Laszlo Ersek wrote: MemoryFence (); do { Value = *Address } while (Value == 0xFF) ; MemoryFence (); If Address points to a byte in RAM that's manipulated by multiple CPUs, then I think: - volatile is not needed - the two MemoryFence() calls should be removed - an AcquireMemoryFence() call should be inserted into the loop, after the read access. Because, presumably, once the flag assumes the appropriate value (it has been "released")), we'll proceed to reading some memory resource that was protected by the flag until then.
This is correct, alternatively you could have this: do { // Force re-reading *Address on every iteration CompilerFence (); Value = *Address; } while (Value == 0xFF); // As you explained above. AcquireMemoryFence (); Paolo
|
|
This is very interesting. I am trying to involve the discussion to learn. I did some experiment. ----a.c---- // Copy EDKII MemoryFence() and as Laszlo said it's just a compiler fence. void CompilerFence () { return; }
void main () { // I copied Paolo's code in below. int *Address = &main; int Value;
do { // Force re-reading *Address on every iteration CompilerFence (); Value = *Address; } while (Value == 0xFF); // As you explained above. __asm mfence; } ----END of a.c----
When I use "Cl.exe /O2 /FAcs a.c" to compile it. I got a.cod as below -----a.cod---- ; Listing generated by Microsoft (R) Optimizing Compiler Version 19.27.29112.0
TITLE E:\Work\edk2\a.c .686P .XMM include listing.inc .model flat
INCLUDELIB LIBCMT INCLUDELIB OLDNAMES
PUBLIC _CompilerFence PUBLIC _main ; Function compile flags: /Ogtpy ; File E:\Work\edk2\a.c ; COMDAT _main _TEXT SEGMENT _main PROC ; COMDAT
; 6 : int *Address = &main;
00000 b8 00 00 00 00 mov eax, OFFSET _main 00005 8b 00 mov eax, DWORD PTR [eax] $LL4@main:
; 7 : int Value; ; 8 : ; 9 : do { ; 10 : // Force re-reading *Address on every iteration ; 11 : CompilerFence (); ; 12 : Value = *Address; ; 13 : } while (Value == 0xFF);
00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH 0000c 74 f9 je SHORT $LL4@main
; 14 : // As you explained above. ; 15 : __asm mfence;
0000e 0f ae f0 mfence 00011 c3 ret 0 _main ENDP _TEXT ENDS ; Function compile flags: /Ogtpy ; File E:\Work\edk2\a.c ; COMDAT _CompilerFence _TEXT SEGMENT _CompilerFence PROC ; COMDAT
; 2 : return; ; 3 : }
00000 c2 00 00 ret 0 _CompilerFence ENDP _TEXT ENDS END -----END of a.cod-----
Check the assembly: 00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH
The CompilerFence() call doesn't force to generate code to always read the memory pointed by Address.
If I replace CompilerFence() with _ReadBarrier(), compiler can generate assembly that always reads the memory pointed by Address.
It seems the EDKII MemoryFence() even cannot be used as a read barrier. Does it mean that the MSVC version of EDKII MemoryFence() is implemented wrongly?
Thanks, Ray
toggle quoted message
Show quoted text
-----Original Message----- From: Paolo Bonzini <pbonzini@...> Sent: Friday, February 5, 2021 5:13 PM To: Laszlo Ersek <lersek@...>; Andrew Fish <afish@...> Cc: edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Ard Biesheuvel (kernel.org address) <ardb@...>; Dong, Eric <eric.dong@...>; Ni, Ray <ray.ni@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...> Subject: Re: MemoryFence()
On 05/02/21 08:48, Laszlo Ersek wrote:
MemoryFence (); do { Value = *Address } while (Value == 0xFF) ; MemoryFence (); If Address points to a byte in RAM that's manipulated by multiple CPUs, then I think:
- volatile is not needed
- the two MemoryFence() calls should be removed
- an AcquireMemoryFence() call should be inserted into the loop, after the read access. Because, presumably, once the flag assumes the appropriate value (it has been "released")), we'll proceed to reading some memory resource that was protected by the flag until then. This is correct, alternatively you could have this:
do { // Force re-reading *Address on every iteration CompilerFence (); Value = *Address; } while (Value == 0xFF); // As you explained above. AcquireMemoryFence ();
Paolo
|
|
Paolo Bonzini <pbonzini@...>
On 05/02/21 14:15, Ni, Ray wrote: I did some experiment. ----a.c---- // Copy EDKII MemoryFence() and as Laszlo said it's just a compiler fence. void CompilerFence () { return; This should be #define CompilerFence _ReadWriteBarrier Paolo } void main () { // I copied Paolo's code in below. int *Address = &main; int Value; do { // Force re-reading *Address on every iteration CompilerFence (); Value = *Address; } while (Value == 0xFF); // As you explained above. __asm mfence; } ----END of a.c----
|
|
On 02/05/21 10:12, Paolo Bonzini wrote: On 05/02/21 08:48, Laszlo Ersek wrote:
MemoryFence (); do { Value = *Address } while (Value == 0xFF) ; MemoryFence (); If Address points to a byte in RAM that's manipulated by multiple CPUs, then I think:
- volatile is not needed
- the two MemoryFence() calls should be removed
- an AcquireMemoryFence() call should be inserted into the loop, after the read access. Because, presumably, once the flag assumes the appropriate value (it has been "released")), we'll proceed to reading some memory resource that was protected by the flag until then. This is correct, alternatively you could have this:
do { // Force re-reading *Address on every iteration CompilerFence (); Value = *Address; } while (Value == 0xFF); // As you explained above. AcquireMemoryFence (); Ah, thanks; also good example where a CompilerFence() would be justified. Laszlo
|
|
On 02/05/21 09:25, Ni, Ray wrote: So if I understand correctly 1) volatile basically tells C to always read the memory. So it impacts the C memory model. 2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier. 3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things. I agree with this summary. Volatile using in MpInitLib or PiSmmCpu driver cannot be emitted because different CPU threads change the same memory content to synchronize with each other. I don’t quite understand if removing “volatile” what mechanism can be used to force the compiler generating code that always reads from memory?
A compiler barrier (in case that's sufficient), or even a memory fence -- seems more appropriate for inter-CPU synchronization -- which I think should be implemented to automatically force a compiler barrier too. Laszlo
|
|
On 02/05/21 14:15, Ni, Ray wrote: This is very interesting. I am trying to involve the discussion to learn. I did some experiment. ----a.c---- // Copy EDKII MemoryFence() and as Laszlo said it's just a compiler fence. void CompilerFence () { return; }
void main () { // I copied Paolo's code in below. int *Address = &main; int Value;
do { // Force re-reading *Address on every iteration CompilerFence (); Value = *Address; } while (Value == 0xFF); // As you explained above. __asm mfence; } ----END of a.c----
When I use "Cl.exe /O2 /FAcs a.c" to compile it. I got a.cod as below -----a.cod---- ; Listing generated by Microsoft (R) Optimizing Compiler Version 19.27.29112.0
TITLE E:\Work\edk2\a.c .686P .XMM include listing.inc .model flat
INCLUDELIB LIBCMT INCLUDELIB OLDNAMES
PUBLIC _CompilerFence PUBLIC _main ; Function compile flags: /Ogtpy ; File E:\Work\edk2\a.c ; COMDAT _main _TEXT SEGMENT _main PROC ; COMDAT
; 6 : int *Address = &main;
00000 b8 00 00 00 00 mov eax, OFFSET _main 00005 8b 00 mov eax, DWORD PTR [eax] $LL4@main:
; 7 : int Value; ; 8 : ; 9 : do { ; 10 : // Force re-reading *Address on every iteration ; 11 : CompilerFence (); ; 12 : Value = *Address; ; 13 : } while (Value == 0xFF);
00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH 0000c 74 f9 je SHORT $LL4@main
; 14 : // As you explained above. ; 15 : __asm mfence;
0000e 0f ae f0 mfence 00011 c3 ret 0 _main ENDP _TEXT ENDS ; Function compile flags: /Ogtpy ; File E:\Work\edk2\a.c ; COMDAT _CompilerFence _TEXT SEGMENT _CompilerFence PROC ; COMDAT
; 2 : return; ; 3 : }
00000 c2 00 00 ret 0 _CompilerFence ENDP _TEXT ENDS END -----END of a.cod-----
Check the assembly: 00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH
The CompilerFence() call doesn't force to generate code to always read the memory pointed by Address. That's because your CompilerFence() function is not correct. You did: void CompilerFence () { return; } which is wrong. It enforces a C-language sequence point, but it does not implement a compiler barrier. And your code is wrong because the *original*, namely the MSFT implementation of MemoryFence() in edk2, is bugged: * MdePkg/Library/BaseLib/X86MemoryFence.c: /** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ) { return; } This code comes from historical commit e1f414b6a7d8 ("Import some basic libraries instances for Mde Packages.", 2007-06-22). * MdePkg/Library/BaseLib/BaseLib.inf: [Sources.Ia32] X86MemoryFence.c | MSFT
[Sources.X64] X86MemoryFence.c | MSFT Back to your email: If I replace CompilerFence() with _ReadBarrier(), compiler can generate assembly that always reads the memory pointed by Address.
It seems the EDKII MemoryFence() even cannot be used as a read barrier. Does it mean that the MSVC version of EDKII MemoryFence() is implemented wrongly? That's precisely the case. This actually explains why MpInitLib and PiSmmCpuDxeSmm use "volatile", rather than MemoryFence(). (This is one of those info bits that I've been missing.) The reason is that, when built with Visual Studio, MemoryFence() does absolutely nothing. So if MpInitLib and PiSmmCpuDxeSmm used MemoryFence() rather than "volatile", things would break immediately. Thank you Ray for highlighting this, I couldn't have imagined. The current status means that, considering all toolchains and all architectures that edk2 supports, together, it's been *impossible* to use MemoryFence() correctly. And so my defensive thinking "whenever in doubt, use 'volatile' PLUS MemoryFence()" apparently paid off, because when built with Visual Studio, OVMF is left with 'volatile' *only*; in effect. It also means that in Ankur's upcoming v7 series, relying *just* on MemoryFence() will not suffice. We need 'volatile' (given the current state of primitives); otherwise we'll have no barriers at all. ... I don't even know where to start changing the code. :( Laszlo
|
|
Paolo Bonzini <pbonzini@...>
On 05/02/21 16:34, Laszlo Ersek wrote: And your code is wrong because the*original*, namely the MSFT implementation of MemoryFence() in edk2, is bugged: * MdePkg/Library/BaseLib/X86MemoryFence.c:
/** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ) { return; } I think it's okay-ish (it works as long as you don't do link-time optimization, but it's less efficient than a compiler intrinsic) because X86MemoryFence.c is a separate file. Paolo
|
|
Ard Biesheuvel <ardb@...>
On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote: On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT implementation of MemoryFence() in edk2, is bugged:
* MdePkg/Library/BaseLib/X86MemoryFence.c:
/** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ) { return; } I think it's okay-ish (it works as long as you don't do link-time optimization, but it's less efficient than a compiler intrinsic) because X86MemoryFence.c is a separate file.
We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains.
|
|
Paolo Bonzini <pbonzini@...>
On 05/02/21 16:39, Ard Biesheuvel wrote: On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:
On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT implementation of MemoryFence() in edk2, is bugged:
* MdePkg/Library/BaseLib/X86MemoryFence.c:
/** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ) { return; } I think it's okay-ish (it works as long as you don't do link-time optimization, but it's less efficient than a compiler intrinsic) because X86MemoryFence.c is a separate file. We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains. Oops... Well, it *was* okay at the time MemoryFence() was first implemented. :( Paolo
|
|
Maybe that explains why the MemoryFence was implemented so confidently. I trust Mike:) In UefiCpuPkg/Library/RegisterCpuFeaturesLib/CpuFeaturesInitialize.c, the LibWaitForSemaphore, LibReleaseSemaphore are coded without the read write barrier. But the AcquireSpinLockOrFail and ReleaseSpinLock in MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c contain read write barrier. Is the former implementation wrong because without the barrier the cmpxchg can happen earlier or later chosen by the compiler? Does volatile keyword help here? thanks, ray ________________________________ 发件人: Paolo Bonzini <pbonzini@...> 发送时间: Friday, February 5, 2021 11:42:36 PM 收件人: Ard Biesheuvel <ardb@...> 抄送: Laszlo Ersek <lersek@...>; Ni, Ray <ray.ni@...>; Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...> 主题: Re: MemoryFence() On 05/02/21 16:39, Ard Biesheuvel wrote: On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:
On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT implementation of MemoryFence() in edk2, is bugged:
* MdePkg/Library/BaseLib/X86MemoryFence.c:
/** Used to serialize load and store operations.
All loads and stores that proceed calls to this function are guaranteed to be globally visible when this function returns.
**/ VOID EFIAPI MemoryFence ( VOID ) { return; } I think it's okay-ish (it works as long as you don't do link-time optimization, but it's less efficient than a compiler intrinsic) because X86MemoryFence.c is a separate file. We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains. Oops... Well, it *was* okay at the time MemoryFence() was first implemented. :( Paolo
|
|