MemoryFence()


Laszlo Ersek
 

Hi All,

here's the declaration of MemoryFence(), from <BaseLib.h>:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
);
The IA32 and X64 implementations of MemoryFence() are not memory
barriers, only compiler barriers, however.

This is a problem for two reasons:

- code that believes to add a memory barrier only adds a compiler
barrier, possibly creating an unsafe situation,

- code that knowingly calls MemoryFence() for the purposes of a compiler
barrier -- while it will hopefully work safely, relying on tacit
knowledge / assumptions about the x86 architecture -- results in a
source reading experience that mismatches known lockless patterns.


This topic has repeatedly come up in the past; most recently in the
review of Ankur's v6 series for
<https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some
inter-processor communication needs to happen.

I asked Paolo for help with understanding some of the synchronziation
patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is
going to rely on, and/or imitate, some of those patterns. From this
discussion with Paolo, the following have emerged (I'm going to
liberally steal / paraphrase Paolo's words):


(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".

(Note that this step would make MemoryFence() heavier-weight on x86 than
it currently is.)


(2) We should audit and rework existent uses (combinations) of
MemoryFence() into matched acquire / release pairs.

Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?

In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?

Would there be interest in reviewing such work?

Thanks
Laszlo


Ard Biesheuvel <ardb@...>
 

On Thu, 4 Feb 2021 at 19:09, Laszlo Ersek <lersek@...> wrote:

Hi All,

here's the declaration of MemoryFence(), from <BaseLib.h>:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
);
The IA32 and X64 implementations of MemoryFence() are not memory
barriers, only compiler barriers, however.

This is a problem for two reasons:

- code that believes to add a memory barrier only adds a compiler
barrier, possibly creating an unsafe situation,

- code that knowingly calls MemoryFence() for the purposes of a compiler
barrier -- while it will hopefully work safely, relying on tacit
knowledge / assumptions about the x86 architecture -- results in a
source reading experience that mismatches known lockless patterns.


This topic has repeatedly come up in the past; most recently in the
review of Ankur's v6 series for
<https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some
inter-processor communication needs to happen.

I asked Paolo for help with understanding some of the synchronziation
patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is
going to rely on, and/or imitate, some of those patterns. From this
discussion with Paolo, the following have emerged (I'm going to
liberally steal / paraphrase Paolo's words):


(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".
Acquire semantics typically order writes before reads, not /between/
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.

Note that reasoning about this only makes sense in the context of
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)

For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.

(Note that this step would make MemoryFence() heavier-weight on x86 than
it currently is.)


(2) We should audit and rework existent uses (combinations) of
MemoryFence() into matched acquire / release pairs.
It is very important to be *aware* of the acquire/release semantics,
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.

Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?
Who decides what 'memory fence' means anyway?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?
'volatile' affects code generation only, i.e., which loads and stores
are issued. The *order* in which those loads observe stores done by
other masters is what is managed by barrier instructions. So they are
two sides of the same coin: 'volatile' may help to guarantee that
every assignment through a pointer variable results in a store
instruction, but how these are observed by other CPUs still needs to
be managed if it needs to occur in a certain way.

I think volatile is the wrong tool in most cases: the ordering of
accesses are not a property of the type or of the variable, but of the
code sequence. So using asm("") to ensure that an assignment through a
pointer is emitted as a store instruction, followed by a barrier
instruction is a typical pattern. And given the fact that barrier
instructions require asm sequences in most cases anyway, the asm ("")
will often be implied.


In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?

Would there be interest in reviewing such work?

Thanks
Laszlo


Paolo Bonzini <pbonzini@...>
 

Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto:

(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".
Acquire semantics typically order writes before reads, not /between/
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.
Acquire fences are barriers between earlier loads and subsequent loads and
stores; those earlier loads then synchronize with release stores in other
threads.

Release fences are barriers been earlier loads and stores against
subsequent stores, and those subsequent stores synchronize with acquire
loads in other threads.

In both cases, however, fences only make sense between memory operations.
So something like "after reads" and "before writes" would have been more
precise in some sense, but in practice the usual idiom is "between"
reads/writes as Laszlo wrote.

Note that reasoning about this only makes sense in the context of
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)

For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.
Sure, that's a matter of how to implement the primitives. If you think that
non-coherent DMA is important, a full dmb can be used too.

As far as the compiler is concerned, an asm in the macros *should* block
enough optimizations, even without making the accesses volatile.
CompilerFence (or the edk2 equivalent of cpu_relax, whose name escapes me
right now) would be necessary in the body of busy-waiting loops. However
somebody should check the MSVC docs for asm, too.

It is very important to be *aware* of the acquire/release semantics,
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.
I agree as long as the primitives are self-documenting. A single
MemoryFence() does not make it clear in which direction the data is flowing
(whether from other threads to this one, or vice versa).

Paolo


Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?
Who decides what 'memory fence' means anyway?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?
'volatile' affects code generation only, i.e., which loads and stores
are issued. The *order* in which those loads observe stores done by
other masters is what is managed by barrier instructions. So they are
two sides of the same coin: 'volatile' may help to guarantee that
every assignment through a pointer variable results in a store
instruction, but how these are observed by other CPUs still needs to
be managed if it needs to occur in a certain way.

I think volatile is the wrong tool in most cases: the ordering of
accesses are not a property of the type or of the variable, but of the
code sequence. So using asm("") to ensure that an assignment through a
pointer is emitted as a store instruction, followed by a barrier
instruction is a typical pattern. And given the fact that barrier
instructions require asm sequences in most cases anyway, the asm ("")
will often be implied.


In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?

Would there be interest in reviewing such work?

Thanks
Laszlo


Ethin Probst
 

I agree with Ard (I'm not a contributor of EDK2 but thought I'd add my
2 sense anyway). `volatile` should only be used if a pointer
load/store is utilized in a context where you expect the external
environment (e.g.: hardware) to alter the result of such an op. In the
case of fences, volatile, I think, doesn't really do much in terms of
helping. For fences x86 has the LFENCE instruction for load fences and
SFENCE for store fences. If I'm not mistaken, MFENCE, then, is just a
combination of the other two. Would it not then be more appropriate to
use an LFENCE for a load (release?) fence, and SFENCE for a store
(acquire?) fence? (I ask for release/acquire because I might have
these mixed up.) I might of course be entirely wrong too.
I'd love to, perhaps, become a contributor of EDK2 sometime, but I
honestly have no idea where I'd even begin. :-)

On 2/4/21, Ard Biesheuvel <ardb@...> wrote:
On Thu, 4 Feb 2021 at 19:09, Laszlo Ersek <lersek@...> wrote:

Hi All,

here's the declaration of MemoryFence(), from <BaseLib.h>:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are
guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
);
The IA32 and X64 implementations of MemoryFence() are not memory
barriers, only compiler barriers, however.

This is a problem for two reasons:

- code that believes to add a memory barrier only adds a compiler
barrier, possibly creating an unsafe situation,

- code that knowingly calls MemoryFence() for the purposes of a compiler
barrier -- while it will hopefully work safely, relying on tacit
knowledge / assumptions about the x86 architecture -- results in a
source reading experience that mismatches known lockless patterns.


This topic has repeatedly come up in the past; most recently in the
review of Ankur's v6 series for
<https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some
inter-processor communication needs to happen.

I asked Paolo for help with understanding some of the synchronziation
patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is
going to rely on, and/or imitate, some of those patterns. From this
discussion with Paolo, the following have emerged (I'm going to
liberally steal / paraphrase Paolo's words):


(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".
Acquire semantics typically order writes before reads, not /between/
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.

Note that reasoning about this only makes sense in the context of
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)

For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.

(Note that this step would make MemoryFence() heavier-weight on x86 than
it currently is.)


(2) We should audit and rework existent uses (combinations) of
MemoryFence() into matched acquire / release pairs.
It is very important to be *aware* of the acquire/release semantics,
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.

Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?
Who decides what 'memory fence' means anyway?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?
'volatile' affects code generation only, i.e., which loads and stores
are issued. The *order* in which those loads observe stores done by
other masters is what is managed by barrier instructions. So they are
two sides of the same coin: 'volatile' may help to guarantee that
every assignment through a pointer variable results in a store
instruction, but how these are observed by other CPUs still needs to
be managed if it needs to occur in a certain way.

I think volatile is the wrong tool in most cases: the ordering of
accesses are not a property of the type or of the variable, but of the
code sequence. So using asm("") to ensure that an assignment through a
pointer is emitted as a store instruction, followed by a barrier
instruction is a typical pattern. And given the fact that barrier
instructions require asm sequences in most cases anyway, the asm ("")
will often be implied.


In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?

Would there be interest in reviewing such work?

Thanks
Laszlo




--
Signed,
Ethin D. Probst


Paolo Bonzini <pbonzini@...>
 

Il gio 4 feb 2021, 21:09 Ethin Probst <harlydavidsen@...> ha scritto:

For fences x86 has the LFENCE instruction for load fences and
SFENCE for store fences. If I'm not mistaken, MFENCE, then, is just a
combination of the other two. Would it not then be more appropriate to
use an LFENCE for a load (release?) fence, and SFENCE for a store
(acquire?) fence?

No, LFENCE and SFENCE are only useful with special load and store
instructions and are not needed in edk2. In addition, neither of the two
orders earlier stores against subsequent loads; MFENCE is needed for that.

Paolo

(I ask for release/acquire because I might have
these mixed up.) I might of course be entirely wrong too.
I'd love to, perhaps, become a contributor of EDK2 sometime, but I
honestly have no idea where I'd even begin. :-)

On 2/4/21, Ard Biesheuvel <ardb@...> wrote:
On Thu, 4 Feb 2021 at 19:09, Laszlo Ersek <lersek@...> wrote:

Hi All,

here's the declaration of MemoryFence(), from <BaseLib.h>:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are
guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
);
The IA32 and X64 implementations of MemoryFence() are not memory
barriers, only compiler barriers, however.

This is a problem for two reasons:

- code that believes to add a memory barrier only adds a compiler
barrier, possibly creating an unsafe situation,

- code that knowingly calls MemoryFence() for the purposes of a compiler
barrier -- while it will hopefully work safely, relying on tacit
knowledge / assumptions about the x86 architecture -- results in a
source reading experience that mismatches known lockless patterns.


This topic has repeatedly come up in the past; most recently in the
review of Ankur's v6 series for
<https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some
inter-processor communication needs to happen.

I asked Paolo for help with understanding some of the synchronziation
patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is
going to rely on, and/or imitate, some of those patterns. From this
discussion with Paolo, the following have emerged (I'm going to
liberally steal / paraphrase Paolo's words):


(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".
Acquire semantics typically order writes before reads, not /between/
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.

Note that reasoning about this only makes sense in the context of
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)

For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.

(Note that this step would make MemoryFence() heavier-weight on x86 than
it currently is.)


(2) We should audit and rework existent uses (combinations) of
MemoryFence() into matched acquire / release pairs.
It is very important to be *aware* of the acquire/release semantics,
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.

Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?
Who decides what 'memory fence' means anyway?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?
'volatile' affects code generation only, i.e., which loads and stores
are issued. The *order* in which those loads observe stores done by
other masters is what is managed by barrier instructions. So they are
two sides of the same coin: 'volatile' may help to guarantee that
every assignment through a pointer variable results in a store
instruction, but how these are observed by other CPUs still needs to
be managed if it needs to occur in a certain way.

I think volatile is the wrong tool in most cases: the ordering of
accesses are not a property of the type or of the variable, but of the
code sequence. So using asm("") to ensure that an assignment through a
pointer is emitted as a store instruction, followed by a barrier
instruction is a typical pattern. And given the fact that barrier
instructions require asm sequences in most cases anyway, the asm ("")
will often be implied.


In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?

Would there be interest in reviewing such work?

Thanks
Laszlo





--
Signed,
Ethin D. Probst


Andrew Fish <afish@...>
 

On Feb 4, 2021, at 10:09 AM, Laszlo Ersek <lersek@...> wrote:

Hi All,

here's the declaration of MemoryFence(), from <BaseLib.h>:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
);
The IA32 and X64 implementations of MemoryFence() are not memory
barriers, only compiler barriers, however.

This is a problem for two reasons:

- code that believes to add a memory barrier only adds a compiler
barrier, possibly creating an unsafe situation,

- code that knowingly calls MemoryFence() for the purposes of a compiler
barrier -- while it will hopefully work safely, relying on tacit
knowledge / assumptions about the x86 architecture -- results in a
source reading experience that mismatches known lockless patterns.


This topic has repeatedly come up in the past; most recently in the
review of Ankur's v6 series for
<https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some
inter-processor communication needs to happen.

I asked Paolo for help with understanding some of the synchronziation
patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is
going to rely on, and/or imitate, some of those patterns. From this
discussion with Paolo, the following have emerged (I'm going to
liberally steal / paraphrase Paolo's words):


(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".

(Note that this step would make MemoryFence() heavier-weight on x86 than
it currently is.)


(2) We should audit and rework existent uses (combinations) of
MemoryFence() into matched acquire / release pairs.

Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?

In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?
Laszlo,

Thanks for looking into this. I noticed most of the MemoryFence() usage[1] is in the virtual platforms so changes seem manageable. I guess 3rd party code that defines its own low level primitive could be impacted, but they chose to reinvent the wheel.

It looks like the generic platform performance risk is BaseIoLibIntrinsic. I know from debugging crashes a long time ago the the MemoryFence() for x86 is really a CompilerFence() in the new scheme. If the new AcquireMemoryFence()/ReleaseMemoryFence()/CompilerFence() replace MemoryFence() this likely just works. If not then we might need an architecture specific pattern to maintain current performance, we could always make a function internal to the lib that does the right thing for the given architecture. Also making it more precise might actually help ARM/AARCH64 performance?

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.

So for example in BaseIoLibIntrinsic() the volatile is used to force the memory read and the CompilerFence() is used to enforce the order of that read relative to other operations. Thus wrapping the volatile access in CompilerFence() is in some ways redundant as long as there is only a single access inside the CompilerFence() wrapper.

I like the idea of properly annotating the code. Given x86 defaults to so much cache coherency for compatibility with older simpler times it can tend to hide issues. I’m not sure removing the volatile keyword is required, and it does imply this variable has special rules. I guess we could add a REQUIRE_FENCE or some such? But it it feels like defining the variables in some way to imply the require special handling has some value?

The only reason you would need a volatile is if you had more than a single read inside of the fences. Something like:

UINT8 *Address;
UINT8. Value

MemoryFence ();
do {
Value = *Address
} while (Value == 0xFF) ;
MemoryFence ();

The compiler could optimize the above code (current MemoryFence() model) to a single read and infinite loop if 0xFF. Yes I realize you could move the MemoryFence () into the loop, but what if you did not care about that? Mostly just trying to think this through we some examples….

[1] /Volumes/Case/edk2-github(master)>git grep MemoryFence | grep ';'
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:320: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:334: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:344: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1493: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1497: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4702: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4707: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4712: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4765: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4770: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4775: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4828: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4833: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4838: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4892: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4894: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:86: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:88: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:115: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:117: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:147: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:149: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:179: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:181: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:211: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:213: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:243: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:245: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:275: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:277: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:305: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:307: MemoryFence ();
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:23:;MemoryFence (
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:24:;MemoryFence (
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:412: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:425: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:438: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:173: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:186: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:199: MemoryFence ();
OvmfPkg/Library/VirtioLib/VirtioLib.c:305: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:312: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:327: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:334: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:337: MemoryFence();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:122: MemoryFence ();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:124: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:117: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:122: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:171: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:177: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:595: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:625: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:56: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:58: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:99: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:102: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:257: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:384: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:426: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:437: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:101: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:103: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:173: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:176: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:169: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:172: MemoryFence ();
OvmfPkg/XenBusDxe/GrantTable.c:86: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:496: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:503: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:569: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:580: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:587: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:473: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:528: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:578: MemoryFence ();
UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c:204: MemoryFence ();


Thanks,

Andrew Fish

Would there be interest in reviewing such work?

Thanks
Laszlo


Ethin Probst
 

Yep, the point on volatile is pretty much correct. An object that has
the volatile type qualifier may be manipulated in ways unknown to the
implementation (the compiler) and, therefore, the implementation shall
not optimize away actions on that object or reorder those actions
except as permitted by expression evaluation rules. An implementation
needn't evaluate an expression (or part of one) if it determines that
its value is unused and that no needed side effects are produced. This
page (https://en.cppreference.com/w/c/language/volatile) explains the
volatile type qualifier in a lot more detail than I will here
(otherwise we'd get way off topic and I don't think we want that to
happen). I can't really provide much input to the other two points
though.

On 2/4/21, Andrew Fish via groups.io <afish@...> wrote:


On Feb 4, 2021, at 10:09 AM, Laszlo Ersek <lersek@...> wrote:

Hi All,

here's the declaration of MemoryFence(), from <BaseLib.h>:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed
to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
);
The IA32 and X64 implementations of MemoryFence() are not memory
barriers, only compiler barriers, however.

This is a problem for two reasons:

- code that believes to add a memory barrier only adds a compiler
barrier, possibly creating an unsafe situation,

- code that knowingly calls MemoryFence() for the purposes of a compiler
barrier -- while it will hopefully work safely, relying on tacit
knowledge / assumptions about the x86 architecture -- results in a
source reading experience that mismatches known lockless patterns.


This topic has repeatedly come up in the past; most recently in the
review of Ankur's v6 series for
<https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some
inter-processor communication needs to happen.

I asked Paolo for help with understanding some of the synchronziation
patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is
going to rely on, and/or imitate, some of those patterns. From this
discussion with Paolo, the following have emerged (I'm going to
liberally steal / paraphrase Paolo's words):


(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".

(Note that this step would make MemoryFence() heavier-weight on x86 than
it currently is.)


(2) We should audit and rework existent uses (combinations) of
MemoryFence() into matched acquire / release pairs.

Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?

In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?
Laszlo,

Thanks for looking into this. I noticed most of the MemoryFence() usage[1]
is in the virtual platforms so changes seem manageable. I guess 3rd party
code that defines its own low level primitive could be impacted, but they
chose to reinvent the wheel.

It looks like the generic platform performance risk is BaseIoLibIntrinsic. I
know from debugging crashes a long time ago the the MemoryFence() for x86 is
really a CompilerFence() in the new scheme. If the new
AcquireMemoryFence()/ReleaseMemoryFence()/CompilerFence() replace
MemoryFence() this likely just works. If not then we might need an
architecture specific pattern to maintain current performance, we could
always make a function internal to the lib that does the right thing for the
given architecture. Also making it more precise might actually help
ARM/AARCH64 performance?

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C
memory model.
2) CompilerFence() tells the C serialize and complete everything before the
barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe
coherency. Basically the CPU reordering things.

So for example in BaseIoLibIntrinsic() the volatile is used to force the
memory read and the CompilerFence() is used to enforce the order of that
read relative to other operations. Thus wrapping the volatile access in
CompilerFence() is in some ways redundant as long as there is only a single
access inside the CompilerFence() wrapper.

I like the idea of properly annotating the code. Given x86 defaults to so
much cache coherency for compatibility with older simpler times it can tend
to hide issues. I’m not sure removing the volatile keyword is required, and
it does imply this variable has special rules. I guess we could add a
REQUIRE_FENCE or some such? But it it feels like defining the variables in
some way to imply the require special handling has some value?

The only reason you would need a volatile is if you had more than a single
read inside of the fences. Something like:

UINT8 *Address;
UINT8. Value

MemoryFence ();
do {
Value = *Address
} while (Value == 0xFF) ;
MemoryFence ();

The compiler could optimize the above code (current MemoryFence() model) to
a single read and infinite loop if 0xFF. Yes I realize you could move the
MemoryFence () into the loop, but what if you did not care about that?
Mostly just trying to think this through we some examples….

[1] /Volumes/Case/edk2-github(master)>git grep MemoryFence | grep ';'
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:320: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:334: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:344: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1493: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1497: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4702: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4707: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4712: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4765: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4770: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4775: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4828: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4833: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4838: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4892: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4894: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:86: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:88: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:115: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:117: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:147: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:149: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:179: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:181: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:211: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:213: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:243: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:245: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:275: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:277: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:305: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:307: MemoryFence ();
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:3:; MemoryFence() for
AArch64
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:23:;MemoryFence (
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:24:;MemoryFence (
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:412: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:425: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:438: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:173: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:186: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:199: MemoryFence ();
OvmfPkg/Library/VirtioLib/VirtioLib.c:305: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:312: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:327: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:334: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:337: MemoryFence();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:122: MemoryFence ();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:124: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:117:
MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:122:
MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:171:
MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:177:
MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:595: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:625: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:56: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:58: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:99: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:102: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:257: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:384: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:426: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:437: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:101: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:103: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:173: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:176: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:169: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:172: MemoryFence ();
OvmfPkg/XenBusDxe/GrantTable.c:86: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:496: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:503: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:569: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:580: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:587: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:473: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:528: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:578: MemoryFence ();
UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c:204:
MemoryFence ();


Thanks,

Andrew Fish

Would there be interest in reviewing such work?

Thanks
Laszlo





--
Signed,
Ethin D. Probst


Laszlo Ersek
 

On 02/04/21 22:31, Andrew Fish wrote:


On Feb 4, 2021, at 10:09 AM, Laszlo Ersek <lersek@...> wrote:

Hi All,

here's the declaration of MemoryFence(), from <BaseLib.h>:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
);
The IA32 and X64 implementations of MemoryFence() are not memory
barriers, only compiler barriers, however.

This is a problem for two reasons:

- code that believes to add a memory barrier only adds a compiler
barrier, possibly creating an unsafe situation,

- code that knowingly calls MemoryFence() for the purposes of a compiler
barrier -- while it will hopefully work safely, relying on tacit
knowledge / assumptions about the x86 architecture -- results in a
source reading experience that mismatches known lockless patterns.


This topic has repeatedly come up in the past; most recently in the
review of Ankur's v6 series for
<https://bugzilla.tianocore.org/show_bug.cgi?id=3132>, where some
inter-processor communication needs to happen.

I asked Paolo for help with understanding some of the synchronziation
patterns in PiSmmCpuDxeSmm, the reason being that Ankur's series is
going to rely on, and/or imitate, some of those patterns. From this
discussion with Paolo, the following have emerged (I'm going to
liberally steal / paraphrase Paolo's words):


(1) We should introduce finer-grained fence primitives:

ARM AARCH64 i386

CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence

"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e. between
writes)".

(Note that this step would make MemoryFence() heavier-weight on x86 than
it currently is.)


(2) We should audit and rework existent uses (combinations) of
MemoryFence() into matched acquire / release pairs.

Less importantly, this would restore the x86 behavior (performance) to
the one seen with the current MemoryFence() implementation.

More importantly, it would fix what the code *means*: "it's very
important to stick to known lockless patterns, and having matching
acquire/release fences makes it much more intuitive to understand what's
going on".

--*--

I'm not proposing this as a pre-requisite to merging Ankur's series
(i.e., I don't expect the series to use acquire/release -- it can be
converted later with the rest of the audit).

However, I'd really like us to stop wasting more time on MemoryFence()
doubts such as:

- Hey it's not a memory fence at all, is it safe?

- But on x86 it kinda is, *dependent on* the spinlocks / lock prefixes
we use for synchronization, between the other accesses.

- Is it interchangeable with volatile? Yes? No? which one should we use?

In the longer term, multi-processing code like MpInitLib (CpuDxe,
CpuMpPei) and PiSmmCpuDxeSmm should likely abandon all use of
"volatile", and use the explicit barriers instead, for *clarity*.


Step (2) would take quite a bit of thinking (if not much code).

Comments?
Laszlo,

Thanks for looking into this. I noticed most of the MemoryFence() usage[1] is in the virtual platforms so changes seem manageable. I guess 3rd party code that defines its own low level primitive could be impacted, but they chose to reinvent the wheel.

It looks like the generic platform performance risk is BaseIoLibIntrinsic. I know from debugging crashes a long time ago the the MemoryFence() for x86 is really a CompilerFence() in the new scheme. If the new AcquireMemoryFence()/ReleaseMemoryFence()/CompilerFence() replace MemoryFence() this likely just works. If not then we might need an architecture specific pattern to maintain current performance, we could always make a function internal to the lib that does the right thing for the given architecture. Also making it more precise might actually help ARM/AARCH64 performance?

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.

So for example in BaseIoLibIntrinsic() the volatile is used to force the memory read and the CompilerFence() is used to enforce the order of that read relative to other operations. Thus wrapping the volatile access in CompilerFence() is in some ways redundant as long as there is only a single access inside the CompilerFence() wrapper.
Right, I agree with everything you said thus far. (I don't claim to be
an expert on this, of course.) CompilerFence() is a "better volatile"
(for RAM only, not for MMIO), where we can allow the compiler to
optimize, up to a certain location.

I like the idea of properly annotating the code. Given x86 defaults to so much cache coherency for compatibility with older simpler times it can tend to hide issues. I’m not sure removing the volatile keyword is required, and it does imply this variable has special rules. I guess we could add a REQUIRE_FENCE or some such? But it it feels like defining the variables in some way to imply the require special handling has some value?
Objects that are shared between processors, or between processor and
some device, should always be highlighted in some way (naming, comments,
...).


The only reason you would need a volatile is if you had more than a single read inside of the fences. Something like:

UINT8 *Address;
UINT8. Value

MemoryFence ();
do {
Value = *Address
} while (Value == 0xFF) ;
MemoryFence ();

The compiler could optimize the above code (current MemoryFence() model) to a single read and infinite loop if 0xFF. Yes I realize you could move the MemoryFence () into the loop, but what if you did not care about that? Mostly just trying to think this through we some examples….
I think it depends on what Address refers to. If it refers to an MMIO
register, then volatile is needed, IMO (and the MemoryFence()s are
unnecessary). If Address points to a byte in RAM that's manipulated by
multiple CPUs, then I think:

- volatile is not needed

- the two MemoryFence() calls should be removed

- an AcquireMemoryFence() call should be inserted into the loop, after
the read access. Because, presumably, once the flag assumes the
appropriate value (it has been "released")), we'll proceed to reading
some memory resource that was protected by the flag until then.


Regarding the many MemoryFence() calls in the virt platforms, most of
them do not exist for inter-CPU communication, but for communication
with devices through RAM. I admit I'm unsure what we should do with them.

Thanks
Laszlo


[1] /Volumes/Case/edk2-github(master)>git grep MemoryFence | grep ';'
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:320: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:334: MemoryFence ();
ArmVirtPkg/Library/QemuFwCfgLib/QemuFwCfgLib.c:344: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1493: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:1497: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4702: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4707: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4712: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4765: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4770: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4775: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4828: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4833: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4838: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4892: MemoryFence ();
MdeModulePkg/Universal/EbcDxe/EbcExecute.c:4894: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:86: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:88: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:115: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:117: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:147: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:149: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:179: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:181: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:211: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:213: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:243: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:245: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:275: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:277: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:305: MemoryFence ();
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:307: MemoryFence ();
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/AArch64/MemoryFence.asm:23:;MemoryFence (
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:3:; MemoryFence() for AArch64
MdePkg/Library/BaseLib/Arm/MemoryFence.asm:24:;MemoryFence (
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:412: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:425: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgDxe.c:438: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:173: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:186: MemoryFence ();
OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPei.c:199: MemoryFence ();
OvmfPkg/Library/VirtioLib/VirtioLib.c:305: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:312: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:327: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:334: MemoryFence();
OvmfPkg/Library/VirtioLib/VirtioLib.c:337: MemoryFence();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:122: MemoryFence ();
OvmfPkg/Library/VmgExitLib/VmgExitLib.c:124: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:117: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:122: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:171: MemoryFence ();
OvmfPkg/Library/XenConsoleSerialPortLib/XenConsoleSerialPortLib.c:177: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:595: MemoryFence ();
OvmfPkg/PvScsiDxe/PvScsi.c:625: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:56: MemoryFence ();
OvmfPkg/VirtioNetDxe/Events.c:58: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:99: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpGetStatus.c:102: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:257: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:384: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:426: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpInitialize.c:437: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:101: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:103: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:173: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpReceive.c:176: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:169: MemoryFence ();
OvmfPkg/VirtioNetDxe/SnpTransmit.c:172: MemoryFence ();
OvmfPkg/XenBusDxe/GrantTable.c:86: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:496: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:503: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:569: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:580: MemoryFence ();
OvmfPkg/XenBusDxe/XenStore.c:587: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:473: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:528: MemoryFence ();
OvmfPkg/XenPvBlkDxe/BlockFront.c:578: MemoryFence ();
UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c:204: MemoryFence ();


Thanks,

Andrew Fish

Would there be interest in reviewing such work?

Thanks
Laszlo


Ni, Ray
 

So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.
I agree with this summary.
Volatile using in MpInitLib or PiSmmCpu driver cannot be emitted because different CPU threads change the same memory content to synchronize with each other.
I don’t quite understand if removing “volatile” what mechanism can be used to force the compiler generating code that always reads from memory?


Thanks,
Ray


Laszlo Ersek
 

On 02/04/21 21:04, Paolo Bonzini wrote:
Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto:
It is very important to be *aware* of the acquire/release semantics,
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.
I agree as long as the primitives are self-documenting. A single
MemoryFence() does not make it clear in which direction the data is flowing
(whether from other threads to this one, or vice versa).
I can say about the MemoryFence() calls I've added, and about the ones
Ankur is about to add, that we're aware of the data flow directions. We
could certainly annotate every such location with either "we're
publishing information here" or "we're consuming information here".

So for example:

(a) publish:

- prepare and expose the data
- "release fence"
- release the spinlock with InterlockedCompareExchange, or push the MMIO
doorbell register

(b1) consume:

- spin on a spinlock with InterlockedCompareExchange,
- "acquire fence"
- gather / consume the data

(b2) consume:

- spin on an MMIO latch, using "volatile"
- "acquire fence"
- gather / consume the data

(b3) consume:

- spin on a normal memory location with an "acquire fence" at the end of
the loop body -- still inside the loop body,
- gather / consume the data, after the loop


I believe these general patterns / semantics are "more or less" clear to
us all; I personally have not had to do anything fancier than the above,
programming fw_cfg, virtio, or even some inter-processor communication.

*However*, how those natural-language "fence" statements map to compiler
fences, CPU/memory fences, especially with regard to architecture
specifics -- I honestly admit that's where my eyes glaze over.

The MemoryFence() API has been very convenient for *both* of those
directions, but:

- it does not inherently express the intended data flow direction, as
Paolo notes

- it's unclear whether "volatile" (or a compiler barrier) is needed *in
addition* to MemoryFence() or not -- after all, if the compiler didn't
produce the necessary store/load instructions yet, "fencing" the effects
of those instructions makes no sense (paraphrasing Ard),

- the x86 implementation of MemoryFence() openly admits to breaking the
interface contract ("this is a compiler fence only"), without explaining
why -- i.e., for what architectural reasons -- it might *still* suffice
for upholding the contract.

So in the end one just goes for: "whenever in doubt, use volatile *PLUS*
MemoryFence()".


Here's what I'd prefer:

- someone more versed in the *implementation* of these primitivies than
myself should please add the "directional" barriers,

- if possible, I would prefer if a naked compiler barrier API didn't
exist at all (because (a) the above-noted directional barriers should
*guarantee* in the contracts to imply the necessary compiler barrier(s),
and (b) I wouldn't know where I'd be content with *just* a compiler barrier)


Then I could go over the MemoryFence() calls I've added, and replace
each with the proper directional variant, while simultaneously killing
*all* related "volatile" qualifiers. (I'd only keep volatile where it
referred to a known MMIO register, for example for pressing or reading
an MMIO doorbell / latch.)

Then I think someone from Intel should do the same for PiSmmCpuDxeSmm
and MpInitLib -- because every time I need to look at the swaths of
"volatile" in those drivers, I must start reasoning from square#1. And
for the life of me, I cannot squeeze a bunch of the patterns I see there
into any one of the patterns (a), (b1), (b2), and (b3) that I outlined
above. In every such case, I'm left wondering, "does it work by chance
only, or by silently exploiting architecture guarantees that I don't
know about".

In summary, I'm mostly sure what I'd do about the volatile's and
MemoryFence()s that *I* have written, provided the directional barrier
variants existed, but I don't have the first clue about cleaning up
PiSmmCpuDxeSmm and MpInitLib. And ultimately, it would be awesome if,
when discussing a patch series like Ankur's v6, we didn't have to go
look for examples and tacit intentions in PiSmmCpuDxeSmm. :/

Thanks,
Laszlo


Paolo Bonzini <pbonzini@...>
 

On 05/02/21 08:48, Laszlo Ersek wrote:
MemoryFence ();
do {
Value = *Address
} while (Value == 0xFF) ;
MemoryFence ();
If Address points to a byte in RAM that's manipulated by
multiple CPUs, then I think:
- volatile is not needed
- the two MemoryFence() calls should be removed
- an AcquireMemoryFence() call should be inserted into the loop, after
the read access. Because, presumably, once the flag assumes the
appropriate value (it has been "released")), we'll proceed to reading
some memory resource that was protected by the flag until then.
This is correct, alternatively you could have this:

do {
// Force re-reading *Address on every iteration
CompilerFence ();
Value = *Address;
} while (Value == 0xFF);
// As you explained above.
AcquireMemoryFence ();

Paolo


Ni, Ray
 

This is very interesting. I am trying to involve the discussion to learn.
I did some experiment.
----a.c----
// Copy EDKII MemoryFence() and as Laszlo said it's just a compiler fence.
void CompilerFence () {
return;
}

void main () {
// I copied Paolo's code in below.
int *Address = &main;
int Value;

do {
// Force re-reading *Address on every iteration
CompilerFence ();
Value = *Address;
} while (Value == 0xFF);
// As you explained above.
__asm mfence;
}
----END of a.c----

When I use "Cl.exe /O2 /FAcs a.c" to compile it.
I got a.cod as below
-----a.cod----
; Listing generated by Microsoft (R) Optimizing Compiler Version 19.27.29112.0

TITLE E:\Work\edk2\a.c
.686P
.XMM
include listing.inc
.model flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC _CompilerFence
PUBLIC _main
; Function compile flags: /Ogtpy
; File E:\Work\edk2\a.c
; COMDAT _main
_TEXT SEGMENT
_main PROC ; COMDAT

; 6 : int *Address = &main;

00000 b8 00 00 00 00 mov eax, OFFSET _main
00005 8b 00 mov eax, DWORD PTR [eax]
$LL4@main:

; 7 : int Value;
; 8 :
; 9 : do {
; 10 : // Force re-reading *Address on every iteration
; 11 : CompilerFence ();
; 12 : Value = *Address;
; 13 : } while (Value == 0xFF);

00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH
0000c 74 f9 je SHORT $LL4@main

; 14 : // As you explained above.
; 15 : __asm mfence;

0000e 0f ae f0 mfence
00011 c3 ret 0
_main ENDP
_TEXT ENDS
; Function compile flags: /Ogtpy
; File E:\Work\edk2\a.c
; COMDAT _CompilerFence
_TEXT SEGMENT
_CompilerFence PROC ; COMDAT

; 2 : return;
; 3 : }

00000 c2 00 00 ret 0
_CompilerFence ENDP
_TEXT ENDS
END
-----END of a.cod-----

Check the assembly:
00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH

The CompilerFence() call doesn't force to generate code to always read the memory pointed by Address.

If I replace CompilerFence() with _ReadBarrier(), compiler can generate assembly that always
reads the memory pointed by Address.

It seems the EDKII MemoryFence() even cannot be used as a read barrier.
Does it mean that the MSVC version of EDKII MemoryFence() is implemented wrongly?

Thanks,
Ray

-----Original Message-----
From: Paolo Bonzini <pbonzini@...>
Sent: Friday, February 5, 2021 5:13 PM
To: Laszlo Ersek <lersek@...>; Andrew Fish <afish@...>
Cc: edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address)
<leif@...>; Ard Biesheuvel (kernel.org address) <ardb@...>; Dong, Eric <eric.dong@...>; Ni, Ray
<ray.ni@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
Subject: Re: MemoryFence()

On 05/02/21 08:48, Laszlo Ersek wrote:
MemoryFence ();
do {
Value = *Address
} while (Value == 0xFF) ;
MemoryFence ();
If Address points to a byte in RAM that's manipulated by
multiple CPUs, then I think:

- volatile is not needed

- the two MemoryFence() calls should be removed

- an AcquireMemoryFence() call should be inserted into the loop, after
the read access. Because, presumably, once the flag assumes the
appropriate value (it has been "released")), we'll proceed to reading
some memory resource that was protected by the flag until then.
This is correct, alternatively you could have this:

do {
// Force re-reading *Address on every iteration
CompilerFence ();
Value = *Address;
} while (Value == 0xFF);
// As you explained above.
AcquireMemoryFence ();

Paolo


Paolo Bonzini <pbonzini@...>
 

On 05/02/21 14:15, Ni, Ray wrote:
I did some experiment.
----a.c----
// Copy EDKII MemoryFence() and as Laszlo said it's just a compiler fence.
void CompilerFence () {
return;
This should be

#define CompilerFence _ReadWriteBarrier

Paolo

}
void main () {
// I copied Paolo's code in below.
int *Address = &main;
int Value;
do {
// Force re-reading *Address on every iteration
CompilerFence ();
Value = *Address;
} while (Value == 0xFF);
// As you explained above.
__asm mfence;
}
----END of a.c----


Laszlo Ersek
 

On 02/05/21 10:12, Paolo Bonzini wrote:
On 05/02/21 08:48, Laszlo Ersek wrote:
   MemoryFence ();
   do {
     Value = *Address
   } while (Value == 0xFF) ;
   MemoryFence ();
If Address points to a byte in RAM that's manipulated by
multiple CPUs, then I think:

- volatile is not needed

- the two MemoryFence() calls should be removed

- an AcquireMemoryFence() call should be inserted into the loop, after
the read access. Because, presumably, once the flag assumes the
appropriate value (it has been "released")), we'll proceed to reading
some memory resource that was protected by the flag until then.
This is correct, alternatively you could have this:

    do {
      // Force re-reading *Address on every iteration
      CompilerFence ();
      Value = *Address;
    } while (Value == 0xFF);
    // As you explained above.
    AcquireMemoryFence ();
Ah, thanks; also good example where a CompilerFence() would be justified.

Laszlo


Laszlo Ersek
 

On 02/05/21 09:25, Ni, Ray wrote:
So if I understand correctly
1) volatile basically tells C to always read the memory. So it impacts the C memory model.
2) CompilerFence() tells the C serialize and complete everything before the barrier, and read memory again the 1st time after the barrier.
3) MemoryFence() is really dealing with speculative fetches and maybe coherency. Basically the CPU reordering things.
I agree with this summary.
Volatile using in MpInitLib or PiSmmCpu driver cannot be emitted because different CPU threads change the same memory content to synchronize with each other.
I don’t quite understand if removing “volatile” what mechanism can be used to force the compiler generating code that always reads from memory?
A compiler barrier (in case that's sufficient), or even a memory fence
-- seems more appropriate for inter-CPU synchronization -- which I think
should be implemented to automatically force a compiler barrier too.

Laszlo


Laszlo Ersek
 

On 02/05/21 14:15, Ni, Ray wrote:
This is very interesting. I am trying to involve the discussion to
learn.
I did some experiment.
----a.c----
// Copy EDKII MemoryFence() and as Laszlo said it's just a compiler fence.
void CompilerFence () {
return;
}

void main () {
// I copied Paolo's code in below.
int *Address = &main;
int Value;

do {
// Force re-reading *Address on every iteration
CompilerFence ();
Value = *Address;
} while (Value == 0xFF);
// As you explained above.
__asm mfence;
}
----END of a.c----

When I use "Cl.exe /O2 /FAcs a.c" to compile it.
I got a.cod as below
-----a.cod----
; Listing generated by Microsoft (R) Optimizing Compiler Version 19.27.29112.0

TITLE E:\Work\edk2\a.c
.686P
.XMM
include listing.inc
.model flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC _CompilerFence
PUBLIC _main
; Function compile flags: /Ogtpy
; File E:\Work\edk2\a.c
; COMDAT _main
_TEXT SEGMENT
_main PROC ; COMDAT

; 6 : int *Address = &main;

00000 b8 00 00 00 00 mov eax, OFFSET _main
00005 8b 00 mov eax, DWORD PTR [eax]
$LL4@main:

; 7 : int Value;
; 8 :
; 9 : do {
; 10 : // Force re-reading *Address on every iteration
; 11 : CompilerFence ();
; 12 : Value = *Address;
; 13 : } while (Value == 0xFF);

00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH
0000c 74 f9 je SHORT $LL4@main

; 14 : // As you explained above.
; 15 : __asm mfence;

0000e 0f ae f0 mfence
00011 c3 ret 0
_main ENDP
_TEXT ENDS
; Function compile flags: /Ogtpy
; File E:\Work\edk2\a.c
; COMDAT _CompilerFence
_TEXT SEGMENT
_CompilerFence PROC ; COMDAT

; 2 : return;
; 3 : }

00000 c2 00 00 ret 0
_CompilerFence ENDP
_TEXT ENDS
END
-----END of a.cod-----

Check the assembly:
00007 3d ff 00 00 00 cmp eax, 255 ; 000000ffH

The CompilerFence() call doesn't force to generate code to always read
the memory pointed by Address.
That's because your CompilerFence() function is not correct. You did:

void CompilerFence () {
return;
}
which is wrong. It enforces a C-language sequence point, but it does not
implement a compiler barrier.

And your code is wrong because the *original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:

* MdePkg/Library/BaseLib/X86MemoryFence.c:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
This code comes from historical commit e1f414b6a7d8 ("Import some basic
libraries instances for Mde Packages.", 2007-06-22).

* MdePkg/Library/BaseLib/BaseLib.inf:

[Sources.Ia32]
X86MemoryFence.c | MSFT

[Sources.X64]
X86MemoryFence.c | MSFT
Back to your email:

If I replace CompilerFence() with _ReadBarrier(), compiler can
generate assembly that always reads the memory pointed by Address.

It seems the EDKII MemoryFence() even cannot be used as a read
barrier.
Does it mean that the MSVC version of EDKII MemoryFence() is
implemented wrongly?
That's precisely the case.

This actually explains why MpInitLib and PiSmmCpuDxeSmm use "volatile",
rather than MemoryFence(). (This is one of those info bits that I've
been missing.) The reason is that, when built with Visual Studio,
MemoryFence() does absolutely nothing. So if MpInitLib and
PiSmmCpuDxeSmm used MemoryFence() rather than "volatile", things would
break immediately.

Thank you Ray for highlighting this, I couldn't have imagined.

The current status means that, considering all toolchains and all
architectures that edk2 supports, together, it's been *impossible* to
use MemoryFence() correctly.

And so my defensive thinking "whenever in doubt, use 'volatile' PLUS
MemoryFence()" apparently paid off, because when built with Visual
Studio, OVMF is left with 'volatile' *only*; in effect.

It also means that in Ankur's upcoming v7 series, relying *just* on
MemoryFence() will not suffice. We need 'volatile' (given the current
state of primitives); otherwise we'll have no barriers at all.

... I don't even know where to start changing the code. :(

Laszlo


Paolo Bonzini <pbonzini@...>
 

On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:
* MdePkg/Library/BaseLib/X86MemoryFence.c:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
I think it's okay-ish (it works as long as you don't do link-time optimization, but it's less efficient than a compiler intrinsic) because X86MemoryFence.c is a separate file.

Paolo


Ard Biesheuvel <ardb@...>
 

On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:

On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:

* MdePkg/Library/BaseLib/X86MemoryFence.c:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
I think it's okay-ish (it works as long as you don't do link-time
optimization, but it's less efficient than a compiler intrinsic) because
X86MemoryFence.c is a separate file.
We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains.


Paolo Bonzini <pbonzini@...>
 

On 05/02/21 16:39, Ard Biesheuvel wrote:
On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:

On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:

* MdePkg/Library/BaseLib/X86MemoryFence.c:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
I think it's okay-ish (it works as long as you don't do link-time
optimization, but it's less efficient than a compiler intrinsic) because
X86MemoryFence.c is a separate file.
We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains.
Oops... Well, it *was* okay at the time MemoryFence() was first implemented. :(

Paolo


Ni, Ray
 

Maybe that explains why the MemoryFence was implemented so confidently.
I trust Mike:)

In UefiCpuPkg/Library/RegisterCpuFeaturesLib/CpuFeaturesInitialize.c, the LibWaitForSemaphore, LibReleaseSemaphore are coded without the read write barrier.
But the AcquireSpinLockOrFail and ReleaseSpinLock in MdePkg/Library/BaseSynchronizationLib/SynchronizationMsc.c contain read write barrier.
Is the former implementation wrong because without the barrier the cmpxchg can happen earlier or later chosen by the compiler?

Does volatile keyword help here?

thanks,
ray
________________________________
发件人: Paolo Bonzini <pbonzini@...>
发送时间: Friday, February 5, 2021 11:42:36 PM
收件人: Ard Biesheuvel <ardb@...>
抄送: Laszlo Ersek <lersek@...>; Ni, Ray <ray.ni@...>; Andrew Fish <afish@...>; edk2 RFC list <rfc@edk2.groups.io>; Kinney, Michael D <michael.d.kinney@...>; Leif Lindholm (Nuvia address) <leif@...>; Dong, Eric <eric.dong@...>; Liming Gao (Byosoft address) <gaoliming@...>; Ankur Arora <ankur.a.arora@...>
主题: Re: MemoryFence()

On 05/02/21 16:39, Ard Biesheuvel wrote:
On Fri, 5 Feb 2021 at 16:38, Paolo Bonzini <pbonzini@...> wrote:

On 05/02/21 16:34, Laszlo Ersek wrote:
And your code is wrong because the*original*, namely the MSFT
implementation of MemoryFence() in edk2, is bugged:

* MdePkg/Library/BaseLib/X86MemoryFence.c:

/**
Used to serialize load and store operations.

All loads and stores that proceed calls to this function are guaranteed to be
globally visible when this function returns.

**/
VOID
EFIAPI
MemoryFence (
VOID
)
{
return;
}
I think it's okay-ish (it works as long as you don't do link-time
optimization, but it's less efficient than a compiler intrinsic) because
X86MemoryFence.c is a separate file.
We rely heavily on LTO for both Visual Studio and GNU/Clang toolchains.
Oops... Well, it *was* okay at the time MemoryFence() was first
implemented. :(

Paolo