Date
1 - 20 of 66
MemoryFence()
Laszlo Ersek
On 02/11/21 10:07, Paolo Bonzini wrote:
Seeing a few examples from you will hopefully teach me some patterns.
Laszlo
On 10/02/21 20:07, Andrew Fish wrote:Right, I hope to propose some patches after the stable tag.Laszlo,I think the first step should be to introduce the new fence primitives
I think it makes sense to “break this up”. Seems like we need correct
primitives and documentation on how to use them. It is easy enough
after that to use code review to make new code “correct”, but as you
say fixing the general volatile usage (especially given some of the
VC++ behavior) is a big undertaking. Let us not have perfection block
being able to do it correctly going forward.
Changing the CPU drivers (MP libraries) is a complex undertaking as
we really need to test against the various compilers. I’ve tracked
down quite a few MP bugs in proprietary CPU drivers that did not show
up under VC++, but did under clang.
and fixing MemoryFence to be what it says on the tin.
I can volunteer for the work of removing volatile once those are in.Thanks, Paolo -- should not be your job, but I'll gladly take your help.
Seeing a few examples from you will hopefully teach me some patterns.
Laszlo
Paolo Bonzini <pbonzini@...>
On 10/02/21 20:07, Andrew Fish wrote:
I can volunteer for the work of removing volatile once those are in.
Paolo
Laszlo,I think the first step should be to introduce the new fence primitives and fixing MemoryFence to be what it says on the tin.
I think it makes sense to “break this up”. Seems like we need correct
primitives and documentation on how to use them. It is easy enough
after that to use code review to make new code “correct”, but as you
say fixing the general volatile usage (especially given some of the
VC++ behavior) is a big undertaking. Let us not have perfection block
being able to do it correctly going forward.
Changing the CPU drivers (MP libraries) is a complex undertaking as
we really need to test against the various compilers. I’ve tracked
down quite a few MP bugs in proprietary CPU drivers that did not show
up under VC++, but did under clang.
I can volunteer for the work of removing volatile once those are in.
Paolo
Ankur Arora
On 2021-02-10 7:55 a.m., Laszlo Ersek wrote:
a non-trivial effort to test all the affected pieces of logic.
Just thinking out aloud -- maybe the thing to do might be to write
some validation code that compares before and after generated code to
flag cases when there are unexpected changes.
I'm not the most experienced with low level barriers and the like (except
for having used them once in a while) but happy to help any way I can.
send out v7. I plan to send that out sometime early next week so there's
enough time to review before the soft freeze date.
Thanks
Ankur
On 02/10/21 08:21, Ankur Arora wrote:Yeah, also given that the bugs around this are subtle. And, it isIn general I would say that fences are -- except for one case -- strictlyGiven the actual amount of MemoryFence() calls and "volatile" uses in
greater in power than read_once/write_once. That one case is alignment
checks (which we do for instance in MmioRead32() and friends.)
Given that difference in power, I agree with the approach Paolo
recommended elsewhere that for CPU memory access syncrhonization we
start with Fences and then switch to lesser powered variants
as and when needed:
"However, these changes do not have to be done in a single step.
Starting with the rationalization of memory fences makes sense
because, in addition to enabling the annotation of code rather
than data, it is essentially a bugfix. "
edk2, it's actually a pretty large undertaking to "fix" these issues. I
don't foresee a quick resolution here, especially if I'm supposed to
work on it alone (considering actual patches).
I've not seen any feedback from Mike or Liming in this thread yet, for
example, so I'm unsure about the MdePkg (BaseLib) maintainer buy-in.
This is just to state the context in which I interpret "first step" and
"essentially a bugfix". The issue is systemic in edk2, as every such
occasion as we've run into now has only added to the proliferation of
"volatile".
The next stable tag is coming up too. I recommend we focus on merging
the VCPU hotplug series. I don't intend to let this thread silently
peter out (like many similar threads must have, in the past); I hope to
do something about actual patches in the next development cycle. I do
admit it looks like a daunting task, considering the multiple
architectures and toolchains.
a non-trivial effort to test all the affected pieces of logic.
Just thinking out aloud -- maybe the thing to do might be to write
some validation code that compares before and after generated code to
flag cases when there are unexpected changes.
I'm not the most experienced with low level barriers and the like (except
for having used them once in a while) but happy to help any way I can.
(NB: we've not spent a single word on RISC-V yet -- AFAICS, there RISC-VThanks for the heads up. Yeah, I think things are clear enough for me to
doesn't even have *any* implementation of MemoryFence(), yet. Refer to
commit 7601b251fd5c ("MdePkg/BaseLib: BaseLib for RISCV64 architecture",
2020-05-07), for example.)
"You touch it, you own it" and "learn as you go" should not be a
surprise to me in an open source project, but I admit I don't know if I
should even attempt patching UefiCpuPkg content (MdePkg and OvmfPkg look
more approachable). The risk of regressions runs high, reproducibility
is always a mess in multiprocessing code, and the code is also
privileged. I want it fixed everywhere, but I feel uncertain about the
apparent lack of commitment.
I have some hands-on multiprocessing background, but I've always made a
very conscious decision to stay away from lockless programming, and use
safe and *simple* POSIX Threads-based schemes (mutexes, condition
variables with *broadcasts* etc). The "network of IP blocks" that Ard
used for (correctly) describing modern computer architecture is way
below the programming abstraction level where I feel comfortable. Thus
I'm very much out of my comfort zone here. (My initial decision against
lockless programming goes back to when <http://www.1024cores.net/> was
published originally.)
Anyway: I have several messages in this thread starred for revising.
Ankur, is the picture clear enough for you to return to work on the v7
unplug series? The soft feature freeze is coming: 2021-02-22.
https://github.com/tianocore/tianocore.github.io/wiki/EDK-II-Release-Planning
send out v7. I plan to send that out sometime early next week so there's
enough time to review before the soft freeze date.
Thanks
Ankur
Thanks
Laszlo
Andrew Fish <afish@...>
On Feb 10, 2021, at 7:55 AM, Laszlo Ersek <lersek@...> wrote:Laszlo,
On 02/10/21 08:21, Ankur Arora wrote:In general I would say that fences are -- except for one case -- strictlyGiven the actual amount of MemoryFence() calls and "volatile" uses in
greater in power than read_once/write_once. That one case is alignment
checks (which we do for instance in MmioRead32() and friends.)
Given that difference in power, I agree with the approach Paolo
recommended elsewhere that for CPU memory access syncrhonization we
start with Fences and then switch to lesser powered variants
as and when needed:
"However, these changes do not have to be done in a single step.
Starting with the rationalization of memory fences makes sense
because, in addition to enabling the annotation of code rather
than data, it is essentially a bugfix. "
edk2, it's actually a pretty large undertaking to "fix" these issues. I
don't foresee a quick resolution here, especially if I'm supposed to
work on it alone (considering actual patches).
I've not seen any feedback from Mike or Liming in this thread yet, for
example, so I'm unsure about the MdePkg (BaseLib) maintainer buy-in.
This is just to state the context in which I interpret "first step" and
"essentially a bugfix". The issue is systemic in edk2, as every such
occasion as we've run into now has only added to the proliferation of
"volatile".
I think it makes sense to “break this up”. Seems like we need correct primitives and documentation on how to use them. It is easy enough after that to use code review to make new code “correct”, but as you say fixing the general volatile usage (especially given some of the VC++ behavior) is a big undertaking. Let us not have perfection block being able to do it correctly going forward.
Changing the CPU drivers (MP libraries) is a complex undertaking as we really need to test against the various compilers. I’ve tracked down quite a few MP bugs in proprietary CPU drivers that did not show up under VC++, but did under clang.
Thanks,
Andrew Fish
The next stable tag is coming up too. I recommend we focus on merging
the VCPU hotplug series. I don't intend to let this thread silently
peter out (like many similar threads must have, in the past); I hope to
do something about actual patches in the next development cycle. I do
admit it looks like a daunting task, considering the multiple
architectures and toolchains.
(NB: we've not spent a single word on RISC-V yet -- AFAICS, there RISC-V
doesn't even have *any* implementation of MemoryFence(), yet. Refer to
commit 7601b251fd5c ("MdePkg/BaseLib: BaseLib for RISCV64 architecture",
2020-05-07), for example.)
"You touch it, you own it" and "learn as you go" should not be a
surprise to me in an open source project, but I admit I don't know if I
should even attempt patching UefiCpuPkg content (MdePkg and OvmfPkg look
more approachable). The risk of regressions runs high, reproducibility
is always a mess in multiprocessing code, and the code is also
privileged. I want it fixed everywhere, but I feel uncertain about the
apparent lack of commitment.
I have some hands-on multiprocessing background, but I've always made a
very conscious decision to stay away from lockless programming, and use
safe and *simple* POSIX Threads-based schemes (mutexes, condition
variables with *broadcasts* etc). The "network of IP blocks" that Ard
used for (correctly) describing modern computer architecture is way
below the programming abstraction level where I feel comfortable. Thus
I'm very much out of my comfort zone here. (My initial decision against
lockless programming goes back to when <http://www.1024cores.net/> was
published originally.)
Anyway: I have several messages in this thread starred for revising.
Ankur, is the picture clear enough for you to return to work on the v7
unplug series? The soft feature freeze is coming: 2021-02-22.
https://github.com/tianocore/tianocore.github.io/wiki/EDK-II-Release-Planning
Thanks
Laszlo
Laszlo Ersek
On 02/10/21 08:21, Ankur Arora wrote:
edk2, it's actually a pretty large undertaking to "fix" these issues. I
don't foresee a quick resolution here, especially if I'm supposed to
work on it alone (considering actual patches).
I've not seen any feedback from Mike or Liming in this thread yet, for
example, so I'm unsure about the MdePkg (BaseLib) maintainer buy-in.
This is just to state the context in which I interpret "first step" and
"essentially a bugfix". The issue is systemic in edk2, as every such
occasion as we've run into now has only added to the proliferation of
"volatile".
The next stable tag is coming up too. I recommend we focus on merging
the VCPU hotplug series. I don't intend to let this thread silently
peter out (like many similar threads must have, in the past); I hope to
do something about actual patches in the next development cycle. I do
admit it looks like a daunting task, considering the multiple
architectures and toolchains.
(NB: we've not spent a single word on RISC-V yet -- AFAICS, there RISC-V
doesn't even have *any* implementation of MemoryFence(), yet. Refer to
commit 7601b251fd5c ("MdePkg/BaseLib: BaseLib for RISCV64 architecture",
2020-05-07), for example.)
"You touch it, you own it" and "learn as you go" should not be a
surprise to me in an open source project, but I admit I don't know if I
should even attempt patching UefiCpuPkg content (MdePkg and OvmfPkg look
more approachable). The risk of regressions runs high, reproducibility
is always a mess in multiprocessing code, and the code is also
privileged. I want it fixed everywhere, but I feel uncertain about the
apparent lack of commitment.
I have some hands-on multiprocessing background, but I've always made a
very conscious decision to stay away from lockless programming, and use
safe and *simple* POSIX Threads-based schemes (mutexes, condition
variables with *broadcasts* etc). The "network of IP blocks" that Ard
used for (correctly) describing modern computer architecture is way
below the programming abstraction level where I feel comfortable. Thus
I'm very much out of my comfort zone here. (My initial decision against
lockless programming goes back to when <http://www.1024cores.net/> was
published originally.)
Anyway: I have several messages in this thread starred for revising.
Ankur, is the picture clear enough for you to return to work on the v7
unplug series? The soft feature freeze is coming: 2021-02-22.
https://github.com/tianocore/tianocore.github.io/wiki/EDK-II-Release-Planning
Thanks
Laszlo
In general I would say that fences are -- except for one case -- strictlyGiven the actual amount of MemoryFence() calls and "volatile" uses in
greater in power than read_once/write_once. That one case is alignment
checks (which we do for instance in MmioRead32() and friends.)
Given that difference in power, I agree with the approach Paolo
recommended elsewhere that for CPU memory access syncrhonization we
start with Fences and then switch to lesser powered variants
as and when needed:
"However, these changes do not have to be done in a single step.
Starting with the rationalization of memory fences makes sense
because, in addition to enabling the annotation of code rather
than data, it is essentially a bugfix. "
edk2, it's actually a pretty large undertaking to "fix" these issues. I
don't foresee a quick resolution here, especially if I'm supposed to
work on it alone (considering actual patches).
I've not seen any feedback from Mike or Liming in this thread yet, for
example, so I'm unsure about the MdePkg (BaseLib) maintainer buy-in.
This is just to state the context in which I interpret "first step" and
"essentially a bugfix". The issue is systemic in edk2, as every such
occasion as we've run into now has only added to the proliferation of
"volatile".
The next stable tag is coming up too. I recommend we focus on merging
the VCPU hotplug series. I don't intend to let this thread silently
peter out (like many similar threads must have, in the past); I hope to
do something about actual patches in the next development cycle. I do
admit it looks like a daunting task, considering the multiple
architectures and toolchains.
(NB: we've not spent a single word on RISC-V yet -- AFAICS, there RISC-V
doesn't even have *any* implementation of MemoryFence(), yet. Refer to
commit 7601b251fd5c ("MdePkg/BaseLib: BaseLib for RISCV64 architecture",
2020-05-07), for example.)
"You touch it, you own it" and "learn as you go" should not be a
surprise to me in an open source project, but I admit I don't know if I
should even attempt patching UefiCpuPkg content (MdePkg and OvmfPkg look
more approachable). The risk of regressions runs high, reproducibility
is always a mess in multiprocessing code, and the code is also
privileged. I want it fixed everywhere, but I feel uncertain about the
apparent lack of commitment.
I have some hands-on multiprocessing background, but I've always made a
very conscious decision to stay away from lockless programming, and use
safe and *simple* POSIX Threads-based schemes (mutexes, condition
variables with *broadcasts* etc). The "network of IP blocks" that Ard
used for (correctly) describing modern computer architecture is way
below the programming abstraction level where I feel comfortable. Thus
I'm very much out of my comfort zone here. (My initial decision against
lockless programming goes back to when <http://www.1024cores.net/> was
published originally.)
Anyway: I have several messages in this thread starred for revising.
Ankur, is the picture clear enough for you to return to work on the v7
unplug series? The soft feature freeze is coming: 2021-02-22.
https://github.com/tianocore/tianocore.github.io/wiki/EDK-II-Release-Planning
Thanks
Laszlo
Ankur Arora
On 2021-02-09 10:40 p.m., Paolo Bonzini wrote:
I wonder if there are CPUs with ordering primitives fine-grained
enough where this would be useful.
Ankur
Il mer 10 feb 2021, 07:37 Ankur Arora <ankur.a.arora@... <mailto:ankur.a.arora@...>> ha scritto:Thanks for the explanation. That does sound like a pain to work with.
So I don't quite see what would make "memory_order_seq_cst" harder?
From the spec (https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering):
"Atomic operations tagged memory_order_seq_cst not only order
memory the same way as release/acquire ordering (everything
that happened-before a store in one thread becomes a visible
side effect in the thread that did a load), but also establish
a single total modification order of all atomic operations
that are so tagged."
The problem is that the ordering does not extend to relaxed (or even acquire/release) loads and stores. Therefore *every* store that needs to be ordered before a seq_cst load must also be seq_cst. This is hard enough to guarantee that using a fence is preferable.
I wonder if there are CPUs with ordering primitives fine-grained
enough where this would be useful.
Ankur
Paolo
Is the total modification order the problem?
Thanks
Ankur
>
> Paolo
>
Ankur Arora
On 2021-02-09 5:20 a.m., Laszlo Ersek wrote:
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c or
MdePkg/Library/BaseIoLibIntrinsic/IoLibNoIo.c
are effectively what I was proposing as well: a typecheck and
a volatile cast to ensure that the compiler emits a load or
a store.
I was describing -- where we are only concerned with,
mCpuHotEjectData->Handler and not it's ordering with respect to
other stores. Where that would be necessary we would need a compiler
and/or memory-fence.
a compiler and a CPU barrier while annotating my code to say what specific
ordering I'm banking on.
the primitives that Paolo was talking about. Those would handle
ordering while read_once/write_once were only intended to ensure
that the compiler does not in any way mangle the stores and loads.
In general I would say that fences are -- except for one case -- strictly
greater in power than read_once/write_once. That one case is alignment
checks (which we do for instance in MmioRead32() and friends.)
Given that difference in power, I agree with the approach Paolo
recommended elsewhere that for CPU memory access syncrhonization we
start with Fences and then switch to lesser powered variants
as and when needed:
"However, these changes do not have to be done in a single step.
Starting with the rationalization of memory fences makes sense
because, in addition to enabling the annotation of code rather
than data, it is essentially a bugfix. "
just a recipe for bugs.
Thanks
Ankur
On 02/08/21 21:44, Ankur Arora wrote:AFAICS, the various versions of MmioRead, MmioWrite defined inOn 2021-02-08 9:40 a.m., Laszlo Ersek wrote:I disagree, as long as we talk about CPU synchronization throughSo the structure of the solution we're looking for is:Since the first two are of unequal power -- volatile for specific
- exactly *one* of:
- volatile
- compiler fence
- acquire fence used as a heavy substitute for compiler fence,
accesses and the compiler fence being a general barrier across which
the compiler does not optimize memory accesses I think we should have
both instead of volatile or compiler fence.
variables that live in RAM. "volatile" has a use case, yes, but not for
this purpose. The Linux document Paolo referenced earlier explains this
well. The compiler fence is a superset of volatile (for this use case),
feature-wise, *and* it has better performance.However, instead of volatile, how about read_once(), write_once()This seems to overlap with MmioRead32() from <IoLib.h>, at least in
primitives? These, respectively ensure that the compiler emits a
single load or a store and does not combine or mutilate the load/store
in any way.
Internally, these are implemented via volatile (at least the Linux
kernel implementation that I looked at is) but they would explicitly
lay out the guarantees (and non-guarantees) that the C/C++ standard
makes:
"However, the guarantees of the standard are not sufficient for using
volatile for multi-threading. The C++ Standard does not stop the compiler
from reordering non-volatile reads and writes relative to volatile reads
and writes, and it says nothing about preventing CPU reordering."
(https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering)
I see at least two cases (essentially where these are useful on their
own without needing any other fences for ordering (compiler or cpu)):
* Device side communication:
MdeModulePkg/Bus/Ata/AhciPei/AhciMode.c::AhciWaitMemSet()
271 // The system memory pointed by Address will be updated by the
272 // SATA Host Controller, "volatile" is introduced to prevent
273 // compiler from optimizing the access to the memory address
274 // to only read once.
275 //
276 Value = *(volatile UINT32 *) (UINTN) Address;
277 Value &= MaskValue;
278
So use read_once() instead of the volatile cast:
276 Value = read_once (Address);
purpose. I think the code should have used MmioRead32() in the first place.
MdePkg/Library/BaseIoLibIntrinsic/IoLib.c or
MdePkg/Library/BaseIoLibIntrinsic/IoLibNoIo.c
are effectively what I was proposing as well: a typecheck and
a volatile cast to ensure that the compiler emits a load or
a store.
Yes, exactly. My point was that at least in the restricted case that* For handling cases where we don't need a fence, but onlyHmm wait let me gather my thoughts on this...
need to ensure that the compiler does not mutilate a store in any
way. As an example, the code below clears the Handler and the
block below it calls the Handler if the pointer is non-NULL.
(From:
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-7-ankur.a.arora@oracle.com/,
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-9-ankur.a.arora@oracle.com/)
CpuEject():
99 + //
100 + // We are done until the next hot-unplug; clear the handler.
101 + //
102 + mCpuHotEjectData->Handler = NULL;
SmmCpuFeaturesRendezvousExit():
140 + if (mCpuHotEjectData == NULL ||
141 + mCpuHotEjectData->Handler == NULL) {
142 + return;
143 + }
144 +
145 + mCpuHotEjectData->Handler (CpuIndex);
If the code on line 102 were instead,
102 + write_once (&mCpuHotEjectData->Handler, NULL);
and that on lines 141-145 were:
140 + if (mCpuHotEjectData == NULL ||
141 + ((handler = read_once (mCpuHotEjectData->Handler)) == NULL) {
142 + return;
143 + }
144 +
145 + handler (CpuIndex);
We have this loop in SmiRendezvous()
[UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c]:
//
// Wait for BSP's signal to exit SMI
//
while (*mSmmMpSyncData->AllCpusInSync) {
CpuPause ();
}
(there are two instances of it in SmiRendezvous(), both executed before
reaching the Exit label).
This is the loop that separtes CpuEject() (on the BSP) from
SmmCpuFeaturesRendezvousExit() (on the AP).
The object accessed is a "volatile BOOLEAN", so conceptually this is a
loop body with a compiler fence.
The goal is to highlight the data flow (i.e., release -> acquire) in the
code, so we should put a release fence in CpuEject() and an acquire
fence in SmmCpuFeaturesRendezvousExit().
* Until we can do that -- until we have the better-named APIs -- we
should use MemoryFence() for both fences. However, because MemoryFence()
doesn't do anything on VS, we should also keep the volatile, which (on
x86) happens to force the same practical effect that we expect of the
release and acquire fences too.
* However IIUC you're saying that we shouldn't even aim at placing the
release and acquire fences in CpuEject / SmmCpuFeaturesRendezvousExit
respectively, but use write_once() / read_once().
I was describing -- where we are only concerned with,
mCpuHotEjectData->Handler and not it's ordering with respect to
other stores. Where that would be necessary we would need a compiler
and/or memory-fence.
My comment on that is: I don't know.For the last item, I will assume that the MemoryFence() will act as both
I don't know how read_once() / write_once() in Linux compares to the
acquire / release fences.
I don't know if read_once / write_once are just as expressive as the
acquire / release fences (in Paolo's opinion, for example). I also don't
know if their implementations (their effects on the CPUs / visibility)
would be interchangeable on all edk2 architectures.
So I think this is something that Paolo should please comment on.
*Personally*, I would be satisfied with the following "minimum version" too:
- update the MemoryFence() specification to say it is a full compiler
and CPU barrier (covering store-load as well)
- annotate MemoryFence() call sites with comments, to highlight the data
flow direction
- make MemoryFence() emit an MFENCE on x86, and waste a minuscule amount
of CPU time in response -- that's an acceptable price for covering
store-load (Paolo mentioned ~50 clock cycles as penalty)
- fix MemoryFence()'s implementation for VS (currently it does nothing
at all)
- use MemoryFence(), and do not use volatile, in the v7 unplug series
a compiler and a CPU barrier while annotating my code to say what specific
ordering I'm banking on.
*However*, Paolo strongly recommends that we introduce and use theOh, no. I wasn't suggesting read_once/write_once as an alternative to
directional barriers, for highlighting data flow, and for matching known
lockless patterns. Thus, I understand your read_once / write_once
suggestion to be an *alternative* to Paolo's. And I'm not equipped to
compare both recommendations, alas.
the primitives that Paolo was talking about. Those would handle
ordering while read_once/write_once were only intended to ensure
that the compiler does not in any way mangle the stores and loads.
In general I would say that fences are -- except for one case -- strictly
greater in power than read_once/write_once. That one case is alignment
checks (which we do for instance in MmioRead32() and friends.)
Given that difference in power, I agree with the approach Paolo
recommended elsewhere that for CPU memory access syncrhonization we
start with Fences and then switch to lesser powered variants
as and when needed:
"However, these changes do not have to be done in a single step.
Starting with the rationalization of memory fences makes sense
because, in addition to enabling the annotation of code rather
than data, it is essentially a bugfix. "
I don't know why Linux has fences and read_once / write_once separately.Yeah I'm not sure we need to go wild with these in EDK2 either. That's
I know they're all documented, I think I tried to read those docs in the
past (unsuccessfully). I just feel that *all that* would be too much for
me to use edk2.
just a recipe for bugs.
Thanks
Ankur
I have to defer this to Paolo and Ard.
Thanks
LaszloThe use of read_once()/write_once() also document the implicit assumptions
being made in the code at the point of use.
Thanks
Ankur- and *all* of
- acquire fence (load-load, load-store)
- release fence (load-store, store-store)
- full fence (load-load, load-store, store-store, store-load)
The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.
I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).
Thanks
Laszlo
Paolo Bonzini <pbonzini@...>
Il mer 10 feb 2021, 07:37 Ankur Arora <ankur.a.arora@...> ha scritto:
acquire/release) loads and stores. Therefore *every* store that needs to be
ordered before a seq_cst load must also be seq_cst. This is hard enough to
guarantee that using a fence is preferable.
Paolo
So I don't quite see what would make "memory_order_seq_cst" harder?The problem is that the ordering does not extend to relaxed (or even
From the spec (
https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering
):
"Atomic operations tagged memory_order_seq_cst not only order
memory the same way as release/acquire ordering (everything
that happened-before a store in one thread becomes a visible
side effect in the thread that did a load), but also establish
a single total modification order of all atomic operations
that are so tagged."
acquire/release) loads and stores. Therefore *every* store that needs to be
ordered before a seq_cst load must also be seq_cst. This is hard enough to
guarantee that using a fence is preferable.
Paolo
Is the total modification order the problem?
Thanks
Ankur
Paolo
Ankur Arora
On 2021-02-09 5:00 a.m., Paolo Bonzini wrote:
work with.
From the spec (https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering):
"Atomic operations tagged memory_order_seq_cst not only order
memory the same way as release/acquire ordering (everything
that happened-before a store in one thread becomes a visible
side effect in the thread that did a load), but also establish
a single total modification order of all atomic operations
that are so tagged."
Is the total modification order the problem?
Thanks
Ankur
On 08/02/21 21:44, Ankur Arora wrote:I see. Yeah, this ordering does seem like easier toWell, that could certainly be a next step, but I intentionally left it out of my proposal to Laszlo.
Since the first two are of unequal power -- volatile for specific
accesses and the compiler fence being a general barrier across which
the compiler does not optimize memory accesses I think we should have
both instead of volatile or compiler fence.
However, instead of volatile, how about read_once(), write_once()
primitives? These, respectively ensure that the compiler emits a
single load or a store and does not combine or mutilate the load/store
in any way.
The reason for that is that the fences are already enough to wean the edk2 codebase from volatile declarations and pointers, and that is the big paradigm change---annotating code rather than data.
Once you've done that switch I would certainly agree with (basically) an edk2 flavor of C11-like atomic primitives:
- at the very least, AtomicLoadRelaxed and AtomicStoreRelaxed corresponding to Linux's READ_ONCE/WRITE_ONCE
- and if it is possible to implement them easily on MSVC, AtomicLoadAcquire and AtomicStoreRelease
However, these changes do not have to be done in a single step. Starting with the rationalization of memory fences makes sense because, in addition to enabling the annotation of code rather than data, it is essentially a bugfix.
work with.
As an aside, I always stay away from the "memory_order_seq_cst" loads and stores. Those are much more tricky than what their name suggests, and their use cases are much better served by something likeSo I don't quite see what would make "memory_order_seq_cst" harder?
AtomicStoreRelaxed(&a, 1); // atomic_store(mo_relaxed)
MemoryFence(); // atomic_thread_fence(mo_seq_cst)
y = AtomicLoadRelaxed(&b); // atomic_load(mo_relaxed)
From the spec (https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering):
"Atomic operations tagged memory_order_seq_cst not only order
memory the same way as release/acquire ordering (everything
that happened-before a store in one thread becomes a visible
side effect in the thread that did a load), but also establish
a single total modification order of all atomic operations
that are so tagged."
Is the total modification order the problem?
Thanks
Ankur
Paolo
Andrew Fish <afish@...>
On Feb 9, 2021, at 5:20 AM, Laszlo Ersek <lersek@...> wrote:Well for the IoLib that is + ~100 clock cycles for every MMIO read/write so that makes me a little nervous.
On 02/08/21 21:44, Ankur Arora wrote:On 2021-02-08 9:40 a.m., Laszlo Ersek wrote:I disagree, as long as we talk about CPU synchronization throughSo the structure of the solution we're looking for is:Since the first two are of unequal power -- volatile for specific
- exactly *one* of:
- volatile
- compiler fence
- acquire fence used as a heavy substitute for compiler fence,
accesses and the compiler fence being a general barrier across which
the compiler does not optimize memory accesses I think we should have
both instead of volatile or compiler fence.
variables that live in RAM. "volatile" has a use case, yes, but not for
this purpose. The Linux document Paolo referenced earlier explains this
well. The compiler fence is a superset of volatile (for this use case),
feature-wise, *and* it has better performance.However, instead of volatile, how about read_once(), write_once()This seems to overlap with MmioRead32() from <IoLib.h>, at least in
primitives? These, respectively ensure that the compiler emits a
single load or a store and does not combine or mutilate the load/store
in any way.
Internally, these are implemented via volatile (at least the Linux
kernel implementation that I looked at is) but they would explicitly
lay out the guarantees (and non-guarantees) that the C/C++ standard
makes:
"However, the guarantees of the standard are not sufficient for using
volatile for multi-threading. The C++ Standard does not stop the compiler
from reordering non-volatile reads and writes relative to volatile reads
and writes, and it says nothing about preventing CPU reordering."
(https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering)
I see at least two cases (essentially where these are useful on their
own without needing any other fences for ordering (compiler or cpu)):
* Device side communication:
MdeModulePkg/Bus/Ata/AhciPei/AhciMode.c::AhciWaitMemSet()
271 // The system memory pointed by Address will be updated by the
272 // SATA Host Controller, "volatile" is introduced to prevent
273 // compiler from optimizing the access to the memory address
274 // to only read once.
275 //
276 Value = *(volatile UINT32 *) (UINTN) Address;
277 Value &= MaskValue;
278
So use read_once() instead of the volatile cast:
276 Value = read_once (Address);
purpose. I think the code should have used MmioRead32() in the first place.* For handling cases where we don't need a fence, but onlyHmm wait let me gather my thoughts on this...
need to ensure that the compiler does not mutilate a store in any
way. As an example, the code below clears the Handler and the
block below it calls the Handler if the pointer is non-NULL.
(From:
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-7-ankur.a.arora@oracle.com/,
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-9-ankur.a.arora@oracle.com/)
CpuEject():
99 + //
100 + // We are done until the next hot-unplug; clear the handler.
101 + //
102 + mCpuHotEjectData->Handler = NULL;
SmmCpuFeaturesRendezvousExit():
140 + if (mCpuHotEjectData == NULL ||
141 + mCpuHotEjectData->Handler == NULL) {
142 + return;
143 + }
144 +
145 + mCpuHotEjectData->Handler (CpuIndex);
If the code on line 102 were instead,
102 + write_once (&mCpuHotEjectData->Handler, NULL);
and that on lines 141-145 were:
140 + if (mCpuHotEjectData == NULL ||
141 + ((handler = read_once (mCpuHotEjectData->Handler)) == NULL) {
142 + return;
143 + }
144 +
145 + handler (CpuIndex);
We have this loop in SmiRendezvous()
[UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c]:
//
// Wait for BSP's signal to exit SMI
//
while (*mSmmMpSyncData->AllCpusInSync) {
CpuPause ();
}
(there are two instances of it in SmiRendezvous(), both executed before
reaching the Exit label).
This is the loop that separtes CpuEject() (on the BSP) from
SmmCpuFeaturesRendezvousExit() (on the AP).
The object accessed is a "volatile BOOLEAN", so conceptually this is a
loop body with a compiler fence.
The goal is to highlight the data flow (i.e., release -> acquire) in the
code, so we should put a release fence in CpuEject() and an acquire
fence in SmmCpuFeaturesRendezvousExit().
* Until we can do that -- until we have the better-named APIs -- we
should use MemoryFence() for both fences. However, because MemoryFence()
doesn't do anything on VS, we should also keep the volatile, which (on
x86) happens to force the same practical effect that we expect of the
release and acquire fences too.
* However IIUC you're saying that we shouldn't even aim at placing the
release and acquire fences in CpuEject / SmmCpuFeaturesRendezvousExit
respectively, but use write_once() / read_once().
My comment on that is: I don't know.
I don't know how read_once() / write_once() in Linux compares to the
acquire / release fences.
I don't know if read_once / write_once are just as expressive as the
acquire / release fences (in Paolo's opinion, for example). I also don't
know if their implementations (their effects on the CPUs / visibility)
would be interchangeable on all edk2 architectures.
So I think this is something that Paolo should please comment on.
*Personally*, I would be satisfied with the following "minimum version" too:
- update the MemoryFence() specification to say it is a full compiler
and CPU barrier (covering store-load as well)
- annotate MemoryFence() call sites with comments, to highlight the data
flow direction
- make MemoryFence() emit an MFENCE on x86, and waste a minuscule amount
of CPU time in response -- that's an acceptable price for covering
store-load (Paolo mentioned ~50 clock cycles as penalty)
Thanks,
Andrew Fish
- fix MemoryFence()'s implementation for VS (currently it does nothing
at all)
- use MemoryFence(), and do not use volatile, in the v7 unplug series
*However*, Paolo strongly recommends that we introduce and use the
directional barriers, for highlighting data flow, and for matching known
lockless patterns. Thus, I understand your read_once / write_once
suggestion to be an *alternative* to Paolo's. And I'm not equipped to
compare both recommendations, alas.
I don't know why Linux has fences and read_once / write_once separately.
I know they're all documented, I think I tried to read those docs in the
past (unsuccessfully). I just feel that *all that* would be too much for
me to use edk2.
I have to defer this to Paolo and Ard.
Thanks
LaszloThe use of read_once()/write_once() also document the implicit assumptions
being made in the code at the point of use.
Thanks
Ankur- and *all* of
- acquire fence (load-load, load-store)
- release fence (load-store, store-store)
- full fence (load-load, load-store, store-store, store-load)
The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.
I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).
Thanks
Laszlo
Ard Biesheuvel <ardb@...>
On Tue, 9 Feb 2021 at 15:27, Laszlo Ersek <lersek@...> wrote:
are ordered, not about the data itself.
I'd say it is because we are reasoning about how accesses to the data
On 02/09/21 14:00, Paolo Bonzini wrote:annotating code rather than data(
Slightly off-topic, but I wonder why this approach (code annotations)
has proved more effective / successful, with atomics, than annotating
data. Usually it's recommended to put emphasis on data structure design
(with invariants etc), and then the code "writes itself". (I feel like
there's a famous quote to insert here, but I can't remember it.)
So I'm not questioning that annotating code is superior in this case, I
just wonder why it diverges from the usual guideline.
)
are ordered, not about the data itself.
Laszlo Ersek
On 02/09/21 14:00, Paolo Bonzini wrote:
Slightly off-topic, but I wonder why this approach (code annotations)
has proved more effective / successful, with atomics, than annotating
data. Usually it's recommended to put emphasis on data structure design
(with invariants etc), and then the code "writes itself". (I feel like
there's a famous quote to insert here, but I can't remember it.)
So I'm not questioning that annotating code is superior in this case, I
just wonder why it diverges from the usual guideline.
)
Laszlo
annotating code rather than data(
Slightly off-topic, but I wonder why this approach (code annotations)
has proved more effective / successful, with atomics, than annotating
data. Usually it's recommended to put emphasis on data structure design
(with invariants etc), and then the code "writes itself". (I feel like
there's a famous quote to insert here, but I can't remember it.)
So I'm not questioning that annotating code is superior in this case, I
just wonder why it diverges from the usual guideline.
)
Laszlo
Laszlo Ersek
On 02/08/21 21:44, Ankur Arora wrote:
variables that live in RAM. "volatile" has a use case, yes, but not for
this purpose. The Linux document Paolo referenced earlier explains this
well. The compiler fence is a superset of volatile (for this use case),
feature-wise, *and* it has better performance.
purpose. I think the code should have used MmioRead32() in the first place.
We have this loop in SmiRendezvous()
[UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c]:
//
// Wait for BSP's signal to exit SMI
//
while (*mSmmMpSyncData->AllCpusInSync) {
CpuPause ();
}
(there are two instances of it in SmiRendezvous(), both executed before
reaching the Exit label).
This is the loop that separtes CpuEject() (on the BSP) from
SmmCpuFeaturesRendezvousExit() (on the AP).
The object accessed is a "volatile BOOLEAN", so conceptually this is a
loop body with a compiler fence.
The goal is to highlight the data flow (i.e., release -> acquire) in the
code, so we should put a release fence in CpuEject() and an acquire
fence in SmmCpuFeaturesRendezvousExit().
* Until we can do that -- until we have the better-named APIs -- we
should use MemoryFence() for both fences. However, because MemoryFence()
doesn't do anything on VS, we should also keep the volatile, which (on
x86) happens to force the same practical effect that we expect of the
release and acquire fences too.
* However IIUC you're saying that we shouldn't even aim at placing the
release and acquire fences in CpuEject / SmmCpuFeaturesRendezvousExit
respectively, but use write_once() / read_once().
My comment on that is: I don't know.
I don't know how read_once() / write_once() in Linux compares to the
acquire / release fences.
I don't know if read_once / write_once are just as expressive as the
acquire / release fences (in Paolo's opinion, for example). I also don't
know if their implementations (their effects on the CPUs / visibility)
would be interchangeable on all edk2 architectures.
So I think this is something that Paolo should please comment on.
*Personally*, I would be satisfied with the following "minimum version" too:
- update the MemoryFence() specification to say it is a full compiler
and CPU barrier (covering store-load as well)
- annotate MemoryFence() call sites with comments, to highlight the data
flow direction
- make MemoryFence() emit an MFENCE on x86, and waste a minuscule amount
of CPU time in response -- that's an acceptable price for covering
store-load (Paolo mentioned ~50 clock cycles as penalty)
- fix MemoryFence()'s implementation for VS (currently it does nothing
at all)
- use MemoryFence(), and do not use volatile, in the v7 unplug series
*However*, Paolo strongly recommends that we introduce and use the
directional barriers, for highlighting data flow, and for matching known
lockless patterns. Thus, I understand your read_once / write_once
suggestion to be an *alternative* to Paolo's. And I'm not equipped to
compare both recommendations, alas.
I don't know why Linux has fences and read_once / write_once separately.
I know they're all documented, I think I tried to read those docs in the
past (unsuccessfully). I just feel that *all that* would be too much for
me to use edk2.
I have to defer this to Paolo and Ard.
Thanks
Laszlo
On 2021-02-08 9:40 a.m., Laszlo Ersek wrote:
I disagree, as long as we talk about CPU synchronization throughSo the structure of the solution we're looking for is:Since the first two are of unequal power -- volatile for specific
- exactly *one* of:
- volatile
- compiler fence
- acquire fence used as a heavy substitute for compiler fence,
accesses and the compiler fence being a general barrier across which
the compiler does not optimize memory accesses I think we should have
both instead of volatile or compiler fence.
variables that live in RAM. "volatile" has a use case, yes, but not for
this purpose. The Linux document Paolo referenced earlier explains this
well. The compiler fence is a superset of volatile (for this use case),
feature-wise, *and* it has better performance.
However, instead of volatile, how about read_once(), write_once()This seems to overlap with MmioRead32() from <IoLib.h>, at least in
primitives? These, respectively ensure that the compiler emits a
single load or a store and does not combine or mutilate the load/store
in any way.
Internally, these are implemented via volatile (at least the Linux
kernel implementation that I looked at is) but they would explicitly
lay out the guarantees (and non-guarantees) that the C/C++ standard
makes:
"However, the guarantees of the standard are not sufficient for using
volatile for multi-threading. The C++ Standard does not stop the compiler
from reordering non-volatile reads and writes relative to volatile reads
and writes, and it says nothing about preventing CPU reordering."
(https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering)
I see at least two cases (essentially where these are useful on their
own without needing any other fences for ordering (compiler or cpu)):
* Device side communication:
MdeModulePkg/Bus/Ata/AhciPei/AhciMode.c::AhciWaitMemSet()
271 // The system memory pointed by Address will be updated by the
272 // SATA Host Controller, "volatile" is introduced to prevent
273 // compiler from optimizing the access to the memory address
274 // to only read once.
275 //
276 Value = *(volatile UINT32 *) (UINTN) Address;
277 Value &= MaskValue;
278
So use read_once() instead of the volatile cast:
276 Value = read_once (Address);
purpose. I think the code should have used MmioRead32() in the first place.
* For handling cases where we don't need a fence, but onlyHmm wait let me gather my thoughts on this...
need to ensure that the compiler does not mutilate a store in any
way. As an example, the code below clears the Handler and the
block below it calls the Handler if the pointer is non-NULL.
(From:
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-7-ankur.a.arora@oracle.com/,
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-9-ankur.a.arora@oracle.com/)
CpuEject():
99 + //
100 + // We are done until the next hot-unplug; clear the handler.
101 + //
102 + mCpuHotEjectData->Handler = NULL;
SmmCpuFeaturesRendezvousExit():
140 + if (mCpuHotEjectData == NULL ||
141 + mCpuHotEjectData->Handler == NULL) {
142 + return;
143 + }
144 +
145 + mCpuHotEjectData->Handler (CpuIndex);
If the code on line 102 were instead,
102 + write_once (&mCpuHotEjectData->Handler, NULL);
and that on lines 141-145 were:
140 + if (mCpuHotEjectData == NULL ||
141 + ((handler = read_once (mCpuHotEjectData->Handler)) == NULL) {
142 + return;
143 + }
144 +
145 + handler (CpuIndex);
We have this loop in SmiRendezvous()
[UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c]:
//
// Wait for BSP's signal to exit SMI
//
while (*mSmmMpSyncData->AllCpusInSync) {
CpuPause ();
}
(there are two instances of it in SmiRendezvous(), both executed before
reaching the Exit label).
This is the loop that separtes CpuEject() (on the BSP) from
SmmCpuFeaturesRendezvousExit() (on the AP).
The object accessed is a "volatile BOOLEAN", so conceptually this is a
loop body with a compiler fence.
The goal is to highlight the data flow (i.e., release -> acquire) in the
code, so we should put a release fence in CpuEject() and an acquire
fence in SmmCpuFeaturesRendezvousExit().
* Until we can do that -- until we have the better-named APIs -- we
should use MemoryFence() for both fences. However, because MemoryFence()
doesn't do anything on VS, we should also keep the volatile, which (on
x86) happens to force the same practical effect that we expect of the
release and acquire fences too.
* However IIUC you're saying that we shouldn't even aim at placing the
release and acquire fences in CpuEject / SmmCpuFeaturesRendezvousExit
respectively, but use write_once() / read_once().
My comment on that is: I don't know.
I don't know how read_once() / write_once() in Linux compares to the
acquire / release fences.
I don't know if read_once / write_once are just as expressive as the
acquire / release fences (in Paolo's opinion, for example). I also don't
know if their implementations (their effects on the CPUs / visibility)
would be interchangeable on all edk2 architectures.
So I think this is something that Paolo should please comment on.
*Personally*, I would be satisfied with the following "minimum version" too:
- update the MemoryFence() specification to say it is a full compiler
and CPU barrier (covering store-load as well)
- annotate MemoryFence() call sites with comments, to highlight the data
flow direction
- make MemoryFence() emit an MFENCE on x86, and waste a minuscule amount
of CPU time in response -- that's an acceptable price for covering
store-load (Paolo mentioned ~50 clock cycles as penalty)
- fix MemoryFence()'s implementation for VS (currently it does nothing
at all)
- use MemoryFence(), and do not use volatile, in the v7 unplug series
*However*, Paolo strongly recommends that we introduce and use the
directional barriers, for highlighting data flow, and for matching known
lockless patterns. Thus, I understand your read_once / write_once
suggestion to be an *alternative* to Paolo's. And I'm not equipped to
compare both recommendations, alas.
I don't know why Linux has fences and read_once / write_once separately.
I know they're all documented, I think I tried to read those docs in the
past (unsuccessfully). I just feel that *all that* would be too much for
me to use edk2.
I have to defer this to Paolo and Ard.
Thanks
Laszlo
The use of read_once()/write_once() also document the implicit assumptions
being made in the code at the point of use.
Thanks
Ankur- and *all* of
- acquire fence (load-load, load-store)
- release fence (load-store, store-store)
- full fence (load-load, load-store, store-store, store-load)
The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.
I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).
Thanks
Laszlo
Paolo Bonzini <pbonzini@...>
On 08/02/21 21:44, Ankur Arora wrote:
The reason for that is that the fences are already enough to wean the edk2 codebase from volatile declarations and pointers, and that is the big paradigm change---annotating code rather than data.
Once you've done that switch I would certainly agree with (basically) an edk2 flavor of C11-like atomic primitives:
- at the very least, AtomicLoadRelaxed and AtomicStoreRelaxed corresponding to Linux's READ_ONCE/WRITE_ONCE
- and if it is possible to implement them easily on MSVC, AtomicLoadAcquire and AtomicStoreRelease
However, these changes do not have to be done in a single step. Starting with the rationalization of memory fences makes sense because, in addition to enabling the annotation of code rather than data, it is essentially a bugfix.
As an aside, I always stay away from the "memory_order_seq_cst" loads and stores. Those are much more tricky than what their name suggests, and their use cases are much better served by something like
AtomicStoreRelaxed(&a, 1); // atomic_store(mo_relaxed)
MemoryFence(); // atomic_thread_fence(mo_seq_cst)
y = AtomicLoadRelaxed(&b); // atomic_load(mo_relaxed)
Paolo
Since the first two are of unequal power -- volatile for specificWell, that could certainly be a next step, but I intentionally left it out of my proposal to Laszlo.
accesses and the compiler fence being a general barrier across which
the compiler does not optimize memory accesses I think we should have
both instead of volatile or compiler fence.
However, instead of volatile, how about read_once(), write_once()
primitives? These, respectively ensure that the compiler emits a
single load or a store and does not combine or mutilate the load/store
in any way.
The reason for that is that the fences are already enough to wean the edk2 codebase from volatile declarations and pointers, and that is the big paradigm change---annotating code rather than data.
Once you've done that switch I would certainly agree with (basically) an edk2 flavor of C11-like atomic primitives:
- at the very least, AtomicLoadRelaxed and AtomicStoreRelaxed corresponding to Linux's READ_ONCE/WRITE_ONCE
- and if it is possible to implement them easily on MSVC, AtomicLoadAcquire and AtomicStoreRelease
However, these changes do not have to be done in a single step. Starting with the rationalization of memory fences makes sense because, in addition to enabling the annotation of code rather than data, it is essentially a bugfix.
As an aside, I always stay away from the "memory_order_seq_cst" loads and stores. Those are much more tricky than what their name suggests, and their use cases are much better served by something like
AtomicStoreRelaxed(&a, 1); // atomic_store(mo_relaxed)
MemoryFence(); // atomic_thread_fence(mo_seq_cst)
y = AtomicLoadRelaxed(&b); // atomic_load(mo_relaxed)
Paolo
Laszlo Ersek
On 02/08/21 20:18, Andrew Fish wrote:
But Paolo's find:
#define _Compiler_barrier() \
_STL_DISABLE_DEPRECATED_WARNING _ReadWriteBarrier() \
_STL_RESTORE_DEPRECATED_WARNING
suggests we should still stick with _ReadWriteBarrier (usable in plain
C), even if it's officially deprecated.
Thanks
Laszlo
The VC++ docs seem to point you toward:Yes, this points too much toward C++.
1) volatile + /volatile:iso compiler flag for MMIO.
3) For synchronization: inline void atomic_thread_fence(memory_order Order) noexcept;
It looks like memory_order maps into the primitives you are proposing?
memory_order_relaxed The fence has no effect.
memory_order_consume The fence is an acquire fence.
memory_order_acquire The fence is an acquire fence.
memory_order_release The fence is a release fence.
memory_order_acq_rel The fence is both an acquire fence and a release fence.
memory_order_seq_cst The fence is both an acquire fence and a release fence, and is sequentially consistent.
But it kind of seems like we may be falling down a C++ rabbit hole…..
But Paolo's find:
#define _Compiler_barrier() \
_STL_DISABLE_DEPRECATED_WARNING _ReadWriteBarrier() \
_STL_RESTORE_DEPRECATED_WARNING
suggests we should still stick with _ReadWriteBarrier (usable in plain
C), even if it's officially deprecated.
Thanks
Laszlo
Ankur Arora
On 2021-02-08 9:40 a.m., Laszlo Ersek wrote:
accesses and the compiler fence being a general barrier across which
the compiler does not optimize memory accesses I think we should have
both instead of volatile or compiler fence.
However, instead of volatile, how about read_once(), write_once()
primitives? These, respectively ensure that the compiler emits a
single load or a store and does not combine or mutilate the load/store
in any way.
Internally, these are implemented via volatile (at least the Linux
kernel implementation that I looked at is) but they would explicitly
lay out the guarantees (and non-guarantees) that the C/C++ standard
makes:
"However, the guarantees of the standard are not sufficient for using
volatile for multi-threading. The C++ Standard does not stop the compiler
from reordering non-volatile reads and writes relative to volatile reads
and writes, and it says nothing about preventing CPU reordering."
(https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering)
I see at least two cases (essentially where these are useful on their
own without needing any other fences for ordering (compiler or cpu)):
* Device side communication:
MdeModulePkg/Bus/Ata/AhciPei/AhciMode.c::AhciWaitMemSet()
271 // The system memory pointed by Address will be updated by the
272 // SATA Host Controller, "volatile" is introduced to prevent
273 // compiler from optimizing the access to the memory address
274 // to only read once.
275 //
276 Value = *(volatile UINT32 *) (UINTN) Address;
277 Value &= MaskValue;
278
So use read_once() instead of the volatile cast:
276 Value = read_once (Address);
* For handling cases where we don't need a fence, but only
need to ensure that the compiler does not mutilate a store in any
way. As an example, the code below clears the Handler and the
block below it calls the Handler if the pointer is non-NULL.
(From:
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-7-ankur.a.arora@oracle.com/,
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-9-ankur.a.arora@oracle.com/)
CpuEject():
99 + //
100 + // We are done until the next hot-unplug; clear the handler.
101 + //
102 + mCpuHotEjectData->Handler = NULL;
SmmCpuFeaturesRendezvousExit():
140 + if (mCpuHotEjectData == NULL ||
141 + mCpuHotEjectData->Handler == NULL) {
142 + return;
143 + }
144 +
145 + mCpuHotEjectData->Handler (CpuIndex);
If the code on line 102 were instead,
102 + write_once (&mCpuHotEjectData->Handler, NULL);
and that on lines 141-145 were:
140 + if (mCpuHotEjectData == NULL ||
141 + ((handler = read_once (mCpuHotEjectData->Handler)) == NULL) {
142 + return;
143 + }
144 +
145 + handler (CpuIndex);
The use of read_once()/write_once() also document the implicit assumptions
being made in the code at the point of use.
Thanks
Ankur
On 02/04/21 21:04, Paolo Bonzini wrote:Since the first two are of unequal power -- volatile for specificIl gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto:[*]Acquire fences are barriers between earlier loads and subsequent loads(1) We should introduce finer-grained fence primitives:Acquire semantics typically order writes before reads, not /between/
ARM AARCH64 i386
CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence
"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e.
between writes)".
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.
and stores; those earlier loads then synchronize with release stores
in other threads.
Release fences are barriers been earlier loads and stores against
subsequent stores, and those subsequent stores synchronize with
acquire loads in other threads.Doesn't look too specific:
In both cases, however, fences only make sense between memory
operations. So something like "after reads" and "before writes" would
have been more precise in some sense, but in practice the usual idiom
is "between" reads/writes as Laszlo wrote.Note that reasoning about this only makes sense in the context ofSure, that's a matter of how to implement the primitives. If you think
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)
For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.
that non-coherent DMA is important, a full dmb can be used too.
As far as the compiler is concerned, an asm in the macros *should*
block enough optimizations, even without making the accesses volatile.
CompilerFence (or the edk2 equivalent of cpu_relax, whose name escapes
me right now) would be necessary in the body of busy-waiting loops.
However somebody should check the MSVC docs for asm, too.
https://docs.microsoft.com/en-us/cpp/assembler/inline/optimizing-inline-assembly?view=msvc-160I've found this article very relevant:It is very important to be *aware* of the acquire/release semantics,I agree as long as the primitives are self-documenting. A single
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.
MemoryFence() does not make it clear in which direction the data is
flowing (whether from other threads to this one, or vice versa).
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming
(1) It provides an example for a store-load (Dekker's algorithm) where
any combination of read-acquire + write-release is insufficient. Thus it
would need an MFENCE (hence we need all four APIs in edk2).
(... If we jump back to the part I marked with [*], then we can see
Paolo's description of read-acquire and store-release covers load-load,
load-store, load-store (again), and store-store. What's not covered is
store-load, which Paolo said elsewhere in this thread is exactly what
x86 does reorder. So the MemoryFence() API's use would be "exceptional",
in future code, but it should exist for supporting patterns like
Dekker's algorithm.)
(2) I think the "Recommendations" section:
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#recommendations
highlights the very problem we have. It recommendsWhen doing lockless programming, be sure to use volatile flag*after* explaining why "volatile" is generally insufficient:
variables and memory barrier instructions as needed.
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering
and *after* describing the compiler barriers.
So this recommendation should recommend compiler barriers rather than
volatile. :/
(3) The article recommends _ReadWriteBarrier, _ReadBarrier and
_WriteBarrier, for compiler fences. I think _ReadWriteBarrier should
suffice for edk2's purposes.
However, the following reference deprecates those intrinsics:
https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160
while offering *only* C++ language replacements.
Could we implement CompilerFence() for all edk2 architectures as
*non-inline* assembly? The function would consist of a return
instruction only. For x86, we could use a NASM source; for ARM, separate
MS and GNU assembler sources would be needed.
I totally want to get rid of "volatile" at least in future code, but
that's only possible if one of the following options can be satisfied:
- we find a supported replacement method for _ReadWriteBarrier when
using the MSFT toolchain family (such as the *non-inline*, empty
assembly function),
- or we accept that CompilerFence() is not possible to implement
portably, and we only offer the heavier-weight acquire / release /
full fences, which *include* a compiler fence too.
In the latter case, the body of a busy-waiting loop would have to use
the heavier read-acquire API.
--*--
So the structure of the solution we're looking for is:
- exactly *one* of:
- volatile
- compiler fence
- acquire fence used as a heavy substitute for compiler fence,
accesses and the compiler fence being a general barrier across which
the compiler does not optimize memory accesses I think we should have
both instead of volatile or compiler fence.
However, instead of volatile, how about read_once(), write_once()
primitives? These, respectively ensure that the compiler emits a
single load or a store and does not combine or mutilate the load/store
in any way.
Internally, these are implemented via volatile (at least the Linux
kernel implementation that I looked at is) but they would explicitly
lay out the guarantees (and non-guarantees) that the C/C++ standard
makes:
"However, the guarantees of the standard are not sufficient for using
volatile for multi-threading. The C++ Standard does not stop the compiler
from reordering non-volatile reads and writes relative to volatile reads
and writes, and it says nothing about preventing CPU reordering."
(https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering)
I see at least two cases (essentially where these are useful on their
own without needing any other fences for ordering (compiler or cpu)):
* Device side communication:
MdeModulePkg/Bus/Ata/AhciPei/AhciMode.c::AhciWaitMemSet()
271 // The system memory pointed by Address will be updated by the
272 // SATA Host Controller, "volatile" is introduced to prevent
273 // compiler from optimizing the access to the memory address
274 // to only read once.
275 //
276 Value = *(volatile UINT32 *) (UINTN) Address;
277 Value &= MaskValue;
278
So use read_once() instead of the volatile cast:
276 Value = read_once (Address);
* For handling cases where we don't need a fence, but only
need to ensure that the compiler does not mutilate a store in any
way. As an example, the code below clears the Handler and the
block below it calls the Handler if the pointer is non-NULL.
(From:
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-7-ankur.a.arora@oracle.com/,
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-9-ankur.a.arora@oracle.com/)
CpuEject():
99 + //
100 + // We are done until the next hot-unplug; clear the handler.
101 + //
102 + mCpuHotEjectData->Handler = NULL;
SmmCpuFeaturesRendezvousExit():
140 + if (mCpuHotEjectData == NULL ||
141 + mCpuHotEjectData->Handler == NULL) {
142 + return;
143 + }
144 +
145 + mCpuHotEjectData->Handler (CpuIndex);
If the code on line 102 were instead,
102 + write_once (&mCpuHotEjectData->Handler, NULL);
and that on lines 141-145 were:
140 + if (mCpuHotEjectData == NULL ||
141 + ((handler = read_once (mCpuHotEjectData->Handler)) == NULL) {
142 + return;
143 + }
144 +
145 + handler (CpuIndex);
The use of read_once()/write_once() also document the implicit assumptions
being made in the code at the point of use.
Thanks
Ankur
- and *all* of
- acquire fence (load-load, load-store)
- release fence (load-store, store-store)
- full fence (load-load, load-store, store-store, store-load)
The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.
I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).
Thanks
Laszlo
Andrew Fish <afish@...>
On Feb 8, 2021, at 9:40 AM, Laszlo Ersek <lersek@...> wrote:“inline assembly is not supported on the ARM and x64 processors. “ [1].
On 02/04/21 21:04, Paolo Bonzini wrote:Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto:[*]Acquire fences are barriers between earlier loads and subsequent loads(1) We should introduce finer-grained fence primitives:Acquire semantics typically order writes before reads, not /between/
ARM AARCH64 i386
CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence
"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e.
between writes)".
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.
and stores; those earlier loads then synchronize with release stores
in other threads.
Release fences are barriers been earlier loads and stores against
subsequent stores, and those subsequent stores synchronize with
acquire loads in other threads.Doesn't look too specific:
In both cases, however, fences only make sense between memory
operations. So something like "after reads" and "before writes" would
have been more precise in some sense, but in practice the usual idiom
is "between" reads/writes as Laszlo wrote.Note that reasoning about this only makes sense in the context ofSure, that's a matter of how to implement the primitives. If you think
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)
For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.
that non-coherent DMA is important, a full dmb can be used too.
As far as the compiler is concerned, an asm in the macros *should*
block enough optimizations, even without making the accesses volatile.
CompilerFence (or the edk2 equivalent of cpu_relax, whose name escapes
me right now) would be necessary in the body of busy-waiting loops.
However somebody should check the MSVC docs for asm, too.
https://docs.microsoft.com/en-us/cpp/assembler/inline/optimizing-inline-assembly?view=msvc-160
Kind of looks like MSVC replaced inline assembly with intrinsics [2].
Looks like _ReadBarrier, _ReadWriteBarrier [3] are cross arch. But as you point out when you look them up [4]:
The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics and the MemoryBarrier macro are all deprecated and should not be used. For inter-thread communication, use mechanisms such as atomic_thread_fence <https://docs.microsoft.com/en-us/cpp/standard-library/atomic-functions?view=msvc-160#atomic_thread_fence> and std::atomic<T> <https://docs.microsoft.com/en-us/cpp/standard-library/atomic?view=msvc-160> that are defined in the C++ Standard Library <https://docs.microsoft.com/en-us/cpp/standard-library/cpp-standard-library-reference?view=msvc-160>. For hardware access, use the /volatile:iso <https://docs.microsoft.com/en-us/cpp/build/reference/volatile-volatile-keyword-interpretation?view=msvc-160> compiler option together with the volatile <https://docs.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-160> keyword.
The scary statement there is /volatile:iso compiler option + volatile keyword for hardware access.
[1] https://docs.microsoft.com/en-us/cpp/assembler/inline/inline-assembler?view=msvc-160
[2] https://docs.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-160
[3] https://docs.microsoft.com/en-us/cpp/intrinsics/intrinsics-available-on-all-architectures?view=msvc-160
[4] https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160
I think we have a good mapping for GCC/clang on x86, but I’m still not 100% clear on what to do MSVC++.I've found this article very relevant:It is very important to be *aware* of the acquire/release semantics,I agree as long as the primitives are self-documenting. A single
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.
MemoryFence() does not make it clear in which direction the data is
flowing (whether from other threads to this one, or vice versa).
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming
(1) It provides an example for a store-load (Dekker's algorithm) where
any combination of read-acquire + write-release is insufficient. Thus it
would need an MFENCE (hence we need all four APIs in edk2).
(... If we jump back to the part I marked with [*], then we can see
Paolo's description of read-acquire and store-release covers load-load,
load-store, load-store (again), and store-store. What's not covered is
store-load, which Paolo said elsewhere in this thread is exactly what
x86 does reorder. So the MemoryFence() API's use would be "exceptional",
in future code, but it should exist for supporting patterns like
Dekker's algorithm.)
(2) I think the "Recommendations" section:
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#recommendations
highlights the very problem we have. It recommendsWhen doing lockless programming, be sure to use volatile flag*after* explaining why "volatile" is generally insufficient:
variables and memory barrier instructions as needed.
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering
and *after* describing the compiler barriers.
So this recommendation should recommend compiler barriers rather than
volatile. :/
(3) The article recommends _ReadWriteBarrier, _ReadBarrier and
_WriteBarrier, for compiler fences. I think _ReadWriteBarrier should
suffice for edk2's purposes.
However, the following reference deprecates those intrinsics:
https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160
while offering *only* C++ language replacements.
Could we implement CompilerFence() for all edk2 architectures as
*non-inline* assembly? The function would consist of a return
instruction only. For x86, we could use a NASM source; for ARM, separate
MS and GNU assembler sources would be needed.
I totally want to get rid of "volatile" at least in future code, but
that's only possible if one of the following options can be satisfied:
- we find a supported replacement method for _ReadWriteBarrier when
using the MSFT toolchain family (such as the *non-inline*, empty
assembly function),
- or we accept that CompilerFence() is not possible to implement
portably, and we only offer the heavier-weight acquire / release /
full fences, which *include* a compiler fence too.
In the latter case, the body of a busy-waiting loop would have to use
the heavier read-acquire API.
--*--
So the structure of the solution we're looking for is:
- exactly *one* of:
- volatile
- compiler fence
- acquire fence used as a heavy substitute for compiler fence,
- and *all* of
- acquire fence (load-load, load-store)
- release fence (load-store, store-store)
- full fence (load-load, load-store, store-store, store-load)
The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.
I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).
The VC++ docs seem to point you toward:
1) volatile + /volatile:iso compiler flag for MMIO.
3) For synchronization: inline void atomic_thread_fence(memory_order Order) noexcept;
It looks like memory_order maps into the primitives you are proposing?
memory_order_relaxed The fence has no effect.
memory_order_consume The fence is an acquire fence.
memory_order_acquire The fence is an acquire fence.
memory_order_release The fence is a release fence.
memory_order_acq_rel The fence is both an acquire fence and a release fence.
memory_order_seq_cst The fence is both an acquire fence and a release fence, and is sequentially consistent.
But it kind of seems like we may be falling down a C++ rabbit hole…..
As I pointed out my experience with clang is volatile is not enough and we needed the compiler fence.
Thanks,
Andrew Fish
Thanks
Laszlo
Paolo Bonzini <pbonzini@...>
On 08/02/21 18:40, Laszlo Ersek wrote:
In fact, even Microsoft's own STL does
#define _Compiler_barrier() \
_STL_DISABLE_DEPRECATED_WARNING _ReadWriteBarrier() \
_STL_RESTORE_DEPRECATED_WARNING
extern "C" inline void
atomic_signal_fence(const memory_order _Order) noexcept
{
if (_Order != memory_order_relaxed) {
_Compiler_barrier();
}
}
(https://github.com/microsoft/STL/blob/master/stl/inc/atomic).
For fences that require an assembly instruction, it is possible to use intrinsics too: _mm_mfence() for x86, __dmb(_ARM_BARRIER_SY) for ARM.
Paolo
(3) The article recommends _ReadWriteBarrier, _ReadBarrier andIn theory, atomic_thread_fence (and in this case, more specifically, atomic_signal_fence) is available also on C11. But if it is possible to assume that _RWB is present on Microsoft toolchains (even though it's deprecated) I would go for it as it seems to be the best match for CompilerFence.
_WriteBarrier, for compiler fences. I think _ReadWriteBarrier should
suffice for edk2's purposes.
However, the following reference deprecates those intrinsics:
https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160
while offering*only* C++ language replacements.
In fact, even Microsoft's own STL does
#define _Compiler_barrier() \
_STL_DISABLE_DEPRECATED_WARNING _ReadWriteBarrier() \
_STL_RESTORE_DEPRECATED_WARNING
extern "C" inline void
atomic_signal_fence(const memory_order _Order) noexcept
{
if (_Order != memory_order_relaxed) {
_Compiler_barrier();
}
}
(https://github.com/microsoft/STL/blob/master/stl/inc/atomic).
For fences that require an assembly instruction, it is possible to use intrinsics too: _mm_mfence() for x86, __dmb(_ARM_BARRIER_SY) for ARM.
Paolo
Laszlo Ersek
On 02/04/21 21:04, Paolo Bonzini wrote:
https://docs.microsoft.com/en-us/cpp/assembler/inline/optimizing-inline-assembly?view=msvc-160
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming
(1) It provides an example for a store-load (Dekker's algorithm) where
any combination of read-acquire + write-release is insufficient. Thus it
would need an MFENCE (hence we need all four APIs in edk2).
(... If we jump back to the part I marked with [*], then we can see
Paolo's description of read-acquire and store-release covers load-load,
load-store, load-store (again), and store-store. What's not covered is
store-load, which Paolo said elsewhere in this thread is exactly what
x86 does reorder. So the MemoryFence() API's use would be "exceptional",
in future code, but it should exist for supporting patterns like
Dekker's algorithm.)
(2) I think the "Recommendations" section:
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#recommendations
highlights the very problem we have. It recommends
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering
and *after* describing the compiler barriers.
So this recommendation should recommend compiler barriers rather than
volatile. :/
(3) The article recommends _ReadWriteBarrier, _ReadBarrier and
_WriteBarrier, for compiler fences. I think _ReadWriteBarrier should
suffice for edk2's purposes.
However, the following reference deprecates those intrinsics:
https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160
while offering *only* C++ language replacements.
Could we implement CompilerFence() for all edk2 architectures as
*non-inline* assembly? The function would consist of a return
instruction only. For x86, we could use a NASM source; for ARM, separate
MS and GNU assembler sources would be needed.
I totally want to get rid of "volatile" at least in future code, but
that's only possible if one of the following options can be satisfied:
- we find a supported replacement method for _ReadWriteBarrier when
using the MSFT toolchain family (such as the *non-inline*, empty
assembly function),
- or we accept that CompilerFence() is not possible to implement
portably, and we only offer the heavier-weight acquire / release /
full fences, which *include* a compiler fence too.
In the latter case, the body of a busy-waiting loop would have to use
the heavier read-acquire API.
--*--
So the structure of the solution we're looking for is:
- exactly *one* of:
- volatile
- compiler fence
- acquire fence used as a heavy substitute for compiler fence,
- and *all* of
- acquire fence (load-load, load-store)
- release fence (load-store, store-store)
- full fence (load-load, load-store, store-store, store-load)
The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.
I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).
Thanks
Laszlo
Il gio 4 feb 2021, 20:46 Ard Biesheuvel <ardb@...> ha scritto:[*]Acquire fences are barriers between earlier loads and subsequent loads(1) We should introduce finer-grained fence primitives:Acquire semantics typically order writes before reads, not /between/
ARM AARCH64 i386
CompilerFence() asm("") asm("") asm("")
AcquireMemoryFence() dmb ish dmb ishld asm("")
ReleaseMemoryFence() dmb ish dmb ish asm("")
MemoryFence() dmb ish dmb ish mfence
"where AcquireMemoryFence() is used on the read side (i.e. between
reads) and ReleaseMemoryFence() is used on the write side (i.e.
between writes)".
reads. Similarly, release semantics imply that all outstanding writes
complete before a barrier with acquire semantics is permitted to
complete.
and stores; those earlier loads then synchronize with release stores
in other threads.
Release fences are barriers been earlier loads and stores against
subsequent stores, and those subsequent stores synchronize with
acquire loads in other threads.
Doesn't look too specific:
In both cases, however, fences only make sense between memory
operations. So something like "after reads" and "before writes" would
have been more precise in some sense, but in practice the usual idiom
is "between" reads/writes as Laszlo wrote.Note that reasoning about this only makes sense in the context ofSure, that's a matter of how to implement the primitives. If you think
concurrency, i.e., different CPUs operating on the same memory (or
coherent DMA masters)
For non-coherent DMA, the 'ish' variants are not appropriate, and
given the low likelihood that any of this is creates a performance
bottleneck, I would suggest to only use full barriers on ARM.
that non-coherent DMA is important, a full dmb can be used too.
As far as the compiler is concerned, an asm in the macros *should*
block enough optimizations, even without making the accesses volatile.
CompilerFence (or the edk2 equivalent of cpu_relax, whose name escapes
me right now) would be necessary in the body of busy-waiting loops.
However somebody should check the MSVC docs for asm, too.
https://docs.microsoft.com/en-us/cpp/assembler/inline/optimizing-inline-assembly?view=msvc-160
I've found this article very relevant:It is very important to be *aware* of the acquire/release semantics,I agree as long as the primitives are self-documenting. A single
but I don't think it is necessary to use all the fine grained barrier
types in EDK2.
MemoryFence() does not make it clear in which direction the data is
flowing (whether from other threads to this one, or vice versa).
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming
(1) It provides an example for a store-load (Dekker's algorithm) where
any combination of read-acquire + write-release is insufficient. Thus it
would need an MFENCE (hence we need all four APIs in edk2).
(... If we jump back to the part I marked with [*], then we can see
Paolo's description of read-acquire and store-release covers load-load,
load-store, load-store (again), and store-store. What's not covered is
store-load, which Paolo said elsewhere in this thread is exactly what
x86 does reorder. So the MemoryFence() API's use would be "exceptional",
in future code, but it should exist for supporting patterns like
Dekker's algorithm.)
(2) I think the "Recommendations" section:
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#recommendations
highlights the very problem we have. It recommends
When doing lockless programming, be sure to use volatile flag*after* explaining why "volatile" is generally insufficient:
variables and memory barrier instructions as needed.
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering
and *after* describing the compiler barriers.
So this recommendation should recommend compiler barriers rather than
volatile. :/
(3) The article recommends _ReadWriteBarrier, _ReadBarrier and
_WriteBarrier, for compiler fences. I think _ReadWriteBarrier should
suffice for edk2's purposes.
However, the following reference deprecates those intrinsics:
https://docs.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-160
while offering *only* C++ language replacements.
Could we implement CompilerFence() for all edk2 architectures as
*non-inline* assembly? The function would consist of a return
instruction only. For x86, we could use a NASM source; for ARM, separate
MS and GNU assembler sources would be needed.
I totally want to get rid of "volatile" at least in future code, but
that's only possible if one of the following options can be satisfied:
- we find a supported replacement method for _ReadWriteBarrier when
using the MSFT toolchain family (such as the *non-inline*, empty
assembly function),
- or we accept that CompilerFence() is not possible to implement
portably, and we only offer the heavier-weight acquire / release /
full fences, which *include* a compiler fence too.
In the latter case, the body of a busy-waiting loop would have to use
the heavier read-acquire API.
--*--
So the structure of the solution we're looking for is:
- exactly *one* of:
- volatile
- compiler fence
- acquire fence used as a heavy substitute for compiler fence,
- and *all* of
- acquire fence (load-load, load-store)
- release fence (load-store, store-store)
- full fence (load-load, load-store, store-store, store-load)
The implementation of each fence would have to be *at least* as safe as
required; it could be stronger.
I feel that we have to reach an agreement on the "exactly one of" part;
subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just
the interface contracts, at first).
Thanks
Laszlo
Ard Biesheuvel <ardb@...>
On Mon, 8 Feb 2021 at 17:11, Laszlo Ersek <lersek@...> wrote:
Generally, yes. But I already mentioned non-cache coherent DMA
devices, and we also have some ARM platforms that use non-shareable
mappings for SRAM, which is used as temporary PEI memory in some
cases.
So the safe option is really to just use system-wide barriers - if
anyone has a use case where this is a performance bottleneck, we can
always revisit this later.
On 02/06/21 09:37, Ard Biesheuvel wrote:On ARM, 'inner shareable' is [generally] the subset of nodes andThis seems consistent with the fact that Paolo recommended variants of
edges that make up the CPUs, caches and cache-coherent DMA masters
(but not main memory)
"dmb ish" for all the memory (not compiler) fences -- I guess we don't
care about actual RAM content as long as all CPUs and cache-coherent DMA
masters see the same data.
Generally, yes. But I already mentioned non-cache coherent DMA
devices, and we also have some ARM platforms that use non-shareable
mappings for SRAM, which is used as temporary PEI memory in some
cases.
So the safe option is really to just use system-wide barriers - if
anyone has a use case where this is a performance bottleneck, we can
always revisit this later.