Re: Cross referencing EDKII code
I have found that VSCode opening the edk2 folder does a decent job of giving you resolution of symbols but you still have to understand edk2 specifics like library resolution, public/private files, packages, and the "platform" DSC file.
toggle quoted message
Show quoted text
On 3/4/2021 10:15 AM, wonderfly@... wrote: Started looking at EDKII code about two months ago. One of the things I still haven't figured out is how to read code effectively, possibly with the help of cross referencing tools. Haven't got much luck with cscope, and I don't see any compile_commands.json being generated either. What do you folks use for cross referencing? What's your development setup like?
|
|
Re: Cross referencing EDKII code
Rebecca Cran <rebecca@...>
It's very specific for my use of building it for my site, but I've put the Doxyfile at https://bsdio.com/edk2/docs/master.doxyfile . You'll want to change things like the output directory, which is hard-coded to /home/bcran/sites/bsdio.com/edk2/docs . But then you should be able to generate a directory of html files by running "doxygen master.doxyfile". There _is_ a doxyfile task in the edk2 repo at BaseTools/Scripts/PackageDocumentTools/plugins/EdkPlugins/basemodel/doxygen.py, but from looking at it a while ago I found it to be too simple for my use, for example lacking the search function. -- Rebecca Cran
toggle quoted message
Show quoted text
On 3/4/21 3:19 PM, Daniel Wang wrote: Very nice! Thank you for the pointers. I wonder how the cross links were created and whether it's possible to integrate them into my editor (vim). If not, is there a build target for doxygen? So that I could generate these html files for the branch I am on? On Thu, Mar 4, 2021 at 2:12 PM Rebecca Cran <rebecca@... <mailto:rebecca@...>> wrote: On 3/4/21 11:15 AM, wonderfly@... <mailto:wonderfly@...> wrote: > Started looking at EDKII code about two months ago. One of the things I still haven't figured out is how to read code effectively, possibly with the help of cross referencing tools. Haven't got much luck with cscope, and I don't see any compile_commands.json being generated either. > > What do you folks use for cross referencing? What's your development setup like? Personally, I've found Doxygen a great help in understanding EDK2. I've been maintaining a build of the documentation at https://bsdio.com/edk2/docs/master/index.html <https://bsdio.com/edk2/docs/master/index.html> for a few years. I think I've enabled all the search and browsing features, so for example you can use the search bar in the top-right corner to find the EFI_BOOT_SERVICES docs (https://bsdio.com/edk2/docs/master/struct_e_f_i___b_o_o_t___s_e_r_v_i_c_e_s.html <https://bsdio.com/edk2/docs/master/struct_e_f_i___b_o_o_t___s_e_r_v_i_c_e_s.html>). -- Rebecca Cran -- Best, Daniel
|
|
Re: Cross referencing EDKII code
Rebecca Cran <rebecca@...>
On 3/4/21 11:15 AM, wonderfly@... wrote: Started looking at EDKII code about two months ago. One of the things I still haven't figured out is how to read code effectively, possibly with the help of cross referencing tools. Haven't got much luck with cscope, and I don't see any compile_commands.json being generated either. What do you folks use for cross referencing? What's your development setup like? Personally, I've found Doxygen a great help in understanding EDK2. I've been maintaining a build of the documentation at https://bsdio.com/edk2/docs/master/index.html for a few years. I think I've enabled all the search and browsing features, so for example you can use the search bar in the top-right corner to find the EFI_BOOT_SERVICES docs ( https://bsdio.com/edk2/docs/master/struct_e_f_i___b_o_o_t___s_e_r_v_i_c_e_s.html). -- Rebecca Cran
|
|
Cross referencing EDKII code
Started looking at EDKII code about two months ago. One of the things I still haven't figured out is how to read code effectively, possibly with the help of cross referencing tools. Haven't got much luck with cscope, and I don't see any compile_commands.json being generated either.
What do you folks use for cross referencing? What's your development setup like?
|
|
Re: [edk2-devel] [RFC] UnitTestFrameworkPkg cmocka submodule alternatives
|
|
Re: [edk2-devel] RFC: Adding support for ARM (RNDR etc.) to RngDxe
Ard Biesheuvel <ardb@...>
On Wed, 10 Feb 2021 at 23:49, Rebecca Cran <rebecca@...> wrote: On 1/15/21 7:51 PM, Sami Mujawar wrote:
I have shared some initial thoughts on the RNG implementation updates at https://edk2.groups.io/g/devel/files/Designs/2021/0116/EDKII%20-%20Proposed%20update%20to%20RNG%20implementation.pdf
Kindly let me know your feedback or if you have any queries. The ARMv8.5 RNDRRS instruction appears to be missing from the diagram on page 11 - it has RngLib|RNDR, which is listed under PRNG, but RNDRRS returns a true random number. From the Arm ARM:
"Returns a 64-bit random number which is reseeded from the True Random Number source immediately before the read of the random number."
This is an unfortunate oversight in the architecture, but RNDRRS most certainly does not return a true random number. RNDR and RNDRRS both return the output of a DRBG (pseudo RNG), and the only difference is the reseed rate: RNDRRS triggers a reseed on every invocation, whereas RNDR triggers a reseed at an IMPDEF rate.
|
|
On 02/11/21 10:07, Paolo Bonzini wrote: On 10/02/21 20:07, Andrew Fish wrote:
Laszlo,
I think it makes sense to “break this up”. Seems like we need correct primitives and documentation on how to use them. It is easy enough after that to use code review to make new code “correct”, but as you say fixing the general volatile usage (especially given some of the VC++ behavior) is a big undertaking. Let us not have perfection block being able to do it correctly going forward.
Changing the CPU drivers (MP libraries) is a complex undertaking as we really need to test against the various compilers. I’ve tracked down quite a few MP bugs in proprietary CPU drivers that did not show up under VC++, but did under clang. I think the first step should be to introduce the new fence primitives and fixing MemoryFence to be what it says on the tin. Right, I hope to propose some patches after the stable tag. I can volunteer for the work of removing volatile once those are in. Thanks, Paolo -- should not be your job, but I'll gladly take your help. Seeing a few examples from you will hopefully teach me some patterns. Laszlo
|
|
Paolo Bonzini <pbonzini@...>
On 10/02/21 20:07, Andrew Fish wrote: Laszlo, I think it makes sense to “break this up”. Seems like we need correct primitives and documentation on how to use them. It is easy enough after that to use code review to make new code “correct”, but as you say fixing the general volatile usage (especially given some of the VC++ behavior) is a big undertaking. Let us not have perfection block being able to do it correctly going forward. Changing the CPU drivers (MP libraries) is a complex undertaking as we really need to test against the various compilers. I’ve tracked down quite a few MP bugs in proprietary CPU drivers that did not show up under VC++, but did under clang. I think the first step should be to introduce the new fence primitives and fixing MemoryFence to be what it says on the tin. I can volunteer for the work of removing volatile once those are in. Paolo
|
|
On 2021-02-10 7:55 a.m., Laszlo Ersek wrote: On 02/10/21 08:21, Ankur Arora wrote:
In general I would say that fences are -- except for one case -- strictly greater in power than read_once/write_once. That one case is alignment checks (which we do for instance in MmioRead32() and friends.)
Given that difference in power, I agree with the approach Paolo recommended elsewhere that for CPU memory access syncrhonization we start with Fences and then switch to lesser powered variants as and when needed:
"However, these changes do not have to be done in a single step. Starting with the rationalization of memory fences makes sense because, in addition to enabling the annotation of code rather than data, it is essentially a bugfix. " Given the actual amount of MemoryFence() calls and "volatile" uses in edk2, it's actually a pretty large undertaking to "fix" these issues. I don't foresee a quick resolution here, especially if I'm supposed to work on it alone (considering actual patches). I've not seen any feedback from Mike or Liming in this thread yet, for example, so I'm unsure about the MdePkg (BaseLib) maintainer buy-in. This is just to state the context in which I interpret "first step" and "essentially a bugfix". The issue is systemic in edk2, as every such occasion as we've run into now has only added to the proliferation of "volatile". The next stable tag is coming up too. I recommend we focus on merging the VCPU hotplug series. I don't intend to let this thread silently peter out (like many similar threads must have, in the past); I hope to do something about actual patches in the next development cycle. I do admit it looks like a daunting task, considering the multiple architectures and toolchains. Yeah, also given that the bugs around this are subtle. And, it is a non-trivial effort to test all the affected pieces of logic. Just thinking out aloud -- maybe the thing to do might be to write some validation code that compares before and after generated code to flag cases when there are unexpected changes. I'm not the most experienced with low level barriers and the like (except for having used them once in a while) but happy to help any way I can. (NB: we've not spent a single word on RISC-V yet -- AFAICS, there RISC-V doesn't even have *any* implementation of MemoryFence(), yet. Refer to commit 7601b251fd5c ("MdePkg/BaseLib: BaseLib for RISCV64 architecture", 2020-05-07), for example.) "You touch it, you own it" and "learn as you go" should not be a surprise to me in an open source project, but I admit I don't know if I should even attempt patching UefiCpuPkg content (MdePkg and OvmfPkg look more approachable). The risk of regressions runs high, reproducibility is always a mess in multiprocessing code, and the code is also privileged. I want it fixed everywhere, but I feel uncertain about the apparent lack of commitment. I have some hands-on multiprocessing background, but I've always made a very conscious decision to stay away from lockless programming, and use safe and *simple* POSIX Threads-based schemes (mutexes, condition variables with *broadcasts* etc). The "network of IP blocks" that Ard used for (correctly) describing modern computer architecture is way below the programming abstraction level where I feel comfortable. Thus I'm very much out of my comfort zone here. (My initial decision against lockless programming goes back to when <http://www.1024cores.net/> was published originally.) Anyway: I have several messages in this thread starred for revising. Ankur, is the picture clear enough for you to return to work on the v7 unplug series? The soft feature freeze is coming: 2021-02-22. https://github.com/tianocore/tianocore.github.io/wiki/EDK-II-Release-Planning Thanks for the heads up. Yeah, I think things are clear enough for me to send out v7. I plan to send that out sometime early next week so there's enough time to review before the soft freeze date. Thanks Ankur Thanks Laszlo
|
|
Re: [edk2-devel] RFC: Adding support for ARM (RNDR etc.) to RngDxe
Rebecca Cran <rebecca@...>
On 1/15/21 7:51 PM, Sami Mujawar wrote: I have shared some initial thoughts on the RNG implementation updates at https://edk2.groups.io/g/devel/files/Designs/2021/0116/EDKII%20-%20Proposed%20update%20to%20RNG%20implementation.pdf Kindly let me know your feedback or if you have any queries. The ARMv8.5 RNDRRS instruction appears to be missing from the diagram on page 11 - it has RngLib|RNDR, which is listed under PRNG, but RNDRRS returns a true random number. From the Arm ARM: "Returns a 64-bit random number which is reseeded from the True Random Number source immediately before the read of the random number." -- Rebecca Cran
|
|
On Feb 10, 2021, at 7:55 AM, Laszlo Ersek <lersek@...> wrote:
On 02/10/21 08:21, Ankur Arora wrote:
In general I would say that fences are -- except for one case -- strictly greater in power than read_once/write_once. That one case is alignment checks (which we do for instance in MmioRead32() and friends.)
Given that difference in power, I agree with the approach Paolo recommended elsewhere that for CPU memory access syncrhonization we start with Fences and then switch to lesser powered variants as and when needed:
"However, these changes do not have to be done in a single step. Starting with the rationalization of memory fences makes sense because, in addition to enabling the annotation of code rather than data, it is essentially a bugfix. " Given the actual amount of MemoryFence() calls and "volatile" uses in edk2, it's actually a pretty large undertaking to "fix" these issues. I don't foresee a quick resolution here, especially if I'm supposed to work on it alone (considering actual patches).
I've not seen any feedback from Mike or Liming in this thread yet, for example, so I'm unsure about the MdePkg (BaseLib) maintainer buy-in.
This is just to state the context in which I interpret "first step" and "essentially a bugfix". The issue is systemic in edk2, as every such occasion as we've run into now has only added to the proliferation of "volatile".
Laszlo, I think it makes sense to “break this up”. Seems like we need correct primitives and documentation on how to use them. It is easy enough after that to use code review to make new code “correct”, but as you say fixing the general volatile usage (especially given some of the VC++ behavior) is a big undertaking. Let us not have perfection block being able to do it correctly going forward. Changing the CPU drivers (MP libraries) is a complex undertaking as we really need to test against the various compilers. I’ve tracked down quite a few MP bugs in proprietary CPU drivers that did not show up under VC++, but did under clang. Thanks, Andrew Fish The next stable tag is coming up too. I recommend we focus on merging the VCPU hotplug series. I don't intend to let this thread silently peter out (like many similar threads must have, in the past); I hope to do something about actual patches in the next development cycle. I do admit it looks like a daunting task, considering the multiple architectures and toolchains.
(NB: we've not spent a single word on RISC-V yet -- AFAICS, there RISC-V doesn't even have *any* implementation of MemoryFence(), yet. Refer to commit 7601b251fd5c ("MdePkg/BaseLib: BaseLib for RISCV64 architecture", 2020-05-07), for example.)
"You touch it, you own it" and "learn as you go" should not be a surprise to me in an open source project, but I admit I don't know if I should even attempt patching UefiCpuPkg content (MdePkg and OvmfPkg look more approachable). The risk of regressions runs high, reproducibility is always a mess in multiprocessing code, and the code is also privileged. I want it fixed everywhere, but I feel uncertain about the apparent lack of commitment.
I have some hands-on multiprocessing background, but I've always made a very conscious decision to stay away from lockless programming, and use safe and *simple* POSIX Threads-based schemes (mutexes, condition variables with *broadcasts* etc). The "network of IP blocks" that Ard used for (correctly) describing modern computer architecture is way below the programming abstraction level where I feel comfortable. Thus I'm very much out of my comfort zone here. (My initial decision against lockless programming goes back to when <http://www.1024cores.net/> was published originally.)
Anyway: I have several messages in this thread starred for revising. Ankur, is the picture clear enough for you to return to work on the v7 unplug series? The soft feature freeze is coming: 2021-02-22.
https://github.com/tianocore/tianocore.github.io/wiki/EDK-II-Release-Planning
Thanks Laszlo
|
|
On 02/10/21 08:21, Ankur Arora wrote: In general I would say that fences are -- except for one case -- strictly greater in power than read_once/write_once. That one case is alignment checks (which we do for instance in MmioRead32() and friends.)
Given that difference in power, I agree with the approach Paolo recommended elsewhere that for CPU memory access syncrhonization we start with Fences and then switch to lesser powered variants as and when needed:
"However, these changes do not have to be done in a single step. Starting with the rationalization of memory fences makes sense because, in addition to enabling the annotation of code rather than data, it is essentially a bugfix. " Given the actual amount of MemoryFence() calls and "volatile" uses in edk2, it's actually a pretty large undertaking to "fix" these issues. I don't foresee a quick resolution here, especially if I'm supposed to work on it alone (considering actual patches). I've not seen any feedback from Mike or Liming in this thread yet, for example, so I'm unsure about the MdePkg (BaseLib) maintainer buy-in. This is just to state the context in which I interpret "first step" and "essentially a bugfix". The issue is systemic in edk2, as every such occasion as we've run into now has only added to the proliferation of "volatile". The next stable tag is coming up too. I recommend we focus on merging the VCPU hotplug series. I don't intend to let this thread silently peter out (like many similar threads must have, in the past); I hope to do something about actual patches in the next development cycle. I do admit it looks like a daunting task, considering the multiple architectures and toolchains. (NB: we've not spent a single word on RISC-V yet -- AFAICS, there RISC-V doesn't even have *any* implementation of MemoryFence(), yet. Refer to commit 7601b251fd5c ("MdePkg/BaseLib: BaseLib for RISCV64 architecture", 2020-05-07), for example.) "You touch it, you own it" and "learn as you go" should not be a surprise to me in an open source project, but I admit I don't know if I should even attempt patching UefiCpuPkg content (MdePkg and OvmfPkg look more approachable). The risk of regressions runs high, reproducibility is always a mess in multiprocessing code, and the code is also privileged. I want it fixed everywhere, but I feel uncertain about the apparent lack of commitment. I have some hands-on multiprocessing background, but I've always made a very conscious decision to stay away from lockless programming, and use safe and *simple* POSIX Threads-based schemes (mutexes, condition variables with *broadcasts* etc). The "network of IP blocks" that Ard used for (correctly) describing modern computer architecture is way below the programming abstraction level where I feel comfortable. Thus I'm very much out of my comfort zone here. (My initial decision against lockless programming goes back to when < http://www.1024cores.net/> was published originally.) Anyway: I have several messages in this thread starred for revising. Ankur, is the picture clear enough for you to return to work on the v7 unplug series? The soft feature freeze is coming: 2021-02-22. https://github.com/tianocore/tianocore.github.io/wiki/EDK-II-Release-PlanningThanks Laszlo
|
|
On 2021-02-09 10:40 p.m., Paolo Bonzini wrote: Il mer 10 feb 2021, 07:37 Ankur Arora <ankur.a.arora@... <mailto:ankur.a.arora@...>> ha scritto: So I don't quite see what would make "memory_order_seq_cst" harder? From the spec (https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering): "Atomic operations tagged memory_order_seq_cst not only order memory the same way as release/acquire ordering (everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load), but also establish a single total modification order of all atomic operations that are so tagged." The problem is that the ordering does not extend to relaxed (or even acquire/release) loads and stores. Therefore *every* store that needs to be ordered before a seq_cst load must also be seq_cst. This is hard enough to guarantee that using a fence is preferable. Thanks for the explanation. That does sound like a pain to work with. I wonder if there are CPUs with ordering primitives fine-grained enough where this would be useful. Ankur Paolo Is the total modification order the problem? Thanks Ankur
> > Paolo >
|
|
On 2021-02-09 5:20 a.m., Laszlo Ersek wrote: On 02/08/21 21:44, Ankur Arora wrote:
On 2021-02-08 9:40 a.m., Laszlo Ersek wrote: So the structure of the solution we're looking for is:
- exactly *one* of: - volatile - compiler fence - acquire fence used as a heavy substitute for compiler fence, Since the first two are of unequal power -- volatile for specific accesses and the compiler fence being a general barrier across which the compiler does not optimize memory accesses I think we should have both instead of volatile or compiler fence. I disagree, as long as we talk about CPU synchronization through variables that live in RAM. "volatile" has a use case, yes, but not for this purpose. The Linux document Paolo referenced earlier explains this well. The compiler fence is a superset of volatile (for this use case), feature-wise, *and* it has better performance.
However, instead of volatile, how about read_once(), write_once() primitives? These, respectively ensure that the compiler emits a single load or a store and does not combine or mutilate the load/store in any way.
Internally, these are implemented via volatile (at least the Linux kernel implementation that I looked at is) but they would explicitly lay out the guarantees (and non-guarantees) that the C/C++ standard makes:
"However, the guarantees of the standard are not sufficient for using volatile for multi-threading. The C++ Standard does not stop the compiler from reordering non-volatile reads and writes relative to volatile reads and writes, and it says nothing about preventing CPU reordering."
(https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering)
I see at least two cases (essentially where these are useful on their own without needing any other fences for ordering (compiler or cpu)):
* Device side communication:
MdeModulePkg/Bus/Ata/AhciPei/AhciMode.c::AhciWaitMemSet() 271 // The system memory pointed by Address will be updated by the 272 // SATA Host Controller, "volatile" is introduced to prevent 273 // compiler from optimizing the access to the memory address 274 // to only read once. 275 // 276 Value = *(volatile UINT32 *) (UINTN) Address; 277 Value &= MaskValue; 278
So use read_once() instead of the volatile cast: 276 Value = read_once (Address); This seems to overlap with MmioRead32() from <IoLib.h>, at least in purpose. I think the code should have used MmioRead32() in the first place. AFAICS, the various versions of MmioRead, MmioWrite defined in MdePkg/Library/BaseIoLibIntrinsic/IoLib.c or MdePkg/Library/BaseIoLibIntrinsic/IoLibNoIo.c are effectively what I was proposing as well: a typecheck and a volatile cast to ensure that the compiler emits a load or a store.
* For handling cases where we don't need a fence, but only need to ensure that the compiler does not mutilate a store in any way. As an example, the code below clears the Handler and the block below it calls the Handler if the pointer is non-NULL.
(From: https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-7-ankur.a.arora@oracle.com/,
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-9-ankur.a.arora@oracle.com/)
CpuEject(): 99 + // 100 + // We are done until the next hot-unplug; clear the handler. 101 + // 102 + mCpuHotEjectData->Handler = NULL;
SmmCpuFeaturesRendezvousExit(): 140 + if (mCpuHotEjectData == NULL || 141 + mCpuHotEjectData->Handler == NULL) { 142 + return; 143 + } 144 + 145 + mCpuHotEjectData->Handler (CpuIndex);
If the code on line 102 were instead,
102 + write_once (&mCpuHotEjectData->Handler, NULL);
and that on lines 141-145 were: 140 + if (mCpuHotEjectData == NULL || 141 + ((handler = read_once (mCpuHotEjectData->Handler)) == NULL) { 142 + return; 143 + } 144 + 145 + handler (CpuIndex); Hmm wait let me gather my thoughts on this... We have this loop in SmiRendezvous() [UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c]: // // Wait for BSP's signal to exit SMI // while (*mSmmMpSyncData->AllCpusInSync) { CpuPause (); } (there are two instances of it in SmiRendezvous(), both executed before reaching the Exit label). This is the loop that separtes CpuEject() (on the BSP) from SmmCpuFeaturesRendezvousExit() (on the AP). The object accessed is a "volatile BOOLEAN", so conceptually this is a loop body with a compiler fence. The goal is to highlight the data flow (i.e., release -> acquire) in the code, so we should put a release fence in CpuEject() and an acquire fence in SmmCpuFeaturesRendezvousExit(). * Until we can do that -- until we have the better-named APIs -- we should use MemoryFence() for both fences. However, because MemoryFence() doesn't do anything on VS, we should also keep the volatile, which (on x86) happens to force the same practical effect that we expect of the release and acquire fences too. * However IIUC you're saying that we shouldn't even aim at placing the release and acquire fences in CpuEject / SmmCpuFeaturesRendezvousExit respectively, but use write_once() / read_once().
Yes, exactly. My point was that at least in the restricted case that I was describing -- where we are only concerned with, mCpuHotEjectData->Handler and not it's ordering with respect to other stores. Where that would be necessary we would need a compiler and/or memory-fence. My comment on that is: I don't know. I don't know how read_once() / write_once() in Linux compares to the acquire / release fences. I don't know if read_once / write_once are just as expressive as the acquire / release fences (in Paolo's opinion, for example). I also don't know if their implementations (their effects on the CPUs / visibility) would be interchangeable on all edk2 architectures. So I think this is something that Paolo should please comment on. *Personally*, I would be satisfied with the following "minimum version" too: - update the MemoryFence() specification to say it is a full compiler and CPU barrier (covering store-load as well) - annotate MemoryFence() call sites with comments, to highlight the data flow direction - make MemoryFence() emit an MFENCE on x86, and waste a minuscule amount of CPU time in response -- that's an acceptable price for covering store-load (Paolo mentioned ~50 clock cycles as penalty) - fix MemoryFence()'s implementation for VS (currently it does nothing at all) - use MemoryFence(), and do not use volatile, in the v7 unplug series For the last item, I will assume that the MemoryFence() will act as both a compiler and a CPU barrier while annotating my code to say what specific ordering I'm banking on. *However*, Paolo strongly recommends that we introduce and use the directional barriers, for highlighting data flow, and for matching known lockless patterns. Thus, I understand your read_once / write_once suggestion to be an *alternative* to Paolo's. And I'm not equipped to compare both recommendations, alas. Oh, no. I wasn't suggesting read_once/write_once as an alternative to the primitives that Paolo was talking about. Those would handle ordering while read_once/write_once were only intended to ensure that the compiler does not in any way mangle the stores and loads. In general I would say that fences are -- except for one case -- strictly greater in power than read_once/write_once. That one case is alignment checks (which we do for instance in MmioRead32() and friends.) Given that difference in power, I agree with the approach Paolo recommended elsewhere that for CPU memory access syncrhonization we start with Fences and then switch to lesser powered variants as and when needed: "However, these changes do not have to be done in a single step. Starting with the rationalization of memory fences makes sense because, in addition to enabling the annotation of code rather than data, it is essentially a bugfix. " I don't know why Linux has fences and read_once / write_once separately. I know they're all documented, I think I tried to read those docs in the past (unsuccessfully). I just feel that *all that* would be too much for me to use edk2. Yeah I'm not sure we need to go wild with these in EDK2 either. That's just a recipe for bugs. Thanks Ankur I have to defer this to Paolo and Ard. Thanks Laszlo
The use of read_once()/write_once() also document the implicit assumptions being made in the code at the point of use.
Thanks Ankur
- and *all* of - acquire fence (load-load, load-store) - release fence (load-store, store-store) - full fence (load-load, load-store, store-store, store-load)
The implementation of each fence would have to be *at least* as safe as required; it could be stronger.
I feel that we have to reach an agreement on the "exactly one of" part; subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just the interface contracts, at first).
Thanks Laszlo
|
|
Paolo Bonzini <pbonzini@...>
Il mer 10 feb 2021, 07:37 Ankur Arora <ankur.a.arora@...> ha scritto: So I don't quite see what would make "memory_order_seq_cst" harder?
From the spec ( https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering ):
"Atomic operations tagged memory_order_seq_cst not only order memory the same way as release/acquire ordering (everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load), but also establish a single total modification order of all atomic operations that are so tagged."
The problem is that the ordering does not extend to relaxed (or even acquire/release) loads and stores. Therefore *every* store that needs to be ordered before a seq_cst load must also be seq_cst. This is hard enough to guarantee that using a fence is preferable. Paolo Is the total modification order the problem?
Thanks Ankur
Paolo
|
|
On 2021-02-09 5:00 a.m., Paolo Bonzini wrote: On 08/02/21 21:44, Ankur Arora wrote:
Since the first two are of unequal power -- volatile for specific accesses and the compiler fence being a general barrier across which the compiler does not optimize memory accesses I think we should have both instead of volatile or compiler fence.
However, instead of volatile, how about read_once(), write_once() primitives? These, respectively ensure that the compiler emits a single load or a store and does not combine or mutilate the load/store in any way. Well, that could certainly be a next step, but I intentionally left it out of my proposal to Laszlo. The reason for that is that the fences are already enough to wean the edk2 codebase from volatile declarations and pointers, and that is the big paradigm change---annotating code rather than data. Once you've done that switch I would certainly agree with (basically) an edk2 flavor of C11-like atomic primitives: - at the very least, AtomicLoadRelaxed and AtomicStoreRelaxed corresponding to Linux's READ_ONCE/WRITE_ONCE - and if it is possible to implement them easily on MSVC, AtomicLoadAcquire and AtomicStoreRelease However, these changes do not have to be done in a single step. Starting with the rationalization of memory fences makes sense because, in addition to enabling the annotation of code rather than data, it is essentially a bugfix. I see. Yeah, this ordering does seem like easier to work with. As an aside, I always stay away from the "memory_order_seq_cst" loads and stores. Those are much more tricky than what their name suggests, and their use cases are much better served by something like AtomicStoreRelaxed(&a, 1); // atomic_store(mo_relaxed) MemoryFence(); // atomic_thread_fence(mo_seq_cst) y = AtomicLoadRelaxed(&b); // atomic_load(mo_relaxed) So I don't quite see what would make "memory_order_seq_cst" harder? From the spec ( https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering): "Atomic operations tagged memory_order_seq_cst not only order memory the same way as release/acquire ordering (everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load), but also establish a single total modification order of all atomic operations that are so tagged." Is the total modification order the problem? Thanks Ankur Paolo
|
|
On Feb 9, 2021, at 5:20 AM, Laszlo Ersek <lersek@...> wrote:
On 02/08/21 21:44, Ankur Arora wrote:
On 2021-02-08 9:40 a.m., Laszlo Ersek wrote: So the structure of the solution we're looking for is:
- exactly *one* of: - volatile - compiler fence - acquire fence used as a heavy substitute for compiler fence, Since the first two are of unequal power -- volatile for specific accesses and the compiler fence being a general barrier across which the compiler does not optimize memory accesses I think we should have both instead of volatile or compiler fence. I disagree, as long as we talk about CPU synchronization through variables that live in RAM. "volatile" has a use case, yes, but not for this purpose. The Linux document Paolo referenced earlier explains this well. The compiler fence is a superset of volatile (for this use case), feature-wise, *and* it has better performance.
However, instead of volatile, how about read_once(), write_once() primitives? These, respectively ensure that the compiler emits a single load or a store and does not combine or mutilate the load/store in any way.
Internally, these are implemented via volatile (at least the Linux kernel implementation that I looked at is) but they would explicitly lay out the guarantees (and non-guarantees) that the C/C++ standard makes:
"However, the guarantees of the standard are not sufficient for using volatile for multi-threading. The C++ Standard does not stop the compiler from reordering non-volatile reads and writes relative to volatile reads and writes, and it says nothing about preventing CPU reordering."
(https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering)
I see at least two cases (essentially where these are useful on their own without needing any other fences for ordering (compiler or cpu)):
* Device side communication:
MdeModulePkg/Bus/Ata/AhciPei/AhciMode.c::AhciWaitMemSet() 271 // The system memory pointed by Address will be updated by the 272 // SATA Host Controller, "volatile" is introduced to prevent 273 // compiler from optimizing the access to the memory address 274 // to only read once. 275 // 276 Value = *(volatile UINT32 *) (UINTN) Address; 277 Value &= MaskValue; 278
So use read_once() instead of the volatile cast:
276 Value = read_once (Address); This seems to overlap with MmioRead32() from <IoLib.h>, at least in purpose. I think the code should have used MmioRead32() in the first place.
* For handling cases where we don't need a fence, but only need to ensure that the compiler does not mutilate a store in any way. As an example, the code below clears the Handler and the block below it calls the Handler if the pointer is non-NULL.
(From: https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-7-ankur.a.arora@oracle.com/,
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-9-ankur.a.arora@oracle.com/)
CpuEject(): 99 + // 100 + // We are done until the next hot-unplug; clear the handler. 101 + // 102 + mCpuHotEjectData->Handler = NULL;
SmmCpuFeaturesRendezvousExit(): 140 + if (mCpuHotEjectData == NULL || 141 + mCpuHotEjectData->Handler == NULL) { 142 + return; 143 + } 144 + 145 + mCpuHotEjectData->Handler (CpuIndex);
If the code on line 102 were instead,
102 + write_once (&mCpuHotEjectData->Handler, NULL);
and that on lines 141-145 were: 140 + if (mCpuHotEjectData == NULL || 141 + ((handler = read_once (mCpuHotEjectData->Handler)) == NULL) { 142 + return; 143 + } 144 + 145 + handler (CpuIndex); Hmm wait let me gather my thoughts on this...
We have this loop in SmiRendezvous() [UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c]:
// // Wait for BSP's signal to exit SMI // while (*mSmmMpSyncData->AllCpusInSync) { CpuPause (); }
(there are two instances of it in SmiRendezvous(), both executed before reaching the Exit label).
This is the loop that separtes CpuEject() (on the BSP) from SmmCpuFeaturesRendezvousExit() (on the AP).
The object accessed is a "volatile BOOLEAN", so conceptually this is a loop body with a compiler fence.
The goal is to highlight the data flow (i.e., release -> acquire) in the code, so we should put a release fence in CpuEject() and an acquire fence in SmmCpuFeaturesRendezvousExit().
* Until we can do that -- until we have the better-named APIs -- we should use MemoryFence() for both fences. However, because MemoryFence() doesn't do anything on VS, we should also keep the volatile, which (on x86) happens to force the same practical effect that we expect of the release and acquire fences too.
* However IIUC you're saying that we shouldn't even aim at placing the release and acquire fences in CpuEject / SmmCpuFeaturesRendezvousExit respectively, but use write_once() / read_once().
My comment on that is: I don't know.
I don't know how read_once() / write_once() in Linux compares to the acquire / release fences.
I don't know if read_once / write_once are just as expressive as the acquire / release fences (in Paolo's opinion, for example). I also don't know if their implementations (their effects on the CPUs / visibility) would be interchangeable on all edk2 architectures.
So I think this is something that Paolo should please comment on. *Personally*, I would be satisfied with the following "minimum version" too:
- update the MemoryFence() specification to say it is a full compiler and CPU barrier (covering store-load as well)
- annotate MemoryFence() call sites with comments, to highlight the data flow direction
- make MemoryFence() emit an MFENCE on x86, and waste a minuscule amount of CPU time in response -- that's an acceptable price for covering store-load (Paolo mentioned ~50 clock cycles as penalty)
Well for the IoLib that is + ~100 clock cycles for every MMIO read/write so that makes me a little nervous. Thanks, Andrew Fish - fix MemoryFence()'s implementation for VS (currently it does nothing at all)
- use MemoryFence(), and do not use volatile, in the v7 unplug series
*However*, Paolo strongly recommends that we introduce and use the directional barriers, for highlighting data flow, and for matching known lockless patterns. Thus, I understand your read_once / write_once suggestion to be an *alternative* to Paolo's. And I'm not equipped to compare both recommendations, alas.
I don't know why Linux has fences and read_once / write_once separately. I know they're all documented, I think I tried to read those docs in the past (unsuccessfully). I just feel that *all that* would be too much for me to use edk2.
I have to defer this to Paolo and Ard.
Thanks Laszlo
The use of read_once()/write_once() also document the implicit assumptions being made in the code at the point of use.
Thanks Ankur
- and *all* of - acquire fence (load-load, load-store) - release fence (load-store, store-store) - full fence (load-load, load-store, store-store, store-load)
The implementation of each fence would have to be *at least* as safe as required; it could be stronger.
I feel that we have to reach an agreement on the "exactly one of" part; subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just the interface contracts, at first).
Thanks Laszlo
|
|
Ard Biesheuvel <ardb@...>
On Tue, 9 Feb 2021 at 15:27, Laszlo Ersek <lersek@...> wrote: On 02/09/21 14:00, Paolo Bonzini wrote:
annotating code rather than data (
Slightly off-topic, but I wonder why this approach (code annotations) has proved more effective / successful, with atomics, than annotating data. Usually it's recommended to put emphasis on data structure design (with invariants etc), and then the code "writes itself". (I feel like there's a famous quote to insert here, but I can't remember it.)
So I'm not questioning that annotating code is superior in this case, I just wonder why it diverges from the usual guideline.
)
I'd say it is because we are reasoning about how accesses to the data are ordered, not about the data itself.
|
|
On 02/09/21 14:00, Paolo Bonzini wrote: annotating code rather than data ( Slightly off-topic, but I wonder why this approach (code annotations) has proved more effective / successful, with atomics, than annotating data. Usually it's recommended to put emphasis on data structure design (with invariants etc), and then the code "writes itself". (I feel like there's a famous quote to insert here, but I can't remember it.) So I'm not questioning that annotating code is superior in this case, I just wonder why it diverges from the usual guideline. ) Laszlo
|
|
On 02/08/21 21:44, Ankur Arora wrote: On 2021-02-08 9:40 a.m., Laszlo Ersek wrote: So the structure of the solution we're looking for is:
- exactly *one* of: - volatile - compiler fence - acquire fence used as a heavy substitute for compiler fence, Since the first two are of unequal power -- volatile for specific accesses and the compiler fence being a general barrier across which the compiler does not optimize memory accesses I think we should have both instead of volatile or compiler fence.
I disagree, as long as we talk about CPU synchronization through variables that live in RAM. "volatile" has a use case, yes, but not for this purpose. The Linux document Paolo referenced earlier explains this well. The compiler fence is a superset of volatile (for this use case), feature-wise, *and* it has better performance. However, instead of volatile, how about read_once(), write_once() primitives? These, respectively ensure that the compiler emits a single load or a store and does not combine or mutilate the load/store in any way.
Internally, these are implemented via volatile (at least the Linux kernel implementation that I looked at is) but they would explicitly lay out the guarantees (and non-guarantees) that the C/C++ standard makes:
"However, the guarantees of the standard are not sufficient for using volatile for multi-threading. The C++ Standard does not stop the compiler from reordering non-volatile reads and writes relative to volatile reads and writes, and it says nothing about preventing CPU reordering."
(https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming#volatile-variables-and-reordering)
I see at least two cases (essentially where these are useful on their own without needing any other fences for ordering (compiler or cpu)):
* Device side communication:
MdeModulePkg/Bus/Ata/AhciPei/AhciMode.c::AhciWaitMemSet() 271 // The system memory pointed by Address will be updated by the 272 // SATA Host Controller, "volatile" is introduced to prevent 273 // compiler from optimizing the access to the memory address 274 // to only read once. 275 // 276 Value = *(volatile UINT32 *) (UINTN) Address; 277 Value &= MaskValue; 278
So use read_once() instead of the volatile cast: 276 Value = read_once (Address); This seems to overlap with MmioRead32() from <IoLib.h>, at least in purpose. I think the code should have used MmioRead32() in the first place. * For handling cases where we don't need a fence, but only need to ensure that the compiler does not mutilate a store in any way. As an example, the code below clears the Handler and the block below it calls the Handler if the pointer is non-NULL.
(From: https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-7-ankur.a.arora@oracle.com/,
https://patchew.org/EDK2/20210129005950.467638-1-ankur.a.arora@oracle.com/20210129005950.467638-9-ankur.a.arora@oracle.com/)
CpuEject(): 99 + // 100 + // We are done until the next hot-unplug; clear the handler. 101 + // 102 + mCpuHotEjectData->Handler = NULL;
SmmCpuFeaturesRendezvousExit(): 140 + if (mCpuHotEjectData == NULL || 141 + mCpuHotEjectData->Handler == NULL) { 142 + return; 143 + } 144 + 145 + mCpuHotEjectData->Handler (CpuIndex);
If the code on line 102 were instead,
102 + write_once (&mCpuHotEjectData->Handler, NULL);
and that on lines 141-145 were: 140 + if (mCpuHotEjectData == NULL || 141 + ((handler = read_once (mCpuHotEjectData->Handler)) == NULL) { 142 + return; 143 + } 144 + 145 + handler (CpuIndex); Hmm wait let me gather my thoughts on this... We have this loop in SmiRendezvous() [UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c]: // // Wait for BSP's signal to exit SMI // while (*mSmmMpSyncData->AllCpusInSync) { CpuPause (); } (there are two instances of it in SmiRendezvous(), both executed before reaching the Exit label). This is the loop that separtes CpuEject() (on the BSP) from SmmCpuFeaturesRendezvousExit() (on the AP). The object accessed is a "volatile BOOLEAN", so conceptually this is a loop body with a compiler fence. The goal is to highlight the data flow (i.e., release -> acquire) in the code, so we should put a release fence in CpuEject() and an acquire fence in SmmCpuFeaturesRendezvousExit(). * Until we can do that -- until we have the better-named APIs -- we should use MemoryFence() for both fences. However, because MemoryFence() doesn't do anything on VS, we should also keep the volatile, which (on x86) happens to force the same practical effect that we expect of the release and acquire fences too. * However IIUC you're saying that we shouldn't even aim at placing the release and acquire fences in CpuEject / SmmCpuFeaturesRendezvousExit respectively, but use write_once() / read_once(). My comment on that is: I don't know. I don't know how read_once() / write_once() in Linux compares to the acquire / release fences. I don't know if read_once / write_once are just as expressive as the acquire / release fences (in Paolo's opinion, for example). I also don't know if their implementations (their effects on the CPUs / visibility) would be interchangeable on all edk2 architectures. So I think this is something that Paolo should please comment on. *Personally*, I would be satisfied with the following "minimum version" too: - update the MemoryFence() specification to say it is a full compiler and CPU barrier (covering store-load as well) - annotate MemoryFence() call sites with comments, to highlight the data flow direction - make MemoryFence() emit an MFENCE on x86, and waste a minuscule amount of CPU time in response -- that's an acceptable price for covering store-load (Paolo mentioned ~50 clock cycles as penalty) - fix MemoryFence()'s implementation for VS (currently it does nothing at all) - use MemoryFence(), and do not use volatile, in the v7 unplug series *However*, Paolo strongly recommends that we introduce and use the directional barriers, for highlighting data flow, and for matching known lockless patterns. Thus, I understand your read_once / write_once suggestion to be an *alternative* to Paolo's. And I'm not equipped to compare both recommendations, alas. I don't know why Linux has fences and read_once / write_once separately. I know they're all documented, I think I tried to read those docs in the past (unsuccessfully). I just feel that *all that* would be too much for me to use edk2. I have to defer this to Paolo and Ard. Thanks Laszlo The use of read_once()/write_once() also document the implicit assumptions being made in the code at the point of use.
Thanks Ankur
- and *all* of - acquire fence (load-load, load-store) - release fence (load-store, store-store) - full fence (load-load, load-store, store-store, store-load)
The implementation of each fence would have to be *at least* as safe as required; it could be stronger.
I feel that we have to reach an agreement on the "exactly one of" part; subsequent to that, maybe I can try an RFC patch for <BaseLib.h> (just the interface contracts, at first).
Thanks Laszlo
|
|