Date
1 - 6 of 6
[edk2-devel] A problem with live migration of UEFI virtual machines
Laszlo Ersek
On 02/24/20 16:28, Daniel P. Berrangé wrote:
On Tue, Feb 11, 2020 at 05:39:59PM +0000, Alex Bennée wrote:Following up here *too*, just for completeness.I don't believe we are that strict for firmware in general. The The query in this thread has been posted three times now (and I have zero idea why). Each time it generated a different set of responses. For completes, I'm now going to link the other two threads here (because the present thread seems to have gotten the most feedback). To the OP: - please do *NOT* repost the same question once you get an answer. It only fragments the discussion and creates confusion. It also doesn't hurt if you *confirm* that you understood the answer. - Yet further, if your email address has @gmail.com for domain, but your msgids contain "tencent", that raises some eyebrows (mine for sure). You say "we" in the query, but never identify the organization behind the plural pronoun. (I've been fuming about the triple-posting of the question for a while now, but it's only now that, upon seeing how much work Dan has put into his answer, I've decided that dishing out a bit of netiquette would be in order.) * First posting: - msgid: <tencent_F1295F826E46EDFF3D77812B@...> - edk2-devel: https://edk2.groups.io/g/devel/message/54146 - qemu-devel: https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg02419.html * my response: - msgid: <12553.1581366059422195003@groups.io> - edk2-devel: https://edk2.groups.io/g/devel/message/54161 - qemu-devel: none, because (as an exception) I used the stupid groups.io web interface to respond, and so my response never reached qemu-devel * Second posting (~4 hours after the first) - msgid: <tencent_3CD8845EC159F0161725898B@...> - edk2-devel: https://edk2.groups.io/g/devel/message/54147 - qemu-devel: https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg02415.html * Dave's response: - msgid: <20200220154742.GC2882@work-vm> - edk2-devel: https://edk2.groups.io/g/devel/message/54681 - qemu-devel: https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg05632.html * Third posting (next day, present thread) -- cross posted to yet another list (!), because apparently Dave's feedback and mine had not been enough: - msgid: <tencent_BC7FD00363690990994E90F8@...> - edk2-devel: https://edk2.groups.io/g/devel/message/54220 - edk2-discuss: https://edk2.groups.io/g/discuss/message/135 - qemu-devel: https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg02735.html Back on topic: see my response again. The answer is, you can't solve the problem (specifically with OVMF), and QEMU in fact does you service by preventing the migration. Laszlo |
|
Laszlo Ersek
Hi Andrew,
On 02/25/20 19:56, Andrew Fish wrote: Laszlo,Yes. The legacy BIOS used fixed magic address ranges, but UEFI uses dynamically allocated memory so addresses are not fixed. While the UEFI firmware does try to keep S3 and S4 layouts consistent between boots, I'm not aware of any mechanism to keep the memory map address the same between versions of the firmware?It's not about RAM, but platform MMIO. The core of the issue here is that the -D FD_SIZE_4MB and -D FD_SIZE_2MB build options (or more directly, the different FD_SIZE_IN_KB macro settings) set a bunch of flash-related build-time constant macros, and PCDs, differently, in the following files: - OvmfPkg/OvmfPkg.fdf.inc - OvmfPkg/VarStore.fdf.inc - OvmfPkg/OvmfPkg*.dsc As a result, the OVMF_CODE.fd firmware binary will have different hard-coded references to the variable store pflash addresses. (Guest-physical MMIO addresses that point into the pflash range.) If someone tries to combine an OVMF_CODE.fd firmware binary from e.g. the 4MB build, with a variable store file that was originally instantiated from an OVMF_VARS.fd varstore template from the 2MB build, then the firmware binary's physical address references and various size references will not match the contents / layout of the varstore pflash chip, which maps an incompatibly structured varstore file. For example, "OvmfPkg/VarStore.fdf.inc" describes two incompatible EFI_FIRMWARE_VOLUME_HEADER structures (which "build" generates for the OVMF_VARS.fd template) between the 4MB (total size) build, and the 1MB/2MB (total size) build. The commit message below summarizes the internal layout differences, from 1MB/2MB -> 4MB: https://github.com/tianocore/edk2/commit/b24fca05751f Excerpt (relevant for OVMF_VARS.fd): Description Compression type Size [KB] ------------------------- ----------------- ---------------------- Non-volatile data storage open-coded binary 128 -> 528 ( +400) data Variable store 56 -> 256 ( +200) Event log 4 -> 4 ( +0) Working block 4 -> 4 ( +0) Spare area 64 -> 264 ( +200) Thanks Laszlo On Feb 25, 2020, at 9:53 AM, Laszlo Ersek <lersek@...> wrote: |
|
Laszlo Ersek
Hi Andrew,
On 02/25/20 22:35, Andrew Fish wrote: Laszlo,With live migration, the running guest doesn't notice anything. This is a general requirement for live migration (regardless of UEFI or flash). You are very correct to ask about "skipping" the NVRAM region. With the approach that OvmfPkg originally supported, live migration would simply be unfeasible. The "build" utility would produce a single (unified) OVMF.fd file, which would contain both NVRAM and executable regions, and the guest's variable updates would modify the one file that would exist. This is inappropriate even without considering live migration, because OVMF binary upgrades (package updates) on the virtualization host would force guests to lose their private variable stores (NVRAMs). Therefore, the "build" utility produces "split" files too, in addition to the unified OVMF.fd file. Namely, OVMF_CODE.fd and OVMF_VARS.fd. OVMF.fd is simply the concatenation of the latter two. $ cat OVMF_VARS.fd OVMF_CODE.fd | cmp - OVMF.fd [prints nothing] When you define a new domain (VM) on a virtualization host, the domain definition saves a reference (pathname) to the OVMF_CODE.fd file. However, the OVMF_VARS.fd file (the variable store *template*) is not directly referenced; instead, it is *copied* into a separate (private) file for the domain. Furthermore, once booted, guest has two flash chips, one that maps the firmware executable OVMF_CODE.fd read-only, and another pflash chip that maps its private varstore file read-write. This makes it possible to upgrade OVMF_CODE.fd and OVMF_VARS.fd (via package upgrades on the virt host) without messing with varstores that were earlier instantiated from OVMF_VARS.fd. What's important here is that the various constants in the new (upgraded) OVMF_CODE.fd file remain compatible with the *old* OVMF_VARS.fd structure, across package upgrades. If that's not possible for introducing e.g. a new feature, then the package upgrade must not overwrite the OVMF_CODE.fd file in place, but must provide an additional firmware binary. This firmware binary can then only be used by freshly defined domains (old domains cannot be switched over). Old domains can be switched over manually -- and only if the sysadmin decides it is OK to lose the current variable store contents. Then the old varstore file for the domain is deleted (manually), the domain definition is updated, and then a new (logically empty, pristine) varstore can be created from the *new* OVMF_2_VARS.fd that matches the *new* OVMF_2_CODE.fd. During live migration, the "RAM-like" contents of both pflash chips are migrated (the guest-side view of both chips remains the same, including the case when the writeable chip happens to be in "programming mode", i.e., during a UEFI variable write through the Fault Tolerant Write and Firmware Volume Block(2) protocols). Once live migration completes, QEMU dumps the full contents of the writeable chip to the backing file (on the destination host). Going forward, flash writes from within the guest are reflected to said host-side file on-line, just like it happened on the source host before live migration. If the file backing the r/w pflash chip is on NFS (shared by both src and dst hosts), then this one-time dumping when the migration completes is superfluous, but it's also harmless. The interesting question is, what happens when you power down the VM on the destination host (= post migration), and launch it again there, from zero. In that case, the firmware executable file comes from the *destination host* (it was never persistently migrated from the source host, i.e. never written out on the dst). It simply comes from the OVMF package that had been installed on the destination host, by the sysadmin. However, the varstore pflash does reflect the permanent result of the previous migration. So this is where things can fall apart, if both firmware binaries (on the src host and on the dst host) don't agree about the internal structure of the varstore pflash. Thanks Laszlo |
|
Laszlo Ersek
On 02/28/20 04:20, Zhoujian (jay) wrote:
Hi Laszlo,Yes, exactly.-----Original Message-----Hi Laszlo, I'm unaware of any VMs running in clouds that use "-bios" with OVMF. It certainly seems a terrible idea, regardless of live migration. You're mixing up small details. OVMF_CODE.fd is already heavily padded, internally. We've grown the *internal* DXEFV firmware volume repeatedly over *years*, without *any* disruption to users. Please see: - da78c88f4535 ("OvmfPkg: raise DXEFV size to 8 MB", 2014-03-05) - 08df58ec3043 ("OvmfPkg: raise DXEFV size to 9 MB", 2015-10-07) - 2f7b34b20842 ("OvmfPkg: raise DXEFV size to 10 MB", 2016-05-31) - d272449d9e1e ("OvmfPkg: raise DXEFV size to 11 MB", 2018-05-29) To this day, i.e., with edk2 master @ edfe16a6d9f8, you can build OVMF in the default feature configuration [*] for -D FD_SIZE_2MB. [*] DEFINE SECURE_BOOT_ENABLE = FALSE DEFINE SMM_REQUIRE = FALSE DEFINE SOURCE_DEBUG_ENABLE = FALSE DEFINE TPM2_ENABLE = FALSE DEFINE TPM2_CONFIG_ENABLE = FALSE DEFINE NETWORK_TLS_ENABLE = FALSE DEFINE NETWORK_IP6_ENABLE = FALSE DEFINE NETWORK_HTTP_BOOT_ENABLE = FALSE For example: $ build \ -a IA32 -a X64 \ -b DEBUG \ -p OvmfPkg/OvmfPkgIa32X64.dsc \ -t GCC48 \ -D FD_SIZE_2MB Note that this build will contain DEBUG messages (at least DEBUG_INFO level ones) and ASSERT()s too. The final usage report at the end of the command is: SECFV [14%Full] 212992 total, 31648 used, 181344 free PEIFV [31%Full] 917504 total, 284584 used, 632920 free DXEFV [44%Full] 11534336 total, 5113688 used, 6420648 free FVMAIN_COMPACT [73%Full] 1753088 total, 1284216 used, 468872 free What does that mean? It means that largest firmware volume, DXEFV, uses just 44% of the 11MB allotted size. And FVMAIN_COMPACT, which embeds (among other things) DXEFV in LZMA-compressed format, only uses 73% of its allotted size, which is 1712 KB. All this means that in the default feature config, there's still a bunch of room free in the 2MB build, even with DEBUGs and ASSERT()s enabled, and with an old compiler that does not do link-time optimization. I think you must have misunderstood the purpose of the 4MB build. The 4MB build was solely introduced for enlarging the *varstore*. That was motivated by passing an SVVP check. This is described in detail in the relevant commit, which I may have linked earlier. https://github.com/tianocore/edk2/commit/b24fca05751f (Please consult the diagram in the commit message carefully. It shows you how the various firmware volumes / flash devices are nested; it will help you understand where the 1712 KB FVMAIN_COMPACT firmware volume is placed in the final image, and how FVMAIN_COMPACT embeds / compresses DXEFV.) And *given that* we had to introduce an incompatible change (for enlarging the varstore, for SVVP's sake), it made sense to *also* enlarge the other parts of the flash content. But the motivation was strictly the varstore change, and that was inevitably an incompatible change. In fact, you can see in the commit message that the *outer* container FVMAIN_COMPACT was enlarged from 1712 to 3360 kilobytes, the embedded PEIFV and DXEFV firmware volumes didn't put that extra space to use. The SECFV firmware volume runs directly from flash, so it's not compressed, but even that firmware volume got no "space injection". So basically all the size increase that *could* have been exploited for executable code size was spent on padding. As far as I can tell, we have never broken compatibility due to executable code size increases. Sorry if I over-explained this; I simply don't know how to express this any better. Things are a little different here,No, this doesn't make any sense. On both the source host and the destination host, the same pathname (for example, "/usr/share/OVMF/OVMF_CODE.fd") must point to same-size (compatible) firmware binaries. Both must be built with the same -D FD_SIZE_2MB flag, or with the same -D FD_SIZE_4MB flag. Then you can migrate. You can offer a 4MB build too on the destination host, but it must be under a different pathname. So that after the domain has been migrated in from the source host, and then re-launched against the firmware binary that's on the destination host, there is an incompatibility between the domain's *original* varstore, and the domain's *new* firmware binary. Sorry, my brain just cannot cope with the idea of even *running* OVMF in production with "-bios" -- let alone migrate it. But anyway... if you are dead set on this, you can try the following: - On the destination host, rename the 4MB build to a different filename. - On the destination host, update all your domain definitions to refer to the renamed filename with "-bios" - on the destination host, rebuild your current (more modern) firmware package, using the -D FD_SIZE_2MB flag. If you have not enabled a bunch of features meanwhile, it will actually succeed. - on the destination host, put this fresh build (with unified size 2MB) in the original place (using the original pathname) - now you can migrate domains from your source host. The pathname they refer to with "-bios" will exist, and it will be a 2MB build. And the contents of that build will be more modern (presumably) than what you are migrating away from. Please understand this: when you *allowed* OVMF to build with 4MB size, and installed it under the exact same pathname (on the destination host) where you previously used to keep a 2MB binary, *that* is when you broke compatibility. What's quite unfathomable to me is that the 2MB->4MB change in upstream was *solely* motivated by varstore enlargement (for passing SVVP with *flash*-based variables), but you're still using the ancient and non-conformant \NvVars emulation that comes with "-bios". Please, flash based variables with OVMF and QEMU have been supported since QEMU v1.6. I've attempted to remove -bios support from OVMF multiple times, I've always been prevented from doing that, and the damage is obvious only now. Laszlo |
|
Laszlo Ersek
On 02/28/20 05:04, Andrew Fish wrote:
Maybe I was overcomplicating this. Given your explanation I think the part I'm missing is OVMF is implying FLASH layout, in this split model, based on the size of the OVMF_CODE.fd and OVMF_VARS.fd. Given that if OVMF_CODE.fd gets bigger the variable address changes from a QEMU point of view. So basically it is the QEMU API that is making assumptions about the relative layout of the FD in the split model that makes a migration to larger ROM not work.No, QEMU does not make any assumptions here. QEMU simply grabs both pflash chips (the order is not random, it can be specified on the command line -- in fact the QEMU user is expected to specify in the right order), and then QEMU maps them in decreasing address order from 4GB in guest-phys address space. If we enlarge OVMF_CODE.fd, then the base address of the varstore (PcdOvmfFlashNvStorageVariableBase) will sink. That's not a problem per se, because QEMU doesn't know about PcdOvmfFlashNvStorageVariableBase at all. QEMU will simply map the varstore, automatically, where the enlarged OVMF_CODE.fd will look for it. Basically the -pflash API does not support changing the size of the ROM without moving NVRAM given the way it is currently defined.Let me put it like this: the NVRAM gets moved by virtue of how OVMF is built, and by how QEMU maps the pflash chips into guest-phys address space. They are in sync, automatically. The problem is when the NVRAM is internally restructured, or resized -- the new OVMF_CODE.fd binary will reflect this with changed PCDs, and look for "stuff" at those addresses. But if you still try to use an old (differently sized, or differently structured) varstore file, while QEMU will happily map it, parts of the NVRAM will just not end up in places where OVMF_CODE.fd expects them. There's already room to grow, *inside* OVMF_CODE.fd. As I've shown elsewhere in this thread, even the 2MB build has approx. 457 KB free in the DXEFV volume, even without link-time optimization and without DEBUG/ASSERT stripping, if you don't enable additional features. 2) Add some feature to QUEM that allows the variable store address to not be based on OVMF_CODE.fd size.Yes, this has been proposed over time. It wouldn't help with the case when you change the internal structure of the NVRAM, and try to run an incompatible OVMF_CODE.fd against that. I did see this [1] and combined with your email I either understand, or I'm still confused? :)I think the most interesting function for you could be pc_system_flash_map(), in "hw/i386/pc_sysfw.c", in the QEMU source.
Thanks Laszlo |
|
Laszlo Ersek
On 02/28/20 12:47, Laszlo Ersek wrote:
On 02/28/20 05:04, Andrew Fish wrote: Typo; I meant FVMAIN_COMPACT, not DXEFV.Given the above it seems like the 2 options are:There's already room to grow, *inside* OVMF_CODE.fd. As I've shown Laszlo |
|