Eduardo Habkost <ehabkost@...>
(+Jiri, +libvir-list) On Fri, Nov 22, 2019 at 04:58:25PM +0000, Dr. David Alan Gilbert wrote: * Laszlo Ersek (lersek@...) wrote:
(+Dave, +Eduardo)
On 11/22/19 00:00, dann frazier wrote:
On Tue, Nov 19, 2019 at 06:06:15AM +0100, Laszlo Ersek wrote:
On 11/19/19 01:54, dann frazier wrote:
On Fri, Nov 15, 2019 at 11:51:18PM +0100, Laszlo Ersek wrote:
On 11/15/19 19:56, dann frazier wrote:
Hi, I'm trying to passthrough an Nvidia GPU to a q35 KVM guest, but UEFI is failing to allocate resources for it. I have no issues if I boot w/ a legacy BIOS, and it works fine if I tell the linux guest to do the allocation itself - but I'm looking for a way to make this work w/ OVMF by default.
I posted a debug log here: https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563/+attachment/5305740/+files/q35-uefidbg.log
Linux guest lspci output is also available for both seabios/OVMF boots here: https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563 By default, OVMF exposes such a 64-bit MMIO aperture for PCI MMIO BAR allocation that is 32GB in size. The generic PciBusDxe driver collects, orders, and assigns / allocates the MMIO BARs, but it can work only out of the aperture that platform code advertizes.
Your GPU's region 1 is itself 32GB in size. Given that there are further PCI devices in the system with further 64-bit MMIO BARs, the default aperture cannot accommodate everything. In such an event, PciBusDxe avoids assigning the largest BARs (to my knowledge), in order to conserve the most aperture possible, for other devices -- hence break the fewest possible PCI devices.
You can control the aperture size from the QEMU command line. You can also do it from the libvirt domain XML, technically speaking. The knob is experimental, so no stability or compatibility guarantees are made. (That's also the reason why it's a bit of a hack in the libvirt domain XML.)
The QEMU cmdline options is described in the following edk2 commit message:
https://github.com/tianocore/edk2/commit/7e5b1b670c38 Hi Laszlo,
Thanks for taking the time to describe this in detail! The -fw_cfg option did avoid the problem for me. Good to hear, thanks.
I also noticed that the above commit message mentions the existence of a 24GB card as a reasoning behind choosing the 32GB default aperture. From what you say below, I understand that bumping this above 64GB could break hosts w/ <= 37 physical address bits. Right.
What would be the downside of bumping the default aperture to, say, 48GB? The placement of the aperture is not trivial (please see the code comments in the linked commit). The base address of the aperture is chosen so that the largest BAR that can fit in the aperture may be naturally aligned. (BARs are whole powers of two.)
The largest BAR that can fit in a 48 GB aperture is 32 GB. Therefore such an aperture would be aligned at 32 GB -- the lowest base address (dependent on guest RAM size) would be 32 GB. Meaning that the aperture would end at 32 + 48 = 80 GB. That still breaches the 36-bit phys address width.
32 GB is the largest aperture size that can work with 36-bit phys address width; that's the aperture that ends at 64 GB exactly. Thanks, yeah - now that I read the code comments that is clear (as clear as it can be w/ my low level of base knowledge). In the commit you mention Gerd (CC'd) had suggested a heuristic-based approach for sizing the aperture. When you say "PCPU address width" - is that a function of the available physical bits? "PCPU address width" is not a "function" of the available physical bits -- it *is* the available physical bits. "PCPU" simply stands for "physical CPU".
IOW, would that approach allow OVMF to automatically grow the aperture to the max ^2 supported by the host CPU? Maybe.
The current logic in OVMF works from the guest-physical address space size -- as deduced from multiple factors, such as the 64-bit MMIO aperture size, and others -- towards the guest-CPU (aka VCPU) address width. The VCPU address width is important for a bunch of other purposes in the firmware, so OVMF has to calculate it no matter what.
Again, the current logic is to calculate the highest guest-physical address, and then deduce the VCPU address width from that (and then expose it to the rest of the firmware).
Your suggestion would require passing the PCPU (physical CPU) address width from QEMU/KVM into the guest, and reversing the direction of the calculation. The PCPU address width would determine the VCPU address width directly, and then the 64-bit PCI MMIO aperture would be calculated from that.
However, there are two caveats.
(1) The larger your guest-phys address space (as exposed through the VCPU address width to the rest of the firmware), the more guest RAM you need for page tables. Because, just before entering the DXE phase, the firmware builds 1:1 mapping page tables for the entire guest-phys address space. This is necessary e.g. so you can access any PCI MMIO BAR.
Now consider that you have a huge beefy virtualization host with say 46 phys address bits, and a wimpy guest with say 1.5GB of guest RAM. Do you absolutely want tens of *terabytes* for your 64-bit PCI MMIO aperture? Do you really want to pay for the necessary page tables with that meager guest RAM?
(Such machines do exist BTW, for example:
http://mid.mail-archive.com/9BD73EA91F8E404F851CF3F519B14AA8036C67B5@DGGEMI521-MBX.china.huawei.com )
In other words, you'd need some kind of knob anyway, because otherwise your aperture could grow too *large*.
(2) Exposing the PCPU address width to the guest may have nasty consequences at the QEMU/KVM level, regardless of guest firmware. For example, that kind of "guest enlightenment" could interfere with migration.
If you boot a guest let's say with 16GB of RAM, and tell it "hey friend, have 40 bits of phys address width!", then you'll have a difficult time migrating that guest to a host with a CPU that only has 36-bits wide physical addresses -- even if the destination host has plenty of RAM otherwise, such as a full 64GB.
There could be other QEMU/KVM / libvirt issues that I m unaware of (hence the CC to Dave and Eduardo). host physical address width gets messy. There are differences as well between upstream qemu behaviour, and some downstreams. I think the story is that:
a) Qemu default: 40 bits on any host b) -cpu blah,host-phys-bits=true to follow the host. c) RHEL has host-phys-bits=true by default
As you say, the only real problem with host-phys-bits is migration - between say an E3 and an E5 xeon with different widths. The magic 40's is generally wrong as well - I think it came from some ancient AMD, but it's the default on QEMU TCG as well. Yes, and because it affects live migration ability, we have two constraints: 1) It needs to be exposed in the libvirt domain XML; 2) QEMU and libvirt can't choose a value that works for everybody (because neither QEMU or libvirt know where the VM might be migrated later). Which is why the BZ below is important: I don't think there's a way to set it in libvirt; https://bugzilla.redhat.com/show_bug.cgi?id=1578278 is a bz asking for that.
IMHO host-phys-bits is actually pretty safe; and makes most sense in a lot of cases.
Yeah, it is mostly safe and makes sense, but messy if you try to migrate to a host with a different size. Dave
Thanks, Laszlo
-dann
For example, to set a 64GB aperture, pass:
-fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536
The libvirt domain XML syntax is a bit tricky (and it might "taint" your domain, as it goes outside of the QEMU features that libvirt directly maps to):
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <qemu:commandline> <qemu:arg value='-fw_cfg'/> <qemu:arg value='opt/ovmf/X-PciMmio64Mb,string=65536'/> </qemu:commandline> </domain>
Some notes:
(1) The "xmlns:qemu" namespace definition attribute in the <domain> root element is important. You have to add it manually when you add <qemu:commandline> and <qemu:arg> too. Without the namespace definition, the latter elements will make no sense, and libvirt will delete them immediately.
(2) The above change will grow your guest's physical address space to more than 64GB. As a consequence, on your *host*, *if* your physical CPU supports nested paging (called "ept" on Intel and "npt" on AMD), *then* the CPU will have to support at least 37 physical address bits too, for the guest to work. Otherwise, the guest will break, hard.
Here's how to verify (on the host):
(2a) run "egrep -w 'npt|ept' /proc/cpuinfo" --> if this does not produce output, then stop reading here; things should work. Your CPU does not support nested paging, so KVM will use shadow paging, which is slower, but at least you don't have to care about the CPU's phys address width.
(2b) otherwise (i.e. when you do have nested paging), run "grep 'bits physical' /proc/cpuinfo" --> if the physical address width is >=37, you're good.
(2c) if you have nested paging but exactly 36 phys address bits, then you'll have to forcibly disable nested paging (assuming you want to run a guest with larger than 64GB guest-phys address space, that is). On Intel, issue:
rmmod kvm_intel modprobe kvm_intel ept=N
On AMD, go with:
rmmod kvm_amd modprobe kvm_amd npt=N
Hope this helps, Laszlo
-- Dr. David Alan Gilbert / dgilbert@... / Manchester, UK
-- Eduardo
|