Windows 2019 VM fails to boot from vhost-scsi with UEFI mode
annie li <annie.li@...>
Hello,
I have been trying to boot up a Windows 2019 VM from vhost-scsi device with UEFI mode in KVM environment, but kept getting boot failure. The Win2019 VM directly goes into automatic recovery mode, but Win2016 VM doesn't have this issue. Originally, I thought this issue is related to vioscsi driver. I limited the max transfer length of I/O in vioscsi driver, but it didn't help. Through Windbg debug, it shows vioscsi device driver doesn't get chance to be loaded yet. This failure happens in very early stage of loading Windows kernel. After analyzing the log of both vhost-scsi and OVMF, it turns out OVMF is sending out big sized I/O(>8M) for Windows 2019 VM. This I/O size exceeds the max SCSI I/O limitation(8M) of vhost-scsi in KVM. Windows 2019 kernel fails getting loaded due to failure of handling these big sized I/O in vhost-scsi. See following log printed out in vhost-scsi, [3199901.817872] vhost_scsi_calc_sgls: requested sgl_count: 2368 exceeds pre-allocated max_sgls: 2048 [3199901.839181] vhost_scsi_calc_sgls: requested sgl_count: 2368 exceeds pre-allocated max_sgls: 2048 Following is the diskio log of OVMF, it shows big sized I/O is sent out. DiskIo: Create subtasks for task: Offset/BufferSize/Buffer = 00000000F3020000/0093F400/03C00000 R:Lba/Offset/Length/WorkingBuffer/Buffer = 0000000000798100/00000000/0093F400/00000000/03C00000 Here, length is 0x0093F400 that is bigger than 8M. So I am wondering if this is a known issue? or is there any configure can limit the size of disk I/O of OVMF? or it is something related to Windows 2019 itself? Any feedback is greatly appreciated. Thanks Annie |
|
Laszlo Ersek
Hello Annie,
thank you for the comprehensive write-up. On 05/22/20 00:11, annie li wrote: Hello,OvmfPkg/VirtioScsiDxe does not set the transfer size. This driver implements the EFI_EXT_SCSI_PASS_THRU_PROTOCOL, and the transfer size comes from the caller. The (ultimate) caller in this case, likely through a number of other protocol layers, is the Windows 2019 boot loader (or another UEFI component of Windows 2019). The actual limit is in the host kernel (see more details below). This I/O sizeThis is helpful! In the host kernel (Linux), commit 3aee26b4ae91 ("vhost/scsi: Add pre-allocation for tv_cmd SGL + upages memory", 2013-09-09) introduced PREALLOC_SGLS, with value 2048. Furthermore, commit b1935f687bb9 ("vhost/scsi: Add preallocation of protection SGLs", 2014-06-02) introduced PREALLOC_PROT_SGLS, with value 512. Later, PREALLOC_PROT_SGLS was bumped to 2048 in commit 864d39df09b4 ("vhost/scsi: increase VHOST_SCSI_PREALLOC_PROT_SGLS to 2048", 2018-08-22). From the commit message, it seems that others have encountered a symptom very similar to yours before. Therefore I would suggest: (1) Narrowing down which constant needs bumping (PREALLOC_SGLS or PREALLOC_PROT_SGLS). I can't tell that from the host kernel message, because vhost_scsi_calc_sgls() is called with both constants (as of commit e8de56b5e76a, "vhost/scsi: Add ANY_LAYOUT iov -> sgl mapping prerequisites", 2015-02-04), and the error message printed for both is the same. (2) Submitting a patch (similar to commit 864d39df09b4) to the following addresses: "Michael S. Tsirkin" <mst@...> (maintainer:VIRTIO BLOCK AND SCSI DRIVERS) Jason Wang <jasowang@...> (maintainer:VIRTIO BLOCK AND SCSI DRIVERS) Paolo Bonzini <pbonzini@...> (reviewer:VIRTIO BLOCK AND SCSI DRIVERS) Stefan Hajnoczi <stefanha@...> (reviewer:VIRTIO BLOCK AND SCSI DRIVERS) virtualization@... (open list:VIRTIO BLOCK AND SCSI DRIVERS) kvm@... (open list:VIRTIO HOST (VHOST)) netdev@... (open list:VIRTIO HOST (VHOST)) linux-kernel@... (open list) The new value should likely be something "nice and round", for example 2048+512 = 2560 or 2048+1024 = 3072. That shouldn't increase memory consumption a lot, but it would still accommodate the 2368 value that Windows 2019 needs. Of course, the maintainers will tell you the value they deem best. I'm CC'ing my colleagues from the above address list at once. Thanks Laszlo |
|
Laszlo Ersek
On 05/26/20 19:25, Paolo Bonzini wrote:
On 26/05/20 15:18, Laszlo Ersek wrote:The "virtio_scsi_config" structure has the following fields, from theOvmfPkg/VirtioScsiDxe does not set the transfer size. This driverDoes EFI_EXT_SCSI_PASS_THRU_PROTOCOL lack a way to specify the maximum virtio-1.0 spec (I'm looking at "CS04" anyway): le32 num_queues; le32 seg_max; le32 max_sectors; le32 cmd_per_lun; le32 event_info_size; le32 sense_size; le32 cdb_size; le16 max_channel; le16 max_target; le32 max_lun; The only fields that appear related to the symptom at hand are "seg_max" and "max_sectors". (1) "seg_max" has the simpler story, to let's start with that. The spec says: seg_max is the maximum number of segments that can be in a command. A bidirectional command can include seg_max input segments and seg_max output segments. OvmfPkg/VirtioScsiDxe does not check the "seg_max" field. That's because: (1.1) VirtioScsiDxe considers the VIRTIO_SCSI_F_INOUT feature bit, and rejects bidirectional requests from the EFI_EXT_SCSI_PASS_THRU_PROTOCOL's caller with EFI_UNSUPPORTED immediately, if the feature bit is clear on the device. (1.2) VirtioScsiDxe never composes a virtio request (= descriptor chain) with more than 4 descriptors: (1.2.1) request header -- "virtio_scsi_req_cmd" up to and including the "cdb" field, (1.2.2) data block to transfer from the driver to the device (if any), (1.2.3) response header -- "virtio_scsi_req_cmd" starting at the "Device-writable part", (1.2.4) data block to transfer from the device to the driver (if any). The queue size is checked to be at least 4 in VirtioScsiInit(). And neither (1.2.2) nor (1.2.4) require "seg_max" to be larger than 1. So I don't think "seg_max" plays any role for the current symptom. (1.3) Assuming some CDB (= SCSI command) exists that has another layer of indirection, i.e., it transfers a list of pointers in the (1.2.2) or (1.2.4) data blocks, then parsing such a CDB and list of pointers is not the job of EFI_EXT_SCSI_PASS_THRU_PROTOCOL. It says "passthru" in the name. (Now I surely don't know if such a SCSI command exists at all, but if it does, and "seg_max" in the virtio-scsi config header intends to limit that, then an EFI_EXT_SCSI_PASS_THRU_PROTOCOL implementation cannot do anything about it; it can't even expose "seg_max" to higher-level callers.) (2) Regardig "max_sectors", the spec says: max_sectors is a hint to the driver about the maximum transfer size to use. OvmfPkg/VirtioScsiDxe honors and exposes this field to higher level protocols, as follows: (2.1) in VirtioScsiInit(), the field is read and saved. It is also checked to be at least 2 (due to the division quoted in the next bullet). (2.2) PopulateRequest() contains the following logic: // // Catch oversized requests eagerly. If this condition evaluates to false, // then the combined size of a bidirectional request will not exceed the // virtio-scsi device's transfer limit either. // if (ALIGN_VALUE (Packet->OutTransferLength, 512) / 512 > Dev->MaxSectors / 2 || ALIGN_VALUE (Packet->InTransferLength, 512) / 512 > Dev->MaxSectors / 2) { Packet->InTransferLength = (Dev->MaxSectors / 2) * 512; Packet->OutTransferLength = (Dev->MaxSectors / 2) * 512; Packet->HostAdapterStatus = EFI_EXT_SCSI_STATUS_HOST_ADAPTER_DATA_OVERRUN_UNDERRUN; Packet->TargetStatus = EFI_EXT_SCSI_STATUS_TARGET_GOOD; Packet->SenseDataLength = 0; return EFI_BAD_BUFFER_SIZE; } That is, VirtioScsiDxe only lets such requests reach the device that do not exceed *half* of "max_sectors" *per direction*. Meaning that, for uni-directional requests, the check is stricter than "max_sectors" requires, and for bi-directional requests, it is exactly as safe as "max_sectors" requires. (VirtioScsiDxe will indeed refuse to drive a device that has just 1 in "max_sectors", per (2.1), but that's not a *practical* limitation, I would say.) (2.3) When the above EFI_BAD_BUFFER_SIZE branch is taken, the maximum transfer sizes that the device supports are exposed to the caller (per direction), in accordance with the UEFI spec. (2.4) The ScsiDiskRead10(), ScsiDiskWrite10(), ScsiDiskRead16(), ScsiDiskWrite16() functions in "MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c" set the "NeedRetry" output param to TRUE upon seeing EFI_BAD_BUFFER_SIZE. (I take the blame for implementing that, in commit fc3c83e0b355, "MdeModulePkg: ScsiDiskDxe: recognize EFI_BAD_BUFFER_SIZE", 2015-09-10.) (2.5) The ScsiDiskReadSectors() and ScsiDiskWriteSectors() functions, which call the functions listed in (2.4), adjust the request size, and resubmit the request, when "NeedRetry" is set on output. (I take part of the blame for this as well, in commit 5abc2a70da4f, "MdeModulePkg: ScsiDiskDxe: adapt SectorCount when shortening transfers", 2015-09-10. I recommend reading the commit message on this commit, as it describes a symptom somewhat similar to the current one.) (3.1) Looking at "drivers/vhost/scsi.c" in the kernel, it doesn't seem to fill in "max_sectors" at all. (3.2) However, the QEMU part of the same device model does seem to populate it; see "hw/scsi/vhost-scsi.c": DEFINE_PROP_UINT32("max_sectors", VirtIOSCSICommon, conf.max_sectors, 0xFFFF), This field dates back to the original introduction of vhost-scsi, namely QEMU commit 5e9be92d7752 ("vhost-scsi: new device supporting the tcm_vhost Linux kernel module", 2013-04-19). The default value is almost 64K sectors, making the default transfer limit (from the device's perspective) almost 32 MB. (3.3) And this QEMU-side limit looks orthogonal to the PREALLOC_SGLS and PREALLOC_PROT_SGLS kernel macros. IOW, it looks possible to exceed PREALLOC_SGLS / PREALLOC_PROT_SGLS without exceeding "max_sectors". (4) Annie: can you try launching QEMU with the following flag: -global vhost-scsi-pci.max_sectors=2048 If that works, then I *guess* the kernel-side vhost device model could interrogate the virtio-scsi config space for "max_sectors", and use the value seen there in place of PREALLOC_SGLS / PREALLOC_PROT_SGLS. (5) PS: referring back to (1) "seg_max": given that I'm looking at "hw/scsi/vhost-scsi.c" in QEMU anyway, git-blame fingers commit 1bf8a989a566 ("virtio: make seg_max virtqueue size dependent", 2020-01-06). This commit seems to confirm that "seg_max" stands basically for the same thing as "virtqueue size", and so my argument (1.2) is valid, and (1.3) is irrelevant. Put differently, the commit confirms that, in (1.2.2) and (1.2.4), VirtioScsiDxe indeed only relies on "seg_max" being >=1, and therefore VirtioScsiDxe can safely ignore the actual (positive) value of "seg_max". Thanks, Laszlo |
|
annie li <annie.li@...>
Hi Laszlo,
Thanks for the feedback. I added more log in OVMF and got more info, see following. On 5/26/2020 9:18 AM, Laszlo Ersek wrote: Hello Annie,Nods, VirtioScsiDxe doesn't set the transfer size. My recent debug shows ScsiDiskDxe sets the max transfer size. I added more log in modules that call DiskIo Read functions, and narrowed down to MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c. It seems having maximum setting related to MAX SCSI I/O size. In Read(10) command, the MaxBlock is 0xFFFF, and the BlockSize is 0x200. So the max ByteCount is 0xFFFF*0x200 = 0x1FFFE00(bigger than 8M). After setting MaxBlock as 0x4000 to limit the max ByteCount to 8M, Windows 2019 can boot up from vhost-scsi in my local environment. However, this change is only for test, and not a fix. it is VHOST_SCSI_PREALLOC_SGLS? I'll make sure it. This issue seems related to both OVMF(ScsiDiskDxe) and vhost-scsi, and both of them limit the max SCSI I/O size. I am wondering where is the best part that the fix goes. If the fix goes into OVMF(ScsiDiskDxe), so big sized I/O will be split into small pieces and may slow down the booting procedure. If the fix goes into vhost-scsi, it may involve more memory consumption. Any more suggestions on this? Thank you! Thanks Annie
|
|
Paolo Bonzini <pbonzini@...>
On 26/05/20 15:18, Laszlo Ersek wrote:
OvmfPkg/VirtioScsiDxe does not set the transfer size. This driverDoes EFI_EXT_SCSI_PASS_THRU_PROTOCOL lack a way to specify the maximum number of SG entries supported by the HBA? Or also, though it's not related to this bug, the maximum size of each SG entry? This information should be in the virtio-scsi configuration space. (But I haven't checked if vhost-scsi fills it in correctly). Thanks, Paolo |
|
Paolo Bonzini <pbonzini@...>
On 27/05/20 13:43, Laszlo Ersek wrote:
Yes, that would do! Thanks for the investigation, Laszlo. Or alternatively, QEMU could change the default max_sectors. Paolo |
|
annie li <annie.li@...>
Hi Laszlo,
(I sent out this email yesterday, but it somehow doesn't show up in https://edk2.groups.io/g/discuss. So re-sending it here...) Thanks for the feedback. I added more log in OVMF and got more info, see following. On 5/26/2020 9:18 AM, Laszlo Ersek wrote: Hello Annie,Nods, VirtioScsiDxe doesn't set the transfer size. My recent debug shows ScsiDiskDxe sets the max transfer size. I added more log in modules that call DiskIo Read functions, and narrowed down to MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c. It seems having maximum setting related to MAX SCSI I/O size. In Read(10) command, the MaxBlock is 0xFFFF, and the BlockSize is 0x200. So the max ByteCount is 0xFFFF*0x200 = 0x1FFFE00(bigger than 8M). After setting MaxBlock as 0x4000 to limit the max ByteCount to 8M, Windows 2019 can boot up from vhost-scsi in my local environment. However, this change is only for test, and not a fix. it is VHOST_SCSI_PREALLOC_SGLS?This I/O sizeThis is helpful! I'll make sure it. This issue seems related to both OVMF(ScsiDiskDxe) and vhost-scsi, and both of them limit the max SCSI I/O size. I am wondering where is the best part that the fix goes. If the fix goes into OVMF(ScsiDiskDxe), so big sized I/O will be split into small pieces and may slow down the booting procedure. If the fix goes into vhost-scsi, it may involve more memory consumption. Any more suggestions on this? Thank you! Thanks Annie
|
|
annie li <annie.li@...>
Hi Laszlo,
(I sent out my reply to your original response twice, but my reply somehow doesn't show up in https://edk2.groups.io/g/discuss. It is confusing. Anyway, re-sending it here, hope you can get it...) On 5/27/2020 7:43 AM, Laszlo Ersek wrote: On 05/26/20 19:25, Paolo Bonzini wrote:I recently added more log in MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c thatOn 26/05/20 15:18, Laszlo Ersek wrote:The "virtio_scsi_config" structure has the following fields, from theOvmfPkg/VirtioScsiDxe does not set the transfer size. This driverDoes EFI_EXT_SCSI_PASS_THRU_PROTOCOL lack a way to specify the maximum has maximum setting related to MAX SCSI I/O size. For example, In Read(10) command, the MaxBlock is 0xFFFF, and the BlockSize is 0x200. So the max ByteCount is 0xFFFF*0x200 = 0x1FFFE00(32M). After setting MaxBlock as 0x4000 to limit the max ByteCount to 8M, Windows 2019 can boot up from vhost-scsi in my local environment. Looks this 32M setting in ScsiDiskDxe is consistent with the one you mentioned in following (3.2) in QEMU? Cool! I can boot Win2019 VM up from vhost-scsi with the flag above. Thanks Annie
|
|
Laszlo Ersek
On 05/27/20 17:58, annie li wrote:
Hi Laszlo,Apologies for that -- while I'm one of the moderators on edk2-devel (I get moderation notifications with the other mods, and we distribute the mod workload the best we can), I'm not one of the edk2-discuss mods. Hmm, wait a sec -- it seems like I am? And I just don't get mod notifications for edk2-discuss? Let me poke around in the settings :/ edk2-devel: - Spam Control - Messages are not moderated - New Members moderated - Unmoderate after 1 approved message - Message Policies - Allow Nonmembers to post (messages from nonmembers will be moderated instead of rejected) edk2-discuss: - Spam Control - Messages are not moderated - New Members ARE NOT moderated - Message Policies - Allow Nonmembers to post (messages from nonmembers will be moderated instead of rejected) So I think the bug in our configuration is that nonmembers are moderated on edk2-discuss just the same (because of the identical "Allow Nonmembers to post" setting), *however*, mods don't get notified because of the "New Members ARE NOT moderated" setting. So let me tweak this -- I'm setting the same - Spam Control - New Members moderated - Unmoderate after 1 approved message for edk2-discuss as we have on edk2-devel, *plus* I'm removing the following from the edk2-discuss list description: "Basically unmoderated". (I mean I totally agree that it *should* be unmoderated, but fully open posting doesn't seem possible on groups.io at all!) Anyway, re-sending it here, hope you can get it...)Thanks -- in case you CC me personally in addition to messaging the list (which is the common "best practice" for mailing lists), then I'll surely get it. Following up below: On 5/27/2020 7:43 AM, Laszlo Ersek wrote:(2) Regardig "max_sectors", the spec says: I recently added more log inYes, that's possible -- maybe the caller starts with an even larger transfer size, and then the EFI_BAD_BUFFER_SIZE logic is already at work, but it only reduces the transfer size to 32MB (per "max_sectors" from QEMU). And then all the protocols expect that to succeed, and when it fails, the failure is propagated to the outermost caller. (4) Annie: can you try launching QEMU with the following flag: Cool!Thank you for confirming! Laszlo |
|
annie li <annie.li@...>
Hi Laszlo,
On 5/27/2020 2:00 PM, Laszlo Ersek
wrote:
On 05/27/20 17:58, annie li wrote:Hi Laszlo, (I sent out my reply to your original response twice, but my reply somehow doesn't show up in https://edk2.groups.io/g/discuss. It is confusing.Apologies for that -- while I'm one of the moderators on edk2-devel (I get moderation notifications with the other mods, and we distribute the mod workload the best we can), I'm not one of the edk2-discuss mods. Hmm, wait a sec -- it seems like I am? And I just don't get mod notifications for edk2-discuss? Let me poke around in the settings :/ edk2-devel: - Spam Control - Messages are not moderated - New Members moderated - Unmoderate after 1 approved message - Message Policies - Allow Nonmembers to post (messages from nonmembers will be moderated instead of rejected) edk2-discuss: - Spam Control - Messages are not moderated - New Members ARE NOT moderated - Message Policies - Allow Nonmembers to post (messages from nonmembers will be moderated instead of rejected) So I think the bug in our configuration is that nonmembers are moderated on edk2-discuss just the same (because of the identical "Allow Nonmembers to post" setting), *however*, mods don't get notified because of the "New Members ARE NOT moderated" setting. So let me tweak this -- I'm setting the same - Spam Control - New Members moderated - Unmoderate after 1 approved message for edk2-discuss as we have on edk2-devel, *plus* I'm removing the following from the edk2-discuss list description: "Basically unmoderated". (I mean I totally agree that it *should* be unmoderated, but fully open posting doesn't seem possible on groups.io at all!) Thank you for looking at it. Nods.Anyway, re-sending it here, hope you can get it...)Thanks -- in case you CC me personally in addition to messaging the list (which is the common "best practice" for mailing lists), then I'll surely get it. Following up below:On 5/27/2020 7:43 AM, Laszlo Ersek wrote:(2) Regardig "max_sectors", the spec says: max_sectors is a hint to the driver about the maximum transfer size to use. OvmfPkg/VirtioScsiDxe honors and exposes this field to higher level protocols, as follows: (2.1) in VirtioScsiInit(), the field is read and saved. It is also checked to be at least 2 (due to the division quoted in the next bullet). (2.2) PopulateRequest() contains the following logic: // // Catch oversized requests eagerly. If this condition evaluates to false, // then the combined size of a bidirectional request will not exceed the // virtio-scsi device's transfer limit either. // if (ALIGN_VALUE (Packet->OutTransferLength, 512) / 512 > Dev->MaxSectors / 2 || ALIGN_VALUE (Packet->InTransferLength, 512) / 512 > Dev->MaxSectors / 2) { Packet->InTransferLength = (Dev->MaxSectors / 2) * 512; Packet->OutTransferLength = (Dev->MaxSectors / 2) * 512; Packet->HostAdapterStatus = EFI_EXT_SCSI_STATUS_HOST_ADAPTER_DATA_OVERRUN_UNDERRUN; Packet->TargetStatus = EFI_EXT_SCSI_STATUS_TARGET_GOOD; Packet->SenseDataLength = 0; return EFI_BAD_BUFFER_SIZE; } That is, VirtioScsiDxe only lets such requests reach the device that do not exceed *half* of "max_sectors" *per direction*. Meaning that, for uni-directional requests, the check is stricter than "max_sectors" requires, and for bi-directional requests, it is exactly as safe as "max_sectors" requires. (VirtioScsiDxe will indeed refuse to drive a device that has just 1 in "max_sectors", per (2.1), but that's not a *practical* limitation, I would say.) (2.3) When the above EFI_BAD_BUFFER_SIZE branch is taken, the maximum transfer sizes that the device supports are exposed to the caller (per direction), in accordance with the UEFI spec. (2.4) The ScsiDiskRead10(), ScsiDiskWrite10(), ScsiDiskRead16(), ScsiDiskWrite16() functions in "MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c" set the "NeedRetry" output param to TRUE upon seeing EFI_BAD_BUFFER_SIZE.I recently added more log in MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c that has maximum setting related to MAX SCSI I/O size. For example, In Read(10) command, the MaxBlock is 0xFFFF, and the BlockSize is 0x200. So the max ByteCount is 0xFFFF*0x200 = 0x1FFFE00(32M). After setting MaxBlock as 0x4000 to limit the max ByteCount to 8M, Windows 2019 can boot up from vhost-scsi in my local environment. Looks this 32M setting in ScsiDiskDxe is consistent with the one you mentioned in following (3.2) in QEMU?Yes, that's possible -- maybe the caller starts with an even larger transfer size, and then the EFI_BAD_BUFFER_SIZE logic is already at work, but it only reduces the transfer size to 32MB (per "max_sectors" from QEMU). And then all the protocols expect that to succeed, and when it fails, the failure is propagated to the outermost caller. I am a little confused here,(4) Annie: can you try launching QEMU with the following flag: -global vhost-scsi-pci.max_sectors=2048 If that works, then I *guess* the kernel-side vhost device model could interrogate the virtio-scsi config space for "max_sectors", and use the value seen there in place of PREALLOC_SGLS / PREALLOC_PROT_SGLS. Both VHOST_SCSI_PREALLOC_SGLS(2048) and TCM_VHOST_PREALLOC_PROT_SGLS(512) are hard coded in vhost/scsi.c. ... sgl_count = vhost_scsi_calc_sgls(prot_iter, prot_bytes, TCM_VHOST_PREALLOC_PROT_SGLS); .... sgl_count = vhost_scsi_calc_sgls(data_iter, data_bytes, VHOST_SCSI_PREALLOC_SGLS);
In vhost_scsi_calc_sgls, error is printed out if sgl_count is more
thanTCM_VHOST_PREALLOC_PROT_SGLS or VHOST_SCSI_PREALLOC_SGLS. sgl_count = iov_iter_npages(iter, 0xffff); if (sgl_count > max_sgls) { pr_err("%s: requested sgl_count: %d exceeds pre-allocated" " max_sgls: %d\n", __func__, sgl_count, max_sgls); return -EINVAL; } Looks like vhost-scsi doesn't interrogate the virtio-scsi config space for"max_sectors". The guest virtio-scsi driver may read this configuration out though. So the following flag reduces the transfer size to 8M on QEMU
side. Thanks Cool! I can boot Win2019 VM up from vhost-scsi with the flag above.Thank you for confirming! Laszlo |
|
annie li
On 5/27/2020 2:00 PM, Laszlo Ersek wrote:
On 05/27/20 17:58, annie li wrote:Thanks for addressing it.Hi Laszlo,Apologies for that -- while I'm one of the moderators on edk2-devel (I My another email sent out yesterday didn't reach to edk2-discuss. I joined this group and hope the email can show up this time. See my following comments. Thanks for the detailed explanation, it is very helpful.Anyway, re-sending it here, hope you can get it...)Thanks -- in case you CC me personally in addition to messaging the list This limits the I/O size to 1M. The EFI_BAD_BUFFER_SIZE logic reducesI recently added more log inYes, that's possible -- maybe the caller starts with an even larger I/O size to 512K for uni-directional requests. To send biggest I/O(8M) allowed by current vhost-scsi setting, I adjust the value to 0x3FFF. The EFI_BAD_BUFFER_SIZE logic reduces I/O size to 4M for uni-directional requests. -global vhost-scsi-pci.max_sectors=0x3FFF 0x4000 doesn't survive here. You mean the vhost device on the guest side here, right? In Windows virtio-scsi driver, it does read out max_sectors. Even though the driver doesn't take use of it later, it can be used to adjust the transfer length of I/O. I guess you are not mentioning the vhost-scsi on the host? Both VHOST_SCSI_PREALLOC_SGLS(2048) and TCM_VHOST_PREALLOC_PROT_SGLS(512) are hard coded in vhost/scsi.c. ... sgl_count = vhost_scsi_calc_sgls(prot_iter, prot_bytes, TCM_VHOST_PREALLOC_PROT_SGLS); .... sgl_count = vhost_scsi_calc_sgls(data_iter, data_bytes, VHOST_SCSI_PREALLOC_SGLS); In vhost_scsi_calc_sgls, error is printed out if sgl_count is more than TCM_VHOST_PREALLOC_PROT_SGLS or VHOST_SCSI_PREALLOC_SGLS. sgl_count = iov_iter_npages(iter, 0xffff); if (sgl_count > max_sgls) { pr_err("%s: requested sgl_count: %d exceeds pre-allocated" " max_sgls: %d\n", __func__, sgl_count, max_sgls); return -EINVAL; } Looks like vhost-scsi doesn't interrogate the virtio-scsi config space for "max_sectors". Although Win2019 boots from vhost-scsi with above flag, I assume we still need to enlarge the value of VHOST_SCSI_PREALLOC_SGLS in vhost-scsi for final fix instead of setting max_sectors through QEMU options? Adding specific QEMU command option for booting Win2019 from vhost-scsi seems not appropriate. Suggestions? Thanks Annie Cool!Thank you for confirming! |
|
Laszlo Ersek
On 05/28/20 00:04, annie li wrote:
On 5/27/2020 2:00 PM, Laszlo Ersek wrote: (4) Annie: can you try launching QEMU with the following flag: I am a little confused here,Yes. Yes. The transfer size that ultimately reaches the device is the minimum of three quantities: (a) the transfer size requested by the caller (i.e., the UEFI application), (b) the limit set by the READ(10) / READ(16) decision (i.e., MaxBlock), (c) the transfer size limit enforced / reported by EFI_EXT_SCSI_PASS_THRU_PROTOCOL.PassThru(), with EFI_BAD_BUFFER_SIZE Whichever is the smallest from the three, determines the transfer size that the device ultimately sees in the request. And then *that* transfer size must satisfy PREALLOC_SGLS and/or PREALLOC_PROT_SGLS (2048 4K pages: 0x80_0000 bytes). In your original use case, (a) is 0x93_F400 bytes, (b) is 0x1FF_FE00 bytes, and (c) is 0x1FF_FE00 too. Therefore the minimum is 0x93_F400, so that is what reaches the device. And because 0x93_F400 exceeds 0x80_0000, the request fails. When you set "-global vhost-scsi-pci.max_sectors=2048", that lowers (c) to 0x10_0000. (a) and (b) remain unchanged. Therefore the new minimum (which finally reaches the device) is 0x10_0000. This does not exceed 0x80_0000, so the request succeeds. ... In my prior email, I think I missed a detail: while the unit for QEMU's "vhost-scsi-pci.max_sectors" property is a "sector" (512 bytes), the unit for PREALLOC_SGLS and PREALLOC_PROT_SGLS in the kernel device model seems to be a *page*, rather than a sector. (I don't think I've ever checked iov_iter_npages() before.) Therefore the QEMU flag that I recommended previously was too strict. Can you try this instead, please?: -global vhost-scsi-pci.max_sectors=16384 This should set (c) to 0x80_0000 bytes. And so the minimum of {(a), (b), {c}) will be 0x80_0000 bytes -- exactly what PREALLOC_SGLS and PREALLOC_PROT_SGLS require. Although Win2019 boots from vhost-scsi with above flag, I assume we stillThere are multiple ways (alternatives) to fix the issue. - use larger constants for PREALLOC_SGLS and PREALLOC_PROT_SGLS in the kernel; - or replace the PREALLOC_SGLS and PREALLOC_PROT_SGLS constants in the kernel altogether, with such logic that dynamically calculates them from the "max_sectors" virtio-scsi config header field; - or change the QEMU default for "vhost-scsi-pci.max_sectors", from 0xFFFF to 16384. Either should work. Thanks, Laszlo |
|
Laszlo Ersek
On 05/28/20 18:39, annie li wrote:
On 5/27/2020 2:00 PM, Laszlo Ersek wrote: (4) Annie: can you try launching QEMU with the following flag: This limits the I/O size to 1M.Indeed -- as I just pointed out under your other email, I previously missed that the host kernel-side unit was not "sector" but "4K page". So yes, the value 2048 above is too strict. The EFI_BAD_BUFFER_SIZE logic reducesOK! 0x4000 doesn't survive here.That's really interesting. I'm not sure why that happens. ... Is it possible that vhost_scsi_handle_vq() -- in the host kernel -- puts stuff in the scatter-gather list *other* than the transfer buffers? Some headers and such? Maybe those headers need an extra page. If that works, then I *guess* the kernel-side vhost device model You mean the vhost device on the guest side here, right? In WindowsWith vhost, the virtio-scsi device model is split between QEMU and the host kernel. While QEMU manages the "max_sectors" property (= accepts it from the command line, and exposes it to the guest driver), the host kernel (i.e., the other half of the device model) ignores the same property. Consequently, although the guest driver obeys "max_sectors" for limiting the transfer size, the host kernel's constants may prove *stricter* than that. Because, the host kernel ignores "max_sectors". So one idea is to make the host kernel honor the "max_sectors" limit that QEMU manages. The other two ideas are: use larger constants in the kernel, or use a smaller "max_sectors" default in QEMU. The goal behind all three alternatives is the same: the limit that QEMU exposes to the guest driver should satisfy the host kernel. Thanks Laszlo |
|
annie li
On 5/28/2020 6:08 PM, Laszlo Ersek wrote:
On 05/28/20 18:39, annie li wrote:YupOn 5/27/2020 2:00 PM, Laszlo Ersek wrote:Indeed -- as I just pointed out under your other email, I previouslyThis limits the I/O size to 1M.(4) Annie: can you try launching QEMU with the following flag: I'm not sure why that happens.Then I found out it is related to operations on this VM, see following. ... Is it possible that vhost_scsi_handle_vq() -- in the host kernel --I ran more tests, and found booting failure happens randomly when I boot the VM right after it was previously terminated by Ctrl+C directly from QEMU monitor, no matter the max_sectors is 2048, 16383 or 16384. The failing chance is about 7 out of 20. So my previous statement about 0x4000 and 0x3FFF isn't accurate. It is just that booting happened to succeed with 0x3FFF(16383 ), but not with 0x4000(16384). Also, when this failure happens, dmesg doesn't print out following errors, vhost_scsi_calc_sgls: requested sgl_count: 2368 exceeds pre-allocated max_sgls: 2048 This new failure is totally different issue from the one caused by max sized I/O. For my debug log of OVMF, the biggest I/O size is only about 1M. This means Windows 2019 didn't send out big sized I/O out yet. The interesting part is that I didn't see this new failure happen if I boot the VM which was previously shutdown gracefully from inside Windows guest. This involves both changes in kernel and QEMU. I guess maybe it is more straightWith vhost, the virtio-scsi device model is split between QEMU and theYou mean the vhost device on the guest side here, right? In WindowsIf that works, then I *guess* the kernel-side vhost device model that kernel controls the transfer size based on memory consumed. I prefer to fixing it by using larger constants in the kernel, this also avoid splitting big sized I/O by using smaller "max_sectors"default in QEMU. Following is the code change I did in the kernel code vhost/scsi.c, -#define VHOST_SCSI_PREALLOC_SGLS 2048 -#define VHOST_SCSI_PREALLOC_UPAGES 2048 +#define VHOST_SCSI_PREALLOC_SGLS 2560 +#define VHOST_SCSI_PREALLOC_UPAGES 2560 Thanks Annie
|
|
annie li
On 5/28/2020 5:51 PM, Laszlo Ersek wrote:
On 05/28/20 00:04, annie li wrote:Much clear now, thank you!On 5/27/2020 2:00 PM, Laszlo Ersek wrote:Yes.I am a little confused here,(4) Annie: can you try launching QEMU with the following flag: It works but run into another failure. I put details in another email. I prefer to fixing it in the kernel side, details are in another email too.:-) Thanks Annie
|
|
Laszlo Ersek
On 05/29/20 16:47, annie li wrote:
I ran more tests, and found booting failure happens randomly when ICan you build the host kernel with "CONFIG_VHOST_SCSI=m", and repeat your Ctrl-C test such that you remove and re-insert "vhost_scsi.ko" after every Ctrl-C? My guess is that, when you kill QEMU with Ctrl-C, "vhost_scsi.ko" might not clean up something, and that could break the next guest boot. If you re-insert "vhost_scsi.ko" for each QEMU launch, and that ends up masking the symptom, then there's likely some resource leak in "vhost_scsi.ko". Just a guess. Thanks Laszlo |
|
annie li
On 6/2/2020 7:44 AM, Laszlo Ersek wrote:
On 05/29/20 16:47, annie li wrote:I am using targetcli to create SCSI lun that the VM boots from. The vhost_scsiI ran more tests, and found booting failure happens randomly when ICan you build the host kernel with "CONFIG_VHOST_SCSI=m", and repeat module gets loaded right after I create target in /vhost. However, I cannot remove vhost_scsi module since then. It always complains " Module vhost_scsi is in use" (same even after I delete target in targetcli). Maybe it is related to targetcli, but I didn't try other tools yet. Nods, it is possible. Thanks Annie
|
|
Laszlo Ersek
On 06/03/20 00:19, annie li wrote:
On 6/2/2020 7:44 AM, Laszlo Ersek wrote:Can you check with "lsmod" if other modules use vhost_scsi?On 05/29/20 16:47, annie li wrote:I am using targetcli to create SCSI lun that the VM boots from. TheI ran more tests, and found booting failure happens randomly when ICan you build the host kernel with "CONFIG_VHOST_SCSI=m", and repeat If you shut down QEMU gracefully, can you rmmod vhost_scsi in that case? I wonder if the failure to remove the vhost_scsi module is actually another sign of the same (as yet unknown) leaked reference. Thanks Laszlo Nods, it is possible. |
|
annie li
On 6/3/2020 9:33 AM, Laszlo Ersek wrote:
On 06/03/20 00:19, annie li wrote:lsmod shows vhost_scsi is used by 4 programs, I assume these 4 are relatedOn 6/2/2020 7:44 AM, Laszlo Ersek wrote:Can you check with "lsmod" if other modules use vhost_scsi?On 05/29/20 16:47, annie li wrote:I am using targetcli to create SCSI lun that the VM boots from. TheI ran more tests, and found booting failure happens randomly when ICan you build the host kernel with "CONFIG_VHOST_SCSI=m", and repeat to targetcli. lsmod |grep vhost_scsi vhost_scsi 36864 4 vhost 53248 1 vhost_scsi target_core_mod 380928 14 target_core_file,target_core_iblock,iscsi_target_mod,vhost_scsi,target_core_pscsi,target_core_user I was thinking maybe these target_* modules are using vhost_scsi, then removed following modules by modprobe -r, target_core_file,target_core_iblock,vhost_scsi,target_core_pscsi,target_core_user then lsmod shows "used by" down to 3 programs, vhost_scsi 36864 3 vhost 53248 1 vhost_scsi target_core_mod 380928 6 iscsi_target_mod,vhost_scsi However, others can not be removed. "rmmod --force" doesn't help either. "dmesg |grep vhost_scsi" doesn't show much useful information either. No, I cannot rmmod these modules right after I create target in targetcli, no matter whether I start a VM or not. Deleting the target in targetcli doesn't help either. Before I create target in targetcli, I can add and remove vhost_scsi module. The "used by" of vhost_scsi is 0. See following steps I did right after I reboot my host, # modprobe vhost_scsivhost_scsi 36864 0 vhost 53248 1 vhost_scsi target_core_mod 380928 1 vhost_scsi # modprobe -r vhost_scsiRight after I setup luns in targetcli, the "used by" is always 4 no matter I stop the VM by "CTRL-C" or graceful shutdown, no matter the VM is running or not. So targetcli is the suspect of these 4 "used by". Thanks Annie
|
|