Windows 2019 VM fails to boot from vhost-scsi with UEFI mode


annie li <annie.li@...>
 

Hello,

I have been trying to boot up a Windows 2019 VM from vhost-scsi device with UEFI mode in KVM environment, but kept getting boot failure. The Win2019 VM directly goes into automatic recovery mode, but Win2016 VM doesn't have this issue.

Originally, I thought this issue is related to vioscsi driver. I limited the max transfer length of I/O in vioscsi driver, but it didn't help. Through Windbg debug, it shows vioscsi device driver doesn't get chance to be loaded yet. This failure happens in very early stage of loading Windows kernel.

After analyzing the log of both vhost-scsi and OVMF, it turns out OVMF is sending out big sized I/O(>8M) for Windows 2019 VM. This I/O size exceeds the max SCSI I/O limitation(8M) of vhost-scsi in KVM. Windows 2019 kernel fails getting loaded due to failure of handling these big sized I/O in vhost-scsi. See following log printed out in vhost-scsi,

[3199901.817872] vhost_scsi_calc_sgls: requested sgl_count: 2368 exceeds pre-allocated max_sgls: 2048
[3199901.839181] vhost_scsi_calc_sgls: requested sgl_count: 2368 exceeds pre-allocated max_sgls: 2048

Following is the diskio log of OVMF,  it shows big sized I/O is sent out.

DiskIo: Create subtasks for task: Offset/BufferSize/Buffer = 00000000F3020000/0093F400/03C00000
R:Lba/Offset/Length/WorkingBuffer/Buffer = 0000000000798100/00000000/0093F400/00000000/03C00000

Here, length is 0x0093F400 that is bigger than 8M.

So I am wondering if this is a known issue? or is there any configure can limit the size of disk I/O of OVMF? or it is something related to Windows 2019 itself? Any feedback is greatly appreciated.

Thanks
Annie


Laszlo Ersek
 

Hello Annie,

thank you for the comprehensive write-up.

On 05/22/20 00:11, annie li wrote:
Hello,

I have been trying to boot up a Windows 2019 VM from vhost-scsi device
with UEFI mode in KVM environment, but kept getting boot failure. The
Win2019 VM directly goes into automatic recovery mode, but Win2016 VM
doesn't have this issue.

Originally, I thought this issue is related to vioscsi driver. I
limited the max transfer length of I/O in vioscsi driver, but it
didn't help. Through Windbg debug, it shows vioscsi device driver
doesn't get chance to be loaded yet. This failure happens in very
early stage of loading Windows kernel.

After analyzing the log of both vhost-scsi and OVMF, it turns out OVMF
is sending out big sized I/O(>8M) for Windows 2019 VM.
OvmfPkg/VirtioScsiDxe does not set the transfer size. This driver
implements the EFI_EXT_SCSI_PASS_THRU_PROTOCOL, and the transfer size
comes from the caller.

The (ultimate) caller in this case, likely through a number of other
protocol layers, is the Windows 2019 boot loader (or another UEFI
component of Windows 2019).

The actual limit is in the host kernel (see more details below).

This I/O size
exceeds the max SCSI I/O limitation(8M) of vhost-scsi in KVM. Windows
2019 kernel fails getting loaded due to failure of handling these big
sized I/O in vhost-scsi. See following log printed out in vhost-scsi,

[3199901.817872] vhost_scsi_calc_sgls: requested sgl_count: 2368
exceeds pre-allocated max_sgls: 2048
[3199901.839181] vhost_scsi_calc_sgls: requested sgl_count: 2368
exceeds pre-allocated max_sgls: 2048
This is helpful!


Following is the diskio log of OVMF, it shows big sized I/O is sent
out.

DiskIo: Create subtasks for task: Offset/BufferSize/Buffer =
00000000F3020000/0093F400/03C00000
R:Lba/Offset/Length/WorkingBuffer/Buffer =
0000000000798100/00000000/0093F400/00000000/03C00000

Here, length is 0x0093F400 that is bigger than 8M.

So I am wondering if this is a known issue? or is there any configure
can limit the size of disk I/O of OVMF? or it is something related to
Windows 2019 itself? Any feedback is greatly appreciated.
In the host kernel (Linux), commit 3aee26b4ae91 ("vhost/scsi: Add
pre-allocation for tv_cmd SGL + upages memory", 2013-09-09) introduced
PREALLOC_SGLS, with value 2048.

Furthermore, commit b1935f687bb9 ("vhost/scsi: Add preallocation of
protection SGLs", 2014-06-02) introduced PREALLOC_PROT_SGLS, with value
512.

Later, PREALLOC_PROT_SGLS was bumped to 2048 in commit 864d39df09b4
("vhost/scsi: increase VHOST_SCSI_PREALLOC_PROT_SGLS to 2048",
2018-08-22). From the commit message, it seems that others have
encountered a symptom very similar to yours before.

Therefore I would suggest:

(1) Narrowing down which constant needs bumping (PREALLOC_SGLS or
PREALLOC_PROT_SGLS).

I can't tell that from the host kernel message, because
vhost_scsi_calc_sgls() is called with both constants (as of commit
e8de56b5e76a, "vhost/scsi: Add ANY_LAYOUT iov -> sgl mapping
prerequisites", 2015-02-04), and the error message printed for both is
the same.

(2) Submitting a patch (similar to commit 864d39df09b4) to the following
addresses:

"Michael S. Tsirkin" <mst@redhat.com> (maintainer:VIRTIO BLOCK AND SCSI DRIVERS)
Jason Wang <jasowang@redhat.com> (maintainer:VIRTIO BLOCK AND SCSI DRIVERS)
Paolo Bonzini <pbonzini@redhat.com> (reviewer:VIRTIO BLOCK AND SCSI DRIVERS)
Stefan Hajnoczi <stefanha@redhat.com> (reviewer:VIRTIO BLOCK AND SCSI DRIVERS)
virtualization@lists.linux-foundation.org (open list:VIRTIO BLOCK AND SCSI DRIVERS)
kvm@vger.kernel.org (open list:VIRTIO HOST (VHOST))
netdev@vger.kernel.org (open list:VIRTIO HOST (VHOST))
linux-kernel@vger.kernel.org (open list)

The new value should likely be something "nice and round", for example
2048+512 = 2560 or 2048+1024 = 3072. That shouldn't increase memory
consumption a lot, but it would still accommodate the 2368 value that
Windows 2019 needs. Of course, the maintainers will tell you the value
they deem best.

I'm CC'ing my colleagues from the above address list at once.

Thanks
Laszlo


Laszlo Ersek
 

On 05/26/20 19:25, Paolo Bonzini wrote:
On 26/05/20 15:18, Laszlo Ersek wrote:
OvmfPkg/VirtioScsiDxe does not set the transfer size. This driver
implements the EFI_EXT_SCSI_PASS_THRU_PROTOCOL, and the transfer size
comes from the caller.

The (ultimate) caller in this case, likely through a number of other
protocol layers, is the Windows 2019 boot loader (or another UEFI
component of Windows 2019).

The actual limit is in the host kernel (see more details below).
Does EFI_EXT_SCSI_PASS_THRU_PROTOCOL lack a way to specify the maximum
number of SG entries supported by the HBA? Or also, though it's not
related to this bug, the maximum size of each SG entry?

This information should be in the virtio-scsi configuration space.
(But I haven't checked if vhost-scsi fills it in correctly).
The "virtio_scsi_config" structure has the following fields, from the
virtio-1.0 spec (I'm looking at "CS04" anyway):

le32 num_queues;
le32 seg_max;
le32 max_sectors;
le32 cmd_per_lun;
le32 event_info_size;
le32 sense_size;
le32 cdb_size;
le16 max_channel;
le16 max_target;
le32 max_lun;

The only fields that appear related to the symptom at hand are "seg_max"
and "max_sectors".

(1) "seg_max" has the simpler story, to let's start with that. The spec
says:

seg_max is the maximum number of segments that can be in a command. A
bidirectional command can include seg_max input segments and
seg_max output segments.

OvmfPkg/VirtioScsiDxe does not check the "seg_max" field. That's
because:

(1.1) VirtioScsiDxe considers the VIRTIO_SCSI_F_INOUT feature bit, and
rejects bidirectional requests from the
EFI_EXT_SCSI_PASS_THRU_PROTOCOL's caller with EFI_UNSUPPORTED
immediately, if the feature bit is clear on the device.

(1.2) VirtioScsiDxe never composes a virtio request (= descriptor chain)
with more than 4 descriptors:

(1.2.1) request header -- "virtio_scsi_req_cmd" up to and including the
"cdb" field,

(1.2.2) data block to transfer from the driver to the device (if any),

(1.2.3) response header -- "virtio_scsi_req_cmd" starting at the
"Device-writable part",

(1.2.4) data block to transfer from the device to the driver (if any).

The queue size is checked to be at least 4 in VirtioScsiInit(). And
neither (1.2.2) nor (1.2.4) require "seg_max" to be larger than 1.

So I don't think "seg_max" plays any role for the current symptom.

(1.3) Assuming some CDB (= SCSI command) exists that has another layer
of indirection, i.e., it transfers a list of pointers in the (1.2.2) or
(1.2.4) data blocks, then parsing such a CDB and list of pointers is not
the job of EFI_EXT_SCSI_PASS_THRU_PROTOCOL. It says "passthru" in the
name.

(Now I surely don't know if such a SCSI command exists at all, but if it
does, and "seg_max" in the virtio-scsi config header intends to limit
that, then an EFI_EXT_SCSI_PASS_THRU_PROTOCOL implementation cannot do
anything about it; it can't even expose "seg_max" to higher-level
callers.)


(2) Regardig "max_sectors", the spec says:

max_sectors is a hint to the driver about the maximum transfer size to
use.

OvmfPkg/VirtioScsiDxe honors and exposes this field to higher level
protocols, as follows:

(2.1) in VirtioScsiInit(), the field is read and saved. It is also
checked to be at least 2 (due to the division quoted in the next
bullet).

(2.2) PopulateRequest() contains the following logic:

//
// Catch oversized requests eagerly. If this condition evaluates to false,
// then the combined size of a bidirectional request will not exceed the
// virtio-scsi device's transfer limit either.
//
if (ALIGN_VALUE (Packet->OutTransferLength, 512) / 512
> Dev->MaxSectors / 2 ||
ALIGN_VALUE (Packet->InTransferLength, 512) / 512
> Dev->MaxSectors / 2) {
Packet->InTransferLength = (Dev->MaxSectors / 2) * 512;
Packet->OutTransferLength = (Dev->MaxSectors / 2) * 512;
Packet->HostAdapterStatus =
EFI_EXT_SCSI_STATUS_HOST_ADAPTER_DATA_OVERRUN_UNDERRUN;
Packet->TargetStatus = EFI_EXT_SCSI_STATUS_TARGET_GOOD;
Packet->SenseDataLength = 0;
return EFI_BAD_BUFFER_SIZE;
}

That is, VirtioScsiDxe only lets such requests reach the device that do
not exceed *half* of "max_sectors" *per direction*. Meaning that, for
uni-directional requests, the check is stricter than "max_sectors"
requires, and for bi-directional requests, it is exactly as safe as
"max_sectors" requires. (VirtioScsiDxe will indeed refuse to drive a
device that has just 1 in "max_sectors", per (2.1), but that's not a
*practical* limitation, I would say.)

(2.3) When the above EFI_BAD_BUFFER_SIZE branch is taken, the maximum
transfer sizes that the device supports are exposed to the caller (per
direction), in accordance with the UEFI spec.

(2.4) The ScsiDiskRead10(), ScsiDiskWrite10(), ScsiDiskRead16(),
ScsiDiskWrite16() functions in
"MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c" set the "NeedRetry"
output param to TRUE upon seeing EFI_BAD_BUFFER_SIZE.

(I take the blame for implementing that, in commit fc3c83e0b355,
"MdeModulePkg: ScsiDiskDxe: recognize EFI_BAD_BUFFER_SIZE", 2015-09-10.)

(2.5) The ScsiDiskReadSectors() and ScsiDiskWriteSectors() functions,
which call the functions listed in (2.4), adjust the request size, and
resubmit the request, when "NeedRetry" is set on output.

(I take part of the blame for this as well, in commit 5abc2a70da4f,
"MdeModulePkg: ScsiDiskDxe: adapt SectorCount when shortening
transfers", 2015-09-10. I recommend reading the commit message on this
commit, as it describes a symptom somewhat similar to the current one.)


(3.1) Looking at "drivers/vhost/scsi.c" in the kernel, it doesn't seem
to fill in "max_sectors" at all.

(3.2) However, the QEMU part of the same device model does seem to
populate it; see "hw/scsi/vhost-scsi.c":

DEFINE_PROP_UINT32("max_sectors", VirtIOSCSICommon, conf.max_sectors,
0xFFFF),

This field dates back to the original introduction of vhost-scsi, namely
QEMU commit 5e9be92d7752 ("vhost-scsi: new device supporting the
tcm_vhost Linux kernel module", 2013-04-19).

The default value is almost 64K sectors, making the default transfer
limit (from the device's perspective) almost 32 MB.

(3.3) And this QEMU-side limit looks orthogonal to the PREALLOC_SGLS and
PREALLOC_PROT_SGLS kernel macros.

IOW, it looks possible to exceed PREALLOC_SGLS / PREALLOC_PROT_SGLS
without exceeding "max_sectors".


(4) Annie: can you try launching QEMU with the following flag:

-global vhost-scsi-pci.max_sectors=2048

If that works, then I *guess* the kernel-side vhost device model could
interrogate the virtio-scsi config space for "max_sectors", and use the
value seen there in place of PREALLOC_SGLS / PREALLOC_PROT_SGLS.


(5) PS: referring back to (1) "seg_max":

given that I'm looking at "hw/scsi/vhost-scsi.c" in QEMU anyway,
git-blame fingers commit 1bf8a989a566 ("virtio: make seg_max virtqueue
size dependent", 2020-01-06). This commit seems to confirm that
"seg_max" stands basically for the same thing as "virtqueue size", and
so my argument (1.2) is valid, and (1.3) is irrelevant.

Put differently, the commit confirms that, in (1.2.2) and (1.2.4),
VirtioScsiDxe indeed only relies on "seg_max" being >=1, and therefore
VirtioScsiDxe can safely ignore the actual (positive) value of
"seg_max".

Thanks,
Laszlo


annie li <annie.li@...>
 

Hi Laszlo,

Thanks for the feedback.
I added more log in OVMF and got more info, see following.

On 5/26/2020 9:18 AM, Laszlo Ersek wrote:
Hello Annie,

thank you for the comprehensive write-up.

On 05/22/20 00:11, annie li wrote:
Hello,

I have been trying to boot up a Windows 2019 VM from vhost-scsi device
with UEFI mode in KVM environment, but kept getting boot failure. The
Win2019 VM directly goes into automatic recovery mode, but Win2016 VM
doesn't have this issue.

Originally, I thought this issue is related to vioscsi driver. I
limited the max transfer length of I/O in vioscsi driver, but it
didn't help. Through Windbg debug, it shows vioscsi device driver
doesn't get chance to be loaded yet. This failure happens in very
early stage of loading Windows kernel.

After analyzing the log of both vhost-scsi and OVMF, it turns out OVMF
is sending out big sized I/O(>8M) for Windows 2019 VM.
OvmfPkg/VirtioScsiDxe does not set the transfer size. This driver
implements the EFI_EXT_SCSI_PASS_THRU_PROTOCOL, and the transfer size
comes from the caller.
Nods, VirtioScsiDxe doesn't set the transfer size.
My recent debug shows ScsiDiskDxe sets the max transfer size.

The (ultimate) caller in this case, likely through a number of other
protocol layers, is the Windows 2019 boot loader (or another UEFI
component of Windows 2019).
I added more log in modules that call DiskIo Read functions, and narrowed
down to MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c. It seems having
maximum setting related to MAX SCSI I/O size.

In Read(10) command, the MaxBlock is 0xFFFF, and the BlockSize is 0x200.
So the max ByteCount is 0xFFFF*0x200 = 0x1FFFE00(bigger than 8M).
After setting MaxBlock as 0x4000 to limit the max ByteCount to 8M,
Windows 2019 can boot up from vhost-scsi in my local environment.
However,  this change is only for test, and not a fix.


The actual limit is in the host kernel (see more details below).

This I/O size
exceeds the max SCSI I/O limitation(8M) of vhost-scsi in KVM. Windows
2019 kernel fails getting loaded due to failure of handling these big
sized I/O in vhost-scsi. See following log printed out in vhost-scsi,

[3199901.817872] vhost_scsi_calc_sgls: requested sgl_count: 2368
exceeds pre-allocated max_sgls: 2048
[3199901.839181] vhost_scsi_calc_sgls: requested sgl_count: 2368
exceeds pre-allocated max_sgls: 2048
This is helpful!

Following is the diskio log of OVMF, it shows big sized I/O is sent
out.

DiskIo: Create subtasks for task: Offset/BufferSize/Buffer =
00000000F3020000/0093F400/03C00000
R:Lba/Offset/Length/WorkingBuffer/Buffer =
0000000000798100/00000000/0093F400/00000000/03C00000

Here, length is 0x0093F400 that is bigger than 8M.

So I am wondering if this is a known issue? or is there any configure
can limit the size of disk I/O of OVMF? or it is something related to
Windows 2019 itself? Any feedback is greatly appreciated.
In the host kernel (Linux), commit 3aee26b4ae91 ("vhost/scsi: Add
pre-allocation for tv_cmd SGL + upages memory", 2013-09-09) introduced
PREALLOC_SGLS, with value 2048.

Furthermore, commit b1935f687bb9 ("vhost/scsi: Add preallocation of
protection SGLs", 2014-06-02) introduced PREALLOC_PROT_SGLS, with value
512.

Later, PREALLOC_PROT_SGLS was bumped to 2048 in commit 864d39df09b4
("vhost/scsi: increase VHOST_SCSI_PREALLOC_PROT_SGLS to 2048",
2018-08-22). From the commit message, it seems that others have
encountered a symptom very similar to yours before.

Therefore I would suggest:

(1) Narrowing down which constant needs bumping (PREALLOC_SGLS or
PREALLOC_PROT_SGLS).

I can't tell that from the host kernel message, because
vhost_scsi_calc_sgls() is called with both constants (as of commit
e8de56b5e76a, "vhost/scsi: Add ANY_LAYOUT iov -> sgl mapping
prerequisites", 2015-02-04), and the error message printed for both is
the same.
it is VHOST_SCSI_PREALLOC_SGLS?
I'll make sure it.


(2) Submitting a patch (similar to commit 864d39df09b4) to the following
addresses:

"Michael S. Tsirkin" <mst@redhat.com> (maintainer:VIRTIO BLOCK AND SCSI DRIVERS)
Jason Wang <jasowang@redhat.com> (maintainer:VIRTIO BLOCK AND SCSI DRIVERS)
Paolo Bonzini <pbonzini@redhat.com> (reviewer:VIRTIO BLOCK AND SCSI DRIVERS)
Stefan Hajnoczi <stefanha@redhat.com> (reviewer:VIRTIO BLOCK AND SCSI DRIVERS)
virtualization@lists.linux-foundation.org (open list:VIRTIO BLOCK AND SCSI DRIVERS)
kvm@vger.kernel.org (open list:VIRTIO HOST (VHOST))
netdev@vger.kernel.org (open list:VIRTIO HOST (VHOST))
linux-kernel@vger.kernel.org (open list)

The new value should likely be something "nice and round", for example
2048+512 = 2560 or 2048+1024 = 3072. That shouldn't increase memory
consumption a lot, but it would still accommodate the 2368 value that
Windows 2019 needs. Of course, the maintainers will tell you the value
they deem best.
This issue seems related to both OVMF(ScsiDiskDxe) and vhost-scsi, and both
of them limit the max SCSI I/O size.
I am wondering where is the best part that the fix goes. If the fix goes into
OVMF(ScsiDiskDxe), so big sized I/O will be split into small pieces and may
slow down the booting procedure. If the fix goes into vhost-scsi, it may involve
more memory consumption.
Any more suggestions on this?

I'm CC'ing my colleagues from the above address list at once.
Thank you!

Thanks
Annie


Thanks
Laszlo


Paolo Bonzini <pbonzini@...>
 

On 26/05/20 15:18, Laszlo Ersek wrote:
OvmfPkg/VirtioScsiDxe does not set the transfer size. This driver
implements the EFI_EXT_SCSI_PASS_THRU_PROTOCOL, and the transfer size
comes from the caller.

The (ultimate) caller in this case, likely through a number of other
protocol layers, is the Windows 2019 boot loader (or another UEFI
component of Windows 2019).

The actual limit is in the host kernel (see more details below).
Does EFI_EXT_SCSI_PASS_THRU_PROTOCOL lack a way to specify the maximum
number of SG entries supported by the HBA? Or also, though it's not
related to this bug, the maximum size of each SG entry?

This information should be in the virtio-scsi configuration space. (But
I haven't checked if vhost-scsi fills it in correctly).

Thanks,

Paolo


Paolo Bonzini <pbonzini@...>
 

On 27/05/20 13:43, Laszlo Ersek wrote:

(4) Annie: can you try launching QEMU with the following flag:

-global vhost-scsi-pci.max_sectors=2048

If that works, then I *guess* the kernel-side vhost device model could
interrogate the virtio-scsi config space for "max_sectors", and use the
value seen there in place of PREALLOC_SGLS / PREALLOC_PROT_SGLS.
Yes, that would do! Thanks for the investigation, Laszlo.

Or alternatively, QEMU could change the default max_sectors.

Paolo


annie li <annie.li@...>
 

Hi Laszlo,

(I sent out this email yesterday, but it somehow doesn't show up
in https://edk2.groups.io/g/discuss. So re-sending it here...)

Thanks for the feedback.
I added more log in OVMF and got more info, see following.

On 5/26/2020 9:18 AM, Laszlo Ersek wrote:
Hello Annie,

thank you for the comprehensive write-up.

On 05/22/20 00:11, annie li wrote:
Hello,

I have been trying to boot up a Windows 2019 VM from vhost-scsi device
with UEFI mode in KVM environment, but kept getting boot failure. The
Win2019 VM directly goes into automatic recovery mode, but Win2016 VM
doesn't have this issue.

Originally, I thought this issue is related to vioscsi driver. I
limited the max transfer length of I/O in vioscsi driver, but it
didn't help. Through Windbg debug, it shows vioscsi device driver
doesn't get chance to be loaded yet. This failure happens in very
early stage of loading Windows kernel.

After analyzing the log of both vhost-scsi and OVMF, it turns out OVMF
is sending out big sized I/O(>8M) for Windows 2019 VM.
OvmfPkg/VirtioScsiDxe does not set the transfer size. This driver
implements the EFI_EXT_SCSI_PASS_THRU_PROTOCOL, and the transfer size
comes from the caller.
Nods, VirtioScsiDxe doesn't set the transfer size.
My recent debug shows ScsiDiskDxe sets the max transfer size.

The (ultimate) caller in this case, likely through a number of other
protocol layers, is the Windows 2019 boot loader (or another UEFI
component of Windows 2019).

The actual limit is in the host kernel (see more details below).
I added more log in modules that call DiskIo Read functions, and narrowed
down to MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c. It seems having
maximum setting related to MAX SCSI I/O size.

In Read(10) command, the MaxBlock is 0xFFFF, and the BlockSize is 0x200.
So the max ByteCount is 0xFFFF*0x200 = 0x1FFFE00(bigger than 8M).
After setting MaxBlock as 0x4000 to limit the max ByteCount to 8M,
Windows 2019 can boot up from vhost-scsi in my local environment.
However,  this change is only for test, and not a fix.
This I/O size
exceeds the max SCSI I/O limitation(8M) of vhost-scsi in KVM. Windows
2019 kernel fails getting loaded due to failure of handling these big
sized I/O in vhost-scsi. See following log printed out in vhost-scsi,

[3199901.817872] vhost_scsi_calc_sgls: requested sgl_count: 2368
exceeds pre-allocated max_sgls: 2048
[3199901.839181] vhost_scsi_calc_sgls: requested sgl_count: 2368
exceeds pre-allocated max_sgls: 2048
This is helpful!

Following is the diskio log of OVMF, it shows big sized I/O is sent
out.

DiskIo: Create subtasks for task: Offset/BufferSize/Buffer =
00000000F3020000/0093F400/03C00000
R:Lba/Offset/Length/WorkingBuffer/Buffer =
0000000000798100/00000000/0093F400/00000000/03C00000

Here, length is 0x0093F400 that is bigger than 8M.

So I am wondering if this is a known issue? or is there any configure
can limit the size of disk I/O of OVMF? or it is something related to
Windows 2019 itself? Any feedback is greatly appreciated.
In the host kernel (Linux), commit 3aee26b4ae91 ("vhost/scsi: Add
pre-allocation for tv_cmd SGL + upages memory", 2013-09-09) introduced
PREALLOC_SGLS, with value 2048.

Furthermore, commit b1935f687bb9 ("vhost/scsi: Add preallocation of
protection SGLs", 2014-06-02) introduced PREALLOC_PROT_SGLS, with value
512.

Later, PREALLOC_PROT_SGLS was bumped to 2048 in commit 864d39df09b4
("vhost/scsi: increase VHOST_SCSI_PREALLOC_PROT_SGLS to 2048",
2018-08-22). From the commit message, it seems that others have
encountered a symptom very similar to yours before.

Therefore I would suggest:

(1) Narrowing down which constant needs bumping (PREALLOC_SGLS or
PREALLOC_PROT_SGLS).

I can't tell that from the host kernel message, because
vhost_scsi_calc_sgls() is called with both constants (as of commit
e8de56b5e76a, "vhost/scsi: Add ANY_LAYOUT iov -> sgl mapping
prerequisites", 2015-02-04), and the error message printed for both is
the same.
it is VHOST_SCSI_PREALLOC_SGLS?
I'll make sure it.

(2) Submitting a patch (similar to commit 864d39df09b4) to the following
addresses:

"Michael S. Tsirkin" <mst@redhat.com> (maintainer:VIRTIO BLOCK AND SCSI DRIVERS)
Jason Wang <jasowang@redhat.com> (maintainer:VIRTIO BLOCK AND SCSI DRIVERS)
Paolo Bonzini <pbonzini@redhat.com> (reviewer:VIRTIO BLOCK AND SCSI DRIVERS)
Stefan Hajnoczi <stefanha@redhat.com> (reviewer:VIRTIO BLOCK AND SCSI DRIVERS)
virtualization@lists.linux-foundation.org (open list:VIRTIO BLOCK AND SCSI DRIVERS)
kvm@vger.kernel.org (open list:VIRTIO HOST (VHOST))
netdev@vger.kernel.org (open list:VIRTIO HOST (VHOST))
linux-kernel@vger.kernel.org (open list)

The new value should likely be something "nice and round", for example
2048+512 = 2560 or 2048+1024 = 3072. That shouldn't increase memory
consumption a lot, but it would still accommodate the 2368 value that
Windows 2019 needs. Of course, the maintainers will tell you the value
they deem best.
This issue seems related to both OVMF(ScsiDiskDxe) and vhost-scsi, and both
of them limit the max SCSI I/O size.
I am wondering where is the best part that the fix goes. If the fix goes into
OVMF(ScsiDiskDxe), so big sized I/O will be split into small pieces and may
slow down the booting procedure. If the fix goes into vhost-scsi, it may involve
more memory consumption.
Any more suggestions on this?

I'm CC'ing my colleagues from the above address list at once.
Thank you!

Thanks
Annie


Thanks
Laszlo


annie li <annie.li@...>
 

Hi Laszlo,

(I sent out my reply to your original response twice, but my reply somehow
doesn't show up in https://edk2.groups.io/g/discuss. It is confusing.
Anyway, re-sending it here, hope you can get it...)

On 5/27/2020 7:43 AM, Laszlo Ersek wrote:
On 05/26/20 19:25, Paolo Bonzini wrote:
On 26/05/20 15:18, Laszlo Ersek wrote:
OvmfPkg/VirtioScsiDxe does not set the transfer size. This driver
implements the EFI_EXT_SCSI_PASS_THRU_PROTOCOL, and the transfer size
comes from the caller.

The (ultimate) caller in this case, likely through a number of other
protocol layers, is the Windows 2019 boot loader (or another UEFI
component of Windows 2019).

The actual limit is in the host kernel (see more details below).
Does EFI_EXT_SCSI_PASS_THRU_PROTOCOL lack a way to specify the maximum
number of SG entries supported by the HBA? Or also, though it's not
related to this bug, the maximum size of each SG entry?

This information should be in the virtio-scsi configuration space.
(But I haven't checked if vhost-scsi fills it in correctly).
The "virtio_scsi_config" structure has the following fields, from the
virtio-1.0 spec (I'm looking at "CS04" anyway):

le32 num_queues;
le32 seg_max;
le32 max_sectors;
le32 cmd_per_lun;
le32 event_info_size;
le32 sense_size;
le32 cdb_size;
le16 max_channel;
le16 max_target;
le32 max_lun;

The only fields that appear related to the symptom at hand are "seg_max"
and "max_sectors".

(1) "seg_max" has the simpler story, to let's start with that. The spec
says:

seg_max is the maximum number of segments that can be in a command. A
bidirectional command can include seg_max input segments and
seg_max output segments.

OvmfPkg/VirtioScsiDxe does not check the "seg_max" field. That's
because:

(1.1) VirtioScsiDxe considers the VIRTIO_SCSI_F_INOUT feature bit, and
rejects bidirectional requests from the
EFI_EXT_SCSI_PASS_THRU_PROTOCOL's caller with EFI_UNSUPPORTED
immediately, if the feature bit is clear on the device.

(1.2) VirtioScsiDxe never composes a virtio request (= descriptor chain)
with more than 4 descriptors:

(1.2.1) request header -- "virtio_scsi_req_cmd" up to and including the
"cdb" field,

(1.2.2) data block to transfer from the driver to the device (if any),

(1.2.3) response header -- "virtio_scsi_req_cmd" starting at the
"Device-writable part",

(1.2.4) data block to transfer from the device to the driver (if any).

The queue size is checked to be at least 4 in VirtioScsiInit(). And
neither (1.2.2) nor (1.2.4) require "seg_max" to be larger than 1.

So I don't think "seg_max" plays any role for the current symptom.

(1.3) Assuming some CDB (= SCSI command) exists that has another layer
of indirection, i.e., it transfers a list of pointers in the (1.2.2) or
(1.2.4) data blocks, then parsing such a CDB and list of pointers is not
the job of EFI_EXT_SCSI_PASS_THRU_PROTOCOL. It says "passthru" in the
name.

(Now I surely don't know if such a SCSI command exists at all, but if it
does, and "seg_max" in the virtio-scsi config header intends to limit
that, then an EFI_EXT_SCSI_PASS_THRU_PROTOCOL implementation cannot do
anything about it; it can't even expose "seg_max" to higher-level
callers.)


(2) Regardig "max_sectors", the spec says:

max_sectors is a hint to the driver about the maximum transfer size to
use.

OvmfPkg/VirtioScsiDxe honors and exposes this field to higher level
protocols, as follows:

(2.1) in VirtioScsiInit(), the field is read and saved. It is also
checked to be at least 2 (due to the division quoted in the next
bullet).

(2.2) PopulateRequest() contains the following logic:

//
// Catch oversized requests eagerly. If this condition evaluates to false,
// then the combined size of a bidirectional request will not exceed the
// virtio-scsi device's transfer limit either.
//
if (ALIGN_VALUE (Packet->OutTransferLength, 512) / 512
> Dev->MaxSectors / 2 ||
ALIGN_VALUE (Packet->InTransferLength, 512) / 512
> Dev->MaxSectors / 2) {
Packet->InTransferLength = (Dev->MaxSectors / 2) * 512;
Packet->OutTransferLength = (Dev->MaxSectors / 2) * 512;
Packet->HostAdapterStatus =
EFI_EXT_SCSI_STATUS_HOST_ADAPTER_DATA_OVERRUN_UNDERRUN;
Packet->TargetStatus = EFI_EXT_SCSI_STATUS_TARGET_GOOD;
Packet->SenseDataLength = 0;
return EFI_BAD_BUFFER_SIZE;
}

That is, VirtioScsiDxe only lets such requests reach the device that do
not exceed *half* of "max_sectors" *per direction*. Meaning that, for
uni-directional requests, the check is stricter than "max_sectors"
requires, and for bi-directional requests, it is exactly as safe as
"max_sectors" requires. (VirtioScsiDxe will indeed refuse to drive a
device that has just 1 in "max_sectors", per (2.1), but that's not a
*practical* limitation, I would say.)

(2.3) When the above EFI_BAD_BUFFER_SIZE branch is taken, the maximum
transfer sizes that the device supports are exposed to the caller (per
direction), in accordance with the UEFI spec.

(2.4) The ScsiDiskRead10(), ScsiDiskWrite10(), ScsiDiskRead16(),
ScsiDiskWrite16() functions in
"MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c" set the "NeedRetry"
output param to TRUE upon seeing EFI_BAD_BUFFER_SIZE.
I recently added more log in MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c that
has maximum setting related to MAX SCSI I/O size.

For example, In Read(10) command, the MaxBlock is 0xFFFF, and the BlockSize is 0x200.
So the max ByteCount is 0xFFFF*0x200 = 0x1FFFE00(32M).
After setting MaxBlock as 0x4000 to limit the max ByteCount to 8M,
Windows 2019 can boot up from vhost-scsi in my local environment.
Looks this 32M setting in ScsiDiskDxe is consistent with the one you mentioned
in following (3.2) in QEMU?

(I take the blame for implementing that, in commit fc3c83e0b355,
"MdeModulePkg: ScsiDiskDxe: recognize EFI_BAD_BUFFER_SIZE", 2015-09-10.)

(2.5) The ScsiDiskReadSectors() and ScsiDiskWriteSectors() functions,
which call the functions listed in (2.4), adjust the request size, and
resubmit the request, when "NeedRetry" is set on output.

(I take part of the blame for this as well, in commit 5abc2a70da4f,
"MdeModulePkg: ScsiDiskDxe: adapt SectorCount when shortening
transfers", 2015-09-10. I recommend reading the commit message on this
commit, as it describes a symptom somewhat similar to the current one.)


(3.1) Looking at "drivers/vhost/scsi.c" in the kernel, it doesn't seem
to fill in "max_sectors" at all.

(3.2) However, the QEMU part of the same device model does seem to
populate it; see "hw/scsi/vhost-scsi.c":

DEFINE_PROP_UINT32("max_sectors", VirtIOSCSICommon, conf.max_sectors,
0xFFFF),

This field dates back to the original introduction of vhost-scsi, namely
QEMU commit 5e9be92d7752 ("vhost-scsi: new device supporting the
tcm_vhost Linux kernel module", 2013-04-19).

The default value is almost 64K sectors, making the default transfer
limit (from the device's perspective) almost 32 MB.

(3.3) And this QEMU-side limit looks orthogonal to the PREALLOC_SGLS and
PREALLOC_PROT_SGLS kernel macros.

IOW, it looks possible to exceed PREALLOC_SGLS / PREALLOC_PROT_SGLS
without exceeding "max_sectors".


(4) Annie: can you try launching QEMU with the following flag:

-global vhost-scsi-pci.max_sectors=2048

If that works, then I *guess* the kernel-side vhost device model could
interrogate the virtio-scsi config space for "max_sectors", and use the
value seen there in place of PREALLOC_SGLS / PREALLOC_PROT_SGLS.
Cool!

I can boot Win2019 VM up from vhost-scsi with the flag above.

Thanks

Annie


(5) PS: referring back to (1) "seg_max":

given that I'm looking at "hw/scsi/vhost-scsi.c" in QEMU anyway,
git-blame fingers commit 1bf8a989a566 ("virtio: make seg_max virtqueue
size dependent", 2020-01-06). This commit seems to confirm that
"seg_max" stands basically for the same thing as "virtqueue size", and
so my argument (1.2) is valid, and (1.3) is irrelevant.

Put differently, the commit confirms that, in (1.2.2) and (1.2.4),
VirtioScsiDxe indeed only relies on "seg_max" being >=1, and therefore
VirtioScsiDxe can safely ignore the actual (positive) value of
"seg_max".

Thanks,
Laszlo


Laszlo Ersek
 

On 05/27/20 17:58, annie li wrote:
Hi Laszlo,

(I sent out my reply to your original response twice, but my reply
somehow doesn't show up in https://edk2.groups.io/g/discuss. It is
confusing.
Apologies for that -- while I'm one of the moderators on edk2-devel (I
get moderation notifications with the other mods, and we distribute the
mod workload the best we can), I'm not one of the edk2-discuss mods.

Hmm, wait a sec -- it seems like I am? And I just don't get mod
notifications for edk2-discuss? Let me poke around in the settings :/

edk2-devel:

- Spam Control
- Messages are not moderated
- New Members moderated
- Unmoderate after 1 approved message
- Message Policies
- Allow Nonmembers to post (messages from nonmembers will be moderated
instead of rejected)

edk2-discuss:

- Spam Control
- Messages are not moderated
- New Members ARE NOT moderated
- Message Policies
- Allow Nonmembers to post (messages from nonmembers will be moderated
instead of rejected)

So I think the bug in our configuration is that nonmembers are moderated
on edk2-discuss just the same (because of the identical "Allow
Nonmembers to post" setting), *however*, mods don't get notified because
of the "New Members ARE NOT moderated" setting.

So let me tweak this -- I'm setting the same

- Spam Control
- New Members moderated
- Unmoderate after 1 approved message

for edk2-discuss as we have on edk2-devel, *plus* I'm removing the
following from the edk2-discuss list description: "Basically
unmoderated". (I mean I totally agree that it *should* be unmoderated,
but fully open posting doesn't seem possible on groups.io at all!)

Anyway, re-sending it here, hope you can get it...)
Thanks -- in case you CC me personally in addition to messaging the list
(which is the common "best practice" for mailing lists), then I'll
surely get it.

Following up below:

On 5/27/2020 7:43 AM, Laszlo Ersek wrote:
(2) Regardig "max_sectors", the spec says:

max_sectors is a hint to the driver about the maximum transfer
size to use.

OvmfPkg/VirtioScsiDxe honors and exposes this field to higher level
protocols, as follows:

(2.1) in VirtioScsiInit(), the field is read and saved. It is also
checked to be at least 2 (due to the division quoted in the next
bullet).

(2.2) PopulateRequest() contains the following logic:

//
// Catch oversized requests eagerly. If this condition evaluates to
false,
// then the combined size of a bidirectional request will not
exceed the
// virtio-scsi device's transfer limit either.
//
if (ALIGN_VALUE (Packet->OutTransferLength, 512) / 512
> Dev->MaxSectors / 2 ||
ALIGN_VALUE (Packet->InTransferLength, 512) / 512
> Dev->MaxSectors / 2) {
Packet->InTransferLength = (Dev->MaxSectors / 2) * 512;
Packet->OutTransferLength = (Dev->MaxSectors / 2) * 512;
Packet->HostAdapterStatus =

EFI_EXT_SCSI_STATUS_HOST_ADAPTER_DATA_OVERRUN_UNDERRUN;
Packet->TargetStatus = EFI_EXT_SCSI_STATUS_TARGET_GOOD;
Packet->SenseDataLength = 0;
return EFI_BAD_BUFFER_SIZE;
}

That is, VirtioScsiDxe only lets such requests reach the device that
do not exceed *half* of "max_sectors" *per direction*. Meaning that,
for uni-directional requests, the check is stricter than
"max_sectors" requires, and for bi-directional requests, it is
exactly as safe as "max_sectors" requires. (VirtioScsiDxe will indeed
refuse to drive a device that has just 1 in "max_sectors", per (2.1),
but that's not a *practical* limitation, I would say.)

(2.3) When the above EFI_BAD_BUFFER_SIZE branch is taken, the maximum
transfer sizes that the device supports are exposed to the caller
(per direction), in accordance with the UEFI spec.

(2.4) The ScsiDiskRead10(), ScsiDiskWrite10(), ScsiDiskRead16(),
ScsiDiskWrite16() functions in
"MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c" set the "NeedRetry"
output param to TRUE upon seeing EFI_BAD_BUFFER_SIZE.
I recently added more log in
MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c that
has maximum setting related to MAX SCSI I/O size.

For example, In Read(10) command, the MaxBlock is 0xFFFF, and the
BlockSize is 0x200.
So the max ByteCount is 0xFFFF*0x200 = 0x1FFFE00(32M).
After setting MaxBlock as 0x4000 to limit the max ByteCount to 8M,
Windows 2019 can boot up from vhost-scsi in my local environment.
Looks this 32M setting in ScsiDiskDxe is consistent with the one you
mentioned
in following (3.2) in QEMU?
Yes, that's possible -- maybe the caller starts with an even larger
transfer size, and then the EFI_BAD_BUFFER_SIZE logic is already at
work, but it only reduces the transfer size to 32MB (per "max_sectors"
from QEMU). And then all the protocols expect that to succeed, and when
it fails, the failure is propagated to the outermost caller.

(4) Annie: can you try launching QEMU with the following flag:

-global vhost-scsi-pci.max_sectors=2048

If that works, then I *guess* the kernel-side vhost device model
could interrogate the virtio-scsi config space for "max_sectors", and
use the value seen there in place of PREALLOC_SGLS /
PREALLOC_PROT_SGLS.
Cool!

I can boot Win2019 VM up from vhost-scsi with the flag above.
Thank you for confirming!

Laszlo


annie li <annie.li@...>
 

Hi Laszlo,

On 5/27/2020 2:00 PM, Laszlo Ersek wrote:
On 05/27/20 17:58, annie li wrote:
Hi Laszlo,

(I sent out my reply to your original response twice, but my reply
somehow doesn't show up in https://edk2.groups.io/g/discuss. It is
confusing.
Apologies for that -- while I'm one of the moderators on edk2-devel (I
get moderation notifications with the other mods, and we distribute the
mod workload the best we can), I'm not one of the edk2-discuss mods.

Hmm, wait a sec -- it seems like I am? And I just don't get mod
notifications for edk2-discuss? Let me poke around in the settings :/

edk2-devel:

- Spam Control
  - Messages are not moderated
  - New Members moderated
    - Unmoderate after 1 approved message
- Message Policies
  - Allow Nonmembers to post (messages from nonmembers will be moderated
    instead of rejected)

edk2-discuss:

- Spam Control
  - Messages are not moderated
  - New Members ARE NOT moderated
- Message Policies
  - Allow Nonmembers to post (messages from nonmembers will be moderated
    instead of rejected)

So I think the bug in our configuration is that nonmembers are moderated
on edk2-discuss just the same (because of the identical "Allow
Nonmembers to post" setting), *however*, mods don't get notified because
of the "New Members ARE NOT moderated" setting.

So let me tweak this -- I'm setting the same

- Spam Control
  - New Members moderated
    - Unmoderate after 1 approved message

for edk2-discuss as we have on edk2-devel, *plus* I'm removing the
following from the edk2-discuss list description: "Basically
unmoderated". (I mean I totally agree that it *should* be unmoderated,
but fully open posting doesn't seem possible on groups.io at all!)

Thank you for looking at it.
See my comments below,


      
Anyway, re-sending it here, hope you can get it...)
Thanks -- in case you CC me personally in addition to messaging the list
(which is the common "best practice" for mailing lists), then I'll
surely get it.

Following up below:

On 5/27/2020 7:43 AM, Laszlo Ersek wrote:
(2) Regardig "max_sectors", the spec says:

   max_sectors is a hint to the driver about the maximum transfer
               size to use.

OvmfPkg/VirtioScsiDxe honors and exposes this field to higher level
protocols, as follows:

(2.1) in VirtioScsiInit(), the field is read and saved. It is also
checked to be at least 2 (due to the division quoted in the next
bullet).

(2.2) PopulateRequest() contains the following logic:

   //
   // Catch oversized requests eagerly. If this condition evaluates to
false,
   // then the combined size of a bidirectional request will not
exceed the
   // virtio-scsi device's transfer limit either.
   //
   if (ALIGN_VALUE (Packet->OutTransferLength, 512) / 512
         > Dev->MaxSectors / 2 ||
       ALIGN_VALUE (Packet->InTransferLength,  512) / 512
         > Dev->MaxSectors / 2) {
     Packet->InTransferLength  = (Dev->MaxSectors / 2) * 512;
     Packet->OutTransferLength = (Dev->MaxSectors / 2) * 512;
     Packet->HostAdapterStatus =

EFI_EXT_SCSI_STATUS_HOST_ADAPTER_DATA_OVERRUN_UNDERRUN;
     Packet->TargetStatus      = EFI_EXT_SCSI_STATUS_TARGET_GOOD;
     Packet->SenseDataLength   = 0;
     return EFI_BAD_BUFFER_SIZE;
   }

That is, VirtioScsiDxe only lets such requests reach the device that
do not exceed *half* of "max_sectors" *per direction*. Meaning that,
for uni-directional requests, the check is stricter than
"max_sectors" requires, and for bi-directional requests, it is
exactly as safe as "max_sectors" requires. (VirtioScsiDxe will indeed
refuse to drive a device that has just 1 in "max_sectors", per (2.1),
but that's not a *practical* limitation, I would say.)

(2.3) When the above EFI_BAD_BUFFER_SIZE branch is taken, the maximum
transfer sizes that the device supports are exposed to the caller
(per direction), in accordance with the UEFI spec.

(2.4) The ScsiDiskRead10(), ScsiDiskWrite10(), ScsiDiskRead16(),
ScsiDiskWrite16() functions in
"MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c" set the "NeedRetry"
output param to TRUE upon seeing EFI_BAD_BUFFER_SIZE.

      
I recently added more log in
MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c that
has maximum setting related to MAX SCSI I/O size.

For example, In Read(10) command, the MaxBlock is 0xFFFF, and the
BlockSize is 0x200.
So the max ByteCount is 0xFFFF*0x200 = 0x1FFFE00(32M).
After setting MaxBlock as 0x4000 to limit the max ByteCount to 8M,
Windows 2019 can boot up from vhost-scsi in my local environment.
Looks this 32M setting in ScsiDiskDxe is consistent with the one you
mentioned
in following (3.2) in QEMU?
Yes, that's possible -- maybe the caller starts with an even larger
transfer size, and then the EFI_BAD_BUFFER_SIZE logic is already at
work, but it only reduces the transfer size to 32MB (per "max_sectors"
from QEMU). And then all the protocols expect that to succeed, and when
it fails, the failure is propagated to the outermost caller.
Nods.

(4) Annie: can you try launching QEMU with the following flag:

   -global vhost-scsi-pci.max_sectors=2048

If that works, then I *guess* the kernel-side vhost device model
could interrogate the virtio-scsi config space for "max_sectors", and
use the value seen there in place of PREALLOC_SGLS /
PREALLOC_PROT_SGLS.
I am a little confused here,
Both VHOST_SCSI_PREALLOC_SGLS(2048) and
TCM_VHOST_PREALLOC_PROT_SGLS(512) are hard coded in vhost/scsi.c.
...
sgl_count = vhost_scsi_calc_sgls(prot_iter, prot_bytes,
                                                          TCM_VHOST_PREALLOC_PROT_SGLS);
....
sgl_count = vhost_scsi_calc_sgls(data_iter, data_bytes,
                                                          VHOST_SCSI_PREALLOC_SGLS);

In vhost_scsi_calc_sgls, error is printed out if sgl_count is more than
TCM_VHOST_PREALLOC_PROT_SGLS or VHOST_SCSI_PREALLOC_SGLS.

    sgl_count = iov_iter_npages(iter, 0xffff);
    if (sgl_count > max_sgls) {
        pr_err("%s: requested sgl_count: %d exceeds pre-allocated"
               " max_sgls: %d\n", __func__, sgl_count, max_sgls);
        return -EINVAL;

    }

Looks like vhost-scsi doesn't interrogate the virtio-scsi config space for
"max_sectors". The guest virtio-scsi driver may read this configuration
out though.

So the following flag reduces the transfer size to 8M on QEMU side.
"-global vhost-scsi-pci.max_sectors=2048"
Due to this setting, even though max ByteCount of Read(10) command in
ScsiDiskDxe/ScsiDisk.c is is 0xFFFF*0x200 = 0x1FFFE00(32M), under
EFI_BAD_BUFFER_SIZE logic, ScsiDiskDxe/ScsiDisk.c does retries and
adjust the request size <= 8M?

Although Win2019 boots from vhost-scsi with above flag, I assume we still
need to enlarge the value of VHOST_SCSI_PREALLOC_SGLS in vhost-scsi for
final fix instead of setting max_sectors through QEMU options?

Thanks
Annie

Cool!

I can boot Win2019 VM up from vhost-scsi with the flag above.
Thank you for confirming!

Laszlo


annie li
 

On 5/27/2020 2:00 PM, Laszlo Ersek wrote:
On 05/27/20 17:58, annie li wrote:
Hi Laszlo,

(I sent out my reply to your original response twice, but my reply
somehow doesn't show up in https://edk2.groups.io/g/discuss. It is
confusing.
Apologies for that -- while I'm one of the moderators on edk2-devel (I
get moderation notifications with the other mods, and we distribute the
mod workload the best we can), I'm not one of the edk2-discuss mods.

Hmm, wait a sec -- it seems like I am? And I just don't get mod
notifications for edk2-discuss? Let me poke around in the settings :/

edk2-devel:

- Spam Control
- Messages are not moderated
- New Members moderated
- Unmoderate after 1 approved message
- Message Policies
- Allow Nonmembers to post (messages from nonmembers will be moderated
instead of rejected)

edk2-discuss:

- Spam Control
- Messages are not moderated
- New Members ARE NOT moderated
- Message Policies
- Allow Nonmembers to post (messages from nonmembers will be moderated
instead of rejected)

So I think the bug in our configuration is that nonmembers are moderated
on edk2-discuss just the same (because of the identical "Allow
Nonmembers to post" setting), *however*, mods don't get notified because
of the "New Members ARE NOT moderated" setting.

So let me tweak this -- I'm setting the same

- Spam Control
- New Members moderated
- Unmoderate after 1 approved message

for edk2-discuss as we have on edk2-devel, *plus* I'm removing the
following from the edk2-discuss list description: "Basically
unmoderated". (I mean I totally agree that it *should* be unmoderated,
but fully open posting doesn't seem possible on groups.io at all!)
Thanks for addressing it.
My another email sent out yesterday didn't reach to edk2-discuss.
I joined this group and hope the email can show up this time.
See my following comments.
Anyway, re-sending it here, hope you can get it...)
Thanks -- in case you CC me personally in addition to messaging the list
(which is the common "best practice" for mailing lists), then I'll
surely get it.

Following up below:

On 5/27/2020 7:43 AM, Laszlo Ersek wrote:
(2) Regardig "max_sectors", the spec says:

max_sectors is a hint to the driver about the maximum transfer
size to use.

OvmfPkg/VirtioScsiDxe honors and exposes this field to higher level
protocols, as follows:

(2.1) in VirtioScsiInit(), the field is read and saved. It is also
checked to be at least 2 (due to the division quoted in the next
bullet).

(2.2) PopulateRequest() contains the following logic:

//
// Catch oversized requests eagerly. If this condition evaluates to
false,
// then the combined size of a bidirectional request will not
exceed the
// virtio-scsi device's transfer limit either.
//
if (ALIGN_VALUE (Packet->OutTransferLength, 512) / 512
> Dev->MaxSectors / 2 ||
ALIGN_VALUE (Packet->InTransferLength, 512) / 512
> Dev->MaxSectors / 2) {
Packet->InTransferLength = (Dev->MaxSectors / 2) * 512;
Packet->OutTransferLength = (Dev->MaxSectors / 2) * 512;
Packet->HostAdapterStatus =

EFI_EXT_SCSI_STATUS_HOST_ADAPTER_DATA_OVERRUN_UNDERRUN;
Packet->TargetStatus = EFI_EXT_SCSI_STATUS_TARGET_GOOD;
Packet->SenseDataLength = 0;
return EFI_BAD_BUFFER_SIZE;
}

That is, VirtioScsiDxe only lets such requests reach the device that
do not exceed *half* of "max_sectors" *per direction*. Meaning that,
for uni-directional requests, the check is stricter than
"max_sectors" requires, and for bi-directional requests, it is
exactly as safe as "max_sectors" requires. (VirtioScsiDxe will indeed
refuse to drive a device that has just 1 in "max_sectors", per (2.1),
but that's not a *practical* limitation, I would say.)

(2.3) When the above EFI_BAD_BUFFER_SIZE branch is taken, the maximum
transfer sizes that the device supports are exposed to the caller
(per direction), in accordance with the UEFI spec.

(2.4) The ScsiDiskRead10(), ScsiDiskWrite10(), ScsiDiskRead16(),
ScsiDiskWrite16() functions in
"MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c" set the "NeedRetry"
output param to TRUE upon seeing EFI_BAD_BUFFER_SIZE.
Thanks for the detailed explanation, it is very helpful.
I recently added more log in
MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c that
has maximum setting related to MAX SCSI I/O size.

For example, In Read(10) command, the MaxBlock is 0xFFFF, and the
BlockSize is 0x200.
So the max ByteCount is 0xFFFF*0x200 = 0x1FFFE00(32M).
After setting MaxBlock as 0x4000 to limit the max ByteCount to 8M,
Windows 2019 can boot up from vhost-scsi in my local environment.
Looks this 32M setting in ScsiDiskDxe is consistent with the one you
mentioned
in following (3.2) in QEMU?
Yes, that's possible -- maybe the caller starts with an even larger
transfer size, and then the EFI_BAD_BUFFER_SIZE logic is already at
work, but it only reduces the transfer size to 32MB (per "max_sectors"
from QEMU). And then all the protocols expect that to succeed, and when
it fails, the failure is propagated to the outermost caller.

(4) Annie: can you try launching QEMU with the following flag:

-global vhost-scsi-pci.max_sectors=2048
This limits the I/O size to 1M. The EFI_BAD_BUFFER_SIZE logic reduces
I/O size to 512K for uni-directional requests.
To send biggest I/O(8M) allowed by current vhost-scsi setting, I adjust the
value to 0x3FFF. The EFI_BAD_BUFFER_SIZE logic reduces I/O size to 4M
for uni-directional requests.
   -global vhost-scsi-pci.max_sectors=0x3FFF
0x4000 doesn't survive here.

If that works, then I *guess* the kernel-side vhost device model
could interrogate the virtio-scsi config space for "max_sectors", and
use the value seen there in place of PREALLOC_SGLS /
PREALLOC_PROT_SGLS.
You mean the vhost device on the guest side here, right? In Windows
virtio-scsi driver, it does read out max_sectors. Even though the driver
doesn't take use of it later, it can be used to adjust the transfer length
of I/O.

I guess you are not mentioning the vhost-scsi on the host?
Both VHOST_SCSI_PREALLOC_SGLS(2048) and
TCM_VHOST_PREALLOC_PROT_SGLS(512) are hard coded in vhost/scsi.c.
...
sgl_count = vhost_scsi_calc_sgls(prot_iter, prot_bytes,
TCM_VHOST_PREALLOC_PROT_SGLS);
....
sgl_count = vhost_scsi_calc_sgls(data_iter, data_bytes,
VHOST_SCSI_PREALLOC_SGLS);


In vhost_scsi_calc_sgls, error is printed out if sgl_count is more than
TCM_VHOST_PREALLOC_PROT_SGLS or VHOST_SCSI_PREALLOC_SGLS.

    sgl_count = iov_iter_npages(iter, 0xffff);
    if (sgl_count > max_sgls) {
        pr_err("%s: requested sgl_count: %d exceeds pre-allocated"
               " max_sgls: %d\n", __func__, sgl_count, max_sgls);
        return -EINVAL;

    }
Looks like vhost-scsi doesn't interrogate the virtio-scsi config space for
"max_sectors".

Although Win2019 boots from vhost-scsi with above flag, I assume we still
need to enlarge the value of VHOST_SCSI_PREALLOC_SGLS in vhost-scsi for
final fix instead of setting max_sectors through QEMU options?
Adding specific QEMU command option for booting Win2019 from vhost-scsi
seems not appropriate.
Suggestions?

Thanks
Annie
Cool!

I can boot Win2019 VM up from vhost-scsi with the flag above.
Thank you for confirming!

Laszlo


Laszlo Ersek
 

On 05/28/20 00:04, annie li wrote:
On 5/27/2020 2:00 PM, Laszlo Ersek wrote:
(4) Annie: can you try launching QEMU with the following flag:

    -global vhost-scsi-pci.max_sectors=2048

If that works, then I *guess* the kernel-side vhost device model
could interrogate the virtio-scsi config space for "max_sectors", and
use the value seen there in place of PREALLOC_SGLS /
PREALLOC_PROT_SGLS.
I am a little confused here,
Both VHOST_SCSI_PREALLOC_SGLS(2048) and
TCM_VHOST_PREALLOC_PROT_SGLS(512) are hard coded in vhost/scsi.c.
...
sgl_count = vhost_scsi_calc_sgls(prot_iter, prot_bytes,
TCM_VHOST_PREALLOC_PROT_SGLS);
....
sgl_count = vhost_scsi_calc_sgls(data_iter, data_bytes,
VHOST_SCSI_PREALLOC_SGLS);

In vhost_scsi_calc_sgls, error is printed out if sgl_count is more than
TCM_VHOST_PREALLOC_PROT_SGLS or VHOST_SCSI_PREALLOC_SGLS.

    sgl_count = iov_iter_npages(iter, 0xffff);
    if (sgl_count > max_sgls) {
        pr_err("%s: requested sgl_count: %d exceeds pre-allocated"
               " max_sgls: %d\n", __func__, sgl_count, max_sgls);
        return -EINVAL;

    }

Looks like vhost-scsi doesn't interrogate the virtio-scsi config space for
"max_sectors". The guest virtio-scsi driver may read this configuration
out though.
Yes.


So the following flag reduces the transfer size to 8M on QEMU side.
"-global vhost-scsi-pci.max_sectors=2048"
Due to this setting, even though max ByteCount of Read(10) command in
ScsiDiskDxe/ScsiDisk.c is is 0xFFFF*0x200 = 0x1FFFE00(32M), under
EFI_BAD_BUFFER_SIZE logic, ScsiDiskDxe/ScsiDisk.c does retries and
adjust the request size <= 8M?
Yes.

The transfer size that ultimately reaches the device is the minimum of
three quantities:

(a) the transfer size requested by the caller (i.e., the UEFI application),

(b) the limit set by the READ(10) / READ(16) decision (i.e., MaxBlock),

(c) the transfer size limit enforced / reported by
EFI_EXT_SCSI_PASS_THRU_PROTOCOL.PassThru(), with EFI_BAD_BUFFER_SIZE

Whichever is the smallest from the three, determines the transfer size
that the device ultimately sees in the request.

And then *that* transfer size must satisfy PREALLOC_SGLS and/or
PREALLOC_PROT_SGLS (2048 4K pages: 0x80_0000 bytes).

In your original use case, (a) is 0x93_F400 bytes, (b) is 0x1FF_FE00
bytes, and (c) is 0x1FF_FE00 too. Therefore the minimum is 0x93_F400, so
that is what reaches the device. And because 0x93_F400 exceeds
0x80_0000, the request fails.

When you set "-global vhost-scsi-pci.max_sectors=2048", that lowers (c)
to 0x10_0000. (a) and (b) remain unchanged. Therefore the new minimum
(which finally reaches the device) is 0x10_0000. This does not exceed
0x80_0000, so the request succeeds.

... In my prior email, I think I missed a detail: while the unit for
QEMU's "vhost-scsi-pci.max_sectors" property is a "sector" (512 bytes),
the unit for PREALLOC_SGLS and PREALLOC_PROT_SGLS in the kernel device
model seems to be a *page*, rather than a sector. (I don't think I've
ever checked iov_iter_npages() before.)

Therefore the QEMU flag that I recommended previously was too strict.
Can you try this instead, please?:

-global vhost-scsi-pci.max_sectors=16384

This should set (c) to 0x80_0000 bytes. And so the minimum of {(a), (b),
{c}) will be 0x80_0000 bytes -- exactly what PREALLOC_SGLS and
PREALLOC_PROT_SGLS require.

Although Win2019 boots from vhost-scsi with above flag, I assume we still
need to enlarge the value of VHOST_SCSI_PREALLOC_SGLS in vhost-scsi for
final fix instead of setting max_sectors through QEMU options?
There are multiple ways (alternatives) to fix the issue.

- use larger constants for PREALLOC_SGLS and PREALLOC_PROT_SGLS in the
kernel;

- or replace the PREALLOC_SGLS and PREALLOC_PROT_SGLS constants in the
kernel altogether, with such logic that dynamically calculates them from
the "max_sectors" virtio-scsi config header field;

- or change the QEMU default for "vhost-scsi-pci.max_sectors", from
0xFFFF to 16384.

Either should work.

Thanks,
Laszlo


Laszlo Ersek
 

On 05/28/20 18:39, annie li wrote:
On 5/27/2020 2:00 PM, Laszlo Ersek wrote:
(4) Annie: can you try launching QEMU with the following flag:

    -global vhost-scsi-pci.max_sectors=2048
This limits the I/O size to 1M.
Indeed -- as I just pointed out under your other email, I previously
missed that the host kernel-side unit was not "sector" but "4K page". So
yes, the value 2048 above is too strict.

The EFI_BAD_BUFFER_SIZE logic reduces
I/O size to 512K for uni-directional requests.
To send biggest I/O(8M) allowed by current vhost-scsi setting, I adjust the
value to 0x3FFF. The EFI_BAD_BUFFER_SIZE logic reduces I/O size to 4M
for uni-directional requests.
   -global vhost-scsi-pci.max_sectors=0x3FFF
OK!

0x4000 doesn't survive here.
That's really interesting. I'm not sure why that happens.

... Is it possible that vhost_scsi_handle_vq() -- in the host kernel --
puts stuff in the scatter-gather list *other* than the transfer buffers?
Some headers and such? Maybe those headers need an extra page.

If that works, then I *guess* the kernel-side vhost device model
could interrogate the virtio-scsi config space for "max_sectors", and
use the value seen there in place of PREALLOC_SGLS /
PREALLOC_PROT_SGLS.
You mean the vhost device on the guest side here, right? In Windows
virtio-scsi driver, it does read out max_sectors. Even though the driver
doesn't take use of it later, it can be used to adjust the transfer length
of I/O.
With vhost, the virtio-scsi device model is split between QEMU and the
host kernel. While QEMU manages the "max_sectors" property (= accepts it
from the command line, and exposes it to the guest driver), the host
kernel (i.e., the other half of the device model) ignores the same property.

Consequently, although the guest driver obeys "max_sectors" for limiting
the transfer size, the host kernel's constants may prove *stricter* than
that. Because, the host kernel ignores "max_sectors". So one idea is to
make the host kernel honor the "max_sectors" limit that QEMU manages.

The other two ideas are: use larger constants in the kernel, or use a
smaller "max_sectors" default in QEMU.

The goal behind all three alternatives is the same: the limit that QEMU
exposes to the guest driver should satisfy the host kernel.

Thanks
Laszlo


annie li
 

On 5/28/2020 6:08 PM, Laszlo Ersek wrote:
On 05/28/20 18:39, annie li wrote:
On 5/27/2020 2:00 PM, Laszlo Ersek wrote:
(4) Annie: can you try launching QEMU with the following flag:

    -global vhost-scsi-pci.max_sectors=2048
This limits the I/O size to 1M.
Indeed -- as I just pointed out under your other email, I previously
missed that the host kernel-side unit was not "sector" but "4K page". So
yes, the value 2048 above is too strict.

The EFI_BAD_BUFFER_SIZE logic reduces
I/O size to 512K for uni-directional requests.
To send biggest I/O(8M) allowed by current vhost-scsi setting, I adjust the
value to 0x3FFF. The EFI_BAD_BUFFER_SIZE logic reduces I/O size to 4M
for uni-directional requests.
   -global vhost-scsi-pci.max_sectors=0x3FFF
OK!

0x4000 doesn't survive here.
That's really interesting.
Yup
I'm not sure why that happens.
Then I found out it is related to operations on this VM, see following.
... Is it possible that vhost_scsi_handle_vq() -- in the host kernel --
puts stuff in the scatter-gather list *other* than the transfer buffers?
Some headers and such? Maybe those headers need an extra page.
I ran more tests, and found booting failure happens randomly when I boot the VM
right after it was previously terminated by Ctrl+C directly from QEMU monitor, no
matter the max_sectors is 2048, 16383 or 16384. The failing chance is about 7 out of 20.

So my previous statement about 0x4000 and 0x3FFF isn't accurate.
It is just that booting happened to succeed with 0x3FFF(16383 ), but not with 0x4000(16384).

Also, when this failure happens, dmesg doesn't print out following errors,
vhost_scsi_calc_sgls: requested sgl_count: 2368 exceeds pre-allocated max_sgls: 2048

This new failure is totally different issue from the one caused by max sized I/O. For my
debug log of OVMF, the biggest I/O size is only about 1M. This means Windows 2019
didn't send out big sized I/O out yet.

The interesting part is that I didn't see this new failure happen if I boot the VM which
was previously shutdown gracefully from inside Windows guest.

If that works, then I *guess* the kernel-side vhost device model
could interrogate the virtio-scsi config space for "max_sectors", and
use the value seen there in place of PREALLOC_SGLS /
PREALLOC_PROT_SGLS.
You mean the vhost device on the guest side here, right? In Windows
virtio-scsi driver, it does read out max_sectors. Even though the driver
doesn't take use of it later, it can be used to adjust the transfer length
of I/O.
With vhost, the virtio-scsi device model is split between QEMU and the
host kernel. While QEMU manages the "max_sectors" property (= accepts it
from the command line, and exposes it to the guest driver), the host
kernel (i.e., the other half of the device model) ignores the same property.

Consequently, although the guest driver obeys "max_sectors" for limiting
the transfer size, the host kernel's constants may prove *stricter* than
that. Because, the host kernel ignores "max_sectors". So one idea is to
make the host kernel honor the "max_sectors" limit that QEMU manages.
This involves both changes in kernel and QEMU. I guess maybe it is more straight
that kernel controls the transfer size based on memory consumed.

The other two ideas are: use larger constants in the kernel, or use a
smaller "max_sectors" default in QEMU.
I prefer to fixing it by using larger constants in the kernel, this also avoid splitting
big sized I/O by using smaller "max_sectors"default in QEMU.
Following is the code change I did in the kernel code vhost/scsi.c,
-#define VHOST_SCSI_PREALLOC_SGLS 2048
-#define VHOST_SCSI_PREALLOC_UPAGES 2048
+#define VHOST_SCSI_PREALLOC_SGLS 2560
+#define VHOST_SCSI_PREALLOC_UPAGES 2560

Thanks
Annie

The goal behind all three alternatives is the same: the limit that QEMU
exposes to the guest driver should satisfy the host kernel.

Thanks
Laszlo


annie li
 

On 5/28/2020 5:51 PM, Laszlo Ersek wrote:
On 05/28/20 00:04, annie li wrote:
On 5/27/2020 2:00 PM, Laszlo Ersek wrote:
(4) Annie: can you try launching QEMU with the following flag:

    -global vhost-scsi-pci.max_sectors=2048

If that works, then I *guess* the kernel-side vhost device model
could interrogate the virtio-scsi config space for "max_sectors", and
use the value seen there in place of PREALLOC_SGLS /
PREALLOC_PROT_SGLS.
I am a little confused here,
Both VHOST_SCSI_PREALLOC_SGLS(2048) and
TCM_VHOST_PREALLOC_PROT_SGLS(512) are hard coded in vhost/scsi.c.
...
sgl_count = vhost_scsi_calc_sgls(prot_iter, prot_bytes,
TCM_VHOST_PREALLOC_PROT_SGLS);
....
sgl_count = vhost_scsi_calc_sgls(data_iter, data_bytes,
VHOST_SCSI_PREALLOC_SGLS);

In vhost_scsi_calc_sgls, error is printed out if sgl_count is more than
TCM_VHOST_PREALLOC_PROT_SGLS or VHOST_SCSI_PREALLOC_SGLS.

    sgl_count = iov_iter_npages(iter, 0xffff);
    if (sgl_count > max_sgls) {
        pr_err("%s: requested sgl_count: %d exceeds pre-allocated"
               " max_sgls: %d\n", __func__, sgl_count, max_sgls);
        return -EINVAL;

    }

Looks like vhost-scsi doesn't interrogate the virtio-scsi config space for
"max_sectors". The guest virtio-scsi driver may read this configuration
out though.
Yes.

So the following flag reduces the transfer size to 8M on QEMU side.
"-global vhost-scsi-pci.max_sectors=2048"
Due to this setting, even though max ByteCount of Read(10) command in
ScsiDiskDxe/ScsiDisk.c is is 0xFFFF*0x200 = 0x1FFFE00(32M), under
EFI_BAD_BUFFER_SIZE logic, ScsiDiskDxe/ScsiDisk.c does retries and
adjust the request size <= 8M?
Yes.

The transfer size that ultimately reaches the device is the minimum of
three quantities:

(a) the transfer size requested by the caller (i.e., the UEFI application),

(b) the limit set by the READ(10) / READ(16) decision (i.e., MaxBlock),

(c) the transfer size limit enforced / reported by
EFI_EXT_SCSI_PASS_THRU_PROTOCOL.PassThru(), with EFI_BAD_BUFFER_SIZE

Whichever is the smallest from the three, determines the transfer size
that the device ultimately sees in the request.

And then *that* transfer size must satisfy PREALLOC_SGLS and/or
PREALLOC_PROT_SGLS (2048 4K pages: 0x80_0000 bytes).

In your original use case, (a) is 0x93_F400 bytes, (b) is 0x1FF_FE00
bytes, and (c) is 0x1FF_FE00 too. Therefore the minimum is 0x93_F400, so
that is what reaches the device. And because 0x93_F400 exceeds
0x80_0000, the request fails.

When you set "-global vhost-scsi-pci.max_sectors=2048", that lowers (c)
to 0x10_0000. (a) and (b) remain unchanged. Therefore the new minimum
(which finally reaches the device) is 0x10_0000. This does not exceed
0x80_0000, so the request succeeds.
Much clear now, thank you!

... In my prior email, I think I missed a detail: while the unit for
QEMU's "vhost-scsi-pci.max_sectors" property is a "sector" (512 bytes),
the unit for PREALLOC_SGLS and PREALLOC_PROT_SGLS in the kernel device
model seems to be a *page*, rather than a sector. (I don't think I've
ever checked iov_iter_npages() before.)

Therefore the QEMU flag that I recommended previously was too strict.
Can you try this instead, please?:

-global vhost-scsi-pci.max_sectors=16384
It works but run into another failure. I put details in another email.

This should set (c) to 0x80_0000 bytes. And so the minimum of {(a), (b),
{c}) will be 0x80_0000 bytes -- exactly what PREALLOC_SGLS and
PREALLOC_PROT_SGLS require.

Although Win2019 boots from vhost-scsi with above flag, I assume we still
need to enlarge the value of VHOST_SCSI_PREALLOC_SGLS in vhost-scsi for
final fix instead of setting max_sectors through QEMU options?
There are multiple ways (alternatives) to fix the issue.

- use larger constants for PREALLOC_SGLS and PREALLOC_PROT_SGLS in the
kernel;

- or replace the PREALLOC_SGLS and PREALLOC_PROT_SGLS constants in the
kernel altogether, with such logic that dynamically calculates them from
the "max_sectors" virtio-scsi config header field;

- or change the QEMU default for "vhost-scsi-pci.max_sectors", from
0xFFFF to 16384.
I prefer to fixing it in the kernel side, details are in another email too.:-)

Thanks
Annie

Either should work.

Thanks,
Laszlo



Laszlo Ersek
 

On 05/29/20 16:47, annie li wrote:

I ran more tests, and found booting failure happens randomly when I
boot the VM right after it was previously terminated by Ctrl+C
directly from QEMU monitor, no matter the max_sectors is 2048, 16383
or 16384. The failing chance is about 7 out of 20.

So my previous statement about 0x4000 and 0x3FFF isn't accurate. It is
just that booting happened to succeed with 0x3FFF(16383 ), but not
with 0x4000(16384).

Also, when this failure happens, dmesg doesn't print out following
errors, vhost_scsi_calc_sgls: requested sgl_count: 2368 exceeds
pre-allocated max_sgls: 2048

This new failure is totally different issue from the one caused by max
sized I/O. For my debug log of OVMF, the biggest I/O size is only
about 1M. This means Windows 2019 didn't send out big sized I/O out
yet.

The interesting part is that I didn't see this new failure happen if I
boot the VM which was previously shutdown gracefully from inside
Windows guest.
Can you build the host kernel with "CONFIG_VHOST_SCSI=m", and repeat
your Ctrl-C test such that you remove and re-insert "vhost_scsi.ko"
after every Ctrl-C?

My guess is that, when you kill QEMU with Ctrl-C, "vhost_scsi.ko" might
not clean up something, and that could break the next guest boot. If you
re-insert "vhost_scsi.ko" for each QEMU launch, and that ends up masking
the symptom, then there's likely some resource leak in "vhost_scsi.ko".

Just a guess.

Thanks
Laszlo


annie li
 

On 6/2/2020 7:44 AM, Laszlo Ersek wrote:
On 05/29/20 16:47, annie li wrote:

I ran more tests, and found booting failure happens randomly when I
boot the VM right after it was previously terminated by Ctrl+C
directly from QEMU monitor, no matter the max_sectors is 2048, 16383
or 16384. The failing chance is about 7 out of 20.

So my previous statement about 0x4000 and 0x3FFF isn't accurate. It is
just that booting happened to succeed with 0x3FFF(16383 ), but not
with 0x4000(16384).

Also, when this failure happens, dmesg doesn't print out following
errors, vhost_scsi_calc_sgls: requested sgl_count: 2368 exceeds
pre-allocated max_sgls: 2048

This new failure is totally different issue from the one caused by max
sized I/O. For my debug log of OVMF, the biggest I/O size is only
about 1M. This means Windows 2019 didn't send out big sized I/O out
yet.

The interesting part is that I didn't see this new failure happen if I
boot the VM which was previously shutdown gracefully from inside
Windows guest.
Can you build the host kernel with "CONFIG_VHOST_SCSI=m", and repeat
your Ctrl-C test such that you remove and re-insert "vhost_scsi.ko"
after every Ctrl-C?
I am using targetcli to create SCSI lun that the VM boots from. The vhost_scsi
module gets loaded right after I create target in /vhost. However, I cannot remove
vhost_scsi module since then. It always complains " Module vhost_scsi is in use"
(same even after I delete target in targetcli).
Maybe it is related to targetcli, but I didn't try other tools yet.

My guess is that, when you kill QEMU with Ctrl-C, "vhost_scsi.ko" might
not clean up something, and that could break the next guest boot. If you
re-insert "vhost_scsi.ko" for each QEMU launch, and that ends up masking
the symptom, then there's likely some resource leak in "vhost_scsi.ko".
Nods, it is possible.

Thanks
Annie


Just a guess.

Thanks
Laszlo



Laszlo Ersek
 

On 06/03/20 00:19, annie li wrote:
On 6/2/2020 7:44 AM, Laszlo Ersek wrote:
On 05/29/20 16:47, annie li wrote:

I ran more tests, and found booting failure happens randomly when I
boot the VM right after it was previously terminated by Ctrl+C
directly from QEMU monitor, no matter the max_sectors is 2048, 16383
or 16384. The failing chance is about 7 out of 20.

So my previous statement about 0x4000 and 0x3FFF isn't accurate. It is
just that booting happened to succeed with 0x3FFF(16383 ), but not
with 0x4000(16384).

Also, when this failure happens, dmesg doesn't print out following
errors, vhost_scsi_calc_sgls: requested sgl_count: 2368 exceeds
pre-allocated max_sgls: 2048

This new failure is totally different issue from the one caused by max
sized I/O. For my debug log of OVMF, the biggest I/O size is only
about 1M. This means Windows 2019 didn't send out big sized I/O out
yet.

The interesting part is that I didn't see this new failure happen if I
boot the VM which was previously shutdown gracefully from inside
Windows guest.
Can you build the host kernel with "CONFIG_VHOST_SCSI=m", and repeat
your Ctrl-C test such that you remove and re-insert "vhost_scsi.ko"
after every Ctrl-C?
I am using targetcli to create SCSI lun that the VM boots from. The
vhost_scsi
module gets loaded right after I create target in /vhost. However, I
cannot remove
vhost_scsi module since then. It always complains " Module vhost_scsi is
in use"
(same even after I delete target in targetcli).
Maybe it is related to targetcli, but I didn't try other tools yet.
Can you check with "lsmod" if other modules use vhost_scsi?

If you shut down QEMU gracefully, can you rmmod vhost_scsi in that case?

I wonder if the failure to remove the vhost_scsi module is actually
another sign of the same (as yet unknown) leaked reference.

Thanks
Laszlo


My guess is that, when you kill QEMU with Ctrl-C, "vhost_scsi.ko" might
not clean up something, and that could break the next guest boot. If you
re-insert "vhost_scsi.ko" for each QEMU launch, and that ends up masking
the symptom, then there's likely some resource leak in "vhost_scsi.ko".
Nods, it is possible.

Thanks
Annie


Just a guess.

Thanks
Laszlo




annie li
 

On 6/3/2020 9:33 AM, Laszlo Ersek wrote:
On 06/03/20 00:19, annie li wrote:
On 6/2/2020 7:44 AM, Laszlo Ersek wrote:
On 05/29/20 16:47, annie li wrote:

I ran more tests, and found booting failure happens randomly when I
boot the VM right after it was previously terminated by Ctrl+C
directly from QEMU monitor, no matter the max_sectors is 2048, 16383
or 16384. The failing chance is about 7 out of 20.

So my previous statement about 0x4000 and 0x3FFF isn't accurate. It is
just that booting happened to succeed with 0x3FFF(16383 ), but not
with 0x4000(16384).

Also, when this failure happens, dmesg doesn't print out following
errors, vhost_scsi_calc_sgls: requested sgl_count: 2368 exceeds
pre-allocated max_sgls: 2048

This new failure is totally different issue from the one caused by max
sized I/O. For my debug log of OVMF, the biggest I/O size is only
about 1M. This means Windows 2019 didn't send out big sized I/O out
yet.

The interesting part is that I didn't see this new failure happen if I
boot the VM which was previously shutdown gracefully from inside
Windows guest.
Can you build the host kernel with "CONFIG_VHOST_SCSI=m", and repeat
your Ctrl-C test such that you remove and re-insert "vhost_scsi.ko"
after every Ctrl-C?
I am using targetcli to create SCSI lun that the VM boots from. The
vhost_scsi
module gets loaded right after I create target in /vhost. However, I
cannot remove
vhost_scsi module since then. It always complains " Module vhost_scsi is
in use"
(same even after I delete target in targetcli).
Maybe it is related to targetcli, but I didn't try other tools yet.
Can you check with "lsmod" if other modules use vhost_scsi?
lsmod shows vhost_scsi is used by 4 programs, I assume these 4 are related
to targetcli.
lsmod |grep vhost_scsi
vhost_scsi             36864  4
vhost                      53248  1 vhost_scsi
target_core_mod       380928  14 target_core_file,target_core_iblock,iscsi_target_mod,vhost_scsi,target_core_pscsi,target_core_user

I was thinking maybe these target_* modules are using vhost_scsi, then removed
following modules by modprobe -r,
target_core_file,target_core_iblock,vhost_scsi,target_core_pscsi,target_core_user
then lsmod shows "used by" down to 3 programs,
vhost_scsi             36864  3
vhost                  53248  1 vhost_scsi
target_core_mod       380928  6 iscsi_target_mod,vhost_scsi
However, others can not be removed. "rmmod --force" doesn't help either.
"dmesg |grep vhost_scsi" doesn't show much useful information either.

If you shut down QEMU gracefully, can you rmmod vhost_scsi in that case?
No, I cannot rmmod these modules right after I create target in targetcli, no matter
whether I start a VM or not. Deleting the target in targetcli doesn't help either.
Before I create target in targetcli, I can add and remove vhost_scsi module. The
"used by" of vhost_scsi is 0.
See following steps I did right after I reboot my host,
# modprobe vhost_scsi
# lsmod |grep vhost
vhost_scsi             36864  0
vhost                  53248  1 vhost_scsi
target_core_mod       380928  1 vhost_scsi
# modprobe -r vhost_scsi
# lsmod |grep vhost
#
Right after I setup luns in targetcli, the "used by" is always 4 no matter I stop the VM
by "CTRL-C" or graceful shutdown, no matter the VM is running or not. So targetcli
is the suspect of these 4 "used by".

Thanks
Annie

I wonder if the failure to remove the vhost_scsi module is actually
another sign of the same (as yet unknown) leaked reference.

Thanks
Laszlo

My guess is that, when you kill QEMU with Ctrl-C, "vhost_scsi.ko" might
not clean up something, and that could break the next guest boot. If you
re-insert "vhost_scsi.ko" for each QEMU launch, and that ends up masking
the symptom, then there's likely some resource leak in "vhost_scsi.ko".
Nods, it is possible.

Thanks
Annie

Just a guess.

Thanks
Laszlo