PR fails due to OVMF time out


Ni, Ray
 

Hi,

I found several of my PRs cannot pass the CI due to OVMF boot timeout.

e.g.: Mktme fix by niruiyu · Pull Request #4072 · tianocore/edk2 (github.com)

 

I also one PR from Ard (X64 text reloc fixes by ardbiesheuvel · Pull Request #4216 · tianocore/edk2 (github.com)) failed as well.

 

But I remember Mike increased the boot timeout from 1min to 2 mins.

Why does OVMF boot require more time than before?

 

Is that because Gerd enabled the SMP 4 boot?

 

Thanks,

Ray


Gerd Hoffmann
 

On Fri, Mar 31, 2023 at 03:00:37PM +0000, Ni, Ray wrote:
Hi,
I found several of my PRs cannot pass the CI due to OVMF boot timeout.
e.g.: Mktme fix by niruiyu * Pull Request #4072 * tianocore/edk2 (github.com)<https://github.com/tianocore/edk2/pull/4072>

I also one PR from Ard (X64 text reloc fixes by ardbiesheuvel * Pull Request #4216 * tianocore/edk2 (github.com)<https://github.com/tianocore/edk2/pull/4216>) failed as well.

But I remember Mike increased the boot timeout from 1min to 2 mins.
Why does OVMF boot require more time than before?

Is that because Gerd enabled the SMP 4 boot?
That could be the reason. It happened to work in my CI test run,
but maybe I was just lucky.

take care,
Gerd


Michael D Kinney
 

Hi Gerd,

I have investigated this failure with enabling -smp 4. I think this is an
important feature that should be on by default.

I did find that the results where inconsistent at 2 or 4 cpus on my laptop,
but going 8 or higher would fail consistently. I used -smp 32 for all the
testing below.

First, this failure mode is only seen with SMM_REQUIRE=1 builds. If I
set SMM_REQUIRE=0, then QEMU can boot with different smp settings.
This appears to be a QEMU MP SMM related issue.

I first tried going back in edk2 history looking for a point where smp
boot would work using the version of QEMU used by EDK II CI agents which
is 2021.5.5 (QEMU 6.0.0)
* https://github.com/tianocore/edk2/blob/fc00ff286a541c047b7d343e66ec10890b80d3ea/OvmfPkg/PlatformCI/.azurepipelines/Windows-VS2019.yml#L142

I went back in edk2 history to the 2017/2018 timeframe and the issue was
still present, and I recall testing this feature back then myself, so I
do not think it is related to edk2 changes.

I then did a binary search through the QEMU releases:
PASS: https://qemu.weilnetz.de/w64/2017/qemu-w64-setup-20170113.exe
PASS: https://qemu.weilnetz.de/w64/2017/qemu-w64-setup-20170808.exe 2.10.0-rc2
PASS: https://qemu.weilnetz.de/w64/2017/qemu-w64-setup-20171122.exe 2.11.0-rc2
PASS: https://qemu.weilnetz.de/w64/2017/qemu-w64-setup-20171211.exe 2.11.0-rc5
PASS: https://qemu.weilnetz.de/w64/2017/qemu-w64-setup-20171217.exe 2.11.0
FAIL: https://qemu.weilnetz.de/w64/2018/qemu-w64-setup-20180321.exe 2.12.0-rc0 - AioContextPolling - Unrelated failure
PASS: https://qemu.weilnetz.de/w64/2018/qemu-w64-setup-20180404.exe 2.12.0-rc1
PASS: https://qemu.weilnetz.de/w64/2018/qemu-w64-setup-20180711.exe 3.0.0-rc0
PASS: https://qemu.weilnetz.de/w64/2018/qemu-w64-setup-20180807.exe 3.0.0-rc4
PASS: https://qemu.weilnetz.de/w64/2018/qemu-w64-setup-20180815.exe 3.0.0
FAIL: https://qemu.weilnetz.de/w64/2018/qemu-w64-setup-20181108.exe 3.1.0-rc0 - Long Delay at BSD Entry
FAIL: https://qemu.weilnetz.de/w64/2019/qemu-w64-setup-20181211.exe 3.1.0 - Long Delay at BDS Entry
FAIL: https://qemu.weilnetz.de/w64/2019/qemu-w64-setup-20190815.exe 4.1.0 - Long Delay at BDS Entry

It appears the failure was introduced in the transition from 3.0.0 -> 3.1.0.

QEMU 3.1 introduced Multi-Threaded TCG for x86 CPUs
* https://www.qemu.org/2018/12/12/qemu-3-1-0/
* https://qemu.readthedocs.io/en/latest/devel/multi-thread-tcg.html#multi-threaded-tcg

The feature can be disabled with the following QEMU command line option:
* -accel tcg,thread=single
* https://doc.ycharbi.fr/fichiers/virtualisation/qemu/documentation/qemu-doc-3.1.0.html

When I apply -accel tcg,thread-single to the QEMU command line, QEMU 3.1.0 passes.
When I apply -accel tcg,thread-single to the QEMU command line, QEMU 6.0.0 passes.

I think we can create a new PR with: -smp 4 -accel tcg,thead-single

I consider this a workaround, and we should investigate why Multi-Threaded
TCG breaks QEMU MP SMM.

Best regards,

Mike

-----Original Message-----
From: 'Gerd Hoffmann' <kraxel@...>
Sent: Friday, March 31, 2023 12:47 PM
To: Ni, Ray <ray.ni@...>
Cc: devel@edk2.groups.io; Kinney, Michael D <michael.d.kinney@...>; Ard Biesheuvel <ardb@...>
Subject: Re: PR fails due to OVMF time out

On Fri, Mar 31, 2023 at 03:00:37PM +0000, Ni, Ray wrote:
Hi,
I found several of my PRs cannot pass the CI due to OVMF boot timeout.
e.g.: Mktme fix by niruiyu * Pull Request #4072 * tianocore/edk2
(github.com)<https://github.com/tianocore/edk2/pull/4072>

I also one PR from Ard (X64 text reloc fixes by ardbiesheuvel * Pull Request #4216 * tianocore/edk2
(github.com)<https://github.com/tianocore/edk2/pull/4216>) failed as well.

But I remember Mike increased the boot timeout from 1min to 2 mins.
Why does OVMF boot require more time than before?

Is that because Gerd enabled the SMP 4 boot?
That could be the reason. It happened to work in my CI test run,
but maybe I was just lucky.

take care,
Gerd


Ard Biesheuvel
 

On Sun, 2 Apr 2023 at 20:23, Kinney, Michael D
<michael.d.kinney@...> wrote:

Hi Gerd,

I have investigated this failure with enabling -smp 4. I think this is an
important feature that should be on by default.
Agreed. And given your investigation below (thanks!), we should be
able to revert the revert as long as we add '-accel
tcg,thread-single', right?

It would be nice if we could also get some coverage for KVM, but for
the time being, this seems to be an improvement already (AFAIU, we
want -smp 4 primarily for the logic used in the code, while there is
very little code that would actually run concurrently, if any)

--
Ard.


Gerd Hoffmann
 

On Mon, Apr 03, 2023 at 10:21:27AM +0200, Ard Biesheuvel wrote:
On Sun, 2 Apr 2023 at 20:23, Kinney, Michael D
<michael.d.kinney@...> wrote:

Hi Gerd,

I have investigated this failure with enabling -smp 4. I think this is an
important feature that should be on by default.
Agreed. And given your investigation below (thanks!), we should be
able to revert the revert as long as we add '-accel
tcg,thread-single', right?
That is how I read it, yes.

It would be nice if we could also get some coverage for KVM, but for
the time being, this seems to be an improvement already (AFAIU, we
want -smp 4 primarily for the logic used in the code, while there is
very little code that would actually run concurrently, if any)
Yes, it's to improve MpInit test coverage. Most code would continue to
run single-threaded.

take care,
Gerd


Michael D Kinney
 

Hi Gerd,

Do you want to submit a new email review/PR with this suggested change?

Thanks,

Mike

-----Original Message-----
From: devel@edk2.groups.io <devel@edk2.groups.io> On Behalf Of Gerd Hoffmann
Sent: Monday, April 3, 2023 3:38 AM
To: devel@edk2.groups.io; ardb@...
Cc: Kinney, Michael D <michael.d.kinney@...>; Laszlo Ersek <lersek@...>; Ni, Ray <ray.ni@...>
Subject: Re: [edk2-devel] PR fails due to OVMF time out

On Mon, Apr 03, 2023 at 10:21:27AM +0200, Ard Biesheuvel wrote:
On Sun, 2 Apr 2023 at 20:23, Kinney, Michael D
<michael.d.kinney@...> wrote:

Hi Gerd,

I have investigated this failure with enabling -smp 4. I think this is an
important feature that should be on by default.
Agreed. And given your investigation below (thanks!), we should be
able to revert the revert as long as we add '-accel
tcg,thread-single', right?
That is how I read it, yes.

It would be nice if we could also get some coverage for KVM, but for
the time being, this seems to be an improvement already (AFAIU, we
want -smp 4 primarily for the logic used in the code, while there is
very little code that would actually run concurrently, if any)
Yes, it's to improve MpInit test coverage. Most code would continue to
run single-threaded.

take care,
Gerd