On Mar 18, 2021, at 6:22 AM, Laszlo Ersek <email@example.com> wrote:
On 03/18/21 02:48, Annie Li wrote:
Hello,In addition to what Andrew said, I suggest the following:
I ran into a windows booting failure issue(a page fault exception), and narrow down it to the following patch,
MdeModulePkg/DxeIpl: support more NX related PCDs
This issue always happens after QMP is terminated by <ctrl-C> twice, see following steps.
1. Boot Windows VM up, and <ctrl-C> to exit the QMP
2. Repeat 1
3. Boot Windows VM, and this page fault issue happens. (Note: Windows should boot into recovery mode in this round, and this is due to the previous two consecutive boot failure, see https://docs.microsoft.com/en-us/windows-hardware/manufacture/desktop/windows-recovery-environment--windows-re--technical-reference#entry-points-into-winre)
During above 3 windows booting procedures, the value of following variables are always the same,
However, Windows guest fails to boot up into recovery mode in the 3rd round due to the patch above(5267926). I modified the return value to "(PcdGetBool (PcdSetNxForStack)" in function "IsEnableNonExecNeeded" in MdeModulePkg/Core/DxeIplPeim/X64/VirtualMemory.c, this page fault issue is gone with this change. The patch(5267926) is for fixing bug https://bugzilla.tianocore.org/show_bug.cgi?id=1116, where the comments show PcdImageProtectionPolicy needs also to enable NXE. But this does cause the page fault exception in this scenario, any suggestion?
The page fault exception is pasted here,
!!!! X64 Exception Type - 0E(#PF - Page-Fault) CPU Apic ID - 00000000 !!!!
ExceptionData - 0000000000000009 I:0 R:1 U:0 W:0 P:1 PK:0 SS:0 SGX:0
RIP - 000000003E0A7C75, CS - 0000000000000038, RFLAGS - 0000000000010202
RAX - 8000000000000003, RCX - 0000000000000001, RDX - 0000000001040001
RBX - 0000000000000001, RSP - 00000000001A6AA0, RBP - 0000000001040001
RSI - 000000003F2E2010, RDI - 0000000000000001
R8 - 0000000000000000, R9 - 000000003E0AEC90, R10 - 0000FFFFFFFFF000
R11 - 00000000001A6E90, R12 - 0000000000000000, R13 - 000000003E0AEC90
R14 - 00000000001A6B28, R15 - 00000000001AB000
DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
GS - 0000000000000030, SS - 0000000000000030
CR0 - 0000000080010033, CR2 - 000000003F2E2010, CR3 - 000000003F401000
CR4 - 0000000000040668, CR8 - 0000000000000000
DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 000000003F1EE698 0000000000000047, LDTR - 0000000000000000
IDTR - 000000003ECCA018 0000000000000FFF, TR - 0000000000000000
FXSAVE_STATE - 00000000001A6700
!!!! Find image based on IP(0x3E0A7C75) /builddir/build/BUILD/edk2-1.4.3/Build/OvmfX64/DEBUG_GCC48/X64/MdeModulePkg/Universal/Console/TerminalDxe/TerminalDxe/DEBUG/TerminalDxe.dll (ImageBase=000000003E0A5000, EntryPoint=000000003E0A86E8) !!!!
(1) Please rebuild OVMF *locally*, using the same edk2 tree, and the same toolchain, and the same "build" flags.
(2) Reproduce the issue, capture the register dump.
(3) Run the following command:
objdump -f -S Build/OvmfX64/DEBUG_GCC48/X64/MdeModulePkg/Universal/Console/TerminalDxe/TerminalDxe/DEBUG/TerminalDxe.debug
WARNING Wish List off topic…..
It would nice to have a debug script that could post process serial log file and append the extra information. That tool would need to bee toolchain aware as for gcc you do `objdump -f -S TerminalDxe.debug` for Xcode you would do `lldb -o <lldbCommand> Terminal.dll. I guess it could also decode the execution and point out CR2 is the fault address and what ExceptionData means.
We could hook something like that into the CI and capture more detailed error reports.
The point of this exercise is to reproduce the issue with such an OVMF build for which you have a matching "TerminalDxe.debug" file. Once you do that, you can run "objdump" on the ".debug" file, and get a disassembly of the TerminalDxe driver, inter-leaved with the C language source code.
Then, we can do two things:
- we can verify whether (EntryPoint - ImageBase), from the register dump, matches the (relative) "start address" that "objdump -f" reports,
- we can take the crash offset (RIP - ImageBase), from the register dump, and use that offset into the "objdump -S" disassembly, to narrow down what the terminal driver may have been doing to trigger the crash.
It's not necessarily the terminal driver's fault that encounter a crash, but knowing what TerminalDxe was up to, might shed light on the actual reason. It's of course also possible that TerminalDxe *is* at fault. We'll see.
If possible, please post:
- your precise edk2 version (if you have local patches, it would be best to reproduce with an upstream-only tree),
- your full firmware log (feel free to compress it),
- the register dump from serial,
- the objdump (disassembly) output (feel free to compress it).