[PATCH V7 36/37] UefiCpuPkg: Setting initial-count register as the last step


Min Xu
 

BZ: https://bugzilla.tianocore.org/show_bug.cgi?id=3711

Per SDM, changing the mode of APIC timer (from one-shot to periodic or
vice versa) by writing to the timer LVT entry does not start the timer.
To start the timer, it is necessary to write to the initial-count
register.

If initial-count is wrote before mode change, it's possible that timer
expired before the mode change. Thus failing the periodic mode.

Cc: Jiewen Yao <jiewen.yao@...>
Cc: Gerd Hoffmann <kraxel@...>
Cc: Anthony Perard <anthony.perard@...>
Cc: Julien Grall <julien@...>
Cc: Eric Dong <eric.dong@...>
Cc: Ray Ni <ray.ni@...>
Acked-by: Gerd Hoffmann <kraxel@...>
Signed-off-by: Min Xu <min.m.xu@...>
---
.../Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c b/UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c
index 2d17177df12b..f26d9c93894f 100644
--- a/UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c
+++ b/UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c
@@ -967,11 +967,6 @@ InitializeApicTimer (
//
InitializeLocalApicSoftwareEnable (TRUE);

- //
- // Program init-count register.
- //
- WriteLocalApicReg (XAPIC_TIMER_INIT_COUNT_OFFSET, InitCount);
-
if (DivideValue != 0) {
ASSERT (DivideValue <= 128);
ASSERT (DivideValue == GetPowerOfTwo32 ((UINT32)DivideValue));
@@ -996,6 +991,11 @@ InitializeApicTimer (
LvtTimer.Bits.Mask = 0;
LvtTimer.Bits.Vector = Vector;
WriteLocalApicReg (XAPIC_LVT_TIMER_OFFSET, LvtTimer.Uint32);
+
+ //
+ // Program init-count register.
+ //
+ WriteLocalApicReg (XAPIC_TIMER_INIT_COUNT_OFFSET, InitCount);
}

/**
--
2.29.2.windows.2


Ni, Ray
 

Thanks for fixing this potential bug. Reviewed-by: Ray Ni ray.ni@...


Lendacky, Thomas
 

On 2/28/22 01:21, Min Xu via groups.io wrote:
BZ: https://bugzilla.tianocore.org/show_bug.cgi?id=3711
Per SDM, changing the mode of APIC timer (from one-shot to periodic or
vice versa) by writing to the timer LVT entry does not start the timer.
To start the timer, it is necessary to write to the initial-count
register.
If initial-count is wrote before mode change, it's possible that timer
expired before the mode change. Thus failing the periodic mode.
I'm replying to this patch since I can't find patch V12 46/47 anywhere in
my email.

I've bisected a regression in the Linux kernel to this patch when an
SEV-SNP guest is booted. The following message is issued in the kernel for
every AP being brought online:

APIC: Stale IRR: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000020 ISR: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000

Possibly a timing issue involving the mode switch with the interrupt
unmasked. If I leave the interrupt masked and only un-mask it
after the programming of the init-count, then the message goes away.

Thoughts?

Thanks,
Tom

Cc: Jiewen Yao <jiewen.yao@...>
Cc: Gerd Hoffmann <kraxel@...>
Cc: Anthony Perard <anthony.perard@...>
Cc: Julien Grall <julien@...>
Cc: Eric Dong <eric.dong@...>
Cc: Ray Ni <ray.ni@...>
Acked-by: Gerd Hoffmann <kraxel@...>
Signed-off-by: Min Xu <min.m.xu@...>
---
.../Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c b/UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c
index 2d17177df12b..f26d9c93894f 100644
--- a/UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c
+++ b/UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c
@@ -967,11 +967,6 @@ InitializeApicTimer (
//
InitializeLocalApicSoftwareEnable (TRUE);
- //
- // Program init-count register.
- //
- WriteLocalApicReg (XAPIC_TIMER_INIT_COUNT_OFFSET, InitCount);
-
if (DivideValue != 0) {
ASSERT (DivideValue <= 128);
ASSERT (DivideValue == GetPowerOfTwo32 ((UINT32)DivideValue));
@@ -996,6 +991,11 @@ InitializeApicTimer (
LvtTimer.Bits.Mask = 0;
LvtTimer.Bits.Vector = Vector;
WriteLocalApicReg (XAPIC_LVT_TIMER_OFFSET, LvtTimer.Uint32);
+
+ //
+ // Program init-count register.
+ //
+ WriteLocalApicReg (XAPIC_TIMER_INIT_COUNT_OFFSET, InitCount);
}
/**


Min Xu
 

On May 11, 2022 4:30 AM, Tom Lendacky wrote:
On 2/28/22 01:21, Min Xu via groups.io wrote:
BZ: https://bugzilla.tianocore.org/show_bug.cgi?id=3711

Per SDM, changing the mode of APIC timer (from one-shot to periodic or
vice versa) by writing to the timer LVT entry does not start the timer.
To start the timer, it is necessary to write to the initial-count
register.

If initial-count is wrote before mode change, it's possible that timer
expired before the mode change. Thus failing the periodic mode.
I'm replying to this patch since I can't find patch V12 46/47 anywhere in my
email.

I've bisected a regression in the Linux kernel to this patch when an SEV-SNP
guest is booted. The following message is issued in the kernel for every AP
being brought online:

APIC: Stale IRR:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
00020 ISR:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
00000

Possibly a timing issue involving the mode switch with the interrupt
unmasked. If I leave the interrupt masked and only un-mask it after the
programming of the init-count, then the message goes away.
Do you mean in InitializeApicTimer, it should follow below steps:
1. mask LvtTimer. (set LvtTimer.Bits.Mask = 1)
2. Do other stuff, including programing the init-count register.
3. un-mask LvtTimer (set LvtTimer.Bit.Mask = 0)

Thanks
Min


Lendacky, Thomas
 

On 5/10/22 21:00, Xu, Min M wrote:
On May 11, 2022 4:30 AM, Tom Lendacky wrote:
On 2/28/22 01:21, Min Xu via groups.io wrote:
BZ: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.tianocore.org%2Fshow_bug.cgi%3Fid%3D3711&;data=05%7C01%7Cthomas.lendacky%40amd.com%7Cce8cd76e1b054fe277d808da32f20976%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878312370633666%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=C%2F0BMTV%2FoZdUxLwRbqzjNVdlMvEKfl20z6RwKeMzr2c%3D&amp;reserved=0

Per SDM, changing the mode of APIC timer (from one-shot to periodic or
vice versa) by writing to the timer LVT entry does not start the timer.
To start the timer, it is necessary to write to the initial-count
register.

If initial-count is wrote before mode change, it's possible that timer
expired before the mode change. Thus failing the periodic mode.
I'm replying to this patch since I can't find patch V12 46/47 anywhere in my
email.

I've bisected a regression in the Linux kernel to this patch when an SEV-SNP
guest is booted. The following message is issued in the kernel for every AP
being brought online:

APIC: Stale IRR:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
00020 ISR:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
00000

Possibly a timing issue involving the mode switch with the interrupt
unmasked. If I leave the interrupt masked and only un-mask it after the
programming of the init-count, then the message goes away.
Do you mean in InitializeApicTimer, it should follow below steps:
1. mask LvtTimer. (set LvtTimer.Bits.Mask = 1)
2. Do other stuff, including programing the init-count register.
3. un-mask LvtTimer (set LvtTimer.Bit.Mask = 0)
Yes, I believe so. I'm not an expert on the APIC timer, but that seems reasonable to me.

Thanks,
Tom

Thanks
Min


Min Xu
 

On May 11, 2022 10:06 PM, Lendacky, Thomas wrote:
On 5/10/22 21:00, Xu, Min M wrote:
On May 11, 2022 4:30 AM, Tom Lendacky wrote:
I'm replying to this patch since I can't find patch V12 46/47
anywhere in my email.

I've bisected a regression in the Linux kernel to this patch when an
SEV-SNP guest is booted. The following message is issued in the
kernel for every AP being brought online:

APIC: Stale IRR:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
00020 ISR:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
00000

Possibly a timing issue involving the mode switch with the interrupt
unmasked. If I leave the interrupt masked and only un-mask it after
the programming of the init-count, then the message goes away.
Do you mean in InitializeApicTimer, it should follow below steps:
1. mask LvtTimer. (set LvtTimer.Bits.Mask = 1) 2. Do other stuff,
including programing the init-count register.
3. un-mask LvtTimer (set LvtTimer.Bit.Mask = 0)
Yes, I believe so. I'm not an expert on the APIC timer, but that seems
reasonable to me.
I tested this fix in Td guest and it has no side effect.
I check the Intel SDM (Vol.3A Chap 10.5 Handling Local Interrupts) but it doesn't describe the actual sequence of LvtTimer.Bits.Mask and programming of init-count register.
@ Ni, Ray, What's your thought about it?

Thanks
Min


Lendacky, Thomas
 

On 5/11/22 19:52, Min Xu via groups.io wrote:
On May 11, 2022 10:06 PM, Lendacky, Thomas wrote:
On 5/10/22 21:00, Xu, Min M wrote:
On May 11, 2022 4:30 AM, Tom Lendacky wrote:
I'm replying to this patch since I can't find patch V12 46/47
anywhere in my email.

I've bisected a regression in the Linux kernel to this patch when an
SEV-SNP guest is booted. The following message is issued in the
kernel for every AP being brought online:

APIC: Stale IRR:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
00020 ISR:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
00000

Possibly a timing issue involving the mode switch with the interrupt
unmasked. If I leave the interrupt masked and only un-mask it after
the programming of the init-count, then the message goes away.
Do you mean in InitializeApicTimer, it should follow below steps:
1. mask LvtTimer. (set LvtTimer.Bits.Mask = 1) 2. Do other stuff,
including programing the init-count register.
3. un-mask LvtTimer (set LvtTimer.Bit.Mask = 0)
Yes, I believe so. I'm not an expert on the APIC timer, but that seems
reasonable to me.
I tested this fix in Td guest and it has no side effect.
I check the Intel SDM (Vol.3A Chap 10.5 Handling Local Interrupts) but it doesn't describe the actual sequence of LvtTimer.Bits.Mask and programming of init-count register.
@ Ni, Ray, What's your thought about it?
I guess you can theoretically miss an interrupt if your initial count is expires before you unmask the interrupt, so I think your fix is correct and no changes are needed.

I need to double check whether I'm properly resetting the APIC when APs are booted multiple times. Since this only occurs with SNP, I think this is on my end.

Thanks,
Tom

Thanks
Min


Henz, Patrick
 

Hi all,

We (Hewlett Packard Enterprise) are also running into a race condition due to how InitializeApicTimer initializes the APIC timers, we figured this might be a good place to report our findings. On the occasion we notice that APs get stuck in the timer interrupt handling code after getting woken up by the BSP. It appears that if the CurrentCount timer value provided by the BSP is sufficiently small, the brief amount of time between an AP calling InitializeApicTimer and calling DisableApicTimerInterrupt (see SyncLocalApicTimerSetting as an example) leaves enough room for an APIC timer interrupt to occur. This seems to become more frequent on larger systems with higher processor counts, from what we've gathered the increase in the number of locking sequence invocations appears to be making this condition far more likely to occur. We work on scaled systems with node controllers and we've really only seen this on larger systems, but it seems to us this could feasibly happen on smaller systems too. Our current solution is to add an additional argument to InitializeApicTimer, allowing the caller to specify whether or not APIC timer interrupts are to be enabled for the current thread.

Thanks,
Patrick Henz

Enterprise X86 Labs
Hewlett Packard Enterprise
patrick.henz@...

-----Original Message-----
From: devel@edk2.groups.io [mailto:devel@edk2.groups.io] On Behalf Of Lendacky, Thomas via groups.io
Sent: Friday, May 13, 2022 5:13 PM
To: devel@edk2.groups.io; min.m.xu@...; Ni, Ray <ray.ni@...>
Cc: Yao, Jiewen <jiewen.yao@...>; Gerd Hoffmann <kraxel@...>; Anthony Perard <anthony.perard@...>; Julien Grall <julien@...>; Dong, Eric <eric.dong@...>
Subject: Re: [edk2-devel] [PATCH V7 36/37] UefiCpuPkg: Setting initial-count register as the last step

On 5/11/22 19:52, Min Xu via groups.io wrote:
On May 11, 2022 10:06 PM, Lendacky, Thomas wrote:
On 5/10/22 21:00, Xu, Min M wrote:
On May 11, 2022 4:30 AM, Tom Lendacky wrote:
I'm replying to this patch since I can't find patch V12 46/47
anywhere in my email.

I've bisected a regression in the Linux kernel to this patch when
an SEV-SNP guest is booted. The following message is issued in the
kernel for every AP being brought online:

APIC: Stale IRR:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
00020 ISR:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
00000

Possibly a timing issue involving the mode switch with the
interrupt unmasked. If I leave the interrupt masked and only
un-mask it after the programming of the init-count, then the message goes away.
Do you mean in InitializeApicTimer, it should follow below steps:
1. mask LvtTimer. (set LvtTimer.Bits.Mask = 1) 2. Do other stuff,
including programing the init-count register.
3. un-mask LvtTimer (set LvtTimer.Bit.Mask = 0)
Yes, I believe so. I'm not an expert on the APIC timer, but that
seems reasonable to me.
I tested this fix in Td guest and it has no side effect.
I check the Intel SDM (Vol.3A Chap 10.5 Handling Local Interrupts) but it doesn't describe the actual sequence of LvtTimer.Bits.Mask and programming of init-count register.
@ Ni, Ray, What's your thought about it?
I guess you can theoretically miss an interrupt if your initial count is expires before you unmask the interrupt, so I think your fix is correct and no changes are needed.

I need to double check whether I'm properly resetting the APIC when APs are booted multiple times. Since this only occurs with SNP, I think this is on my end.

Thanks,
Tom


Thanks
Min





Jeff Fan
 

Ray,

Could we add one API (like InitializeApicTimerEx) to intialize CPU local APCI with additional parameter to indicate if interrupt is to be enabled or not? Just like HPE's solution. 

Jeff

fanjianfeng@...

 
Date: 2022-05-20 05:54
Subject: Re: [edk2-devel] [PATCH V7 36/37] UefiCpuPkg: Setting initial-count register as the last step
Hi all,
 
We (Hewlett Packard Enterprise) are also running into a race condition due to how InitializeApicTimer initializes the APIC timers, we figured this might be a good place to report our findings. On the occasion we notice that APs get stuck in the timer interrupt handling code after getting woken up by the BSP. It appears that if the CurrentCount timer value provided by the BSP is sufficiently small, the brief amount of time between an AP calling InitializeApicTimer and calling DisableApicTimerInterrupt (see SyncLocalApicTimerSetting as an example) leaves enough room for an APIC timer interrupt to occur. This seems to become more frequent on larger systems with higher processor counts, from what we've gathered the increase in the number of locking sequence invocations appears to be making this condition far more likely to occur. We work on scaled systems with node controllers and we've really only seen this on larger systems, but it seems to us this could feasibly happen on smaller systems too. Our current solution is to add an additional argument to InitializeApicTimer, allowing the caller to specify whether or not APIC timer interrupts are to be enabled for the current thread.
 
Thanks,
Patrick Henz
 
Enterprise X86 Labs
Hewlett Packard Enterprise
patrick.henz@...
 
-----Original Message-----
From: devel@edk2.groups.io [mailto:devel@edk2.groups.io] On Behalf Of Lendacky, Thomas via groups.io
Sent: Friday, May 13, 2022 5:13 PM
To: devel@edk2.groups.io; min.m.xu@...; Ni, Ray <ray.ni@...>
Cc: Yao, Jiewen <jiewen.yao@...>; Gerd Hoffmann <kraxel@...>; Anthony Perard <anthony.perard@...>; Julien Grall <julien@...>; Dong, Eric <eric.dong@...>
Subject: Re: [edk2-devel] [PATCH V7 36/37] UefiCpuPkg: Setting initial-count register as the last step
 
On 5/11/22 19:52, Min Xu via groups.io wrote:
> On May 11, 2022 10:06 PM, Lendacky, Thomas wrote:
>> On 5/10/22 21:00, Xu, Min M wrote:
>>> On May 11, 2022 4:30 AM, Tom Lendacky wrote:
>>>> I'm replying to this patch since I can't find patch V12 46/47
>>>> anywhere in my email.
>>>>
>>>> I've bisected a regression in the Linux kernel to this patch when
>>>> an SEV-SNP guest is booted. The following message is issued in the
>>>> kernel for every AP being brought online:
>>>>
>>>> APIC: Stale IRR:
>>>>
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
>>>> 00020 ISR:
>>>>
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000
>>>> 00000
>>>>
>>>> Possibly a timing issue involving the mode switch with the
>>>> interrupt unmasked. If I leave the interrupt masked and only
>>>> un-mask it after the programming of the init-count, then the message goes away.
>>>
>>> Do you mean in InitializeApicTimer, it should follow below steps:
>>> 1. mask LvtTimer. (set LvtTimer.Bits.Mask = 1) 2. Do other stuff,
>>> including programing the init-count register.
>>> 3. un-mask LvtTimer (set LvtTimer.Bit.Mask = 0)
>>
>> Yes, I believe so. I'm not an expert on the APIC timer, but that
>> seems reasonable to me.
> I tested this fix in Td guest and it has no side effect.
> I check the Intel SDM (Vol.3A Chap 10.5 Handling Local Interrupts) but it doesn't describe the actual sequence of LvtTimer.Bits.Mask and  programming of init-count register.
> @ Ni, Ray,  What's your thought about it?
 
I guess you can theoretically miss an interrupt if your initial count is expires before you unmask the interrupt, so I think your fix is correct and no changes are needed.
 
I need to double check whether I'm properly resetting the APIC when APs are booted multiple times. Since this only occurs with SNP, I think this is on my end.
 
Thanks,
Tom
 
>
> Thanks
> Min
>
>
>
>
>