On 09/02/19 10:45, Igor Mammedov wrote: On Fri, 30 Aug 2019 20:46:14 +0200 Laszlo Ersek <lersek@...> wrote:
On 08/30/19 16:48, Igor Mammedov wrote:
(01) On boot firmware maps and initializes SMI handler at default SMBASE (30000) (using dedicated SMRAM at 30000 would allow us to avoid save/restore steps and make SMM handler pointer not vulnerable to DMA attacks)
(02) QEMU hotplugs a new CPU in reset-ed state and sends SCI
(03) on receiving SCI, host CPU calls GPE cpu hotplug handler which writes to IO port 0xB2 (broadcast SMI)
(04) firmware waits for all existing CPUs rendezvous in SMM mode, new CPU(s) have SMI pending but does nothing yet
(05) host CPU wakes up one new CPU (INIT-INIT-SIPI) SIPI vector points to RO flash HLT loop. (how host CPU will know which new CPUs to relocate? possibly reuse QEMU CPU hotplug MMIO interface???)
(06) new CPU does relocation. (in case of attacker sends SIPI to several new CPUs, open question how to detect collision of several CPUs at the same default SMBASE)
(07) once new CPU relocated host CPU completes initialization, returns from IO port write and executes the rest of GPE handler, telling OS to online new CPU. In step (03), it is the OS that handles the SCI; it transfers control to ACPI. The AML can write to IO port 0xB2 only because the OS allows it.
If the OS decides to omit that step, and sends an INIT-SIPI-SIPI directly to the new CPU, can it steal the CPU? It sure can but this way it won't get access to privileged SMRAM so OS can't subvert firmware. The next time SMI broadcast is sent the CPU will use SMI handler at default 30000 SMBASE. It's up to us to define behavior here (for example relocation handler can put such CPU in shutdown state).
It's in the best interest of OS to cooperate and execute AML provided by firmware, if it does not follow proper cpu hotplug flow we can't guarantee that stolen CPU will work. This sounds convincing enough, for the hotplugged CPU; thanks. So now my concern is with step (01). While preparing for the initial relocation (of cold-plugged CPUs), the code assumes the memory at the default SMBASE (0x30000) is normal RAM. Is it not a problem that the area is written initially while running in normal 32-bit or 64-bit mode, but then executed (in response to the first, synchronous, SMI) as SMRAM? Basically I'm confused by the alias. TSEG (and presumably, A/B seg) work like this: - when open, looks like RAM to normal mode and SMM - when closed, looks like black-hole to normal mode, and like RAM to SMM The generic edk2 code knows this, and manages the SMRAM areas accordingly. The area at 0x30000 is different: - looks like RAM to both normal mode and SMM If we set up the alias at 0x30000 into A/B seg, - will that *permanently* hide the normal RAM at 0x30000? - will 0x30000 start behaving like A/B seg? Basically my concern is that the universal code in edk2 might or might not keep A/B seg open while initially populating the area at the default SMBASE. Specifically, I can imagine two issues: - if the alias into A/B seg is inactive during the initial population, then the initial writes go to RAM, but the execution (the first SMBASE relocation) will occur from A/B seg through the alias - alternatively, if the alias is always active, but A/B seg is closed during initial population (which happens in normal mode), then the initial writes go to the black hole, and execution will occur from a "blank" A/B seg. Am I seeing things? (Sorry, I keep feeling dumber and dumber in this thread.) Anyway, I guess we could try and see if OVMF still boots with the alias... Thanks Laszlo
|
|
Igor Mammedov <imammedo@...>
On Fri, 30 Aug 2019 20:46:14 +0200 Laszlo Ersek <lersek@...> wrote: On 08/30/19 16:48, Igor Mammedov wrote:
(01) On boot firmware maps and initializes SMI handler at default SMBASE (30000) (using dedicated SMRAM at 30000 would allow us to avoid save/restore steps and make SMM handler pointer not vulnerable to DMA attacks)
(02) QEMU hotplugs a new CPU in reset-ed state and sends SCI
(03) on receiving SCI, host CPU calls GPE cpu hotplug handler which writes to IO port 0xB2 (broadcast SMI)
(04) firmware waits for all existing CPUs rendezvous in SMM mode, new CPU(s) have SMI pending but does nothing yet
(05) host CPU wakes up one new CPU (INIT-INIT-SIPI) SIPI vector points to RO flash HLT loop. (how host CPU will know which new CPUs to relocate? possibly reuse QEMU CPU hotplug MMIO interface???)
(06) new CPU does relocation. (in case of attacker sends SIPI to several new CPUs, open question how to detect collision of several CPUs at the same default SMBASE)
(07) once new CPU relocated host CPU completes initialization, returns from IO port write and executes the rest of GPE handler, telling OS to online new CPU. In step (03), it is the OS that handles the SCI; it transfers control to ACPI. The AML can write to IO port 0xB2 only because the OS allows it.
If the OS decides to omit that step, and sends an INIT-SIPI-SIPI directly to the new CPU, can it steal the CPU? It sure can but this way it won't get access to privileged SMRAM so OS can't subvert firmware. The next time SMI broadcast is sent the CPU will use SMI handler at default 30000 SMBASE. It's up to us to define behavior here (for example relocation handler can put such CPU in shutdown state). It's in the best interest of OS to cooperate and execute AML provided by firmware, if it does not follow proper cpu hotplug flow we can't guarantee that stolen CPU will work. Thanks! Laszlo
|
|
On 08/30/19 16:48, Igor Mammedov wrote: (01) On boot firmware maps and initializes SMI handler at default SMBASE (30000) (using dedicated SMRAM at 30000 would allow us to avoid save/restore steps and make SMM handler pointer not vulnerable to DMA attacks)
(02) QEMU hotplugs a new CPU in reset-ed state and sends SCI
(03) on receiving SCI, host CPU calls GPE cpu hotplug handler which writes to IO port 0xB2 (broadcast SMI)
(04) firmware waits for all existing CPUs rendezvous in SMM mode, new CPU(s) have SMI pending but does nothing yet
(05) host CPU wakes up one new CPU (INIT-INIT-SIPI) SIPI vector points to RO flash HLT loop. (how host CPU will know which new CPUs to relocate? possibly reuse QEMU CPU hotplug MMIO interface???)
(06) new CPU does relocation. (in case of attacker sends SIPI to several new CPUs, open question how to detect collision of several CPUs at the same default SMBASE)
(07) once new CPU relocated host CPU completes initialization, returns from IO port write and executes the rest of GPE handler, telling OS to online new CPU. In step (03), it is the OS that handles the SCI; it transfers control to ACPI. The AML can write to IO port 0xB2 only because the OS allows it. If the OS decides to omit that step, and sends an INIT-SIPI-SIPI directly to the new CPU, can it steal the CPU? Thanks! Laszlo
|
|
Igor Mammedov <imammedo@...>
On Thu, 29 Aug 2019 19:01:35 +0200 Laszlo Ersek <lersek@...> wrote: On 08/27/19 20:31, Igor Mammedov wrote:
On Sat, 24 Aug 2019 01:48:09 +0000 "Yao, Jiewen" <jiewen.yao@...> wrote: (05) Host CPU: (OS) Port 0xB2 write, all CPUs enter SMM (NOTE: New CPU will not enter CPU because SMI is disabled) I think only CPU that does the write will enter SMM That used to be the case (and it is still the default QEMU behavior, if broadcast SMI is not negotiated). However, OVMF does negotiate broadcast SMI whenever QEMU offers the feature. Broadcast SMI is important for the stability of the edk2 SMM infrastructure on QEMU/KVM, we've found.
https://bugzilla.redhat.com/show_bug.cgi?id=1412313 https://bugzilla.redhat.com/show_bug.cgi?id=1412327
and we might not need to pull in all already initialized CPUs into SMM. That, on the other hand, could be a valid idea. But then the CPU should use a different method for raising a synchronous SMI for itself (not a write to IO port 0xB2). Is a "directed SMI for self" possible? theoretically depending on argument in 0xb3, it should be possible to rise directed SMI even if broadcast ones are negotiated. [...] I've tried to read through the procedure with your suggested changes, but I'm failing at composing a coherent mental image, in this email response format.
If you have the time, can you write up the suggested list of steps in a "flat" format? (I believe you are suggesting to eliminate some steps completely.)
if I'd sum it up: (01) On boot firmware maps and initializes SMI handler at default SMBASE (30000) (using dedicated SMRAM at 30000 would allow us to avoid save/restore steps and make SMM handler pointer not vulnerable to DMA attacks) (02) QEMU hotplugs a new CPU in reset-ed state and sends SCI (03) on receiving SCI, host CPU calls GPE cpu hotplug handler which writes to IO port 0xB2 (broadcast SMI) (04) firmware waits for all existing CPUs rendezvous in SMM mode, new CPU(s) have SMI pending but does nothing yet (05) host CPU wakes up one new CPU (INIT-INIT-SIPI) SIPI vector points to RO flash HLT loop. (how host CPU will know which new CPUs to relocate? possibly reuse QEMU CPU hotplug MMIO interface???) (06) new CPU does relocation. (in case of attacker sends SIPI to several new CPUs, open question how to detect collision of several CPUs at the same default SMBASE) (07) once new CPU relocated host CPU completes initialization, returns from IO port write and executes the rest of GPE handler, telling OS to online new CPU. ... jumping to another point:
2) Let trusted software (SMM and init code) guarantee SMREBASE one by one (include any code runs before SMREBASE) that would mean pulling all present CPUs into SMM mode so no attack code could be executing before doing hotplug. With a lot of present CPUs it could be quite expensive and unlike physical hardware, guest's CPUs could be preempted arbitrarily long causing long delays. I agree with your analysis, but I slightly disagree about the impact:
- CPU hotplug is not a frequent administrative action, so the CPU load should be temporary (it should be a spike). I don't worry that it would trip up OS kernel code. (SMI handling is known to take long on physical platforms oo.) In practice, all "normal" SMIs are broadcast already (for example when calling the runtime UEFI variable services from the OS kernel).
- The fact that QEMU/KVM introduces some jitter into the execution of multi-core code (including SMM code) has proved useful in the past, for catching edk2 regressions.
Again, this is not a strong disagreement from my side. I'm open to better ways for synching CPUs during muti-CPU-hotplug.
(Digression:
I expect someone could be curious why (a) I find it acceptable (even beneficial) that "some jitter" injected by the QEMU/KVM scheduling exposes multi-core regressions in edk2, but at the same time (b) I found it really important to add broadcast SMI to QEMU and OVMF. After all, both "jitter" and "unicast SMIs" are QEMU/KVM platform specifics, so why the different treatment?
The reason is that the "jitter" does not interfere with normal operation, and it has been good for catching *regressions*. IOW, there is a working edk2 state, someone posts a patch, works on physical hardware, but breaks on QEMU/KVM --> then we can still reject or rework or revert the patch. And we're back to a working state again (in the best case, with a fixed feature patch).
With the unicast SMIs however, it was impossible to enable the SMM stack reliably in the first place. There was no functional state to return to. I don't really get the last statement, but the I know nothing about OVMF. I don't insist on unicast SMI being used, it's just some ideas about what we could do. It could be done later, broadcast SMI (might be not the best) is sufficient to implement CPU hotplug. Digression ends.)
lets first see if if we can ignore race Makes me uncomfortable, but if this is the consensus, I'll go along. same here, as mentioned in another reply as it's only possible in attack case (multiple SMIs + multiple SIPI) so it could be fine to just explode in case it happens (point is fw in not leaking anything from SMRAM and OS did something illegeal). and if it's not then we probably end up with implementing some form of #1 OK.
Thanks! Laszlo
|
|
On 08/27/19 20:31, Igor Mammedov wrote: On Sat, 24 Aug 2019 01:48:09 +0000 "Yao, Jiewen" <jiewen.yao@...> wrote: (05) Host CPU: (OS) Port 0xB2 write, all CPUs enter SMM (NOTE: New CPU will not enter CPU because SMI is disabled) I think only CPU that does the write will enter SMM
That used to be the case (and it is still the default QEMU behavior, if broadcast SMI is not negotiated). However, OVMF does negotiate broadcast SMI whenever QEMU offers the feature. Broadcast SMI is important for the stability of the edk2 SMM infrastructure on QEMU/KVM, we've found. https://bugzilla.redhat.com/show_bug.cgi?id=1412313https://bugzilla.redhat.com/show_bug.cgi?id=1412327and we might not need to pull in all already initialized CPUs into SMM. That, on the other hand, could be a valid idea. But then the CPU should use a different method for raising a synchronous SMI for itself (not a write to IO port 0xB2). Is a "directed SMI for self" possible? [...] I've tried to read through the procedure with your suggested changes, but I'm failing at composing a coherent mental image, in this email response format. If you have the time, can you write up the suggested list of steps in a "flat" format? (I believe you are suggesting to eliminate some steps completely.) ... jumping to another point: 2) Let trusted software (SMM and init code) guarantee SMREBASE one by one (include any code runs before SMREBASE) that would mean pulling all present CPUs into SMM mode so no attack code could be executing before doing hotplug. With a lot of present CPUs it could be quite expensive and unlike physical hardware, guest's CPUs could be preempted arbitrarily long causing long delays.
I agree with your analysis, but I slightly disagree about the impact: - CPU hotplug is not a frequent administrative action, so the CPU load should be temporary (it should be a spike). I don't worry that it would trip up OS kernel code. (SMI handling is known to take long on physical platforms oo.) In practice, all "normal" SMIs are broadcast already (for example when calling the runtime UEFI variable services from the OS kernel). - The fact that QEMU/KVM introduces some jitter into the execution of multi-core code (including SMM code) has proved useful in the past, for catching edk2 regressions. Again, this is not a strong disagreement from my side. I'm open to better ways for synching CPUs during muti-CPU-hotplug. (Digression: I expect someone could be curious why (a) I find it acceptable (even beneficial) that "some jitter" injected by the QEMU/KVM scheduling exposes multi-core regressions in edk2, but at the same time (b) I found it really important to add broadcast SMI to QEMU and OVMF. After all, both "jitter" and "unicast SMIs" are QEMU/KVM platform specifics, so why the different treatment? The reason is that the "jitter" does not interfere with normal operation, and it has been good for catching *regressions*. IOW, there is a working edk2 state, someone posts a patch, works on physical hardware, but breaks on QEMU/KVM --> then we can still reject or rework or revert the patch. And we're back to a working state again (in the best case, with a fixed feature patch). With the unicast SMIs however, it was impossible to enable the SMM stack reliably in the first place. There was no functional state to return to. Digression ends.) lets first see if if we can ignore race Makes me uncomfortable, but if this is the consensus, I'll go along. and if it's not then we probably end up with implementing some form of #1 OK. Thanks! Laszlo
|
|
On 08/28/19 14:01, Igor Mammedov wrote: On Tue, 27 Aug 2019 22:11:15 +0200 Laszlo Ersek <lersek@...> wrote:
On 08/27/19 18:23, Igor Mammedov wrote:
On Mon, 26 Aug 2019 17:30:43 +0200 Laszlo Ersek <lersek@...> wrote:
On 08/23/19 17:25, Kinney, Michael D wrote:
Hi Jiewen,
If a hot add CPU needs to run any code before the first SMI, I would recommend is only executes code from a write protected FLASH range without a stack and then wait for the first SMI. "without a stack" looks very risky to me. Even if we manage to implement the guest code initially, we'll be trapped without a stack, should we ever need to add more complex stuff there. Do we need anything complex in relocation handler, though? From what I'd imagine, minimum handler should 1: get address of TSEG, possibly read it from chipset The TSEG base calculation is not trivial in this environment. The 32-bit RAM size needs to be read from the CMOS (IO port accesses). Then the extended TSEG size (if any) needs to be detected from PCI config space (IO port accesses). Both CMOS and PCI config space requires IO port writes too (not just reads). Even if there are enough registers for the calculations, can we rely on these unprotected IO ports?
Also, can we switch to 32-bit mode without a stack? I assume it would be necessary to switch to 32-bit mode for 32-bit arithmetic. from SDM vol 3: " 34.5.1 Initial SMM Execution Environment After saving the current context of the processor, the processor initializes its core registers to the values shown in Table 34-4. Upon entering SMM, the PE and PG flags in control register CR0 are cleared, which places the processor in an environment similar to real-address mode. The differences between the SMM execution environment and the real-address mode execution environment are as follows: • The addressable address space ranges from 0 to FFFFFFFFH (4 GBytes). • The normal 64-KByte segment limit for real-address mode is increased to 4 GBytes. • The default operand and address sizes are set to 16 bits, which restricts the addressable SMRAM address space to the 1-MByte real-address mode limit for native real-address-mode code. However, operand-size and address-size override prefixes can be used to access the address space beyond ^^^^^^^^ the 1-MByte. " That helps. Thanks for the quote! Getting the initial APIC ID needs some CPUID instructions IIUC, which clobber EAX through EDX, if I understand correctly. Given the register pressure, CPUID might have to be one of the first instructions to call. we could map at 30000 not 64K required for save area but 128K and use 2nd half as secure RAM for stack and intermediate data.
Firmware could put there pre-calculated pointer to TSEG after it's configured and locked down, this way relocation handler won't have to figure out TSEG address on its own.
Sounds like a great idea. 2: calculate its new SMBASE offset based on its APIC ID 3: save new SMBASE
For this OVMF use case, is any CPU init required before the first SMI? I expressed a preference for that too: "I wish we could simply wake the new CPU [...] with an SMI".
http://mid.mail-archive.com/398b3327-0820-95af-a34d-1a4a1d50cf35@redhat.com
From Paolo's list of steps are steps (8a) and (8b) really required? 07b - implies 08b I agree about that implication, yes. *If* we send an INIT/SIPI/SIPI to the new CPU, then the new CPU needs a HLT loop, I think. It also could execute INIT reset, which leaves initialized SMM untouched but otherwise CPU would be inactive.
8b could be trivial hlt loop and we most likely could skip 08a and signaling host CPU steps but we need INIT/SIPI/SIPI sequence to wake up AP so it could handle pending SMI before handling SIPI (so behavior would follow SDM).
See again my message linked above -- just after the quoted sentence, I wrote, "IOW, if we could excise steps 07b, 08a, 08b".
But, I obviously defer to Paolo and Igor on that.
(I do believe we have a dilemma here. In QEMU, we probably prefer to emulate physical hardware as faithfully as possible. However, we do not have Cache-As-RAM (nor do we intend to, IIUC). Does that justify other divergences from physical hardware too, such as waking just by virtue of an SMI?) So far we should be able to implement it per spec (at least SDM one), but we would still need to invent chipset hardware i.e. like adding to Q35 non exiting SMRAM and means to map/unmap it to non-SMM address space. (and I hope we could avoid adding "parked CPU" thingy) I think we'll need a separate QEMU tree for this. I'm quite in the dark -- I can't tell if I'll be able to do something in OVMF without actually trying it. And for that, we'll need some proposed QEMU code that is testable, but not upstream yet. (As I might realize that I'm unable to make it work in OVMF.) Let me prepare a QEMU branch with something usable for you.
To avoid inventing mgmt API for configuring SMRAM at 30000, I'm suggesting to steal/alias top or bottom 128K of TSEG window to 30000. This way OVMF would be able to set SMI relocation handler modifying TSEG and pass TSEG base/other data to it as well. Would it work for you or should we try more elaborate approach?
I believe this this change may not be cross-compatible between QEMU and OVMF. OVMF platform code would have to hide the stolen part of the TSEG from core edk2 SMM code. If old OVMF were booted on new QEMU, I believe things could break -- the SMM core would be at liberty to use any part of the TSEG (advertized by OVMF platform code to the full extent), and the SMM core would continue expecting 0x30000 to be normal (and distinct) RAM. If QEMU suddenly aliased both ranges to the same contents (in System Management Mode), I think that would confuse the SMM core. We already negotiate (or at least, detect) two features in this area; "extended TSEG" and "broadcast SMI". I believe we need a CPU hotplug controller anyway -- is that still the case? If it is, we could use registers on that device, for managing the alias. If the default SMBASE area is corrupted due to concurrent access, could that lead to invalid relocated SMBASE values? Possibly pointing into normal RAM? in case of broadcast SMI (btw does OVMF use broadcast SMIs?) several CPUs could end up
Broadcast SMI is very important for OVMF. The Platform Init spec basically defines an abstract interface for runtime UEFI drivers for submitting an "SMM request". Part of that is raising an SMI (also abstracted). *How* an SMI is raised is platform-dependent, and edk2 provides two implementations for synching APs in SMM (broadcast ("traditional") and relaxed). In our testing on QEMU/KVM, the broadcast/traditional sync mode worked very robustly (with QEMU actually broadcasting the SMI in response to IO port 0xB2 writes), but the relaxed synch mode was unstable / brittle (in particular during S3 resume). Therefore broadcast SMI is negotiated by OVMF whenever it is available -- it makes a big difference in stability. Now, whether broadcast SMI needs to be part of CPU hotplug specifically, that's a different question. The CPU hotplug logic may not necessarily have to go through the same (standardized) interfaces that runtime UEFI drivers do. with the same SMBASE within SMRAM 1: default one: in case the 2nd CPU enters SMM after the 1st CPU saved new SMBASE but before it's called RSM 2: duplicated SMBASE: where the 2nd CPU saves its new SMBASE before the 1st calls RSM
while the 2nd could be counteracted with using locks, I don't see how 1st one could be avoided. May be host CPU can send 2nd SMI so just relocated CPU could send an ACK from relocated SMBASE/with new SMI handler? I don't have any better idea. We could protect the default SMBASE with a semaphore (spinlock?) in SMRAM, but that would have to be released with the owning CPU executing code at the new SMBASE. Basically, what you say, just "ACK" meaning "release the spinlock". Thanks, Laszlo
|
|
Igor Mammedov <imammedo@...>
On Tue, 27 Aug 2019 22:11:15 +0200 Laszlo Ersek <lersek@...> wrote: On 08/27/19 18:23, Igor Mammedov wrote:
On Mon, 26 Aug 2019 17:30:43 +0200 Laszlo Ersek <lersek@...> wrote:
On 08/23/19 17:25, Kinney, Michael D wrote:
Hi Jiewen,
If a hot add CPU needs to run any code before the first SMI, I would recommend is only executes code from a write protected FLASH range without a stack and then wait for the first SMI. "without a stack" looks very risky to me. Even if we manage to implement the guest code initially, we'll be trapped without a stack, should we ever need to add more complex stuff there. Do we need anything complex in relocation handler, though? From what I'd imagine, minimum handler should 1: get address of TSEG, possibly read it from chipset The TSEG base calculation is not trivial in this environment. The 32-bit RAM size needs to be read from the CMOS (IO port accesses). Then the extended TSEG size (if any) needs to be detected from PCI config space (IO port accesses). Both CMOS and PCI config space requires IO port writes too (not just reads). Even if there are enough registers for the calculations, can we rely on these unprotected IO ports?
Also, can we switch to 32-bit mode without a stack? I assume it would be necessary to switch to 32-bit mode for 32-bit arithmetic. from SDM vol 3: " 34.5.1 Initial SMM Execution Environment After saving the current context of the processor, the processor initializes its core registers to the values shown in Table 34-4. Upon entering SMM, the PE and PG flags in control register CR0 are cleared, which places the processor in an environment similar to real-address mode. The differences between the SMM execution environment and the real-address mode execution environment are as follows: • The addressable address space ranges from 0 to FFFFFFFFH (4 GBytes). • The normal 64-KByte segment limit for real-address mode is increased to 4 GBytes. • The default operand and address sizes are set to 16 bits, which restricts the addressable SMRAM address space to the 1-MByte real-address mode limit for native real-address-mode code. However, operand-size and address-size override prefixes can be used to access the address space beyond ^^^^^^^^ the 1-MByte. " Getting the initial APIC ID needs some CPUID instructions IIUC, which clobber EAX through EDX, if I understand correctly. Given the register pressure, CPUID might have to be one of the first instructions to call.
we could map at 30000 not 64K required for save area but 128K and use 2nd half as secure RAM for stack and intermediate data. Firmware could put there pre-calculated pointer to TSEG after it's configured and locked down, this way relocation handler won't have to figure out TSEG address on its own. 2: calculate its new SMBASE offset based on its APIC ID 3: save new SMBASE
For this OVMF use case, is any CPU init required before the first SMI? I expressed a preference for that too: "I wish we could simply wake the new CPU [...] with an SMI".
http://mid.mail-archive.com/398b3327-0820-95af-a34d-1a4a1d50cf35@redhat.com
From Paolo's list of steps are steps (8a) and (8b) really required? 07b - implies 08b I agree about that implication, yes. *If* we send an INIT/SIPI/SIPI to the new CPU, then the new CPU needs a HLT loop, I think.
It also could execute INIT reset, which leaves initialized SMM untouched but otherwise CPU would be inactive.
8b could be trivial hlt loop and we most likely could skip 08a and signaling host CPU steps but we need INIT/SIPI/SIPI sequence to wake up AP so it could handle pending SMI before handling SIPI (so behavior would follow SDM).
See again my message linked above -- just after the quoted sentence, I wrote, "IOW, if we could excise steps 07b, 08a, 08b".
But, I obviously defer to Paolo and Igor on that.
(I do believe we have a dilemma here. In QEMU, we probably prefer to emulate physical hardware as faithfully as possible. However, we do not have Cache-As-RAM (nor do we intend to, IIUC). Does that justify other divergences from physical hardware too, such as waking just by virtue of an SMI?) So far we should be able to implement it per spec (at least SDM one), but we would still need to invent chipset hardware i.e. like adding to Q35 non exiting SMRAM and means to map/unmap it to non-SMM address space. (and I hope we could avoid adding "parked CPU" thingy) I think we'll need a separate QEMU tree for this. I'm quite in the dark -- I can't tell if I'll be able to do something in OVMF without actually trying it. And for that, we'll need some proposed QEMU code that is testable, but not upstream yet. (As I might realize that I'm unable to make it work in OVMF.)
Let me prepare a QEMU branch with something usable for you. To avoid inventing mgmt API for configuring SMRAM at 30000, I'm suggesting to steal/alias top or bottom 128K of TSEG window to 30000. This way OVMF would be able to set SMI relocation handler modifying TSEG and pass TSEG base/other data to it as well. Would it work for you or should we try more elaborate approach? Can the SMI monarch use the Local APIC to send a directed SMI to the hot added CPU? The SMI monarch needs to know the APIC ID of the hot added CPU. Do we also need to handle the case where multiple CPUs are added at once? I think we would need to serialize the use of 3000:8000 for the SMM rebase operation on each hot added CPU. I agree this would be a huge help. We can serialize it (for normal hotplug flow) from ACPI handler in the guest (i.e. non enforced serialization). The only reason for serialization I see is not to allow a bunch of new CPU trample over default SMBASE save area at the same time. If the default SMBASE area is corrupted due to concurrent access, could that lead to invalid relocated SMBASE values? Possibly pointing into normal RAM?
in case of broadcast SMI (btw does OVMF use broadcast SMIs?) several CPUs could end up with the same SMBASE within SMRAM 1: default one: in case the 2nd CPU enters SMM after the 1st CPU saved new SMBASE but before it's called RSM 2: duplicated SMBASE: where the 2nd CPU saves its new SMBASE before the 1st calls RSM while the 2nd could be counteracted with using locks, I don't see how 1st one could be avoided. May be host CPU can send 2nd SMI so just relocated CPU could send an ACK from relocated SMBASE/with new SMI handler? Thanks Laszlo
|
|
On 08/27/19 18:23, Igor Mammedov wrote: On Mon, 26 Aug 2019 17:30:43 +0200 Laszlo Ersek <lersek@...> wrote:
On 08/23/19 17:25, Kinney, Michael D wrote:
Hi Jiewen,
If a hot add CPU needs to run any code before the first SMI, I would recommend is only executes code from a write protected FLASH range without a stack and then wait for the first SMI. "without a stack" looks very risky to me. Even if we manage to implement the guest code initially, we'll be trapped without a stack, should we ever need to add more complex stuff there. Do we need anything complex in relocation handler, though? From what I'd imagine, minimum handler should 1: get address of TSEG, possibly read it from chipset The TSEG base calculation is not trivial in this environment. The 32-bit RAM size needs to be read from the CMOS (IO port accesses). Then the extended TSEG size (if any) needs to be detected from PCI config space (IO port accesses). Both CMOS and PCI config space requires IO port writes too (not just reads). Even if there are enough registers for the calculations, can we rely on these unprotected IO ports? Also, can we switch to 32-bit mode without a stack? I assume it would be necessary to switch to 32-bit mode for 32-bit arithmetic. Getting the initial APIC ID needs some CPUID instructions IIUC, which clobber EAX through EDX, if I understand correctly. Given the register pressure, CPUID might have to be one of the first instructions to call. 2: calculate its new SMBASE offset based on its APIC ID 3: save new SMBASE
For this OVMF use case, is any CPU init required before the first SMI? I expressed a preference for that too: "I wish we could simply wake the new CPU [...] with an SMI".
http://mid.mail-archive.com/398b3327-0820-95af-a34d-1a4a1d50cf35@redhat.com
From Paolo's list of steps are steps (8a) and (8b) really required? 07b - implies 08b I agree about that implication, yes. *If* we send an INIT/SIPI/SIPI to the new CPU, then the new CPU needs a HLT loop, I think. 8b could be trivial hlt loop and we most likely could skip 08a and signaling host CPU steps but we need INIT/SIPI/SIPI sequence to wake up AP so it could handle pending SMI before handling SIPI (so behavior would follow SDM).
See again my message linked above -- just after the quoted sentence, I wrote, "IOW, if we could excise steps 07b, 08a, 08b".
But, I obviously defer to Paolo and Igor on that.
(I do believe we have a dilemma here. In QEMU, we probably prefer to emulate physical hardware as faithfully as possible. However, we do not have Cache-As-RAM (nor do we intend to, IIUC). Does that justify other divergences from physical hardware too, such as waking just by virtue of an SMI?) So far we should be able to implement it per spec (at least SDM one), but we would still need to invent chipset hardware i.e. like adding to Q35 non exiting SMRAM and means to map/unmap it to non-SMM address space. (and I hope we could avoid adding "parked CPU" thingy) I think we'll need a separate QEMU tree for this. I'm quite in the dark -- I can't tell if I'll be able to do something in OVMF without actually trying it. And for that, we'll need some proposed QEMU code that is testable, but not upstream yet. (As I might realize that I'm unable to make it work in OVMF.) Can the SMI monarch use the Local APIC to send a directed SMI to the hot added CPU? The SMI monarch needs to know the APIC ID of the hot added CPU. Do we also need to handle the case where multiple CPUs are added at once? I think we would need to serialize the use of 3000:8000 for the SMM rebase operation on each hot added CPU. I agree this would be a huge help. We can serialize it (for normal hotplug flow) from ACPI handler in the guest (i.e. non enforced serialization). The only reason for serialization I see is not to allow a bunch of new CPU trample over default SMBASE save area at the same time.
If the default SMBASE area is corrupted due to concurrent access, could that lead to invalid relocated SMBASE values? Possibly pointing into normal RAM? Thanks Laszlo
|
|
Igor Mammedov <imammedo@...>
On Sat, 24 Aug 2019 01:48:09 +0000 "Yao, Jiewen" <jiewen.yao@...> wrote: I give my thought. Paolo may add more. Here are some ideas I have on the topic.
-----Original Message----- From: Kinney, Michael D Sent: Friday, August 23, 2019 11:25 PM To: Yao, Jiewen <jiewen.yao@...>; Paolo Bonzini <pbonzini@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io; Kinney, Michael D <michael.d.kinney@...> Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu-devel@...>; Igor Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>; Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: RE: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
Hi Jiewen,
If a hot add CPU needs to run any code before the first SMI, I would recommend is only executes code from a write protected FLASH range without a stack and then wait for the first SMI. [Jiewen] Right.
Another option from Paolo, the new CPU will not run until 0x7b. To mitigate DMA threat, someone need guarantee the low memory SIPI vector is DMA protected.
NOTE: The LOW memory *could* be mapped to write protected FLASH AREA via PAM register. The Host CPU may setup that in SMM. If that is the case, we don’t need worry DMA.
I copied the detail step here, because I found it is hard to dig them out again.
*) In light of using dedicated SMRAM at 30000 with pre-configured relocation vector for initial relocation which is not reachable from non-SMM mode: ==================== (01a) QEMU: create new CPU. The CPU already exists, but it does not start running code until unparked by the CPU hotplug controller. we might not need parked CPU (if we ignore attacker's attempt to send SMI to several new CPUs, see below for issue it causes) (01b) QEMU: trigger SCI
(02-03) no equivalent
(04) Host CPU: (OS) execute GPE handler from DSDT
(05) Host CPU: (OS) Port 0xB2 write, all CPUs enter SMM (NOTE: New CPU will not enter CPU because SMI is disabled) I think only CPU that does the write will enter SMM and we might not need to pull in all already initialized CPUs into SMM. At this step we could also send a directed SMI to a new CPU from host CPU that entered SMM on write. (06) Host CPU: (SMM) Save 38000, Update 38000 -- fill simple SMM rebase code. could skip this step as well (*) (07a) Host CPU: (SMM) Write to CPU hotplug controller to enable new CPU ditto (07b) Host CPU: (SMM) Send INIT/SIPI/SIPI to new CPU. we need to wake up new CPU somehow so it would process (09) pending (05) SMI before jumping to SIPI vector (08a) New CPU: (Low RAM) Enter protected mode.
(08b) New CPU: (Flash) Signals host CPU to proceed and enter cli;hlt loop. these both steps could be changed to to just cli;hlt loop or do INIT reset. if SMI relocation handler and/or host CPU will pull in the new CPU into OVMF, we actually don't care about SIPI vector as all firmware initialization for the new CPU is done in SMM mode (07b triggers 10). Thus eliminating one attack vector to protect from. (09) Host CPU: (SMM) Send SMI to the new CPU only. could be done at (05) (10) New CPU: (SMM) Run SMM code at 38000, and rebase SMBASE to TSEG. it could also pull in itself into other OVMF structures (assuming it can TSEG as stack as that's rather complex) or just do relocation and let host CPU to fill in OVMF structures for the new CPU (12). (11) Host CPU: (SMM) Restore 38000. could skip this step as well (*) (12) Host CPU: (SMM) Update located data structure to add the new CPU information. (This step will involve CPU_SERVICE protocol)
(13) New CPU: (Flash) do whatever other initialization is needed do we actually need it? (14) New CPU: (Flash) Deadloop, and wait for INIT-SIPI-SIPI.
(15) Host CPU: (OS) Send INIT-SIPI-SIPI to pull new CPU in.. ====================
For this OVMF use case, is any CPU init required before the first SMI? [Jiewen] I am sure what is the detail action in 08b. And I am not sure what your "init" means here? Personally, I don’t think we need too much init work, such as Microcode or MTRR. But we need detail info. Wouldn't it be preferable to do in SMM mode? From Paolo's list of steps are steps (8a) and (8b) really required? Can the SMI monarch use the Local APIC to send a directed SMI to the hot added CPU? The SMI monarch needs to know the APIC ID of the hot added CPU. [Jiewen] I think it depend upon virtual hardware design. Leave question to Paolo.
it's not really needed as described in (8x), it could be just cli;hlt loop so that our SIPI could land at sensible code and stop the new CPU, it even could be an attacker's code if we do all initialization in SMM mode. Do we also need to handle the case
where multiple CPUs are added at once? I think we would need to serialize the use of 3000:8000 for the SMM rebase operation on each hot added CPU. It would be simpler if we can guarantee that only one CPU can be added or removed at a time and the complete flow of adding a CPU to SMM and the OS needs to be completed before another add/remove event needs to be processed. [Jiewen] Right. I treat the multiple CPU hot-add at same time as a potential threat. the problem I see here is the race of saving/restoring to/from SMBASE at 30000, so a CPU exiting SMM can't be sure if it restores its own saved area or it's another CPU saved state. (I couldn't find in SDM what would happen in this case) If we consider non-attack flow, then we can serialize sending SMIs to new CPUs (one at a time) from GPE handler and ensure that only one CPU can do relocation at a time (i.e. non enforced serialization). In attack case, attacker would only be able to trigger above race. We don’t want to trust end user. The solution could be: 1) Let trusted hardware guarantee hot-add one by one. so far in QEMU it's not possible. We might be able to implement "parking/unparking" chipset feature, but that would mean inventing and maintaining ABI for it, which I'd like to avoid if possible. That's why I'm curious about what happens if CPU exits SMM mode with another CPU saved registers state in case of the race and if we could ignore consequences of it. (it's fine for guest OS to crash or new CPU do not work, attacker would only affect itself) 2) Let trusted software (SMM and init code) guarantee SMREBASE one by one (include any code runs before SMREBASE) that would mean pulling all present CPUs into SMM mode so no attack code could be executing before doing hotplug. With a lot of present CPUs it could be quite expensive and unlike physical hardware, guest's CPUs could be preempted arbitrarily long causing long delays. 3) Let trusted software (SMM and init code) support SMREBASE simultaneously (include any code runs before SMREBASE). Is it really possible to do in software? Potentially it could be done in hardware (QEMU/KVM) if each CPU will have its own SMRAM at 30000, so parallely relocated CPUs won't trample over each other. But KVM has only 2 address spaces (normal RAM and SMM) so it won't just work of the box (and I recall that Paolo had some reservation versus adding more). Also it would mean adding ABI for initializing that SMRAM blocks from another CPU which could be complicated. Solution #1 or #2 are simple solution. lets first see if if we can ignore race and if it's not then we probably end up with implementing some form of #1
Mike
-----Original Message----- From: Yao, Jiewen Sent: Thursday, August 22, 2019 10:00 PM To: Kinney, Michael D <michael.d.kinney@...>; Paolo Bonzini <pbonzini@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>; Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: RE: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
Thank you Mike!
That is good reference on the real hardware behavior. (Glad it is public.)
For threat model, the unique part in virtual environment is temp RAM. The temp RAM in real platform is per CPU cache, while the temp RAM in virtual platform is global memory. That brings one more potential attack surface in virtual environment, if hot-added CPU need run code with stack or heap before SMI rebase.
Other threats, such as SMRAM or DMA, are same.
Thank you Yao Jiewen
-----Original Message----- From: Kinney, Michael D Sent: Friday, August 23, 2019 9:03 AM To: Paolo Bonzini <pbonzini@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io; Yao, Jiewen <jiewen.yao@...>; Kinney, Michael D <michael.d.kinney@...>
Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor
Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>;
Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos
Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: RE: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with
QEMU+OVMF
Paolo,
I find the following links related to the discussions here along with
one example feature called GENPROTRANGE.
https://csrc.nist.gov/CSRC/media/Presentations/The- Whole-is-Greater/im
a ges-media/day1_trusted-computing_200-250.pdf https://cansecwest.com/slides/2017/CSW2017_Cuauhtemoc- Rene_CPU_Ho
t-Add_flow.pdf https://www.mouser.com/ds/2/612/5520-5500-chipset-ioh- datasheet-1131
292.pdf
Best regards,
Mike
-----Original Message----- From: Paolo Bonzini [mailto:pbonzini@...] Sent: Thursday, August 22, 2019 4:12 PM To: Kinney, Michael D <michael.d.kinney@...>; Laszlo Ersek
<lersek@...>; rfc@edk2.groups.io; Yao, Jiewen <jiewen.yao@...> Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor
Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>;
Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos
Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with
QEMU+OVMF
On 23/08/19 00:32, Kinney, Michael D wrote:
Paolo,
It is my understanding that real HW hot plug uses the
SDM defined
methods. Meaning the initial SMI is to 3000:8000 and
they rebase to
TSEG in the first SMI. They must have chipset specific
methods to
protect 3000:8000 from DMA. It would be great if you could check.
Can we add a chipset feature to prevent DMA to 64KB
range from
0x30000-0x3FFFF and the UEFI Memory Map and ACPI content can be
updated so the Guest OS knows to not use that range for
DMA?
If real hardware does it at the chipset level, we will probably use
Igor's suggestion of aliasing A-seg to 3000:0000. Before starting
the new CPU, the SMI handler can prepare the SMBASE relocation
trampoline at A000:8000 and the hot-plugged CPU will find it at 3000:8000 when it receives the initial SMI. Because this is backed
by RAM at 0xA0000-0xAFFFF, DMA cannot access it and would still go
through to RAM at 0x30000.
Paolo
|
|
Igor Mammedov <imammedo@...>
On Mon, 26 Aug 2019 17:30:43 +0200 Laszlo Ersek <lersek@...> wrote: On 08/23/19 17:25, Kinney, Michael D wrote:
Hi Jiewen,
If a hot add CPU needs to run any code before the first SMI, I would recommend is only executes code from a write protected FLASH range without a stack and then wait for the first SMI. "without a stack" looks very risky to me. Even if we manage to implement the guest code initially, we'll be trapped without a stack, should we ever need to add more complex stuff there. Do we need anything complex in relocation handler, though? From what I'd imagine, minimum handler should 1: get address of TSEG, possibly read it from chipset 2: calculate its new SMBASE offset based on its APIC ID 3: save new SMBASE For this OVMF use case, is any CPU init required before the first SMI? I expressed a preference for that too: "I wish we could simply wake the new CPU [...] with an SMI".
http://mid.mail-archive.com/398b3327-0820-95af-a34d-1a4a1d50cf35@redhat.com
From Paolo's list of steps are steps (8a) and (8b) really required?
07b - implies 08b 8b could be trivial hlt loop and we most likely could skip 08a and signaling host CPU steps but we need INIT/SIPI/SIPI sequence to wake up AP so it could handle pending SMI before handling SIPI (so behavior would follow SDM). See again my message linked above -- just after the quoted sentence, I wrote, "IOW, if we could excise steps 07b, 08a, 08b".
But, I obviously defer to Paolo and Igor on that.
(I do believe we have a dilemma here. In QEMU, we probably prefer to emulate physical hardware as faithfully as possible. However, we do not have Cache-As-RAM (nor do we intend to, IIUC). Does that justify other divergences from physical hardware too, such as waking just by virtue of an SMI?) So far we should be able to implement it per spec (at least SDM one), but we would still need to invent chipset hardware i.e. like adding to Q35 non exiting SMRAM and means to map/unmap it to non-SMM address space. (and I hope we could avoid adding "parked CPU" thingy) Can the SMI monarch use the Local APIC to send a directed SMI to the hot added CPU? The SMI monarch needs to know the APIC ID of the hot added CPU. Do we also need to handle the case where multiple CPUs are added at once? I think we would need to serialize the use of 3000:8000 for the SMM rebase operation on each hot added CPU. I agree this would be a huge help.
We can serialize it (for normal hotplug flow) from ACPI handler in the guest (i.e. non enforced serialization). The only reason for serialization I see is not to allow a bunch of new CPU trample over default SMBASE save area at the same time. There is a consideration though, an OS level attacker could send broadcast SMI and INIT-SIPI-SIPI sequences to rigger race, but I don't see it as a threat since attack shouldn't be able to exploit anything and in worst case guest OS would crash (taking in account that SMIs are privileged, OS attacker has a plenty of other means to kill itself). It would be simpler if we can guarantee that only one CPU can be added or removed at a time and the complete flow of adding a CPU to SMM and the OS needs to be completed before another add/remove event needs to be processed. I don't know if the QEMU monitor command in question can guarantee this serialization. I think such a request/response pattern is generally implementable between QEMU and guest code.
But, AIUI, the "device-add" monitor command is quite generic, and used for hot-plugging a number of other (non-CPU) device models. I'm unsure if the pattern in question can be squeezed into "device-add". (It's not a dedicated command for CPU hotplug.)
... Apologies that I didn't add much information to the thread, just now. I'd like to keep the discussion going.
Thanks Laszlo
|
|
On 08/23/19 17:25, Kinney, Michael D wrote: Hi Jiewen,
If a hot add CPU needs to run any code before the first SMI, I would recommend is only executes code from a write protected FLASH range without a stack and then wait for the first SMI. "without a stack" looks very risky to me. Even if we manage to implement the guest code initially, we'll be trapped without a stack, should we ever need to add more complex stuff there. For this OVMF use case, is any CPU init required before the first SMI? I expressed a preference for that too: "I wish we could simply wake the new CPU [...] with an SMI". http://mid.mail-archive.com/398b3327-0820-95af-a34d-1a4a1d50cf35@redhat.comFrom Paolo's list of steps are steps (8a) and (8b) really required? See again my message linked above -- just after the quoted sentence, I wrote, "IOW, if we could excise steps 07b, 08a, 08b". But, I obviously defer to Paolo and Igor on that. (I do believe we have a dilemma here. In QEMU, we probably prefer to emulate physical hardware as faithfully as possible. However, we do not have Cache-As-RAM (nor do we intend to, IIUC). Does that justify other divergences from physical hardware too, such as waking just by virtue of an SMI?) Can the SMI monarch use the Local APIC to send a directed SMI to the hot added CPU? The SMI monarch needs to know the APIC ID of the hot added CPU. Do we also need to handle the case where multiple CPUs are added at once? I think we would need to serialize the use of 3000:8000 for the SMM rebase operation on each hot added CPU. I agree this would be a huge help. It would be simpler if we can guarantee that only one CPU can be added or removed at a time and the complete flow of adding a CPU to SMM and the OS needs to be completed before another add/remove event needs to be processed. I don't know if the QEMU monitor command in question can guarantee this serialization. I think such a request/response pattern is generally implementable between QEMU and guest code. But, AIUI, the "device-add" monitor command is quite generic, and used for hot-plugging a number of other (non-CPU) device models. I'm unsure if the pattern in question can be squeezed into "device-add". (It's not a dedicated command for CPU hotplug.) ... Apologies that I didn't add much information to the thread, just now. I'd like to keep the discussion going. Thanks Laszlo
|
|
Paolo Bonzini <pbonzini@...>
On 23/08/19 00:32, Kinney, Michael D wrote: Paolo,
It is my understanding that real HW hot plug uses the SDM defined methods. Meaning the initial SMI is to 3000:8000 and they rebase to TSEG in the first SMI. They must have chipset specific methods to protect 3000:8000 from DMA. It would be great if you could check. Can we add a chipset feature to prevent DMA to 64KB range from 0x30000-0x3FFFF and the UEFI Memory Map and ACPI content can be updated so the Guest OS knows to not use that range for DMA? If real hardware does it at the chipset level, we will probably use Igor's suggestion of aliasing A-seg to 3000:0000. Before starting the new CPU, the SMI handler can prepare the SMBASE relocation trampoline at A000:8000 and the hot-plugged CPU will find it at 3000:8000 when it receives the initial SMI. Because this is backed by RAM at 0xA0000-0xAFFFF, DMA cannot access it and would still go through to RAM at 0x30000. Paolo
|
|
Paolo Bonzini <pbonzini@...>
On 22/08/19 22:06, Kinney, Michael D wrote: The SMBASE register is internal and cannot be directly accessed by any CPU. There is an SMBASE field that is member of the SMM Save State area and can only be modified from SMM and requires the execution of an RSM instruction from SMM for the SMBASE register to be updated from the current SMBASE field value. The new SMBASE register value is only used on the next SMI. Actually there is also an SMBASE MSR, even though in current silicon it's read-only and its use is theoretically limited to SMM-transfer monitors. If that MSR could be made accessible somehow outside SMM, that would be great. Once all the CPUs have been initialized for SMM, the CPUs that are not needed can be hot removed. As noted above, the SMBASE value does not change on an INIT. So as long as the hot add operation does not do a RESET, the SMBASE value must be preserved. IIRC, hot-remove + hot-add will unplugs/plugs a completely different CPU. Another idea is to emulate this behavior. If the hot plug controller provide registers (only accessible from SMM) to assign the SMBASE address for every CPU. When a CPU is hot added, QEMU can set the internal SMBASE register value from the hot plug controller register value. If the SMM Monarch sends an INIT or an SMI from the Local APIC to the hot added CPU, then the SMBASE register should not be modified and the CPU starts execution within TSEG the first time it receives an SMI. Yes, this would work. But again---if the issue is real on current hardware too, I'd rather have a matching solution for virtual platforms. If the current hardware for example remembers INIT-preserved across hot-remove/hot-add, we could emulate that. I guess the fundamental question is: how do bare metal platforms avoid this issue, or plan to avoid this issue? Once we know that, we can use that information to find a way to implement it in KVM. Only if it is impossible we'll have a different strategy that is specific to our platform. Paolo Jiewen and I can collect specific questions on this topic and continue the discussion here. For example, I do not think there is any method other than what I referenced above to program the SMBASE register, but I can ask if there are any other methods.
|
|
I give my thought. Paolo may add more.
toggle quoted message
Show quoted text
-----Original Message----- From: Kinney, Michael D Sent: Friday, August 23, 2019 11:25 PM To: Yao, Jiewen <jiewen.yao@...>; Paolo Bonzini <pbonzini@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io; Kinney, Michael D <michael.d.kinney@...> Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu-devel@...>; Igor Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>; Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: RE: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
Hi Jiewen,
If a hot add CPU needs to run any code before the first SMI, I would recommend is only executes code from a write protected FLASH range without a stack and then wait for the first SMI. [Jiewen] Right. Another option from Paolo, the new CPU will not run until 0x7b. To mitigate DMA threat, someone need guarantee the low memory SIPI vector is DMA protected. NOTE: The LOW memory *could* be mapped to write protected FLASH AREA via PAM register. The Host CPU may setup that in SMM. If that is the case, we don’t need worry DMA. I copied the detail step here, because I found it is hard to dig them out again. ==================== (01a) QEMU: create new CPU. The CPU already exists, but it does not start running code until unparked by the CPU hotplug controller. (01b) QEMU: trigger SCI (02-03) no equivalent (04) Host CPU: (OS) execute GPE handler from DSDT (05) Host CPU: (OS) Port 0xB2 write, all CPUs enter SMM (NOTE: New CPU will not enter CPU because SMI is disabled) (06) Host CPU: (SMM) Save 38000, Update 38000 -- fill simple SMM rebase code. (07a) Host CPU: (SMM) Write to CPU hotplug controller to enable new CPU (07b) Host CPU: (SMM) Send INIT/SIPI/SIPI to new CPU. (08a) New CPU: (Low RAM) Enter protected mode. (08b) New CPU: (Flash) Signals host CPU to proceed and enter cli;hlt loop. (09) Host CPU: (SMM) Send SMI to the new CPU only. (10) New CPU: (SMM) Run SMM code at 38000, and rebase SMBASE to TSEG. (11) Host CPU: (SMM) Restore 38000. (12) Host CPU: (SMM) Update located data structure to add the new CPU information. (This step will involve CPU_SERVICE protocol) (13) New CPU: (Flash) do whatever other initialization is needed (14) New CPU: (Flash) Deadloop, and wait for INIT-SIPI-SIPI. (15) Host CPU: (OS) Send INIT-SIPI-SIPI to pull new CPU in.. ==================== For this OVMF use case, is any CPU init required before the first SMI?
[Jiewen] I am sure what is the detail action in 08b. And I am not sure what your "init" means here? Personally, I don’t think we need too much init work, such as Microcode or MTRR. But we need detail info. From Paolo's list of steps are steps (8a) and (8b) really required? Can the SMI monarch use the Local APIC to send a directed SMI to the hot added CPU? The SMI monarch needs to know the APIC ID of the hot added CPU. [Jiewen] I think it depend upon virtual hardware design. Leave question to Paolo. Do we also need to handle the case where multiple CPUs are added at once? I think we would need to serialize the use of 3000:8000 for the SMM rebase operation on each hot added CPU. It would be simpler if we can guarantee that only one CPU can be added or removed at a time and the complete flow of adding a CPU to SMM and the OS needs to be completed before another add/remove event needs to be processed. [Jiewen] Right. I treat the multiple CPU hot-add at same time as a potential threat. We don’t want to trust end user. The solution could be: 1) Let trusted hardware guarantee hot-add one by one. 2) Let trusted software (SMM and init code) guarantee SMREBASE one by one (include any code runs before SMREBASE) 3) Let trusted software (SMM and init code) support SMREBASE simultaneously (include any code runs before SMREBASE). Solution #1 or #2 are simple solution. Mike
-----Original Message----- From: Yao, Jiewen Sent: Thursday, August 22, 2019 10:00 PM To: Kinney, Michael D <michael.d.kinney@...>; Paolo Bonzini <pbonzini@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>; Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: RE: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
Thank you Mike!
That is good reference on the real hardware behavior. (Glad it is public.)
For threat model, the unique part in virtual environment is temp RAM. The temp RAM in real platform is per CPU cache, while the temp RAM in virtual platform is global memory. That brings one more potential attack surface in virtual environment, if hot-added CPU need run code with stack or heap before SMI rebase.
Other threats, such as SMRAM or DMA, are same.
Thank you Yao Jiewen
-----Original Message----- From: Kinney, Michael D Sent: Friday, August 23, 2019 9:03 AM To: Paolo Bonzini <pbonzini@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io; Yao, Jiewen <jiewen.yao@...>; Kinney, Michael D <michael.d.kinney@...>
Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor
Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>;
Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos
Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: RE: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with
QEMU+OVMF
Paolo,
I find the following links related to the discussions here along with
one example feature called GENPROTRANGE.
https://csrc.nist.gov/CSRC/media/Presentations/The- Whole-is-Greater/im
a ges-media/day1_trusted-computing_200-250.pdf https://cansecwest.com/slides/2017/CSW2017_Cuauhtemoc- Rene_CPU_Ho
t-Add_flow.pdf https://www.mouser.com/ds/2/612/5520-5500-chipset-ioh- datasheet-1131
292.pdf
Best regards,
Mike
-----Original Message----- From: Paolo Bonzini [mailto:pbonzini@...] Sent: Thursday, August 22, 2019 4:12 PM To: Kinney, Michael D <michael.d.kinney@...>; Laszlo Ersek
<lersek@...>; rfc@edk2.groups.io; Yao, Jiewen <jiewen.yao@...> Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor
Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>;
Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos
Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with
QEMU+OVMF
On 23/08/19 00:32, Kinney, Michael D wrote:
Paolo,
It is my understanding that real HW hot plug uses the
SDM defined
methods. Meaning the initial SMI is to 3000:8000 and
they rebase to
TSEG in the first SMI. They must have chipset specific
methods to
protect 3000:8000 from DMA. It would be great if you could check.
Can we add a chipset feature to prevent DMA to 64KB
range from
0x30000-0x3FFFF and the UEFI Memory Map and ACPI content can be
updated so the Guest OS knows to not use that range for
DMA?
If real hardware does it at the chipset level, we will probably use
Igor's suggestion of aliasing A-seg to 3000:0000. Before starting
the new CPU, the SMI handler can prepare the SMBASE relocation
trampoline at A000:8000 and the hot-plugged CPU will find it at 3000:8000 when it receives the initial SMI. Because this is backed
by RAM at 0xA0000-0xAFFFF, DMA cannot access it and would still go
through to RAM at 0x30000.
Paolo
|
|
Hi Jiewen,
If a hot add CPU needs to run any code before the first SMI, I would recommend is only executes code from a write protected FLASH range without a stack and then wait for the first SMI.
For this OVMF use case, is any CPU init required before the first SMI?
From Paolo's list of steps are steps (8a) and (8b) really required? Can the SMI monarch use the Local APIC to send a directed SMI to the hot added CPU? The SMI monarch needs to know the APIC ID of the hot added CPU. Do we also need to handle the case where multiple CPUs are added at once? I think we would need to serialize the use of 3000:8000 for the SMM rebase operation on each hot added CPU.
It would be simpler if we can guarantee that only one CPU can be added or removed at a time and the complete flow of adding a CPU to SMM and the OS needs to be completed before another add/remove event needs to be processed.
Mike
toggle quoted message
Show quoted text
-----Original Message----- From: Yao, Jiewen Sent: Thursday, August 22, 2019 10:00 PM To: Kinney, Michael D <michael.d.kinney@...>; Paolo Bonzini <pbonzini@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>; Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: RE: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
Thank you Mike!
That is good reference on the real hardware behavior. (Glad it is public.)
For threat model, the unique part in virtual environment is temp RAM. The temp RAM in real platform is per CPU cache, while the temp RAM in virtual platform is global memory. That brings one more potential attack surface in virtual environment, if hot-added CPU need run code with stack or heap before SMI rebase.
Other threats, such as SMRAM or DMA, are same.
Thank you Yao Jiewen
-----Original Message----- From: Kinney, Michael D Sent: Friday, August 23, 2019 9:03 AM To: Paolo Bonzini <pbonzini@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io; Yao, Jiewen <jiewen.yao@...>; Kinney, Michael D <michael.d.kinney@...>
Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor
Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>;
Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos
Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: RE: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with
QEMU+OVMF
Paolo,
I find the following links related to the discussions here along with
one example feature called GENPROTRANGE.
https://csrc.nist.gov/CSRC/media/Presentations/The- Whole-is-Greater/im
a ges-media/day1_trusted-computing_200-250.pdf https://cansecwest.com/slides/2017/CSW2017_Cuauhtemoc- Rene_CPU_Ho
t-Add_flow.pdf https://www.mouser.com/ds/2/612/5520-5500-chipset-ioh- datasheet-1131
292.pdf
Best regards,
Mike
-----Original Message----- From: Paolo Bonzini [mailto:pbonzini@...] Sent: Thursday, August 22, 2019 4:12 PM To: Kinney, Michael D <michael.d.kinney@...>; Laszlo Ersek
<lersek@...>; rfc@edk2.groups.io; Yao, Jiewen <jiewen.yao@...> Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor
Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>;
Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos
Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with
QEMU+OVMF
On 23/08/19 00:32, Kinney, Michael D wrote:
Paolo,
It is my understanding that real HW hot plug uses the
SDM defined
methods. Meaning the initial SMI is to 3000:8000 and
they rebase to
TSEG in the first SMI. They must have chipset specific
methods to
protect 3000:8000 from DMA. It would be great if you could check.
Can we add a chipset feature to prevent DMA to 64KB
range from
0x30000-0x3FFFF and the UEFI Memory Map and ACPI content can be
updated so the Guest OS knows to not use that range for
DMA?
If real hardware does it at the chipset level, we will probably use
Igor's suggestion of aliasing A-seg to 3000:0000. Before starting
the new CPU, the SMI handler can prepare the SMBASE relocation
trampoline at A000:8000 and the hot-plugged CPU will find it at 3000:8000 when it receives the initial SMI. Because this is backed
by RAM at 0xA0000-0xAFFFF, DMA cannot access it and would still go
through to RAM at 0x30000.
Paolo
|
|
On 08/22/19 20:51, Paolo Bonzini wrote: On 22/08/19 20:29, Laszlo Ersek wrote:
On 08/22/19 08:18, Paolo Bonzini wrote:
On 21/08/19 22:17, Kinney, Michael D wrote:
DMA protection of memory ranges is a chipset feature. For the current QEMU implementation, what ranges of memory are guaranteed to be protected from DMA? Is it only A/B seg and TSEG? Yes. This thread (esp. Jiewen's and Mike's messages) are the first time that I've heard about the *existence* of such RAM ranges / the chipset feature. :)
Out of interest (independently of virtualization), how is a general purpose OS informed by the firmware, "never try to set up DMA to this RAM area"? Is this communicated through ACPI _CRS perhaps?
... Ah, almost: ACPI 6.2 specifies _DMA, in "6.2.4 _DMA (Direct Memory Access)". It writes,
For example, if a platform implements a PCI bus that cannot access all of physical memory, it has a _DMA object under that PCI bus that describes the ranges of physical memory that can be accessed by devices on that bus.
Sorry about the digression, and also about being late to this thread, continually -- I'm primarily following and learning. It's much simpler: these ranges are not in e820, for example
kernel: BIOS-e820: [mem 0x0000000000059000-0x000000000008bfff] usable kernel: BIOS-e820: [mem 0x000000000008c000-0x00000000000fffff] reserved (1) Sorry, my _DMA quote was a detour from QEMU -- I wondered how a physical machine with actual RAM at 0x30000, and also chipset level protection against DMA to/from that RAM range, would expose the fact to the OS (so that the OS not innocently try to set up DMA there). (2) In case of QEMU+OVMF, "e820" is not defined at the firmware level. While - QEMU exports an "e820 map" (and OVMF does utilize that), - and Linux parses the UEFI memmap into an "e820 map" (so that dependent logic only need to deal with e820), in edk2 the concepts are "GCD memory space map" and "UEFI memmap". So what OVMF does is, it reserves the TSEG area in the UEFI memmap: https://github.com/tianocore/edk2/commit/b09c1c6f2569a(This was later de-constified for the extended TSEG size, in commit 23bfb5c0aab6, "OvmfPkg/PlatformPei: prepare for PcdQ35TsegMbytes becoming dynamic", 2017-07-05). This is just to say that with OVMF, TSEG is not absent from the UEFI memmap, it is reserved instead. (Apologies if I misunderstood and you didn't actually claim otherwise.) The ranges are not special-cased in any way by QEMU. Simply, AB-segs and TSEG RAM are not part of the address space except when in SMM. (or when TSEG is not locked, and open; but:) yes, this matches my understanding. Therefore, DMA to those ranges ends up respectively to low VGA RAM[1] and to the bit bucket. When AB-segs are open, for example, DMA to that area becomes possible. Which seems to imply that, if we alias 0x30000 to the AB-segs, and rely on the AB-segs for CPU hotplug, OVMF should close and lock down the AB-segs at first boot. Correct? (Because OVMF doesn't do anything about AB at the moment.) Thanks Laszlo Paolo
[1] old timers may remember DEF SEG=&HB800: BLOAD "foo.img",0. It still works with some disk device models.
|
|
Thank you Mike!
That is good reference on the real hardware behavior. (Glad it is public.)
For threat model, the unique part in virtual environment is temp RAM. The temp RAM in real platform is per CPU cache, while the temp RAM in virtual platform is global memory. That brings one more potential attack surface in virtual environment, if hot-added CPU need run code with stack or heap before SMI rebase.
Other threats, such as SMRAM or DMA, are same.
Thank you Yao Jiewen
toggle quoted message
Show quoted text
-----Original Message----- From: Kinney, Michael D Sent: Friday, August 23, 2019 9:03 AM To: Paolo Bonzini <pbonzini@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io; Yao, Jiewen <jiewen.yao@...>; Kinney, Michael D <michael.d.kinney@...> Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu-devel@...>; Igor Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>; Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: RE: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
Paolo,
I find the following links related to the discussions here along with one example feature called GENPROTRANGE.
https://csrc.nist.gov/CSRC/media/Presentations/The-Whole-is-Greater/ima ges-media/day1_trusted-computing_200-250.pdf https://cansecwest.com/slides/2017/CSW2017_Cuauhtemoc-Rene_CPU_Ho t-Add_flow.pdf https://www.mouser.com/ds/2/612/5520-5500-chipset-ioh-datasheet-1131 292.pdf
Best regards,
Mike
-----Original Message----- From: Paolo Bonzini [mailto:pbonzini@...] Sent: Thursday, August 22, 2019 4:12 PM To: Kinney, Michael D <michael.d.kinney@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io; Yao, Jiewen <jiewen.yao@...> Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>; Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 23/08/19 00:32, Kinney, Michael D wrote:
Paolo,
It is my understanding that real HW hot plug uses the SDM defined
methods. Meaning the initial SMI is to 3000:8000 and they rebase to
TSEG in the first SMI. They must have chipset specific methods to
protect 3000:8000 from DMA. It would be great if you could check.
Can we add a chipset feature to prevent DMA to 64KB range from
0x30000-0x3FFFF and the UEFI Memory Map and ACPI content can be
updated so the Guest OS knows to not use that range for DMA?
If real hardware does it at the chipset level, we will probably use Igor's suggestion of aliasing A-seg to 3000:0000. Before starting the new CPU, the SMI handler can prepare the SMBASE relocation trampoline at A000:8000 and the hot-plugged CPU will find it at 3000:8000 when it receives the initial SMI. Because this is backed by RAM at 0xA0000-0xAFFFF, DMA cannot access it and would still go through to RAM at 0x30000.
Paolo
|
|
toggle quoted message
Show quoted text
-----Original Message----- From: Paolo Bonzini [mailto:pbonzini@...] Sent: Thursday, August 22, 2019 4:12 PM To: Kinney, Michael D <michael.d.kinney@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io; Yao, Jiewen <jiewen.yao@...> Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>; Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 23/08/19 00:32, Kinney, Michael D wrote:
Paolo,
It is my understanding that real HW hot plug uses the SDM defined
methods. Meaning the initial SMI is to 3000:8000 and they rebase to
TSEG in the first SMI. They must have chipset specific methods to
protect 3000:8000 from DMA. It would be great if you could check.
Can we add a chipset feature to prevent DMA to 64KB range from
0x30000-0x3FFFF and the UEFI Memory Map and ACPI content can be
updated so the Guest OS knows to not use that range for DMA?
If real hardware does it at the chipset level, we will probably use Igor's suggestion of aliasing A-seg to 3000:0000. Before starting the new CPU, the SMI handler can prepare the SMBASE relocation trampoline at A000:8000 and the hot-plugged CPU will find it at 3000:8000 when it receives the initial SMI. Because this is backed by RAM at 0xA0000-0xAFFFF, DMA cannot access it and would still go through to RAM at 0x30000.
Paolo
|
|
Paolo,
It is my understanding that real HW hot plug uses the SDM defined methods. Meaning the initial SMI is to 3000:8000 and they rebase to TSEG in the first SMI. They must have chipset specific methods to protect 3000:8000 from DMA.
Can we add a chipset feature to prevent DMA to 64KB range from 0x30000-0x3FFFF and the UEFI Memory Map and ACPI content can be updated so the Guest OS knows to not use that range for DMA?
Thanks,
Mike
toggle quoted message
Show quoted text
-----Original Message----- From: Paolo Bonzini [mailto:pbonzini@...] Sent: Thursday, August 22, 2019 3:18 PM To: Kinney, Michael D <michael.d.kinney@...>; Laszlo Ersek <lersek@...>; rfc@edk2.groups.io; Yao, Jiewen <jiewen.yao@...> Cc: Alex Williamson <alex.williamson@...>; devel@edk2.groups.io; qemu devel list <qemu- devel@...>; Igor Mammedov <imammedo@...>; Chen, Yingwen <yingwen.chen@...>; Nakajima, Jun <jun.nakajima@...>; Boris Ostrovsky <boris.ostrovsky@...>; Joao Marcal Lemos Martins <joao.m.martins@...>; Phillip Goerl <phillip.goerl@...> Subject: Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 22/08/19 22:06, Kinney, Michael D wrote:
The SMBASE register is internal and cannot be directly accessed by any
CPU. There is an SMBASE field that is member of the SMM Save State
area and can only be modified from SMM and requires the execution of
an RSM instruction from SMM for the SMBASE register to be updated from
the current SMBASE field value. The new SMBASE register value is only
used on the next SMI. Actually there is also an SMBASE MSR, even though in current silicon it's read-only and its use is theoretically limited to SMM-transfer monitors. If that MSR could be made accessible somehow outside SMM, that would be great.
Once all the CPUs have been initialized for SMM, the CPUs that are not
needed can be hot removed. As noted above, the SMBASE value does not
change on an INIT. So as long as the hot add operation does not do a
RESET, the SMBASE value must be preserved. IIRC, hot-remove + hot-add will unplugs/plugs a completely different CPU.
Another idea is to emulate this behavior. If the hot plug controller
provide registers (only accessible from SMM) to assign the SMBASE
address for every CPU. When a CPU is hot added, QEMU can set the
internal SMBASE register value from the hot plug controller register
value. If the SMM Monarch sends an INIT or an SMI from the Local APIC
to the hot added CPU, then the SMBASE register should not be modified
and the CPU starts execution within TSEG the first time it receives an SMI.
Yes, this would work. But again---if the issue is real on current hardware too, I'd rather have a matching solution for virtual platforms.
If the current hardware for example remembers INIT- preserved across hot-remove/hot-add, we could emulate that.
I guess the fundamental question is: how do bare metal platforms avoid this issue, or plan to avoid this issue? Once we know that, we can use that information to find a way to implement it in KVM. Only if it is impossible we'll have a different strategy that is specific to our platform.
Paolo
Jiewen and I can collect specific questions on this topic and continue
the discussion here. For example, I do not think there is any method
other than what I referenced above to program the SMBASE register, but
I can ask if there are any other methods.
|
|
Paolo Bonzini <pbonzini@...>
On 21/08/19 22:17, Kinney, Michael D wrote: Paolo,
It makes sense to match real HW. Note that it'd also be fine to match some kind of official Intel specification even if no processor (currently?) supports it. That puts us back to the reset vector and handling the initial SMI at 3000:8000. That is all workable from a FW implementation perspective. It look like the only issue left is DMA.
DMA protection of memory ranges is a chipset feature. For the current QEMU implementation, what ranges of memory are guaranteed to be protected from DMA? Is it only A/B seg and TSEG? Yes. Paolo Yes, all of these would work. Again, I'm interested in having something that has a hope of being implemented in real hardware.
Another, far easier to implement possibility could be a lockable MSR (could be the existing MSR_SMM_FEATURE_CONTROL) that allows programming the SMBASE outside SMM. It would be nice if such a bit could be defined by Intel.
Paolo
|
|