Topics

RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept


Tobin Feldman-Fitzthum <tobin@...>
 

Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working on a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's out and even hopefully Intel TDX) VMs. We have developed an approach that we believe is feasible and a demonstration that shows our solution to the most difficult part of the problem. In short, we have implemented a UEFI Application that can resume from a VM snapshot. We think this is the crux of SEV-ES live migration. After describing the context of our demo and how it works, we explain how it can be extended to a full SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live migration can be implemented in OVMF with minimal kernel changes. We provide a blueprint for doing so.

Typically the hypervisor facilitates live migration. AMD SEV excludes the hypervisor from the trust domain of the guest. When a hypervisor (HV) examines the memory of an SEV guest, it will find only a ciphertext. If the HV moves the memory of an SEV guest, the ciphertext will be invalidated. Furthermore, with SEV-ES the hypervisor is largely unable to access guest CPU state. Thus, fast migration of SEV VMs requires support from inside the trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel. This would allow the guest to encrypt/decrypt its own memory with a transport key. This approach has met some resistance. We propose a similar approach implemented not in Linux, but in firmware, specifically OVMF. Since OVMF runs inside the guest, it has access to the guest memory and CPU state. OVMF should be able to perform the manipulations required for live migration of SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU state of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU state of the target before the target begins executing. In our approach, the HV starts the target and OVMF must resume to whatever state the source was in. We believe this to be the crux (or at least the most difficult part) of live migration for SEV and we hope that by demonstrating resume from EFI, we can show that our approach is generally feasible.

Our demo can be found at <https://github.com/secure-migration>. The tooling repository is the best starting point. It contains documentation about the project and the scripts needed to run the demo. There are two more repos associated with the project. One is a modified edk2 tree that contains our modified OVMF. The other is a modified qemu, that has a couple of temporary changes needed for the demo. Our demonstration is aimed only at resuming from a VM snapshot in OVMF. We provide the source CPU state and source memory to the destination using temporary plumbing that violates the SEV trust model. We explain the setup in more depth in README.md. We are showing only that OVMF can resume from a VM snapshot. At the end we will describe our plan for transferring CPU state and memory from source to guest. To be clear, the temporary tooling used for this demo isn't built for encrypted VMs, but below we explain how this demo applies to and can be extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the recommended S3 resume code. When the HV sets the CPU state of a guest, it can do so when the guest is not executing. Setting the state from inside the guest is a delicate operation. There is no way to atomically set all of the CPU state from inside the guest. Instead, we must set most registers individually and account for changes in control flow that doing so might cause. We do this with a three-phase trampoline. OVMF calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an intermediate map that reconciles the OVMF map and the source map. Phase 3 switches to the source map, restores the registers, and returns into execution of the source. We will go backwards through these phases in more depth.

The last thing that resume to EFI does is return. Specifically, we use IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a temporary stack and restores them atomically, thus returning to source execution. Prior to returning, we must manually restore most other registers to the values they had on the source. One particularly significant register is CR3. When we return to Linux, CR3 must be set to the source CR3 or the first instruction executed in Linux will cause a page fault. The code that we use to restore the registers and return must be mapped in the source page table or we would get a page fault executing the instructions prior to returning into Linux. The value of CR3 is so significant, that it defines the three phases of the trampoline. Phase 3 begins when CR3 is set to the source CR3. After setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning that virtual addresses are the same as physical addresses. The kernel page table uses an offset mapping, meaning that virtual addresses differ from physical addresses by a constant (for the most part). Crucially, this means that the virtual address of the page that is executed by phase 3 differs between the OVMF map and the source map. If we are executing code mapped in OVMF and we change CR3 to point to the source map, although the page may be mapped in the source map, the virtual address will be different, and we will face undefined behavior. To fix this, we construct intermediate page tables that map the pages for phase 2 and 3 to the virtual address expected in OVMF and to the virtual address expected in the source map. Thus, we can switch CR3 from OVMF's map to the intermediate map and then from the intermediate map to the source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly responsible for switching to the intermediate map, flushing the TLB, and jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two duties. First, since phase 2 and 3 operate without a stack and can't access values defined in OVMF (such as the addresses of the pages containing phase 2 and 3), phase 1 must pass these values to phase 2 by putting them in registers. Second, phase 1 must start phase 2 by jumping to it.

Given that we can resume to a snapshot in OVMF, we should be able to migrate an SEV guest as long as we can securely communicate the VM snapshot from source to destination. For our demo, we do this with a handful of QMP commands. More sophisticated methods are required for a production implementation.

When we refer to a snapshot, what we really mean is the device state, memory, and CPU state of a guest. In live migration this is transmitted dynamically as opposed to being saved and restored. Device state is not protected by SEV and can be handled entirely by the HV. Memory, on the other hand, cannot be handled only by the HV. As mentioned previously, memory needs to be encrypted with a transport key. A Migration Handler on the source will coordinate with the HV to encrypt pages and transmit them to the destination. The destination HV will receive the pages over the network and pass them to the Migration Handler in the target VM so they can be decrypted. This transmission will occur continuously until the memory of the source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore does not require any special mechanism for transmission of the CPU state. We plan to implement an end-to-end migration with plain SEV first. In SEV-ES, the PSP (platform security processor) encrypts CPU state on each VMExit. The encrypted state is stored in memory. Normally this memory (known as the VMSA) is not mapped into the guest, but we can add an entry to the nested page tables that will expose the VMSA to the guest. This means that when the guest VMExits, the CPU state will be saved to guest memory. With the CPU state in guest memory, it can be transmitted to the target using the method described above.

In addition to the changes needed in OVMF to resume the VM, the transmission of the VM from source to target will require a new code path in the hypervisor. There will also need to be a few minor changes to Linux (adding a mapping for our Phase 3 pages). Despite all the moving pieces, we believe that this is a feasible approach for supporting live migration for SEV and SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP support, generation of the intermediate mappings, and more. We have included some notes about these issues in the COMPLICATIONS.md file. We also have an outline of an end-to-end implementation of live migration for SEV-ES in END-TO-END.md. See README.md for info on how to run the demo. While this is not a full migration, we hope to show that fast live migration with SEV and SEV-ES is possible without major kernel changes.

-Tobin


Ashish Kalra <ashish.kalra@...>
 

Hello Tobin,

On Wed, Oct 28, 2020 at 03:31:44PM -0400, Tobin Feldman-Fitzthum wrote:
Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working on a
plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's out
and even hopefully Intel TDX) VMs. We have developed an approach that we
believe is feasible and a demonstration that shows our solution to the most
difficult part of the problem. In short, we have implemented a UEFI
Application that can resume from a VM snapshot. We think this is the crux of
SEV-ES live migration. After describing the context of our demo and how it
works, we explain how it can be extended to a full SEV-ES migration. Our
goal is to show that fast SEV and SEV-ES live migration can be implemented
in OVMF with minimal kernel changes. We provide a blueprint for doing so.

Typically the hypervisor facilitates live migration. AMD SEV excludes the
hypervisor from the trust domain of the guest. When a hypervisor (HV)
examines the memory of an SEV guest, it will find only a ciphertext. If the
HV moves the memory of an SEV guest, the ciphertext will be invalidated.
Furthermore, with SEV-ES the hypervisor is largely unable to access guest
CPU state. Thus, fast migration of SEV VMs requires support from inside the
trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel. This
would allow the guest to encrypt/decrypt its own memory with a transport
key. This approach has met some resistance. We propose a similar approach
implemented not in Linux, but in firmware, specifically OVMF. Since OVMF
runs inside the guest, it has access to the guest memory and CPU state. OVMF
should be able to perform the manipulations required for live migration of
SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU state of
an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU state
of the target before the target begins executing. In our approach, the HV
starts the target and OVMF must resume to whatever state the source was in.
We believe this to be the crux (or at least the most difficult part) of live
migration for SEV and we hope that by demonstrating resume from EFI, we can
show that our approach is generally feasible.

Our demo can be found at <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C6edb93f8936e465a9fee08d87b781d00%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637395103097650163%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dsOh3zcwSWgnpmMdcCnSoJ%2B3Ohqz175axch%2B%2Bnu73Uc%3D&amp;reserved=0>.
The tooling repository is the best starting point. It contains documentation
about the project and the scripts needed to run the demo. There are two more
repos associated with the project. One is a modified edk2 tree that contains
our modified OVMF. The other is a modified qemu, that has a couple of
temporary changes needed for the demo. Our demonstration is aimed only at
resuming from a VM snapshot in OVMF. We provide the source CPU state and
source memory to the destination using temporary plumbing that violates the
SEV trust model. We explain the setup in more depth in README.md. We are
showing only that OVMF can resume from a VM snapshot. At the end we will
describe our plan for transferring CPU state and memory from source to
guest. To be clear, the temporary tooling used for this demo isn't built for
encrypted VMs, but below we explain how this demo applies to and can be
extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the recommended
S3 resume code. When the HV sets the CPU state of a guest, it can do so when
the guest is not executing. Setting the state from inside the guest is a
delicate operation. There is no way to atomically set all of the CPU state
from inside the guest. Instead, we must set most registers individually and
account for changes in control flow that doing so might cause. We do this
with a three-phase trampoline. OVMF calls phase 1, which runs on the OVMF
map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an
intermediate map that reconciles the OVMF map and the source map. Phase 3
switches to the source map, restores the registers, and returns into
execution of the source. We will go backwards through these phases in more
depth.

The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other registers
to the values they had on the source. One particularly significant register
is CR3. When we return to Linux, CR3 must be set to the source CR3 or the
first instruction executed in Linux will cause a page fault. The code that
we use to restore the registers and return must be mapped in the source page
table or we would get a page fault executing the instructions prior to
returning into Linux. The value of CR3 is so significant, that it defines
the three phases of the trampoline. Phase 3 begins when CR3 is set to the
source CR3. After setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning
that virtual addresses are the same as physical addresses. The kernel page
table uses an offset mapping, meaning that virtual addresses differ from
physical addresses by a constant (for the most part). Crucially, this means
that the virtual address of the page that is executed by phase 3 differs
between the OVMF map and the source map. If we are executing code mapped in
OVMF and we change CR3 to point to the source map, although the page may be
mapped in the source map, the virtual address will be different, and we will
face undefined behavior. To fix this, we construct intermediate page tables
that map the pages for phase 2 and 3 to the virtual address expected in OVMF
and to the virtual address expected in the source map. Thus, we can switch
CR3 from OVMF's map to the intermediate map and then from the intermediate
map to the source map. Phase 2 is much shorter than phase 3. Phase 2 is
mainly responsible for switching to the intermediate map, flushing the TLB,
and jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two duties.
First, since phase 2 and 3 operate without a stack and can't access values
defined in OVMF (such as the addresses of the pages containing phase 2 and
3), phase 1 must pass these values to phase 2 by putting them in registers.
Second, phase 1 must start phase 2 by jumping to it.

Given that we can resume to a snapshot in OVMF, we should be able to migrate
an SEV guest as long as we can securely communicate the VM snapshot from
source to destination. For our demo, we do this with a handful of QMP
commands. More sophisticated methods are required for a production
implementation.

When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is transmitted
dynamically as opposed to being saved and restored. Device state is not
protected by SEV and can be handled entirely by the HV. Memory, on the other
hand, cannot be handled only by the HV. As mentioned previously, memory
needs to be encrypted with a transport key. A Migration Handler on the
source will coordinate with the HV to encrypt pages and transmit them to the
destination. The destination HV will receive the pages over the network and
pass them to the Migration Handler in the target VM so they can be
decrypted. This transmission will occur continuously until the memory of the
source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore does not
require any special mechanism for transmission of the CPU state. We plan to
implement an end-to-end migration with plain SEV first. In SEV-ES, the PSP
(platform security processor) encrypts CPU state on each VMExit. The
encrypted state is stored in memory. Normally this memory (known as the
VMSA) is not mapped into the guest, but we can add an entry to the nested
page tables that will expose the VMSA to the guest.
I have a question here, is there any kind of integrity protection on the
CPU state when the target VM is resumed after nigration, for example, if
there is a malicious hypervisor which maps a page with subverted CPU
state on the nested page tables, what prevents the target VM to resume
execution on a subverted or compromised CPU state ?

Thanks,
Ashish

This means that when the
guest VMExits, the CPU state will be saved to guest memory. With the CPU
state in guest memory, it can be transmitted to the target using the method
described above.

In addition to the changes needed in OVMF to resume the VM, the transmission
of the VM from source to target will require a new code path in the
hypervisor. There will also need to be a few minor changes to Linux (adding
a mapping for our Phase 3 pages). Despite all the moving pieces, we believe
that this is a feasible approach for supporting live migration for SEV and
SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have included
some notes about these issues in the COMPLICATIONS.md file. We also have an
outline of an end-to-end implementation of live migration for SEV-ES in
END-TO-END.md. See README.md for info on how to run the demo. While this is
not a full migration, we hope to show that fast live migration with SEV and
SEV-ES is possible without major kernel changes.

-Tobin


Tobin Feldman-Fitzthum
 

On 2020-10-29 13:06, Ashish Kalra wrote:
Hello Tobin,
On Wed, Oct 28, 2020 at 03:31:44PM -0400, Tobin Feldman-Fitzthum wrote:
Hello,
Dov Murik. James Bottomley, Hubertus Franke, and I have been working on a
plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's out
and even hopefully Intel TDX) VMs. We have developed an approach that we
believe is feasible and a demonstration that shows our solution to the most
difficult part of the problem. In short, we have implemented a UEFI
Application that can resume from a VM snapshot. We think this is the crux of
SEV-ES live migration. After describing the context of our demo and how it
works, we explain how it can be extended to a full SEV-ES migration. Our
goal is to show that fast SEV and SEV-ES live migration can be implemented
in OVMF with minimal kernel changes. We provide a blueprint for doing so.
Typically the hypervisor facilitates live migration. AMD SEV excludes the
hypervisor from the trust domain of the guest. When a hypervisor (HV)
examines the memory of an SEV guest, it will find only a ciphertext. If the
HV moves the memory of an SEV guest, the ciphertext will be invalidated.
Furthermore, with SEV-ES the hypervisor is largely unable to access guest
CPU state. Thus, fast migration of SEV VMs requires support from inside the
trust domain, i.e. the guest.
One approach is to add support for SEV Migration to the Linux kernel. This
would allow the guest to encrypt/decrypt its own memory with a transport
key. This approach has met some resistance. We propose a similar approach
implemented not in Linux, but in firmware, specifically OVMF. Since OVMF
runs inside the guest, it has access to the guest memory and CPU state. OVMF
should be able to perform the manipulations required for live migration of
SEV and SEV-ES guests.
The biggest challenge of this approach involves migrating the CPU state of
an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU state
of the target before the target begins executing. In our approach, the HV
starts the target and OVMF must resume to whatever state the source was in.
We believe this to be the crux (or at least the most difficult part) of live
migration for SEV and we hope that by demonstrating resume from EFI, we can
show that our approach is generally feasible.
Our demo can be found at <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C6edb93f8936e465a9fee08d87b781d00%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637395103097650163%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dsOh3zcwSWgnpmMdcCnSoJ%2B3Ohqz175axch%2B%2Bnu73Uc%3D&amp;reserved=0>.
The tooling repository is the best starting point. It contains documentation
about the project and the scripts needed to run the demo. There are two more
repos associated with the project. One is a modified edk2 tree that contains
our modified OVMF. The other is a modified qemu, that has a couple of
temporary changes needed for the demo. Our demonstration is aimed only at
resuming from a VM snapshot in OVMF. We provide the source CPU state and
source memory to the destination using temporary plumbing that violates the
SEV trust model. We explain the setup in more depth in README.md. We are
showing only that OVMF can resume from a VM snapshot. At the end we will
describe our plan for transferring CPU state and memory from source to
guest. To be clear, the temporary tooling used for this demo isn't built for
encrypted VMs, but below we explain how this demo applies to and can be
extended to encrypted VMs.
We Implemented our resume code in a very similar fashion to the recommended
S3 resume code. When the HV sets the CPU state of a guest, it can do so when
the guest is not executing. Setting the state from inside the guest is a
delicate operation. There is no way to atomically set all of the CPU state
from inside the guest. Instead, we must set most registers individually and
account for changes in control flow that doing so might cause. We do this
with a three-phase trampoline. OVMF calls phase 1, which runs on the OVMF
map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an
intermediate map that reconciles the OVMF map and the source map. Phase 3
switches to the source map, restores the registers, and returns into
execution of the source. We will go backwards through these phases in more
depth.
The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other registers
to the values they had on the source. One particularly significant register
is CR3. When we return to Linux, CR3 must be set to the source CR3 or the
first instruction executed in Linux will cause a page fault. The code that
we use to restore the registers and return must be mapped in the source page
table or we would get a page fault executing the instructions prior to
returning into Linux. The value of CR3 is so significant, that it defines
the three phases of the trampoline. Phase 3 begins when CR3 is set to the
source CR3. After setting CR3, we set all the other registers and return.
Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning
that virtual addresses are the same as physical addresses. The kernel page
table uses an offset mapping, meaning that virtual addresses differ from
physical addresses by a constant (for the most part). Crucially, this means
that the virtual address of the page that is executed by phase 3 differs
between the OVMF map and the source map. If we are executing code mapped in
OVMF and we change CR3 to point to the source map, although the page may be
mapped in the source map, the virtual address will be different, and we will
face undefined behavior. To fix this, we construct intermediate page tables
that map the pages for phase 2 and 3 to the virtual address expected in OVMF
and to the virtual address expected in the source map. Thus, we can switch
CR3 from OVMF's map to the intermediate map and then from the intermediate
map to the source map. Phase 2 is much shorter than phase 3. Phase 2 is
mainly responsible for switching to the intermediate map, flushing the TLB,
and jumping to phase 3.
Fortunately phase 1 is even simpler than phase 2. Phase 1 has two duties.
First, since phase 2 and 3 operate without a stack and can't access values
defined in OVMF (such as the addresses of the pages containing phase 2 and
3), phase 1 must pass these values to phase 2 by putting them in registers.
Second, phase 1 must start phase 2 by jumping to it.
Given that we can resume to a snapshot in OVMF, we should be able to migrate
an SEV guest as long as we can securely communicate the VM snapshot from
source to destination. For our demo, we do this with a handful of QMP
commands. More sophisticated methods are required for a production
implementation.
When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is transmitted
dynamically as opposed to being saved and restored. Device state is not
protected by SEV and can be handled entirely by the HV. Memory, on the other
hand, cannot be handled only by the HV. As mentioned previously, memory
needs to be encrypted with a transport key. A Migration Handler on the
source will coordinate with the HV to encrypt pages and transmit them to the
destination. The destination HV will receive the pages over the network and
pass them to the Migration Handler in the target VM so they can be
decrypted. This transmission will occur continuously until the memory of the
source and target converges.
Plain SEV does not protect the CPU state of the guest and therefore does not
require any special mechanism for transmission of the CPU state. We plan to
implement an end-to-end migration with plain SEV first. In SEV-ES, the PSP
(platform security processor) encrypts CPU state on each VMExit. The
encrypted state is stored in memory. Normally this memory (known as the
VMSA) is not mapped into the guest, but we can add an entry to the nested
page tables that will expose the VMSA to the guest.
I have a question here, is there any kind of integrity protection on the
CPU state when the target VM is resumed after nigration, for example, if
there is a malicious hypervisor which maps a page with subverted CPU
state on the nested page tables, what prevents the target VM to resume
execution on a subverted or compromised CPU state ?
Good question. Here is my thinking. The VMSA is mapped in the guest memory.
It will be transmitted to the target like any other page, with encryption
and integrity-checking. So we have integrity checking for CPU state while
it is in flight.

I think you are wondering something slightly different, though. Once the
page with the VMSA arrives at the target and is decrypted and put in place,
the hypervisor could potentially change the NPT to replace the data. Since
the page with the VMSA will be encrypted (and the Migration Handler will
expect this), the HV can't replace the page with arbitrary values.

Since the VMSA is in memory, we have the protections that SEV provides
for memory. Prior to SNP, this does not include integrity protection.
The HV could attempt a replay attack by replacing the page with the
VMSA with an older version of the same page. That said, the target will
have just booted so there isn't much to replay.

If we really need to, we could add functionality to the Migration Handler
that would allow the HV to ask for an HMAC of the VMSA on the source.
The Migration Handler on the target could use this to verify the VMSA
just prior to starting the trampoline. Given the above, I am not sure
this is necessary. Hopefully I've understood the attack you're suggesting
correctly.

-Tobin
Thanks,
Ashish

This means that when the
guest VMExits, the CPU state will be saved to guest memory. With the CPU
state in guest memory, it can be transmitted to the target using the method
described above.
In addition to the changes needed in OVMF to resume the VM, the transmission
of the VM from source to target will require a new code path in the
hypervisor. There will also need to be a few minor changes to Linux (adding
a mapping for our Phase 3 pages). Despite all the moving pieces, we believe
that this is a feasible approach for supporting live migration for SEV and
SEV-ES.
For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have included
some notes about these issues in the COMPLICATIONS.md file. We also have an
outline of an end-to-end implementation of live migration for SEV-ES in
END-TO-END.md. See README.md for info on how to run the demo. While this is
not a full migration, we hope to show that fast live migration with SEV and
SEV-ES is possible without major kernel changes.
-Tobin


Ashish Kalra <ashish.kalra@...>
 

Hello Tobin,

On Thu, Oct 29, 2020 at 04:36:07PM -0400, Tobin Feldman-Fitzthum wrote:
On 2020-10-29 13:06, Ashish Kalra wrote:
Hello Tobin,

On Wed, Oct 28, 2020 at 03:31:44PM -0400, Tobin Feldman-Fitzthum wrote:
Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working
on a
plan for fast live migration of SEV and SEV-ES (and SEV-SNP when
it's out
and even hopefully Intel TDX) VMs. We have developed an approach
that we
believe is feasible and a demonstration that shows our solution to
the most
difficult part of the problem. In short, we have implemented a UEFI
Application that can resume from a VM snapshot. We think this is the
crux of
SEV-ES live migration. After describing the context of our demo and
how it
works, we explain how it can be extended to a full SEV-ES migration.
Our
goal is to show that fast SEV and SEV-ES live migration can be
implemented
in OVMF with minimal kernel changes. We provide a blueprint for
doing so.

Typically the hypervisor facilitates live migration. AMD SEV
excludes the
hypervisor from the trust domain of the guest. When a hypervisor (HV)
examines the memory of an SEV guest, it will find only a ciphertext.
If the
HV moves the memory of an SEV guest, the ciphertext will be
invalidated.
Furthermore, with SEV-ES the hypervisor is largely unable to access
guest
CPU state. Thus, fast migration of SEV VMs requires support from
inside the
trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux
kernel. This
would allow the guest to encrypt/decrypt its own memory with a
transport
key. This approach has met some resistance. We propose a similar
approach
implemented not in Linux, but in firmware, specifically OVMF. Since
OVMF
runs inside the guest, it has access to the guest memory and CPU
state. OVMF
should be able to perform the manipulations required for live
migration of
SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU
state of
an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU
state
of the target before the target begins executing. In our approach,
the HV
starts the target and OVMF must resume to whatever state the source
was in.
We believe this to be the crux (or at least the most difficult part)
of live
migration for SEV and we hope that by demonstrating resume from EFI,
we can
show that our approach is generally feasible.

Our demo can be found at <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C9ae0ce60e5fd43378cb808d87c4a4746%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637396005748813716%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=I%2FF8XELOBFAvDnHmVw3M1ln7hb9a%2FmQrGXxWn2s5XSY%3D&amp;reserved=0>.
The tooling repository is the best starting point. It contains
documentation
about the project and the scripts needed to run the demo. There are
two more
repos associated with the project. One is a modified edk2 tree that
contains
our modified OVMF. The other is a modified qemu, that has a couple of
temporary changes needed for the demo. Our demonstration is aimed
only at
resuming from a VM snapshot in OVMF. We provide the source CPU state
and
source memory to the destination using temporary plumbing that
violates the
SEV trust model. We explain the setup in more depth in README.md. We
are
showing only that OVMF can resume from a VM snapshot. At the end we
will
describe our plan for transferring CPU state and memory from source to
guest. To be clear, the temporary tooling used for this demo isn't
built for
encrypted VMs, but below we explain how this demo applies to and can
be
extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the
recommended
S3 resume code. When the HV sets the CPU state of a guest, it can do
so when
the guest is not executing. Setting the state from inside the guest
is a
delicate operation. There is no way to atomically set all of the CPU
state
from inside the guest. Instead, we must set most registers
individually and
account for changes in control flow that doing so might cause. We do
this
with a three-phase trampoline. OVMF calls phase 1, which runs on the
OVMF
map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an
intermediate map that reconciles the OVMF map and the source map.
Phase 3
switches to the source map, restores the registers, and returns into
execution of the source. We will go backwards through these phases
in more
depth.

The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other
registers
to the values they had on the source. One particularly significant
register
is CR3. When we return to Linux, CR3 must be set to the source CR3
or the
first instruction executed in Linux will cause a page fault. The
code that
we use to restore the registers and return must be mapped in the
source page
table or we would get a page fault executing the instructions prior to
returning into Linux. The value of CR3 is so significant, that it
defines
the three phases of the trampoline. Phase 3 begins when CR3 is set
to the
source CR3. After setting CR3, we set all the other registers and
return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping,
meaning
that virtual addresses are the same as physical addresses. The
kernel page
table uses an offset mapping, meaning that virtual addresses differ
from
physical addresses by a constant (for the most part). Crucially,
this means
that the virtual address of the page that is executed by phase 3
differs
between the OVMF map and the source map. If we are executing code
mapped in
OVMF and we change CR3 to point to the source map, although the page
may be
mapped in the source map, the virtual address will be different, and
we will
face undefined behavior. To fix this, we construct intermediate page
tables
that map the pages for phase 2 and 3 to the virtual address expected
in OVMF
and to the virtual address expected in the source map. Thus, we can
switch
CR3 from OVMF's map to the intermediate map and then from the
intermediate
map to the source map. Phase 2 is much shorter than phase 3. Phase 2
is
mainly responsible for switching to the intermediate map, flushing
the TLB,
and jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
duties.
First, since phase 2 and 3 operate without a stack and can't access
values
defined in OVMF (such as the addresses of the pages containing phase
2 and
3), phase 1 must pass these values to phase 2 by putting them in
registers.
Second, phase 1 must start phase 2 by jumping to it.

Given that we can resume to a snapshot in OVMF, we should be able to
migrate
an SEV guest as long as we can securely communicate the VM snapshot
from
source to destination. For our demo, we do this with a handful of QMP
commands. More sophisticated methods are required for a production
implementation.

When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is
transmitted
dynamically as opposed to being saved and restored. Device state is
not
protected by SEV and can be handled entirely by the HV. Memory, on
the other
hand, cannot be handled only by the HV. As mentioned previously,
memory
needs to be encrypted with a transport key. A Migration Handler on the
source will coordinate with the HV to encrypt pages and transmit
them to the
destination. The destination HV will receive the pages over the
network and
pass them to the Migration Handler in the target VM so they can be
decrypted. This transmission will occur continuously until the
memory of the
source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore
does not
require any special mechanism for transmission of the CPU state. We
plan to
implement an end-to-end migration with plain SEV first. In SEV-ES,
the PSP
(platform security processor) encrypts CPU state on each VMExit. The
encrypted state is stored in memory. Normally this memory (known as
the
VMSA) is not mapped into the guest, but we can add an entry to the
nested
page tables that will expose the VMSA to the guest.
I have a question here, is there any kind of integrity protection on the
CPU state when the target VM is resumed after nigration, for example, if
there is a malicious hypervisor which maps a page with subverted CPU
state on the nested page tables, what prevents the target VM to resume
execution on a subverted or compromised CPU state ?
Good question. Here is my thinking. The VMSA is mapped in the guest memory.
It will be transmitted to the target like any other page, with encryption
and integrity-checking. So we have integrity checking for CPU state while
it is in flight.

I think you are wondering something slightly different, though. Once the
page with the VMSA arrives at the target and is decrypted and put in place,
the hypervisor could potentially change the NPT to replace the data. Since
the page with the VMSA will be encrypted (and the Migration Handler will
expect this), the HV can't replace the page with arbitrary values.

Since the VMSA is in memory, we have the protections that SEV provides
for memory. Prior to SNP, this does not include integrity protection.
The HV could attempt a replay attack by replacing the page with the
VMSA with an older version of the same page. That said, the target will
have just booted so there isn't much to replay.

If we really need to, we could add functionality to the Migration Handler
that would allow the HV to ask for an HMAC of the VMSA on the source.
The Migration Handler on the target could use this to verify the VMSA
just prior to starting the trampoline. Given the above, I am not sure
this is necessary. Hopefully I've understood the attack you're suggesting
correctly.
Yes this is the attack i am suggesting about a compromised or malicious
hypervisor replacing the page containing the CPU state with compromised
data in the NPT when the target VM starts.

Thanks,
Ashish

This means that when the
guest VMExits, the CPU state will be saved to guest memory. With the
CPU
state in guest memory, it can be transmitted to the target using the
method
described above.

In addition to the changes needed in OVMF to resume the VM, the
transmission
of the VM from source to target will require a new code path in the
hypervisor. There will also need to be a few minor changes to Linux
(adding
a mapping for our Phase 3 pages). Despite all the moving pieces, we
believe
that this is a feasible approach for supporting live migration for
SEV and
SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have
included
some notes about these issues in the COMPLICATIONS.md file. We also
have an
outline of an end-to-end implementation of live migration for SEV-ES
in
END-TO-END.md. See README.md for info on how to run the demo. While
this is
not a full migration, we hope to show that fast live migration with
SEV and
SEV-ES is possible without major kernel changes.

-Tobin


Laszlo Ersek
 

Hi Tobin,

(keeping full context -- I'm adding Dave)

On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working on
a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's
out and even hopefully Intel TDX) VMs. We have developed an approach
that we believe is feasible and a demonstration that shows our solution
to the most difficult part of the problem. In short, we have implemented
a UEFI Application that can resume from a VM snapshot. We think this is
the crux of SEV-ES live migration. After describing the context of our
demo and how it works, we explain how it can be extended to a full
SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
migration can be implemented in OVMF with minimal kernel changes. We
provide a blueprint for doing so.

Typically the hypervisor facilitates live migration. AMD SEV excludes
the hypervisor from the trust domain of the guest. When a hypervisor
(HV) examines the memory of an SEV guest, it will find only a
ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
will be invalidated. Furthermore, with SEV-ES the hypervisor is largely
unable to access guest CPU state. Thus, fast migration of SEV VMs
requires support from inside the trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel.
This would allow the guest to encrypt/decrypt its own memory with a
transport key. This approach has met some resistance. We propose a
similar approach implemented not in Linux, but in firmware, specifically
OVMF. Since OVMF runs inside the guest, it has access to the guest
memory and CPU state. OVMF should be able to perform the manipulations
required for live migration of SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU state
of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU
state of the target before the target begins executing. In our approach,
the HV starts the target and OVMF must resume to whatever state the
source was in. We believe this to be the crux (or at least the most
difficult part) of live migration for SEV and we hope that by
demonstrating resume from EFI, we can show that our approach is
generally feasible.

Our demo can be found at <https://github.com/secure-migration>. The
tooling repository is the best starting point. It contains documentation
about the project and the scripts needed to run the demo. There are two
more repos associated with the project. One is a modified edk2 tree that
contains our modified OVMF. The other is a modified qemu, that has a
couple of temporary changes needed for the demo. Our demonstration is
aimed only at resuming from a VM snapshot in OVMF. We provide the source
CPU state and source memory to the destination using temporary plumbing
that violates the SEV trust model. We explain the setup in more depth in
README.md. We are showing only that OVMF can resume from a VM snapshot.
At the end we will describe our plan for transferring CPU state and
memory from source to guest. To be clear, the temporary tooling used for
this demo isn't built for encrypted VMs, but below we explain how this
demo applies to and can be extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the
recommended S3 resume code. When the HV sets the CPU state of a guest,
it can do so when the guest is not executing. Setting the state from
inside the guest is a delicate operation. There is no way to atomically
set all of the CPU state from inside the guest. Instead, we must set
most registers individually and account for changes in control flow that
doing so might cause. We do this with a three-phase trampoline. OVMF
calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
jumps to it. Phase 2 switches to an intermediate map that reconciles the
OVMF map and the source map. Phase 3 switches to the source map,
restores the registers, and returns into execution of the source. We
will go backwards through these phases in more depth.

The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other
registers to the values they had on the source. One particularly
significant register is CR3. When we return to Linux, CR3 must be set to
the source CR3 or the first instruction executed in Linux will cause a
page fault. The code that we use to restore the registers and return
must be mapped in the source page table or we would get a page fault
executing the instructions prior to returning into Linux. The value of
CR3 is so significant, that it defines the three phases of the
trampoline. Phase 3 begins when CR3 is set to the source CR3. After
setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning
that virtual addresses are the same as physical addresses. The kernel
page table uses an offset mapping, meaning that virtual addresses differ
from physical addresses by a constant (for the most part). Crucially,
this means that the virtual address of the page that is executed by
phase 3 differs between the OVMF map and the source map. If we are
executing code mapped in OVMF and we change CR3 to point to the source
map, although the page may be mapped in the source map, the virtual
address will be different, and we will face undefined behavior. To fix
this, we construct intermediate page tables that map the pages for phase
2 and 3 to the virtual address expected in OVMF and to the virtual
address expected in the source map. Thus, we can switch CR3 from OVMF's
map to the intermediate map and then from the intermediate map to the
source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
responsible for switching to the intermediate map, flushing the TLB, and
jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
duties. First, since phase 2 and 3 operate without a stack and can't
access values defined in OVMF (such as the addresses of the pages
containing phase 2 and 3), phase 1 must pass these values to phase 2 by
putting them in registers. Second, phase 1 must start phase 2 by jumping
to it.

Given that we can resume to a snapshot in OVMF, we should be able to
migrate an SEV guest as long as we can securely communicate the VM
snapshot from source to destination. For our demo, we do this with a
handful of QMP commands. More sophisticated methods are required for a
production implementation.

When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is transmitted
dynamically as opposed to being saved and restored. Device state is not
protected by SEV and can be handled entirely by the HV. Memory, on the
other hand, cannot be handled only by the HV. As mentioned previously,
memory needs to be encrypted with a transport key. A Migration Handler
on the source will coordinate with the HV to encrypt pages and transmit
them to the destination. The destination HV will receive the pages over
the network and pass them to the Migration Handler in the target VM so
they can be decrypted. This transmission will occur continuously until
the memory of the source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore does
not require any special mechanism for transmission of the CPU state. We
plan to implement an end-to-end migration with plain SEV first. In
SEV-ES, the PSP (platform security processor) encrypts CPU state on each
VMExit. The encrypted state is stored in memory. Normally this memory
(known as the VMSA) is not mapped into the guest, but we can add an
entry to the nested page tables that will expose the VMSA to the guest.
This means that when the guest VMExits, the CPU state will be saved to
guest memory. With the CPU state in guest memory, it can be transmitted
to the target using the method described above.

In addition to the changes needed in OVMF to resume the VM, the
transmission of the VM from source to target will require a new code
path in the hypervisor. There will also need to be a few minor changes
to Linux (adding a mapping for our Phase 3 pages). Despite all the
moving pieces, we believe that this is a feasible approach for
supporting live migration for SEV and SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have
included some notes about these issues in the COMPLICATIONS.md file. We
also have an outline of an end-to-end implementation of live migration
for SEV-ES in END-TO-END.md. See README.md for info on how to run the
demo. While this is not a full migration, we hope to show that fast live
migration with SEV and SEV-ES is possible without major kernel changes.

-Tobin
the one word that comes to my mind upon reading the above is,
"overwhelming".

(I have not been addressed directly, but:

- the subject says "RFC",

- and the documentation at

https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make

states that AmdSevPkg was created for convenience, and that the feature
could be integrated into OVMF. (Paraphrased.)

So I guess it's tolerable if I make a comment: )

I've checked out the "mh-state-dev" branch of
<https://github.com/secure-migration/resume-from-efi-edk2.git>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).

These commits were authored over the 6-7 months since April. It's
obviously huge work. To me, most of these commits clearly aim at getting
the demo / proof-of-concept functional, rather than guiding (more
precisely: hand-holding) reviewers through the construction of the feature.

In my opinion, the series is not upstreamable in its current format
(which is presently not much more readable than a single-commit code
drop). Upstreaming is probably not your intent, either, at this time.

I agree that getting feedback ("buy-in") at this level of maturity is
justified from your POV, before you invest more work into cleaning up /
restructuring the series.

My problem is that "hand-holding" is exactly what I'd need -- I cannot
dedicate one or two weeks, as an indivisible block, to understanding
your design. Nor can I approach the series patch-wise in its current
format. Personally I would need the patch series to lead me through the
whole design with baby steps ("ELI5"), meaning small code changes and
detailed commit messages. I'd *also* need the more comprehensive
guide-like documentation, as background material.

Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for cleaning up
the series, by reporting success).

So I hope others can spend the time discussing the design with you, and
testing / repeating the demo. For me to review the patches, the patches
should condense and replay your thinking process from the last 7 months,
in as small as possible logical steps. (On the list.)

I really don't want to be the bottleneck here, which is why I would
support introducing this feature as a separate top-level package
(AmdSevPkg).

Thanks
Laszlo


Tobin Feldman-Fitzthum
 

On 2020-11-03 09:59, Laszlo Ersek wrote:
Hi Tobin,
(keeping full context -- I'm adding Dave)
On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
Hello,
Dov Murik. James Bottomley, Hubertus Franke, and I have been working on
a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's
out and even hopefully Intel TDX) VMs. We have developed an approach
that we believe is feasible and a demonstration that shows our solution
to the most difficult part of the problem. In short, we have implemented
a UEFI Application that can resume from a VM snapshot. We think this is
the crux of SEV-ES live migration. After describing the context of our
demo and how it works, we explain how it can be extended to a full
SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
migration can be implemented in OVMF with minimal kernel changes. We
provide a blueprint for doing so.
Typically the hypervisor facilitates live migration. AMD SEV excludes
the hypervisor from the trust domain of the guest. When a hypervisor
(HV) examines the memory of an SEV guest, it will find only a
ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
will be invalidated. Furthermore, with SEV-ES the hypervisor is largely
unable to access guest CPU state. Thus, fast migration of SEV VMs
requires support from inside the trust domain, i.e. the guest.
One approach is to add support for SEV Migration to the Linux kernel.
This would allow the guest to encrypt/decrypt its own memory with a
transport key. This approach has met some resistance. We propose a
similar approach implemented not in Linux, but in firmware, specifically
OVMF. Since OVMF runs inside the guest, it has access to the guest
memory and CPU state. OVMF should be able to perform the manipulations
required for live migration of SEV and SEV-ES guests.
The biggest challenge of this approach involves migrating the CPU state
of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU
state of the target before the target begins executing. In our approach,
the HV starts the target and OVMF must resume to whatever state the
source was in. We believe this to be the crux (or at least the most
difficult part) of live migration for SEV and we hope that by
demonstrating resume from EFI, we can show that our approach is
generally feasible.
Our demo can be found at <https://github.com/secure-migration>. The
tooling repository is the best starting point. It contains documentation
about the project and the scripts needed to run the demo. There are two
more repos associated with the project. One is a modified edk2 tree that
contains our modified OVMF. The other is a modified qemu, that has a
couple of temporary changes needed for the demo. Our demonstration is
aimed only at resuming from a VM snapshot in OVMF. We provide the source
CPU state and source memory to the destination using temporary plumbing
that violates the SEV trust model. We explain the setup in more depth in
README.md. We are showing only that OVMF can resume from a VM snapshot.
At the end we will describe our plan for transferring CPU state and
memory from source to guest. To be clear, the temporary tooling used for
this demo isn't built for encrypted VMs, but below we explain how this
demo applies to and can be extended to encrypted VMs.
We Implemented our resume code in a very similar fashion to the
recommended S3 resume code. When the HV sets the CPU state of a guest,
it can do so when the guest is not executing. Setting the state from
inside the guest is a delicate operation. There is no way to atomically
set all of the CPU state from inside the guest. Instead, we must set
most registers individually and account for changes in control flow that
doing so might cause. We do this with a three-phase trampoline. OVMF
calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
jumps to it. Phase 2 switches to an intermediate map that reconciles the
OVMF map and the source map. Phase 3 switches to the source map,
restores the registers, and returns into execution of the source. We
will go backwards through these phases in more depth.
The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other
registers to the values they had on the source. One particularly
significant register is CR3. When we return to Linux, CR3 must be set to
the source CR3 or the first instruction executed in Linux will cause a
page fault. The code that we use to restore the registers and return
must be mapped in the source page table or we would get a page fault
executing the instructions prior to returning into Linux. The value of
CR3 is so significant, that it defines the three phases of the
trampoline. Phase 3 begins when CR3 is set to the source CR3. After
setting CR3, we set all the other registers and return.
Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning
that virtual addresses are the same as physical addresses. The kernel
page table uses an offset mapping, meaning that virtual addresses differ
from physical addresses by a constant (for the most part). Crucially,
this means that the virtual address of the page that is executed by
phase 3 differs between the OVMF map and the source map. If we are
executing code mapped in OVMF and we change CR3 to point to the source
map, although the page may be mapped in the source map, the virtual
address will be different, and we will face undefined behavior. To fix
this, we construct intermediate page tables that map the pages for phase
2 and 3 to the virtual address expected in OVMF and to the virtual
address expected in the source map. Thus, we can switch CR3 from OVMF's
map to the intermediate map and then from the intermediate map to the
source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
responsible for switching to the intermediate map, flushing the TLB, and
jumping to phase 3.
Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
duties. First, since phase 2 and 3 operate without a stack and can't
access values defined in OVMF (such as the addresses of the pages
containing phase 2 and 3), phase 1 must pass these values to phase 2 by
putting them in registers. Second, phase 1 must start phase 2 by jumping
to it.
Given that we can resume to a snapshot in OVMF, we should be able to
migrate an SEV guest as long as we can securely communicate the VM
snapshot from source to destination. For our demo, we do this with a
handful of QMP commands. More sophisticated methods are required for a
production implementation.
When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is transmitted
dynamically as opposed to being saved and restored. Device state is not
protected by SEV and can be handled entirely by the HV. Memory, on the
other hand, cannot be handled only by the HV. As mentioned previously,
memory needs to be encrypted with a transport key. A Migration Handler
on the source will coordinate with the HV to encrypt pages and transmit
them to the destination. The destination HV will receive the pages over
the network and pass them to the Migration Handler in the target VM so
they can be decrypted. This transmission will occur continuously until
the memory of the source and target converges.
Plain SEV does not protect the CPU state of the guest and therefore does
not require any special mechanism for transmission of the CPU state. We
plan to implement an end-to-end migration with plain SEV first. In
SEV-ES, the PSP (platform security processor) encrypts CPU state on each
VMExit. The encrypted state is stored in memory. Normally this memory
(known as the VMSA) is not mapped into the guest, but we can add an
entry to the nested page tables that will expose the VMSA to the guest.
This means that when the guest VMExits, the CPU state will be saved to
guest memory. With the CPU state in guest memory, it can be transmitted
to the target using the method described above.
In addition to the changes needed in OVMF to resume the VM, the
transmission of the VM from source to target will require a new code
path in the hypervisor. There will also need to be a few minor changes
to Linux (adding a mapping for our Phase 3 pages). Despite all the
moving pieces, we believe that this is a feasible approach for
supporting live migration for SEV and SEV-ES.
For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have
included some notes about these issues in the COMPLICATIONS.md file. We
also have an outline of an end-to-end implementation of live migration
for SEV-ES in END-TO-END.md. See README.md for info on how to run the
demo. While this is not a full migration, we hope to show that fast live
migration with SEV and SEV-ES is possible without major kernel changes.
-Tobin
the one word that comes to my mind upon reading the above is,
"overwhelming".
(I have not been addressed directly, but:
- the subject says "RFC",
- and the documentation at
https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make
states that AmdSevPkg was created for convenience, and that the feature
could be integrated into OVMF. (Paraphrased.)
So I guess it's tolerable if I make a comment: )
We've been looking forward to your perspective.

I've checked out the "mh-state-dev" branch of
<https://github.com/secure-migration/resume-from-efi-edk2.git>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).
These commits were authored over the 6-7 months since April. It's
obviously huge work. To me, most of these commits clearly aim at getting
the demo / proof-of-concept functional, rather than guiding (more
precisely: hand-holding) reviewers through the construction of the feature.
In my opinion, the series is not upstreamable in its current format
(which is presently not much more readable than a single-commit code
drop). Upstreaming is probably not your intent, either, at this time.
I agree that getting feedback ("buy-in") at this level of maturity is
justified from your POV, before you invest more work into cleaning up /
restructuring the series.
My problem is that "hand-holding" is exactly what I'd need -- I cannot
dedicate one or two weeks, as an indivisible block, to understanding
your design. Nor can I approach the series patch-wise in its current
format. Personally I would need the patch series to lead me through the
whole design with baby steps ("ELI5"), meaning small code changes and
detailed commit messages. I'd *also* need the more comprehensive
guide-like documentation, as background material.
Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for cleaning up
the series, by reporting success).
So I hope others can spend the time discussing the design with you, and
testing / repeating the demo. For me to review the patches, the patches
should condense and replay your thinking process from the last 7 months,
in as small as possible logical steps. (On the list.)
I completely understand your position. This PoC has a lot of
new ideas in it and you're right that our main priority was not
to hand-hold/guide reviewers through the code.

One thing that is worth emphasizing is that the pieces we
are showcasing here are not the immediate priority when it
comes to upstreaming. Specifically, we looked into the trampoline
to make sure it was possible to migrate CPU state via firmware.
While we need this for SEV-ES and our goal is to support SEV-ES,
it is not the first step. We are currently working on a PoC for
a full end-to-end migration with SEV (non-ES), which may be a better
place for us to begin a serious discussion about getting things
upstream. We will focus more on making these patches accessible
to the upstream community.

In the meantime, perhaps there is something we can do to help
make our current work more clear. We could potentially explain
things on a call or create some additional documentation. While
our goal is not to shove this version of the trampoline upstream,
it is significant to our plan as a whole and we want to help
people understand it.

-Tobin

I really don't want to be the bottleneck here, which is why I would
support introducing this feature as a separate top-level package
(AmdSevPkg).
Thanks
Laszlo


Laszlo Ersek
 

On 11/04/20 19:27, Tobin Feldman-Fitzthum wrote:

In the meantime, perhaps there is something we can do to help
make our current work more clear. We could potentially explain
things on a call or create some additional documentation. While
our goal is not to shove this version of the trampoline upstream,
it is significant to our plan as a whole and we want to help
people understand it.
From my personal (selfish) perspective, a call would be
counter-productive. Regarding documentation, I do have one thought that
might help, with the (very tricky) page table manipulations / phases:
diagrams (ascii or svg, perhaps). I don't know if that will help me look
at this in detail earlier, but *when* I will look at it, it will
definitely help me.

(If there are diagrams already, then I apologize for not noticing them.)

Thanks!
Laszlo


Dr. David Alan Gilbert
 

* Tobin Feldman-Fitzthum (tobin@...) wrote:
On 2020-11-03 09:59, Laszlo Ersek wrote:
Hi Tobin,

(keeping full context -- I'm adding Dave)

On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working
on
a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when
it's
out and even hopefully Intel TDX) VMs. We have developed an approach
that we believe is feasible and a demonstration that shows our
solution
to the most difficult part of the problem. In short, we have
implemented
a UEFI Application that can resume from a VM snapshot. We think this
is
the crux of SEV-ES live migration. After describing the context of our
demo and how it works, we explain how it can be extended to a full
SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
migration can be implemented in OVMF with minimal kernel changes. We
provide a blueprint for doing so.

Typically the hypervisor facilitates live migration. AMD SEV excludes
the hypervisor from the trust domain of the guest. When a hypervisor
(HV) examines the memory of an SEV guest, it will find only a
ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
will be invalidated. Furthermore, with SEV-ES the hypervisor is
largely
unable to access guest CPU state. Thus, fast migration of SEV VMs
requires support from inside the trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel.
This would allow the guest to encrypt/decrypt its own memory with a
transport key. This approach has met some resistance. We propose a
similar approach implemented not in Linux, but in firmware,
specifically
OVMF. Since OVMF runs inside the guest, it has access to the guest
memory and CPU state. OVMF should be able to perform the manipulations
required for live migration of SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU
state
of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the
CPU
state of the target before the target begins executing. In our
approach,
the HV starts the target and OVMF must resume to whatever state the
source was in. We believe this to be the crux (or at least the most
difficult part) of live migration for SEV and we hope that by
demonstrating resume from EFI, we can show that our approach is
generally feasible.

Our demo can be found at <https://github.com/secure-migration>. The
tooling repository is the best starting point. It contains
documentation
about the project and the scripts needed to run the demo. There are
two
more repos associated with the project. One is a modified edk2 tree
that
contains our modified OVMF. The other is a modified qemu, that has a
couple of temporary changes needed for the demo. Our demonstration is
aimed only at resuming from a VM snapshot in OVMF. We provide the
source
CPU state and source memory to the destination using temporary
plumbing
that violates the SEV trust model. We explain the setup in more
depth in
README.md. We are showing only that OVMF can resume from a VM
snapshot.
At the end we will describe our plan for transferring CPU state and
memory from source to guest. To be clear, the temporary tooling used
for
this demo isn't built for encrypted VMs, but below we explain how this
demo applies to and can be extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the
recommended S3 resume code. When the HV sets the CPU state of a guest,
it can do so when the guest is not executing. Setting the state from
inside the guest is a delicate operation. There is no way to
atomically
set all of the CPU state from inside the guest. Instead, we must set
most registers individually and account for changes in control flow
that
doing so might cause. We do this with a three-phase trampoline. OVMF
calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
jumps to it. Phase 2 switches to an intermediate map that reconciles
the
OVMF map and the source map. Phase 3 switches to the source map,
restores the registers, and returns into execution of the source. We
will go backwards through these phases in more depth.

The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other
registers to the values they had on the source. One particularly
significant register is CR3. When we return to Linux, CR3 must be
set to
the source CR3 or the first instruction executed in Linux will cause a
page fault. The code that we use to restore the registers and return
must be mapped in the source page table or we would get a page fault
executing the instructions prior to returning into Linux. The value of
CR3 is so significant, that it defines the three phases of the
trampoline. Phase 3 begins when CR3 is set to the source CR3. After
setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping,
meaning
that virtual addresses are the same as physical addresses. The kernel
page table uses an offset mapping, meaning that virtual addresses
differ
from physical addresses by a constant (for the most part). Crucially,
this means that the virtual address of the page that is executed by
phase 3 differs between the OVMF map and the source map. If we are
executing code mapped in OVMF and we change CR3 to point to the source
map, although the page may be mapped in the source map, the virtual
address will be different, and we will face undefined behavior. To fix
this, we construct intermediate page tables that map the pages for
phase
2 and 3 to the virtual address expected in OVMF and to the virtual
address expected in the source map. Thus, we can switch CR3 from
OVMF's
map to the intermediate map and then from the intermediate map to the
source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
responsible for switching to the intermediate map, flushing the TLB,
and
jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
duties. First, since phase 2 and 3 operate without a stack and can't
access values defined in OVMF (such as the addresses of the pages
containing phase 2 and 3), phase 1 must pass these values to phase 2
by
putting them in registers. Second, phase 1 must start phase 2 by
jumping
to it.

Given that we can resume to a snapshot in OVMF, we should be able to
migrate an SEV guest as long as we can securely communicate the VM
snapshot from source to destination. For our demo, we do this with a
handful of QMP commands. More sophisticated methods are required for a
production implementation.

When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is
transmitted
dynamically as opposed to being saved and restored. Device state is
not
protected by SEV and can be handled entirely by the HV. Memory, on the
other hand, cannot be handled only by the HV. As mentioned previously,
memory needs to be encrypted with a transport key. A Migration Handler
on the source will coordinate with the HV to encrypt pages and
transmit
them to the destination. The destination HV will receive the pages
over
the network and pass them to the Migration Handler in the target VM so
they can be decrypted. This transmission will occur continuously until
the memory of the source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore
does
not require any special mechanism for transmission of the CPU state.
We
plan to implement an end-to-end migration with plain SEV first. In
SEV-ES, the PSP (platform security processor) encrypts CPU state on
each
VMExit. The encrypted state is stored in memory. Normally this memory
(known as the VMSA) is not mapped into the guest, but we can add an
entry to the nested page tables that will expose the VMSA to the
guest.
This means that when the guest VMExits, the CPU state will be saved to
guest memory. With the CPU state in guest memory, it can be
transmitted
to the target using the method described above.

In addition to the changes needed in OVMF to resume the VM, the
transmission of the VM from source to target will require a new code
path in the hypervisor. There will also need to be a few minor changes
to Linux (adding a mapping for our Phase 3 pages). Despite all the
moving pieces, we believe that this is a feasible approach for
supporting live migration for SEV and SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have
included some notes about these issues in the COMPLICATIONS.md file.
We
also have an outline of an end-to-end implementation of live migration
for SEV-ES in END-TO-END.md. See README.md for info on how to run the
demo. While this is not a full migration, we hope to show that fast
live
migration with SEV and SEV-ES is possible without major kernel
changes.

-Tobin
the one word that comes to my mind upon reading the above is,
"overwhelming".

(I have not been addressed directly, but:

- the subject says "RFC",

- and the documentation at

https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make

states that AmdSevPkg was created for convenience, and that the feature
could be integrated into OVMF. (Paraphrased.)

So I guess it's tolerable if I make a comment: )
We've been looking forward to your perspective.

I've checked out the "mh-state-dev" branch of
<https://github.com/secure-migration/resume-from-efi-edk2.git>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).

These commits were authored over the 6-7 months since April. It's
obviously huge work. To me, most of these commits clearly aim at getting
the demo / proof-of-concept functional, rather than guiding (more
precisely: hand-holding) reviewers through the construction of the
feature.

In my opinion, the series is not upstreamable in its current format
(which is presently not much more readable than a single-commit code
drop). Upstreaming is probably not your intent, either, at this time.

I agree that getting feedback ("buy-in") at this level of maturity is
justified from your POV, before you invest more work into cleaning up /
restructuring the series.

My problem is that "hand-holding" is exactly what I'd need -- I cannot
dedicate one or two weeks, as an indivisible block, to understanding
your design. Nor can I approach the series patch-wise in its current
format. Personally I would need the patch series to lead me through the
whole design with baby steps ("ELI5"), meaning small code changes and
detailed commit messages. I'd *also* need the more comprehensive
guide-like documentation, as background material.

Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for cleaning up
the series, by reporting success).

So I hope others can spend the time discussing the design with you, and
testing / repeating the demo. For me to review the patches, the patches
should condense and replay your thinking process from the last 7 months,
in as small as possible logical steps. (On the list.)
I completely understand your position. This PoC has a lot of
new ideas in it and you're right that our main priority was not
to hand-hold/guide reviewers through the code.

One thing that is worth emphasizing is that the pieces we
are showcasing here are not the immediate priority when it
comes to upstreaming. Specifically, we looked into the trampoline
to make sure it was possible to migrate CPU state via firmware.
While we need this for SEV-ES and our goal is to support SEV-ES,
it is not the first step. We are currently working on a PoC for
a full end-to-end migration with SEV (non-ES), which may be a better
place for us to begin a serious discussion about getting things
upstream. We will focus more on making these patches accessible
to the upstream community.
With my migration maintainer hat on, I'd like to understand a bit more
about these different approaches; they could be quite invasive, so I'd
like to make sure we're not doing one and throwing it away - it would
be great if you could explain your non-ES approach; you don't need to
have POC code to explain it.

Dave

In the meantime, perhaps there is something we can do to help
make our current work more clear. We could potentially explain
things on a call or create some additional documentation. While
our goal is not to shove this version of the trampoline upstream,
it is significant to our plan as a whole and we want to help
people understand it.

-Tobin

I really don't want to be the bottleneck here, which is why I would
support introducing this feature as a separate top-level package
(AmdSevPkg).

Thanks
Laszlo
--
Dr. David Alan Gilbert / dgilbert@... / Manchester, UK


Tobin Feldman-Fitzthum
 

On 2020-11-06 10:45, Laszlo Ersek wrote:
On 11/04/20 19:27, Tobin Feldman-Fitzthum wrote:

In the meantime, perhaps there is something we can do to help
make our current work more clear. We could potentially explain
things on a call or create some additional documentation. While
our goal is not to shove this version of the trampoline upstream,
it is significant to our plan as a whole and we want to help
people understand it.
From my personal (selfish) perspective, a call would be
counter-productive. Regarding documentation, I do have one thought that
might help, with the (very tricky) page table manipulations / phases:
diagrams (ascii or svg, perhaps). I don't know if that will help me look
at this in detail earlier, but *when* I will look at it, it will
definitely help me.
(If there are diagrams already, then I apologize for not noticing them.)
We can work on some diagrams. I have a couple informal ones on
paper. Adding something visual to the docs seems like a good idea.

-Tobin

Thanks!
Laszlo


Tobin Feldman-Fitzthum
 

On 2020-11-06 11:38, Dr. David Alan Gilbert wrote:
* Tobin Feldman-Fitzthum (tobin@...) wrote:
On 2020-11-03 09:59, Laszlo Ersek wrote:
Hi Tobin,

(keeping full context -- I'm adding Dave)

On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working
on
a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when
it's
out and even hopefully Intel TDX) VMs. We have developed an approach
that we believe is feasible and a demonstration that shows our
solution
to the most difficult part of the problem. In short, we have
implemented
a UEFI Application that can resume from a VM snapshot. We think this
is
the crux of SEV-ES live migration. After describing the context of our
demo and how it works, we explain how it can be extended to a full
SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
migration can be implemented in OVMF with minimal kernel changes. We
provide a blueprint for doing so.

Typically the hypervisor facilitates live migration. AMD SEV excludes
the hypervisor from the trust domain of the guest. When a hypervisor
(HV) examines the memory of an SEV guest, it will find only a
ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
will be invalidated. Furthermore, with SEV-ES the hypervisor is
largely
unable to access guest CPU state. Thus, fast migration of SEV VMs
requires support from inside the trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel.
This would allow the guest to encrypt/decrypt its own memory with a
transport key. This approach has met some resistance. We propose a
similar approach implemented not in Linux, but in firmware,
specifically
OVMF. Since OVMF runs inside the guest, it has access to the guest
memory and CPU state. OVMF should be able to perform the manipulations
required for live migration of SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU
state
of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the
CPU
state of the target before the target begins executing. In our
approach,
the HV starts the target and OVMF must resume to whatever state the
source was in. We believe this to be the crux (or at least the most
difficult part) of live migration for SEV and we hope that by
demonstrating resume from EFI, we can show that our approach is
generally feasible.

Our demo can be found at <https://github.com/secure-migration>. The
tooling repository is the best starting point. It contains
documentation
about the project and the scripts needed to run the demo. There are
two
more repos associated with the project. One is a modified edk2 tree
that
contains our modified OVMF. The other is a modified qemu, that has a
couple of temporary changes needed for the demo. Our demonstration is
aimed only at resuming from a VM snapshot in OVMF. We provide the
source
CPU state and source memory to the destination using temporary
plumbing
that violates the SEV trust model. We explain the setup in more
depth in
README.md. We are showing only that OVMF can resume from a VM
snapshot.
At the end we will describe our plan for transferring CPU state and
memory from source to guest. To be clear, the temporary tooling used
for
this demo isn't built for encrypted VMs, but below we explain how this
demo applies to and can be extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the
recommended S3 resume code. When the HV sets the CPU state of a guest,
it can do so when the guest is not executing. Setting the state from
inside the guest is a delicate operation. There is no way to
atomically
set all of the CPU state from inside the guest. Instead, we must set
most registers individually and account for changes in control flow
that
doing so might cause. We do this with a three-phase trampoline. OVMF
calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
jumps to it. Phase 2 switches to an intermediate map that reconciles
the
OVMF map and the source map. Phase 3 switches to the source map,
restores the registers, and returns into execution of the source. We
will go backwards through these phases in more depth.

The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other
registers to the values they had on the source. One particularly
significant register is CR3. When we return to Linux, CR3 must be
set to
the source CR3 or the first instruction executed in Linux will cause a
page fault. The code that we use to restore the registers and return
must be mapped in the source page table or we would get a page fault
executing the instructions prior to returning into Linux. The value of
CR3 is so significant, that it defines the three phases of the
trampoline. Phase 3 begins when CR3 is set to the source CR3. After
setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping,
meaning
that virtual addresses are the same as physical addresses. The kernel
page table uses an offset mapping, meaning that virtual addresses
differ
from physical addresses by a constant (for the most part). Crucially,
this means that the virtual address of the page that is executed by
phase 3 differs between the OVMF map and the source map. If we are
executing code mapped in OVMF and we change CR3 to point to the source
map, although the page may be mapped in the source map, the virtual
address will be different, and we will face undefined behavior. To fix
this, we construct intermediate page tables that map the pages for
phase
2 and 3 to the virtual address expected in OVMF and to the virtual
address expected in the source map. Thus, we can switch CR3 from
OVMF's
map to the intermediate map and then from the intermediate map to the
source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
responsible for switching to the intermediate map, flushing the TLB,
and
jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
duties. First, since phase 2 and 3 operate without a stack and can't
access values defined in OVMF (such as the addresses of the pages
containing phase 2 and 3), phase 1 must pass these values to phase 2
by
putting them in registers. Second, phase 1 must start phase 2 by
jumping
to it.

Given that we can resume to a snapshot in OVMF, we should be able to
migrate an SEV guest as long as we can securely communicate the VM
snapshot from source to destination. For our demo, we do this with a
handful of QMP commands. More sophisticated methods are required for a
production implementation.

When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is
transmitted
dynamically as opposed to being saved and restored. Device state is
not
protected by SEV and can be handled entirely by the HV. Memory, on the
other hand, cannot be handled only by the HV. As mentioned previously,
memory needs to be encrypted with a transport key. A Migration Handler
on the source will coordinate with the HV to encrypt pages and
transmit
them to the destination. The destination HV will receive the pages
over
the network and pass them to the Migration Handler in the target VM so
they can be decrypted. This transmission will occur continuously until
the memory of the source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore
does
not require any special mechanism for transmission of the CPU state.
We
plan to implement an end-to-end migration with plain SEV first. In
SEV-ES, the PSP (platform security processor) encrypts CPU state on
each
VMExit. The encrypted state is stored in memory. Normally this memory
(known as the VMSA) is not mapped into the guest, but we can add an
entry to the nested page tables that will expose the VMSA to the
guest.
This means that when the guest VMExits, the CPU state will be saved to
guest memory. With the CPU state in guest memory, it can be
transmitted
to the target using the method described above.

In addition to the changes needed in OVMF to resume the VM, the
transmission of the VM from source to target will require a new code
path in the hypervisor. There will also need to be a few minor changes
to Linux (adding a mapping for our Phase 3 pages). Despite all the
moving pieces, we believe that this is a feasible approach for
supporting live migration for SEV and SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have
included some notes about these issues in the COMPLICATIONS.md file.
We
also have an outline of an end-to-end implementation of live migration
for SEV-ES in END-TO-END.md. See README.md for info on how to run the
demo. While this is not a full migration, we hope to show that fast
live
migration with SEV and SEV-ES is possible without major kernel
changes.

-Tobin
the one word that comes to my mind upon reading the above is,
"overwhelming".

(I have not been addressed directly, but:

- the subject says "RFC",

- and the documentation at

https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make

states that AmdSevPkg was created for convenience, and that the feature
could be integrated into OVMF. (Paraphrased.)

So I guess it's tolerable if I make a comment: )
We've been looking forward to your perspective.

I've checked out the "mh-state-dev" branch of
<https://github.com/secure-migration/resume-from-efi-edk2.git>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).

These commits were authored over the 6-7 months since April. It's
obviously huge work. To me, most of these commits clearly aim at getting
the demo / proof-of-concept functional, rather than guiding (more
precisely: hand-holding) reviewers through the construction of the
feature.

In my opinion, the series is not upstreamable in its current format
(which is presently not much more readable than a single-commit code
drop). Upstreaming is probably not your intent, either, at this time.

I agree that getting feedback ("buy-in") at this level of maturity is
justified from your POV, before you invest more work into cleaning up /
restructuring the series.

My problem is that "hand-holding" is exactly what I'd need -- I cannot
dedicate one or two weeks, as an indivisible block, to understanding
your design. Nor can I approach the series patch-wise in its current
format. Personally I would need the patch series to lead me through the
whole design with baby steps ("ELI5"), meaning small code changes and
detailed commit messages. I'd *also* need the more comprehensive
guide-like documentation, as background material.

Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for cleaning up
the series, by reporting success).

So I hope others can spend the time discussing the design with you, and
testing / repeating the demo. For me to review the patches, the patches
should condense and replay your thinking process from the last 7 months,
in as small as possible logical steps. (On the list.)
I completely understand your position. This PoC has a lot of
new ideas in it and you're right that our main priority was not
to hand-hold/guide reviewers through the code.
One thing that is worth emphasizing is that the pieces we
are showcasing here are not the immediate priority when it
comes to upstreaming. Specifically, we looked into the trampoline
to make sure it was possible to migrate CPU state via firmware.
While we need this for SEV-ES and our goal is to support SEV-ES,
it is not the first step. We are currently working on a PoC for
a full end-to-end migration with SEV (non-ES), which may be a better
place for us to begin a serious discussion about getting things
upstream. We will focus more on making these patches accessible
to the upstream community.
With my migration maintainer hat on, I'd like to understand a bit more
about these different approaches; they could be quite invasive, so I'd
like to make sure we're not doing one and throwing it away - it would
be great if you could explain your non-ES approach; you don't need to
have POC code to explain it.
Our non-ES approach is a subset of our ES approach. For ES, the
Migration Handler in the guest needs to help out with memory and
CPU state. For plain SEV, the HV can set the CPU state, but we still
need a way to transfer the memory. The current POC only deals
with the CPU state.

We're still working out some of the details in QEMU, but the basic
idea of transferring memory is that each time the HV needs to send a
page to the target, it will ask the Migration Handler in the guest
for a version of the page that is encrypted with a transport key.
Since the MH is inside the guest, it can read from any address
in guest memory. The Migration Handlers on the source and the target
will share a key. Once the source encrypts the requested page with
the transport key, it can safely hand it off to the HV. Once the page
reaches the target, the target HV will pass the page into the
Migration Handler, which will decrypt using the transport key and
move the page to the appropriate address.

A few things to note:

- The Migration Handler on the source needs to be running in the
guest alongside the VM. On the target, the MH needs to startup
before we can receive any pages. In both cases we are thinking
that an additional vCPU can be started for the MH to run on.
This could be spawned dynamically or live for the duration of
the guest.

- We need to make sure that the Migration Handler on the target
does not overwrite itself when it receives pages from the
source. Since we run the same firmware on the source and
target, and since the MH is runtime code, the memory
footprint of the MH should match on the source and the
target. We will need to make sure there are no weird
relocations.

- There are some complexities arising from the fact that not
every page in an SEV VM is encrypted. We are looking into
the best way to handle encrypted vs. shared pages.

Hopefully those notes don't confound my earlier explanation too
much. I think that's most of the picture for non-ES migration.
Let me know if you have any questions. ES migration would use
the same approach for transferring memory.

-Tobin

Dave

In the meantime, perhaps there is something we can do to help
make our current work more clear. We could potentially explain
things on a call or create some additional documentation. While
our goal is not to shove this version of the trampoline upstream,
it is significant to our plan as a whole and we want to help
people understand it.
-Tobin

I really don't want to be the bottleneck here, which is why I would
support introducing this feature as a separate top-level package
(AmdSevPkg).

Thanks
Laszlo


Ashish Kalra <ashish.kalra@...>
 

Hello Tobin,

On Fri, Nov 06, 2020 at 04:48:12PM -0500, Tobin Feldman-Fitzthum wrote:
On 2020-11-06 11:38, Dr. David Alan Gilbert wrote:
* Tobin Feldman-Fitzthum (tobin@...) wrote:
On 2020-11-03 09:59, Laszlo Ersek wrote:
Hi Tobin,

(keeping full context -- I'm adding Dave)

On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working
on
a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when
it's
out and even hopefully Intel TDX) VMs. We have developed an approach
that we believe is feasible and a demonstration that shows our
solution
to the most difficult part of the problem. In short, we have
implemented
a UEFI Application that can resume from a VM snapshot. We think this
is
the crux of SEV-ES live migration. After describing the context of our
demo and how it works, we explain how it can be extended to a full
SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
migration can be implemented in OVMF with minimal kernel changes. We
provide a blueprint for doing so.

Typically the hypervisor facilitates live migration. AMD SEV excludes
the hypervisor from the trust domain of the guest. When a hypervisor
(HV) examines the memory of an SEV guest, it will find only a
ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
will be invalidated. Furthermore, with SEV-ES the hypervisor is
largely
unable to access guest CPU state. Thus, fast migration of SEV VMs
requires support from inside the trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel.
This would allow the guest to encrypt/decrypt its own memory with a
transport key. This approach has met some resistance. We propose a
similar approach implemented not in Linux, but in firmware,
specifically
OVMF. Since OVMF runs inside the guest, it has access to the guest
memory and CPU state. OVMF should be able to perform the manipulations
required for live migration of SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU
state
of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the
CPU
state of the target before the target begins executing. In our
approach,
the HV starts the target and OVMF must resume to whatever state the
source was in. We believe this to be the crux (or at least the most
difficult part) of live migration for SEV and we hope that by
demonstrating resume from EFI, we can show that our approach is
generally feasible.

Our demo can be found at <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QA0DmtLkHFEovIu2Wd%2BYscW%2Fa9cNofg2xEQn3jPth9A%3D&amp;reserved=0>. The
tooling repository is the best starting point. It contains
documentation
about the project and the scripts needed to run the demo. There are
two
more repos associated with the project. One is a modified edk2 tree
that
contains our modified OVMF. The other is a modified qemu, that has a
couple of temporary changes needed for the demo. Our demonstration is
aimed only at resuming from a VM snapshot in OVMF. We provide the
source
CPU state and source memory to the destination using temporary
plumbing
that violates the SEV trust model. We explain the setup in more
depth in
README.md. We are showing only that OVMF can resume from a VM
snapshot.
At the end we will describe our plan for transferring CPU state and
memory from source to guest. To be clear, the temporary tooling used
for
this demo isn't built for encrypted VMs, but below we explain how this
demo applies to and can be extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the
recommended S3 resume code. When the HV sets the CPU state of a guest,
it can do so when the guest is not executing. Setting the state from
inside the guest is a delicate operation. There is no way to
atomically
set all of the CPU state from inside the guest. Instead, we must set
most registers individually and account for changes in control flow
that
doing so might cause. We do this with a three-phase trampoline. OVMF
calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
jumps to it. Phase 2 switches to an intermediate map that reconciles
the
OVMF map and the source map. Phase 3 switches to the source map,
restores the registers, and returns into execution of the source. We
will go backwards through these phases in more depth.

The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other
registers to the values they had on the source. One particularly
significant register is CR3. When we return to Linux, CR3 must be
set to
the source CR3 or the first instruction executed in Linux will cause a
page fault. The code that we use to restore the registers and return
must be mapped in the source page table or we would get a page fault
executing the instructions prior to returning into Linux. The value of
CR3 is so significant, that it defines the three phases of the
trampoline. Phase 3 begins when CR3 is set to the source CR3. After
setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping,
meaning
that virtual addresses are the same as physical addresses. The kernel
page table uses an offset mapping, meaning that virtual addresses
differ
from physical addresses by a constant (for the most part). Crucially,
this means that the virtual address of the page that is executed by
phase 3 differs between the OVMF map and the source map. If we are
executing code mapped in OVMF and we change CR3 to point to the source
map, although the page may be mapped in the source map, the virtual
address will be different, and we will face undefined behavior. To fix
this, we construct intermediate page tables that map the pages for
phase
2 and 3 to the virtual address expected in OVMF and to the virtual
address expected in the source map. Thus, we can switch CR3 from
OVMF's
map to the intermediate map and then from the intermediate map to the
source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
responsible for switching to the intermediate map, flushing the TLB,
and
jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
duties. First, since phase 2 and 3 operate without a stack and can't
access values defined in OVMF (such as the addresses of the pages
containing phase 2 and 3), phase 1 must pass these values to phase 2
by
putting them in registers. Second, phase 1 must start phase 2 by
jumping
to it.

Given that we can resume to a snapshot in OVMF, we should be able to
migrate an SEV guest as long as we can securely communicate the VM
snapshot from source to destination. For our demo, we do this with a
handful of QMP commands. More sophisticated methods are required for a
production implementation.

When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is
transmitted
dynamically as opposed to being saved and restored. Device state is
not
protected by SEV and can be handled entirely by the HV. Memory, on the
other hand, cannot be handled only by the HV. As mentioned previously,
memory needs to be encrypted with a transport key. A Migration Handler
on the source will coordinate with the HV to encrypt pages and
transmit
them to the destination. The destination HV will receive the pages
over
the network and pass them to the Migration Handler in the target VM so
they can be decrypted. This transmission will occur continuously until
the memory of the source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore
does
not require any special mechanism for transmission of the CPU state.
We
plan to implement an end-to-end migration with plain SEV first. In
SEV-ES, the PSP (platform security processor) encrypts CPU state on
each
VMExit. The encrypted state is stored in memory. Normally this memory
(known as the VMSA) is not mapped into the guest, but we can add an
entry to the nested page tables that will expose the VMSA to the
guest.
This means that when the guest VMExits, the CPU state will be saved to
guest memory. With the CPU state in guest memory, it can be
transmitted
to the target using the method described above.

In addition to the changes needed in OVMF to resume the VM, the
transmission of the VM from source to target will require a new code
path in the hypervisor. There will also need to be a few minor changes
to Linux (adding a mapping for our Phase 3 pages). Despite all the
moving pieces, we believe that this is a feasible approach for
supporting live migration for SEV and SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have
included some notes about these issues in the COMPLICATIONS.md file.
We
also have an outline of an end-to-end implementation of live migration
for SEV-ES in END-TO-END.md. See README.md for info on how to run the
demo. While this is not a full migration, we hope to show that fast
live
migration with SEV and SEV-ES is possible without major kernel
changes.

-Tobin
the one word that comes to my mind upon reading the above is,
"overwhelming".

(I have not been addressed directly, but:

- the subject says "RFC",

- and the documentation at

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration%2Fresume-from-edk2-tooling%23what-changes-did-we-make&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=3%2FYBNKU90Kas%2F%2FUccbeqLI5CB2QRXBlA0ARkrnEAe0U%3D&amp;reserved=0

states that AmdSevPkg was created for convenience, and that the feature
could be integrated into OVMF. (Paraphrased.)

So I guess it's tolerable if I make a comment: )
We've been looking forward to your perspective.

I've checked out the "mh-state-dev" branch of
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration%2Fresume-from-efi-edk2.git&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=WP17dXixeaanEpMzbwNmsIhTtGiizcl1jBMb4xmRMuk%3D&amp;reserved=0>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).

These commits were authored over the 6-7 months since April. It's
obviously huge work. To me, most of these commits clearly aim at getting
the demo / proof-of-concept functional, rather than guiding (more
precisely: hand-holding) reviewers through the construction of the
feature.

In my opinion, the series is not upstreamable in its current format
(which is presently not much more readable than a single-commit code
drop). Upstreaming is probably not your intent, either, at this time.

I agree that getting feedback ("buy-in") at this level of maturity is
justified from your POV, before you invest more work into cleaning up /
restructuring the series.

My problem is that "hand-holding" is exactly what I'd need -- I cannot
dedicate one or two weeks, as an indivisible block, to understanding
your design. Nor can I approach the series patch-wise in its current
format. Personally I would need the patch series to lead me through the
whole design with baby steps ("ELI5"), meaning small code changes and
detailed commit messages. I'd *also* need the more comprehensive
guide-like documentation, as background material.

Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for cleaning up
the series, by reporting success).

So I hope others can spend the time discussing the design with you, and
testing / repeating the demo. For me to review the patches, the patches
should condense and replay your thinking process from the last 7 months,
in as small as possible logical steps. (On the list.)
I completely understand your position. This PoC has a lot of
new ideas in it and you're right that our main priority was not
to hand-hold/guide reviewers through the code.

One thing that is worth emphasizing is that the pieces we
are showcasing here are not the immediate priority when it
comes to upstreaming. Specifically, we looked into the trampoline
to make sure it was possible to migrate CPU state via firmware.
While we need this for SEV-ES and our goal is to support SEV-ES,
it is not the first step. We are currently working on a PoC for
a full end-to-end migration with SEV (non-ES), which may be a better
place for us to begin a serious discussion about getting things
upstream. We will focus more on making these patches accessible
to the upstream community.
With my migration maintainer hat on, I'd like to understand a bit more
about these different approaches; they could be quite invasive, so I'd
like to make sure we're not doing one and throwing it away - it would
be great if you could explain your non-ES approach; you don't need to
have POC code to explain it.
Our non-ES approach is a subset of our ES approach. For ES, the
Migration Handler in the guest needs to help out with memory and
CPU state. For plain SEV, the HV can set the CPU state, but we still
need a way to transfer the memory. The current POC only deals
with the CPU state.

We're still working out some of the details in QEMU, but the basic
idea of transferring memory is that each time the HV needs to send a
page to the target, it will ask the Migration Handler in the guest
for a version of the page that is encrypted with a transport key.
Since the MH is inside the guest, it can read from any address
in guest memory. The Migration Handlers on the source and the target
will share a key. Once the source encrypts the requested page with
the transport key, it can safely hand it off to the HV. Once the page
reaches the target, the target HV will pass the page into the
Migration Handler, which will decrypt using the transport key and
move the page to the appropriate address.

A few things to note:

- The Migration Handler on the source needs to be running in the
guest alongside the VM. On the target, the MH needs to startup
before we can receive any pages. In both cases we are thinking
that an additional vCPU can be started for the MH to run on.
This could be spawned dynamically or live for the duration of
the guest.

- We need to make sure that the Migration Handler on the target
does not overwrite itself when it receives pages from the
source. Since we run the same firmware on the source and
target, and since the MH is runtime code, the memory
footprint of the MH should match on the source and the
target. We will need to make sure there are no weird
relocations.

- There are some complexities arising from the fact that not
every page in an SEV VM is encrypted. We are looking into
the best way to handle encrypted vs. shared pages.
Raising this question here as part of this discussion ... are you
thinking of adding the page encryption bitmap (as we do for the slow
migration patches) here to figure out if the guest pages are encrypted
or not ?

The page encryption status will need notifications from the guest kernel
and OVMF.

Additionally, is the page encrpytion bitmap support going to be added as
a hypercall interface to the guest, which also means that the
guest kernel needs to be modified ?

Thanks,
Ashish

Hopefully those notes don't confound my earlier explanation too
much. I think that's most of the picture for non-ES migration.
Let me know if you have any questions. ES migration would use
the same approach for transferring memory.

-Tobin

Dave

In the meantime, perhaps there is something we can do to help
make our current work more clear. We could potentially explain
things on a call or create some additional documentation. While
our goal is not to shove this version of the trampoline upstream,
it is significant to our plan as a whole and we want to help
people understand it.

-Tobin

I really don't want to be the bottleneck here, which is why I would
support introducing this feature as a separate top-level package
(AmdSevPkg).

Thanks
Laszlo


Dr. David Alan Gilbert
 

* Tobin Feldman-Fitzthum (tobin@...) wrote:
On 2020-11-06 11:38, Dr. David Alan Gilbert wrote:
* Tobin Feldman-Fitzthum (tobin@...) wrote:
On 2020-11-03 09:59, Laszlo Ersek wrote:
Hi Tobin,

(keeping full context -- I'm adding Dave)

On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working
on
a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when
it's
out and even hopefully Intel TDX) VMs. We have developed an approach
that we believe is feasible and a demonstration that shows our
solution
to the most difficult part of the problem. In short, we have
implemented
a UEFI Application that can resume from a VM snapshot. We think this
is
the crux of SEV-ES live migration. After describing the context of our
demo and how it works, we explain how it can be extended to a full
SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
migration can be implemented in OVMF with minimal kernel changes. We
provide a blueprint for doing so.

Typically the hypervisor facilitates live migration. AMD SEV excludes
the hypervisor from the trust domain of the guest. When a hypervisor
(HV) examines the memory of an SEV guest, it will find only a
ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
will be invalidated. Furthermore, with SEV-ES the hypervisor is
largely
unable to access guest CPU state. Thus, fast migration of SEV VMs
requires support from inside the trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel.
This would allow the guest to encrypt/decrypt its own memory with a
transport key. This approach has met some resistance. We propose a
similar approach implemented not in Linux, but in firmware,
specifically
OVMF. Since OVMF runs inside the guest, it has access to the guest
memory and CPU state. OVMF should be able to perform the manipulations
required for live migration of SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU
state
of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the
CPU
state of the target before the target begins executing. In our
approach,
the HV starts the target and OVMF must resume to whatever state the
source was in. We believe this to be the crux (or at least the most
difficult part) of live migration for SEV and we hope that by
demonstrating resume from EFI, we can show that our approach is
generally feasible.

Our demo can be found at <https://github.com/secure-migration>. The
tooling repository is the best starting point. It contains
documentation
about the project and the scripts needed to run the demo. There are
two
more repos associated with the project. One is a modified edk2 tree
that
contains our modified OVMF. The other is a modified qemu, that has a
couple of temporary changes needed for the demo. Our demonstration is
aimed only at resuming from a VM snapshot in OVMF. We provide the
source
CPU state and source memory to the destination using temporary
plumbing
that violates the SEV trust model. We explain the setup in more
depth in
README.md. We are showing only that OVMF can resume from a VM
snapshot.
At the end we will describe our plan for transferring CPU state and
memory from source to guest. To be clear, the temporary tooling used
for
this demo isn't built for encrypted VMs, but below we explain how this
demo applies to and can be extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the
recommended S3 resume code. When the HV sets the CPU state of a guest,
it can do so when the guest is not executing. Setting the state from
inside the guest is a delicate operation. There is no way to
atomically
set all of the CPU state from inside the guest. Instead, we must set
most registers individually and account for changes in control flow
that
doing so might cause. We do this with a three-phase trampoline. OVMF
calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
jumps to it. Phase 2 switches to an intermediate map that reconciles
the
OVMF map and the source map. Phase 3 switches to the source map,
restores the registers, and returns into execution of the source. We
will go backwards through these phases in more depth.

The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other
registers to the values they had on the source. One particularly
significant register is CR3. When we return to Linux, CR3 must be
set to
the source CR3 or the first instruction executed in Linux will cause a
page fault. The code that we use to restore the registers and return
must be mapped in the source page table or we would get a page fault
executing the instructions prior to returning into Linux. The value of
CR3 is so significant, that it defines the three phases of the
trampoline. Phase 3 begins when CR3 is set to the source CR3. After
setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping,
meaning
that virtual addresses are the same as physical addresses. The kernel
page table uses an offset mapping, meaning that virtual addresses
differ
from physical addresses by a constant (for the most part). Crucially,
this means that the virtual address of the page that is executed by
phase 3 differs between the OVMF map and the source map. If we are
executing code mapped in OVMF and we change CR3 to point to the source
map, although the page may be mapped in the source map, the virtual
address will be different, and we will face undefined behavior. To fix
this, we construct intermediate page tables that map the pages for
phase
2 and 3 to the virtual address expected in OVMF and to the virtual
address expected in the source map. Thus, we can switch CR3 from
OVMF's
map to the intermediate map and then from the intermediate map to the
source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
responsible for switching to the intermediate map, flushing the TLB,
and
jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
duties. First, since phase 2 and 3 operate without a stack and can't
access values defined in OVMF (such as the addresses of the pages
containing phase 2 and 3), phase 1 must pass these values to phase 2
by
putting them in registers. Second, phase 1 must start phase 2 by
jumping
to it.

Given that we can resume to a snapshot in OVMF, we should be able to
migrate an SEV guest as long as we can securely communicate the VM
snapshot from source to destination. For our demo, we do this with a
handful of QMP commands. More sophisticated methods are required for a
production implementation.

When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is
transmitted
dynamically as opposed to being saved and restored. Device state is
not
protected by SEV and can be handled entirely by the HV. Memory, on the
other hand, cannot be handled only by the HV. As mentioned previously,
memory needs to be encrypted with a transport key. A Migration Handler
on the source will coordinate with the HV to encrypt pages and
transmit
them to the destination. The destination HV will receive the pages
over
the network and pass them to the Migration Handler in the target VM so
they can be decrypted. This transmission will occur continuously until
the memory of the source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore
does
not require any special mechanism for transmission of the CPU state.
We
plan to implement an end-to-end migration with plain SEV first. In
SEV-ES, the PSP (platform security processor) encrypts CPU state on
each
VMExit. The encrypted state is stored in memory. Normally this memory
(known as the VMSA) is not mapped into the guest, but we can add an
entry to the nested page tables that will expose the VMSA to the
guest.
This means that when the guest VMExits, the CPU state will be saved to
guest memory. With the CPU state in guest memory, it can be
transmitted
to the target using the method described above.

In addition to the changes needed in OVMF to resume the VM, the
transmission of the VM from source to target will require a new code
path in the hypervisor. There will also need to be a few minor changes
to Linux (adding a mapping for our Phase 3 pages). Despite all the
moving pieces, we believe that this is a feasible approach for
supporting live migration for SEV and SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have
included some notes about these issues in the COMPLICATIONS.md file.
We
also have an outline of an end-to-end implementation of live migration
for SEV-ES in END-TO-END.md. See README.md for info on how to run the
demo. While this is not a full migration, we hope to show that fast
live
migration with SEV and SEV-ES is possible without major kernel
changes.

-Tobin
the one word that comes to my mind upon reading the above is,
"overwhelming".

(I have not been addressed directly, but:

- the subject says "RFC",

- and the documentation at

https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make

states that AmdSevPkg was created for convenience, and that the feature
could be integrated into OVMF. (Paraphrased.)

So I guess it's tolerable if I make a comment: )
We've been looking forward to your perspective.

I've checked out the "mh-state-dev" branch of
<https://github.com/secure-migration/resume-from-efi-edk2.git>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).

These commits were authored over the 6-7 months since April. It's
obviously huge work. To me, most of these commits clearly aim at getting
the demo / proof-of-concept functional, rather than guiding (more
precisely: hand-holding) reviewers through the construction of the
feature.

In my opinion, the series is not upstreamable in its current format
(which is presently not much more readable than a single-commit code
drop). Upstreaming is probably not your intent, either, at this time.

I agree that getting feedback ("buy-in") at this level of maturity is
justified from your POV, before you invest more work into cleaning up /
restructuring the series.

My problem is that "hand-holding" is exactly what I'd need -- I cannot
dedicate one or two weeks, as an indivisible block, to understanding
your design. Nor can I approach the series patch-wise in its current
format. Personally I would need the patch series to lead me through the
whole design with baby steps ("ELI5"), meaning small code changes and
detailed commit messages. I'd *also* need the more comprehensive
guide-like documentation, as background material.

Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for cleaning up
the series, by reporting success).

So I hope others can spend the time discussing the design with you, and
testing / repeating the demo. For me to review the patches, the patches
should condense and replay your thinking process from the last 7 months,
in as small as possible logical steps. (On the list.)
I completely understand your position. This PoC has a lot of
new ideas in it and you're right that our main priority was not
to hand-hold/guide reviewers through the code.

One thing that is worth emphasizing is that the pieces we
are showcasing here are not the immediate priority when it
comes to upstreaming. Specifically, we looked into the trampoline
to make sure it was possible to migrate CPU state via firmware.
While we need this for SEV-ES and our goal is to support SEV-ES,
it is not the first step. We are currently working on a PoC for
a full end-to-end migration with SEV (non-ES), which may be a better
place for us to begin a serious discussion about getting things
upstream. We will focus more on making these patches accessible
to the upstream community.
With my migration maintainer hat on, I'd like to understand a bit more
about these different approaches; they could be quite invasive, so I'd
like to make sure we're not doing one and throwing it away - it would
be great if you could explain your non-ES approach; you don't need to
have POC code to explain it.
Our non-ES approach is a subset of our ES approach. For ES, the
Migration Handler in the guest needs to help out with memory and
CPU state. For plain SEV, the HV can set the CPU state, but we still
need a way to transfer the memory. The current POC only deals
with the CPU state.
OK, so as long as that's a subset, and this POC glues on for SEV-ES
registers that's fine.

We're still working out some of the details in QEMU, but the basic
idea of transferring memory is that each time the HV needs to send a
page to the target, it will ask the Migration Handler in the guest
for a version of the page that is encrypted with a transport key.
Since the MH is inside the guest, it can read from any address
in guest memory. The Migration Handlers on the source and the target
will share a key. Once the source encrypts the requested page with
the transport key, it can safely hand it off to the HV. Once the page
reaches the target, the target HV will pass the page into the
Migration Handler, which will decrypt using the transport key and
move the page to the appropriate address.
So somehow we have to get that transport key negotiated and into the
the migration-handlers.

A few things to note:

- The Migration Handler on the source needs to be running in the
guest alongside the VM. On the target, the MH needs to startup
before we can receive any pages. In both cases we are thinking
that an additional vCPU can be started for the MH to run on.
This could be spawned dynamically or live for the duration of
the guest.
And on the source it needs to keep running even when the other vCPUs
stop for the stop-copy phase at the end.

I know various people had asked the question whether we could have
some form of helper vCPU or whether hte vCPU would be guest visible.

- We need to make sure that the Migration Handler on the target
does not overwrite itself when it receives pages from the
source. Since we run the same firmware on the source and
target, and since the MH is runtime code, the memory
footprint of the MH should match on the source and the
target. We will need to make sure there are no weird
relocations.
So hmm; that depends whether you're going to transfer the MH
using the AMD hardware, or somehow rely on it being the same on
the two sides I think.

- There are some complexities arising from the fact that not
every page in an SEV VM is encrypted. We are looking into
the best way to handle encrypted vs. shared pages.
Right.

Hopefully those notes don't confound my earlier explanation too
much. I think that's most of the picture for non-ES migration.
Let me know if you have any questions. ES migration would use
the same approach for transferring memory.
OK, good.

Dave

-Tobin

Dave

In the meantime, perhaps there is something we can do to help
make our current work more clear. We could potentially explain
things on a call or create some additional documentation. While
our goal is not to shove this version of the trampoline upstream,
it is significant to our plan as a whole and we want to help
people understand it.

-Tobin

I really don't want to be the bottleneck here, which is why I would
support introducing this feature as a separate top-level package
(AmdSevPkg).

Thanks
Laszlo
--
Dr. David Alan Gilbert / dgilbert@... / Manchester, UK


Tobin Feldman-Fitzthum
 

On 2020-11-06 17:17, Ashish Kalra wrote:
Hello Tobin,
On Fri, Nov 06, 2020 at 04:48:12PM -0500, Tobin Feldman-Fitzthum wrote:
On 2020-11-06 11:38, Dr. David Alan Gilbert wrote:
* Tobin Feldman-Fitzthum (tobin@...) wrote:
On 2020-11-03 09:59, Laszlo Ersek wrote:
Hi Tobin,

(keeping full context -- I'm adding Dave)

On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working
on
a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when
it's
out and even hopefully Intel TDX) VMs. We have developed an approach
that we believe is feasible and a demonstration that shows our
solution
to the most difficult part of the problem. In short, we have
implemented
a UEFI Application that can resume from a VM snapshot. We think this
is
the crux of SEV-ES live migration. After describing the context of our
demo and how it works, we explain how it can be extended to a full
SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
migration can be implemented in OVMF with minimal kernel changes. We
provide a blueprint for doing so.

Typically the hypervisor facilitates live migration. AMD SEV excludes
the hypervisor from the trust domain of the guest. When a hypervisor
(HV) examines the memory of an SEV guest, it will find only a
ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
will be invalidated. Furthermore, with SEV-ES the hypervisor is
largely
unable to access guest CPU state. Thus, fast migration of SEV VMs
requires support from inside the trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel.
This would allow the guest to encrypt/decrypt its own memory with a
transport key. This approach has met some resistance. We propose a
similar approach implemented not in Linux, but in firmware,
specifically
OVMF. Since OVMF runs inside the guest, it has access to the guest
memory and CPU state. OVMF should be able to perform the manipulations
required for live migration of SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU
state
of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the
CPU
state of the target before the target begins executing. In our
approach,
the HV starts the target and OVMF must resume to whatever state the
source was in. We believe this to be the crux (or at least the most
difficult part) of live migration for SEV and we hope that by
demonstrating resume from EFI, we can show that our approach is
generally feasible.

Our demo can be found at <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QA0DmtLkHFEovIu2Wd%2BYscW%2Fa9cNofg2xEQn3jPth9A%3D&amp;reserved=0>. The
tooling repository is the best starting point. It contains
documentation
about the project and the scripts needed to run the demo. There are
two
more repos associated with the project. One is a modified edk2 tree
that
contains our modified OVMF. The other is a modified qemu, that has a
couple of temporary changes needed for the demo. Our demonstration is
aimed only at resuming from a VM snapshot in OVMF. We provide the
source
CPU state and source memory to the destination using temporary
plumbing
that violates the SEV trust model. We explain the setup in more
depth in
README.md. We are showing only that OVMF can resume from a VM
snapshot.
At the end we will describe our plan for transferring CPU state and
memory from source to guest. To be clear, the temporary tooling used
for
this demo isn't built for encrypted VMs, but below we explain how this
demo applies to and can be extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the
recommended S3 resume code. When the HV sets the CPU state of a guest,
it can do so when the guest is not executing. Setting the state from
inside the guest is a delicate operation. There is no way to
atomically
set all of the CPU state from inside the guest. Instead, we must set
most registers individually and account for changes in control flow
that
doing so might cause. We do this with a three-phase trampoline. OVMF
calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
jumps to it. Phase 2 switches to an intermediate map that reconciles
the
OVMF map and the source map. Phase 3 switches to the source map,
restores the registers, and returns into execution of the source. We
will go backwards through these phases in more depth.

The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other
registers to the values they had on the source. One particularly
significant register is CR3. When we return to Linux, CR3 must be
set to
the source CR3 or the first instruction executed in Linux will cause a
page fault. The code that we use to restore the registers and return
must be mapped in the source page table or we would get a page fault
executing the instructions prior to returning into Linux. The value of
CR3 is so significant, that it defines the three phases of the
trampoline. Phase 3 begins when CR3 is set to the source CR3. After
setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping,
meaning
that virtual addresses are the same as physical addresses. The kernel
page table uses an offset mapping, meaning that virtual addresses
differ
from physical addresses by a constant (for the most part). Crucially,
this means that the virtual address of the page that is executed by
phase 3 differs between the OVMF map and the source map. If we are
executing code mapped in OVMF and we change CR3 to point to the source
map, although the page may be mapped in the source map, the virtual
address will be different, and we will face undefined behavior. To fix
this, we construct intermediate page tables that map the pages for
phase
2 and 3 to the virtual address expected in OVMF and to the virtual
address expected in the source map. Thus, we can switch CR3 from
OVMF's
map to the intermediate map and then from the intermediate map to the
source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
responsible for switching to the intermediate map, flushing the TLB,
and
jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
duties. First, since phase 2 and 3 operate without a stack and can't
access values defined in OVMF (such as the addresses of the pages
containing phase 2 and 3), phase 1 must pass these values to phase 2
by
putting them in registers. Second, phase 1 must start phase 2 by
jumping
to it.

Given that we can resume to a snapshot in OVMF, we should be able to
migrate an SEV guest as long as we can securely communicate the VM
snapshot from source to destination. For our demo, we do this with a
handful of QMP commands. More sophisticated methods are required for a
production implementation.

When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is
transmitted
dynamically as opposed to being saved and restored. Device state is
not
protected by SEV and can be handled entirely by the HV. Memory, on the
other hand, cannot be handled only by the HV. As mentioned previously,
memory needs to be encrypted with a transport key. A Migration Handler
on the source will coordinate with the HV to encrypt pages and
transmit
them to the destination. The destination HV will receive the pages
over
the network and pass them to the Migration Handler in the target VM so
they can be decrypted. This transmission will occur continuously until
the memory of the source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore
does
not require any special mechanism for transmission of the CPU state.
We
plan to implement an end-to-end migration with plain SEV first. In
SEV-ES, the PSP (platform security processor) encrypts CPU state on
each
VMExit. The encrypted state is stored in memory. Normally this memory
(known as the VMSA) is not mapped into the guest, but we can add an
entry to the nested page tables that will expose the VMSA to the
guest.
This means that when the guest VMExits, the CPU state will be saved to
guest memory. With the CPU state in guest memory, it can be
transmitted
to the target using the method described above.

In addition to the changes needed in OVMF to resume the VM, the
transmission of the VM from source to target will require a new code
path in the hypervisor. There will also need to be a few minor changes
to Linux (adding a mapping for our Phase 3 pages). Despite all the
moving pieces, we believe that this is a feasible approach for
supporting live migration for SEV and SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have
included some notes about these issues in the COMPLICATIONS.md file.
We
also have an outline of an end-to-end implementation of live migration
for SEV-ES in END-TO-END.md. See README.md for info on how to run the
demo. While this is not a full migration, we hope to show that fast
live
migration with SEV and SEV-ES is possible without major kernel
changes.

-Tobin
the one word that comes to my mind upon reading the above is,
"overwhelming".

(I have not been addressed directly, but:

- the subject says "RFC",

- and the documentation at

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration%2Fresume-from-edk2-tooling%23what-changes-did-we-make&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=3%2FYBNKU90Kas%2F%2FUccbeqLI5CB2QRXBlA0ARkrnEAe0U%3D&amp;reserved=0

states that AmdSevPkg was created for convenience, and that the feature
could be integrated into OVMF. (Paraphrased.)

So I guess it's tolerable if I make a comment: )
We've been looking forward to your perspective.

I've checked out the "mh-state-dev" branch of
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration%2Fresume-from-efi-edk2.git&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=WP17dXixeaanEpMzbwNmsIhTtGiizcl1jBMb4xmRMuk%3D&amp;reserved=0>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).

These commits were authored over the 6-7 months since April. It's
obviously huge work. To me, most of these commits clearly aim at getting
the demo / proof-of-concept functional, rather than guiding (more
precisely: hand-holding) reviewers through the construction of the
feature.

In my opinion, the series is not upstreamable in its current format
(which is presently not much more readable than a single-commit code
drop). Upstreaming is probably not your intent, either, at this time.

I agree that getting feedback ("buy-in") at this level of maturity is
justified from your POV, before you invest more work into cleaning up /
restructuring the series.

My problem is that "hand-holding" is exactly what I'd need -- I cannot
dedicate one or two weeks, as an indivisible block, to understanding
your design. Nor can I approach the series patch-wise in its current
format. Personally I would need the patch series to lead me through the
whole design with baby steps ("ELI5"), meaning small code changes and
detailed commit messages. I'd *also* need the more comprehensive
guide-like documentation, as background material.

Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for cleaning up
the series, by reporting success).

So I hope others can spend the time discussing the design with you, and
testing / repeating the demo. For me to review the patches, the patches
should condense and replay your thinking process from the last 7 months,
in as small as possible logical steps. (On the list.)
I completely understand your position. This PoC has a lot of
new ideas in it and you're right that our main priority was not
to hand-hold/guide reviewers through the code.

One thing that is worth emphasizing is that the pieces we
are showcasing here are not the immediate priority when it
comes to upstreaming. Specifically, we looked into the trampoline
to make sure it was possible to migrate CPU state via firmware.
While we need this for SEV-ES and our goal is to support SEV-ES,
it is not the first step. We are currently working on a PoC for
a full end-to-end migration with SEV (non-ES), which may be a better
place for us to begin a serious discussion about getting things
upstream. We will focus more on making these patches accessible
to the upstream community.
With my migration maintainer hat on, I'd like to understand a bit more
about these different approaches; they could be quite invasive, so I'd
like to make sure we're not doing one and throwing it away - it would
be great if you could explain your non-ES approach; you don't need to
have POC code to explain it.
Our non-ES approach is a subset of our ES approach. For ES, the
Migration Handler in the guest needs to help out with memory and
CPU state. For plain SEV, the HV can set the CPU state, but we still
need a way to transfer the memory. The current POC only deals
with the CPU state.
We're still working out some of the details in QEMU, but the basic
idea of transferring memory is that each time the HV needs to send a
page to the target, it will ask the Migration Handler in the guest
for a version of the page that is encrypted with a transport key.
Since the MH is inside the guest, it can read from any address
in guest memory. The Migration Handlers on the source and the target
will share a key. Once the source encrypts the requested page with
the transport key, it can safely hand it off to the HV. Once the page
reaches the target, the target HV will pass the page into the
Migration Handler, which will decrypt using the transport key and
move the page to the appropriate address.
A few things to note:
- The Migration Handler on the source needs to be running in the
guest alongside the VM. On the target, the MH needs to startup
before we can receive any pages. In both cases we are thinking
that an additional vCPU can be started for the MH to run on.
This could be spawned dynamically or live for the duration of
the guest.
- We need to make sure that the Migration Handler on the target
does not overwrite itself when it receives pages from the
source. Since we run the same firmware on the source and
target, and since the MH is runtime code, the memory
footprint of the MH should match on the source and the
target. We will need to make sure there are no weird
relocations.
- There are some complexities arising from the fact that not
every page in an SEV VM is encrypted. We are looking into
the best way to handle encrypted vs. shared pages.
Raising this question here as part of this discussion ... are you
thinking of adding the page encryption bitmap (as we do for the slow
migration patches) here to figure out if the guest pages are encrypted
or not ?
We are using the bitmap for the first iteration of our end-to-end POC.

The page encryption status will need notifications from the guest kernel
and OVMF.
Additionally, is the page encrpytion bitmap support going to be added as
a hypercall interface to the guest, which also means that the
guest kernel needs to be modified ?
Although the bitmap is handy, we would like to avoid the patches you
are alluding to. We are currently looking into how we can eliminate
the bitmap.

-Tobin

Thanks,
Ashish

Hopefully those notes don't confound my earlier explanation too
much. I think that's most of the picture for non-ES migration.
Let me know if you have any questions. ES migration would use
the same approach for transferring memory.
-Tobin

Dave

In the meantime, perhaps there is something we can do to help
make our current work more clear. We could potentially explain
things on a call or create some additional documentation. While
our goal is not to shove this version of the trampoline upstream,
it is significant to our plan as a whole and we want to help
people understand it.

-Tobin

I really don't want to be the bottleneck here, which is why I would
support introducing this feature as a separate top-level package
(AmdSevPkg).

Thanks
Laszlo


Tobin Feldman-Fitzthum
 

On 2020-11-09 14:56, Dr. David Alan Gilbert wrote:
* Tobin Feldman-Fitzthum (tobin@...) wrote:
On 2020-11-06 11:38, Dr. David Alan Gilbert wrote:
* Tobin Feldman-Fitzthum (tobin@...) wrote:
On 2020-11-03 09:59, Laszlo Ersek wrote:
Hi Tobin,

(keeping full context -- I'm adding Dave)

On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working
on
a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when
it's
out and even hopefully Intel TDX) VMs. We have developed an approach
that we believe is feasible and a demonstration that shows our
solution
to the most difficult part of the problem. In short, we have
implemented
a UEFI Application that can resume from a VM snapshot. We think this
is
the crux of SEV-ES live migration. After describing the context of our
demo and how it works, we explain how it can be extended to a full
SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
migration can be implemented in OVMF with minimal kernel changes. We
provide a blueprint for doing so.

Typically the hypervisor facilitates live migration. AMD SEV excludes
the hypervisor from the trust domain of the guest. When a hypervisor
(HV) examines the memory of an SEV guest, it will find only a
ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
will be invalidated. Furthermore, with SEV-ES the hypervisor is
largely
unable to access guest CPU state. Thus, fast migration of SEV VMs
requires support from inside the trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel.
This would allow the guest to encrypt/decrypt its own memory with a
transport key. This approach has met some resistance. We propose a
similar approach implemented not in Linux, but in firmware,
specifically
OVMF. Since OVMF runs inside the guest, it has access to the guest
memory and CPU state. OVMF should be able to perform the manipulations
required for live migration of SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU
state
of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the
CPU
state of the target before the target begins executing. In our
approach,
the HV starts the target and OVMF must resume to whatever state the
source was in. We believe this to be the crux (or at least the most
difficult part) of live migration for SEV and we hope that by
demonstrating resume from EFI, we can show that our approach is
generally feasible.

Our demo can be found at <https://github.com/secure-migration>. The
tooling repository is the best starting point. It contains
documentation
about the project and the scripts needed to run the demo. There are
two
more repos associated with the project. One is a modified edk2 tree
that
contains our modified OVMF. The other is a modified qemu, that has a
couple of temporary changes needed for the demo. Our demonstration is
aimed only at resuming from a VM snapshot in OVMF. We provide the
source
CPU state and source memory to the destination using temporary
plumbing
that violates the SEV trust model. We explain the setup in more
depth in
README.md. We are showing only that OVMF can resume from a VM
snapshot.
At the end we will describe our plan for transferring CPU state and
memory from source to guest. To be clear, the temporary tooling used
for
this demo isn't built for encrypted VMs, but below we explain how this
demo applies to and can be extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the
recommended S3 resume code. When the HV sets the CPU state of a guest,
it can do so when the guest is not executing. Setting the state from
inside the guest is a delicate operation. There is no way to
atomically
set all of the CPU state from inside the guest. Instead, we must set
most registers individually and account for changes in control flow
that
doing so might cause. We do this with a three-phase trampoline. OVMF
calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
jumps to it. Phase 2 switches to an intermediate map that reconciles
the
OVMF map and the source map. Phase 3 switches to the source map,
restores the registers, and returns into execution of the source. We
will go backwards through these phases in more depth.

The last thing that resume to EFI does is return. Specifically, we use
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
temporary stack and restores them atomically, thus returning to source
execution. Prior to returning, we must manually restore most other
registers to the values they had on the source. One particularly
significant register is CR3. When we return to Linux, CR3 must be
set to
the source CR3 or the first instruction executed in Linux will cause a
page fault. The code that we use to restore the registers and return
must be mapped in the source page table or we would get a page fault
executing the instructions prior to returning into Linux. The value of
CR3 is so significant, that it defines the three phases of the
trampoline. Phase 3 begins when CR3 is set to the source CR3. After
setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping,
meaning
that virtual addresses are the same as physical addresses. The kernel
page table uses an offset mapping, meaning that virtual addresses
differ
from physical addresses by a constant (for the most part). Crucially,
this means that the virtual address of the page that is executed by
phase 3 differs between the OVMF map and the source map. If we are
executing code mapped in OVMF and we change CR3 to point to the source
map, although the page may be mapped in the source map, the virtual
address will be different, and we will face undefined behavior. To fix
this, we construct intermediate page tables that map the pages for
phase
2 and 3 to the virtual address expected in OVMF and to the virtual
address expected in the source map. Thus, we can switch CR3 from
OVMF's
map to the intermediate map and then from the intermediate map to the
source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
responsible for switching to the intermediate map, flushing the TLB,
and
jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
duties. First, since phase 2 and 3 operate without a stack and can't
access values defined in OVMF (such as the addresses of the pages
containing phase 2 and 3), phase 1 must pass these values to phase 2
by
putting them in registers. Second, phase 1 must start phase 2 by
jumping
to it.

Given that we can resume to a snapshot in OVMF, we should be able to
migrate an SEV guest as long as we can securely communicate the VM
snapshot from source to destination. For our demo, we do this with a
handful of QMP commands. More sophisticated methods are required for a
production implementation.

When we refer to a snapshot, what we really mean is the device state,
memory, and CPU state of a guest. In live migration this is
transmitted
dynamically as opposed to being saved and restored. Device state is
not
protected by SEV and can be handled entirely by the HV. Memory, on the
other hand, cannot be handled only by the HV. As mentioned previously,
memory needs to be encrypted with a transport key. A Migration Handler
on the source will coordinate with the HV to encrypt pages and
transmit
them to the destination. The destination HV will receive the pages
over
the network and pass them to the Migration Handler in the target VM so
they can be decrypted. This transmission will occur continuously until
the memory of the source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore
does
not require any special mechanism for transmission of the CPU state.
We
plan to implement an end-to-end migration with plain SEV first. In
SEV-ES, the PSP (platform security processor) encrypts CPU state on
each
VMExit. The encrypted state is stored in memory. Normally this memory
(known as the VMSA) is not mapped into the guest, but we can add an
entry to the nested page tables that will expose the VMSA to the
guest.
This means that when the guest VMExits, the CPU state will be saved to
guest memory. With the CPU state in guest memory, it can be
transmitted
to the target using the method described above.

In addition to the changes needed in OVMF to resume the VM, the
transmission of the VM from source to target will require a new code
path in the hypervisor. There will also need to be a few minor changes
to Linux (adding a mapping for our Phase 3 pages). Despite all the
moving pieces, we believe that this is a feasible approach for
supporting live migration for SEV and SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP
support, generation of the intermediate mappings, and more. We have
included some notes about these issues in the COMPLICATIONS.md file.
We
also have an outline of an end-to-end implementation of live migration
for SEV-ES in END-TO-END.md. See README.md for info on how to run the
demo. While this is not a full migration, we hope to show that fast
live
migration with SEV and SEV-ES is possible without major kernel
changes.

-Tobin
the one word that comes to my mind upon reading the above is,
"overwhelming".

(I have not been addressed directly, but:

- the subject says "RFC",

- and the documentation at

https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make

states that AmdSevPkg was created for convenience, and that the feature
could be integrated into OVMF. (Paraphrased.)

So I guess it's tolerable if I make a comment: )
We've been looking forward to your perspective.

I've checked out the "mh-state-dev" branch of
<https://github.com/secure-migration/resume-from-efi-edk2.git>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).

These commits were authored over the 6-7 months since April. It's
obviously huge work. To me, most of these commits clearly aim at getting
the demo / proof-of-concept functional, rather than guiding (more
precisely: hand-holding) reviewers through the construction of the
feature.

In my opinion, the series is not upstreamable in its current format
(which is presently not much more readable than a single-commit code
drop). Upstreaming is probably not your intent, either, at this time.

I agree that getting feedback ("buy-in") at this level of maturity is
justified from your POV, before you invest more work into cleaning up /
restructuring the series.

My problem is that "hand-holding" is exactly what I'd need -- I cannot
dedicate one or two weeks, as an indivisible block, to understanding
your design. Nor can I approach the series patch-wise in its current
format. Personally I would need the patch series to lead me through the
whole design with baby steps ("ELI5"), meaning small code changes and
detailed commit messages. I'd *also* need the more comprehensive
guide-like documentation, as background material.

Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for cleaning up
the series, by reporting success).

So I hope others can spend the time discussing the design with you, and
testing / repeating the demo. For me to review the patches, the patches
should condense and replay your thinking process from the last 7 months,
in as small as possible logical steps. (On the list.)
I completely understand your position. This PoC has a lot of
new ideas in it and you're right that our main priority was not
to hand-hold/guide reviewers through the code.

One thing that is worth emphasizing is that the pieces we
are showcasing here are not the immediate priority when it
comes to upstreaming. Specifically, we looked into the trampoline
to make sure it was possible to migrate CPU state via firmware.
While we need this for SEV-ES and our goal is to support SEV-ES,
it is not the first step. We are currently working on a PoC for
a full end-to-end migration with SEV (non-ES), which may be a better
place for us to begin a serious discussion about getting things
upstream. We will focus more on making these patches accessible
to the upstream community.
With my migration maintainer hat on, I'd like to understand a bit more
about these different approaches; they could be quite invasive, so I'd
like to make sure we're not doing one and throwing it away - it would
be great if you could explain your non-ES approach; you don't need to
have POC code to explain it.
Our non-ES approach is a subset of our ES approach. For ES, the
Migration Handler in the guest needs to help out with memory and
CPU state. For plain SEV, the HV can set the CPU state, but we still
need a way to transfer the memory. The current POC only deals
with the CPU state.
OK, so as long as that's a subset, and this POC glues on for SEV-ES
registers that's fine.

We're still working out some of the details in QEMU, but the basic
idea of transferring memory is that each time the HV needs to send a
page to the target, it will ask the Migration Handler in the guest
for a version of the page that is encrypted with a transport key.
Since the MH is inside the guest, it can read from any address
in guest memory. The Migration Handlers on the source and the target
will share a key. Once the source encrypts the requested page with
the transport key, it can safely hand it off to the HV. Once the page
reaches the target, the target HV will pass the page into the
Migration Handler, which will decrypt using the transport key and
move the page to the appropriate address.
So somehow we have to get that transport key negotiated and into the
the migration-handlers.
Inject-launch-secret is one of the main pieces here. James might have
more info about this step.


A few things to note:
- The Migration Handler on the source needs to be running in the
guest alongside the VM. On the target, the MH needs to startup
before we can receive any pages. In both cases we are thinking
that an additional vCPU can be started for the MH to run on.
This could be spawned dynamically or live for the duration of
the guest.
And on the source it needs to keep running even when the other vCPUs
stop for the stop-copy phase at the end.
Yes. Good point.

I know various people had asked the question whether we could have
some form of helper vCPU or whether hte vCPU would be guest visible.

- We need to make sure that the Migration Handler on the target
does not overwrite itself when it receives pages from the
source. Since we run the same firmware on the source and
target, and since the MH is runtime code, the memory
footprint of the MH should match on the source and the
target. We will need to make sure there are no weird
relocations.
So hmm; that depends whether you're going to transfer the MH
using the AMD hardware, or somehow rely on it being the same on
the two sides I think.
We don't transfer the MH itself. Even if we did, we would still
need to make sure that the MH on the target and the OS on the
source do not overlap. Currently our approach for this is to
designate the MH as a runtime driver, meaning that the code for
the MH is on reserved pages that won't be mapped by Linux.
We'll use the same firmware and thus the same driver on the
source and destination. We think this will be enough, but it
is a somewhat delicate step that we may need to revisit.

-Tobin

- There are some complexities arising from the fact that not
every page in an SEV VM is encrypted. We are looking into
the best way to handle encrypted vs. shared pages.
Right.

Hopefully those notes don't confound my earlier explanation too
much. I think that's most of the picture for non-ES migration.
Let me know if you have any questions. ES migration would use
the same approach for transferring memory.
OK, good.
Dave

-Tobin

Dave

In the meantime, perhaps there is something we can do to help
make our current work more clear. We could potentially explain
things on a call or create some additional documentation. While
our goal is not to shove this version of the trampoline upstream,
it is significant to our plan as a whole and we want to help
people understand it.

-Tobin

I really don't want to be the bottleneck here, which is why I would
support introducing this feature as a separate top-level package
(AmdSevPkg).

Thanks
Laszlo


Kalra, Ashish <Ashish.Kalra@...>
 

[AMD Public Use]

Hello Tobin,

-----Original Message-----
From: Tobin Feldman-Fitzthum <tobin@...>
Sent: Monday, November 9, 2020 2:28 PM
To: Kalra, Ashish <Ashish.Kalra@...>
Cc: Dr. David Alan Gilbert <dgilbert@...>; Laszlo Ersek <lersek@...>; devel@edk2.groups.io; dovmurik@...; Dov.Murik1@...; Singh, Brijesh <brijesh.singh@...>; tobin@...; Kaplan, David <David.Kaplan@...>; Grimm, Jon <Jon.Grimm@...>; Lendacky, Thomas <Thomas.Lendacky@...>; jejb@...; frankeh@...
Subject: Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept

On 2020-11-06 17:17, Ashish Kalra wrote:
Hello Tobin,

On Fri, Nov 06, 2020 at 04:48:12PM -0500, Tobin Feldman-Fitzthum wrote:
On 2020-11-06 11:38, Dr. David Alan Gilbert wrote:
* Tobin Feldman-Fitzthum (tobin@...) wrote:
On 2020-11-03 09:59, Laszlo Ersek wrote:
Hi Tobin,

(keeping full context -- I'm adding Dave)

On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been
working on a plan for fast live migration of SEV and SEV-ES
(and SEV-SNP when it's out and even hopefully Intel TDX) VMs.
We have developed an approach that we believe is feasible and
a demonstration that shows our solution to the most difficult
part of the problem. In short, we have implemented a UEFI
Application that can resume from a VM snapshot. We think this
is the crux of SEV-ES live migration. After describing the
context of our demo and how it works, we explain how it can
be extended to a full SEV-ES migration. Our goal is to show
that fast SEV and SEV-ES live migration can be implemented in
OVMF with minimal kernel changes. We provide a blueprint for
doing so.

Typically the hypervisor facilitates live migration. AMD SEV
excludes the hypervisor from the trust domain of the guest.
When a hypervisor
(HV) examines the memory of an SEV guest, it will find only a
ciphertext. If the HV moves the memory of an SEV guest, the
ciphertext will be invalidated. Furthermore, with SEV-ES the
hypervisor is largely unable to access guest CPU state. Thus,
fast migration of SEV VMs requires support from inside the
trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel.
This would allow the guest to encrypt/decrypt its own memory
with a transport key. This approach has met some resistance.
We propose a similar approach implemented not in Linux, but
in firmware, specifically OVMF. Since OVMF runs inside the
guest, it has access to the guest memory and CPU state. OVMF
should be able to perform the manipulations required for live
migration of SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the
CPU state of an SEV-ES guest. In a normal (non-SEV migration)
the HV sets the CPU state of the target before the target
begins executing. In our approach, the HV starts the target
and OVMF must resume to whatever state the source was in. We
believe this to be the crux (or at least the most difficult
part) of live migration for SEV and we hope that by
demonstrating resume from EFI, we can show that our approach
is generally feasible.

Our demo can be found at
<https://nam11.safelinks.protection.outlook.com/?url=https%3A
%2F%2Fgithub.com%2Fsecure-migration&amp;data=04%7C01%7Cashish
.kalra%40amd.com%7C5180f68f099546c3a49e08d884edf727%7C3dd8961
fe4884e608e11a82d994e183d%7C0%7C0%7C637405504892572249%7CUnkn
own%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dkF04%2FoQgl8rLYXXxF
2nQNwDr1VmfvMfZ8amC6QHZV4%3D&amp;reserved=0>. The tooling
repository is the best starting point. It contains
documentation about the project and the scripts needed to run
the demo. There are two more repos associated with the
project. One is a modified edk2 tree that contains our
modified OVMF. The other is a modified qemu, that has a
couple of temporary changes needed for the demo. Our
demonstration is aimed only at resuming from a VM snapshot in
OVMF. We provide the source CPU state and source memory to
the destination using temporary plumbing that violates the SEV trust model. We explain the setup in more depth in README.md. We are showing only that OVMF can resume from a VM snapshot.
At the end we will describe our plan for transferring CPU
state and memory from source to guest. To be clear, the
temporary tooling used for this demo isn't built for
encrypted VMs, but below we explain how this demo applies to
and can be extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to
the recommended S3 resume code. When the HV sets the CPU
state of a guest, it can do so when the guest is not
executing. Setting the state from inside the guest is a
delicate operation. There is no way to atomically set all of
the CPU state from inside the guest. Instead, we must set
most registers individually and account for changes in
control flow that doing so might cause. We do this with a
three-phase trampoline. OVMF calls phase 1, which runs on the
OVMF map. Phase 1 sets up phase 2 and jumps to it. Phase 2
switches to an intermediate map that reconciles the OVMF map
and the source map. Phase 3 switches to the source map,
restores the registers, and returns into execution of the
source. We will go backwards through these phases in more depth.

The last thing that resume to EFI does is return.
Specifically, we use IRETQ, which reads the values of RIP,
CS, RFLAGS, RSP, and SS from a temporary stack and restores
them atomically, thus returning to source execution. Prior to
returning, we must manually restore most other registers to
the values they had on the source. One particularly
significant register is CR3. When we return to Linux, CR3
must be set to the source CR3 or the first instruction
executed in Linux will cause a page fault. The code that we
use to restore the registers and return must be mapped in the
source page table or we would get a page fault executing the
instructions prior to returning into Linux. The value of
CR3 is so significant, that it defines the three phases of
the trampoline. Phase 3 begins when CR3 is set to the source
CR3. After setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1
mapping, meaning that virtual addresses are the same as
physical addresses. The kernel page table uses an offset
mapping, meaning that virtual addresses differ from physical
addresses by a constant (for the most part). Crucially, this
means that the virtual address of the page that is executed
by phase 3 differs between the OVMF map and the source map.
If we are executing code mapped in OVMF and we change CR3 to
point to the source map, although the page may be mapped in
the source map, the virtual address will be different, and we
will face undefined behavior. To fix this, we construct
intermediate page tables that map the pages for phase
2 and 3 to the virtual address expected in OVMF and to the
virtual address expected in the source map. Thus, we can
switch CR3 from OVMF's map to the intermediate map and then
from the intermediate map to the source map. Phase 2 is much
shorter than phase 3. Phase 2 is mainly responsible for
switching to the intermediate map, flushing the TLB, and
jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has
two duties. First, since phase 2 and 3 operate without a
stack and can't access values defined in OVMF (such as the
addresses of the pages containing phase 2 and 3), phase 1
must pass these values to phase 2 by putting them in
registers. Second, phase 1 must start phase 2 by jumping to
it.

Given that we can resume to a snapshot in OVMF, we should be
able to migrate an SEV guest as long as we can securely
communicate the VM snapshot from source to destination. For
our demo, we do this with a handful of QMP commands. More
sophisticated methods are required for a production implementation.

When we refer to a snapshot, what we really mean is the
device state, memory, and CPU state of a guest. In live
migration this is transmitted dynamically as opposed to being
saved and restored. Device state is not protected by SEV and
can be handled entirely by the HV. Memory, on the other hand,
cannot be handled only by the HV. As mentioned previously,
memory needs to be encrypted with a transport key. A
Migration Handler on the source will coordinate with the HV
to encrypt pages and transmit them to the destination. The
destination HV will receive the pages over the network and
pass them to the Migration Handler in the target VM so they
can be decrypted. This transmission will occur continuously
until the memory of the source and target converges.

Plain SEV does not protect the CPU state of the guest and
therefore does not require any special mechanism for
transmission of the CPU state.
We
plan to implement an end-to-end migration with plain SEV
first. In SEV-ES, the PSP (platform security processor)
encrypts CPU state on each VMExit. The encrypted state is
stored in memory. Normally this memory (known as the VMSA) is
not mapped into the guest, but we can add an entry to the
nested page tables that will expose the VMSA to the guest.
This means that when the guest VMExits, the CPU state will be
saved to guest memory. With the CPU state in guest memory, it
can be transmitted to the target using the method described
above.

In addition to the changes needed in OVMF to resume the VM,
the transmission of the VM from source to target will require
a new code path in the hypervisor. There will also need to be
a few minor changes to Linux (adding a mapping for our Phase
3 pages). Despite all the moving pieces, we believe that this
is a feasible approach for supporting live migration for SEV and SEV-ES.

For the sake of brevity, we have left out a few issues,
including SMP support, generation of the intermediate
mappings, and more. We have included some notes about these issues in the COMPLICATIONS.md file.
We
also have an outline of an end-to-end implementation of live
migration for SEV-ES in END-TO-END.md. See README.md for info
on how to run the demo. While this is not a full migration,
we hope to show that fast live migration with SEV and SEV-ES
is possible without major kernel changes.

-Tobin
the one word that comes to my mind upon reading the above is,
"overwhelming".

(I have not been addressed directly, but:

- the subject says "RFC",

- and the documentation at

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F
%2Fgithub.com%2Fsecure-migration%2Fresume-from-edk2-tooling%23w
hat-changes-did-we-make&amp;data=04%7C01%7Cashish.kalra%40amd.c
om%7C5180f68f099546c3a49e08d884edf727%7C3dd8961fe4884e608e11a82
d994e183d%7C0%7C0%7C637405504892582241%7CUnknown%7CTWFpbGZsb3d8
eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
%3D%7C1000&amp;sdata=ztsOgLs3hNcn90iOPRV5gfSly11o0X3kq7yMmYhmRe
E%3D&amp;reserved=0

states that AmdSevPkg was created for convenience, and that the
feature could be integrated into OVMF. (Paraphrased.)

So I guess it's tolerable if I make a comment: )
We've been looking forward to your perspective.

I've checked out the "mh-state-dev" branch of
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2
F%2Fgithub.com%2Fsecure-migration%2Fresume-from-efi-edk2.git&am
p;data=04%7C01%7Cashish.kalra%40amd.com%7C5180f68f099546c3a49e0
8d884edf727%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637405
504892582241%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIj
oiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nL0QD
AEW3%2B4%2Fw4GJRtyoF0D12gRRiTno6tA%2BE3%2BjNhM%3D&amp;reserved=
0>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).

These commits were authored over the 6-7 months since April.
It's obviously huge work. To me, most of these commits clearly
aim at getting the demo / proof-of-concept functional, rather
than guiding (more
precisely: hand-holding) reviewers through the construction of
the feature.

In my opinion, the series is not upstreamable in its current
format (which is presently not much more readable than a
single-commit code drop). Upstreaming is probably not your intent, either, at this time.

I agree that getting feedback ("buy-in") at this level of
maturity is justified from your POV, before you invest more
work into cleaning up / restructuring the series.

My problem is that "hand-holding" is exactly what I'd need -- I
cannot dedicate one or two weeks, as an indivisible block, to
understanding your design. Nor can I approach the series
patch-wise in its current format. Personally I would need the
patch series to lead me through the whole design with baby
steps ("ELI5"), meaning small code changes and detailed commit
messages. I'd *also* need the more comprehensive guide-like documentation, as background material.

Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for
cleaning up the series, by reporting success).

So I hope others can spend the time discussing the design with
you, and testing / repeating the demo. For me to review the
patches, the patches should condense and replay your thinking
process from the last 7 months, in as small as possible logical
steps. (On the list.)
I completely understand your position. This PoC has a lot of new
ideas in it and you're right that our main priority was not to
hand-hold/guide reviewers through the code.

One thing that is worth emphasizing is that the pieces we are
showcasing here are not the immediate priority when it comes to
upstreaming. Specifically, we looked into the trampoline to make
sure it was possible to migrate CPU state via firmware.
While we need this for SEV-ES and our goal is to support SEV-ES,
it is not the first step. We are currently working on a PoC for a
full end-to-end migration with SEV (non-ES), which may be a
better place for us to begin a serious discussion about getting
things upstream. We will focus more on making these patches
accessible to the upstream community.
With my migration maintainer hat on, I'd like to understand a bit
more about these different approaches; they could be quite
invasive, so I'd like to make sure we're not doing one and throwing
it away - it would be great if you could explain your non-ES
approach; you don't need to have POC code to explain it.
Our non-ES approach is a subset of our ES approach. For ES, the
Migration Handler in the guest needs to help out with memory and CPU
state. For plain SEV, the HV can set the CPU state, but we still need
a way to transfer the memory. The current POC only deals with the CPU
state.

We're still working out some of the details in QEMU, but the basic
idea of transferring memory is that each time the HV needs to send a
page to the target, it will ask the Migration Handler in the guest
for a version of the page that is encrypted with a transport key.
Since the MH is inside the guest, it can read from any address in
guest memory. The Migration Handlers on the source and the target
will share a key. Once the source encrypts the requested page with
the transport key, it can safely hand it off to the HV. Once the page
reaches the target, the target HV will pass the page into the
Migration Handler, which will decrypt using the transport key and
move the page to the appropriate address.

A few things to note:

- The Migration Handler on the source needs to be running in the
guest alongside the VM. On the target, the MH needs to startup
before we can receive any pages. In both cases we are thinking
that an additional vCPU can be started for the MH to run on.
This could be spawned dynamically or live for the duration of
the guest.

- We need to make sure that the Migration Handler on the target
does not overwrite itself when it receives pages from the
source. Since we run the same firmware on the source and
target, and since the MH is runtime code, the memory
footprint of the MH should match on the source and the
target. We will need to make sure there are no weird
relocations.

- There are some complexities arising from the fact that not
every page in an SEV VM is encrypted. We are looking into
the best way to handle encrypted vs. shared pages.
Raising this question here as part of this discussion ... are you
thinking of adding the page encryption bitmap (as we do for the slow
migration patches) here to figure out if the guest pages are encrypted
or not ?
We are using the bitmap for the first iteration of our end-to-end POC.
Ok.

The page encryption status will need notifications from the guest
kernel and OVMF.

Additionally, is the page encrpytion bitmap support going to be added
as a hypercall interface to the guest, which also means that the guest
kernel needs to be modified ?
Although the bitmap is handy, we would like to avoid the patches you are alluding to. We are currently looking into how we can eliminate the bitmap.
Please note, the page encryption bitmap is also required for SEV guest page migration and SEV guest debug support, therefore it might be
useful for having these patches available.

If you want us to push Brijesh's and my patches for the page encryption bitmap separately for the kernel then let us know.

Thanks,
Ashish


Hopefully those notes don't confound my earlier explanation too much.
I think that's most of the picture for non-ES migration.
Let me know if you have any questions. ES migration would use the
same approach for transferring memory.

-Tobin

Dave

In the meantime, perhaps there is something we can do to help
make our current work more clear. We could potentially explain
things on a call or create some additional documentation. While
our goal is not to shove this version of the trampoline upstream,
it is significant to our plan as a whole and we want to help
people understand it.

-Tobin

I really don't want to be the bottleneck here, which is why I
would support introducing this feature as a separate top-level
package (AmdSevPkg).

Thanks
Laszlo


James Bottomley <jejb@...>
 

On Mon, 2020-11-09 at 17:37 -0500, Tobin Feldman-Fitzthum wrote:
On 2020-11-09 14:56, Dr. David Alan Gilbert wrote:
* Tobin Feldman-Fitzthum (tobin@...) wrote:
[...]
We're still working out some of the details in QEMU, but the
basic idea of transferring memory is that each time the HV needs
to send a page to the target, it will ask the Migration Handler
in the guest for a version of the page that is encrypted with a
transport key. Since the MH is inside the guest, it can read
from any address in guest memory. The Migration Handlers on the
source and the target will share a key. Once the source encrypts
the requested page with the transport key, it can safely hand it
off to the HV. Once the page reaches the target, the target HV
will pass the page into the Migration Handler, which will decrypt
using the transport key and move the page to the appropriate
address.
So somehow we have to get that transport key negotiated and into
the the migration-handlers.
Inject-launch-secret is one of the main pieces here. James might have
more info about this step.
So there are a couple of ways I was thinking this could work. In the
current slow migration, the PSPs on each end validate each other by
exchanging keys. We could do something similar by having the two MHs
do an ECDHE exchange to agree a trusted transfer key between them and
then having them both exchange trusted information about the SEV
environment i.e. both validating each other.

However, the alternative and simpler way is simply to have the machine
owner control everything. So encrypted boot would provision two
secrets: one for the actual encrypted root which grub needs but the
other would be what the MH needs. The MH secret would be the private
part of an ECDH key (effectively the MH identity) and the public ECDH
key of the MH source, so only the source MH would be able to make
encrypted contact for migration. On boot from image, the public key
part would be empty indicating boot should proceed normally. On
migration, we make sure we know the source public key and provision it
to the target along with a random target key. To trigger the
migration, we have to tell the source what the target's public key is
and they can now make encrypted contact in a manner that should be
cryptographically secure. The MH ECDH key would exist for the lifetime
of the VM on a SEV system and would be destroyed on either image
shutdown or successful migration.

James