[SOLVED] Live migration causes freeze on VM

rauth · Aug 3, 2023

I have a cluster with 6 nodes running on version 7.4, all nodes are Dell PowerEdge R630 and now I needed to add the seventh node which is Dell PowerEdge R640. Live migration from a VM residing on a R630 host to R640 goes fine, but when migrating from the new R640 host to R630 the VM crashes/freezes, requiring a reboot to get back to normal.

This happens with any type of VM, below is an example configuration that we use by default on all VMs running in the cluster:

cores: 8
cpu: Haswell-noTSX
ide0: stor01-vms:vm-10004-cloudinit,media=cdrom
ide2: none,media=cdrom
kvm: 1
memory: 32768
meta: creation-qemu=7.0.0,ctime=1664308344
numa: 0
ostype: l26
scsi0: stor01-vms:vm-10004-disk-0,cache=none,discard=on,iops_rd=2000,iops_wr=1000,mbps_rd=300,mbps_wr=200,size=160G,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=c438496d-1855-4e86-adbe-17731e37cf0d
sockets: 1
vga: std
vmgenid: ec907fe3-e72b-4535-8de5-ade34686f824

We set the CPU type to Haswell-noTSX to match the lowest processor we have in the cluster, which is an Intel Xeon CPU E5-2680 v3 at 2.50GHz. Some hosts have Intel Xeon CPU E5-2680 v4 and real time migration between them has always worked fine (The new Dell 640 node has Intel Xeon Gold 6132 CPU @ 2.60GHz). I did some tests by changing the CPU type to other architectures like kvm64 or Intel IvyBridge but the problem still occurs. I don't know if it's really a CPU incompatibility issue. I understand that since the type is set in the VM settings, this failure should not occur. I am running the same version of Proxmox on all nodes, the only thing different is the server generation itself

# pveversion --verbose
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-network-perl: 0.7.3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

*No error message is displayed on the VM console, it just freezes and does not accept any commands

emunt6 · Aug 4, 2023

Hi!

There is a workaround for the freezes, ( I found in another thread ):
1., suspend the VM
2., migrate the VM
3., resume the VM

Are this method works for you?

rauth · Aug 4, 2023

Hi @emunt6,

I have not tested this method, but I found the solution in these threads:

https://forum.proxmox.com/threads/vm-stuck-freeze-after-live-migration.114867/
https://forum.proxmox.com/threads/virtual-machines-freeze-just-after-a-live-migration.109796/

I updated my hosts to kernel 5.19 and it solved the problem

# apt install pve-kernel-5.19
# reboot

Neobin · Aug 5, 2023

rauth said:
Hi @emunt6,

I have not tested this method, but I found the solution in these threads:

https://forum.proxmox.com/threads/vm-stuck-freeze-after-live-migration.114867/
https://forum.proxmox.com/threads/virtual-machines-freeze-just-after-a-live-migration.109796/

I updated my hosts to kernel 5.19 and it solved the problem

# apt install pve-kernel-5.19
# reboot

Be advised, that the 5.19 kernel is EOL since quite some time and it would be recommended to use the 6.2 one: [1].

Or, of course, upgrade to PVE 8, where the 6.2 kernel is the (current) default: [2].

[1] https://forum.proxmox.com/threads/opt-in-linux-6-2-kernel-for-proxmox-ve-7-x-available.124189
[2] https://pve.proxmox.com/wiki/Upgrade_from_7_to_8

adman · Aug 13, 2023

I'm seeing this behavior with Debian VMs with PVE 8, so new kernels haven't fixed the problem.

fiona · Aug 14, 2023

Hi,

adman said:
I'm seeing this behavior with Debian VMs with PVE 8, so new kernels haven't fixed the problem.

same symptoms doesn't mean same problem. If upgrading the kernel solved the problem for @rauth but not for you, you likely have a different one. Please share the output of pveversion -v from both source and target node and the VM configuration. What kind of physical CPU do the source and target node have?

tj90241 · Sep 14, 2023

adman, I bumped into the issue on the new kernels and proposed a patch to the kernel (upstream) tonight. It is is related to the former FPU/PRKU fixes for live-migrations.

As fiona suggests, actually the problem that you are observing (adman) is likely different than the OP.

Just wanted to express my gratitude to the Proxmox community, as when I was first looking to understand the issue, I found a lot of breadcrumbs on the forums here that led to my understanding of what exactly the problem was.

fiona · Sep 25, 2023

Hi,

tj90241 said:
adman, I bumped into the issue on the new kernels and proposed a patch to the kernel (upstream) tonight. It is is related to the former FPU/PRKU fixes for live-migrations.

As fiona suggests, actually the problem that you are observing (adman) is likely different than the OP.

Just wanted to express my gratitude to the Proxmox community, as when I was first looking to understand the issue, I found a lot of breadcrumbs on the forums here that led to my understanding of what exactly the problem was.

could you maybe share the link to the upstream patch?

Note that we do have a workaround for the FPU/PKRU issue when migrating from 5.15 to 6.2 in kernels >= 6.2.16-5: https://git.proxmox.com/?p=pve-kern...e;hp=e8568c4378ba3f8bb398284306c1d432bfeeb5a6
Is it related to that? Could it be that it's not complete?

tj90241 · Sep 25, 2023

fiona said:
Hi,

could you maybe share the link to the upstream patch?

Note that we do have a workaround for the FPU/PKRU issue when migrating from 5.15 to 6.2 in kernels >= 6.2.16-5: https://git.proxmox.com/?p=pve-kern...e;hp=e8568c4378ba3f8bb398284306c1d432bfeeb5a6
Is it related to that? Could it be that it's not complete?

Unfortunately, my patch was not accepted. You can find it here, though: https://lore.kernel.org/all/ZQRNmsWcOM1xbNsZ@luigi.stachecki.net/T/

As you guessed, it is indeed related to the PKRU issue. Essentially, the PKRU patch introduced a change where migrating from a kernel without that patch to a kernel with that patch in many cases triggers a bug in qemu that instantly corrupts the VM.

Upstream's belief is that the only way to (safely) proceed forward is to eseentially power off VMs, upgrade to the PKRU-patched kernel, and then power the VMs back on.

tj90241 · Sep 25, 2023

I should also mention that the PKRU patch had some (other) bad regressions that were found -- you should pull the fixes for those, too:

https://lore.kernel.org/lkml/Yv0T8iFq2xb1301w@work-vm/T/#m5d070c36f6e957c514ca601c6b39c66c9ce4453e

fiona · Sep 25, 2023

tj90241 said:
As you guessed, it is indeed related to the PKRU issue. Essentially, the PKRU patch introduced a change where migrating from a kernel without that patch to a kernel with that patch in many cases triggers a bug in qemu that instantly corrupts the VM.

That's what our workaround is about: https://git.proxmox.com/?p=pve-kern...e;hp=e8568c4378ba3f8bb398284306c1d432bfeeb5a6

tj90241 said:
Upstream's belief is that the only way to (safely) proceed forward is to eseentially power off VMs, upgrade to the PKRU-patched kernel, and then power the VMs back on.

We thought about fixing QEMU too (see the commit message from the work around), but decided to go for the kernel route, as we have enough information there to detect when a migration is coming from a host without the ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") patch.

With what exact source and target kernel versions are you experiencing the issue? What physical CPUs do the nodes have? What CPU type did you set in the VM configuration?

fiona · Sep 25, 2023

tj90241 said:
I should also mention that the PKRU patch had some (other) bad regressions that were found -- you should pull the fixes for those, too:

https://lore.kernel.org/lkml/Yv0T8iFq2xb1301w@work-vm/T/#m5d070c36f6e957c514ca601c6b39c66c9ce4453e

Note that we base our kernel off of the Ubuntu one, so we get most stable fixes from there. That patch is already part of our kernels >= 6.1.

You are talking about regressions in the plural, but when looking at git log torvalds/master arch/x86/kvm, I only see this single one with

Code:

Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0")

Is there another one?

tj90241 · Sep 26, 2023

I can't say too much here in the regards of the Ubuntu kernel... but do note that the (would-be backported) patch is absent in Ubuntu's 5.15 kernel. It's that bad!

I see your workaround - though, I think the scope of the problem extends PKRU. This is a multi-faceted problem.

The problem is that the PKRU "fixes" things by addressing a long-standing issue in the KVM interface in a well-meaning manner - it effectively changes the "dump processor context" (KVM_GET_XSAVE) ioctl such that the source hypervisor now exports its xsave buffer in terms of the guest CPU context. It used to be in terms of the host CPU context, which is objectively wrong.

Unfortunately, as part of the PKRU patch, there is a subtle gotcha added:
The destination hypervisor (as part of KVM_SET_XSAVE ioctl) will now reject ANY context which does not match the guest context. So -- if you have, say, hypervisors that are all capable of AVX256 (again, suppose Broadwell)... trying to live-migrate a guest with Ivy-Bridge guest context from a kernel lacking Leo's patch to a kernel with Leo's patch will now result in instant corruption of the VM. Formerly, the breakage was limited to the very new features of Skylake -- no more...

I have replicated this in an environment contanining all Skylake CPUs with Broadwell guests. Again: the hypervisors all support PKRU/are Skylake... but because of the changes introduced in the patch and the fact that the destination now enforces Skylake context (but the guest sends Broadwell)... it triggers a different bug entirely.

Put in another way:
Yesterday, before Leo's patch, migrating from Skylake to Broadwell was broken. PKRU.
Today, with Leo's patch, migrating from Skylake to Skylake is broken if the guest is non-PKRU/Broadwell based.

Hopefully that makes sense. I am always happy to meet interactively to explain over a Zoom meeting or something else.

EDIT: well timed, as seanjc himself has bumped my initlal patch with something that may be much better today!

fiona · Sep 26, 2023

tj90241 said:
I can't say too much here in the regards of the Ubuntu kernel... but do note that the (would-be backported) patch is absent in Ubuntu's 5.15 kernel. It's that bad!

But the original ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") also isn't part of that kernel, so no need for backporting the Fixes commit.

tj90241 said:
I see your workaround - though, I think the scope of the problem extends PKRU. This is a multi-faceted problem.

The problem is that the PKRU "fixes" things by addressing a long-standing issue in the KVM interface in a well-meaning manner - it effectively changes the "dump processor context" (KVM_GET_XSAVE) ioctl such that the source hypervisor now exports its xsave buffer in terms of the guest CPU context. It used to be in terms of the host CPU context, which is objectively wrong.

Unfortunately, as part of the PKRU patch, there is a subtle gotcha added:
The destination hypervisor (as part of KVM_SET_XSAVE ioctl) will now reject ANY context which does not match the guest context. So -- if you have, say, hypervisors that are all capable of AVX256 (again, suppose Broadwell)... trying to live-migrate a guest with Ivy-Bridge guest context from a kernel lacking Leo's patch to a kernel with Leo's patch will now result in instant corruption of the VM. Formerly, the breakage was limited to the very new features of Skylake -- no more...

I see, but I'm actually not aware of our users reporting the non-PKRU kind of issue. But maybe it's just a rarer configuration. There were quite a few reports because of the PKRU issue and that got resolved by our workaround. @adman never answered so we don't know if they had the non-PKRU issue.

tj90241 said:
I have replicated this in an environment contanining all Skylake CPUs with Broadwell guests. Again: the hypervisors all support PKRU/are Skylake... but because of the changes introduced in the patch and the fact that the destination now enforces Skylake context (but the guest sends Broadwell)... it triggers a different bug entirely.

Put in another way:
Yesterday, before Leo's patch, migrating from Skylake to Broadwell was broken. PKRU.
Today, with Leo's patch, migrating from Skylake to Skylake is broken if the guest is non-PKRU/Broadwell based.

Hopefully that makes sense. I am always happy to meet interactively to explain over a Zoom meeting or something else.

Thank you for making us aware of the other issues! We'll try to reproduce the "Broadwell guest from Skylake host to Skylake host" one and keep an eye on the upstream discussion. Just to clarify, the issue is supposed to happen when migrating from a kernel without Leo's patch to a kernel with Leo's patch or also when migrating between two kernels with the patch?

tj90241 said:
EDIT: well timed, as seanjc himself has bumped my initlal patch with something that may be much better today!

For reference: https://lore.kernel.org/all/ZQRNmsW.../T/#m999fd96e3c54b5a32640e4c1be0479e2e06555d6

fiona · Sep 26, 2023

tj90241 said:
I have replicated this in an environment contanining all Skylake CPUs with Broadwell guests. Again: the hypervisors all support PKRU/are Skylake... but because of the changes introduced in the patch and the fact that the destination now enforces Skylake context (but the guest sends Broadwell)... it triggers a different bug entirely.

Did you reproduce this in a Proxmox VE environment or something else?

tj90241 said:
Put in another way:
Yesterday, before Leo's patch, migrating from Skylake to Broadwell was broken. PKRU.
Today, with Leo's patch, migrating from Skylake to Skylake is broken if the guest is non-PKRU/Broadwell based.

Thinking about it again, I feel like this is the issue fixed by our workaround: If the guest is non-PKRU-based but the hosts are. If the vCPU does not support the PKRU feature, but it's present in the migrated guest state (because it might've "leaked" on a kernel without Leo's patch), we mask it out.

tj90241 · Sep 26, 2023

fiona said:
Did you reproduce this in a Proxmox VE environment or something else?

Thinking about it again, I feel like this is the issue fixed by our workaround: If the guest is non-PKRU-based but the hosts are. If the vCPU does not support the PKRU feature, but it's present in the migrated guest state (because it might've "leaked" on a kernel without Leo's patch), we mask it out.

Reproduced with Ubuntu 22.04 userspace across a variety or kernels - their HWE kernels, kernel self-compiled via kernel.org, etc.

re: your workaround... I guess what I mean by multi-facted is that I would equivalently expect the following scenario to break (do not have the means to verify):

Hypervisors both Broadwell
Guests use SandyBridge CPU model
Source kernel lacks PKRU patch
Destination kernel has PKRU patch

If I understand correctly, in this case Broadwell hypervisor will send XSAVE buffer with AVX256 state. Destination kernel will refuse it as PKRU-patched kernel will only accept XSAVE buffer is compatible with the guest CPU.

tj90241 · Sep 26, 2023

fiona said:
But the original ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") also isn't part of that kernel, so no need for backporting the Fixes commit.

I see, but I'm actually not aware of our users reporting the non-PKRU kind of issue. But maybe it's just a rarer configuration. There were quite a few reports because of the PKRU issue and that got resolved by our workaround. @adman never answered so we don't know if they had the non-PKRU issue.

Thank you for making us aware of the other issues! We'll try to reproduce the "Broadwell guest from Skylake host to Skylake host" one and keep an eye on the upstream discussion. Just to clarify, the issue is supposed to happen when migrating from a kernel without Leo's patch to a kernel with Leo's patch or also when migrating between two kernels with the patch?

For reference: https://lore.kernel.org/all/ZQRNmsW.../T/#m999fd96e3c54b5a32640e4c1be0479e2e06555d6

I mean the PKRU patch - they have not backported it to their 5.15 series yet IIUC.

I did see user reports of people complaining of what looked like this exact scenario post-PKRU patch in IIRC Proxmox 5.19(?) kernel on the forums here... that was what pointed me down the path that led me to discovering the issue. I can try to dig up the threads later.

No problem -- yes, as you see in the test case I proposed above, it's basically any time:

The guest FPU context mismatches with the host FPU context (i.e, host CPU model and guest GPU model mismatch)
Source kernel does not have the PKRU patch (ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0")
Destination kernel has the PKRU patch ("")

(...even if the feature different does not surmount to PKRU itself)

igort · Nov 27, 2023

Hello,

I have an almost identical problem. Migration from a weaker to a stronger node works well (I'm not 100% sure about that either), but from a stronger to a weaker one there are big problems. Some VMs don't have migration problems, but many do. I'm far from finding out exactly which VM can be safely migrated in both directions, for now I think Debian based ones can, RedHat based can't, Windows VMs seem to survive, Mikrotik VMs don't.

Currently, a very sure indicator that a VM is frozen is when the information Guest Agent not running appears for that VM (before the migration it was working properly and the IP address of the VM was visible).

I have 3 nodes with different CPUs; 96 x Intel(R) Xeon(R) Platinum 8168, 56 x Intel(R) Xeon(R) CPU E5-2690 v4, 48 x Intel(R) Xeon(R) CPU E5-2680 v3
All three nodes have an Enterprise repository.
All three are up to date with pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-4-pve)
All VMs have the CPU set to x86-64-v2-AES.
Do you need any other information?

Is this a bug? Can I expect a faster solution? Is there any temporary solution (except not to use live migration

)?
Thanks!

fiona · Nov 27, 2023

Hi,

igort said:
Hello,

I have an almost identical problem. Migration from a weaker to a stronger node works well (I'm not 100% sure about that either), but from a stronger to a weaker one there are big problems. Some VMs don't have migration problems, but many do. I'm far from finding out exactly which VM can be safely migrated in both directions, for now I think Debian based ones can, RedHat based can't, Windows VMs seem to survive, Mikrotik VMs don't.

Currently, a very sure indicator that a VM is frozen is when the information Guest Agent not running appears for that VM (before the migration it was working properly and the IP address of the VM was visible).

I have 3 nodes with different CPUs; 96 x Intel(R) Xeon(R) Platinum 8168, 56 x Intel(R) Xeon(R) CPU E5-2690 v4, 48 x Intel(R) Xeon(R) CPU E5-2680 v3
All three nodes have an Enterprise repository.
All three are up to date with pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-4-pve)
All VMs have the CPU set to x86-64-v2-AES.
Do you need any other information?

Is this a bug? Can I expect a faster solution? Is there any temporary solution (except not to use live migration )?
Thanks!

Since you are using kernel 6.5, this likely is: https://forum.proxmox.com/threads/proxmox-8-1-kernel-6-5-11-4-rcu_sched-stall-cpu.136992/

For now, downgrading the kernel is suggested to avoid the issue. An alternative is turning off the pcid CPU flag. The issue seems to be resolved in kernel 6.7-rc1 so I'm currently searching for the fix.

[SOLVED] Live migration causes freeze on VM

New Member

Active Member

New Member

Distinguished Member

Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Proxmox Staff Member

New Member

New Member

Member

Proxmox Staff Member