I have a cluster with 6 nodes running on version 7.4, all nodes are Dell PowerEdge R630 and now I needed to add the seventh node which is Dell PowerEdge R640. Live migration from a VM residing on a R630 host to R640 goes fine, but when migrating from the new R640 host to R630 the VM crashes/freezes, requiring a reboot to get back to normal.
This happens with any type of VM, below is an example configuration that we use by default on all VMs running in the cluster:
cores: 8
cpu: Haswell-noTSX
ide0: stor01-vms:vm-10004-cloudinit,media=cdrom
ide2: none,media=cdrom
kvm: 1
memory: 32768
meta: creation-qemu=7.0.0,ctime=1664308344
numa: 0
ostype: l26
scsi0: stor01-vms:vm-10004-disk-0,cache=none,discard=on,iops_rd=2000,iops_wr=1000,mbps_rd=300,mbps_wr=200,size=160G,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=c438496d-1855-4e86-adbe-17731e37cf0d
sockets: 1
vga: std
vmgenid: ec907fe3-e72b-4535-8de5-ade34686f824
We set the CPU type to Haswell-noTSX to match the lowest processor we have in the cluster, which is an Intel Xeon CPU E5-2680 v3 at 2.50GHz. Some hosts have Intel Xeon CPU E5-2680 v4 and real time migration between them has always worked fine (The new Dell 640 node has Intel Xeon Gold 6132 CPU @ 2.60GHz). I did some tests by changing the CPU type to other architectures like kvm64 or Intel IvyBridge but the problem still occurs. I don't know if it's really a CPU incompatibility issue. I understand that since the type is set in the VM settings, this failure should not occur. I am running the same version of Proxmox on all nodes, the only thing different is the server generation itself
# pveversion --verbose
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-network-perl: 0.7.3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
*No error message is displayed on the VM console, it just freezes and does not accept any commands
This happens with any type of VM, below is an example configuration that we use by default on all VMs running in the cluster:
cores: 8
cpu: Haswell-noTSX
ide0: stor01-vms:vm-10004-cloudinit,media=cdrom
ide2: none,media=cdrom
kvm: 1
memory: 32768
meta: creation-qemu=7.0.0,ctime=1664308344
numa: 0
ostype: l26
scsi0: stor01-vms:vm-10004-disk-0,cache=none,discard=on,iops_rd=2000,iops_wr=1000,mbps_rd=300,mbps_wr=200,size=160G,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=c438496d-1855-4e86-adbe-17731e37cf0d
sockets: 1
vga: std
vmgenid: ec907fe3-e72b-4535-8de5-ade34686f824
We set the CPU type to Haswell-noTSX to match the lowest processor we have in the cluster, which is an Intel Xeon CPU E5-2680 v3 at 2.50GHz. Some hosts have Intel Xeon CPU E5-2680 v4 and real time migration between them has always worked fine (The new Dell 640 node has Intel Xeon Gold 6132 CPU @ 2.60GHz). I did some tests by changing the CPU type to other architectures like kvm64 or Intel IvyBridge but the problem still occurs. I don't know if it's really a CPU incompatibility issue. I understand that since the type is set in the VM settings, this failure should not occur. I am running the same version of Proxmox on all nodes, the only thing different is the server generation itself
# pveversion --verbose
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-network-perl: 0.7.3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
*No error message is displayed on the VM console, it just freezes and does not accept any commands