trouble live migrating from 7.1 to 7.2 nodes

sungen · May 5, 2022

Hi,

I've updated one of four nodes in a cluster from 7.1 to 7.2. Standard install, except do have a newer frr install on the 7.2 node. Updated node sems happy, in quorum, /etc/pve looks normal etc.

Live migration of VMs from 7.1 nodes to 7.2 node is failing, with pretty generic output (in the WUI).

Code:

2022-05-05 02:21:03 starting migration of VM 404 to node d...
2022-05-05 02:21:03 starting VM 404 on remote node d...
2022-05-05 02:21:09 [d...] got timeout
2022-05-05 02:21:09 ERROR: online migrate failure - remote command failed with exit code 255
2022-05-05 02:21:09 aborting phase 2 - cleanup resources
2022-05-05 02:21:09 migrate_cancel
2022-05-05 02:21:15 ERROR: migration finished with problems (duration 00:00:12)
TASK ERROR: migration problems

Cold migration does work.

Hardware is AMD EPYC 7002 with external Ceph RBD storage.

I note that qemu has moved from version 6.1 to 6.2.

Would appreciate help with debugging and resolving issue. Where to start?

Cheers

fabian · May 5, 2022

there should be a start task on the target node - could you check the log of that? also please include output of pveversion -v on both nodes, the VM config and any relevant parts of the journal on both nodes around the time of the migration.

itNGO · May 5, 2022

Is your QEMU-Version set to latest or fixed to 6.1?

sungen · May 5, 2022

QEMU-Version in the VM is "latest". VM was created on 6.1 (in meta: of VMID.conf file).

The task log on target host just has, "TASK ERROR: got timeout".

Code:

# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.30-2-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-helper: 7.2-2
pve-kernel-5.15: 7.2-1
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.30-2-pve: 5.15.30-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-1
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

We have upstream frr and also the mgmt-vrf packages on the host. (Using BGP to the hypervisor.) Network seems normal on updated node.

Code:

root@d:mgmt:~# dpkg -l | grep frr
ii frr                                  8.2.2-0~deb10u1                 amd64        FRRouting suite of internet protocols (BGP, OSPF, IS-IS, ...)
ii  frr-pythontools                      8.2.2-0~deb10u1                all          FRRouting suite - Python tools
root@d:mgmt:~# dpkg -l | grep vrf
ii  mgmt-vrf                             1.3                            all          Linux tools and config for Management VRF
ii  vrf                                  1.3                            amd64        Linux tools for VRF

Also, as Ceph RBD cluster is remote, don't have ceph package installed on hypervisor.

fabian · May 5, 2022

also please include output of pveversion -v on both nodes, the VM config and any relevant parts of the journal on both nodes around the time of the migration.

sungen · May 5, 2022

from 7.1 node (source of migration)

Code:

root@c:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

This is a newly created minimal VM. It will offline migrate, but not online.

Code:

root@c:~# cat /etc/pve/qemu-server/100.conf
boot: order=scsi0
cores: 1
memory: 2048
meta: creation-qemu=6.1.0,ctime=1651753367
name: migr-test
numa: 0
ostype: l26
scsi0: cephrbd:vm-100-disk-2,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=1c3e80af-78f4-43e3-9af9-0f159f86f65c
sockets: 1
vmgenid: 54949fa9-f9b6-413f-b2df-f4a2cae75438

sungen · May 5, 2022

Live migration of this VM from the 7.2 to 7.1 node works, if I set the QEMU version to 6.1 (otherwise fails as version is 6.2 and can't run on proxmox 7.1).

I've looked in system logs, but don't see anything that looks like a problem. I'm not sure what is meant by looking at "journal".

Thanks!

sungen · May 5, 2022

Here is diff of pveversion -v on new node compared to list here: https://pve.proxmox.com/wiki/Downlo...Proxmox_Virtual_Environment_7.x_to_latest_7.2

I've got more kernels, don't have proxmox ceph packages.

A difference is that I have the libqb100 package installed (on 7.1 and 7.2 nodes).

Code:

# diff wiki-pvever.out pvever-me.out
4a5,6
> pve-kernel-5.13: 7.1-9
> pve-kernel-5.11: 7.0-10
6,7c8,12
< ceph: 16.2.7
< ceph-fuse: 16.2.7
---
> pve-kernel-5.13.19-6-pve: 5.13.19-15
> pve-kernel-5.13.19-1-pve: 5.13.19-3
> pve-kernel-5.11.22-7-pve: 5.11.22-12
> pve-kernel-5.11.22-1-pve: 5.11.22-2
> ceph-fuse: 15.2.13-pve1
23d27
< libqb0: 1.0.5-1

This is from bullseye, is it OK?

Code:

# dpkg -l | grep libqb
ii  libqb100:amd64                       2.0.3-1                        amd64        high performance client server features library

fabian · May 6, 2022

please do journalctl --since "START OF MIGRATION" --until "END OF MIGRATION" on both nodes (replace with correct timestamps) and post the result.

sungen · May 6, 2022

Thanks for the help.

Here's core bit of the logs.

source host:

Code:

May 06 05:57:22 c pvedaemon[3498796]: <root@pam> starting task UPID:c:003580CF:50FB6499:6274F102:qmigrate:100:root@pam:
May 06 05:57:23 c pmxcfs[2544]: [status] notice: received log
May 06 05:57:28 c pmxcfs[2544]: [status] notice: received log
May 06 05:57:29 c pmxcfs[2544]: [status] notice: received log
May 06 05:57:34 c pmxcfs[2544]: [status] notice: received log
May 06 05:57:34 c pvedaemon[3506383]: migration problems
May 06 05:57:34 c pvedaemon[3498796]: <root@pam> end task UPID:c:003580CF:50FB6499:6274F102:qmigrate:100:root@pam: migration problems

dest host:

Code:

May 06 05:57:23 d qm[3265]: <root@pam> starting task UPID:d:00000CC2:000010B9:6274F103:qmstart:100:root@pam:
May 06 05:57:23 d qm[3266]: start VM 100: UPID:d:00000CC2:000010B9:6274F103:qmstart:100:root@pam:
May 06 05:57:28 d qm[3266]: got timeout
May 06 05:57:28 d qm[3265]: <root@pam> end task UPID:d:00000CC2:000010B9:6274F103:qmstart:100:root@pam: got timeout

I went ahead and tried 7.2 node with 5.13 kernel, this gave no change (still failed).

Is it possible to make qm logging more verbose?

fabian · May 6, 2022

my guess is that some query to your ceph cluster fails to return within the default timeout of 5s.. could you post your storage config? note that we added some checks in the rbd storage plugin to prevent misconfigurations from causing data loss, these are only there on 7.2 not 7.1..

sungen · May 6, 2022

Makes sense. I'll look at the storage setup.

sungen · May 7, 2022

Note that the proxmox cluster is ceph client only --- the ceph cluster is separate.

I verified that the VM will live migrate if there is no ceph backed disk image attached. Live migrate works with NFS backed disk attached.

I've updated the 7.2 packages against the proxmox ceph repo (to get to 16.2.7). No change after this. I added the "ceph" package, again no change.

We've been using this RBD storage for 3-4 years now. The configuration is pretty simple, just have the entry in /etc/pve/storage.cfg like:

Code:

rbd: prod
        content images
        krbd 0
        monhost ceph1-stor ceph2-stor ceph3-stor ceph4-stor ceph5-stor
        pool proxmox_prod
        username proxmox_prod

And we have a keyfile for cephx auth in /etc/pve/priv/ceph/STORAGENAME.keyring

Thanks again!

sungen · May 7, 2022

On the 7.2 node, storage live migration is working. That is I can migrate disk attached to VM between ceph pools.

sungen · May 7, 2022

Doing a packet capture, it seems that the qm process may be trying to reach the ceph monitors --- perhaps as a "sanity check" at start of migration.

However, the mgmt VRF setup on the nodes isolates process started from ssh from the VRF that can access ceph.

Would there be a way to disable the new (?) ceph checks?

The network architecture we have with routing to the hypervisor and EVPN/VXLAN with mgmt vrf has some issues, but redoing all of that would be a big lift.

Thanks again.

sungen · May 8, 2022

Hi,

On these hosts we have specific network links to get to the ceph storage. Management connections to the hosts are made on a 1GE network. A "mgmt vrf" setup is in place which by default launches ssh sessions in the mgmt vrf (so the routes are pointing to the 1GE network).

Migration is setup to go over a high-speed link, and this ssh works, remaining on the high-speed link even though the session is in the vrf. However, new connections can not be opened on the high-speed links from this ssh session. This quirk of the setup was "Ok" before. Multi-homed hosts generally have some oddities...

So, now in 7.2 qm (apparently) wants to query the ceph monitors before starting migration. Since this is going inside the ssh session using the mgmt vrf, it fails.

I've adjusted config so that the migration ssh sessions are not moved to the mgmt vrf and now things go.

I did go ahead and remove the "ceph" package from the node, keeping with our RBD client only config.

I didn't find in the PVE code just where these checks are.

Thanks again, Cheers!

fabian · May 9, 2022

yeah, the check/bug fix requires having the cluster fsid available (so that we can differentiate between images on identically named pools on different clusters). for hyperconverged setups we parse it from the config, for external clusters we ask the mons for it.

https://git.proxmox.com/?p=pve-storage.git;a=commitdiff;h=cfe46e2d4a97a83f1bbe6ad656e6416399309ba2

Search

Search

trouble live migrating from 7.1 to 7.2 nodes

sungen

New Member

fabian

Proxmox Staff Member

itNGO

Famous Member

sungen

New Member

fabian

Proxmox Staff Member

sungen

New Member

sungen

New Member

sungen

New Member

fabian

Proxmox Staff Member

sungen

New Member

fabian

Proxmox Staff Member

sungen

New Member

sungen

New Member

sungen

New Member

sungen

New Member

sungen

New Member

fabian

Proxmox Staff Member

We value your privacy