Live migrate fails to resume.

effeeue05 · Jun 24, 2024

Most of our guests will fail to resume when being live-migrated from one node to another. This seems to happen on all guests, however; a newly created guest will not initially have the problem but after an unknown amount of time; the problem will surface. The error we see is "tunnel replied 'ERR: resume failed - VM XXX qmp command 'query-status' failed - got timeout' to command 'resume XXX'". This problem has been on Proxmox 7 and still exists after a recent upgrade to Proxmox 8. I am able to manually resume the guest after the error. It's like the process is trying to resume the guest before it's available to be resumed on the new node.

Code:

proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

Any help figuring out this issue would be appreciated.

fiona · Jun 25, 2024

Hi,
please share the full migration log as well as the configuration file of an affected VM with qm config XXX. What also would be interesting is the journal from the target node around the time the issue occurs.

justinclift · Jun 25, 2024

@effeeue05 Out of curiosity, how long does it take for the timeout to occur after you start the migration?

effeeue05 · Jun 25, 2024

@fiona

Code:

2024-06-24 11:02:05 starting migration of VM 106 to node 'NODE4' (x.x.x.18)
2024-06-24 11:02:05 starting VM 106 on remote node 'NODE4'
2024-06-24 11:02:18 start remote tunnel
2024-06-24 11:02:19 ssh tunnel ver 1
2024-06-24 11:02:19 starting online/live migration on unix:/run/qemu-server/106.migrate
2024-06-24 11:02:19 set migration capabilities
2024-06-24 11:02:19 migration downtime limit: 100 ms
2024-06-24 11:02:19 migration cachesize: 256.0 MiB
2024-06-24 11:02:19 set migration parameters
2024-06-24 11:02:19 start migrate command to unix:/run/qemu-server/106.migrate
2024-06-24 11:02:30 average migration speed: 187.7 MiB/s - downtime 10335 ms
2024-06-24 11:02:30 migration status: completed
2024-06-24 11:02:35 ERROR: tunnel replied 'ERR: resume failed - VM 106 qmp command 'query-status' failed - got timeout' to command 'resume 106'
2024-06-24 11:02:43 ERROR: migration finished with problems (duration 00:00:38)
TASK ERROR: migration problems



balloon: 0
boot: order=scsi0;net0
cores: 2
memory: 2048
meta: creation-qemu=7.2.0,ctime=1709660274
name: migrate-testing
net0: virtio=BC:24:11:FB:D5:FE,bridge=vmbr0,firewall=1,link_down=1,tag=1234
net1: virtio=BC:24:11:F8:88:B3,bridge=vmbr0,link_down=1
numa: 0
onboot: 1
ostype: l26
parent: daily_1719277291408
scsi0: rbd:vm-106-disk-0,iothread=1,size=128G
scsihw: virtio-scsi-single
smbios1: uuid=22d47330-5c06-4c97-b751-2867aed0a64e
sockets: 1
vmgenid: 493abb7f-05c7-4bf7-8811-7737b5de75f9

@justinclift - About 30 seconds.

Thanks

effeeue05 · Jun 25, 2024

@fiona - forgot to include journal log from the time this occurred.

Jun 25 09:35:26 NODE4 pmxcfs[2048]: [status] notice: received log
Jun 25 09:35:26 NODE4 sshd[580845]: Accepted publickey for root from x.x.x.18 port 33370 ssh2: RSA SHA256:aeJdRRu5IP1SizlPkvvFMe/xFzUknqu33DRyG9UhKuA
Jun 25 09:35:26 NODE4 sshd[580845]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 25 09:35:26 NODE4 systemd[1]: Created slice user-0.slice - User Slice of UID 0.
Jun 25 09:35:26 NODE4 systemd[1]: Starting user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Jun 25 09:35:26 NODE4 systemd-logind[1708]: New session 1109 of user root.
Jun 25 09:35:26 NODE4 systemd[1]: Finished user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Jun 25 09:35:26 NODE4 systemd[1]: Starting user@0.service - User Manager for UID 0...
Jun 25 09:35:26 NODE4 (systemd)[580848]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
Jun 25 09:35:26 NODE4 systemd[580848]: Queued start job for default target default.target.
Jun 25 09:35:26 NODE4 systemd[580848]: Created slice app.slice - User Application Slice.
Jun 25 09:35:26 NODE4 systemd[580848]: Reached target paths.target - Paths.
Jun 25 09:35:26 NODE4 systemd[580848]: Reached target timers.target - Timers.
Jun 25 09:35:26 NODE4 systemd[580848]: Listening on dirmngr.socket - GnuPG network certificate management daemon.
Jun 25 09:35:26 NODE4 systemd[580848]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jun 25 09:35:26 NODE4 systemd[580848]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Jun 25 09:35:26 NODE4 systemd[580848]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Jun 25 09:35:26 NODE4 systemd[580848]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Jun 25 09:35:26 NODE4 systemd[580848]: Reached target sockets.target - Sockets.
Jun 25 09:35:26 NODE4 systemd[580848]: Reached target basic.target - Basic System.
Jun 25 09:35:26 NODE4 systemd[580848]: Reached target default.target - Main User Target.
Jun 25 09:35:26 NODE4 systemd[580848]: Startup finished in 227ms.
Jun 25 09:35:26 NODE4 systemd[1]: Started user@0.service - User Manager for UID 0.
Jun 25 09:35:26 NODE4 systemd[1]: Started session-1109.scope - Session 1109 of User root.
Jun 25 09:35:26 NODE4 sshd[580845]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 25 09:35:26 NODE4 sshd[580845]: Received disconnect from x.x.x.18 port 33370:11: disconnected by user
Jun 25 09:35:26 NODE4 sshd[580845]: Disconnected from user root x.x.x.18 port 33370
Jun 25 09:35:26 NODE4 sshd[580845]: pam_unix(sshd:session): session closed for user root
Jun 25 09:35:26 NODE4 systemd[1]: session-1109.scope: Deactivated successfully.
Jun 25 09:35:26 NODE4 systemd-logind[1708]: Session 1109 logged out. Waiting for processes to exit.
Jun 25 09:35:26 NODE4 systemd-logind[1708]: Removed session 1109.
Jun 25 09:35:26 NODE4 sshd[580869]: Accepted publickey for root from x.x.x.18 port 33376 ssh2: RSA SHA256:aeJdRRu5IP1SizlPkvvFMe/xFzUknqu33DRyG9UhKuA
Jun 25 09:35:26 NODE4 sshd[580869]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 25 09:35:26 NODE4 systemd-logind[1708]: New session 1111 of user root.
Jun 25 09:35:26 NODE4 systemd[1]: Started session-1111.scope - Session 1111 of User root.
Jun 25 09:35:26 NODE4 sshd[580869]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 25 09:35:27 NODE4 qm[580875]: <root@pam> starting task UPID:NODE4:0008DD0C:147BB0DB:667AD5AF:qmstart:106:root@pam:
Jun 25 09:35:27 NODE4 qm[580876]: start VM 106: UPID:NODE4:0008DD0C:147BB0DB:667AD5AF:qmstart:106:root@pam:
Jun 25 09:35:34 NODE4 kernel: rbd20: p1 p2 < p5 >
Jun 25 09:35:34 NODE4 kernel: rbd: rbd20: capacity 137438953472 features 0x3d
Jun 25 09:35:34 NODE4 systemd[1]: Started 106.scope.
Jun 25 09:35:35 NODE4 kernel: tap106i0: entered promiscuous mode
Jun 25 09:35:35 NODE4 kernel: vmbr0v2020: port 3(fwpr106p0) entered blocking state
Jun 25 09:35:35 NODE4 kernel: vmbr0v2020: port 3(fwpr106p0) entered disabled state
Jun 25 09:35:35 NODE4 kernel: fwpr106p0: entered allmulticast mode
Jun 25 09:35:35 NODE4 kernel: fwpr106p0: entered promiscuous mode
Jun 25 09:35:35 NODE4 kernel: vmbr0v2020: port 3(fwpr106p0) entered blocking state
Jun 25 09:35:35 NODE4 kernel: vmbr0v2020: port 3(fwpr106p0) entered forwarding state
Jun 25 09:35:35 NODE4 kernel: fwbr106i0: port 1(fwln106i0) entered blocking state
Jun 25 09:35:35 NODE4 kernel: fwbr106i0: port 1(fwln106i0) entered disabled state
Jun 25 09:35:35 NODE4 kernel: fwln106i0: entered allmulticast mode
Jun 25 09:35:35 NODE4 kernel: fwln106i0: entered promiscuous mode
Jun 25 09:35:35 NODE4 kernel: fwbr106i0: port 1(fwln106i0) entered blocking state
Jun 25 09:35:35 NODE4 kernel: fwbr106i0: port 1(fwln106i0) entered forwarding state
Jun 25 09:35:35 NODE4 kernel: fwbr106i0: port 2(tap106i0) entered blocking state
Jun 25 09:35:35 NODE4 kernel: fwbr106i0: port 2(tap106i0) entered disabled state
Jun 25 09:35:35 NODE4 kernel: tap106i0: entered allmulticast mode
Jun 25 09:35:35 NODE4 kernel: fwbr106i0: port 2(tap106i0) entered blocking state
Jun 25 09:35:35 NODE4 kernel: fwbr106i0: port 2(tap106i0) entered forwarding state
Jun 25 09:35:36 NODE4 kernel: tap106i1: entered promiscuous mode
Jun 25 09:35:36 NODE4 kernel: vmbr0: port 3(tap106i1) entered blocking state
Jun 25 09:35:36 NODE4 kernel: vmbr0: port 3(tap106i1) entered disabled state
Jun 25 09:35:36 NODE4 kernel: tap106i1: entered allmulticast mode
Jun 25 09:35:36 NODE4 kernel: vmbr0: port 3(tap106i1) entered blocking state
Jun 25 09:35:36 NODE4 kernel: vmbr0: port 3(tap106i1) entered forwarding state
Jun 25 09:35:36 NODE4 qm[580875]: <root@pam> end task UPID:NODE4:0008DD0C:147BB0DB:667AD5AF:qmstart:106:root@pam: OK
Jun 25 09:35:36 NODE4 sshd[580869]: Received disconnect from x.x.x.18 port 33376:11: disconnected by user
Jun 25 09:35:36 NODE4 sshd[580869]: Disconnected from user root x.x.x.18 port 33376
Jun 25 09:35:36 NODE4 sshd[580869]: pam_unix(sshd:session): session closed for user root
Jun 25 09:35:36 NODE4 systemd[1]: session-1111.scope: Deactivated successfully.
Jun 25 09:35:36 NODE4 systemd-logind[1708]: Session 1111 logged out. Waiting for processes to exit.
Jun 25 09:35:36 NODE4 systemd-logind[1708]: Removed session 1111.
Jun 25 09:35:36 NODE4 sshd[581309]: Accepted publickey for root from x.x.x.18 port 49484 ssh2: RSA SHA256:aeJdRRu5IP1SizlPkvvFMe/xFzUknqu33DRyG9UhKuA
Jun 25 09:35:36 NODE4 sshd[581309]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 25 09:35:36 NODE4 systemd-logind[1708]: New session 1112 of user root.
Jun 25 09:35:36 NODE4 systemd[1]: Started session-1112.scope - Session 1112 of User root.
Jun 25 09:35:36 NODE4 sshd[581309]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 25 09:35:49 NODE4 pmxcfs[2048]: [status] notice: received log
Jun 25 09:35:53 NODE4 qm[581315]: VM 106 qmp command failed - VM 106 qmp command 'query-status' failed - got timeout
Jun 25 09:35:57 NODE4 pvedaemon[517917]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 5>
Jun 25 09:35:57 NODE4 pvedaemon[511154]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 5>
Jun 25 09:35:57 NODE4 pmxcfs[2048]: [status] notice: received log
Jun 25 09:35:57 NODE4 sshd[581422]: Accepted publickey for root from x.x.x.13 port 39376 ssh2: RSA SHA256

NqPAVatqUQeBGMm1In5Heq4ZBgQWqJOe80HO5hOqus
Jun 25 09:35:57 NODE4 sshd[581422]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 25 09:35:57 NODE4 systemd-logind[1708]: New session 1113 of user root.
Jun 25 09:35:57 NODE4 systemd[1]: Started session-1113.scope - Session 1113 of User root.
Jun 25 09:35:57 NODE4 sshd[581422]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 25 09:35:58 NODE4 sshd[581309]: Received disconnect from x.x.x.18 port 49484:11: disconnected by user
Jun 25 09:35:58 NODE4 sshd[581309]: Disconnected from user root x.x.x.18 port 49484
Jun 25 09:35:58 NODE4 sshd[581309]: pam_unix(sshd:session): session closed for user root
Jun 25 09:35:58 NODE4 systemd[1]: session-1112.scope: Deactivated successfully.
Jun 25 09:35:58 NODE4 systemd[1]: session-1112.scope: Consumed 1.454s CPU time.
Jun 25 09:35:58 NODE4 systemd-logind[1708]: Session 1112 logged out. Waiting for processes to exit.
Jun 25 09:35:58 NODE4 systemd-logind[1708]: Removed session 1112.
Jun 25 09:35:59 NODE4 sshd[581454]: Accepted publickey for root from x.x.x.18 port 47724 ssh2: RSA SHA256:aeJdRRu5IP1SizlPkvvFMe/xFzUknqu33DRyG9UhKuA
Jun 25 09:35:59 NODE4 sshd[581454]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 25 09:35:59 NODE4 systemd-logind[1708]: New session 1114 of user root.
Jun 25 09:35:59 NODE4 systemd[1]: Started session-1114.scope - Session 1114 of User root.
Jun 25 09:35:59 NODE4 sshd[581454]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 25 09:35:59 NODE4 sshd[581454]: Received disconnect from x.x.x.18 port 47724:11: disconnected by user
Jun 25 09:35:59 NODE4 sshd[581454]: Disconnected from user root x.x.x.18 port 47724
Jun 25 09:35:59 NODE4 sshd[581454]: pam_unix(sshd:session): session closed for user root
Jun 25 09:35:59 NODE4 systemd-logind[1708]: Session 1114 logged out. Waiting for processes to exit.
Jun 25 09:35:59 NODE4 systemd[1]: session-1114.scope: Deactivated successfully.
Jun 25 09:35:59 NODE4 systemd-logind[1708]: Removed session 1114.
Jun 25 09:36:00 NODE4 sshd[581461]: Accepted publickey for root from x.x.x.18 port 47732 ssh2: RSA SHA256:aeJdRRu5IP1SizlPkvvFMe/xFzUknqu33DRyG9UhKuA
Jun 25 09:36:00 NODE4 sshd[581461]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 25 09:36:00 NODE4 systemd-logind[1708]: New session 1115 of user root.
Jun 25 09:36:00 NODE4 systemd[1]: Started session-1115.scope - Session 1115 of User root.
Jun 25 09:36:00 NODE4 sshd[581461]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 25 09:36:01 NODE4 sshd[581461]: Received disconnect from x.x.x.18 port 47732:11: disconnected by user
Jun 25 09:36:01 NODE4 sshd[581461]: Disconnected from user root x.x.x.18 port 47732
Jun 25 09:36:01 NODE4 sshd[581461]: pam_unix(sshd:session): session closed for user root
Jun 25 09:36:01 NODE4 systemd[1]: session-1115.scope: Deactivated successfully.
Jun 25 09:36:01 NODE4 systemd-logind[1708]: Session 1115 logged out. Waiting for processes to exit.
Jun 25 09:36:01 NODE4 systemd-logind[1708]: Removed session 1115.
Jun 25 09:36:01 NODE4 pmxcfs[2048]: [status] notice: received log

justinclift · Jun 25, 2024

Are those two sets of logs from the same migration attempt, or from different ones?

Asking because there's a large difference in the timestamps, so if those are supposed to be the same migration attempt then I'm wondering if there's a time sync problem between the nodes.

effeeue05 · Jun 25, 2024

Sorry, I grabbed the wrong logs from the migrate task. Here is the one that matches the journal log.

Code:

2024-06-25 09:35:26 starting migration of VM 106 to node 'NODE4' (x.x.x.18)
2024-06-25 09:35:26 starting VM 106 on remote node 'NODE4'
2024-06-25 09:35:36 start remote tunnel
2024-06-25 09:35:37 ssh tunnel ver 1
2024-06-25 09:35:37 starting online/live migration on unix:/run/qemu-server/106.migrate
2024-06-25 09:35:37 set migration capabilities
2024-06-25 09:35:37 migration downtime limit: 100 ms
2024-06-25 09:35:37 migration cachesize: 256.0 MiB
2024-06-25 09:35:37 set migration parameters
2024-06-25 09:35:37 start migrate command to unix:/run/qemu-server/106.migrate
2024-06-25 09:35:48 average migration speed: 187.7 MiB/s - downtime 10089 ms
2024-06-25 09:35:48 migration status: completed
2024-06-25 09:35:53 ERROR: tunnel replied 'ERR: resume failed - VM 106 qmp command 'query-status' failed - got timeout' to command 'resume 106'
2024-06-25 09:36:01 ERROR: migration finished with problems (duration 00:00:35)
TASK ERROR: migration problems

fiona · Jun 26, 2024

So the VM start task seems to complete successfully, but the QMP commands at the end of migration run into timeouts. Does the target node also have pve-qemu-kvm: 8.1.5-6?

How does the load on the target server look like? One way to check is with grep '' /proc/pressure/*, the graphs in the node's Summary tab in the UI might also be worth a look.

You could try using a bandwidth limit with qm set 106 --migrate_speed 60, maybe that helps. If you trust your migration network, you can also try pvesh set /cluster/options --migration type=insecure or if you have a dedicated network: pvesh set /cluster/options --migration network=A.B.C.D/E,type=insecure with your network

effeeue05 · Jun 26, 2024

All nodes have the same version: pve-qemu-kvm: 8.1.5-6

Are you asking about the load on the target server during migration or just in general? I can check both but was just curious.

Let me try some of these suggestions and get back with you.

effeeue05 · Jun 26, 2024

Here are the logs after setting --migrate_speed 60:

Code:

Migration Output:

2024-06-26 12:09:35 starting migration of VM 106 to node 'NODE4' (x.x.x.x)
2024-06-26 12:09:35 starting VM 106 on remote node 'NODE4'
2024-06-26 12:09:48 start remote tunnel
2024-06-26 12:09:48 ssh tunnel ver 1
2024-06-26 12:09:48 starting online/live migration on unix:/run/qemu-server/106.migrate
2024-06-26 12:09:48 set migration capabilities
2024-06-26 12:09:48 migration speed limit: 60.0 MiB/s
2024-06-26 12:09:48 migration downtime limit: 100 ms
2024-06-26 12:09:48 migration cachesize: 256.0 MiB
2024-06-26 12:09:48 set migration parameters
2024-06-26 12:09:48 start migrate command to unix:/run/qemu-server/106.migrate
2024-06-26 12:09:49 migration active, transferred 61.5 MiB of 2.0 GiB VM-state, 67.0 MiB/s
2024-06-26 12:09:50 migration active, transferred 123.0 MiB of 2.0 GiB VM-state, 60.9 MiB/s
2024-06-26 12:09:51 migration active, transferred 183.5 MiB of 2.0 GiB VM-state, 60.2 MiB/s
2024-06-26 12:09:52 migration active, transferred 244.6 MiB of 2.0 GiB VM-state, 60.0 MiB/s
2024-06-26 12:09:53 migration active, transferred 305.1 MiB of 2.0 GiB VM-state, 99.8 MiB/s
2024-06-26 12:09:54 migration active, transferred 366.1 MiB of 2.0 GiB VM-state, 60.1 MiB/s
2024-06-26 12:09:55 migration active, transferred 426.1 MiB of 2.0 GiB VM-state, 78.6 MiB/s
2024-06-26 12:09:56 migration active, transferred 486.1 MiB of 2.0 GiB VM-state, 59.3 MiB/s
2024-06-26 12:09:58 migration active, transferred 552.2 MiB of 2.0 GiB VM-state, 59.9 MiB/s
2024-06-26 12:10:08 average migration speed: 103.3 MiB/s - downtime 10451 ms
2024-06-26 12:10:08 migration status: completed
2024-06-26 12:10:13 ERROR: tunnel replied 'ERR: resume failed - VM 106 qmp command 'query-status' failed - got timeout' to command 'resume 106'
2024-06-26 12:10:21 ERROR: migration finished with problems (duration 00:00:46)
TASK ERROR: migration problems

Journal Output:

Jun 26 12:09:35 NODE4 pmxcfs[2048]: [status] notice: received log
Jun 26 12:09:35 NODE4 sshd[1118107]: Accepted publickey for root from x.x.x.x port 38120 ssh2: RSA SHA256:aeJdRRu5IP1SizlPkvvFMe/xFzUknqu33DRyG9UhKuA
Jun 26 12:09:35 NODE4 sshd[1118107]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 26 12:09:35 NODE4 systemd-logind[1708]: New session 1149 of user root.
Jun 26 12:09:35 NODE4 systemd[1]: Started session-1149.scope - Session 1149 of User root.
Jun 26 12:09:35 NODE4 sshd[1118107]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 26 12:09:35 NODE4 sshd[1118107]: Received disconnect from x.x.x.x port 38120:11: disconnected by user
Jun 26 12:09:35 NODE4 sshd[1118107]: Disconnected from user root x.x.x.x port 38120
Jun 26 12:09:35 NODE4 sshd[1118107]: pam_unix(sshd:session): session closed for user root
Jun 26 12:09:35 NODE4 systemd[1]: session-1149.scope: Deactivated successfully.
Jun 26 12:09:35 NODE4 systemd-logind[1708]: Session 1149 logged out. Waiting for processes to exit.
Jun 26 12:09:35 NODE4 systemd-logind[1708]: Removed session 1149.
Jun 26 12:09:35 NODE4 sshd[1118114]: Accepted publickey for root from x.x.x.x port 38124 ssh2: RSA SHA256:aeJdRRu5IP1SizlPkvvFMe/xFzUknqu33DRyG9UhKuA
Jun 26 12:09:35 NODE4 sshd[1118114]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 26 12:09:35 NODE4 systemd-logind[1708]: New session 1150 of user root.
Jun 26 12:09:35 NODE4 systemd[1]: Started session-1150.scope - Session 1150 of User root.
Jun 26 12:09:35 NODE4 sshd[1118114]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 26 12:09:36 NODE4 qm[1118120]: <root@pam> starting task UPID:NODE4:00110FA9:150DA3AC:667C4B50:qmstart:106:root@pam:
Jun 26 12:09:36 NODE4 qm[1118121]: start VM 106: UPID:NODE4:00110FA9:150DA3AC:667C4B50:qmstart:106:root@pam:
Jun 26 12:09:46 NODE4 kernel:  rbd20: p1 p2 < p5 >
Jun 26 12:09:46 NODE4 kernel: rbd: rbd20: capacity 137438953472 features 0x3d
Jun 26 12:09:46 NODE4 systemd[1]: Started 106.scope.
Jun 26 12:09:47 NODE4 kernel: tap106i0: entered promiscuous mode
Jun 26 12:09:47 NODE4 kernel: vmbr0v2020: port 3(fwpr106p0) entered blocking state
Jun 26 12:09:47 NODE4 kernel: vmbr0v2020: port 3(fwpr106p0) entered disabled state
Jun 26 12:09:47 NODE4 kernel: fwpr106p0: entered allmulticast mode
Jun 26 12:09:47 NODE4 kernel: fwpr106p0: entered promiscuous mode
Jun 26 12:09:47 NODE4 kernel: vmbr0v2020: port 3(fwpr106p0) entered blocking state
Jun 26 12:09:47 NODE4 kernel: vmbr0v2020: port 3(fwpr106p0) entered forwarding state
Jun 26 12:09:47 NODE4 kernel: fwbr106i0: port 1(fwln106i0) entered blocking state
Jun 26 12:09:47 NODE4 kernel: fwbr106i0: port 1(fwln106i0) entered disabled state
Jun 26 12:09:47 NODE4 kernel: fwln106i0: entered allmulticast mode
Jun 26 12:09:47 NODE4 kernel: fwln106i0: entered promiscuous mode
Jun 26 12:09:47 NODE4 kernel: fwbr106i0: port 1(fwln106i0) entered blocking state
Jun 26 12:09:47 NODE4 kernel: fwbr106i0: port 1(fwln106i0) entered forwarding state
Jun 26 12:09:47 NODE4 kernel: fwbr106i0: port 2(tap106i0) entered blocking state
Jun 26 12:09:47 NODE4 kernel: fwbr106i0: port 2(tap106i0) entered disabled state
Jun 26 12:09:47 NODE4 kernel: tap106i0: entered allmulticast mode
Jun 26 12:09:47 NODE4 kernel: fwbr106i0: port 2(tap106i0) entered blocking state
Jun 26 12:09:47 NODE4 kernel: fwbr106i0: port 2(tap106i0) entered forwarding state
Jun 26 12:09:47 NODE4 kernel: tap106i1: entered promiscuous mode
Jun 26 12:09:48 NODE4 kernel: vmbr0: port 3(tap106i1) entered blocking state
Jun 26 12:09:48 NODE4 kernel: vmbr0: port 3(tap106i1) entered disabled state
Jun 26 12:09:48 NODE4 kernel: tap106i1: entered allmulticast mode
Jun 26 12:09:48 NODE4 kernel: vmbr0: port 3(tap106i1) entered blocking state
Jun 26 12:09:48 NODE4 kernel: vmbr0: port 3(tap106i1) entered forwarding state
Jun 26 12:09:48 NODE4 qm[1118120]: <root@pam> end task UPID:NODE4:00110FA9:150DA3AC:667C4B50:qmstart:106:root@pam: OK
Jun 26 12:09:48 NODE4 sshd[1118114]: Received disconnect from x.x.x.x port 38124:11: disconnected by user
Jun 26 12:09:48 NODE4 sshd[1118114]: Disconnected from user root x.x.x.x port 38124
Jun 26 12:09:48 NODE4 sshd[1118114]: pam_unix(sshd:session): session closed for user root
Jun 26 12:09:48 NODE4 systemd-logind[1708]: Session 1150 logged out. Waiting for processes to exit.
Jun 26 12:09:48 NODE4 systemd[1]: session-1150.scope: Deactivated successfully.
Jun 26 12:09:48 NODE4 systemd-logind[1708]: Removed session 1150.
Jun 26 12:09:48 NODE4 sshd[1118569]: Accepted publickey for root from x.x.x.x port 38902 ssh2: RSA SHA256:aeJdRRu5IP1SizlPkvvFMe/xFzUknqu33DRyG9UhKuA
Jun 26 12:09:48 NODE4 sshd[1118569]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 26 12:09:48 NODE4 systemd-logind[1708]: New session 1151 of user root.
Jun 26 12:09:48 NODE4 systemd[1]: Started session-1151.scope - Session 1151 of User root.
Jun 26 12:09:48 NODE4 sshd[1118569]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 26 12:10:13 NODE4 qm[1118580]: VM 106 qmp command failed - VM 106 qmp command 'query-status' failed - got timeout
Jun 26 12:10:19 NODE4 sshd[1118569]: Received disconnect from x.x.x.x port 38902:11: disconnected by user
Jun 26 12:10:19 NODE4 sshd[1118569]: Disconnected from user root x.x.x.x port 38902
Jun 26 12:10:19 NODE4 sshd[1118569]: pam_unix(sshd:session): session closed for user root
Jun 26 12:10:19 NODE4 systemd-logind[1708]: Session 1151 logged out. Waiting for processes to exit.
Jun 26 12:10:19 NODE4 systemd[1]: session-1151.scope: Deactivated successfully.
Jun 26 12:10:19 NODE4 systemd[1]: session-1151.scope: Consumed 1.584s CPU time.
Jun 26 12:10:19 NODE4 systemd-logind[1708]: Removed session 1151.
Jun 26 12:10:19 NODE4 sshd[1118822]: Accepted publickey for root from x.x.x.x port 54548 ssh2: RSA SHA256:aeJdRRu5IP1SizlPkvvFMe/xFzUknqu33DRyG9UhKuA
Jun 26 12:10:19 NODE4 sshd[1118822]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 26 12:10:19 NODE4 systemd-logind[1708]: New session 1152 of user root.
Jun 26 12:10:19 NODE4 systemd[1]: Started session-1152.scope - Session 1152 of User root.
Jun 26 12:10:19 NODE4 sshd[1118822]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 26 12:10:20 NODE4 sshd[1118822]: Received disconnect from x.x.x.x port 54548:11: disconnected by user
Jun 26 12:10:20 NODE4 sshd[1118822]: Disconnected from user root x.x.x.x port 54548
Jun 26 12:10:20 NODE4 sshd[1118822]: pam_unix(sshd:session): session closed for user root
Jun 26 12:10:20 NODE4 systemd[1]: session-1152.scope: Deactivated successfully.
Jun 26 12:10:20 NODE4 systemd-logind[1708]: Session 1152 logged out. Waiting for processes to exit.
Jun 26 12:10:20 NODE4 systemd-logind[1708]: Removed session 1152.
Jun 26 12:10:20 NODE4 sshd[1118829]: Accepted publickey for root from x.x.x.x port 54558 ssh2: RSA SHA256:aeJdRRu5IP1SizlPkvvFMe/xFzUknqu33DRyG9UhKuA
Jun 26 12:10:20 NODE4 sshd[1118829]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 26 12:10:20 NODE4 systemd-logind[1708]: New session 1153 of user root.
Jun 26 12:10:20 NODE4 systemd[1]: Started session-1153.scope - Session 1153 of User root.
Jun 26 12:10:20 NODE4 sshd[1118829]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 26 12:10:21 NODE4 sshd[1118829]: Received disconnect from x.x.x.x port 54558:11: disconnected by user
Jun 26 12:10:21 NODE4 sshd[1118829]: Disconnected from user root x.x.x.x port 54558
Jun 26 12:10:21 NODE4 sshd[1118829]: pam_unix(sshd:session): session closed for user root
Jun 26 12:10:21 NODE4 systemd[1]: session-1153.scope: Deactivated successfully.
Jun 26 12:10:21 NODE4 systemd-logind[1708]: Session 1153 logged out. Waiting for processes to exit.
Jun 26 12:10:21 NODE4 systemd-logind[1708]: Removed session 1153.
Jun 26 12:10:21 NODE4 pmxcfs[2048]: [status] notice: received log

Pressure Output logs: Logging to file every 5 seconds during migration.

2024-06-26 12:09:34
2024-06-26 12:09:34 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124315023
2024-06-26 12:09:34 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:09:34 /proc/pressure/io:some avg10=0.00 avg60=0.00 avg300=0.02 total=5127418486
2024-06-26 12:09:34 /proc/pressure/io:full avg10=0.00 avg60=0.00 avg300=0.02 total=5099851926
2024-06-26 12:09:34 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:09:34 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:09:39
2024-06-26 12:09:39 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124328588
2024-06-26 12:09:39 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:09:39 /proc/pressure/io:some avg10=0.00 avg60=0.00 avg300=0.02 total=5127418544
2024-06-26 12:09:39 /proc/pressure/io:full avg10=0.00 avg60=0.00 avg300=0.02 total=5099851979
2024-06-26 12:09:39 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:09:39 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:09:44
2024-06-26 12:09:44 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124341110
2024-06-26 12:09:44 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:09:44 /proc/pressure/io:some avg10=0.00 avg60=0.00 avg300=0.02 total=5127419980
2024-06-26 12:09:44 /proc/pressure/io:full avg10=0.00 avg60=0.00 avg300=0.02 total=5099853326
2024-06-26 12:09:44 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:09:44 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:09:49
2024-06-26 12:09:49 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124355139
2024-06-26 12:09:49 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:09:49 /proc/pressure/io:some avg10=0.00 avg60=0.00 avg300=0.02 total=5127421062
2024-06-26 12:09:49 /proc/pressure/io:full avg10=0.00 avg60=0.00 avg300=0.01 total=5099854303
2024-06-26 12:09:49 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:09:49 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:09:54
2024-06-26 12:09:54 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124369172
2024-06-26 12:09:54 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:09:54 /proc/pressure/io:some avg10=0.00 avg60=0.00 avg300=0.02 total=5127421299
2024-06-26 12:09:54 /proc/pressure/io:full avg10=0.00 avg60=0.00 avg300=0.01 total=5099854522
2024-06-26 12:09:54 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:09:54 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:09:59
2024-06-26 12:09:59 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124380503
2024-06-26 12:09:59 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:09:59 /proc/pressure/io:some avg10=0.00 avg60=0.00 avg300=0.02 total=5127421322
2024-06-26 12:09:59 /proc/pressure/io:full avg10=0.00 avg60=0.00 avg300=0.01 total=5099854545
2024-06-26 12:09:59 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:09:59 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:10:04
2024-06-26 12:10:04 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124391901
2024-06-26 12:10:04 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:10:04 /proc/pressure/io:some avg10=0.00 avg60=0.00 avg300=0.02 total=5127421376
2024-06-26 12:10:04 /proc/pressure/io:full avg10=0.00 avg60=0.00 avg300=0.01 total=5099854591
2024-06-26 12:10:04 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:10:04 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:10:09
2024-06-26 12:10:09 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124402769
2024-06-26 12:10:09 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:10:09 /proc/pressure/io:some avg10=0.00 avg60=0.00 avg300=0.01 total=5127443838
2024-06-26 12:10:09 /proc/pressure/io:full avg10=0.00 avg60=0.00 avg300=0.01 total=5099876677
2024-06-26 12:10:09 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:10:09 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:10:14
2024-06-26 12:10:14 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124413558
2024-06-26 12:10:14 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:10:14 /proc/pressure/io:some avg10=7.08 avg60=1.43 avg300=0.32 total=5128377143
2024-06-26 12:10:14 /proc/pressure/io:full avg10=6.93 avg60=1.40 avg300=0.31 total=5100790674
2024-06-26 12:10:14 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:10:14 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:10:19
2024-06-26 12:10:19 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124422349
2024-06-26 12:10:19 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:10:19 /proc/pressure/io:some avg10=10.68 avg60=2.50 avg300=0.56 total=5129338315
2024-06-26 12:10:19 /proc/pressure/io:full avg10=10.43 avg60=2.44 avg300=0.54 total=5101734966
2024-06-26 12:10:19 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:10:19 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831

effeeue05 · Jun 26, 2024

Also same problem after setting migration type to insecure. I can get those logs as well if you like but they look the same to me.

fiona · Jun 27, 2024

effeeue05 said:

Code:

2024-06-26 12:10:09 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124402769
2024-06-26 12:10:09 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:10:09 /proc/pressure/io:some avg10=0.00 avg60=0.00 avg300=0.01 total=5127443838
2024-06-26 12:10:09 /proc/pressure/io:full avg10=0.00 avg60=0.00 avg300=0.01 total=5099876677
2024-06-26 12:10:09 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:10:09 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:10:14
2024-06-26 12:10:14 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124413558
2024-06-26 12:10:14 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:10:14 /proc/pressure/io:some avg10=7.08 avg60=1.43 avg300=0.32 total=5128377143
2024-06-26 12:10:14 /proc/pressure/io:full avg10=6.93 avg60=1.40 avg300=0.31 total=5100790674
2024-06-26 12:10:14 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:10:14 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831
2024-06-26 12:10:19
2024-06-26 12:10:19 /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=6124422349
2024-06-26 12:10:19 /proc/pressure/cpu:full avg10=0.00 avg60=0.00 avg300=0.00 total=0
2024-06-26 12:10:19 /proc/pressure/io:some avg10=10.68 avg60=2.50 avg300=0.56 total=5129338315
2024-06-26 12:10:19 /proc/pressure/io:full avg10=10.43 avg60=2.44 avg300=0.54 total=5101734966
2024-06-26 12:10:19 /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=3992053
2024-06-26 12:10:19 /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=3986831

Is the pressure stall information from the source node or the target node? Target would be more interesting as that is where the timeout occurs.

The interesting thing ist that the IO pressure becomes rather high at the time the migration ends. Is the VM running a workload with heavy IO? Is there some other task you configured that is triggered after migration? It would be good to know which task is responsible for that, you should be able to identify it with a tool like iotop.

You might also want to test if the issue is still present with QEMU 9.0: https://forum.proxmox.com/threads/qemu-9-0-available-on-pve-no-subscription-as-of-now.149772/

effeeue05 · Jun 27, 2024

The stall information was from the target node. Interestingly enough; this is just a test VM running windows. Nothing is actually happening on the guest other than normal windows process on a fresh install. So nothing really special going on. I'll watch / log iotop and run the tests again. I'll also test again after upgrading to QEMU 9.0. May be Monday before all of this is done and I reply back with results.

effeeue05 · Jul 5, 2024

Updating the nodes did not help. (pve-qemu-kvm: 9.0.0-3)

Just to reiterate; this problem happens with all vm's regardless of what node they are on or what node I'm migrating them to. Yesterday I migrated all vm's off of a single node so I could ensure no other vm's were running on a single server. I tested migrating to that server using this vm that is just being used to test migrations. It's running a base install of debian and there are no workloads running on it. I grabbed the output of iotop on the target node and logged to a file while testing a migration. I'm attaching the results.

What I find interesting is the -S on the kvm command which I understand to start the vm in a "paused" state. That may be normal but I don't see anything thats trying to unpause the vm or at least not in the iotop output.

fiona · Jul 8, 2024

effeeue05 said:
What I find interesting is the -S on the kvm command which I understand to start the vm in a "paused" state. That may be normal but I don't see anything thats trying to unpause the vm or at least not in the iotop output.

The migration worker is what resumes the VM at the end, but that very command fails for you for some reason:

Code:

2024-06-26 12:10:13 ERROR: tunnel replied 'ERR: resume failed - VM 106 qmp command 'query-status' failed - got timeout' to command 'resume 106'

Does it work if you migrate a VM without any disks at all or with a different storage than rbd?

effeeue05 · Jul 9, 2024

fiona said:
The migration worker is what resumes the VM at the end, but that very command fails for you for some reason:

Code:

2024-06-26 12:10:13 ERROR: tunnel replied 'ERR: resume failed - VM 106 qmp command 'query-status' failed - got timeout' to command 'resume 106'

Does it work if you migrate a VM without any disks at all or with a different storage than rbd?

There wouldn't be anything to "resume" if there were no disks. However; I did move the testing vm to the local disks and have performed a couple of migrations and they migrate / resume just fine. How can I troubleshoot why rbd volumes and / or the snapshots associated with them would cause this problem?

fiona · Jul 10, 2024

effeeue05 said:
There wouldn't be anything to "resume" if there were no disks.

Yes, there would. The VM on the target is always started in prelaunch runstate, receives the migration data with the VM state and RAM and needs to be resumed afterwards.

effeeue05 said:
However; I did move the testing vm to the local disks and have performed a couple of migrations and they migrate / resume just fine. How can I troubleshoot why rbd volumes and / or the snapshots associated with them would cause this problem?

Please share your /etc/pve/storage.cfg. Telling from the lack of a ceph package in the pveversion output, I'd presume this is an external cluster and not a Proxmox VE-managed one? What does ceph -s show on a Ceph node? What Ceph version are you running?

effeeue05 · Jul 10, 2024

Code:

cluster:
    id:     <ommitted>
    health: HEALTH_OK
 
  services:
    mon:        5 daemons, quorum node12,node13,node15,node2,node8 (age 9d)
    mgr:        node2(active, since 8w), standbys: node12, node8, node13, node15
    mds:        1/1 daemons up, 1 hot standby
    osd:        432 osds: 432 up (since 8w), 432 in (since 3M)
    rbd-mirror: 1 daemon active (1 hosts)
 
  data:
    volumes: 1/1 healthy
    pools:   6 pools, 8784 pgs
    objects: 79.79M objects, 245 TiB
    usage:   739 TiB used, 770 TiB / 1.5 PiB avail
    pgs:     8784 active+clean
 
  io:
    client:   300 MiB/s rd, 376 MiB/s wr, 9.02k op/s rd, 10.48k op/s wr

Code:

cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content backup,vztmpl,iso

zfspool: local-zfs
        pool rpool/data
        content images,rootdir
        sparse 1

rbd: rbd
        content rootdir,images
        krbd 1
        monhost x.x.x.71:6789;x.x.x.77:6789;x.x.x.81:6789;x.x.x.82:6789;x.x.x.84:6789
        pool rbd
        username <omitted>

cifs: imports
        path /mnt/pve/imports
        server x.x.x.245
        share proxmox-imports
        content iso
        prune-backups keep-all=1
        username <omitted>

cifs: backups
        path /mnt/pve/backups
        server x.x.x.245
        share proxmox-backups
        content backup,images
        prune-backups keep-all=1
        username <omitted>

Correct; this is an external ceph cluster. Above is the information you requested.

fiona · Jul 11, 2024

Just guessing, but maybe it is related to the large number of OSDs (which did cause different issues in the past). It would be necessary to get a hold of what exactly the QEMU instance on the target is doing in between finishing the migration and being able to react to the query-status command. From a quick glance, it's cleaning up the internal migration state and activating the block drives, but that should be very fast. ~~In case of librbd images it will additionally invalidate the cache. However, you are using krbd 1, so the images are talked to as raw block devices not via librbd.~~ EDIT: There is a callback to invalidate the cache for both, raw block devices and also librd images (i.e. when using krbd 0).

Still, could you check if switching off krbd changes the behavior?

You could also try editing the vm_resume function in /usr/share/perl5/PVE/QemuServer.pm on the target node of the migration, namely increasing the timeout for the query-status command like so:

Code:

sub vm_resume {
    my ($vmid, $skiplock, $nocheck) = @_;

    PVE::QemuConfig->lock_config($vmid, sub {
    my $res = mon_cmd($vmid, 'query-status');

Code:

sub vm_resume {
    my ($vmid, $skiplock, $nocheck) = @_;

    PVE::QemuConfig->lock_config($vmid, sub {
    my $res = mon_cmd($vmid, 'query-status', timeout => 30);

You'll need to run systemctl reload-or-restart pvedaemon.service pveproxy.service afterwards to have the API handlers pick up the change.

effeeue05 · Jul 12, 2024

If I switch off krbd; does that take effect immediately or would it take effect after a migration? Is there a way to tell that the connection is using librbd vs krbd?

I'll also give the timeout adjustment a try and report back.

Live migrate fails to resume.

New Member

Proxmox Staff Member

Well-Known Member

New Member

New Member

Well-Known Member

New Member

Proxmox Staff Member

New Member

New Member

New Member

Proxmox Staff Member

New Member

New Member

Attachments

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

We value your privacy