Timeouts while trying to migrate a VM

Apr 26, 2023
6
0
1
Hello everyone,

we encounterd some strange behavior when we try to migrate a VM from one node to an other.

Proxmox Version: Virtual Environment 7.4-3 (PVE Manager Version pve-manager/7.4-3/9002ab8a)
We have 2 nodes with subscription and one without.

We tried to move one VM (101) from node without subscription to node with subscription.

1686827646554.png

Migration stops at this point and at the target cvh37 we see timouts in the log files to various VMs running on this host. (see Part-of-syslog.txt)
Example entries:
Code:
Jun 15 09:33:50 cvh37 pvestatd[3535]: VM 104 qmp command failed - VM 104 qmp command 'query-proxmox-support' failed - unable to connect to VM 104 qmp socket - timeout after 51 retries
Jun 15 09:33:55 cvh37 pvestatd[3535]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 51 retries
Jun 15 09:34:00 cvh37 pvestatd[3535]: VM 123 qmp command failed - VM 123 qmp command 'query-proxmox-support' failed - unable to connect to VM 123 qmp socket - timeout after 51 retries

Even some windows systems rebooted!

VM 101 Configuration:
1686829674928.png

pveversion (cvh37):
Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-3
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.3
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20221111-1
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.11-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

Migration between the other 2 nodes run without problems.

regards
Bastian
 

Attachments

  • Part-of-syslog.txt
    20.1 KB · Views: 5
Last edited:
Could you please share the versions of both the source and target nodes? It would also be helpful to know if the IO delay was out of the comfort zone for the machine doing the migration, you can check this either in the web interface's Summary tab for the node, or in the CPU `id` field with `top`.
Even some windows systems rebooted!
Could you please elaborate further? Were this other nodes? VMs in the source node, target node, windows inside of the VM being migrated, etc?
 
The migration was started from host pve to host cvh37.

This is the log part from pve at this time:
Code:
Jun 15 09:30:55 pve pmxcfs[6174]: [dcdb] notice: data verification successful
Jun 15 09:31:32 pve pvedaemon[3016466]: <root@pam> starting task UPID:pve:00146D5E:247FF0B9:648ABE54:qmigrate:101:root@pam:
Jun 15 09:31:34 pve pmxcfs[6174]: [status] notice: received log
Jun 15 09:31:36 pve pmxcfs[6174]: [status] notice: received log
Jun 15 09:32:47 pve pvedaemon[3016466]: worker exit
Jun 15 09:32:47 pve pvedaemon[6363]: worker 3016466 finished
Jun 15 09:32:47 pve pvedaemon[6363]: starting 1 worker(s)
Jun 15 09:32:47 pve pvedaemon[6363]: worker 1365073 started
Jun 15 09:34:55 pve pvedaemon[1338718]: VM 101 qmp command failed - VM 101 qmp command 'block-job-cancel' failed - Block job 'drive-scsi0' not found
Jun 15 09:34:55 pve pmxcfs[6174]: [status] notice: received log
Jun 15 09:34:55 pve pmxcfs[6174]: [status] notice: received log
Jun 15 09:34:56 pve pmxcfs[6174]: [status] notice: received log
Jun 15 09:35:02 pve pmxcfs[6174]: [status] notice: received log
Jun 15 09:35:07 pve kernel:  zd32: p1 p2 p3 p4
Jun 15 09:35:54 pve sshd[1315910]: Received disconnect from 10.10.10.13 port 60514:11: disconnected by user
Jun 15 09:35:54 pve sshd[1315910]: Disconnected from user root 10.10.10.13 port 60514
Jun 15 09:35:54 pve sshd[1315910]: pam_unix(sshd:session): session closed for user root
Jun 15 09:35:54 pve systemd-logind[5566]: Session 2315 logged out. Waiting for processes to exit.
Jun 15 09:35:54 pve systemd[1]: session-2315.scope: Succeeded.
Jun 15 09:35:54 pve systemd-logind[5566]: Removed session 2315.

This is pveversion (pve):
Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-5.4: 6.4-20
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-4
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-1
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1


The server 101 which we tried to mgrate is a windows 2019 Exchange Server.
This is the only VM at the source node. At the target there are about 20 other different VMs running.

I will try to move another smaller VM from cvh37 to pve and back, to see if this is working without problems.
 
From the versions

Diff:
-qemu-server: 7.4-3
+qemu-server: 7.4-2

I can see that qemu-server is slightly older on the target, even though the version difference is minor it is possible that this has an effect on the migration.

On the other hand it is possible that the target node cannot keep up with the IO required for a migration and the other 20 VMs, one possibility is to set a bandwith limit for the migration, on the web UI that is on Datacenter > Options > Bandwidth Limit > Migration. Knowing the IO delay would also be helpful.

I see this is local storage, what kind of hardware do you have for storage?
 
Hello,

today the same happened while doing a bulk migration of many VMs.
At the target server a VM (112) got timeouts and started new with blue screen and checkdisk repairing files.

Code:
Jun 28 08:21:48 cvh46 pvestatd[8093]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - got timeout
Jun 28 08:21:51 cvh46 systemd[1]: 126.scope: Succeeded.
Jun 28 08:21:51 cvh46 systemd[1]: 128.scope: Succeeded.
Jun 28 08:21:53 cvh46 pvestatd[8093]: VM 109 qmp command failed - VM 109 qmp command 'query-proxmox-support' failed - got timeout
Jun 28 08:21:58 cvh46 pvestatd[8093]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Jun 28 08:22:02 cvh46 pvedaemon[3726926]: VM 112 qmp command failed - VM 112 qmp command 'query-proxmox-support' failed - got timeout
Jun 28 08:22:03 cvh46 pvestatd[8093]: VM 112 qmp command failed - VM 112 qmp command 'query-proxmox-support' failed - unable to connect to VM 112 qmp socket - timeout after 51 retries
Jun 28 08:22:04 cvh46 pvestatd[8093]: status update time (39.877 seconds)
Jun 28 08:22:07 cvh46 kernel:  zd832: p1 p2 p3 p4
Jun 28 08:22:07 cvh46 kernel:  zd240: p1 p2 p3 p4
Jun 28 08:22:07 cvh46 kernel:  zd80: p1 p2 p3 p4
Jun 28 08:22:07 cvh46 sshd[2964027]: Received disconnect from 10.10.10.13 port 40918:11: disconnected by user
Jun 28 08:22:07 cvh46 sshd[2964027]: Disconnected from user root 10.10.10.13 port 40918
Jun 28 08:22:07 cvh46 sshd[2964027]: pam_unix(sshd:session): session closed for user root

The Hardware is a supermicro server.
https://store.supermicro.com/nl_en/as-1124us-tnr.html
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!