We have a 4 node cluster of proxmox 6
and on one of th enodes we have some VMs that cannot be managed:
- VNC fails with
- Migration fails with
- Backup fails with
It seems that if we shut down and start the VMs they will start working.
But this happened a few days back, restarted the VMs since then (actually the whole cluster) but now this issue is back for random machines on the same node.
This is breaking automatic backups for us and possibly HA.
How can i diagnose this?
Edit:
The logs for pve-ha-lrm service are full with entries lik ethis:
Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.18-1-pve)
pve-manager: 6.0-5 (running version: 6.0-5/f8a710d7)
pve-kernel-5.0: 6.0-6
pve-kernel-helper: 6.0-6
pve-kernel-4.15: 5.4-7
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve2
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-3
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-63
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
and on one of th enodes we have some VMs that cannot be managed:
- VNC fails with
Code:
VM ID qmp command 'change' failed - got timeoutTASK ERROR: Failed to run vncproxy.
Code:
Task started by HA resource agent
2019-08-16 14:56:20 ERROR: migration aborted (duration 00:00:03): VM 118 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted
Code:
NFO: Starting Backup of VM 118 (qemu)
INFO: Backup started at 2019-08-16 11:33:30
INFO: status = running
INFO: update VM 118: -lock backup
INFO: VM Name: VMNAME
INFO: include disk 'scsi0' 'zesan-lvm:vm-118-disk-0' 25G
/dev/sdc: open failed: No medium found
/dev/sdc: open failed: No medium found
/dev/sdc: open failed: No medium found
/dev/sdc: open failed: No medium found
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: The backup started
INFO: creating archive '/mnt/pve/mntpoint/dump/vzdump-qemu-118-2019_08_16-11_33_30.vma.lzo'
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 118 qmp command 'backup-cancel' failed - got timeout
ERROR: Backup of VM 118 failed - got timeout
INFO: Failed at 2019-08-16 11:43:36
INFO: Backup ended
INFO: Backup job finished with errors
TASK ERROR: job errors
It seems that if we shut down and start the VMs they will start working.
But this happened a few days back, restarted the VMs since then (actually the whole cluster) but now this issue is back for random machines on the same node.
This is breaking automatic backups for us and possibly HA.
How can i diagnose this?
Edit:
The logs for pve-ha-lrm service are full with entries lik ethis:
Code:
Aug 16 11:43:38 srv pve-ha-lrm[25059]: VM 118 qmp command 'query-status' failed - got timeout
Aug 16 11:43:38 srv pve-ha-lrm[25059]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - got timeout
Aug 16 11:43:25 srv pve-ha-lrm[25012]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:43:25 srv pve-ha-lrm[25012]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:43:15 srv pve-ha-lrm[24963]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:43:15 srv pve-ha-lrm[24963]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:43:05 srv pve-ha-lrm[24912]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:43:05 srv pve-ha-lrm[24912]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:42:55 srv pve-ha-lrm[24851]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:42:55 srv pve-ha-lrm[24851]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:42:45 srv pve-ha-lrm[24802]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:42:45 srv pve-ha-lrm[24802]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:42:35 srv pve-ha-lrm[24753]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Last edited: