Moving toWhat aboutpve-qemu-kvm=6.2.0-1
?
pve-qemu-kvm=6.2.0-1
reproduces the unable to connect to VM
errorMoving toWhat aboutpve-qemu-kvm=6.2.0-1
?
pve-qemu-kvm=6.2.0-1
reproduces the unable to connect to VM
errorNode and VM were restarted before the backup attempt. Backup on an old node with virtio scsi single and iothread worked like a charm.May 11 15:54:31 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - got timeout
May 11 15:54:31 PROX-B1 pvestatd[3548]: status update time (6.223 seconds)
May 11 15:54:41 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - got timeout
May 11 15:54:41 PROX-B1 pve-ha-lrm[140660]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:54:41 PROX-B1 pve-ha-lrm[140660]: VM 121 qmp command 'query-status' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:54:41 PROX-B1 pvestatd[3548]: status update time (6.222 seconds)
May 11 15:54:51 PROX-B1 pve-ha-lrm[140951]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - got timeout
May 11 15:54:51 PROX-B1 pve-ha-lrm[140951]: VM 121 qmp command 'query-status' failed - got timeout
May 11 15:54:51 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:54:52 PROX-B1 pvestatd[3548]: status update time (6.222 seconds)
May 11 15:55:01 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:01 PROX-B1 pvestatd[3548]: status update time (6.231 seconds)
May 11 15:55:01 PROX-B1 pve-ha-lrm[141239]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:01 PROX-B1 pve-ha-lrm[141239]: VM 121 qmp command 'query-status' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:10 PROX-B1 pvestatd[3548]: status update time (5.030 seconds)
May 11 15:55:21 PROX-B1 pve-ha-lrm[141754]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:21 PROX-B1 pve-ha-lrm[141754]: VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:21 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:21 PROX-B1 pvestatd[3548]: status update time (6.230 seconds)
May 11 15:55:25 PROX-B1 pvedaemon[3628]: worker exit
May 11 15:55:25 PROX-B1 pvedaemon[3626]: worker 3628 finished
May 11 15:55:25 PROX-B1 pvedaemon[3626]: starting 1 worker(s)
May 11 15:55:25 PROX-B1 pvedaemon[3626]: worker 142036 started
May 11 15:55:31 PROX-B1 pve-ha-lrm[142069]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:31 PROX-B1 pve-ha-lrm[142069]: VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:31 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:31 PROX-B1 pvestatd[3548]: status update time (6.222 seconds)
May 11 15:55:41 PROX-B1 pve-ha-lrm[142357]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:41 PROX-B1 pve-ha-lrm[142357]: VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:41 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:42 PROX-B1 pvestatd[3548]: status update time (6.229 seconds)
Works so far as expected, no errors in the log.@GrueneNeun Tryapt install pve-qemu-kvm=6.1.1-2
, restart the VM, and then back up again.
Yes, but I don't quite understand what to do now. Do we need to downgrade pve-qemu-kvm on all our nodes? If yes, will it affect stability of the cluster?Works so far as expected, no errors in the log.
Since the workaround with iothread didn't work for both of you, the issue I managed to run into is obviously different. Unfortunately, I haven't been able to reproduce the issue with the timed out QMP commands yet. There were a lot of upstream changes between QEMU 6.1.1-2 and 6.2.0-1 so it's not obvious what started triggering this. The backup code from our side mostly stayed the same between these.Yes, but I don't quite understand what to do now. Do we need to downgrade pve-qemu-kvm on all our nodes? If yes, will it affect stability of the cluster?
@Fabian_E would be great to hear some recommendations.
pve-qemu-kvm
downgraded might be the only way for now. It should not affect the stability, but you won't get any fixes/updates for QEMU of course...As for finding other workarounds/trying to track this down more:
Does spacing the backups out more help? Can you trigger the issue reliably or does it depend on system/PBS load? Could you share your storage configuration and the configuration of some affected VMs? The one for the minimal VM with empty disks would be interesting too.
agent: 1
boot: order=virtio0;net0
cores: 8
cpu: EPYC
machine: pc-i440fx-6.0
memory: 8192
name: VGFILE2
net0: virtio=A2:64:23:42:7E:2A,bridge=vmbr0,firewall=1,tag=110
numa: 0
onboot: 1
ostype: win10
scsihw: virtio-scsi-single
smbios1: uuid=a193ab4a-b863-47f6-95f3-385428458c08
sockets: 1
virtio0: VG-Pool:vm-121-disk-0,discard=on,iothread=1,size=128G
virtio1: VG-Pool:vm-121-disk-1,discard=on,iothread=1,size=5T
virtio2: VG-Pool:vm-121-disk-2,discard=on,iothread=1,size=1T
vmgenid: c56fd5fc-a3f3-480d-9095-8ea11cdecbf8
Yes, it happens reliably 100% of the time. The PBS load is ~0% (nothing happening), same with the node. The issue happens regardless of whether the VM is on or off.Can you trigger the issue reliably or does it depend on system/PBS load?
agent: 1
balloon: 131072
bios: ovmf
boot: order=scsi0;net0
cores: 35
efidisk0: cpool:vm-110-disk-2,efitype=4m,size=528K
memory: 196608
meta: creation-qemu=6.1.0,ctime=1651062805
name: ****
net0: virtio=56:57:17:5E:60:F7,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: l26
scsi0: cpool:vm-110-disk-0,discard=on,iothread=1,size=300G,ssd=1
scsi1: cpool:vm-110-disk-1,discard=on,iothread=1,size=6300G,ssd=1
scsi2: cpool:vm-110-disk-3,discard=on,iothread=1,size=10G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=ad9152a8-5683-4601-b725-12d134f00e65
sockets: 2
vmgenid: 74ed1dbe-56ca-4cdb-95c2-3188dda5be0a
What kind of storage isYes, it happens reliably 100% of the time. The PBS load is ~0% (nothing happening), same with the node. The issue happens regardless of whether the VM is on or off.
Code:agent: 1 balloon: 131072 bios: ovmf boot: order=scsi0;net0 cores: 35 efidisk0: cpool:vm-110-disk-2,efitype=4m,size=528K memory: 196608 meta: creation-qemu=6.1.0,ctime=1651062805 name: **** net0: virtio=56:57:17:5E:60:F7,bridge=vmbr0,firewall=1 numa: 1 onboot: 1 ostype: l26 scsi0: cpool:vm-110-disk-0,discard=on,iothread=1,size=300G,ssd=1 scsi1: cpool:vm-110-disk-1,discard=on,iothread=1,size=6300G,ssd=1 scsi2: cpool:vm-110-disk-3,discard=on,iothread=1,size=10G,ssd=1 scsihw: virtio-scsi-single smbios1: uuid=ad9152a8-5683-4601-b725-12d134f00e65 sockets: 2 vmgenid: 74ed1dbe-56ca-4cdb-95c2-3188dda5be0a
the "creation-qemu" was apparently different because I had to downgrade everything.
cpool
, any special configuration for it?Many thanks for the suggestion!When krbd is enabled, it doesn't trigger for me
this might be a different issue then. Please open a new thread and link to it here or ping me with @Fabian_E with the following information:Hello, KRBD doesn't help here, it was already enabled.
pve+ceph 13 nodes cluster alls ssd. pbs all ssd. separate 10gbit networks for ceph and public+backup.
just now updated and rebooted whole cluster.
backups always maxed out the 10gbit network (900MBs for all nodes together), since update last saturday backups are slow as 70-110MBs (all nodes together). only few smaller vms are backed up successfully. everything above 128GB disk size is timing out (different qmp timeouts).
pveversion -v
/var/log/syslog
during the backuppve-qemu-kvm=6.2.0-1
helpspve-qemu-kvm=6.1.1-2
helpsVirtio SCSI single
and enabling iothread
on the disks helpsyes, this need a vm stop/start (or a live migration) to use the krbd path.with the largest disks (750GB) would drop in performance by 95% causing our help desk to blow up with complaints. I've checked KRBD and tested. This didn't help. The VMs needed to be shut down and restarted. . This solved the issue