[SOLVED] Certain VMs from a cluster cannot be backed up and managed

Thanks, can you post the config of one of those VMs? qm config X

Please find config attached:

Also note, below monday's backup log which randomly backups VMs:

VMIDNAMESTATUSTIMESIZEFILENAME
103FortisNXFilterOK00:04:14
3.47GB​
/Backups/dump/vzdump-qemu-103-2019_09_15-23_00_02.vma.lzo
104AtlasNXFilterOK00:04:12
5.11GB​
/Backups/dump/vzdump-qemu-104-2019_09_15-23_04_16.vma.lzo
105replica.atlasict.co.zaOK00:05:25
7.54GB​
/Backups/dump/vzdump-qemu-105-2019_09_15-23_08_28.vma.lzo
106BuildingAccessControlFAILED00:10:12got timeout
107SpiceworksFAILED00:10:08got timeout
108SolarWinds-NCentralFAILED00:10:05got timeout
109PFSense-AtlasICTFAILED00:10:05got timeout
110FortisTS01FAILED00:10:09got timeout
111FortisDCFAILED00:10:08got timeout
112LigoWaveControllerFAILED00:10:08got timeout
113AcutusAccountingFAILED00:10:14got timeout
115FortisMan3000FAILED00:10:12got timeout
116TrisnetWebServerFAILED00:10:15got timeout
TOTAL01:55:2716.12GB
 

Attachments

Hi, same here wiht one cluster node only and ceph:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve) pve-manager: 6.0-6 (running version: 6.0-6/c71f879f) pve-kernel-5.0: 6.0-7 pve-kernel-helper: 6.0-7 pve-kernel-4.15: 5.4-6 pve-kernel-5.0.21-1-pve: 5.0.21-1 pve-kernel-5.0.18-1-pve: 5.0.18-3 pve-kernel-5.0.15-1-pve: 5.0.15-1 pve-kernel-4.15.18-18-pve: 4.15.18-44 pve-kernel-4.15.18-12-pve: 4.15.18-36 ceph: 14.2.2-pve1 ceph-fuse: 14.2.2-pve1 corosync: 3.0.2-pve2 criu: 3.11-3 glusterfs-client: 5.5-3 ksm-control-daemon: 1.3-1 libjs-extjs: 6.0.1-10 libknet1: 1.11-pve1 libpve-access-control: 6.0-2 libpve-apiclient-perl: 3.0-2 libpve-common-perl: 6.0-4 libpve-guest-common-perl: 3.0-1 libpve-http-server-perl: 3.0-2 libpve-storage-perl: 6.0-7 libqb0: 1.0.5-1 lvm2: 2.03.02-pve3 lxc-pve: 3.1.0-64 lxcfs: 3.0.3-pve60 novnc-pve: 1.0.0-60 proxmox-mini-journalreader: 1.1-1 proxmox-widget-toolkit: 2.0-7 pve-cluster: 6.0-5 pve-container: 3.0-5 pve-docs: 6.0-4 pve-edk2-firmware: 2.20190614-1 pve-firewall: 4.0-7 pve-firmware: 3.0-2 pve-ha-manager: 3.0-2 pve-i18n: 2.0-2 pve-qemu-kvm: 4.0.0-5 pve-xtermjs: 3.13.2-1 qemu-server: 6.0-7 smartmontools: 7.0-pve2 spiceterm: 3.1-1 vncterm: 1.6-1 zfsutils-linux: 0.8.1-pve2
Sep 19 12:10:40 ac102 pve-ha-lrm[2302517]: VM 503 qmp command 'query-status' failed - unable to connect to VM 503 qmp socket - timeout after 31 retries#012
qm config 503
balloon: 512
bootdisk: scsi0
cores: 1
lock: backup
memory: 4096
name: Resolver58
net0: virtio=C2:88:FB:2A:44:8F,bridge=vmbr100
numa: 0
onboot: 1
ostype: l26
scsi0: cephvm:vm-503-disk-0,cache=writeback,size=10G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=033ffc7a-341b-4d96-b073-304ef970472d
sockets: 1
Backup to NFS volume also fails:

INFO: starting new backup job: vzdump 503 --storage backup52 --compress lzo --mode snapshot --node ac102 --remove 0
INFO: Starting Backup of VM 503 (qemu)
INFO: Backup started at 2019-09-19 12:08:52
INFO: status = running
INFO: update VM 503: -lock backup
INFO: VM Name: Resolver58
INFO: include disk 'scsi0' 'cephvm:vm-503-disk-0' 10G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/backup52/dump/vzdump-qemu-503-2019_09_19-12_08_52.vma.lzo'
ERROR: got timeout
INFO: aborting backup job
vmgenid: ebe633fd-8d32-4c6d-a340-dc26e8c9a977
 
HI, same here, no backup, migration, console all struggeling with got timeout.

root@xen5:~# pveversion
pve-manager/6.0-7/28984024 (running kernel: 5.0.21-1-pve)

Stoarge on LVM and cqow2, backup local directory

It seems that after a reboot of the node all is working for some time...

migration issue:

2019-09-19 13:52:23 ERROR: migration aborted (duration 00:00:03): VM 109 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted

console issue:

VM 109 qmp command 'change' failed - got timeout
TASK ERROR: Failed to run vncproxy.

Backup issue:

vzdump 301,400 --storage BACKUP-DS218
INFO: starting new backup job: vzdump 301 400 --storage BACKUP-DS218
INFO: Starting Backup of VM 301 (qemu)
INFO: Backup started at 2019-09-19 12:33:54
INFO: status = running
INFO: update VM 301: -lock backup
INFO: VM Name: w7ab
INFO: include disk 'scsi0' 'VMVG:vm-301-disk-0' 50G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/ds218/vzdump/dump/vzdump-qemu-301-2019_09_19-12_33_54.vma.lzo'
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 301 qmp command 'backup-cancel' failed - got timeout
ERROR: Backup of VM 301 failed - got timeout
INFO: Failed at 2019-09-19 12:44:03
INFO: Starting Backup of VM 400 (qemu)
INFO: Backup started at 2019-09-19 12:44:03
INFO: status = running
INFO: update VM 400: -lock backup
INFO: VM Name: w10jb
INFO: include disk 'scsi1' 'VMSTORE:400/vm-400-disk-0.qcow2' 80G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/ds218/vzdump/dump/vzdump-qemu-400-2019_09_19-12_44_03.vma.lzo'
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 400 qmp command 'backup-cancel' failed - got timeout
ERROR: Backup of VM 400 failed - got timeout
INFO: Failed at 2019-09-19 12:54:16
INFO: Backup job finished with errors
job errors
 
I have a cluster that is in this same scenario. It seems to have happened after upgrading the nodes yesterday. Today I tried to get a console, but it failed with this message:
Code:
VM 5355 qmp command 'change' failed - got timeout
TASK ERROR: Failed to run vncproxy.

Then I tried to migrate that VM, and I got this:
Code:
2019-09-20 12:11:42 ERROR: migration aborted (duration 00:00:03): VM 5373 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted
I get information on the VMs in the GUI, but I can't perform any actions on those VMs. I have seen this in the past, and the only way to fix it seems to be to restart the VM.

My storage is CEPH.
 
Can confirm I am seeing this issue on ZFS storage, so it doesn't seem to be a problem with storage backend.
I am getting qmp timeouts on some of actions like backups and console. Storage replication seems to be working ok.
It stopped working after around week from VM startup. After shutting down and powering up the VM the problem is gone at least for a while, but one has to restart all of the vms.
I don't know if this is relevant but anyways: I am running around 20 Windows VMs and few Linux ones on the host the problem started, doesn't seem to be OS dependant. Also it is one of three hosts in cluster and for now it onlly happened on this one.

Some of the relevant log entries
Code:
Sep 23 00:01:09 h3 pve-ha-lrm[1185494]: VM 10302 qmp command failed - VM 10302 qmp command 'query-status' failed - got timeout
Sep 23 00:01:09 h3 pve-ha-lrm[1185494]: VM 10302 qmp command 'query-status' failed - got timeout#012
Sep 23 00:01:19 h3 pve-ha-lrm[1188717]: VM 10302 qmp command failed - VM 10302 qmp command 'query-status' failed - got timeout
Sep 23 00:01:19 h3 pve-ha-lrm[1188717]: VM 10302 qmp command 'query-status' failed - got timeout#012

Sep 23 10:18:26 h3 pvesr[2013891]: VM 9803 qmp command failed - VM 9803 qmp command 'guest-ping' failed - got timeout
Sep 23 10:18:26 h3 pvesr[2013891]: Qemu Guest Agent is not running - VM 9803 qmp command 'guest-ping' failed - got timeout

Sep 23 09:14:25 h3 pvedaemon[767596]: VM 10904 qmp command failed - VM 10904 qmp command 'guest-ping' failed - got timeout
Sep 23 09:14:26 h3 pveproxy[737451]: 2019-09-23 09:14:26.259301 +0200 error AnyEvent::Util: Runtime error in AnyEvent::guard callback: Can't call method "_put_se
ssion" on an undefined value at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Handle.pm line 2259 during global destruction.


Sep 23 09:14:20 h3 qm[744483]: VM 10904 qmp command failed - VM 10904 qmp command 'change' failed - got timeout
Sep 23 09:14:20 h3 qm[741842]: VM 10904 qmp command failed - VM 10904 qmp command 'change' failed - got timeout

Sorry for cutting out single lines of logs, but really nothing useful before and after them.

Code:
root@h3:~# pveversion -V
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.2-pve1
ceph-fuse: 14.2.2-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
pve-zsync: 2.0-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2
And here config of one of the machines:
Code:
agent: 1
bootdisk: scsi0
cores: 1
cpu: host
ide2: none,media=cdrom
memory: 8192
name: RCP
net0: virtio=5E:02:9B:11:0A:FD,bridge=vmbr0,tag=98
numa: 1
onboot: 1
ostype: win8
scsi0: ZFS-H3-SSD:vm-9801-disk-0,cache=writeback,discard=on,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=30dd3844-d706-4c5a-aa1d-8554c2f71143
sockets: 2
vga: qxl
vmgenid: 11a00656-5a67-4e02-a421-65067de1f9b8
 
Last edited:
Same issue is seen here.

Actually we know of 5 machines that cannot be migrated. The affected machines can be reached via ssh, ping etc. and the services on these machines are still up. Last Friday this wasn't the case, even our domain controller became unreachable which caused massive problems.

We run 6 nodes on pve-manager/6.0-7/28984024 (running kernel: 5.0.15-1-pve) with ceph as main storage backend and NFS to a nasserver for backup of the machines.
dir: local #unused
path /var/lib/vz
content backup,images,vztmpl,iso
maxfiles 1
shared 0

lvmthin: local-lvm #unused
thinpool data
vgname pve
content images,rootdir

nfs: Backup-Daily
export /raid0/data/_NAS_NFS_Exports_/Proxmox-Daily
path /mnt/pve/Backup-Daily
server nasserver3
content backup
maxfiles 3
options vers=3

nfs: Backup-Weekly
export /raid0/data/_NAS_NFS_Exports_/Proxmox-weekly
path /mnt/pve/Backup-Weekly
server nasserver3
content backup
maxfiles 2
options vers=3

rbd: HDD_Storage-VM
content images
krbd 0
pool HDD_Storage

nfs: CD-Images
export /raid0/data/_NAS_NFS_Exports_/CD-Images
path /mnt/pve/CD-Images
server nasserver3
content iso
maxfiles 1
options vers=3

nfs: Test
export /raid0/data/_NAS_NFS_Exports_/VM-HDDs
path /mnt/pve/Test
server nasserver3
content images
options vers=3
root@vm-3:/etc/pve# qm showcmd 300 --pretty
/usr/bin/kvm \
-id 300 \
-name Kanal-NB \
-chardev 'socket,id=qmp,path=/var/run/qemu-server/300.qmp,server,nowait' \
-mon 'chardev=qmp,mode=control' \
-chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
-mon 'chardev=qmp-event,mode=control' \
-pidfile /var/run/qemu-server/300.pid \
-daemonize \
-smp '1,sockets=1,cores=1,maxcpus=1' \
-nodefaults \
-boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
-vnc unix:/var/run/qemu-server/300.vnc,password \
-no-hpet \
-cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,hv_ipi,enforce' \
-m 4096 \
-device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
-device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
-device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
-device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
-device 'VGA,id=vga,bus=pci.0,addr=0x2' \
-device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' \
-iscsi 'initiator-name=iqn.1993-08.org.debian:01:81e7c9dd7f3d' \
-drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' \
-device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2' \
-device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' \
-drive 'file=rbd:HDD_Storage/vm-300-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/HDD_Storage-VM.keyring,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=native,detect-zeroes=unmap' \
-device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' \
-netdev 'type=tap,id=net0,ifname=tap300i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' \
-device 'rtl8139,mac=DA:FB:8B:47:4D:CC,netdev=net0,bus=pci.0,addr=0x12,id=net0' \
-rtc 'driftfix=slew,base=localtime' \
-machine 'type=pc' \
-global 'kvm-pit.lost_tick_policy=discard'
boot: c
bootdisk: scsi0
cores: 1
description:
ide2: none,media=cdrom
memory: 4096
name: Kanal-NB
net0: rtl8139=DA:FB:8B:47:4D:CC,bridge=vmbr0,tag=106
onboot: 1
ostype: win7
parent: Sicherung
scsi0: HDD_Storage-VM:vm-300-disk-1,discard=on,size=320G
scsihw: virtio-scsi-pci
sockets: 1
startup: order=5
 
Last edited:
Thank you all for the new information. We are still trying to figure out what is causing this behavior. We'll keep you updated :)

Last Friday this wasn't the case, even our domain controller became unreachable which caused massive problems.
Did the VM react to a shutdown command or did you have to stop it hard?
 
Thank you all for the new information. We are still trying to figure out what is causing this behavior. We'll keep you updated :)


Did the VM react to a shutdown command or did you have to stop it hard?
VMs i had issues with did react to normal Shutdown commands from the GUI or cli.

Edit:
Alternatively you can log into the VM and issue the shutdown command.
 
I had to stop it hard. It didn't react to any command, except "Stop"
Moved the VM to another Node and started it, went up normally.

The other 5 VMs that are stuck at the moment, are currently still reachable and providing services, but give timeout when the backup runs, or when trying to use VNC Console etc.

BTW: This phenomenon is not Node dependent. The "Friday" incident happend on VM-6 while the currently stuck VMs are on VM-3. Let me just check if we have some more VMs on other Nodes that are currently stuck.
 
  • Like
Reactions: troycarpenter
BTW: This phenomenon is not Node dependent. The "Friday" incident happend on VM-6 while the currently stuck VMs are on VM-3. Let me just check if we have some more VMs on other Nodes that are currently stuck.
I'm not sure about that. Maybe that's the case but for me VMs on other nodes didn't experience this bug, and they were running longer. Other nodes are other hardware for me though. But the node that experienced the problem is the heaviest loaded one for me/
 
So, some additional information:

Every Guest that is currently stuck, has Balloning enabled. But not every Guest with ballooning is stuck, so this does not have to mean anything.
No additional stuck Guests were found and it hit different Guests than on Friday.

PS:
This Problem scares me. :oops: Our main SQL Server is one of the stuck Guests and I really don't want any downtime on this one during workhours.
I will gladly help, if there is something to dig deeper into.
Oh and: it hits Windows Guests as well als Linux Guests.
 
Thank you all for the new information. We are still trying to figure out what is causing this behavior. We'll keep you updated :)


Did the VM react to a shutdown command or did you have to stop it hard?

I have to stop it hard. All stuck VMs do still work but cannot be controlled from the UI nor the shell.
 
So, some additional information:

Every Guest that is currently stuck, has Balloning enabled. But not every Guest with ballooning is stuck, so this does not have to mean anything.
No additional stuck Guests were found and it hit different Guests than on Friday.

PS:
This Problem scares me. :oops: Our main SQL Server is one of the stuck Guests and I really don't want any downtime on this one during workhours.
I will gladly help, if there is something to dig deeper into.
Oh and: it hits Windows Guests as well als Linux Guests.
On the VMs that created issues for us we did not have ballooning enabled at all (we don't use ballooning anywhere). So this is not a factor.
 
We have the same issue on one node in our cluster. The backup fails on all VM's on that node. I was going to reboot the node, but the running VM's will not migrate throwing this error:
task started by HA resource agent
2019-09-24 09:09:47 ERROR: migration aborted (duration 00:00:03): VM 303 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted

Our VM's are currently using qcow2 on NFS storage.

agent: 1
bootdisk: scsi0
cores: 4
ide2: none,media=cdrom
memory: 8192
name: REDACTED
net0: virtio=EA:02:EB:DB:CE:00,bridge=vmbr0,tag=20
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: NFS:303/vm-303-disk-0.qcow2,discard=on,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=c7e3bfb6-e6cc-4054-be8d-9d0262384d30
sockets: 1
vmgenid: a087f717-755b-466b-befa-f65a26db19b3

/usr/bin/kvm \
-id 303 \
-name REDACTED \
-chardev 'socket,id=qmp,path=/var/run/qemu-server/303.qmp,server,nowait' \
-mon 'chardev=qmp,mode=control' \
-chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
-mon 'chardev=qmp-event,mode=control' \
-pidfile /var/run/qemu-server/303.pid \
-daemonize \
-smbios 'type=1,uuid=c7e3bfb6-e6cc-4054-be8d-9d0262384d30' \
-smp '4,sockets=1,cores=4,maxcpus=4' \
-nodefaults \
-boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
-vnc unix:/var/run/qemu-server/303.vnc,password \
-cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce \
-m 8192 \
-device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
-device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
-device 'vmgenid,guid=a087f717-755b-466b-befa-f65a26db19b3' \
-device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
-device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
-device 'VGA,id=vga,bus=pci.0,addr=0x2' \
-chardev 'socket,path=/var/run/qemu-server/303.qga,server,nowait,id=qga0' \
-device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
-device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
-device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' \
-iscsi 'initiator-name=iqn.1993-08.org.debian:01:34e47a6f4ec9' \
-drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' \
-device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' \
-device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' \
-drive 'file=/mnt/pve/NFS/images/303/vm-303-disk-0.qcow2,if=none,id=drive-scsi0,discard=on,format=qcow2,cache=none,aio=native,detect-zeroes=unmap' \
-device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' \
-netdev 'type=tap,id=net0,ifname=tap303i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
-device 'virtio-net-pci,mac=EA:02:EB:DB:CE:00,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' \
-machine 'type=pc'

All nodes of the cluter are runnning:
pve-manager/6.0-7/28984024 (running kernel: 5.0.21-1-pve)
 
I have to stop it hard. All stuck VMs do still work but cannot be controlled from the UI nor the shell.
  • Do you have the guest agent installed in the VMs and enabled in the options?
  • Do the VMs react to a shutdown command right after a fresh start?
So far as I understand the situation certain commands like the backup or VNC get a timeout but the shutdown command is passed to the VM.
After a clean start a VM can run days or weeks until this issues hits it again. Which makes hard to reproduce and debug.

PS:
This Problem scares me. :oops: Our main SQL Server is one of the stuck Guests and I really don't want any downtime on this one during workhours.
I will gladly help, if there is something to dig deeper into.
Oh and: it hits Windows Guests as well als Linux Guests.
If you shut it down outside of working hours and then start it fresh it should be fine for some time (as a workaround). Why your DC failed completely I cannot say for sure :/
 
  • Do you have the guest agent installed in the VMs and enabled in the options?
  • Do the VMs react to a shutdown command right after a fresh start?
So far as I understand the situation certain commands like the backup or VNC get a timeout but the shutdown command is passed to the VM.
After a clean start a VM can run days or weeks until this issues hits it again. Which makes hard to reproduce and debug.
We have one stuck VM with Guest Agent enabled, the others have it disabled. This VM reacted to shutdown command from Web GUI while stuck, and it reacts to shutdown command right after a fresh restart.
VMs without Guest Agent enabled fail on shutdown command. TASK ERROR: VM quit/powerdown failed
After fresh reboot, shutdown Command succeeds.
 
If possible can you please help us by tracing your VMs?

Once a VM is in this state again we would like to see the last 10 or so lines from the trace log.

How to:

Create a file with the tracing patterns we are interested in:
Let's store it for example in /root/trace_patterns
The content of it:
Code:
handle_qmp_command
monitor_qmp_cmd_in_band
monitor_qmp_cmd_out_of_band
qmp_job_cancel
qmp_job_pause
qmp_job_resume
qmp_job_complete
qmp_job_finalize
qmp_job_dismiss
qmp_block_job_cancel
qmp_block_job_pause
qmp_block_job_resume
qmp_block_job_complete
qmp_block_job_finalize
qmp_block_job_dismiss
qmp_block_stream
monitor_protocol_event_queue
monitor_suspend
monitor_protocol_event_handler
monitor_protocol_event_emit
monitor_protocol_event_queue

For each VM to be traced make sure if there are already custom arguments set with qm config <vmid>. Look out for the args parameter.

If there is none continue. If there is make sure you add its content again in the next command.

Code:
qm set <vmid> --args '-trace events=/root/trace_patterns,file=/root/qemu_trace_<vmid>'

Don't forget to replace <vmid> with the respective VM IDs.

Once you do a clean start of the VM the trace file should pop up. It should not grow too quickly. Roughly about 50MB to 60MB per week.

Update: If you have HA enabled VMs which you trace, make sure the trace_pattern file is present on all nodes at the same location.
 
Last edited:
  • Like
Reactions: Fathi
I will configure a multitude of VMs with tracing enabled, since this Problem seems to randomly affect only some Guests.

It will be a matter of luck to some degree, but i hope we will catch at least one Guest.
 
One thing, if you have some of those VMs set to HA make sure that the trace_patterns file is available on all nodes at the location. The VM will fail to start if it cannot find the file.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!