[SOLVED] Certain VMs from a cluster cannot be backed up and managed

aaron · Sep 17, 2019

Sarel Pretorius said:
I just want to add the following backup log:

Thanks, can you post the config of one of those VMs? qm config X

Sarel Pretorius · Sep 17, 2019

aaron said:
Thanks, can you post the config of one of those VMs? qm config X

Please find config attached:

Also note, below monday's backup log which randomly backups VMs:

VMID	NAME	STATUS	TIME	SIZE	FILENAME
103	FortisNXFilter	OK	00:04:14	3.47GB	/Backups/dump/vzdump-qemu-103-2019_09_15-23_00_02.vma.lzo
104	AtlasNXFilter	OK	00:04:12	5.11GB	/Backups/dump/vzdump-qemu-104-2019_09_15-23_04_16.vma.lzo
105	replica.atlasict.co.za	OK	00:05:25	7.54GB	/Backups/dump/vzdump-qemu-105-2019_09_15-23_08_28.vma.lzo
106	BuildingAccessControl	FAILED	00:10:12	got timeout
107	Spiceworks	FAILED	00:10:08	got timeout
108	SolarWinds-NCentral	FAILED	00:10:05	got timeout
109	PFSense-AtlasICT	FAILED	00:10:05	got timeout
110	FortisTS01	FAILED	00:10:09	got timeout
111	FortisDC	FAILED	00:10:08	got timeout
112	LigoWaveController	FAILED	00:10:08	got timeout
113	AcutusAccounting	FAILED	00:10:14	got timeout
115	FortisMan3000	FAILED	00:10:12	got timeout
116	TrisnetWebServer	FAILED	00:10:15	got timeout
TOTAL	01:55:27	16.12GB

Mikepop · Sep 19, 2019

Hi, same here wiht one cluster node only and ceph:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve) pve-manager: 6.0-6 (running version: 6.0-6/c71f879f) pve-kernel-5.0: 6.0-7 pve-kernel-helper: 6.0-7 pve-kernel-4.15: 5.4-6 pve-kernel-5.0.21-1-pve: 5.0.21-1 pve-kernel-5.0.18-1-pve: 5.0.18-3 pve-kernel-5.0.15-1-pve: 5.0.15-1 pve-kernel-4.15.18-18-pve: 4.15.18-44 pve-kernel-4.15.18-12-pve: 4.15.18-36 ceph: 14.2.2-pve1 ceph-fuse: 14.2.2-pve1 corosync: 3.0.2-pve2 criu: 3.11-3 glusterfs-client: 5.5-3 ksm-control-daemon: 1.3-1 libjs-extjs: 6.0.1-10 libknet1: 1.11-pve1 libpve-access-control: 6.0-2 libpve-apiclient-perl: 3.0-2 libpve-common-perl: 6.0-4 libpve-guest-common-perl: 3.0-1 libpve-http-server-perl: 3.0-2 libpve-storage-perl: 6.0-7 libqb0: 1.0.5-1 lvm2: 2.03.02-pve3 lxc-pve: 3.1.0-64 lxcfs: 3.0.3-pve60 novnc-pve: 1.0.0-60 proxmox-mini-journalreader: 1.1-1 proxmox-widget-toolkit: 2.0-7 pve-cluster: 6.0-5 pve-container: 3.0-5 pve-docs: 6.0-4 pve-edk2-firmware: 2.20190614-1 pve-firewall: 4.0-7 pve-firmware: 3.0-2 pve-ha-manager: 3.0-2 pve-i18n: 2.0-2 pve-qemu-kvm: 4.0.0-5 pve-xtermjs: 3.13.2-1 qemu-server: 6.0-7 smartmontools: 7.0-pve2 spiceterm: 3.1-1 vncterm: 1.6-1 zfsutils-linux: 0.8.1-pve2
Sep 19 12:10:40 ac102 pve-ha-lrm[2302517]: VM 503 qmp command 'query-status' failed - unable to connect to VM 503 qmp socket - timeout after 31 retries#012
qm config 503
balloon: 512
bootdisk: scsi0
cores: 1
lock: backup
memory: 4096
name: Resolver58
net0: virtio=C2:88:FB:2A:44:8F,bridge=vmbr100
numa: 0
onboot: 1
ostype: l26
scsi0: cephvm:vm-503-disk-0,cache=writeback,size=10G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=033ffc7a-341b-4d96-b073-304ef970472d
sockets: 1
Backup to NFS volume also fails:

INFO: starting new backup job: vzdump 503 --storage backup52 --compress lzo --mode snapshot --node ac102 --remove 0
INFO: Starting Backup of VM 503 (qemu)
INFO: Backup started at 2019-09-19 12:08:52
INFO: status = running
INFO: update VM 503: -lock backup
INFO: VM Name: Resolver58
INFO: include disk 'scsi0' 'cephvm:vm-503-disk-0' 10G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/backup52/dump/vzdump-qemu-503-2019_09_19-12_08_52.vma.lzo'
ERROR: got timeout
INFO: aborting backup job
vmgenid: ebe633fd-8d32-4c6d-a340-dc26e8c9a977

roadrunner_rad · Sep 19, 2019

HI, same here, no backup, migration, console all struggeling with got timeout.

root@xen5:~# pveversion
pve-manager/6.0-7/28984024 (running kernel: 5.0.21-1-pve)

Stoarge on LVM and cqow2, backup local directory

It seems that after a reboot of the node all is working for some time...

migration issue:

2019-09-19 13:52:23 ERROR: migration aborted (duration 00:00:03): VM 109 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted

console issue:

VM 109 qmp command 'change' failed - got timeout
TASK ERROR: Failed to run vncproxy.

Backup issue:

vzdump 301,400 --storage BACKUP-DS218
INFO: starting new backup job: vzdump 301 400 --storage BACKUP-DS218
INFO: Starting Backup of VM 301 (qemu)
INFO: Backup started at 2019-09-19 12:33:54
INFO: status = running
INFO: update VM 301: -lock backup
INFO: VM Name: w7ab
INFO: include disk 'scsi0' 'VMVG:vm-301-disk-0' 50G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/ds218/vzdump/dump/vzdump-qemu-301-2019_09_19-12_33_54.vma.lzo'
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 301 qmp command 'backup-cancel' failed - got timeout
ERROR: Backup of VM 301 failed - got timeout
INFO: Failed at 2019-09-19 12:44:03
INFO: Starting Backup of VM 400 (qemu)
INFO: Backup started at 2019-09-19 12:44:03
INFO: status = running
INFO: update VM 400: -lock backup
INFO: VM Name: w10jb
INFO: include disk 'scsi1' 'VMSTORE:400/vm-400-disk-0.qcow2' 80G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/ds218/vzdump/dump/vzdump-qemu-400-2019_09_19-12_44_03.vma.lzo'
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 400 qmp command 'backup-cancel' failed - got timeout
ERROR: Backup of VM 400 failed - got timeout
INFO: Failed at 2019-09-19 12:54:16
INFO: Backup job finished with errors
job errors

troycarpenter · Sep 20, 2019

I have a cluster that is in this same scenario. It seems to have happened after upgrading the nodes yesterday. Today I tried to get a console, but it failed with this message:

Code:

VM 5355 qmp command 'change' failed - got timeout
TASK ERROR: Failed to run vncproxy.

Then I tried to migrate that VM, and I got this:

Code:

2019-09-20 12:11:42 ERROR: migration aborted (duration 00:00:03): VM 5373 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted

I get information on the VMs in the GUI, but I can't perform any actions on those VMs. I have seen this in the past, and the only way to fix it seems to be to restart the VM.

My storage is CEPH.

kulzaus · Sep 23, 2019

Can confirm I am seeing this issue on ZFS storage, so it doesn't seem to be a problem with storage backend.
I am getting qmp timeouts on some of actions like backups and console. Storage replication seems to be working ok.
It stopped working after around week from VM startup. After shutting down and powering up the VM the problem is gone at least for a while, but one has to restart all of the vms.
I don't know if this is relevant but anyways: I am running around 20 Windows VMs and few Linux ones on the host the problem started, doesn't seem to be OS dependant. Also it is one of three hosts in cluster and for now it onlly happened on this one.

Some of the relevant log entries

Code:

Sep 23 00:01:09 h3 pve-ha-lrm[1185494]: VM 10302 qmp command failed - VM 10302 qmp command 'query-status' failed - got timeout
Sep 23 00:01:09 h3 pve-ha-lrm[1185494]: VM 10302 qmp command 'query-status' failed - got timeout#012
Sep 23 00:01:19 h3 pve-ha-lrm[1188717]: VM 10302 qmp command failed - VM 10302 qmp command 'query-status' failed - got timeout
Sep 23 00:01:19 h3 pve-ha-lrm[1188717]: VM 10302 qmp command 'query-status' failed - got timeout#012

Sep 23 10:18:26 h3 pvesr[2013891]: VM 9803 qmp command failed - VM 9803 qmp command 'guest-ping' failed - got timeout
Sep 23 10:18:26 h3 pvesr[2013891]: Qemu Guest Agent is not running - VM 9803 qmp command 'guest-ping' failed - got timeout

Sep 23 09:14:25 h3 pvedaemon[767596]: VM 10904 qmp command failed - VM 10904 qmp command 'guest-ping' failed - got timeout
Sep 23 09:14:26 h3 pveproxy[737451]: 2019-09-23 09:14:26.259301 +0200 error AnyEvent::Util: Runtime error in AnyEvent::guard callback: Can't call method "_put_se
ssion" on an undefined value at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Handle.pm line 2259 during global destruction.


Sep 23 09:14:20 h3 qm[744483]: VM 10904 qmp command failed - VM 10904 qmp command 'change' failed - got timeout
Sep 23 09:14:20 h3 qm[741842]: VM 10904 qmp command failed - VM 10904 qmp command 'change' failed - got timeout

Sorry for cutting out single lines of logs, but really nothing useful before and after them.

Code:

root@h3:~# pveversion -V
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.2-pve1
ceph-fuse: 14.2.2-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
pve-zsync: 2.0-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

And here config of one of the machines:

Code:

agent: 1
bootdisk: scsi0
cores: 1
cpu: host
ide2: none,media=cdrom
memory: 8192
name: RCP
net0: virtio=5E:02:9B:11:0A:FD,bridge=vmbr0,tag=98
numa: 1
onboot: 1
ostype: win8
scsi0: ZFS-H3-SSD:vm-9801-disk-0,cache=writeback,discard=on,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=30dd3844-d706-4c5a-aa1d-8554c2f71143
sockets: 2
vga: qxl
vmgenid: 11a00656-5a67-4e02-a421-65067de1f9b8

Ingo S · Sep 23, 2019

Same issue is seen here.

Actually we know of 5 machines that cannot be migrated. The affected machines can be reached via ssh, ping etc. and the services on these machines are still up. Last Friday this wasn't the case, even our domain controller became unreachable which caused massive problems.

We run 6 nodes on pve-manager/6.0-7/28984024 (running kernel: 5.0.15-1-pve) with ceph as main storage backend and NFS to a nasserver for backup of the machines.

dir: local #unused
path /var/lib/vz
content backup,images,vztmpl,iso
maxfiles 1
shared 0

lvmthin: local-lvm #unused
thinpool data
vgname pve
content images,rootdir

nfs: Backup-Daily
export /raid0/data/_NAS_NFS_Exports_/Proxmox-Daily
path /mnt/pve/Backup-Daily
server nasserver3
content backup
maxfiles 3
options vers=3

nfs: Backup-Weekly
export /raid0/data/_NAS_NFS_Exports_/Proxmox-weekly
path /mnt/pve/Backup-Weekly
server nasserver3
content backup
maxfiles 2
options vers=3

rbd: HDD_Storage-VM
content images
krbd 0
pool HDD_Storage

nfs: CD-Images
export /raid0/data/_NAS_NFS_Exports_/CD-Images
path /mnt/pve/CD-Images
server nasserver3
content iso
maxfiles 1
options vers=3

nfs: Test
export /raid0/data/_NAS_NFS_Exports_/VM-HDDs
path /mnt/pve/Test
server nasserver3
content images
options vers=3

root@vm-3:/etc/pve# qm showcmd 300 --pretty
/usr/bin/kvm \
-id 300 \
-name Kanal-NB \
-chardev 'socket,id=qmp,path=/var/run/qemu-server/300.qmp,server,nowait' \
-mon 'chardev=qmp,mode=control' \
-chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
-mon 'chardev=qmp-event,mode=control' \
-pidfile /var/run/qemu-server/300.pid \
-daemonize \
-smp '1,sockets=1,cores=1,maxcpus=1' \
-nodefaults \
-boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
-vnc unix:/var/run/qemu-server/300.vnc,password \
-no-hpet \
-cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,hv_ipi,enforce' \
-m 4096 \
-device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
-device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
-device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
-device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
-device 'VGA,id=vga,bus=pci.0,addr=0x2' \
-device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' \
-iscsi 'initiator-name=iqn.1993-08.org.debian:01:81e7c9dd7f3d' \
-drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' \
-device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2' \
-device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' \
-drive 'file=rbd:HDD_Storage/vm-300-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/HDD_Storage-VM.keyring,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=native,detect-zeroes=unmap' \
-device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' \
-netdev 'type=tap,id=net0,ifname=tap300i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' \
-device 'rtl8139,mac=DA:FB:8B:47:4D:CC,netdev=net0,bus=pci.0,addr=0x12,id=net0' \
-rtc 'driftfix=slew,base=localtime' \
-machine 'type=pc' \
-global 'kvm-pit.lost_tick_policy=discard'

boot: c
bootdisk: scsi0
cores: 1
description:
ide2: none,media=cdrom
memory: 4096
name: Kanal-NB
net0: rtl8139=DA:FB:8B:47:4D:CC,bridge=vmbr0,tag=106
onboot: 1
ostype: win7
parent: Sicherung
scsi0: HDD_Storage-VM:vm-300-disk-1,discard=on,size=320G
scsihw: virtio-scsi-pci
sockets: 1
startup: order=5

aaron · Sep 24, 2019

Thank you all for the new information. We are still trying to figure out what is causing this behavior. We'll keep you updated

Ingo S said:
Last Friday this wasn't the case, even our domain controller became unreachable which caused massive problems.

Did the VM react to a shutdown command or did you have to stop it hard?

gradinaruvasile · Sep 24, 2019

aaron said:
Thank you all for the new information. We are still trying to figure out what is causing this behavior. We'll keep you updated

Did the VM react to a shutdown command or did you have to stop it hard?

VMs i had issues with did react to normal Shutdown commands from the GUI or cli.

Edit:
Alternatively you can log into the VM and issue the shutdown command.

Ingo S · Sep 24, 2019

I had to stop it hard. It didn't react to any command, except "Stop"
Moved the VM to another Node and started it, went up normally.

The other 5 VMs that are stuck at the moment, are currently still reachable and providing services, but give timeout when the backup runs, or when trying to use VNC Console etc.

BTW: This phenomenon is not Node dependent. The "Friday" incident happend on VM-6 while the currently stuck VMs are on VM-3. Let me just check if we have some more VMs on other Nodes that are currently stuck.

kulzaus · Sep 24, 2019

BTW: This phenomenon is not Node dependent. The "Friday" incident happend on VM-6 while the currently stuck VMs are on VM-3. Let me just check if we have some more VMs on other Nodes that are currently stuck.

I'm not sure about that. Maybe that's the case but for me VMs on other nodes didn't experience this bug, and they were running longer. Other nodes are other hardware for me though. But the node that experienced the problem is the heaviest loaded one for me/

Ingo S · Sep 24, 2019

So, some additional information:

Every Guest that is currently stuck, has Balloning enabled. But not every Guest with ballooning is stuck, so this does not have to mean anything.
No additional stuck Guests were found and it hit different Guests than on Friday.

PS:
This Problem scares me.

Our main SQL Server is one of the stuck Guests and I really don't want any downtime on this one during workhours.
I will gladly help, if there is something to dig deeper into.
Oh and: it hits Windows Guests as well als Linux Guests.

asedev · Sep 24, 2019

aaron said:
Thank you all for the new information. We are still trying to figure out what is causing this behavior. We'll keep you updated

Did the VM react to a shutdown command or did you have to stop it hard?

I have to stop it hard. All stuck VMs do still work but cannot be controlled from the UI nor the shell.

gradinaruvasile · Sep 24, 2019

Ingo S said:
So, some additional information:

Every Guest that is currently stuck, has Balloning enabled. But not every Guest with ballooning is stuck, so this does not have to mean anything.
No additional stuck Guests were found and it hit different Guests than on Friday.

PS:
This Problem scares me. Our main SQL Server is one of the stuck Guests and I really don't want any downtime on this one during workhours.
I will gladly help, if there is something to dig deeper into.
Oh and: it hits Windows Guests as well als Linux Guests.

On the VMs that created issues for us we did not have ballooning enabled at all (we don't use ballooning anywhere). So this is not a factor.

Patriot-IT · Sep 24, 2019

We have the same issue on one node in our cluster. The backup fails on all VM's on that node. I was going to reboot the node, but the running VM's will not migrate throwing this error:

task started by HA resource agent
2019-09-24 09:09:47 ERROR: migration aborted (duration 00:00:03): VM 303 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted

Our VM's are currently using qcow2 on NFS storage.

agent: 1
bootdisk: scsi0
cores: 4
ide2: none,media=cdrom
memory: 8192
name: REDACTED
net0: virtio=EA:02:EB

B:CE:00,bridge=vmbr0,tag=20
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: NFS:303/vm-303-disk-0.qcow2,discard=on,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=c7e3bfb6-e6cc-4054-be8d-9d0262384d30
sockets: 1
vmgenid: a087f717-755b-466b-befa-f65a26db19b3

/usr/bin/kvm \
-id 303 \
-name REDACTED \
-chardev 'socket,id=qmp,path=/var/run/qemu-server/303.qmp,server,nowait' \
-mon 'chardev=qmp,mode=control' \
-chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
-mon 'chardev=qmp-event,mode=control' \
-pidfile /var/run/qemu-server/303.pid \
-daemonize \
-smbios 'type=1,uuid=c7e3bfb6-e6cc-4054-be8d-9d0262384d30' \
-smp '4,sockets=1,cores=4,maxcpus=4' \
-nodefaults \
-boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
-vnc unix:/var/run/qemu-server/303.vnc,password \
-cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce \
-m 8192 \
-device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
-device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
-device 'vmgenid,guid=a087f717-755b-466b-befa-f65a26db19b3' \
-device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
-device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
-device 'VGA,id=vga,bus=pci.0,addr=0x2' \
-chardev 'socket,path=/var/run/qemu-server/303.qga,server,nowait,id=qga0' \
-device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
-device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
-device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' \
-iscsi 'initiator-name=iqn.1993-08.org.debian:01:34e47a6f4ec9' \
-drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' \
-device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' \
-device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' \
-drive 'file=/mnt/pve/NFS/images/303/vm-303-disk-0.qcow2,if=none,id=drive-scsi0,discard=on,format=qcow2,cache=none,aio=native,detect-zeroes=unmap' \
-device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' \
-netdev 'type=tap,id=net0,ifname=tap303i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
-device 'virtio-net-pci,mac=EA:02:EB

B:CE:00,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' \
-machine 'type=pc'

All nodes of the cluter are runnning:
pve-manager/6.0-7/28984024 (running kernel: 5.0.21-1-pve)

aaron · Sep 24, 2019

asedev said:
I have to stop it hard. All stuck VMs do still work but cannot be controlled from the UI nor the shell.

Do you have the guest agent installed in the VMs and enabled in the options?
Do the VMs react to a shutdown command right after a fresh start?

So far as I understand the situation certain commands like the backup or VNC get a timeout but the shutdown command is passed to the VM.
After a clean start a VM can run days or weeks until this issues hits it again. Which makes hard to reproduce and debug.

Ingo S said:
PS:
This Problem scares me. Our main SQL Server is one of the stuck Guests and I really don't want any downtime on this one during workhours.
I will gladly help, if there is something to dig deeper into.
Oh and: it hits Windows Guests as well als Linux Guests.

If you shut it down outside of working hours and then start it fresh it should be fine for some time (as a workaround). Why your DC failed completely I cannot say for sure :/

Ingo S · Sep 24, 2019

Do you have the guest agent installed in the VMs and enabled in the options?

Do the VMs react to a shutdown command right after a fresh start?

So far as I understand the situation certain commands like the backup or VNC get a timeout but the shutdown command is passed to the VM.
After a clean start a VM can run days or weeks until this issues hits it again. Which makes hard to reproduce and debug.

We have one stuck VM with Guest Agent enabled, the others have it disabled. This VM reacted to shutdown command from Web GUI while stuck, and it reacts to shutdown command right after a fresh restart.
VMs without Guest Agent enabled fail on shutdown command. TASK ERROR: VM quit/powerdown failed
After fresh reboot, shutdown Command succeeds.

aaron · Sep 24, 2019

If possible can you please help us by tracing your VMs?

Once a VM is in this state again we would like to see the last 10 or so lines from the trace log.

How to:

Create a file with the tracing patterns we are interested in:
Let's store it for example in /root/trace_patterns
The content of it:

Code:

handle_qmp_command
monitor_qmp_cmd_in_band
monitor_qmp_cmd_out_of_band
qmp_job_cancel
qmp_job_pause
qmp_job_resume
qmp_job_complete
qmp_job_finalize
qmp_job_dismiss
qmp_block_job_cancel
qmp_block_job_pause
qmp_block_job_resume
qmp_block_job_complete
qmp_block_job_finalize
qmp_block_job_dismiss
qmp_block_stream
monitor_protocol_event_queue
monitor_suspend
monitor_protocol_event_handler
monitor_protocol_event_emit
monitor_protocol_event_queue

For each VM to be traced make sure if there are already custom arguments set with qm config <vmid>. Look out for the args parameter.

If there is none continue. If there is make sure you add its content again in the next command.

Code:

qm set <vmid> --args '-trace events=/root/trace_patterns,file=/root/qemu_trace_<vmid>'

Don't forget to replace <vmid> with the respective VM IDs.

Once you do a clean start of the VM the trace file should pop up. It should not grow too quickly. Roughly about 50MB to 60MB per week.

Update: If you have HA enabled VMs which you trace, make sure the trace_pattern file is present on all nodes at the same location.

Ingo S · Sep 24, 2019

I will configure a multitude of VMs with tracing enabled, since this Problem seems to randomly affect only some Guests.

It will be a matter of luck to some degree, but i hope we will catch at least one Guest.

aaron · Sep 24, 2019

One thing, if you have some of those VMs set to HA make sure that the trace_patterns file is available on all nodes at the location. The VM will fail to start if it cannot find the file.

[SOLVED] Certain VMs from a cluster cannot be backed up and managed

Proxmox Staff Member

Member

Attachments

Well-Known Member

Well-Known Member

Renowned Member

Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Active Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

We value your privacy