Hi, dear,
I recently faced a big problem in a Proxmox 8 cluster. It all started when a Windows VM wouldn't start and displayed the following error in the task log:
In some research on Proxmox forums, some reported that restarting the host solved the problem, so I followed this idea: I started migrating the VMs to another host, and then a series of other problems started: some live migrations were "frozen", the VM stopped responding, other migrations gave an error and I could only migrate with the VM turned off, other migrations were even successful, but the "migrate" task never finished, it reached 100% and did not show "success", and error
In the end, when I managed to remove all the VMs, I restarted the host and a message was displayed on the Proxmox host monitor: something to do with a timeout for terminating RBD processes. We use Ceph storage in a cluster of 8 nodes in version 17.2, and this problem reported happened specifically with Windows VMs that use the TPM device, it seems that the TPM process was stuck on the 2 hosts that support Windows VMs.
After restarting these 2 hosts, everything returned to normal, but my concern is that I don't know how this problem arose, I only know that it has to do with the RBD process on the TPM device of Windows VMs
Configurações da VM Windows que foram afetadas:
I recently faced a big problem in a Proxmox 8 cluster. It all started when a Windows VM wouldn't start and displayed the following error in the task log:
TASK ERROR: timeout waiting on systemd
In some research on Proxmox forums, some reported that restarting the host solved the problem, so I followed this idea: I started migrating the VMs to another host, and then a series of other problems started: some live migrations were "frozen", the VM stopped responding, other migrations gave an error and I could only migrate with the VM turned off, other migrations were even successful, but the "migrate" task never finished, it reached 100% and did not show "success", and error
rbd: rbd2: no lock owners detected
was constant in the syslog.In the end, when I managed to remove all the VMs, I restarted the host and a message was displayed on the Proxmox host monitor: something to do with a timeout for terminating RBD processes. We use Ceph storage in a cluster of 8 nodes in version 17.2, and this problem reported happened specifically with Windows VMs that use the TPM device, it seems that the TPM process was stuck on the 2 hosts that support Windows VMs.
After restarting these 2 hosts, everything returned to normal, but my concern is that I don't know how this problem arose, I only know that it has to do with the RBD process on the TPM device of Windows VMs
Configurações da VM Windows que foram afetadas:
Code:
agent: 1
bios: ovmf
boot: order=scsi0
cores: 2
cpu: Haswell-noTSX
efidisk0: stor-vms:vm-10010-disk-2,efitype=4m,pre-enrolled-keys=1,size=528K
ide0: stor-vms:vm-10010-cloudinit,media=cdrom
ide2: none,media=cdrom
kvm: 1
machine: pc-i440fx-7.2
memory: 6144
meta: creation-qemu=7.2.0,ctime=1696525080
net0: virtio=BC:24:11:BE:2C:94,bridge=vmbr0,firewall=1,rate=12.5
numa: 0
ostype: win10
scsi0: stor-vms:vm-10010-disk-0,discard=on,iops_rd=2000,iops_wr=1000,mbps_rd=300,mbps_wr=200,size=50G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=4264191e-905d-4d06-a5be-4955a7ae0dc7
sockets: 1
tpmstate0: stor-vms:vm-10010-disk-1,size=4M,version=v2.0
vmgenid: 07cfe099-ca6f-4ba7-ab19-6f1ea61eba6e
Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
ceph: 17.2.7-pve3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1