I have two clusters running the following version:
root@vsys07:/var/log# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-10-pve: 4.15.18-32
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2
both clusters were updated from 5.X
randomly a node will become inaccessible. It goes grey in the GUI, all VMs get a question mark next to them and I'm also unable to SSH into the host. The SSH session does NOT time out or get rejected, it just hangs forever. The VMs on the inaccessible host continue to run. I can log into the console, but almost any pve or ssh related command will hang and will not exit with Ctrl-C.
A 'reboot' always hangs on 'A stop job for PVE guest VMs' (<-- wording from memory)
Cold booting the server is the only thing that brings it back.
in /var/log/messages at the time the server becomes inaccessible, I see pvesr Call Trace messages (see attached file).
Any help or insight into this problem would appreciated, I'm getting tired of driving into the office in the middle of the night to reboot servers.
root@vsys07:/var/log# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-10-pve: 4.15.18-32
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2
both clusters were updated from 5.X
randomly a node will become inaccessible. It goes grey in the GUI, all VMs get a question mark next to them and I'm also unable to SSH into the host. The SSH session does NOT time out or get rejected, it just hangs forever. The VMs on the inaccessible host continue to run. I can log into the console, but almost any pve or ssh related command will hang and will not exit with Ctrl-C.
A 'reboot' always hangs on 'A stop job for PVE guest VMs' (<-- wording from memory)
Cold booting the server is the only thing that brings it back.
in /var/log/messages at the time the server becomes inaccessible, I see pvesr Call Trace messages (see attached file).
Any help or insight into this problem would appreciated, I'm getting tired of driving into the office in the middle of the night to reboot servers.