Hello
we have a problem with our proxmox 5 cluster of 24 pmx (yes its old but soon to be upgraded)
the initial problem was : the first pmx1 locked it's mount process after a new nfs storage was added to the cluster (others were fine)
so we did backup/restore of the lxc+vm on it to another pmx node to be able to reboot pmx1
but during the vm backup we had the following error :
INFO: starting kvm to execute backup task
kvm: -vnc unix:/var/run/qemu-server/283.vnc,x509,password: Failed to start VNC server: Our own certificate /etc/pve/local/pve-ssl.pem failed validation against /etc/pve/pve-root-ca.pem: The certificate hasn't got a known issuer
ERROR: Backup of VM 283 failed - start failed: command
we copied /etc/pve/local/pve-ssl.pem from node 2 to be able to do the backup (strange since automated vzdump backups were working 2 days before on this node)
then we rebooted the node and problems began
first corosync was at 100% cpu on this node
it was restarted and was ok after
then corosync was crashed on node pmx17
it was restarted then was ok
but all nodes claimed that pmx001 was missing
a second reboot of pmx001 was done
now corosync seems ok with 24 nodes all listed
but the fuse FS /etc/pve/xxx is unaccessible to lots of nodes that blocks on read and it cannot be written to.
we wonder how we can recover the correct cluster status from this point without rebooting any of the nodes ?
what would be the correct steps to recover correct cluster status, once we shutdown the pmx1 node with the bad certificate.
any quick help would be apreciated !
we have a problem with our proxmox 5 cluster of 24 pmx (yes its old but soon to be upgraded)
the initial problem was : the first pmx1 locked it's mount process after a new nfs storage was added to the cluster (others were fine)
so we did backup/restore of the lxc+vm on it to another pmx node to be able to reboot pmx1
but during the vm backup we had the following error :
INFO: starting kvm to execute backup task
kvm: -vnc unix:/var/run/qemu-server/283.vnc,x509,password: Failed to start VNC server: Our own certificate /etc/pve/local/pve-ssl.pem failed validation against /etc/pve/pve-root-ca.pem: The certificate hasn't got a known issuer
ERROR: Backup of VM 283 failed - start failed: command
we copied /etc/pve/local/pve-ssl.pem from node 2 to be able to do the backup (strange since automated vzdump backups were working 2 days before on this node)
then we rebooted the node and problems began
first corosync was at 100% cpu on this node
it was restarted and was ok after
then corosync was crashed on node pmx17
it was restarted then was ok
but all nodes claimed that pmx001 was missing
a second reboot of pmx001 was done
now corosync seems ok with 24 nodes all listed
but the fuse FS /etc/pve/xxx is unaccessible to lots of nodes that blocks on read and it cannot be written to.
we wonder how we can recover the correct cluster status from this point without rebooting any of the nodes ?
what would be the correct steps to recover correct cluster status, once we shutdown the pmx1 node with the bad certificate.
any quick help would be apreciated !