After upgrading to Proxmox 6 i observed a random event - randomly VMs get into HA error states with a red circle above the VM icon.
Their HA State becomes "error". The cluster seems fine otherwise.
The logs indicate a "Main process exited, code=killed, status=6/ABRT" for the pve-cluster service which is subsequently restarted automatically.
This is a problem since these VMs will not respawn anymore (i assume?) if the node goes down.
The clunky solution, without rebooting the VM, is to go to the cluster wide HA config and remove and re-add the VM into HA. Unpleasant if there are multiple of them (i had this happen once with around 10 vms from all over the cluster).
Right now i have only one VM, id 134.
Error logs:
pve-ha-lrm service logs (reverse):
and phe-ha-crm logs:
pve-cluster log (this one is in reverse):
Their HA State becomes "error". The cluster seems fine otherwise.
The logs indicate a "Main process exited, code=killed, status=6/ABRT" for the pve-cluster service which is subsequently restarted automatically.
This is a problem since these VMs will not respawn anymore (i assume?) if the node goes down.
The clunky solution, without rebooting the VM, is to go to the cluster wide HA config and remove and re-add the VM into HA. Unpleasant if there are multiple of them (i had this happen once with around 10 vms from all over the cluster).
Right now i have only one VM, id 134.
Error logs:
pve-ha-lrm service logs (reverse):
Code:
Aug 30 10:40:55 server pve-ha-lrm[18799]: temporary inconsistent cluster state (cfs restart?), skip round
Aug 30 10:40:54 server pve-ha-lrm[18799]: updating service status from manager failed: Connection refused
Aug 30 10:40:14 server pve-ha-lrm[18799]: temporary inconsistent cluster state (cfs restart?), skip round
Aug 30 10:39:25 server pve-ha-lrm[21140]: service vm:134 is in an error state and needs manual intervention. Look up 'ERROR RECOVERY' in the documentation.
Aug 30 10:39:14 server pve-ha-lrm[21090]: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[3] failed: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[2] failed: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[1] failed: Connection refused
Code:
Aug 30 10:39:20 server pve-ha-crm[18796]: service 'vm:134' got unrecoverable error (exit code 255))
Aug 30 10:39:20 server pve-ha-crm[18796]: service 'vm:134': state changed from 'started' to 'error'
Code:
Aug 30 10:39:14 server systemd[1]: Starting The Proxmox VE cluster filesystem...
Aug 30 10:39:14 server systemd[1]: Stopped The Proxmox VE cluster filesystem.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 11.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Failed with result 'signal'.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Main process exited, code=killed, status=6/ABRT
Aug 30 10:26:49 server pmxcfs[23824]: [dcdb] notice: data verification successful