Hi all,
I have a 3 node proxmox cluster with pve6, on HPE DL380gen10 hardware. I had just upgraded from pve 5 to pve6, so i was checking things. At some point I noticed that one node had wrong time and it was 10minutes behind. I added an ntp server and synced the time.... Immediately the server rebooted with about a dozen VMs on it!
As fas as i understand, the server falsely thought that 10minutes had passed with no keepalives from the other two nodes (or something like that) and it decided to commit suicide, believing that it was isolated. When the server came back up, the time was again 10minutes behind (i guess it takes the time from ILO) and when I ntp synced, it rebooted again!
Well, i really feel a bit dissapointed about how easily the clustering algorithm failed...So i have a few questions about that:
- Why the node rebooted (fenced itself) as a result of a time sync? Whats the culprit of the algorithm?
- Is there a better clock function for the culstering algorithm that is not sensitive to time changes? I guess there is a function like "ticks from poweron" or "uptime".
- Doing a clustering algorithm tha also uses the storage for keepalives (like vmware does), is not that hard. There are challenges but its doable. Is that something that Proxmox is looking at?
Well, having said that, Proxmox is indeed an excellent and stable product and such foolish failures are a shame.
Best Regards,
Spiros
I have a 3 node proxmox cluster with pve6, on HPE DL380gen10 hardware. I had just upgraded from pve 5 to pve6, so i was checking things. At some point I noticed that one node had wrong time and it was 10minutes behind. I added an ntp server and synced the time.... Immediately the server rebooted with about a dozen VMs on it!
As fas as i understand, the server falsely thought that 10minutes had passed with no keepalives from the other two nodes (or something like that) and it decided to commit suicide, believing that it was isolated. When the server came back up, the time was again 10minutes behind (i guess it takes the time from ILO) and when I ntp synced, it rebooted again!
Well, i really feel a bit dissapointed about how easily the clustering algorithm failed...So i have a few questions about that:
- Why the node rebooted (fenced itself) as a result of a time sync? Whats the culprit of the algorithm?
- Is there a better clock function for the culstering algorithm that is not sensitive to time changes? I guess there is a function like "ticks from poweron" or "uptime".
- Doing a clustering algorithm tha also uses the storage for keepalives (like vmware does), is not that hard. There are challenges but its doable. Is that something that Proxmox is looking at?
Well, having said that, Proxmox is indeed an excellent and stable product and such foolish failures are a shame.
Best Regards,
Spiros