We have a couple of multi node clusters running latest 3.4 without any issues and tried to re-install one 4 node cluster of them with pve 4;
Base install was straight forward but run into issues with quorum when creating the cluster - all nodes were setup identical but one node couldn't join the cluster successfully - it hang at 'waiting for quorum...' the other nodes added them but the node itself did nothing and syslog showed:
While testing i rebooted another node and after that containers could not started on this node - found also quorum messages in syslog:
After rebooting the node again it had no quorum issue;
Same happened again on a another node after reboot - rebooting again and quorum issue is gone;
Seems clustering/quorum is not as reliable as in 3.4 where i never saw this issue on any node;
Anything changed or any idea what could casue this issue?
Also not having the live migration feature implemented for containers let me decide to go back to 3.4;
Seems LXC has to improve their tools to be useful, stopping containers to be able to move them is a No-Go for us - will stick with OpenVZ for the moment;
Base install was straight forward but run into issues with quorum when creating the cluster - all nodes were setup identical but one node couldn't join the cluster successfully - it hang at 'waiting for quorum...' the other nodes added them but the node itself did nothing and syslog showed:
Tried several times to delete and re-add the node but no luck, so i installed the node from scratch and when adding i had the same issue again - with the -force option it was able to join;Oct 25 23:44:00 node3 pmxcfs[3788]: [status] crit: cpg_send_message failed: 9
While testing i rebooted another node and after that containers could not started on this node - found also quorum messages in syslog:
and this message again:Oct 26 16:41:01 node1 pmxcfs[1302]: [quorum] crit: quorum_initialize failed: 2
Oct 26 16:41:01 node1 pmxcfs[1302]: [quorum] crit: can't initialize service
Oct 26 16:41:01 node1 pmxcfs[1302]: [confdb] crit: cmap_initialize failed: 2
Oct 26 16:41:01 node1 pmxcfs[1302]: [confdb] crit: can't initialize service
Oct 26 16:41:01 node1 pmxcfs[1302]: [dcdb] crit: cpg_initialize failed: 2
Oct 26 16:41:01 node1 pmxcfs[1302]: [dcdb] crit: can't initialize service
Oct 26 16:41:01 node1 pmxcfs[1302]: [status] crit: cpg_initialize failed: 2
Oct 26 16:41:01 node1 pmxcfs[1302]: [status] crit: can't initialize service
Oct 26 16:41:10 node1 pmxcfs[1302]: [status] crit: cpg_send_message failed: 9
Oct 26 16:41:10 node1 pmxcfs[1302]: [status] crit: cpg_send_message failed: 9
Oct 26 16:41:12 node1 pmxcfs[1302]: [status] crit: cpg_send_message failed: 9
Oct 26 16:41:12 node1 pmxcfs[1302]: [status] crit: cpg_send_message failed: 9
Oct 26 16:41:12 node1 pmxcfs[1302]: [status] crit: cpg_send_message failed: 9
After rebooting the node again it had no quorum issue;
Same happened again on a another node after reboot - rebooting again and quorum issue is gone;
Seems clustering/quorum is not as reliable as in 3.4 where i never saw this issue on any node;
Anything changed or any idea what could casue this issue?
Also not having the live migration feature implemented for containers let me decide to go back to 3.4;
Seems LXC has to improve their tools to be useful, stopping containers to be able to move them is a No-Go for us - will stick with OpenVZ for the moment;