Behaviour on corosync crash


Nov 17, 2017

I'm currently evaluate Proxmox for a 5 Node HA-Setup. During some tests I noticed that Proxmox isn't able to live migrate VMs/Container when Corosync crashes on one node the VMs/CTs are running on. Instead the Node gets restarted and the VMs/CTs are started on another node (after quite some time). I assume that Proxmox is using Corosync to initiate the live migration process, but it still seems a bit odd, as there are other processes running (e.g. pvedaemon) which should be also able to initiate the migration.

So my question is, did I something wrong or is this the expected behaviour? What is a best practice scenario to handle Corosync crashes?

Thanks in advance.

P.S. My Proxmox-Version is 5.1-36.
This is expected. But corosync should not crash.

Sure, but Murphy... (Corosync didn't crashed once, yet, it was killed on purpose to see what happens, as said, I'm evaluating at the moment.)

Would it be save to monitor Corosync with an systemd watchdog, or via supervisord to restart it in case of errors or might this interfere with the fencing watchdog in a bad way?

BTW: What is the proper procedere to do manual changes to corosync.conf? When I just edit the config and bump the version as described in the Proxmox Wiki Corosync refuses to (re)start because the other Nodes have an older config. If I kill all instances to get the new config to all nodes the fencing watchdog kicks in and all nodes will restart/shut down. In an older posting in the forum one suggested to temporary disable fencing in cluster.conf but this seems to be outdated.

BTW2: Commercial Support is per CPU Socket not per Core, correct?
