Hi
I have done another test this morning. I reboot all node and stop all VMs. Then change the order of the nodes to node2, node1, node3 and save and reboot. I don't think that changing the order will help.
When the cluster is on, I have done a reboot on the Node3 and this time it executes a job and all the VMs in Node3 have been migrated to Node1 & Node2. In the log, I found the pve-ha-crm do the migration and fence the Node 3 and VMS as follows:
Jul 9 08:31:51 cluster1 pve-ha-crm[1094]: migrate service 'vm:104' to node 'cluster2' (running)Jul 9 08:31:51 cluster1 pve-ha-crm[1094]: service 'vm:104': state changed from 'started' to 'migrate' (node = cluster3, target = cluster2)
Jul 9 08:31:51 cluster1 pve-ha-crm[1094]: migrate service 'vm:105' to node 'cluster1' (running)
Jul 9 08:31:51 cluster1 pve-ha-crm[1094]: service 'vm:105': state changed from 'started' to 'migrate' (node = cluster3, target = cluster1)
Jul 9 08:34:52 cluster1 pmxcfs[721]: [dcdb] notice: members: 1/721, 2/751
Jul 9 08:34:52 cluster1 pmxcfs[721]: [dcdb] notice: starting data syncronisation
Jul 9 08:34:52 cluster1 pmxcfs[721]: [status] notice: members: 1/721, 2/751
Jul 9 08:34:52 cluster1 pmxcfs[721]: [status] notice: starting data syncronisation
Jul 9 08:34:52 cluster1 corosync[1082]: [TOTEM ] A new membership (10.146.0.181:364) was formed. Members left: 3
Jul 9 08:34:52 cluster1 corosync[1082]: [QUORUM] Members[2]: 1 2
Jul 9 08:34:52 cluster1 corosync[1082]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 9 08:34:52 cluster1 pmxcfs[721]: [dcdb] notice: received sync request (epoch 1/721/0000000A)
Jul 9 08:34:52 cluster1 pmxcfs[721]: [status] notice: received sync request (epoch 1/721/0000000A)
Jul 9 08:34:52 cluster1 pmxcfs[721]: [dcdb] notice: received all states
Jul 9 08:34:52 cluster1 pmxcfs[721]: [dcdb] notice: leader is 1/721
Jul 9 08:34:52 cluster1 pmxcfs[721]: [dcdb] notice: synced members: 1/721, 2/751
Jul 9 08:34:52 cluster1 pmxcfs[721]: [dcdb] notice: start sending inode updates
Jul 9 08:34:52 cluster1 pmxcfs[721]: [dcdb] notice: sent all (0) updates
Jul 9 08:34:52 cluster1 pmxcfs[721]: [dcdb] notice: all data is up to date
Jul 9 08:34:52 cluster1 pmxcfs[721]: [status] notice: received all states
Jul 9 08:34:52 cluster1 pmxcfs[721]: [status] notice: all data is up to date
Jul 9 08:35:01 cluster1 pve-ha-crm[1094]: node 'cluster3': state changed from 'online' => 'unknown'
Jul 9 08:35:26 cluster1 pveproxy[25202]: proxy detected vanished client connection
Jul 9 08:35:51 cluster1 pve-ha-crm[1094]: service 'vm:104': state changed from 'migrate' to 'fence'
Jul 9 08:35:51 cluster1 pve-ha-crm[1094]: service 'vm:105': state changed from 'migrate' to 'fence'
Jul 9 08:35:51 cluster1 pve-ha-crm[1094]: node 'cluster3': state changed from 'unknown' => 'fence'
I repeat this process 2 - 3 time other tests to be sure that it works again and it does no longer work. I checked the logs and found that the
Jul 9 09:35:14 cluster1 pve-ha-crm[1090]: service 'vm:104': state changed from 'started' to 'freeze'
Jul 9 09:35:14 cluster1 pve-ha-crm[1090]: service 'vm:105': state changed from 'started' to 'freeze'
Jul 9 09:35:21 cluster1 pmxcfs[720]: [dcdb] notice: members: 1/720, 2/746
Jul 9 09:35:21 cluster1 pmxcfs[720]: [dcdb] notice: starting data syncronisation
Jul 9 09:35:21 cluster1 pmxcfs[720]: [status] notice: members: 1/720, 2/746
Jul 9 09:35:21 cluster1 pmxcfs[720]: [status] notice: starting data syncronisation
Jul 9 09:35:21 cluster1 corosync[1078]: [TOTEM ] A new membership (10.146.0.181:412) was formed. Members left: 3
Jul 9 09:35:21 cluster1 corosync[1078]: [QUORUM] Members[2]: 1 2
Jul 9 09:35:21 cluster1 corosync[1078]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 9 09:35:21 cluster1 pmxcfs[720]: [dcdb] notice: received sync request (epoch 1/720/00000006)
Jul 9 09:35:21 cluster1 pmxcfs[720]: [status] notice: received sync request (epoch 1/720/00000006)
Jul 9 09:35:21 cluster1 pmxcfs[720]: [dcdb] notice: received all states
Jul 9 09:35:21 cluster1 pmxcfs[720]: [dcdb] notice: leader is 1/720
Jul 9 09:35:21 cluster1 pmxcfs[720]: [dcdb] notice: synced members: 1/720, 2/746
Jul 9 09:35:21 cluster1 pmxcfs[720]: [dcdb] notice: start sending inode updates
Jul 9 09:35:21 cluster1 pmxcfs[720]: [dcdb] notice: sent all (0) updates
Jul 9 09:35:21 cluster1 pmxcfs[720]: [dcdb] notice: all data is up to date
Jul 9 09:35:21 cluster1 pmxcfs[720]: [status] notice: received all states
Jul 9 09:35:21 cluster1 pmxcfs[720]: [status] notice: all data is up to date
Jul 9 09:35:24 cluster1 pve-ha-crm[1090]: node 'cluster3': state changed from 'online' => 'unknown'
NB: When I switched on the the node 3, the VMs remain on Node 1 & 2 and it does not migrate to Node 3. Do we need to do this manually??
PVE version 4.0 Beta 24
Thanks for your helps!!