Slow HA failover, over 3 minutes

crsprox · Jul 5, 2017

I'm currently using Proxmox 5.0 in a test configuration. I have two HP DL360 Gen8 with a Pi3 to hold quorum. I'm using glusterFS (3.8.8) with the Pi3 as a 3rd peer. When I perform a test of pulling the plug on one of the HP servers to fail the 3 VMs to the second server, they take over 3 minutes to do this. The hosts have 90GB+ RAM and SSDs for OS and VM pool. I'm using softdog for fencing,

There appears to be at least 90sec of inactivity in the log that makes it appear the system is just sitting there. Is this amount of time expected? Anything I can do to improve this time?

Thanks for any insight

Jul 05 13:20:43 pve04 corosync[4546]: notice [TOTEM ] A processor failed, forming new configuration.
Jul 05 13:20:43 pve04 corosync[4546]: [TOTEM ] A processor failed, forming new configuration.
Jul 05 13:20:45 pve04 corosync[4546]: notice [TOTEM ] A new membership (10.1.99.59:520) was formed. Members left: 1
Jul 05 13:20:45 pve04 corosync[4546]: notice [TOTEM ] Failed to receive the leave message. failed: 1
Jul 05 13:20:45 pve04 corosync[4546]: [TOTEM ] A new membership (10.1.99.59:520) was formed. Members left: 1
Jul 05 13:20:45 pve04 corosync[4546]: [TOTEM ] Failed to receive the leave message. failed: 1
Jul 05 13:20:45 pve04 pmxcfs[4396]: [dcdb] notice: members: 2/4396
Jul 05 13:20:45 pve04 pmxcfs[4396]: [status] notice: members: 2/4396
Jul 05 13:20:45 pve04 corosync[4546]: notice [QUORUM] Members[2]: 2 3
Jul 05 13:20:45 pve04 corosync[4546]: notice [MAIN ] Completed service synchronization, ready to provide service.
Jul 05 13:20:45 pve04 corosync[4546]: [QUORUM] Members[2]: 2 3
Jul 05 13:20:45 pve04 corosync[4546]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 05 13:20:54 pve04 pvestatd[4589]: got timeout
Jul 05 13:20:55 pve04 mnt-pve-bulkpool[4763]: [2017-07-05 17:20:55.605356] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-bulkpool-client-0: server 10.1.31.58:49155 has not responded in the last 2 seconds, disconnecting.
Jul 05 13:20:55 pve04 pvestatd[4589]: unable to activate storage 'bulkpool' - directory '/mnt/pve/bulkpool' does not exist or is unreachable
Jul 05 13:20:59 pve04 pvestatd[4589]: got timeout
Jul 05 13:20:59 pve04 mnt-pve-vmpool[4835]: [2017-07-05 17:20:59.706436] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-vmpool-client-0: server 10.1.31.58:49156 has not responded in the last 2 seconds, disconnecting.
Jul 05 13:20:59 pve04 pvestatd[4589]: unable to activate storage 'vmpool' - directory '/mnt/pve/vmpool' does not exist or is unreachable
Jul 05 13:20:59 pve04 pvestatd[4589]: status update time (8.780 seconds)
Jul 05 13:21:00 pve04 systemd[1]: Starting Proxmox VE replication runner...
Jul 05 13:21:01 pve04 systemd[1]: Started Proxmox VE replication runner.
Jul 05 13:22:00 pve04 systemd[1]: Starting Proxmox VE replication runner...
Jul 05 13:22:01 pve04 systemd[1]: Started Proxmox VE replication runner.
Jul 05 13:22:37 pve04 pve-ha-crm[4616]: successfully acquired lock 'ha_manager_lock'
Jul 05 13:22:37 pve04 pve-ha-crm[4616]: watchdog active
Jul 05 13:22:37 pve04 pve-ha-crm[4616]: status change slave => master
Jul 05 13:22:37 pve04 pve-ha-crm[4616]: node 'pve03': state changed from 'online' => 'unknown'
Jul 05 13:23:00 pve04 systemd[1]: Starting Proxmox VE replication runner...
Jul 05 13:23:01 pve04 systemd[1]: Started Proxmox VE replication runner.
Jul 05 13:23:37 pve04 pve-ha-crm[4616]: service 'vm:100': state changed from 'started' to 'fence'
Jul 05 13:23:37 pve04 pve-ha-crm[4616]: service 'vm:101': state changed from 'started' to 'fence'
Jul 05 13:23:37 pve04 pve-ha-crm[4616]: service 'vm:102': state changed from 'started' to 'fence'
Jul 05 13:23:37 pve04 pve-ha-crm[4616]: node 'pve03': state changed from 'unknown' => 'fence'

dietmar · Jul 5, 2017

crsprox said:
There appears to be at least 90sec of inactivity in the log that makes it appear the system is just sitting there.

Yes, this is the time required by the self-fencing algorithm (watchdog). In future, we plan to implement active fencing which may reduce the time required.

crsprox · Jul 5, 2017

No problem. If I switch to iTCO or IPMI, will that make a difference?

Thanks

dietmar · Jul 5, 2017

crsprox said:
No problem. If I switch to iTCO or IPMI, will that make a difference?

No.

Search

Search

Slow HA failover, over 3 minutes

crsprox

New Member

dietmar

Proxmox Staff Member

crsprox

New Member

dietmar

Proxmox Staff Member

We value your privacy