2 node cluster, watchdog expire, reboot

Richard Goode · Dec 1, 2019

Hi.

I have a 2 node 5.4 cluster - I know it's recommended for 3 nodes but I only have 2 servers. I don't need auto failover capabilities, just manual migration, config sync, etc.

The cluster has been running for some time (6m+) and today I came across an issue when doing an upgrade (5.4-3 to 5.4-13).

After I rebooted one node, the other node rebooted ~60 seconds later. It seems the watchdog expired.

How do I stop the watchdog / fencing in this cluster? Any failover I will handle manually.

I could have sworn I have rebooted nodes many times before and this didn't happen. Possibly I changed something and forgot.

Here's the logs from one node (nodes are called vm3 and vm4):

Code:

Dec  1 19:26:16 vm4 pmxcfs[5011]: [dcdb] notice: members: 1/5011
Dec  1 19:26:16 vm4 pmxcfs[5011]: [status] notice: members: 1/5011
Dec  1 19:26:16 vm4 corosync[5221]: notice  [TOTEM ] A new membership (172.16.0.13:3468) was formed. Members left: 2
Dec  1 19:26:16 vm4 corosync[5221]: warning [CPG   ] downlist left_list: 1 received
Dec  1 19:26:16 vm4 corosync[5221]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec  1 19:26:16 vm4 corosync[5221]: notice  [QUORUM] Members[1]: 1
Dec  1 19:26:16 vm4 corosync[5221]:  [TOTEM ] A new membership (172.16.0.13:3468) was formed. Members left: 2
Dec  1 19:26:16 vm4 corosync[5221]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec  1 19:26:16 vm4 corosync[5221]:  [CPG   ] downlist left_list: 1 received
Dec  1 19:26:16 vm4 corosync[5221]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec  1 19:26:16 vm4 corosync[5221]:  [QUORUM] Members[1]: 1
Dec  1 19:26:16 vm4 corosync[5221]:  [MAIN  ] Completed service synchronization, ready to provide service.
Dec  1 19:26:16 vm4 pmxcfs[5011]: [status] notice: node lost quorum
Dec  1 19:26:20 vm4 pve-ha-crm[5418]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Dec  1 19:26:24 vm4 pve-ha-lrm[5464]: lost lock 'ha_agent_vm4_lock - cfs lock update failed - Permission denied
Dec  1 19:26:25 vm4 pve-ha-crm[5418]: status change master => lost_manager_lock
Dec  1 19:26:25 vm4 pve-ha-crm[5418]: watchdog closed (disabled)
Dec  1 19:26:25 vm4 pve-ha-crm[5418]: status change lost_manager_lock => wait_for_quorum
Dec  1 19:26:29 vm4 pve-ha-lrm[5464]: status change active => lost_agent_lock
Dec  1 19:27:00 vm4 systemd[1]: Starting Proxmox VE replication runner...
Dec  1 19:27:01 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec  1 19:27:02 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec  1 19:27:03 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec  1 19:27:04 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec  1 19:27:05 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec  1 19:27:06 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec  1 19:27:07 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec  1 19:27:08 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec  1 19:27:09 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec  1 19:27:10 vm4 pvesr[5555]: error with cfs lock 'file-replication_cfg': no quorum!
Dec  1 19:27:10 vm4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Dec  1 19:27:10 vm4 systemd[1]: Failed to start Proxmox VE replication runner.
Dec  1 19:27:10 vm4 systemd[1]: pvesr.service: Unit entered failed state.
Dec  1 19:27:10 vm4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Dec  1 19:27:16 vm4 watchdog-mux[3598]: client watchdog expired - disable watchdog updates

Immediately after that last log line, the server reboots.

And pvecm under normal circumstances:

Code:

root@vm4:~# pvecm status
Quorum information
------------------
Date:             Sun Dec  1 20:23:52 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          2/3476
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 172.16.0.12
0x00000001          1 172.16.0.13 (local)

Thanks,
Richard

aaron · Dec 2, 2019

Do you have any HA configured? If so you have the problem that with only two nodes you loose quorum as soon as one node is offline. If HA is enabled the other node will fence itself.

Solutions: disable HA or add a third vote to the cluster. You can add a so-called QDevice on another machine (like a Raspberry Pi) and have a third vote. See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support

Richard Goode · Dec 3, 2019

Thanks Aaron. I did consider a qdevice on a Pi - not gotten around to that.

I was playing with HA with 1 VM. After the first time the unexpected reboot happened, I removed this test VM from the HA config before I patched/rebooted the second node, but the same still happened.

To disable HA, do I just remove all resources from the HA list? If so, that's what I had done. Anyway, I'll test that again this afternoon.

As an alternative, I'm thinking to modify the corosync config to reduce the required votes (or turn on the two_node option).

Richard Goode · Dec 3, 2019

Ok, testing this just now and the problem is resolved.

I'm certain that no resources were in my HA list the second time the problem occurred, however since then (and since both nodes have been rebooted) it's been stable so I think it must have been in an odd state that a reboot fixed.

Thanks for the help.

Regards,
Richard

Search

Search

2 node cluster, watchdog expire, reboot

Richard Goode

New Member

aaron

Proxmox Staff Member

Richard Goode

New Member

Richard Goode

New Member