Hi.
I have a 2 node 5.4 cluster - I know it's recommended for 3 nodes but I only have 2 servers. I don't need auto failover capabilities, just manual migration, config sync, etc.
The cluster has been running for some time (6m+) and today I came across an issue when doing an upgrade (5.4-3 to 5.4-13).
After I rebooted one node, the other node rebooted ~60 seconds later. It seems the watchdog expired.
How do I stop the watchdog / fencing in this cluster? Any failover I will handle manually.
I could have sworn I have rebooted nodes many times before and this didn't happen. Possibly I changed something and forgot.
Here's the logs from one node (nodes are called vm3 and vm4):
Immediately after that last log line, the server reboots.
And pvecm under normal circumstances:
Thanks,
Richard
I have a 2 node 5.4 cluster - I know it's recommended for 3 nodes but I only have 2 servers. I don't need auto failover capabilities, just manual migration, config sync, etc.
The cluster has been running for some time (6m+) and today I came across an issue when doing an upgrade (5.4-3 to 5.4-13).
After I rebooted one node, the other node rebooted ~60 seconds later. It seems the watchdog expired.
How do I stop the watchdog / fencing in this cluster? Any failover I will handle manually.
I could have sworn I have rebooted nodes many times before and this didn't happen. Possibly I changed something and forgot.
Here's the logs from one node (nodes are called vm3 and vm4):
Code:
Dec 1 19:26:16 vm4 pmxcfs[5011]: [dcdb] notice: members: 1/5011
Dec 1 19:26:16 vm4 pmxcfs[5011]: [status] notice: members: 1/5011
Dec 1 19:26:16 vm4 corosync[5221]: notice [TOTEM ] A new membership (172.16.0.13:3468) was formed. Members left: 2
Dec 1 19:26:16 vm4 corosync[5221]: warning [CPG ] downlist left_list: 1 received
Dec 1 19:26:16 vm4 corosync[5221]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec 1 19:26:16 vm4 corosync[5221]: notice [QUORUM] Members[1]: 1
Dec 1 19:26:16 vm4 corosync[5221]: [TOTEM ] A new membership (172.16.0.13:3468) was formed. Members left: 2
Dec 1 19:26:16 vm4 corosync[5221]: notice [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 19:26:16 vm4 corosync[5221]: [CPG ] downlist left_list: 1 received
Dec 1 19:26:16 vm4 corosync[5221]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec 1 19:26:16 vm4 corosync[5221]: [QUORUM] Members[1]: 1
Dec 1 19:26:16 vm4 corosync[5221]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 19:26:16 vm4 pmxcfs[5011]: [status] notice: node lost quorum
Dec 1 19:26:20 vm4 pve-ha-crm[5418]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Dec 1 19:26:24 vm4 pve-ha-lrm[5464]: lost lock 'ha_agent_vm4_lock - cfs lock update failed - Permission denied
Dec 1 19:26:25 vm4 pve-ha-crm[5418]: status change master => lost_manager_lock
Dec 1 19:26:25 vm4 pve-ha-crm[5418]: watchdog closed (disabled)
Dec 1 19:26:25 vm4 pve-ha-crm[5418]: status change lost_manager_lock => wait_for_quorum
Dec 1 19:26:29 vm4 pve-ha-lrm[5464]: status change active => lost_agent_lock
Dec 1 19:27:00 vm4 systemd[1]: Starting Proxmox VE replication runner...
Dec 1 19:27:01 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 1 19:27:02 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 1 19:27:03 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 1 19:27:04 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 1 19:27:05 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 1 19:27:06 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 1 19:27:07 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 1 19:27:08 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 1 19:27:09 vm4 pvesr[5555]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 1 19:27:10 vm4 pvesr[5555]: error with cfs lock 'file-replication_cfg': no quorum!
Dec 1 19:27:10 vm4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Dec 1 19:27:10 vm4 systemd[1]: Failed to start Proxmox VE replication runner.
Dec 1 19:27:10 vm4 systemd[1]: pvesr.service: Unit entered failed state.
Dec 1 19:27:10 vm4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Dec 1 19:27:16 vm4 watchdog-mux[3598]: client watchdog expired - disable watchdog updates
Immediately after that last log line, the server reboots.
And pvecm under normal circumstances:
Code:
root@vm4:~# pvecm status
Quorum information
------------------
Date: Sun Dec 1 20:23:52 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 2/3476
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.0.12
0x00000001 1 172.16.0.13 (local)
Thanks,
Richard