I have a running cluster (5.4-13) with 11 nodes and everything is up and running until I reboot one of the nodes. Attached is the corosync.conf and the pvecm status. Everything looks OK. However, when I reboot one of the nodes, I am losing quorum on a random other node once the rebooted node is coming back online. There are 2 different type of error messages that can occure on the running node that looses quorum:
======
ERROR1
======
Oct 17 10:19:12 pmnode12 corosync[10721]: error [TOTEM ] FAILED TO RECEIVE
Oct 17 10:19:12 pmnode12 corosync[10721]: [TOTEM ] FAILED TO RECEIVE
Oct 17 10:19:21 pmnode12 corosync[10721]: notice [TOTEM ] A new membership (213.132.140.97:86456) was formed. Members left: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: notice [TOTEM ] Failed to receive the leave message. failed: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: warning [CPG ] downlist left_list: 10 received
Oct 17 10:19:21 pmnode12 corosync[10721]: [TOTEM ] A new membership (213.132.140.97:86456) was formed. Members left: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: [TOTEM ] Failed to receive the leave message. failed: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
======
ERROR2
======
Oct 17 10:27:57 pmnode13 corosync[28530]: [TOTEM ] Retransmit List: 2 3 4 c 5 6 7 8 9 a b
Oct 17 10:27:57 pmnode13 corosync[28530]: error [TOTEM ] FAILED TO RECEIVE
Oct 17 10:27:57 pmnode13 corosync[28530]: [TOTEM ] FAILED TO RECEIVE
Oct 17 10:27:57 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 50
Oct 17 10:27:58 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 70
Oct 17 10:27:58 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 60
Oct 17 10:27:59 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 80
Oct 17 10:27:59 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 70
Oct 17 10:28:00 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 90
Oct 17 10:28:00 pmnode13 systemd[1]: Starting Proxmox VE replication runner...
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 80
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 100
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retried 100 times
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [dcdb] crit: cpg_send_message failed: 6
Oct 17 10:28:01 pmnode13 CRON[10771]: (root) CMD (cd /tmp && iostat -xkd 30 2 | sed 's/,/\./g' > io.tmp && sleep 1 && mv io.tmp iostat.cache 2>/dev/null)
Oct 17 10:28:02 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 90
Oct 17 10:28:02 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 10
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 100
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retried 100 times
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [status] crit: cpg_send_message failed: 6
Oct 17 10:28:03 pmnode13 pve-firewall[2295]: firewall update time (7.641 seconds)
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 20
Oct 17 10:28:04 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 10
Oct 17 10:28:04 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 30
Oct 17 10:28:05 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 20
Oct 17 10:28:05 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 40
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 30
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [TOTEM ] A new membership (213.132.140.98:86600) was formed. Members left: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [TOTEM ] Failed to receive the leave message. failed: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: [TOTEM ] A new membership (213.132.140.98:86600) was formed. Members left: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: [TOTEM ] Failed to receive the leave message. failed: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: warning [CPG ] downlist left_list: 9 received
Oct 17 10:28:06 pmnode13 corosync[28530]: [CPG ] downlist left_list: 9 received
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: members: 13/645
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [QUORUM] Members[1]: 13
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [MAIN ] Completed service synchronization, ready to provide service.
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: members: 13/645
Oct 17 10:28:06 pmnode13 corosync[28530]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 17 10:28:06 pmnode13 corosync[28530]: [QUORUM] Members[1]: 13
Oct 17 10:28:06 pmnode13 corosync[28530]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: node lost quorum
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retried 46 times
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] crit: received write while not quorate - trigger resync
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] crit: leaving CPG group
Oct 17 10:28:06 pmnode13 pve-ha-lrm[31006]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmnode13/lrm_status.tmp.31006' - Permission denied
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retried 31 times
Oct 17 10:28:06 pmnode13 pvesr[10763]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: start cluster connection
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: members: 13/645
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: all data is up to date
Oct 17 10:28:06 pmnode13 pvestatd[757]: status update time (13.220 seconds)
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/222: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/108: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/123: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/157: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/181: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/198: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/189: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/111: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/117: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/217: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/VM-backups-backup12: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/VM-backups-backup11: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/local: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/VM-backups-backup17: -1
Oct 17 10:28:07 pmnode13 pvesr[10763]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 17 10:28:15 pmnode13 pvesr[10763]: error with cfs lock 'file-replication_cfg': no quorum!
Oct 17 10:28:15 pmnode13 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Oct 17 10:28:15 pmnode13 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 17 10:28:15 pmnode13 systemd[1]: pvesr.service: Unit entered failed state.
Oct 17 10:28:15 pmnode13 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Once one of the above error occurs, the node takes a completely different ring ID and only have 1 vote: Most of the time, the cluster recover automatically after a few minutes, but sometimes I have to restart corosync to get things back in sync.
I have no idea what causes or triggers the above errors but I like to get this issue solved. I'm a little afraid one day I will be unable to start VM's on a node that enters some kind of failed state.
Kind regards and thanks in advance for your reply,
Gijsbert
======
ERROR1
======
Oct 17 10:19:12 pmnode12 corosync[10721]: error [TOTEM ] FAILED TO RECEIVE
Oct 17 10:19:12 pmnode12 corosync[10721]: [TOTEM ] FAILED TO RECEIVE
Oct 17 10:19:21 pmnode12 corosync[10721]: notice [TOTEM ] A new membership (213.132.140.97:86456) was formed. Members left: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: notice [TOTEM ] Failed to receive the leave message. failed: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: warning [CPG ] downlist left_list: 10 received
Oct 17 10:19:21 pmnode12 corosync[10721]: [TOTEM ] A new membership (213.132.140.97:86456) was formed. Members left: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: [TOTEM ] Failed to receive the leave message. failed: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
======
ERROR2
======
Oct 17 10:27:57 pmnode13 corosync[28530]: [TOTEM ] Retransmit List: 2 3 4 c 5 6 7 8 9 a b
Oct 17 10:27:57 pmnode13 corosync[28530]: error [TOTEM ] FAILED TO RECEIVE
Oct 17 10:27:57 pmnode13 corosync[28530]: [TOTEM ] FAILED TO RECEIVE
Oct 17 10:27:57 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 50
Oct 17 10:27:58 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 70
Oct 17 10:27:58 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 60
Oct 17 10:27:59 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 80
Oct 17 10:27:59 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 70
Oct 17 10:28:00 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 90
Oct 17 10:28:00 pmnode13 systemd[1]: Starting Proxmox VE replication runner...
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 80
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 100
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retried 100 times
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [dcdb] crit: cpg_send_message failed: 6
Oct 17 10:28:01 pmnode13 CRON[10771]: (root) CMD (cd /tmp && iostat -xkd 30 2 | sed 's/,/\./g' > io.tmp && sleep 1 && mv io.tmp iostat.cache 2>/dev/null)
Oct 17 10:28:02 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 90
Oct 17 10:28:02 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 10
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 100
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retried 100 times
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [status] crit: cpg_send_message failed: 6
Oct 17 10:28:03 pmnode13 pve-firewall[2295]: firewall update time (7.641 seconds)
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 20
Oct 17 10:28:04 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 10
Oct 17 10:28:04 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 30
Oct 17 10:28:05 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 20
Oct 17 10:28:05 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 40
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 30
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [TOTEM ] A new membership (213.132.140.98:86600) was formed. Members left: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [TOTEM ] Failed to receive the leave message. failed: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: [TOTEM ] A new membership (213.132.140.98:86600) was formed. Members left: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: [TOTEM ] Failed to receive the leave message. failed: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: warning [CPG ] downlist left_list: 9 received
Oct 17 10:28:06 pmnode13 corosync[28530]: [CPG ] downlist left_list: 9 received
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: members: 13/645
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [QUORUM] Members[1]: 13
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [MAIN ] Completed service synchronization, ready to provide service.
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: members: 13/645
Oct 17 10:28:06 pmnode13 corosync[28530]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 17 10:28:06 pmnode13 corosync[28530]: [QUORUM] Members[1]: 13
Oct 17 10:28:06 pmnode13 corosync[28530]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: node lost quorum
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retried 46 times
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] crit: received write while not quorate - trigger resync
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] crit: leaving CPG group
Oct 17 10:28:06 pmnode13 pve-ha-lrm[31006]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmnode13/lrm_status.tmp.31006' - Permission denied
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retried 31 times
Oct 17 10:28:06 pmnode13 pvesr[10763]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: start cluster connection
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: members: 13/645
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: all data is up to date
Oct 17 10:28:06 pmnode13 pvestatd[757]: status update time (13.220 seconds)
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/222: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/108: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/123: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/157: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/181: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/198: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/189: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/111: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/117: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/217: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/VM-backups-backup12: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/VM-backups-backup11: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/local: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/VM-backups-backup17: -1
Oct 17 10:28:07 pmnode13 pvesr[10763]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 17 10:28:15 pmnode13 pvesr[10763]: error with cfs lock 'file-replication_cfg': no quorum!
Oct 17 10:28:15 pmnode13 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Oct 17 10:28:15 pmnode13 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 17 10:28:15 pmnode13 systemd[1]: pvesr.service: Unit entered failed state.
Oct 17 10:28:15 pmnode13 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Once one of the above error occurs, the node takes a completely different ring ID and only have 1 vote: Most of the time, the cluster recover automatically after a few minutes, but sometimes I have to restart corosync to get things back in sync.
I have no idea what causes or triggers the above errors but I like to get this issue solved. I'm a little afraid one day I will be unable to start VM's on a node that enters some kind of failed state.
Kind regards and thanks in advance for your reply,
Gijsbert