3 Node Cluster stopped syncing

cannfoddr · Jun 23, 2024

My 3 node cluster had been running for at least a year with no issues. I moved house a few months ago and the home lab was down for a while. I brought it up on a new Switch environment and had to mess around a bit on the new switch to get LACP correctly configured. However everything was up an running.

Then all of a sudden (yes I know its not how computers work) one of the nodes lost sync and dropped off of the cluster and I was getting messages in the log:

Code:

Jun 23 19:03:39 pveb pmxcfs[1269337]: [status] notice: cpg_send_message retry 100
Jun 23 19:03:39 pveb pmxcfs[1269337]: [status] notice: cpg_send_message retried 100 times
Jun 23 19:03:39 pveb pmxcfs[1269337]: [status] crit: cpg_send_message failed: 6
Jun 23 19:03:40 pveb pmxcfs[1269337]: [status] notice: cpg_send_message retry 10
Jun 23 19:03:41 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 59

Eventually the node would ressync after almost a day.

Just recently I lost sync again and cannot get the cluster to resync. I have 3 nodes: pvea, pveb and pvec. They all have local hostile entries for each other and I can get ping times around 0.5ms. I am not doing anything fancy with networking a & c have a bonded intefarface. b does not (yet).

pvea cluster status:

Code:

root@pvea:~# pvecm status
Cluster information
-------------------
Name:             HomeLab
Config Version:   3
Transport:        knet
Secure auth:      on


Quorum information
------------------
Date:             Sun Jun 23 19:09:05 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.ad21
Quorate:          Yes


Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate


Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.193 (local)
0x00000002          1 192.168.1.195
0x00000003          1 192.168.1.197

pveb: pvecm does not run

pvec:

Code:

root@pvec:~# pvecm status
Cluster information
-------------------
Name:             HomeLab
Config Version:   3
Transport:        knet
Secure auth:      on


Quorum information
------------------
Date:             Sun Jun 23 19:11:00 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1.ad21
Quorate:          Yes


Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate


Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.193
0x00000002          1 192.168.1.195
0x00000003          1 192.168.1.197 (local)

Sync on pveb just endlessly repeated the same sync list:

Code:

root@pveb:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Tue 2024-06-18 11:41:42 BST; 5 days ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 1186 (corosync)
      Tasks: 9 (limit: 18976)
     Memory: 2.0G
        CPU: 2d 6h 22min 59.510s
     CGroup: /system.slice/corosync.service
             └─1186 /usr/sbin/corosync -f


Jun 23 19:11:28 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:30 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:31 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:33 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:35 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:36 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:38 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:39 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:41 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:42 pveb corosync[1186]:   [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>

I am actively considering shutting down node pvec and removing it from the cluster and rebuilding as pved or some such and restoring the vms and containers from backup stored on my NAS.

Any Pointers?

Maximiliano · Jun 25, 2024

Hello,

Could you please share Corosync's config? The config is located at /etc/pve/corosync.conf. Could you please share your network config? This is at `/etc/network/interfaces`.

Search

Search

3 Node Cluster stopped syncing

cannfoddr

New Member

Maximiliano

Proxmox Staff Member

We value your privacy