My 3 node cluster had been running for at least a year with no issues. I moved house a few months ago and the home lab was down for a while. I brought it up on a new Switch environment and had to mess around a bit on the new switch to get LACP correctly configured. However everything was up an running.
Then all of a sudden (yes I know its not how computers work) one of the nodes lost sync and dropped off of the cluster and I was getting messages in the log:
Eventually the node would ressync after almost a day.
Just recently I lost sync again and cannot get the cluster to resync. I have 3 nodes: pvea, pveb and pvec. They all have local hostile entries for each other and I can get ping times around 0.5ms. I am not doing anything fancy with networking a & c have a bonded intefarface. b does not (yet).
pvea cluster status:
pveb: pvecm does not run
pvec:
Sync on pveb just endlessly repeated the same sync list:
I am actively considering shutting down node pvec and removing it from the cluster and rebuilding as pved or some such and restoring the vms and containers from backup stored on my NAS.
Any Pointers?
Then all of a sudden (yes I know its not how computers work) one of the nodes lost sync and dropped off of the cluster and I was getting messages in the log:
Code:
Jun 23 19:03:39 pveb pmxcfs[1269337]: [status] notice: cpg_send_message retry 100
Jun 23 19:03:39 pveb pmxcfs[1269337]: [status] notice: cpg_send_message retried 100 times
Jun 23 19:03:39 pveb pmxcfs[1269337]: [status] crit: cpg_send_message failed: 6
Jun 23 19:03:40 pveb pmxcfs[1269337]: [status] notice: cpg_send_message retry 10
Jun 23 19:03:41 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 59
Eventually the node would ressync after almost a day.
Just recently I lost sync again and cannot get the cluster to resync. I have 3 nodes: pvea, pveb and pvec. They all have local hostile entries for each other and I can get ping times around 0.5ms. I am not doing anything fancy with networking a & c have a bonded intefarface. b does not (yet).
pvea cluster status:
Code:
root@pvea:~# pvecm status
Cluster information
-------------------
Name: HomeLab
Config Version: 3
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sun Jun 23 19:09:05 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.ad21
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.193 (local)
0x00000002 1 192.168.1.195
0x00000003 1 192.168.1.197
pveb: pvecm does not run
pvec:
Code:
root@pvec:~# pvecm status
Cluster information
-------------------
Name: HomeLab
Config Version: 3
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sun Jun 23 19:11:00 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 1.ad21
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.193
0x00000002 1 192.168.1.195
0x00000003 1 192.168.1.197 (local)
Sync on pveb just endlessly repeated the same sync list:
Code:
root@pveb:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Tue 2024-06-18 11:41:42 BST; 5 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 1186 (corosync)
Tasks: 9 (limit: 18976)
Memory: 2.0G
CPU: 2d 6h 22min 59.510s
CGroup: /system.slice/corosync.service
└─1186 /usr/sbin/corosync -f
Jun 23 19:11:28 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:30 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:31 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:33 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:35 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:36 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:38 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:39 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:41 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
Jun 23 19:11:42 pveb corosync[1186]: [TOTEM ] Retransmit List: 40 41 42 43 44 45 46 47 48 4a 4c 4d 4e 4f 51 54 55 5a>
I am actively considering shutting down node pvec and removing it from the cluster and rebuilding as pved or some such and restoring the vms and containers from backup stored on my NAS.
Any Pointers?