Adding node #10 makes puts whole cluster out of order

sdettmer · Sep 13, 2022

Hi,

tl;dr:
I added nodes to 4 node cluster, while adding node 10 cluster collapsed. I think I recovered, but I had to guess much too much, so I would like to benefit from your experience how to continue and what to do better.

I had a little 4 node cluster running since a few weeks and yesterday added some more nodes. Nodes are IPs 10.x.y.101 - 10.x.y.115. I kept things simple and followed the Admin Guide. All added nodes where "empty" (no VMs or anything), all nodes are the same (cheap desktop) hardware, same configuration except IP and hostname.

I could add 9 nodes and saw more and more getting green in the web GUI of the first node (10.x.y.101). I used "pvecm add 10.x.y.101 --node 102 --use_ssh" to add each (the node number of course unique on each).

When adding node 10, the command got stuck at "waiting for quorum".

The whole cluster stopped working, no web GUI could be accessed, SSH with SSH key did not work. Luckily this still is in test room with root passwords and I could login with password.

I found corosync with 100% CPU, and access to /etc/pve blocking (hence no SSH key login, I assume).

I could not use the Web GUI of any of the nodes (actually I just tried 5 or so). pvecm status showed each node on its own. I rebooted node 10. On node 1 I still saw 100%CPU of corosync, so a bit later also rebooted it (node 1). Directly after reboot, I killed corosync on node 10 and stopped a pve service as suggested by a team mate.

I now got two sets of 5 nodes each waiting for the sixth (I assume the reboot somehow "unlocked" something, every node got back working, some quorum started and due to a weakness in the protocol I ended up "split horizon", which I think should not happen).

I waited half an hour or so, but it did not resolve. So I rebooted 8 and 9 to see what happens. I noticed a little later that quorum 6 got OK, and a little later also the other 2 nodes joined, so there were 6 of 10 with 9 online. Apparently this broke the "deadlock".

I got heaps of log messages like journalctl:

Code:

Sep 12 18:46:20 xy-101 pvedaemon[3780712]: <root@pam> successful auth for user 'root@pam'
Sep 12 18:46:23 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:46:37 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:46:51 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:47:06 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:47:08 xy-101 kernel: INFO: task pvedaemon worke:3533781 blocked for more than 241 seconds.
Sep 12 18:47:08 xy-101 kernel:       Tainted: P           O      5.15.30-2-pve #1
Sep 12 18:47:08 xy-101 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 12 18:47:08 xy-101 kernel: task:pvedaemon worke state:D stack:    0 pid:3533781 ppid:  1069 flags:0x00000004
Sep 12 18:47:08 xy-101 kernel: Call Trace:
Sep 12 18:47:08 xy-101 kernel:  <TASK>
Sep 12 18:47:08 xy-101 kernel:  __schedule+0x33d/0x1750
Sep 12 18:47:08 xy-101 kernel:  ? path_parentat+0x4c/0x90
Sep 12 18:47:08 xy-101 kernel:  ? filename_parentat+0xd7/0x1e0
Sep 12 18:47:08 xy-101 kernel:  schedule+0x4e/0xb0
Sep 12 18:47:08 xy-101 kernel:  rwsem_down_write_slowpath+0x217/0x4d0
Sep 12 18:47:08 xy-101 kernel:  down_write+0x43/0x50
Sep 12 18:47:08 xy-101 kernel:  filename_create+0x75/0x150
Sep 12 18:47:08 xy-101 kernel:  do_mkdirat+0x48/0x140
Sep 12 18:47:08 xy-101 kernel:  __x64_sys_mkdir+0x4c/0x70
Sep 12 18:47:08 xy-101 kernel:  do_syscall_64+0x59/0xc0
Sep 12 18:47:08 xy-101 kernel:  ? __x64_sys_alarm+0x4a/0x90
Sep 12 18:47:08 xy-101 kernel:  ? exit_to_user_mode_prepare+0x37/0x1b0
Sep 12 18:47:08 xy-101 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Sep 12 18:47:08 xy-101 kernel:  ? do_syscall_64+0x69/0xc0
Sep 12 18:47:08 xy-101 kernel:  ? asm_sysvec_apic_timer_interrupt+0xa/0x20
Sep 12 18:47:08 xy-101 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Sep 12 18:47:08 xy-101 kernel: RIP: 0033:0x7f493209cb07
Sep 12 18:47:08 xy-101 kernel: RSP: 002b:00007ffd493ce1b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
Sep 12 18:47:08 xy-101 kernel: RAX: ffffffffffffffda RBX: 0000555a5ecc22a0 RCX: 00007f493209cb07
Sep 12 18:47:08 xy-101 kernel: RDX: 0000555a5e7ecae5 RSI: 00000000000001ff RDI: 0000555a65f8c870
Sep 12 18:47:08 xy-101 kernel: RBP: 0000000000000000 R08: 0000555a5ecc22a0 R09: 0000000000000111
Sep 12 18:47:08 xy-101 kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000555a65f8c870
Sep 12 18:47:08 xy-101 kernel: R13: 0000555a65f972f8 R14: 0000555a608532d0 R15: 00000000000001ff
Sep 12 18:47:08 xy-101 kernel:  </TASK>
Sep 12 18:47:08 xy-101 kernel: INFO: task pvedaemon worke:854324 blocked for more than 241 seconds.
Sep 12 18:47:08 xy-101 kernel:       Tainted: P           O      5.15.30-2-pve #1
Sep 12 18:47:08 xy-101 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 12 18:47:08 xy-101 kernel: task:pvedaemon worke state:D stack:    0 pid:854324 ppid:  1069 flags:0x00000004
...
Sep 12 18:49:01 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:49:13 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:49:25 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:49:39 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:49:52 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:50:01 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 10
Sep 12 18:50:02 xy-101 kernel: sched: RT throttling activated
Sep 12 18:50:02 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 20
Sep 12 18:50:03 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 30
Sep 12 18:50:04 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 40
Sep 12 18:50:05 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 50
Sep 12 18:50:06 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:50:06 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 60
Sep 12 18:50:07 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 70
Sep 12 18:50:08 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 80
Sep 12 18:50:09 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 90
Sep 12 18:50:10 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 100
Sep 12 18:50:10 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retried 100 times
Sep 12 18:50:10 xy-101 pmxcfs[910]: [status] crit: cpg_send_message failed: 6
Sep 12 18:50:10 xy-101 pve-firewall[1042]: firewall update time (7.938 seconds)
Sep 12 18:50:11 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 10
...
Sep 12 18:51:17 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 70
Sep 12 18:51:18 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 80
Sep 12 18:51:19 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 90
Sep 12 18:51:20 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 100
Sep 12 18:51:20 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retried 100 times
Sep 12 18:51:20 xy-101 pmxcfs[910]: [status] crit: cpg_send_message failed: 6
Sep 12 18:51:20 xy-101 pve-firewall[1042]: firewall update time (10.178 seconds)
Sep 12 18:51:21 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 10
Sep 12 18:51:22 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 20
Sep 12 18:51:23 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 30
Sep 12 18:51:24 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 40
Sep 12 18:51:25 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 50
Sep 12 18:51:26 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 60
Sep 12 18:51:26 xy-101 corosync[1023]:   [TOTEM ] Token has not been received in 6150 ms
...

(I can still look up logs, what should I share?)

Then I tried again with node 10 and again the whole cluster got stuck again. It was late, I stopped node 10 and went home.

Today I tried whether this is reproducible, to collect better diagnostics - but today the same did work apparently well, so I could not reproduce it.

My questions:

1. Did I made something wrong (i.e. adding 6 nodes right one after the other) and/or did I hit some bug? Should I wait some minutes after each node to let something settle down or so?
2. Why can one node pull down the cluster? I thought I use a cluster should protect from this. Can I do better or protect from that? Is this just a matter of adding or can it happen any time?
3. If this happens, what is best to recover? Here, killing corosync on node 10 seemed to solve the "dead" state, but I had no way to find that its node 10 (only because it was the most recently added one). Is there a way to identify a rouge node in order to turn it off (to quickly make cluster alive until maintenance window)?
4. When this happens again, what should I save (command output, logfiles) to assist trouble shooting?

Since this seems to prove that removing the node worked (and I documented it anyway), I include it here in case it could help someone else.

So I followed https://pve.proxmox.com/wiki/Cluster_Manager "Separate a Node Without Reinstalling" to get back a step. Additionally, I deleted its node directory on every node ( ansible all -m ansible.builtin.shell -l \!xy-110 -a "rm -rf /etc/pve/nodes/xy-110/"). I tried find /etc/ -type f|xargs grep \\\.110 and removed the node entry:

/etc/pve/corosync.conf:    ring0_addr: 10.241.197.110
/etc/corosync/corosync.conf:    ring0_addr: 10.241.197.110

(just one, other updated automatically). I checked a second node to be clean now as well (yes, so it synced fine), I waited another minute and rebooted 110 and 101 (but no others).

At 101:

Code:

root@labhen197-101:~# pvecm status
Cluster information
-------------------
Name:             xyz-pve
Config Version:   12
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Sep 13 09:46:49 2022
Quorum provider:  corosync_votequorum
Nodes:            9
Node ID:          0x00000001
Ring ID:          1.fd0
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   9
Highest expected: 9
Total votes:      9
Quorum:           5 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.x.y.101 (local)
0x00000002          1 10.x.y.105
0x00000003          1 10.x.y.106
0x00000004          1 10.x.y.107
0x00000005          1 10.x.y.108
0x00000006          1 10.x.y.109
0x00000066          1 10.x.y.102
0x00000067          1 10.x.y.103
0x00000068          1 10.x.y.104
root@labhen197-101:~#

In Web GUI on node 1 and 2 (and probably others, but didn't check) I see 9 nodes in the cluster.
On node 10, I see just node 10.
So I assume I successfully removed the node from the cluster.

Then I added node 10 again: pvecm add 10.241.197.101 --node 110 --use_ssh.

Well, this time it worked and now I need to find out how to correctly continue.

Any hint, pointer or suggestion from your experience is appreciated!

Steffen

fabian · Sep 13, 2022

did you upgrade to the most recent PVE 7.x version on the existing cluster and the joining nodes before attempting the join?

sdettmer · Sep 13, 2022

fabian said:
did you upgrade to the most recent PVE 7.x version on the existing cluster and the joining nodes before attempting the join?

Thank you for your quick reply!

When I started the test with the 4 nodes (maybe 3 weeks ago) all had same current release (an intern set them freshly up using a single USB stick), but now I still have 7.x, but a mix of versions, it is:

Code:

sdettmer@t6:~/work/ansible (master * u=) $ ansible all -a pveversion
xy-104 | CHANGED | rc=0 >>
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.30-2-pve)
xy-108 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-111 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-113 | CHANGED | rc=0 >>
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.30-2-pve)
xy-102 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-103 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-106 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-110 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-101 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-109 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-100 | CHANGED | rc=0 >>
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.30-2-pve)
xy-107 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-114 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-112 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)
xy-105 | CHANGED | rc=0 >>
pve-manager/7.2-3/c743d6c1 (running kernel: 5.15.30-2-pve)

Should I perform apt-get update && apt-get upgrade on each?

fabian · Sep 13, 2022

I just wanted to ensure that you weren't bitten by a bug we had in a previous version that could rarely cause cluster joining to fail (fixed in pve-cluster 7.2-2 in case you want to check).

did you do the joining one by one waiting for each to fully settle? or did you trigger the join operations using ansible in parallel?

sdettmer · Sep 13, 2022

fabian said:
I just wanted to ensure that you weren't bitten by a bug we had in a previous version that could rarely cause cluster joining to fail (fixed in pve-cluster 7.2-2 in case you want to check).

did you do the joining one by one waiting for each to fully settle? or did you trigger the join operations using ansible in parallel?

Neither; I ran the commands manually on a SSH shell, SSH'd to next node, waited until add command was successful, then logout and next one sequentially, but I did not wait for further settling. How long should I wait after adding a node?

sdettmer · Sep 13, 2022

fabian said:
I just wanted to ensure that you weren't bitten by a bug we had in a previous version that could rarely cause cluster joining to fail (fixed in pve-cluster 7.2-2 in case you want to check).

did you do the joining one by one waiting for each to fully settle? or did you trigger the join operations using ansible in parallel?

Today I tried adding node 11.
It hangs after "waiting for quorum...OK". What details can I provide to track down a potential issue?

Code:

root@xy-111:~# pvecm add 10.x.y.101 --node 111
Please enter superuser (root) password for '10.x.y.101': ********
Establishing API connection with host '10.x.y.101'
The authenticity of host '10.x.y.101' can't be established.
X509 SHA256 key fingerprint is A4:25:58:6C:A7:5C:9E:9F:0F:9A:A3:B0:63:02:BD:74:35:FF:97:B0:EE:F0:E0:48:F8:88:34:1C:44:9F:27:96.
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '10.x.y.111'
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1663087219.sql.gz'
waiting for quorum...OK
[...hangs...]


root@xy-111:~# strace -tt -f /usr/bin/pvecm status
[...much boring output...]
18:52:13.338683 stat("/etc/pve/corosync.conf",
[...hangs...]





Node 1:

root@xy-101:~# pvecm status
Cluster information
-------------------
Name:             tisc-pve
Config Version:   14
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Sep 13 19:05:34 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.1254
Quorate:          No

Votequorum information
----------------------
Expected votes:   11
Highest expected: 11
Total votes:      1
Quorum:           6 Activity blocked
Flags:         

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.x.y.101 (local)
root@xy-101:~#





Node 2

root@xy-102:~# pvecm status
Cluster information
-------------------
Name:             tisc-pve
Config Version:   14
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Sep 13 19:05:58 2022
Quorum provider:  corosync_votequorum
Nodes:            11
Node ID:          0x00000066
Ring ID:          1.fd8
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   11
Highest expected: 11
Total votes:      11
Quorum:           6
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.x.y.101
0x00000002          1 10.x.y.105
0x00000003          1 10.x.y.106
0x00000004          1 10.x.y.107
0x00000005          1 10.x.y.108
0x00000006          1 10.x.y.109
0x00000066          1 10.x.y.102 (local)
0x00000067          1 10.x.y.103
0x00000068          1 10.x.y.104
0x0000006e          1 10.x.y.110
0x0000006f          1 10.x.y.111
root@xy-102:~#

VMs/Containers seem to be unavailable (I just have test stuff on node 1 yet).
Web GUI shows all nodes with red icon in the data center of node 1, with grey ? at node 2, no nodes at node 10, I cannot login at node 11.

I attach "journalctl" output in the case it reveals a hint to my problem for the eye of the experienced ones

What should I do now?

fabian · Sep 14, 2022

could you give the output of journalctl -u corosync -u pve-cluster on all nodes, covering the time period of the first cluster expansion, and also the most recent addition? I'll see whether something obvious jumps out there. the symptoms are some sort of corosync malfunction, which can be caused by inconsistent corosync.conf. for the latter, could you check that the version of pve-cluster before the cluster expansion (e.g., check the current version and any upgrades done since the expansion, which are logged in /var/log/apt/history.log and its rotated versions).

sdettmer · Sep 15, 2022

fabian said:
could you give the output of journalctl -u corosync -u pve-cluster on all nodes, covering the time period of the first cluster expansion, and also the most recent addition? I'll see whether something obvious jumps out there. the symptoms are some sort of corosync malfunction, which can be caused by inconsistent corosync.conf. for the latter, could you check that the version of pve-cluster before the cluster expansion (e.g., check the current version and any upgrades done since the expansion, which are logged in /var/log/apt/history.log and its rotated versions).

Ohh, thank you so much!
One machine (node 4 aka xy-104) apparently ran an "apt upgrade", possible by accident/error. Should I now upgrade all to have a consistent software state?

I collected all journalctrl and history.log files from each node and with "ultra" mode in 7z I created a 1849KB .ZIP file, but it is too large to attach.

My company blocks all sorts of pastebins, so I used my mobile phone to upload a password protected ZIP (SHA256: 921bb4db56fde96ec5188f6059b00a93e2f7ba4e9ac396d07859ebe9bfecdd56) to https://www.filemail.com/d/jwjpkygejydrfcc (and due to company firewalling, I cannot even check if the link works, because its blocked) ~~and I send you the password via private DM (in case the logs contain non-public data, I'm not sure if they could).~~

I hope this is OK, but I understand that it puts a burden of work to you. If you have any proposals how I could share the files better, please let me know (by mail for example?).

Thank you for your help!

Steffen

EDIT: I cannot send, it says

Code:

Oops! We ran into some problems.
You may not start a conversation with the following recipients: fabian.

Can you start and I reply maybe?

Sorry for the inconvenience.

EDIT #2:
I wrote the password as "location" in my profile (Profile | About | Location) and hope you can see it.

fabian · Sep 16, 2022

got the logs. could you add pveversion -v output from all nodes as well?

some things that are noticeable:
- there were some config updates without bumping the version, which is dangerous. they might have been benign and just a side-effect of opening the file with an editor that does some auto-save or whatever in the background, but if you manually edit corosync.conf its always best to follow the approach described in our admin guide
- there were some link down events that might or might not have been manual reboots
- there were multiple times where a new corosync.conf version was propagated through the cluster, but the config reload happened before the file was written out on all nodes (that's the pve-cluster bug I mentioned before - hence the installed versions would be interesting)
- Sep 12 18:40 seems to be where the problems start:
-- node with ID 6 gets added, config reloaded, everything looks okay
-- node with ID 7 gets added, config reloaded - but then the link between 7 and 102 (and only between these two) goes down and up again, then another one, then another one (the later ones correlating with corosync restarts on those other nodes?)
-- totem processing no longer works, corosync is non-functional
-- corosync.conf is then written again without updating the version at 20:04 while node 102 is being restarted - where there any changes done here?

sdettmer · Sep 16, 2022

fabian said:
got the logs. could you add pveversion -v output from all nodes as well?

some things that are noticeable:
- there were some config updates without bumping the version, which is dangerous. they might have been benign and just a side-effect of opening the file with an editor that does some auto-save or whatever in the background, but if you manually edit corosync.conf its always best to follow the approach described in our admin guide
- there were some link down events that might or might not have been manual reboots
- there were multiple times where a new corosync.conf version was propagated through the cluster, but the config reload happened before the file was written out on all nodes (that's the pve-cluster bug I mentioned before - hence the installed versions would be interesting)
- Sep 12 18:40 seems to be where the problems start:
-- node with ID 6 gets added, config reloaded, everything looks okay
-- node with ID 7 gets added, config reloaded - but then the link between 7 and 102 (and only between these two) goes down and up again, then another one, then another one (the later ones correlating with corosync restarts on those other nodes?)
-- totem processing no longer works, corosync is non-functional
-- corosync.conf is then written again without updating the version at 20:04 while node 102 is being restarted - where there any changes done here?

thank you for your attention and help!

I attach pveversions -v for each host in an own file. There are equal except node 4 (10.x.y.104).

I have no idea why there are different versions. One host, node4 aka 10.x.y.104, apparently was accidentally apt upgrade'd.
The others should be all the same as installed by an intern with one single same USB stick plus a few apt install (tmux, iotop, for trouble shooting).
The hosts are in a laboratory/test network and not with final config (i.e. non-prod VLAN etc) and links could get lost for a few minutes every day, they are not in net monitoring yet, so I don't know.

Should I apt upgrade them all or wait to be able to provide further information in case helpful?

After node 10 adding lead to unresponsive cluster, I removed the node as documented in the admin guide (I included the exact steps in this thread at the beginning). I could not access the web GUI. So additionally, since find|grep showed that the node was not removed from corosync.conf, I removed its IP "block" from the config with editor (not knowing about versions, and I did not do the steps you thankfully linked, i.e. no version increment, no usage of tmp files, no restart, nothing but save the file), I hope I didn't make things worse. Apparently it worked, because afterwards the web guis worked and the test container and test VM were reachable again, but of course damage could haven been caused under the hood.

This was after the original problem occurred.

Yes, I manually rebooted several nodes, for example as I had the situation that two different "rings" of each 5 nodes were waiting for a sixth. This happened after adding node 10, which led to a non-responsive cluster. I though this "split brain" situation would be the cause for that. I rebooted two or three machines from second ring and apparently it helped, because some minutes (seconds?) after reboots these appeared in first ring and a bit later all nodes appeared in that. I assumed it was because of my reboots.

Yes, I think the last corosync edit was done by me with an editor to get rid of the entry for the removed node 10 (id 110). I rebooted node 102 (I think) prior to that, to see if its Web GUI would become working again and if it loses the deleted node 10. It had it also in corosync.conf. I edited corosync.conf on just one single node (I think it was on node 1 aka 10.x.y.101) and checked on another node (I think 2 or 3 or 4 , id 102 and 103 or 104). I saw that the entry for the removed node 10 was also not included anymore (i.e. find|grep showed no more matches). From that I assumed the synchronisation worked fine. I assumed that if a second node got the new config without 110, each node now would have it the new config, but I did not check (and I did not know about incrementing the version, so apparently I was lucky that it seemed to work)

(If it helps I think I could possible install debug *.deps and make gdb stack traces or such, please advice).

fabian · Sep 16, 2022

you definitely should upgrade your nodes - they are still on the stock 7.2 iso state, with all the bug fixes since then not being applied (among other things, the one I mentioned affecting cluster joining!). the symptoms align with that bug (total loss of totem/corosync communication because one of the nodes has a different view of the cluster than the rest, because it reloaded the config before applying the config update). removing a node/rebooting a node/re-editing corosync.conf would then trigger another reload, which might fix the issue if the node with the outdated view gets the reload, unless the bug is triggered again (it's a race condition, so how likely triggering it is depends a lot on your specific cluster hardware/network/load/..).

sdettmer · Sep 19, 2022

fabian said:
you definitely should upgrade your nodes - they are still on the stock 7.2 iso state, with all the bug fixes since then not being applied (among other things, the one I mentioned affecting cluster joining!). the symptoms align with that bug (total loss of totem/corosync communication because one of the nodes has a different view of the cluster than the rest, because it reloaded the config before applying the config update). removing a node/rebooting a node/re-editing corosync.conf would then trigger another reload, which might fix the issue if the node with the outdated view gets the reload, unless the bug is triggered again (it's a race condition, so how likely triggering it is depends a lot on your specific cluster hardware/network/load/..).

I apt update && apt update each node and rebooted (in groups of 3 nodes, waiting to be up, then next 3).
Now each node runs pve-manager/7.2-11/b76d3178 (running kernel: 5.15.53-1-pve).

After each node had been rebootet, I waited few minutes. I added node 12. It was quick and successful.

A few minutes later I added node 13. Unfortunately, it hang after waiting for quorum...OK. Node 14 could not be added as well. One hour later still 100% CPU of corosync on the nodes 2-12 (but not on node 1!); node 13 is non-responsive (SSH times out):

Code:

H0L13:~ (win * u=) $ ssh root@xy-113

Connection to 10.x.y.113 port 22 timed out

In Web GUI of node 1, each other node turned red.

On node 1:

Code:

root@xy-101:~# pvecm status
Cluster information
-------------------
Name:             tisc-pve
Config Version:   16
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Sep 19 15:00:31 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.a796
Quorate:          No

Votequorum information
----------------------
Expected votes:   13
Highest expected: 13
Total votes:      1
Quorum:           7 Activity blocked
Flags:       

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.x.y.101 (local)

On node 2:

Code:

root@labhen197-102:~# pvecm status
Cluster information
-------------------
Name:             tisc-pve
Config Version:   16
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Sep 19 15:01:21 2022
Quorum provider:  corosync_votequorum
Nodes:            13
Node ID:          0x00000066
Ring ID:          1.9d36
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   13
Highest expected: 13
Total votes:      13
Quorum:           7
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.x.y.101
0x00000002          1 10.x.y.105
0x00000003          1 10.x.y.106
0x00000004          1 10.x.y.107
0x00000005          1 10.x.y.108
0x00000006          1 10.x.y.109
0x00000066          1 10.x.y.102 (local)
0x00000067          1 10.x.y.103
0x00000068          1 10.x.y.104
0x0000006e          1 10.x.y.110
0x0000006f          1 10.x.y.111
0x00000070          1 10.x.y.112
0x00000071          1 10.x.y.113

pveversion -v attached.

I tried to get a few backtraces with gdb:

Code:

(gdb) i th
  Id   Target Id                                   Frame·
* 1    Thread 0x7f55e7b697c0 (LWP 1237) "corosync" 0x00007f55e818be8e in qb_array_index () from target:/lib/x86_64-linux-gnu/libqb.so.100
...
(gdb) bt
#0  0x00007f55e818be8e in qb_array_index () from target:/lib/x86_64-linux-gnu/libqb.so.100
#1  0x00007f55e818cca7 in qb_poll_fds_usage_check_ () from target:/lib/x86_64-linux-gnu/libqb.so.100
#2  0x00007f55e819d8a4 in ?? () from target:/lib/x86_64-linux-gnu/libqb.so.100
#3  0x00007f55e818c31f in qb_loop_run () from target:/lib/x86_64-linux-gnu/libqb.so.100
#4  0x00005596ac6931dd in main (argc=-395456528, argv=<optimized out>, envp=<optimized out>) at main.c:1641
(gdb)

Details attached (gdb-session-103.log).

To get a few backtraces to see where it loops, I used (gdb) continue, wait random time, STRG-C (SIGINT), backtrace.

The interesting thing is, that after a few backtraces, the 100% CPU load vanished! With few minutes, I could "repair" 3 nodes. After CPU load droppped, the node in web Gui of node 1 turned green, so I now have 4 green nodes (and 9 red ones).

When gdb stacktracing changes behavior, I think it could be an indication for imperfect signal handling (such as a recvmsg returning errno=EINTR/EAGAIN), or a timing issue (gdb halts, by that a timeout applies). In case it helps, I attach the gdb stack traces.

I will try to go on with the "approach"

EDIT: this did not work; I was unable to get all nodes green. So I killed each node and deleted the files mentioned in "Separate a Node Without Reinstalling" and rebooted each node. Then I started again. This looked very promising, I could add 12 nodes and each was fast and got green icon in web GUI within seconds. I waited a minute after each to go sure. It went so well, 12 times in a row! BUT adding node 13 hung at waiting for quorum...OK. I had a shell on node 1 and noticed that it rebooted (I checked no other rebooted). After reboot, node 1 keeps telling

Code:

Sep 19 20:15:50 labhen197-101 corosync[1276]:   [QUORUM] Sync members[1]: 1
Sep 19 20:15:50 labhen197-101 corosync[1276]:   [TOTEM ] A new membership (1.14d) was formed. Members
Sep 19 20:15:50 labhen197-101 corosync[1276]:   [QUORUM] Members[1]: 1
Sep 19 20:15:50 labhen197-101 corosync[1276]:   [MAIN  ] Completed service synchronization, ready to provide service.

Sep 19 20:16:10 labhen197-101 corosync[1276]:   [QUORUM] Sync members[1]: 1
Sep 19 20:16:10 labhen197-101 corosync[1276]:   [TOTEM ] A new membership (1.155) was formed. Members
Sep 19 20:16:10 labhen197-101 corosync[1276]:   [QUORUM] Members[1]: 1
Sep 19 20:16:10 labhen197-101 corosync[1276]:   [MAIN  ] Completed service synchronization, ready to provide service.

with changing numbers.

I attach syslog and journalctrl excerpts ("take2.txt).

Any suggestions how to get out of my evil circle?

EDIT: I tried again with removing each node everywhere and started again. This time I added node 13 first. I was again able to add 11 more nodes and when trying to add the 13ths, node number 12, again it stuck at waiting for quorum...OK.

Is there a max node = 12 limit?

sdettmer · May 10, 2023

Hi,

I hope it's ok to unburry this thread, otherwise of course I also could create a new one.

After a while we now attempted to setup a new cluster and since PVE 7.4 is out, we decided to try clustering again.
This time we double-checked that every node has up-to-date software before starting any clustering.

We added one node after the other and again after adding 11 nodes, 12 nodes total worked without visible issues. For compare, here adding the 11th node (node 12/12 total):

Code:

copy corosync auth key
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1683714178.sql.gz'
waiting for quorum...OK
(re)generate node files
generate new node certificate
merge authorized SSH keys and known hosts
generated new node certificate, restart pveproxy and pvedaemon services
successfully added node 'labhen197-112' to cluster.

but when it came to the 13th node, it stuck after waiting for quorum...OK:

Code:

copy corosync auth key
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1683714202.sql.gz'
waiting for quorum...OK

the pvecm add command hangs (after quorum...OK). Web GUIs do not work correctly. Sometimes the browser shows "Connection error - server offline?", "Login failed. Please try again" or times out. When trying multiple times, sometimes it works. If login (after long time, like 30 or 60 seconds) works, the node shows each other node in grey but itself in green. GUI cannot be used much, because often shows "Connection error - server offline?", "communication failure (0)" or a loading icon (even for menu items below the node itself).

I had watch pvecm status running in a different termial, it shows:

Code:

Every 2.0s: pvecm status                                                                                 labhen197-101: Wed May 10 12:35:52 2023

Cluster information
-------------------
Name:             tisc-pve
Config Version:   13
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 10 12:35:52 2023
Quorum provider:  corosync_votequorum
Nodes:            13
Node ID:          0x00000001
Ring ID:          1.35
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   13
Highest expected: 13
Total votes:      13
Quorum:           7 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.241.197.101 (local)
0x00000066          1 10.241.197.102
0x00000067          1 10.241.197.103
0x00000068          1 10.241.197.104
0x00000069          1 10.241.197.105
0x0000006a          1 10.241.197.106
0x0000006b          1 10.241.197.107
0x0000006c          1 10.241.197.108
0x0000006d          1 10.241.197.109
0x0000006e          1 10.241.197.110
0x0000006f          1 10.241.197.111
0x00000070          1 10.241.197.112
0x00000071          1 10.241.197.113

This output hangs as well. I waited ~ one hour so far and do not expect any change anymore.

I can login as root (i.e. /etc/pve/priv/authorized_keys can be read). I can issue pvecm status and after 10 - 25 seconds (!) I get the same output as above, but with a new time stamp.

Code:

root@labhen197-101:~# pveversion
pve-manager/7.4-3/9002ab8a (running kernel: 5.15.102-1-pve)

root@labhen197-101:~# top
top - 14:01:20 up 6 days, 23:36,  2 users,  load average: 7.02, 6.93, 6.78
Tasks: 282 total,   3 running, 279 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.7 us,  3.2 sy,  0.0 ni, 87.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31727.2 total,  26845.5 free,   2307.0 used,   2574.7 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.  28885.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                       
1446667 root      rt   0 1493604   1.0g  71944 R  95.3   3.4  81:37.94 corosync                                                        
...
root@labhen197-101:~# strace -ttt -p 1446667
...
1683720121.405386 epoll_wait(4, [{EPOLLIN, {u32=13, u64=436902137981566989}}], 12, 0) = 1
1683720121.405399 epoll_wait(4, [{EPOLLIN, {u32=13, u64=436902137981566989}}], 12, 0) = 1
1683720121.405411 epoll_wait(4, [{EPOLLIN, {u32=13, u64=436902137981566989}}], 12, 0) = 1
1683720121.405424 epoll_wait(4, [{EPOLLIN, {u32=13, u64=436902137981566989}}], 12, 0) = 1
1683720121.405436 epoll_wait(4, [{EPOLLIN, {u32=13, u64=436902137981566989}}], 12, 0) = 1
1683720121.405450 epoll_wait(4, [{EPOLLIN, {u32=13, u64=436902137981566989}}], 12, 0) = 1
# "endlessly"
...

Interestingly, with PVE 7.2, we also hit a "12 node limit", but can be be a match by chance.

Can I somehow help analysing the issue?
Any suggestions?

fabian · May 10, 2023

we'd need at least the following:
- pveversion -v from all nodes
- /etc/corosync/corosync.conf from all nodes (feel free to summarize if you made sure they are equal, e.g. by calculating the md5sum and comparing that, and only posting each config representing a distinct md5sum + the info on which nodes it exists)
- "journalctl -u corosync -u pve-cluster" starting ~5 minutes before the failed attempt to add a new node, and any journal messages by other units that seem relevant during the period covering the attempted join
- information about the network setup

sdettmer · May 10, 2023

Here some more logs attached.

I see some errors like:

Code:

May 10 12:23:30 labhen197-101 corosync[1446667]:   [QUORUM] Sync members[13]: 1 102 103 104 105 106 107 108 109 110 111 112 113
May 10 12:23:30 labhen197-101 corosync[1446667]:   [QUORUM] Sync joined[1]: 113
May 10 12:23:30 labhen197-101 corosync[1446667]:   [TOTEM ] A new membership (1.35) was formed. Members joined: 113
May 10 12:23:30 labhen197-101 pmxcfs[1446662]: [dcdb] notice: members: 1/1446662, 102/1438186, 103/1437199, 104/1440606, 105/1426173, 106/1442257, 107/1435768, 108/1
438126, 109/1440049, 110/1436736, 111/1441538, 112/1440173, 113/1442472
May 10 12:23:30 labhen197-101 pmxcfs[1446662]: [dcdb] notice: starting data syncronisation
May 10 12:23:30 labhen197-101 corosync[1446667]:   [QUORUM] Members[13]: 1 102 103 104 105 106 107 108 109 110 111 112 113
May 10 12:23:30 labhen197-101 corosync[1446667]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 10 12:23:31 labhen197-101 pmxcfs[1446662]: [dcdb] notice: cpg_send_message retried 1 times
May 10 12:23:31 labhen197-101 pmxcfs[1446662]: [status] notice: members: 1/1446662, 102/1438186, 103/1437199, 104/1440606, 105/1426173, 106/1442257, 107/1435768, 108
/1438126, 109/1440049, 110/1436736, 111/1441538, 112/1440173, 113/1442472
May 10 12:23:31 labhen197-101 pmxcfs[1446662]: [status] notice: starting data syncronisation
May 10 12:23:31 labhen197-101 pmxcfs[1446662]: [dcdb] notice: received sync request (epoch 1/1446662/0000000D)
May 10 12:23:31 labhen197-101 pmxcfs[1446662]: [status] notice: received sync request (epoch 1/1446662/0000000D)
May 10 12:23:38 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:23:48 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 17824 ms·
May 10 12:23:58 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:24:08 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 17801 ms·
May 10 12:24:18 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:24:33 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:24:51 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:25:08 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:25:18 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 17804 ms·
May 10 12:25:28 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7613 ms·
May 10 12:25:43 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:25:53 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 17804 ms·
May 10 12:26:03 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:26:13 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 17803 ms·
May 10 12:26:23 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:26:40 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:26:58 labhen197-101 corosync[1446667]:   [TOTEM ] Token has not been received in 7612 ms·
May 10 12:27:05 labhen197-101 kernel: [597734.810832] INFO: task pveproxy worker:324626 blocked for more than 120 seconds.
May 10 12:27:05 labhen197-101 kernel: [597734.810853]       Tainted: P           O      5.15.102-1-pve #1
May 10 12:27:05 labhen197-101 kernel: [597734.810864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 10 12:27:05 labhen197-101 kernel: [597734.810878] task:pveproxy worker state:D stack:    0 pid:324626 ppid:  1077 flags:0x00004000
May 10 12:27:05 labhen197-101 kernel: [597734.810880] Call Trace:             
May 10 12:27:05 labhen197-101 kernel: [597734.810881]  <TASK>                 
May 10 12:27:05 labhen197-101 kernel: [597734.810883]  __schedule+0x34e/0x1740
May 10 12:27:05 labhen197-101 kernel: [597734.810887]  schedule+0x69/0x110     
May 10 12:27:05 labhen197-101 kernel: [597734.810889]  schedule_preempt_disabled+0xe/0x20
May 10 12:27:05 labhen197-101 kernel: [597734.810891]  rwsem_down_read_slowpath+0x324/0x370
May 10 12:27:05 labhen197-101 kernel: [597734.810893]  down_read+0x43/0xa0     
May 10 12:27:05 labhen197-101 kernel: [597734.810894]  walk_component+0x136/0x1c0
May 10 12:27:05 labhen197-101 kernel: [597734.810897]  ? path_init+0x2c0/0x3f0
May 10 12:27:05 labhen197-101 kernel: [597734.810899]  path_lookupat+0x73/0x1c0
May 10 12:27:05 labhen197-101 kernel: [597734.810900]  filename_lookup+0xcb/0x1d0
May 10 12:27:05 labhen197-101 kernel: [597734.810901]  ? strncpy_from_user+0x44/0x160
May 10 12:27:05 labhen197-101 kernel: [597734.810903]  ? getname_flags.part.0+0x4c/0x1b0
May 10 12:27:05 labhen197-101 kernel: [597734.810904]  user_path_at_empty+0x3f/0x60
May 10 12:27:05 labhen197-101 kernel: [597734.810905]  vfs_statx+0x7a/0x130   
May 10 12:27:05 labhen197-101 kernel: [597734.810906]  ? __cond_resched+0x1a/0x50
May 10 12:27:05 labhen197-101 kernel: [597734.810908]  __do_sys_newstat+0x3e/0x80
May 10 12:27:05 labhen197-101 kernel: [597734.810909]  ? syscall_exit_to_user_mode+0x27/0x50
May 10 12:27:05 labhen197-101 kernel: [597734.810910]  ? __x64_sys_close+0x12/0x50
May 10 12:27:05 labhen197-101 kernel: [597734.810912]  ? do_syscall_64+0x69/0xc0
May 10 12:27:05 labhen197-101 kernel: [597734.810914]  __x64_sys_newstat+0x16/0x20
May 10 12:27:05 labhen197-101 kernel: [597734.810915]  do_syscall_64+0x59/0xc0
May 10 12:27:05 labhen197-101 kernel: [597734.810916]  ? syscall_exit_to_user_mode+0x27/0x50
May 10 12:27:05 labhen197-101 kernel: [597734.810917]  ? __x64_sys_newfstat+0x16/0x20
May 10 12:27:05 labhen197-101 kernel: [597734.810918]  ? do_syscall_64+0x69/0xc0
May 10 12:27:05 labhen197-101 kernel: [597734.810920]  ? do_syscall_64+0x69/0xc0
May 10 12:27:05 labhen197-101 kernel: [597734.810921]  ? do_syscall_64+0x69/0xc0
May 10 12:27:05 labhen197-101 kernel: [597734.810922]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
May 10 12:27:05 labhen197-101 kernel: [597734.810924] RIP: 0033:0x7ff6192828e6
May 10 12:27:05 labhen197-101 kernel: [597734.810925] RSP: 002b:00007ffe4e4a00f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
May 10 12:27:05 labhen197-101 kernel: [597734.810927] RAX: ffffffffffffffda RBX: 000055db8c34c500 RCX: 00007ff6192828e6
May 10 12:27:05 labhen197-101 kernel: [597734.810927] RDX: 000055db84bf44b8 RSI: 000055db84bf44b8 RDI: 000055db8c2cb760
May 10 12:27:05 labhen197-101 kernel: [597734.810928] RBP: 000055db84bf42a0 R08: 0000000000000001 R09: 0000000000000000
May 10 12:27:05 labhen197-101 kernel: [597734.810929] R10: 000000000000000e R11: 0000000000000246 R12: 0000000000000000
May 10 12:27:05 labhen197-101 kernel: [597734.810929] R13: 000055db82c4b383 R14: 000055db8c2cb760 R15: 0000000000000000
May 10 12:27:05 labhen197-101 kernel: [597734.810931]  </TASK>

fabian · May 10, 2023

sorry, that was not clear - the logs need to be from *all* nodes

sdettmer · May 10, 2023

fabian said:
we'd need at least the following:
- pveversion -v from all nodes
- /etc/corosync/corosync.conf from all nodes (feel free to summarize if you made sure they are equal, e.g. by calculating the md5sum and comparing that, and only posting each config representing a distinct md5sum + the info on which nodes it exists)
- "journalctl -u corosync -u pve-cluster" starting ~5 minutes before the failed attempt to add a new node, and any journal messages by other units that seem relevant during the period covering the attempted join
- information about the network setup

Hi,

thanks so much for your quick reply!

I collected the information from each node, except the last, 113, because here SSH just hangs. Since SSH times out when attempting to connect the node 113, so for this last node I don't have information.

EDIT: 113 is the same, see at the end.

I included the one-liner "scripts" (get.sh) to allow you reviewing whether I used the right commands.

The pveversion and corosync are the same everywhere except 113 (pveversion-v-113.txt is 0 byte because SSH hangs before any output):

Code:

c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-101.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-102.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-103.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-104.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-105.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-106.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-107.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-108.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-109.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-110.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-111.txt
c8a5d3974d3cbfea084373274ac05727  pveversion-3/pveversion-v-112.txt
d41d8cd98f00b204e9800998ecf8427e  pveversion-3/pveversion-v-113.txt
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-101.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-102.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-103.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-104.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-105.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-106.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-107.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-108.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-109.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-110.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-111.conf
c1470252811c26919d67d193da5d1b78  corosync-conf-3/corosync-112.conf

EDIT: I can SSH to 113 when not trying a SSH key (possibly /etc/pve/authorized_keys hangs?):

Code:

sdettmer@tux6:~/work/ansible (master * u=) $ ssh-agent ssh root@10.241.197.113
root@10.241.197.113's password:
Linux labhen197-113 5.15.102-1-pve #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Wed May 10 12:16:10 2023 from 10.241.197.96
root@labhen197-113:~# md5sum  /etc/corosync/corosync.conf
c1470252811c26919d67d193da5d1b78  /etc/corosync/corosync.conf
root@labhen197-113:~# pveversion -v | sort | md5sum
c8a5d3974d3cbfea084373274ac05727  -
root@labhen197-113:~#
[/ICODE]

EDIT #2:

Code:

root@labhen197-113:~# journalctl -u corosync -u pve-cluster > journalctl-u-corosync-u-pve-cluster-3-113.txt

this includes errors! I hope they help.

fabian · May 11, 2023

is there anything particular about node 13 connectivity wise? how is your network set up in general?

did you reboot node13? did that make the cluster operational again? if so, what happened once it got back online and tried to participate in the cluster?

sdettmer · May 11, 2023

fabian said:
is there anything particular about node 13 connectivity wise? how is your network set up in general?

did you reboot node13? did that make the cluster operational again? if so, what happened once it got back online and tried to participate in the cluster?

Thank you for your quick reply!
So far I did not touch anything (beside that the SSH connections got lost over night).

Each node (is assumed to) be indentical (same hardware and such).

In the history of this thread I documented cases (with version 7.2) that it also happend at node 10 first, so probably it is by accident which number hits the issue(?).

Should I reboot node 13 (and only node 13)?

fabian · May 11, 2023

it probably is enough to stop corosync and pve-cluster on node 13, check the status on the other nodes, and if that looks good, start the services again on node 13.

there is one peculiarity in the logs that i am not sure about, I'll have to take a closer look - when node13 joined corosync logs that its ready after pmxcfs logged that it started synchronization, maybe that caused some inconsistent state or caused something to be lost, but it might also not matter at all or just be a log ordering issue with no consequences.

Adding node #10 makes puts whole cluster out of order

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Attachments

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Attachments

Proxmox Staff Member

Member

Attachments

Member

Proxmox Staff Member

Member

Attachments

Proxmox Staff Member

Member

Attachments

Proxmox Staff Member

Member

Proxmox Staff Member