Hi,
tl;dr:
I added nodes to 4 node cluster, while adding node 10 cluster collapsed. I think I recovered, but I had to guess much too much, so I would like to benefit from your experience how to continue and what to do better.
I had a little 4 node cluster running since a few weeks and yesterday added some more nodes. Nodes are IPs 10.x.y.101 - 10.x.y.115. I kept things simple and followed the Admin Guide. All added nodes where "empty" (no VMs or anything), all nodes are the same (cheap desktop) hardware, same configuration except IP and hostname.
I could add 9 nodes and saw more and more getting green in the web GUI of the first node (10.x.y.101). I used "pvecm add 10.x.y.101 --node 102 --use_ssh" to add each (the node number of course unique on each).
When adding node 10, the command got stuck at "waiting for quorum".
The whole cluster stopped working, no web GUI could be accessed, SSH with SSH key did not work. Luckily this still is in test room with root passwords and I could login with password.
I found corosync with 100% CPU, and access to /etc/pve blocking (hence no SSH key login, I assume).
I could not use the Web GUI of any of the nodes (actually I just tried 5 or so).
I now got two sets of 5 nodes each waiting for the sixth (I assume the reboot somehow "unlocked" something, every node got back working, some quorum started and due to a weakness in the protocol I ended up "split horizon", which I think should not happen).
I waited half an hour or so, but it did not resolve. So I rebooted 8 and 9 to see what happens. I noticed a little later that quorum 6 got OK, and a little later also the other 2 nodes joined, so there were 6 of 10 with 9 online. Apparently this broke the "deadlock".
I got heaps of log messages like journalctl:
(I can still look up logs, what should I share?)
Then I tried again with node 10 and again the whole cluster got stuck again. It was late, I stopped node 10 and went home.
Today I tried whether this is reproducible, to collect better diagnostics - but today the same did work apparently well, so I could not reproduce it.
My questions:
1. Did I made something wrong (i.e. adding 6 nodes right one after the other) and/or did I hit some bug? Should I wait some minutes after each node to let something settle down or so?
2. Why can one node pull down the cluster? I thought I use a cluster should protect from this. Can I do better or protect from that? Is this just a matter of adding or can it happen any time?
3. If this happens, what is best to recover? Here, killing corosync on node 10 seemed to solve the "dead" state, but I had no way to find that its node 10 (only because it was the most recently added one). Is there a way to identify a rouge node in order to turn it off (to quickly make cluster alive until maintenance window)?
4. When this happens again, what should I save (command output, logfiles) to assist trouble shooting?
Since this seems to prove that removing the node worked (and I documented it anyway), I include it here in case it could help someone else.
So I followed https://pve.proxmox.com/wiki/Cluster_Manager "Separate a Node Without Reinstalling" to get back a step. Additionally, I deleted its node directory on every node (
At 101:
In Web GUI on node 1 and 2 (and probably others, but didn't check) I see 9 nodes in the cluster.
On node 10, I see just node 10.
So I assume I successfully removed the node from the cluster.
Then I added node 10 again:
Well, this time it worked and now I need to find out how to correctly continue.
Any hint, pointer or suggestion from your experience is appreciated!
Steffen
tl;dr:
I added nodes to 4 node cluster, while adding node 10 cluster collapsed. I think I recovered, but I had to guess much too much, so I would like to benefit from your experience how to continue and what to do better.
I had a little 4 node cluster running since a few weeks and yesterday added some more nodes. Nodes are IPs 10.x.y.101 - 10.x.y.115. I kept things simple and followed the Admin Guide. All added nodes where "empty" (no VMs or anything), all nodes are the same (cheap desktop) hardware, same configuration except IP and hostname.
I could add 9 nodes and saw more and more getting green in the web GUI of the first node (10.x.y.101). I used "pvecm add 10.x.y.101 --node 102 --use_ssh" to add each (the node number of course unique on each).
When adding node 10, the command got stuck at "waiting for quorum".
The whole cluster stopped working, no web GUI could be accessed, SSH with SSH key did not work. Luckily this still is in test room with root passwords and I could login with password.
I found corosync with 100% CPU, and access to /etc/pve blocking (hence no SSH key login, I assume).
I could not use the Web GUI of any of the nodes (actually I just tried 5 or so).
pvecm status
showed each node on its own. I rebooted node 10. On node 1 I still saw 100%CPU of corosync, so a bit later also rebooted it (node 1). Directly after reboot, I killed corosync on node 10 and stopped a pve service as suggested by a team mate.I now got two sets of 5 nodes each waiting for the sixth (I assume the reboot somehow "unlocked" something, every node got back working, some quorum started and due to a weakness in the protocol I ended up "split horizon", which I think should not happen).
I waited half an hour or so, but it did not resolve. So I rebooted 8 and 9 to see what happens. I noticed a little later that quorum 6 got OK, and a little later also the other 2 nodes joined, so there were 6 of 10 with 9 online. Apparently this broke the "deadlock".
I got heaps of log messages like journalctl:
Code:
Sep 12 18:46:20 xy-101 pvedaemon[3780712]: <root@pam> successful auth for user 'root@pam'
Sep 12 18:46:23 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:46:37 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:46:51 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:47:06 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:47:08 xy-101 kernel: INFO: task pvedaemon worke:3533781 blocked for more than 241 seconds.
Sep 12 18:47:08 xy-101 kernel: Tainted: P O 5.15.30-2-pve #1
Sep 12 18:47:08 xy-101 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 12 18:47:08 xy-101 kernel: task:pvedaemon worke state:D stack: 0 pid:3533781 ppid: 1069 flags:0x00000004
Sep 12 18:47:08 xy-101 kernel: Call Trace:
Sep 12 18:47:08 xy-101 kernel: <TASK>
Sep 12 18:47:08 xy-101 kernel: __schedule+0x33d/0x1750
Sep 12 18:47:08 xy-101 kernel: ? path_parentat+0x4c/0x90
Sep 12 18:47:08 xy-101 kernel: ? filename_parentat+0xd7/0x1e0
Sep 12 18:47:08 xy-101 kernel: schedule+0x4e/0xb0
Sep 12 18:47:08 xy-101 kernel: rwsem_down_write_slowpath+0x217/0x4d0
Sep 12 18:47:08 xy-101 kernel: down_write+0x43/0x50
Sep 12 18:47:08 xy-101 kernel: filename_create+0x75/0x150
Sep 12 18:47:08 xy-101 kernel: do_mkdirat+0x48/0x140
Sep 12 18:47:08 xy-101 kernel: __x64_sys_mkdir+0x4c/0x70
Sep 12 18:47:08 xy-101 kernel: do_syscall_64+0x59/0xc0
Sep 12 18:47:08 xy-101 kernel: ? __x64_sys_alarm+0x4a/0x90
Sep 12 18:47:08 xy-101 kernel: ? exit_to_user_mode_prepare+0x37/0x1b0
Sep 12 18:47:08 xy-101 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Sep 12 18:47:08 xy-101 kernel: ? do_syscall_64+0x69/0xc0
Sep 12 18:47:08 xy-101 kernel: ? asm_sysvec_apic_timer_interrupt+0xa/0x20
Sep 12 18:47:08 xy-101 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Sep 12 18:47:08 xy-101 kernel: RIP: 0033:0x7f493209cb07
Sep 12 18:47:08 xy-101 kernel: RSP: 002b:00007ffd493ce1b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
Sep 12 18:47:08 xy-101 kernel: RAX: ffffffffffffffda RBX: 0000555a5ecc22a0 RCX: 00007f493209cb07
Sep 12 18:47:08 xy-101 kernel: RDX: 0000555a5e7ecae5 RSI: 00000000000001ff RDI: 0000555a65f8c870
Sep 12 18:47:08 xy-101 kernel: RBP: 0000000000000000 R08: 0000555a5ecc22a0 R09: 0000000000000111
Sep 12 18:47:08 xy-101 kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000555a65f8c870
Sep 12 18:47:08 xy-101 kernel: R13: 0000555a65f972f8 R14: 0000555a608532d0 R15: 00000000000001ff
Sep 12 18:47:08 xy-101 kernel: </TASK>
Sep 12 18:47:08 xy-101 kernel: INFO: task pvedaemon worke:854324 blocked for more than 241 seconds.
Sep 12 18:47:08 xy-101 kernel: Tainted: P O 5.15.30-2-pve #1
Sep 12 18:47:08 xy-101 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 12 18:47:08 xy-101 kernel: task:pvedaemon worke state:D stack: 0 pid:854324 ppid: 1069 flags:0x00000004
...
Sep 12 18:49:01 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:49:13 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:49:25 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:49:39 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:49:52 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:50:01 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 10
Sep 12 18:50:02 xy-101 kernel: sched: RT throttling activated
Sep 12 18:50:02 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 20
Sep 12 18:50:03 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 30
Sep 12 18:50:04 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 40
Sep 12 18:50:05 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 50
Sep 12 18:50:06 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
Sep 12 18:50:06 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 60
Sep 12 18:50:07 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 70
Sep 12 18:50:08 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 80
Sep 12 18:50:09 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 90
Sep 12 18:50:10 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 100
Sep 12 18:50:10 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retried 100 times
Sep 12 18:50:10 xy-101 pmxcfs[910]: [status] crit: cpg_send_message failed: 6
Sep 12 18:50:10 xy-101 pve-firewall[1042]: firewall update time (7.938 seconds)
Sep 12 18:50:11 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 10
...
Sep 12 18:51:17 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 70
Sep 12 18:51:18 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 80
Sep 12 18:51:19 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 90
Sep 12 18:51:20 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 100
Sep 12 18:51:20 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retried 100 times
Sep 12 18:51:20 xy-101 pmxcfs[910]: [status] crit: cpg_send_message failed: 6
Sep 12 18:51:20 xy-101 pve-firewall[1042]: firewall update time (10.178 seconds)
Sep 12 18:51:21 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 10
Sep 12 18:51:22 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 20
Sep 12 18:51:23 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 30
Sep 12 18:51:24 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 40
Sep 12 18:51:25 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 50
Sep 12 18:51:26 xy-101 pmxcfs[910]: [status] notice: cpg_send_message retry 60
Sep 12 18:51:26 xy-101 corosync[1023]: [TOTEM ] Token has not been received in 6150 ms
...
(I can still look up logs, what should I share?)
Then I tried again with node 10 and again the whole cluster got stuck again. It was late, I stopped node 10 and went home.
Today I tried whether this is reproducible, to collect better diagnostics - but today the same did work apparently well, so I could not reproduce it.
My questions:
1. Did I made something wrong (i.e. adding 6 nodes right one after the other) and/or did I hit some bug? Should I wait some minutes after each node to let something settle down or so?
2. Why can one node pull down the cluster? I thought I use a cluster should protect from this. Can I do better or protect from that? Is this just a matter of adding or can it happen any time?
3. If this happens, what is best to recover? Here, killing corosync on node 10 seemed to solve the "dead" state, but I had no way to find that its node 10 (only because it was the most recently added one). Is there a way to identify a rouge node in order to turn it off (to quickly make cluster alive until maintenance window)?
4. When this happens again, what should I save (command output, logfiles) to assist trouble shooting?
Since this seems to prove that removing the node worked (and I documented it anyway), I include it here in case it could help someone else.
So I followed https://pve.proxmox.com/wiki/Cluster_Manager "Separate a Node Without Reinstalling" to get back a step. Additionally, I deleted its node directory on every node (
ansible all -m ansible.builtin.shell -l \!xy-110 -a "rm -rf /etc/pve/nodes/xy-110/"
). I tried find /etc/ -type f|xargs grep \\\.110
and removed the node entry:/etc/pve/corosync.conf: ring0_addr: 10.241.197.110
/etc/corosync/corosync.conf: ring0_addr: 10.241.197.110
(just one, other updated automatically). I checked a second node to be clean now as well (yes, so it synced fine), I waited another minute and rebooted 110 and 101 (but no others).At 101:
Code:
root@labhen197-101:~# pvecm status
Cluster information
-------------------
Name: xyz-pve
Config Version: 12
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Tue Sep 13 09:46:49 2022
Quorum provider: corosync_votequorum
Nodes: 9
Node ID: 0x00000001
Ring ID: 1.fd0
Quorate: Yes
Votequorum information
----------------------
Expected votes: 9
Highest expected: 9
Total votes: 9
Quorum: 5
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.x.y.101 (local)
0x00000002 1 10.x.y.105
0x00000003 1 10.x.y.106
0x00000004 1 10.x.y.107
0x00000005 1 10.x.y.108
0x00000006 1 10.x.y.109
0x00000066 1 10.x.y.102
0x00000067 1 10.x.y.103
0x00000068 1 10.x.y.104
root@labhen197-101:~#
In Web GUI on node 1 and 2 (and probably others, but didn't check) I see 9 nodes in the cluster.
On node 10, I see just node 10.
So I assume I successfully removed the node from the cluster.
Then I added node 10 again:
pvecm add 10.241.197.101 --node 110 --use_ssh
.Well, this time it worked and now I need to find out how to correctly continue.
Any hint, pointer or suggestion from your experience is appreciated!
Steffen