(This is a non-production environment)
After a recent power cut, I ended up with a split cluster which consisted of three nodes (may switch came up last...
After repeated attempts to reboot them all, one at a time, two...I gave up. Grantted, I should have asked for help at this point...
One of these nodes is a primary NAS sever and provides other core functions, while the other two are just for testing. In an attempt to safeguard the primary machine I followed the instructions to remove nodes from a cluster and had each machine running in isolation (non-clustered).
I then created a 'new' cluster with the primary machine and attempted to rebuild this new cluster. Booting a second machine and attempting to have it join the new cluster failed, so I rebuilt the machine and attempted to add it to the newly created cluster. Didn't go as planned.
The machine wouldn't complete the joining process.
On the primary node I would see the /etc/pve/nodes/<new machine> however while /etc/pve was mounted on the new machine, it didn't contain the ../nodes/ folder.
In the GUI, the primary node was responsive as normal and the new machine was listed as a node with a green tick. However, selecting the new machine would report a communications error, and eventually I lose the green tick for a grey ? Selecting the 'cluster' reports /etc/pve/nodes/<new machine>/pve-ss;.pem' doesn't exist (500)
Strangly, the primary node does list the new node in /etc/pve/nodes/
Trying to force the updatecerts works on the primary machine but hangs on the new machine. I can ssh between then without a problem.
After removing it / joining it, I now have the /nodes/ folder and the two nodes listed in /etc/pve/nodes. On the primary machine, I can see the contents of both folders....however...on the new machine, while I can see the content of the /nodes/<primary machine> if I try ls of the local machine, it hangs.
Top on the primary node show coronsync hitting 100% CPU and journal -f shows:
root@ProxML350:~# journalctl -f
Sorry, watching this a bit longer I see:
Could anyone provide some pointers?
Thanks.
After a recent power cut, I ended up with a split cluster which consisted of three nodes (may switch came up last...
After repeated attempts to reboot them all, one at a time, two...I gave up. Grantted, I should have asked for help at this point...
One of these nodes is a primary NAS sever and provides other core functions, while the other two are just for testing. In an attempt to safeguard the primary machine I followed the instructions to remove nodes from a cluster and had each machine running in isolation (non-clustered).
I then created a 'new' cluster with the primary machine and attempted to rebuild this new cluster. Booting a second machine and attempting to have it join the new cluster failed, so I rebuilt the machine and attempted to add it to the newly created cluster. Didn't go as planned.
The machine wouldn't complete the joining process.
On the primary node I would see the /etc/pve/nodes/<new machine> however while /etc/pve was mounted on the new machine, it didn't contain the ../nodes/ folder.
In the GUI, the primary node was responsive as normal and the new machine was listed as a node with a green tick. However, selecting the new machine would report a communications error, and eventually I lose the green tick for a grey ? Selecting the 'cluster' reports /etc/pve/nodes/<new machine>/pve-ss;.pem' doesn't exist (500)
Strangly, the primary node does list the new node in /etc/pve/nodes/
Trying to force the updatecerts works on the primary machine but hangs on the new machine. I can ssh between then without a problem.
After removing it / joining it, I now have the /nodes/ folder and the two nodes listed in /etc/pve/nodes. On the primary machine, I can see the contents of both folders....however...on the new machine, while I can see the content of the /nodes/<primary machine> if I try ls of the local machine, it hangs.
Top on the primary node show coronsync hitting 100% CPU and journal -f shows:
root@ProxML350:~# journalctl -f
Bash:
Jan 07 17:20:42 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 50
Jan 07 17:20:43 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8
Jan 07 17:20:43 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 60
Jan 07 17:20:43 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8
Jan 07 17:20:44 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8
Jan 07 17:20:44 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 70
Jan 07 17:20:44 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8
Jan 07 17:20:45 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8
Jan 07 17:20:45 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 80
Jan 07 17:20:46 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8
Jan 07 17:20:46 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 90
Jan 07 17:20:46 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8
Jan 07 17:20:47 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8
Jan 07 17:20:47 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 100
Jan 07 17:20:47 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retried 100 times
Jan 07 17:20:47 ProxML350 pmxcfs[2884]: [status] crit: cpg_send_message failed: 6
Jan 07 17:20:47 ProxML350 pve-firewall[3189]: firewall update time (10.016 seconds)
Jan 07 17:20:47 ProxML350 corosync[2905]: [KNET ] pmtud: Global data MTU changed to: 1397
Sorry, watching this a bit longer I see:
Bash:
Jan 07 17:29:51 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 10
Jan 07 17:29:51 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 86 a0 169 1b6 1ba 1c2
Jan 07 17:29:52 ProxML350 kernel: INFO: task pvescheduler:4263 blocked for more than 737 seconds.
Jan 07 17:29:52 ProxML350 kernel: Tainted: P O 6.8.12-17-pve #1
Jan 07 17:29:52 ProxML350 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 07 17:29:52 ProxML350 kernel: taskvescheduler state stack:0 pid:4263 tgid:4263 ppid:3463 flags:0x00000006
Jan 07 17:29:52 ProxML350 kernel: Call Trace:
Jan 07 17:29:52 ProxML350 kernel: <TASK>
Jan 07 17:29:52 ProxML350 kernel: __schedule+0x42b/0x1500
Jan 07 17:29:52 ProxML350 kernel: ? try_to_unlazy+0x60/0xe0
Jan 07 17:29:52 ProxML350 kernel: ? terminate_walk+0x65/0x100
Jan 07 17:29:52 ProxML350 kernel: ? path_parentat+0x49/0x90
Jan 07 17:29:52 ProxML350 kernel: schedule+0x33/0x110
Jan 07 17:29:52 ProxML350 kernel: schedule_preempt_disabled+0x15/0x30
Jan 07 17:29:52 ProxML350 kernel: rwsem_down_write_slowpath+0x392/0x6a0
Jan 07 17:29:52 ProxML350 kernel: down_write+0x5c/0x80
Jan 07 17:29:52 ProxML350 kernel: filename_create+0xaf/0x1b0
Jan 07 17:29:52 ProxML350 kernel: do_mkdirat+0x59/0x180
Jan 07 17:29:52 ProxML350 kernel: __x64_sys_mkdir+0x4a/0x70
Jan 07 17:29:52 ProxML350 kernel: x64_sys_call+0x2e3/0x2480
Jan 07 17:29:52 ProxML350 kernel: do_syscall_64+0x81/0x170
Jan 07 17:29:52 ProxML350 kernel: ? __x64_sys_alarm+0x76/0xd0
Jan 07 17:29:52 ProxML350 kernel: ? arch_exit_to_user_mode_prepare.constprop.0+0x1a/0xe0
Jan 07 17:29:52 ProxML350 kernel: ? syscall_exit_to_user_mode+0x43/0x1e0
Jan 07 17:29:52 ProxML350 kernel: ? do_syscall_64+0x8d/0x170
Jan 07 17:29:52 ProxML350 kernel: ? arch_exit_to_user_mode_prepare.constprop.0+0x1a/0xe0
Jan 07 17:29:52 ProxML350 kernel: ? syscall_exit_to_user_mode+0x43/0x1e0
Jan 07 17:29:52 ProxML350 kernel: ? do_syscall_64+0x8d/0x170
Jan 07 17:29:52 ProxML350 kernel: ? _copy_to_user+0x25/0x50
Jan 07 17:29:52 ProxML350 kernel: ? cp_new_stat+0x143/0x180
Jan 07 17:29:52 ProxML350 kernel: ? __set_task_blocked+0x29/0x80
Jan 07 17:29:52 ProxML350 kernel: ? sigprocmask+0xb4/0xe0
Jan 07 17:29:52 ProxML350 kernel: ? arch_exit_to_user_mode_prepare.constprop.0+0x1a/0xe0
Jan 07 17:29:52 ProxML350 kernel: ? syscall_exit_to_user_mode+0x43/0x1e0
Jan 07 17:29:52 ProxML350 kernel: ? do_syscall_64+0x8d/0x170
Jan 07 17:29:52 ProxML350 kernel: ? arch_exit_to_user_mode_prepare.constprop.0+0x1a/0xe0
Jan 07 17:29:52 ProxML350 kernel: ? syscall_exit_to_user_mode+0x43/0x1e0
Jan 07 17:29:52 ProxML350 kernel: ? do_syscall_64+0x8d/0x170
Jan 07 17:29:52 ProxML350 kernel: ? irqentry_exit+0x43/0x50
Jan 07 17:29:52 ProxML350 kernel: ? exc_page_fault+0x94/0x1b0
Jan 07 17:29:52 ProxML350 kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
Jan 07 17:29:52 ProxML350 kernel: RIP: 0033:0x7dd8165c3f27
Jan 07 17:29:52 ProxML350 kernel: RSP: 002b:00007ffdef765638 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
Jan 07 17:29:52 ProxML350 kernel: RAX: ffffffffffffffda RBX: 00005941f55032a0 RCX: 00007dd8165c3f27
Jan 07 17:29:52 ProxML350 kernel: RDX: 0000000000000026 RSI: 00000000000001ff RDI: 00005941fc40c820
Jan 07 17:29:52 ProxML350 kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
Jan 07 17:29:52 ProxML350 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00005941f5508c88
Jan 07 17:29:52 ProxML350 kernel: R13: 00005941fc40c820 R14: 00005941f57a5450 R15: 00000000000001ff
Jan 07 17:29:52 ProxML350 kernel: </TASK>
Jan 07 17:29:52 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 86 a0 169 1b6 1ba 1c2
Jan 07 17:29:52 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 20
Jan 07 17:29:53 ProxML350 corosync[2905]: [TOTEM ] Retransmit List: 12 13 15 16 18 86 a0 169 1b6 1ba 1c2
Could anyone provide some pointers?
Thanks.
Last edited: