[SOLVED] Clustering issues

Scotty

Active Member
Jul 15, 2021
36
3
28
56
UK
(This is a non-production environment)

After a recent power cut, I ended up with a split cluster which consisted of three nodes (may switch came up last...

After repeated attempts to reboot them all, one at a time, two...I gave up. Grantted, I should have asked for help at this point...

One of these nodes is a primary NAS sever and provides other core functions, while the other two are just for testing. In an attempt to safeguard the primary machine I followed the instructions to remove nodes from a cluster and had each machine running in isolation (non-clustered).

I then created a 'new' cluster with the primary machine and attempted to rebuild this new cluster. Booting a second machine and attempting to have it join the new cluster failed, so I rebuilt the machine and attempted to add it to the newly created cluster. Didn't go as planned.

The machine wouldn't complete the joining process.

On the primary node I would see the /etc/pve/nodes/<new machine> however while /etc/pve was mounted on the new machine, it didn't contain the ../nodes/ folder.

In the GUI, the primary node was responsive as normal and the new machine was listed as a node with a green tick. However, selecting the new machine would report a communications error, and eventually I lose the green tick for a grey ? Selecting the 'cluster' reports /etc/pve/nodes/<new machine>/pve-ss;.pem' doesn't exist (500)

Strangly, the primary node does list the new node in /etc/pve/nodes/

Trying to force the updatecerts works on the primary machine but hangs on the new machine. I can ssh between then without a problem.

After removing it / joining it, I now have the /nodes/ folder and the two nodes listed in /etc/pve/nodes. On the primary machine, I can see the contents of both folders....however...on the new machine, while I can see the content of the /nodes/<primary machine> if I try ls of the local machine, it hangs.

Top on the primary node show coronsync hitting 100% CPU and journal -f shows:

root@ProxML350:~# journalctl -f
Bash:
Jan 07 17:20:42 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 50

Jan 07 17:20:43 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8

Jan 07 17:20:43 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 60

Jan 07 17:20:43 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8

Jan 07 17:20:44 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8

Jan 07 17:20:44 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 70

Jan 07 17:20:44 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8

Jan 07 17:20:45 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8

Jan 07 17:20:45 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 80

Jan 07 17:20:46 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8

Jan 07 17:20:46 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 90

Jan 07 17:20:46 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8

Jan 07 17:20:47 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 172 1b1 1c2 1c8

Jan 07 17:20:47 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 100

Jan 07 17:20:47 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retried 100 times

Jan 07 17:20:47 ProxML350 pmxcfs[2884]: [status] crit: cpg_send_message failed: 6

Jan 07 17:20:47 ProxML350 pve-firewall[3189]: firewall update time (10.016 seconds)

Jan 07 17:20:47 ProxML350 corosync[2905]:   [KNET  ] pmtud: Global data MTU changed to: 1397


Sorry, watching this a bit longer I see:

Bash:
Jan 07 17:29:51 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 10

Jan 07 17:29:51 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 86 a0 169 1b6 1ba 1c2

Jan 07 17:29:52 ProxML350 kernel: INFO: task pvescheduler:4263 blocked for more than 737 seconds.

Jan 07 17:29:52 ProxML350 kernel:       Tainted: P           O       6.8.12-17-pve #1

Jan 07 17:29:52 ProxML350 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Jan 07 17:29:52 ProxML350 kernel: taskvescheduler    state stack:0     pid:4263  tgid:4263  ppid:3463   flags:0x00000006

Jan 07 17:29:52 ProxML350 kernel: Call Trace:

Jan 07 17:29:52 ProxML350 kernel:  <TASK>

Jan 07 17:29:52 ProxML350 kernel:  __schedule+0x42b/0x1500

Jan 07 17:29:52 ProxML350 kernel:  ? try_to_unlazy+0x60/0xe0

Jan 07 17:29:52 ProxML350 kernel:  ? terminate_walk+0x65/0x100

Jan 07 17:29:52 ProxML350 kernel:  ? path_parentat+0x49/0x90

Jan 07 17:29:52 ProxML350 kernel:  schedule+0x33/0x110

Jan 07 17:29:52 ProxML350 kernel:  schedule_preempt_disabled+0x15/0x30

Jan 07 17:29:52 ProxML350 kernel:  rwsem_down_write_slowpath+0x392/0x6a0

Jan 07 17:29:52 ProxML350 kernel:  down_write+0x5c/0x80

Jan 07 17:29:52 ProxML350 kernel:  filename_create+0xaf/0x1b0

Jan 07 17:29:52 ProxML350 kernel:  do_mkdirat+0x59/0x180

Jan 07 17:29:52 ProxML350 kernel:  __x64_sys_mkdir+0x4a/0x70

Jan 07 17:29:52 ProxML350 kernel:  x64_sys_call+0x2e3/0x2480

Jan 07 17:29:52 ProxML350 kernel:  do_syscall_64+0x81/0x170

Jan 07 17:29:52 ProxML350 kernel:  ? __x64_sys_alarm+0x76/0xd0

Jan 07 17:29:52 ProxML350 kernel:  ? arch_exit_to_user_mode_prepare.constprop.0+0x1a/0xe0

Jan 07 17:29:52 ProxML350 kernel:  ? syscall_exit_to_user_mode+0x43/0x1e0

Jan 07 17:29:52 ProxML350 kernel:  ? do_syscall_64+0x8d/0x170

Jan 07 17:29:52 ProxML350 kernel:  ? arch_exit_to_user_mode_prepare.constprop.0+0x1a/0xe0

Jan 07 17:29:52 ProxML350 kernel:  ? syscall_exit_to_user_mode+0x43/0x1e0

Jan 07 17:29:52 ProxML350 kernel:  ? do_syscall_64+0x8d/0x170

Jan 07 17:29:52 ProxML350 kernel:  ? _copy_to_user+0x25/0x50

Jan 07 17:29:52 ProxML350 kernel:  ? cp_new_stat+0x143/0x180

Jan 07 17:29:52 ProxML350 kernel:  ? __set_task_blocked+0x29/0x80

Jan 07 17:29:52 ProxML350 kernel:  ? sigprocmask+0xb4/0xe0

Jan 07 17:29:52 ProxML350 kernel:  ? arch_exit_to_user_mode_prepare.constprop.0+0x1a/0xe0

Jan 07 17:29:52 ProxML350 kernel:  ? syscall_exit_to_user_mode+0x43/0x1e0

Jan 07 17:29:52 ProxML350 kernel:  ? do_syscall_64+0x8d/0x170

Jan 07 17:29:52 ProxML350 kernel:  ? arch_exit_to_user_mode_prepare.constprop.0+0x1a/0xe0

Jan 07 17:29:52 ProxML350 kernel:  ? syscall_exit_to_user_mode+0x43/0x1e0

Jan 07 17:29:52 ProxML350 kernel:  ? do_syscall_64+0x8d/0x170

Jan 07 17:29:52 ProxML350 kernel:  ? irqentry_exit+0x43/0x50

Jan 07 17:29:52 ProxML350 kernel:  ? exc_page_fault+0x94/0x1b0

Jan 07 17:29:52 ProxML350 kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80

Jan 07 17:29:52 ProxML350 kernel: RIP: 0033:0x7dd8165c3f27

Jan 07 17:29:52 ProxML350 kernel: RSP: 002b:00007ffdef765638 EFLAGS: 00000246 ORIG_RAX: 0000000000000053

Jan 07 17:29:52 ProxML350 kernel: RAX: ffffffffffffffda RBX: 00005941f55032a0 RCX: 00007dd8165c3f27

Jan 07 17:29:52 ProxML350 kernel: RDX: 0000000000000026 RSI: 00000000000001ff RDI: 00005941fc40c820

Jan 07 17:29:52 ProxML350 kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000

Jan 07 17:29:52 ProxML350 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00005941f5508c88

Jan 07 17:29:52 ProxML350 kernel: R13: 00005941fc40c820 R14: 00005941f57a5450 R15: 00000000000001ff

Jan 07 17:29:52 ProxML350 kernel:  </TASK>

Jan 07 17:29:52 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 86 a0 169 1b6 1ba 1c2

Jan 07 17:29:52 ProxML350 pmxcfs[2884]: [status] notice: cpg_send_message retry 20

Jan 07 17:29:53 ProxML350 corosync[2905]:   [TOTEM ] Retransmit List: 12 13 15 16 18 86 a0 169 1b6 1ba 1c2

Could anyone provide some pointers?

Thanks.
 
Last edited:
does every node have a link to each other node? for each node, can you ping all the other nodes?
  • please provide the respective output of pvecm status on each node
  • are the contents of /etc/pve/corosync.conf, respectively, /etc/corosync/corosync.conf, the same on all nodes?
 
I should have added the above log is from the primary node, the new machine is showing:


Jan 07 17:49:23 HPLaptop pveproxy[7869]: worker exit
Jan 07 17:49:23 HPLaptop pveproxy[7870]: worker exit
Jan 07 17:49:23 HPLaptop pveproxy[7871]: worker exit
Jan 07 17:49:23 HPLaptop pveproxy[1309]: worker 7869 finished
Jan 07 17:49:23 HPLaptop pveproxy[1309]: starting 1 worker(s)
Jan 07 17:49:23 HPLaptop pveproxy[1309]: worker 7891 started
Jan 07 17:49:23 HPLaptop pveproxy[1309]: worker 7870 finished
Jan 07 17:49:23 HPLaptop pveproxy[1309]: starting 1 worker(s)
Jan 07 17:49:23 HPLaptop pveproxy[7891]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2150.
Jan 07 17:49:23 HPLaptop pveproxy[1309]: worker 7892 started
Jan 07 17:49:23 HPLaptop pveproxy[7892]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2150.
Jan 07 17:49:23 HPLaptop pveproxy[1309]: worker 7871 finished
Jan 07 17:49:23 HPLaptop pveproxy[1309]: starting 1 worker(s)
Jan 07 17:49:23 HPLaptop pveproxy[1309]: worker 7893 started
Jan 07 17:49:23 HPLaptop pveproxy[7893]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2150.
does every node have a link to each other node? for each node, can you ping all the other nodes?
  • please provide the respective output of pvecm status on each node
  • are the contents of /etc/pve/corosync.conf, respectively, /etc/corosync/corosync.conf, the same on all nodes?

(Primary Node)

Code:
root@ProxML350:~# pvecm status
Cluster information
-------------------
Name:             Cabin2
Config Version:   16
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jan  8 10:32:10 2026
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.17c3
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.0.222 (local)
0x00000002          1 192.168.0.179

(New machine)

Code:
root@HPLaptop:~# pvecm status
Cluster information
-------------------
Name:             Cabin2
Config Version:   16
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jan  8 10:33:16 2026
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.17c3
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.0.222
0x00000002          1 192.168.0.179 (local)


I can confirm that the corosync.conf is the same in /etc/pve/ and /etc/corosync/ across both nodes. Here is the content:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: HPLaptop
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.0.179
  }
  node {
    name: ProxML350
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.0.222
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cabin2
  config_version: 16
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
The attachments are the system reports for both nodes. (ProxML350 being the primary). The primary can be created via the GUI but I can't connect to the other machine (although it is shown in the GUI) to create the report so this was via the CLI.
 

Attachments

ok thx... can the nodes really ping each other? please double-check. is there a firewall somewhere that might block some ports, e.g., corosync ports? eventually having a look into the respective journals of the two nodes within a specific timeframe (say 10 minutes, e.g.: journalctl -b --since "2026-01-08 14:00:00" --until "2026-01-08 14:10:00"), could provide some more useful insights
 
  • Like
Reactions: Scotty
Thanks (vey much!) for the feedback...it's led me down a path to resolving the issue.

As mentioned at the start, I had suffered a power cut. Your comment about firewall got me thinking (BTW I could ping and ssh between both). Turns out the 10GB switch had reset it's port configuration during/after the power cut and had defaulted the port configuration to 'fibre' rather than DAC-300CM. This was generating lots of CRC errors on the port connecting the primary server. Once this was resolved, I took a deeper look at the errors in the journal.

Jan 08 14:30:56 HPLaptop pveproxy[1359]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2150.

This took me in the direction of trying to update the certificats again:

Code:
root@HPLaptop:~# pvecm updatecerts -f
(re)generate node files
generate new node certificate
Could not find private key from /etc/pve/nodes/HPLaptop/pve-ssl.key
unable to generate pve certificate request:
command 'openssl req -batch -new -config /tmp/pvesslconf-1455.tmp -key /etc/pve/nodes/HPLaptop/pve-ssl.key -out /tmp/pvecertreq-1455.tmp' failed: exit code 1

However, I had previously thought this was due to not being able to navigate to the folder...so assumed that's why it had given the error. However, now I'm not getting the CRC errors, I can navigate to the folder without a problem.

So a bit more reading took me to : https://pve.proxmox.com/wiki/Proxmox_SSL_Error_Fixing , after following the instructions and a clean reboot....everything is working.

Thanks very much for the pointers...I was pulling my hair out (and there isn't much left).