Totem is unable to form a cluster because of an operating system or network fault

number5

New Member
Mar 22, 2021
19
1
3
United States
I am attempting to join a new host to my cluster and I am getting a corosync error: corosync[7957]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
followed by a couple of other errors.

Code:
Oct  8 10:09:20 host1 corosync[7957]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Oct  8 10:09:20 host1 pmxcfs[1226950]: [dcdb] notice: cpg_join retry 80
Oct  8 10:09:21 host1 corosync[7957]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Oct  8 10:09:21 host1 pmxcfs[1226950]: [dcdb] notice: cpg_join retry 90
Oct  8 10:09:22 host1 pve-firewall[8149]: status update error: Connection refused
Oct  8 10:09:22 host1 pmxcfs[1226950]: [dcdb] notice: cpg_join retry 100
Oct  8 10:09:23 host1 corosync[7957]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Oct  8 10:09:23 host1 pmxcfs[1226950]: [dcdb] notice: cpg_join retry 110
Oct  8 10:09:24 host1 pveproxy[972735]: ipcc_send_rec[1] failed: Connection refused
Oct  8 10:09:24 host1 pveproxy[972735]: ipcc_send_rec[2] failed: Connection refused
Oct  8 10:09:24 host1 pveproxy[972735]: ipcc_send_rec[3] failed: Connection refused
Oct  8 10:09:24 host1 corosync[7957]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.

Disabling the firewall makes no difference and I can also ping all the hosts by name without issue. Eventually, somehow, I got pvecm to remove the one host but now corosync is still acting up.
 
Could you provide the complete output of journalctl -b -u corosync?
If possible for multiple nodes.

Please also provide your interfaces file /etc/network/interfaces and your corosync config /etc/corosync/corosync.conf.
 
Thank you for taking a look at this. I should point out that I got my cluster and the affected host back to normal sometime after submitting my original post. The new host is still not in, however.

While I was still attempting to join this "host47", I was seeing this error here as well: host47 pveproxy[23528]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1891. I included the journalctl output for that as well in the attached files.

Thanks again.
 

Attachments

  • journalctl_corosync_host47.txt
    4.7 KB · Views: 4
  • journalctl_corosync_host47_2.txt
    577 KB · Views: 1
  • interfaces_host41.txt
    886 bytes · Views: 1
  • journalctl_corosync_host41.txt
    125.7 KB · Views: 1
  • corosync.conf_host41.txt
    1.5 KB · Views: 1
Could you provide the complete journal for host 47? It seems that the links are flapping, which could indicate an issue with the NIC(s) on that host.
Regarding the certificate issue, once the node is in the cluster and stable, you can run pvecm updatecerts on host47 to update the certificates.
 
Could you provide the complete journal for host 47? It seems that the links are flapping, which could indicate an issue with the NIC(s) on that host.
Regarding the certificate issue, once the node is in the cluster and stable, you can run pvecm updatecerts on host47 to update the certificates.

Ok, good to know. Please see attached for the journal for that host. I still can't see any networking issues with it though.
 

Attachments

  • journalctl_host47.txt
    490.1 KB · Views: 4
You could try following the `Separate A Node Without Reinstalling` guide in the documentation [0] to make that one host standalone again.
Then join it again to the cluster.

For the private key error run pvecm updatecerts --force on host47 once it has been separated by following the guide in the docs.


[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node (5.5.1)
 
  • Like
Reactions: number5
You could try following the `Separate A Node Without Reinstalling` guide in the documentation [0] to make that one host standalone again.
Then join it again to the cluster.

For the private key error run pvecm updatecerts --force on host47 once it has been separated by following the guide in the docs.


[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node (5.5.1)
While I did not try the "separate a node without reinstalling" specifically, I did reinstall it once before, without luck.
 
To get a clear picture of the situation right now, could you provide the corosync config again for host47 and one of the nodes in the cluster?
As well as the interfaces file for both nodes.
 
To get a clear picture of the situation right now, could you provide the corosync config again for host47 and one of the nodes in the cluster?
As well as the interfaces file for both nodes.
Mira, I don't believe that host47 ever had a corosync config and it does not have one now. During the failed clustering, I don't think it was ever created. The corosync config file for the rest of the hosts is the same I previously uploaded; host41 and the rest of them have the same corosync conf.

We are currently discussing purchasing a subscription. Once we figure this out, I am planning to come back to update the post with any new information.

Thanks again for taking the time to look into this.
 
If it doesn't have a corosync config in /etc/pve/ and /etc/corosync it should be fine to just run `pvecm updatecerts --force` on it. That should get rid of the error in the log.

Although according to the log it has corosync configured.
Code:
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [TOTEM ] Configuring link 0
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [TOTEM ] Configured link number 0: local addr: 10.1.1.47, port=5405
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 1 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 1 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 1 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [QUORUM] Sync members[1]: 13
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [QUORUM] Sync joined[1]: 13
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [TOTEM ] A new membership (d.27292) was formed. Members joined: 13
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 2 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 2 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 2 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 10 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [QUORUM] Members[1]: 13
Oct 12 08:01:45 itk-pve-47 systemd[1]: Started Corosync Cluster Engine.
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 10 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 10 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 11 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 11 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 11 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 11 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 11 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 11 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 12 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 12 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 12 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 12 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 12 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 12 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 3 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 3 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 3 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 8 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 8 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 8 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 5 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 5 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 5 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 6 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 6 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 6 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 13 (passive) best link: 0 (pri: 0)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 13 has no active links
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Oct 12 08:01:45 itk-pve-47 corosync[2070]:   [KNET  ] host: host: 7 has no active links
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!