Corosync issue when restarting some hypervisors

stalio · Jul 27, 2021

I have been using proxmox for more than 10 years now and encountered a number of issues that I was always more or less able to solve, today
I want to share a problem that is happening to me for which I can't find a solution. I think that might be a real problem that is worth investigating.

We have a 24 node pve 6 cluster, months ago updated from pve 5. As we had heating issues in our data center we powered off 11 nodes, leaving 13 of them up. Quorum was never lost.
Yesterday I wanted to turn on some of the hypervisors that were previously powered off but they can not join the corosync cluster anymore.

This the message i find on working corosync members:

Code:

Jul 27 09:34:48 hnode21 corosync[1993]:   [TOTEM ] Message received from 192.168.145.119 has bad magic number (probably sent by unencrypted Kronosnet).. Ignoring

This is the status of the "good" part of the cluster:

Code:

root@hnode21:~# pvecm status
Cluster information
-------------------
Name:             u-lite-v2
Config Version:   43
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Jul 27 10:04:10 2021
Quorum provider:  corosync_votequorum
Nodes:            13
Node ID:          0x00000015
Ring ID:          1.a9e
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   24
Highest expected: 24
Total votes:      13
Quorum:           13 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.145.102
0x00000002          1 192.168.145.103
0x00000003          1 192.168.145.101
0x00000004          1 192.168.145.100
0x00000006          1 192.168.145.106
0x00000008          1 192.168.145.120
0x00000009          1 192.168.145.118
0x0000000b          1 192.168.145.108
0x0000000d          1 192.168.145.110
0x0000000f          1 192.168.145.116
0x00000012          1 192.168.145.113
0x00000015          1 192.168.145.121 (local)
0x00000018          1 192.168.145.115

This is what a node I just powered on sees:

Code:

root@hnode19:~# pvecm status
Cluster information
-------------------
Name:             u-lite-v2
Config Version:   43
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Jul 27 10:02:32 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x0000000a
Ring ID:          a.b0e
Quorate:          No

Votequorum information
----------------------
Expected votes:   24
Highest expected: 24
Total votes:      1
Quorum:           13 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x0000000a          1 192.168.145.119 (local)

Note that the Ring ID is different. The corosync.conf, renamed corosync.txt is attached. Same on all nodes.

fabian · Jul 27, 2021

your config has

Code:

  cluster_name: u-lite-v2
  config_version: 43
  crypto_cipher: none
  crypto_hash: none
  interface {
    bindnetaddr: 192.168.145.101
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2

which is kind of conflicting (secauth vs crypto_*). are the corosync versions identical on both partitions of the cluster? my guess is that they are not, and that the powered-down one interprets the config as "don't use crypto", and the powered-on one gives the "secauth" higher priority and thus enables encryption, so both partitions can't talk with eachother.[/code]

stalio · Jul 27, 2021

Thanks for the hint. I understood what happened and seem to be able to fix the issue.
It all goes back to when I did the pve 5 to pve 6 upgrade and needed to force

Code:

transport: udp

for totem

I then modified the corosync.conf file but never restarted the corosync processes, so totem is still using udp. I went back to the old configuration file and things seem ok now. At a lower priority I'll have to switch udp off.

Stefano.

Search

Search

Corosync issue when restarting some hypervisors

stalio

New Member

Attachments

fabian

Proxmox Staff Member

stalio

New Member

We value your privacy