Corosync issue when restarting some hypervisors

stalio

New Member
Jul 27, 2021
2
0
1
58
I have been using proxmox for more than 10 years now and encountered a number of issues that I was always more or less able to solve, today
I want to share a problem that is happening to me for which I can't find a solution. I think that might be a real problem that is worth investigating.

We have a 24 node pve 6 cluster, months ago updated from pve 5. As we had heating issues in our data center we powered off 11 nodes, leaving 13 of them up. Quorum was never lost.
Yesterday I wanted to turn on some of the hypervisors that were previously powered off but they can not join the corosync cluster anymore.

This the message i find on working corosync members:

Code:
Jul 27 09:34:48 hnode21 corosync[1993]:   [TOTEM ] Message received from 192.168.145.119 has bad magic number (probably sent by unencrypted Kronosnet).. Ignoring

This is the status of the "good" part of the cluster:

Code:
root@hnode21:~# pvecm status
Cluster information
-------------------
Name:             u-lite-v2
Config Version:   43
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Jul 27 10:04:10 2021
Quorum provider:  corosync_votequorum
Nodes:            13
Node ID:          0x00000015
Ring ID:          1.a9e
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   24
Highest expected: 24
Total votes:      13
Quorum:           13 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.145.102
0x00000002          1 192.168.145.103
0x00000003          1 192.168.145.101
0x00000004          1 192.168.145.100
0x00000006          1 192.168.145.106
0x00000008          1 192.168.145.120
0x00000009          1 192.168.145.118
0x0000000b          1 192.168.145.108
0x0000000d          1 192.168.145.110
0x0000000f          1 192.168.145.116
0x00000012          1 192.168.145.113
0x00000015          1 192.168.145.121 (local)
0x00000018          1 192.168.145.115


This is what a node I just powered on sees:

Code:
root@hnode19:~# pvecm status
Cluster information
-------------------
Name:             u-lite-v2
Config Version:   43
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Jul 27 10:02:32 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x0000000a
Ring ID:          a.b0e
Quorate:          No

Votequorum information
----------------------
Expected votes:   24
Highest expected: 24
Total votes:      1
Quorum:           13 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x0000000a          1 192.168.145.119 (local)

Note that the Ring ID is different. The corosync.conf, renamed corosync.txt is attached. Same on all nodes.
 

Attachments

your config has

Code:
  cluster_name: u-lite-v2
  config_version: 43
  crypto_cipher: none
  crypto_hash: none
  interface {
    bindnetaddr: 192.168.145.101
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2

which is kind of conflicting (secauth vs crypto_*). are the corosync versions identical on both partitions of the cluster? my guess is that they are not, and that the powered-down one interprets the config as "don't use crypto", and the powered-on one gives the "secauth" higher priority and thus enables encryption, so both partitions can't talk with eachother.[/code]
 
Thanks for the hint. I understood what happened and seem to be able to fix the issue.
It all goes back to when I did the pve 5 to pve 6 upgrade and needed to force

Code:
transport: udp

for totem

I then modified the corosync.conf file but never restarted the corosync processes, so totem is still using udp. I went back to the old configuration file and things seem ok now. At a lower priority I'll have to switch udp off.

Stefano.