Corosync disaster (different version between nodes)

grobs · Jun 8, 2021

Hi,

Short story:
I have 2 different versions of corosync configuration on my cluster and now "pvecm status" gives me this ugly error:
"Can't use an undefined value as a HASH reference at /usr/share/perl5/PVE/CLI/pvecm.pm line 479, <DATA> line 755."
My cluster is totally broken.

Long story:
I'm running Proxmox for some years now and, looking at /etc/pve/corosync.conf (troubleshooting a networking issue), I saw this:

Code:

totem {
  cluster_name: pm5-cluster-01
  config_version: 43
  interface {
    bindnetaddr: 192.168.10.21 <========== old node IP
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

This IP is the one of an old node in this cluster that I deleted long time ago.
To make things clean, I decided to update this to an IP that currently is part of the nodes (192.168.10.101) and to change the cluster_name from "pm5-cluster-01" to "cluster-01".

Spoiler alert: do not do this at home...

To do so, I did:

service pve-cluster stop (on each of my 11 nodes)
service pve-cluster start (on the node 192.168.10.101)
cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
Editied /etc/pve/corosync.conf.new to change bindnetaddr: 192.168.10.21 -> 192.168.10.101, cluster_name: pm5-cluster-01 -> cluster-01, config_version: 43 -> 44)
mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf

And it worked well:

Code:

pm6-01:~# pvecm status
Cluster information
-------------------
Name:             cluster-01 <==== NEW
Config Version:   44 <==== NEW
Transport:        knet
Secure auth:      on

...

Votequorum information
----------------------
Expected votes:   11
Highest expected: 11
Total votes:      1
Quorum:           6 Activity blocked
Flags:           


Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 192.168.10.101 (local)

Hint: At this step, I didn't see the "Activity blocked".

I did this to 2 other nodes and it seemed to work well.

But...

On one node, my "mv" command to replace the corosync configuration didn't work and hanged.
I was unable to get my prompt back even with "ctrl+c" or anything else.

Now my cluster is totally broken on every node...

Code:

# pvecm status
Can't use an undefined value as a HASH reference at /usr/share/perl5/PVE/CLI/pvecm.pm line 479, <DATA> line 755.

I tried to do this:

service pveproxy stop
service pvedaemon stop
service corosync stop
pvecm expected 1
edit the /etc/pve/corosync.conf ==> this hangs (same with cp / mv)
service pve-cluster stop
and the start everything again

but the result is the same.

This is a production cluster and I really don't know what to do.

Could you please help?

Best regards

ph0x · Jun 9, 2021

This is a joke, right?

grobs · Jun 9, 2021

Unfortunately not...
We were experiencing very bad issues on the cluster the day before those changes and that was kind of a "last chance" change.

grobs · Jun 9, 2021

In order to begin somewhere, here is the state of one of the 11 nodes (pm6-staging-03):

Code:

pm6-staging-03:~# pvecm status
Can't use an undefined value as a HASH reference at /usr/share/perl5/PVE/CLI/pvecm.pm line 479, <DATA> line 755.

Code:

pm6-staging-03:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2021-06-09 11:37:07 CEST; 9min ago
  Process: 1733 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
 Main PID: 1783 (pmxcfs)
    Tasks: 5 (limit: 4915)
   Memory: 34.9M
   CGroup: /system.slice/pve-cluster.service
           └─1783 /usr/bin/pmxcfs

/etc/pve is available with all the configuration files but in read only mode.

Code:

pm6-staging-03:~# pct list
VMID       Status     Lock         Name         
102        stopped                 container01.domain
106        stopped                 container02.domain
...

Code:

pm6-staging-03:~# pct start 102
cluster not ready - no quorum?

Code:

pm6-staging-03:/var/lib/pve-cluster# pvecm expected 1
Unable to set expected votes: CS_ERR_INVALID_PARAM

Code:

pm6-staging-03:~# df -h
Sys. de fichiers             Taille Utilisé Dispo Uti% Monté sur
...
rpool                          382G    128K  382G   1% /rpool
rpool/ROOT                     382G    128K  382G   1% /rpool/ROOT
rpool/data                     382G    256K  382G   1% /rpool/data
rpool/data/subvol-106-disk-1   5,0G    1,6G  3,5G  32% /rpool/data/subvol-106-disk-1
rpool/data/subvol-102-disk-1   5,0G    1,9G  3,2G  37% /rpool/data/subvol-102-disk-1
...
/dev/fuse                       30M    140K   30M   1% /etc/pve

EDIT : I see that the corosync configuration is empty and can't be edited (read only) due to the "no quorum" state:

Code:

-r--r----- 1 root www-data 0 juin   8 19:34 /etc/pve/corosync.conf

How can I at least start my containers?

Search

Search

Corosync disaster (different version between nodes)

grobs

Active Member

ph0x

Renowned Member

grobs

Active Member

grobs

Active Member

We value your privacy