Corosync disaster (different version between nodes)

grobs

Active Member
Apr 1, 2016
56
0
26
37
France
Hi,

Short story:
I have 2 different versions of corosync configuration on my cluster and now "pvecm status" gives me this ugly error:
"Can't use an undefined value as a HASH reference at /usr/share/perl5/PVE/CLI/pvecm.pm line 479, <DATA> line 755."
My cluster is totally broken.

Long story:
I'm running Proxmox for some years now and, looking at /etc/pve/corosync.conf (troubleshooting a networking issue), I saw this:

Code:
totem {
  cluster_name: pm5-cluster-01
  config_version: 43
  interface {
    bindnetaddr: 192.168.10.21 <========== old node IP
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

This IP is the one of an old node in this cluster that I deleted long time ago.
To make things clean, I decided to update this to an IP that currently is part of the nodes (192.168.10.101) and to change the cluster_name from "pm5-cluster-01" to "cluster-01".

Spoiler alert: do not do this at home...

To do so, I did:
  1. service pve-cluster stop (on each of my 11 nodes)
  2. service pve-cluster start (on the node 192.168.10.101)
  3. cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
  4. Editied /etc/pve/corosync.conf.new to change bindnetaddr: 192.168.10.21 -> 192.168.10.101, cluster_name: pm5-cluster-01 -> cluster-01, config_version: 43 -> 44)
  5. mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
And it worked well:

Code:
pm6-01:~# pvecm status
Cluster information
-------------------
Name:             cluster-01 <==== NEW
Config Version:   44 <==== NEW
Transport:        knet
Secure auth:      on

...

Votequorum information
----------------------
Expected votes:   11
Highest expected: 11
Total votes:      1
Quorum:           6 Activity blocked
Flags:           


Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 192.168.10.101 (local)

Hint: At this step, I didn't see the "Activity blocked".

I did this to 2 other nodes and it seemed to work well.

But...

On one node, my "mv" command to replace the corosync configuration didn't work and hanged.
I was unable to get my prompt back even with "ctrl+c" or anything else.

Now my cluster is totally broken on every node...

Code:
# pvecm status
Can't use an undefined value as a HASH reference at /usr/share/perl5/PVE/CLI/pvecm.pm line 479, <DATA> line 755.

I tried to do this:
  • service pveproxy stop
  • service pvedaemon stop
  • service corosync stop
  • pvecm expected 1
  • edit the /etc/pve/corosync.conf ==> this hangs (same with cp / mv)
  • service pve-cluster stop
  • and the start everything again
but the result is the same.

This is a production cluster and I really don't know what to do.

Could you please help?

Best regards
 
Unfortunately not...
We were experiencing very bad issues on the cluster the day before those changes and that was kind of a "last chance" change.
 
In order to begin somewhere, here is the state of one of the 11 nodes (pm6-staging-03):

Code:
pm6-staging-03:~# pvecm status
Can't use an undefined value as a HASH reference at /usr/share/perl5/PVE/CLI/pvecm.pm line 479, <DATA> line 755.

Code:
pm6-staging-03:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2021-06-09 11:37:07 CEST; 9min ago
  Process: 1733 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
 Main PID: 1783 (pmxcfs)
    Tasks: 5 (limit: 4915)
   Memory: 34.9M
   CGroup: /system.slice/pve-cluster.service
           └─1783 /usr/bin/pmxcfs

/etc/pve is available with all the configuration files but in read only mode.

Code:
pm6-staging-03:~# pct list
VMID       Status     Lock         Name         
102        stopped                 container01.domain
106        stopped                 container02.domain
...

Code:
pm6-staging-03:~# pct start 102
cluster not ready - no quorum?

Code:
pm6-staging-03:/var/lib/pve-cluster# pvecm expected 1
Unable to set expected votes: CS_ERR_INVALID_PARAM

Code:
pm6-staging-03:~# df -h
Sys. de fichiers             Taille Utilisé Dispo Uti% Monté sur
...
rpool                          382G    128K  382G   1% /rpool
rpool/ROOT                     382G    128K  382G   1% /rpool/ROOT
rpool/data                     382G    256K  382G   1% /rpool/data
rpool/data/subvol-106-disk-1   5,0G    1,6G  3,5G  32% /rpool/data/subvol-106-disk-1
rpool/data/subvol-102-disk-1   5,0G    1,9G  3,2G  37% /rpool/data/subvol-102-disk-1
...
/dev/fuse                       30M    140K   30M   1% /etc/pve

EDIT : I see that the corosync configuration is empty and can't be edited (read only) due to the "no quorum" state:
Code:
-r--r----- 1 root www-data 0 juin   8 19:34 /etc/pve/corosync.conf

How can I at least start my containers?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!