[SOLVED] Fixing broken cluster

jlgarnier

Active Member
May 25, 2021
40
3
28
Auriol, France
Hi Community,

I have two small servers in my homelab: one at 192.168.1.100, the other at 192.168.110. I had a network issue and decided to set a static DHCP address for the second at 192.168.1.120. Unfortunately, this broke the cluster: the UI returns the error message "hostname lookup 'LAB-server5' failed - failed to get address info for: LAB-server5: Name or service not known (500)" and freezes. I then need to manually edit the cluster config file to update the IP address for the second server.

I've read that I could edit the corosync.conf file (https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_edit_corosync_conf), but it seems I don't have the required privileges, although I SSHed as 'root'...

Can anyone tell me what's the appropriate procedure to modify this file? If this is the proper file of course...

Thanks in advance for any help!
 
Done, but this is obviously not enough... How can I edit the a.k.a. "cluster config file" to indicate Server5 has moved to 192.168.1.120? The procedure listed in the wiki doesn't work because any command in /etc/pve is rejected (Permission denied, as root and www-data)...

I can't use pvecm either as it doesn't find server5:
Bash:
root@LAB-server1:~:$ pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 LAB-server1 (local)
so I can't just
Code:
pvecm delnode LAB-server5
...

Thanks in advance for any help!
 
Last edited:
PVE blocks changes to cluster file system:
1. Stop the cluster services on the node you're changing (Server5)
Code:
systemctl stop pve-cluster
systemctl stop corosync
2. Mount pmxcfs in local mode (bypasses cluster lock) pmxcfs -l
3. Edit /etc/pve/corosync.conf and update the ip address (ring0_addr) and increase config_version (just add 1 to the value - other node will catch the newer config
4. Restart everything and regenerate certs:
Code:
killall pmxcfs
systemctl start corosync
systemctl start pve-cluster
pvecm updatecerts --force
5. Double check for any leftovers in /etc/hosts and /etc/network/interfaces
 
  • Like
Reactions: UdoB
Hi @psalkiewicz and thanks for this detailed procedure!

Everything went fine until I entered the
Code:
pvecm updatecerts --force
command, which got timed-out ("got timeout when trying to ensure cluster certificates and base file hierarchy is set up - no quorum (yet) or hung pmxcfs?").

Is there anything I must do on the second server? Should I edit the config file on Server1 too?

Thanks in advance for your help!
 
please post the following on each node:

pvecm status
pvecm nodes
systemctl status pve-cluster corosync
cat /etc/corosync/corosync.conf
cat /etc/pve/corosync.conf

thanks!
 
Thanks @fabian !

Here's the results for Server1:

Code:
root@LAB-server1:~:$ pvecm status
Cluster information
-------------------
Name:             LAB-home
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Apr 16 09:04:41 2026
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.1ae
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.100 (local)

root@LAB-server1:~:$ pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 LAB-server1 (local)

root@LAB-server1:~:$ systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/usr/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Mon 2026-04-13 09:14:33 CEST; 2 days ago
 Invocation: 7c7de0fa3da2434bb8689427b40fec8b
    Process: 1424 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 1491 (pmxcfs)
      Tasks: 7 (limit: 18707)
     Memory: 45.8M (peak: 55.6M)
        CPU: 2min 2.755s
     CGroup: /system.slice/pve-cluster.service
             └─1491 /usr/bin/pmxcfs

Apr 15 23:14:32 LAB-server1 pmxcfs[1491]: [dcdb] notice: data verification successful
Apr 16 00:14:32 LAB-server1 pmxcfs[1491]: [dcdb] notice: data verification successful
Apr 16 01:14:32 LAB-server1 pmxcfs[1491]: [dcdb] notice: data verification successful
Apr 16 02:14:32 LAB-server1 pmxcfs[1491]: [dcdb] notice: data verification successful
Apr 16 03:14:32 LAB-server1 pmxcfs[1491]: [dcdb] notice: data verification successful
Apr 16 04:14:32 LAB-server1 pmxcfs[1491]: [dcdb] notice: data verification successful
Apr 16 05:14:32 LAB-server1 pmxcfs[1491]: [dcdb] notice: data verification successful
Apr 16 06:14:32 LAB-server1 pmxcfs[1491]: [dcdb] notice: data verification successful
Apr 16 07:14:32 LAB-server1 pmxcfs[1491]: [dcdb] notice: data verification successful
Apr 16 08:14:32 LAB-server1 pmxcfs[1491]: [dcdb] notice: data verification successful

● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Mon 2026-04-13 09:14:33 CEST; 2 days ago
 Invocation: 11acd14ea96e4308b2596980d2ed72f3
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 1621 (corosync)
      Tasks: 9 (limit: 18707)
     Memory: 151.1M (peak: 167.5M)
        CPU: 25min 26.982s
     CGroup: /system.slice/corosync.service
             └─1621 /usr/sbin/corosync -f

Apr 16 09:05:44 LAB-server1 corosync[1621]:   [KNET  ] rx: Packet rejected from 192.168.1.120:5405
Apr 16 09:05:44 LAB-server1 corosync[1621]:   [KNET  ] rx: Packet rejected from 192.168.1.120:5405
Apr 16 09:05:45 LAB-server1 corosync[1621]:   [KNET  ] rx: Packet rejected from 192.168.1.120:5405
Apr 16 09:05:46 LAB-server1 corosync[1621]:   [KNET  ] rx: Packet rejected from 192.168.1.120:5405
Apr 16 09:05:47 LAB-server1 corosync[1621]:   [KNET  ] rx: Packet rejected from 192.168.1.120:5405
Apr 16 09:05:48 LAB-server1 corosync[1621]:   [KNET  ] rx: Packet rejected from 192.168.1.120:5405
Apr 16 09:05:48 LAB-server1 corosync[1621]:   [KNET  ] rx: Packet rejected from 192.168.1.120:5405
Apr 16 09:05:49 LAB-server1 corosync[1621]:   [KNET  ] rx: Packet rejected from 192.168.1.120:5405
Apr 16 09:05:50 LAB-server1 corosync[1621]:   [KNET  ] rx: Packet rejected from 192.168.1.120:5405
Apr 16 09:05:51 LAB-server1 corosync[1621]:   [KNET  ] rx: Packet rejected from 192.168.1.120:5405

root@LAB-server1:~:$ cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: LAB-server1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.100
  }
  node {
    name: LAB-server5
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.110
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: LAB-home
  config_version: 2
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@LAB-server1:~:$ cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: LAB-server1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.100
  }
  node {
    name: LAB-server5
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.110
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: LAB-home
  config_version: 2
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

And for Server5:
Code:
[root@LAB-server5 ~]# pvecm status
Cluster information
-------------------
Name:             LAB-home
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Apr 16 09:09:06 2026
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.1b3
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 192.168.1.120 (local)
[root@LAB-server5 ~]# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         2          1 LAB-server5 (local)
[root@LAB-server5 ~]# systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/usr/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-04-15 17:34:58 CEST; 15h ago
 Invocation: b142979a73e947f0bb9086337d09c0b0
    Process: 589634 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 589635 (pmxcfs)
      Tasks: 8 (limit: 38225)
     Memory: 17.2M (peak: 24M)
        CPU: 32.843s
     CGroup: /system.slice/pve-cluster.service
             └─589635 /usr/bin/pmxcfs

● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-04-15 17:34:46 CEST; 15h ago
 Invocation: c01fc3b3cd69405cb0437c537ceae0cd
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 589614 (corosync)
      Tasks: 9 (limit: 38225)
     Memory: 145.6M (peak: 162.3M)
        CPU: 4min 25.572s
     CGroup: /system.slice/corosync.service
             └─589614 /usr/sbin/corosync -f

Apr 15 17:34:46 LAB-server5 (corosync)[589614]: corosync.service: Referenced but unset environment variable evaluates to an empty string: CORO>
Apr 15 17:34:46 LAB-server5 corosync[589614]:   [WD    ] Watchdog not enabled by configuration
Apr 15 17:34:46 LAB-server5 corosync[589614]:   [WD    ] resource load_15min missing a recovery key.
Apr 15 17:34:46 LAB-server5 corosync[589614]:   [WD    ] resource memory_used missing a recovery key.
Apr 15 17:34:46 LAB-server5 corosync[589614]:   [KNET  ] host: host: 1 has no active links
Apr 15 17:34:46 LAB-server5 corosync[589614]:   [KNET  ] host: host: 1 has no active links
Apr 15 17:34:46 LAB-server5 corosync[589614]:   [KNET  ] host: host: 1 has no active links
[root@LAB-server5 ~]# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: LAB-server1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.100
  }
  node {
    name: LAB-server5
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.120
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: LAB-home
  config_version: 3
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

[root@LAB-server5 ~]# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: LAB-server1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.100
  }
  node {
    name: LAB-server5
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.120
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: LAB-home
  config_version: 3
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

As advised, I've upgrade the Server5 config file to v3.

Hope this helps!
 
Last edited:
you now also need to deploy the updated config file to the other node to /etc/pve, and then restart corosync and pve-cluster services on both nodes.