Cluster broke down; no longer can access GUI

Por12

Member
Mar 6, 2023
58
3
8
Hello,

I have a 4 node setup. There seems to have been an error because after rebooting one of the nodes (let's call it node 1) it won't see the others anymore. The other nodes work very poorly, I cannot even log in the gui (it will always says the password is invalid even if via SSH it works). I can access via SSH though and I can ping the nodes from one another.

cat /etc/pve/corosync.conf

Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.77.250/24
        gateway 192.168.77.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
root@zeus:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: aphrodite
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.1.200
  }
  node {
    name: apollo
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.88.240
  }
  node {
    name: ares
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.1.201
  }
  node {
    name: zeus
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.77.250
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: foa-cluster
  config_version: 8
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Any idea? Thanks in advance
 
Last edited:
I've been rebooting servers and routers. Now one node can see another but not all.

1714338545796.png

1714338568893.png

1714338594619.png

1714338616749.png
 
That corosync config is weird: only two nodes in the same network (192.168.1.200 and .201), the other two each on a different network (192.168.77 and 192.168.88).

You will need some routes so each node reach all the other ones.
 
All is routed at the router, thanks for the heads up. There is 2-8ms latency between nodes so I don't think that should be the issue, right? I can ping one node from another.
 
Ok, I see. I'l break the cluster down and reinstall so that I have a clean config as I'm unable to remove a node (it hangs).

I already have good backups of all lxc and vms, and all valuable data lives on a secondary raidz1 zfs array. Anything else I should consider before reinstall? Will my secondary zfs array be recognized? Is it worth it to save the /etc folder? I guess I would have to re-add all storage configs manually, right?

Thanks
 
- Write down the MAC, ip and interface name of each server (PVE8.2 uses kernel 6.8 which usually renames nic names, MAC's will help to identify them).
- Read the whole documentation about PVE cluster [1]
- Try to use two corosync links [2]. Links should be local with the same subnet.
- Take a backup of every VM/CT and test the restore of each VM/CT (just in case)
- A backup of /etc of each host won't hurt (i.e. you may use parts of storage.cfg to configure the new cluster, i.e. adding an NFS or PBS server)

[1] https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy
[2] https://pve.proxmox.com/wiki/Cluster_Manager
 
  • Like
Reactions: Kingneutron
Thanks Victor. As I have 3 locations connected via Wireguard 1 node on loc 1, 1 on loc 2 and 2 nodes on loc 3), it seems it's best if I avoid clustering and configure each one separately, even if it's more work.

Regards
 
Such "geographically distributed" cluster configuration isn't supported. It simply won't really work as expected due to the requirements for quorum (and may have issues with PMXCFS too).

Using independent PVE servers on each location is the best approach. Multi datacenter management is in the works, so eventually we will be able of managing all our servers from a single interface (no ETA, so it will take some time until it arrives).
 
I have a similar problem where I no longer can see or access the other nodes in my cluster via the GUI.

cat /etc/pve/corosync.conf

root@b360m:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: 800G1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.0.10
}
node {
name: b360m
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.0.13
}
node {
name: gigax99
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.0.15
}
node {
name: m90q
nodeid: 4
quorum_votes: 1
ring0_addr: 192.168.0.12
}
node {
name: wyse3040
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.0.51
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Hillcrest
config_version: 23
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

It was working fine and then it happened suddenly.
I can access each node via SSH.

How can I fix this?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!