Cluster broke down; no longer can access GUI

Por12 · Apr 28, 2024

Hello,

I have a 4 node setup. There seems to have been an error because after rebooting one of the nodes (let's call it node 1) it won't see the others anymore. The other nodes work very poorly, I cannot even log in the gui (it will always says the password is invalid even if via SSH it works). I can access via SSH though and I can ping the nodes from one another.

cat /etc/pve/corosync.conf

Code:

auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.77.250/24
        gateway 192.168.77.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
root@zeus:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: aphrodite
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.1.200
  }
  node {
    name: apollo
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.88.240
  }
  node {
    name: ares
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.1.201
  }
  node {
    name: zeus
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.77.250
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: foa-cluster
  config_version: 8
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Any idea? Thanks in advance

Por12 · Apr 28, 2024

I've been rebooting servers and routers. Now one node can see another but not all.

yaoshengyi · Apr 29, 2024

pve version 8.2?

Por12 · Apr 29, 2024

Yeah. Should I try downgrading kernel?

VictorSTS · Apr 29, 2024

That corosync config is weird: only two nodes in the same network (192.168.1.200 and .201), the other two each on a different network (192.168.77 and 192.168.88).

You will need some routes so each node reach all the other ones.

Por12 · Apr 29, 2024

All is routed at the router, thanks for the heads up. There is 2-8ms latency between nodes so I don't think that should be the issue, right? I can ping one node from another.

VictorSTS · Apr 29, 2024

Should be under 5ms [1] .

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

Por12 · Apr 29, 2024

Ok, I see. I'l break the cluster down and reinstall so that I have a clean config as I'm unable to remove a node (it hangs).

I already have good backups of all lxc and vms, and all valuable data lives on a secondary raidz1 zfs array. Anything else I should consider before reinstall? Will my secondary zfs array be recognized? Is it worth it to save the /etc folder? I guess I would have to re-add all storage configs manually, right?

Thanks

VictorSTS · Apr 29, 2024

- Write down the MAC, ip and interface name of each server (PVE8.2 uses kernel 6.8 which usually renames nic names, MAC's will help to identify them).
- Read the whole documentation about PVE cluster [1]
- Try to use two corosync links [2]. Links should be local with the same subnet.
- Take a backup of every VM/CT and test the restore of each VM/CT (just in case)
- A backup of /etc of each host won't hurt (i.e. you may use parts of storage.cfg to configure the new cluster, i.e. adding an NFS or PBS server)

[1] https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy
[2] https://pve.proxmox.com/wiki/Cluster_Manager

Por12 · Apr 29, 2024

Thanks Victor. As I have 3 locations connected via Wireguard 1 node on loc 1, 1 on loc 2 and 2 nodes on loc 3), it seems it's best if I avoid clustering and configure each one separately, even if it's more work.

Regards

VictorSTS · Apr 29, 2024

Such "geographically distributed" cluster configuration isn't supported. It simply won't really work as expected due to the requirements for quorum (and may have issues with PMXCFS too).

Using independent PVE servers on each location is the best approach. Multi datacenter management is in the works, so eventually we will be able of managing all our servers from a single interface (no ETA, so it will take some time until it arrives).

epicurean · Jun 3, 2024

I have a similar problem where I no longer can see or access the other nodes in my cluster via the GUI.

cat /etc/pve/corosync.conf

root@b360m:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: 800G1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.0.10
}
node {
name: b360m
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.0.13
}
node {
name: gigax99
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.0.15
}
node {
name: m90q
nodeid: 4
quorum_votes: 1
ring0_addr: 192.168.0.12
}
node {
name: wyse3040
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.0.51
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Hillcrest
config_version: 23
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

It was working fine and then it happened suddenly.
I can access each node via SSH.

How can I fix this?

Search

Search

Cluster broke down; no longer can access GUI

Por12

Member

Por12

Member

yaoshengyi

New Member

Por12

Member

VictorSTS

Distinguished Member

Por12

Member

VictorSTS

Distinguished Member

Por12

Member

VictorSTS

Distinguished Member

Por12

Member

VictorSTS

Distinguished Member

epicurean

Active Member

We value your privacy