VE Cluster with 5 servers - issue

linhu · May 23, 2025

VE Cluster with 5 servers - issue

Hi, we have a ve clster with 5 servers, all server are:
Supermicro Server CSE-819U 2x 14-Core Xeon E5-2690 v4 2,6GHz 128GB 9361-8i
prox1 to prox5 have the same netconfig 192.168.1.150-154 (adminnet)
There are running 2-3 vm's on eatch server with local zfs storage, there are shared storage and 3 backupservers on the cluster, the shared storage are on the same net, the backupservers are on the same net

The strange thing is the cluster is not running very good with 5 servers online, because the prox2 goes offline,
- the smartdog shutting down the prox2,
- the prox2 comes up again but not online
- in the cluster i can see the prox2 is offline, i have to hardreset the server
- we have changed the hardware on prox2, also networkcard and mainboard
- after we have changed all hardware the server was running in standalone mode with proxmox and a test vm without issues
- now after 10days we have added prox2 to the cluster, and it goes offline after almost 1hour
- when we remove the server from the cluster, all is running without issues
- the difference between the prox2 and the other servers is, prox2 is new installed with version 8.2 and upgraded to 8.3, and the other servers are installed with version 8.0 and upgraded to 8.4

the internal net is a 10Gib net
HA is activatet on almost all vm's
all proxmox software installations are default and now running version 8.4

any ideas what is going on?

SteveITS · May 23, 2025

Per https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_requirements "If you are interested in High Availability, ... All nodes should have the same version."

If 2 cannot reliably communicate with the cluster, it will reboot itself (due to no quorum), is it rebooting? Can it communicate over the corosync network(s)? (pinging between all nodes)

linhu · May 23, 2025

SteveITS said:
Per https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_requirements "If you are interested in High Availability, ... All nodes should have the same version."

If 2 cannot reliably communicate with the cluster, it will reboot itself (due to no quorum), is it rebooting? Can it communicate over the corosync network(s)? (pinging between all nodes)

all servers are running on 8.4, yes it can communicate over the corosync net,

alexskysilk · May 23, 2025

linhu said:
any ideas what is going on?

All answers are in your logs. start there.

linhu · May 23, 2025

the log says there is no quorum, but why? we have chnaged the hardware, there is no issue on the switch or cable, so why says there are no qourum and it reboots but is not coming online it have to hardreset to reboot second time and then it comes online
the same server has no issues with version 8.0 the issue come after update to 8.2-8.4

alexskysilk · May 23, 2025

linhu said:
the log says there is no quorum, but why?

Thats also in the logs.

If I have to guess, its a networking issue. post the content of /etc/network/interfaces and /etc/pve/corosync.conf for validation and recommendations.

linhu · May 24, 2025

Code:

###interfaces prox1 3,4,5 the same only the ip is different###

auto lo
iface lo inet loopback

auto enp1s0f0
iface enp1s0f0 inet manual
#LAN1

auto enp1s0f1
iface enp1s0f1 inet manual
#LAN2

auto enp2s0f0
iface enp2s0f0 inet manual
#WAN1

auto enp2s0f1
iface enp2s0f1 inet manual
#WAN2

auto vmbr0
iface vmbr0 inet static
    address 192.168.1.150/24
    gateway 192.168.1.1
    bridge-ports enp1s0f0
    bridge-stp off
    bridge-fd 0

auto lan
iface lan inet manual
    bridge-ports enp1s0f1
    bridge-stp off
    bridge-fd 0
#serverlan

auto wan
iface wan inet manual
    bridge-ports enp2s0f1
    bridge-stp off
    bridge-fd 0
#lan-vpn-kontor

auto wwan
iface wwan inet manual
    bridge-ports enp2s0f0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 3500
#WAN

source /etc/network/interfaces.d/*

###interfaces prox2###

auto lo
iface lo inet loopback

auto enp1s0f0
iface enp1s0f0 inet manual
#LAN1

auto enp1s0f1
iface enp1s0f1 inet manual
#LAN2

auto enp2s0f0
iface enp2s0f0 inet manual
#WAN1

auto enp2s0f1
iface enp2s0f1 inet manual
#WAN2

auto vmbr0
iface vmbr0 inet static
    address 192.168.1.151/24
    gateway 192.168.1.1
    bridge-ports enp1s0f0
    bridge-stp off
    bridge-fd 0

auto lan
iface lan inet manual
    bridge-ports enp1s0f1
    bridge-stp off
    bridge-fd 0
#serverlan

auto wan
iface wan inet manual
    bridge-ports enp2s0f1
    bridge-stp off
    bridge-fd 0
#lan-vpn-kontor

auto wwan
iface wwan inet manual
    bridge-ports enp2s0f0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 3500
#WAN

source /etc/network/interfaces.d/*

linhu · May 24, 2025

we have removed prox2 again from the cluster but the config was the same with the

node {
name: prox2
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.1.151
}

Code:

###corosync config###

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: prox1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.150
  }
  node {
    name: prox3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.1.152
  }
  node {
    name: prox4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.1.153
  }
  node {
    name: prox5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.1.154
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: site
  config_version: 14
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

alexskysilk · May 24, 2025

I gather from this that you are using a single interface for all traffic, including corosync. This is bad practice and can lead to the exact behavior you are seeing.

If you want to eliminate the possibility of corosync interruption, do not comingle other traffic with it on the same interface. better yet, add a second interface to make sure interface issues dont get your node fenced.

linhu · May 24, 2025

alexskysilk said:
I gather from this that you are using a single interface for all traffic, including corosync. This is bad practice and can lead to the exact behavior you are seeing.

If you want to eliminate the possibility of corosync interruption, do not comingle other traffic with it on the same interface. better yet, add a second interface to make sure interface issues dont get your node fenced.

vmbr0 is only for internal trafic not for vm's,
vm interfaces are: enp1s0f1 is used for lan, enp2s0f0 is isolatet for wan trafic

alexskysilk · May 24, 2025

linhu said:
vmbr0 is only for internal trafic not for vm's,
vm interfaces are: enp1s0f1 is used for lan, enp2s0f0 is isolatet for wan trafic

...

I wont correct you, but you really should go back and read how networking works in a linux hypervisor environment. Suffice it to say it doesnt work like you think it does.

linhu · May 24, 2025

we are curently in progress to make a new network structure, also for the other clusters, we have not only one proxmox clutser, we are also changing the storage, so there is a lot of work in the next week

Search

Search

VE Cluster with 5 servers - issue

linhu

Member