I'm having some issues with proxmox seeming to fence and reboot one of my nodes when it shouldnt.
My setup:
2 proxmox (8.1.10) virtual hosts with HA/corosync
1 ubuntu (22.04 LTS) storage server as qdevice vote
all boards are supermicro and cluster network (and ceph, but irrelevant as ceph works fine) is over a separate 40GbE link
When I migrate all VMs to vhost1 and shutdown vhost2 for maintenance, vhost1 will reboot itself shortly afterwards (maybe less than 30 seconds after vhost2 goes down?) I believe this is proxmox losing quorum and fencing vhost1 - what log should I be looking at to check for sure?
While vhost2 is still down and after vhost1 comes back online, I cannot start any VMs on vhost1 due to lack of quorum. So it seems that the qdevice isn't providing a 3rd vote for quorum, but as far as I can tell everything is setup right
Not sure if these messages from qnetd service are indicating where the problem is:
My setup:
2 proxmox (8.1.10) virtual hosts with HA/corosync
1 ubuntu (22.04 LTS) storage server as qdevice vote
all boards are supermicro and cluster network (and ceph, but irrelevant as ceph works fine) is over a separate 40GbE link
When I migrate all VMs to vhost1 and shutdown vhost2 for maintenance, vhost1 will reboot itself shortly afterwards (maybe less than 30 seconds after vhost2 goes down?) I believe this is proxmox losing quorum and fencing vhost1 - what log should I be looking at to check for sure?
While vhost2 is still down and after vhost1 comes back online, I cannot start any VMs on vhost1 due to lack of quorum. So it seems that the qdevice isn't providing a 3rd vote for quorum, but as far as I can tell everything is setup right
Code:
root@VHOST1:~# pvecm s
Cluster information
-------------------
Name: homelab
Config Version: 4
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Mon May 13 16:57:21 2024
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.10e9
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 A,V,NMW 172.16.0.1 (local)
0x00000002 1 A,V,NMW 172.16.0.2
0x00000000 1 Qdevice
Code:
root@VHOST1:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: VHOST1
nodeid: 1
quorum_votes: 1
ring0_addr: 172.16.0.1
}
node {
name: VHOST2
nodeid: 2
quorum_votes: 1
ring0_addr: 172.16.0.2
}
}
quorum {
device {
model: net
net {
algorithm: ffsplit
host: 172.16.0.3
tls: on
}
votes: 1
}
provider: corosync_votequorum
}
totem {
cluster_name: homelab
config_version: 4
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
Not sure if these messages from qnetd service are indicating where the problem is:
Code:
root@storage1:/mnt/raid/storage# service corosync-qnetd status
● corosync-qnetd.service - Corosync Qdevice Network daemon
Loaded: loaded (/lib/systemd/system/corosync-qnetd.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2024-04-24 03:44:42 EDT; 2 weeks 5 days ago
Docs: man:corosync-qnetd
Main PID: 3636 (corosync-qnetd)
Tasks: 1 (limit: 154400)
Memory: 8.9M
CPU: 1min 15.283s
CGroup: /system.slice/corosync-qnetd.service
└─3636 /usr/bin/corosync-qnetd -f
May 12 10:05:30 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:41242 doesn't sent any message during 12000ms. Disconnecting
May 12 14:05:24 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:34990 doesn't sent any message during 12000ms. Disconnecting
May 12 18:05:23 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:34028 doesn't sent any message during 12000ms. Disconnecting
May 12 19:57:21 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.1:57120 doesn't sent any message during 12000ms. Disconnecting
May 12 19:57:22 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:40022 doesn't sent any message during 12000ms. Disconnecting
May 13 00:01:13 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:40406 doesn't sent any message during 12000ms. Disconnecting
May 13 00:55:12 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.1:46806 doesn't sent any message during 12000ms. Disconnecting
May 13 02:27:59 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.1:53194 doesn't sent any message during 12000ms. Disconnecting
May 13 02:35:44 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:55096 doesn't sent any message during 12000ms. Disconnecting
May 13 02:39:34 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:44588 doesn't sent any message during 12000ms. Disconnecting