HA Fencing issues

mike15301

New Member
May 13, 2024
10
2
3
I'm having some issues with proxmox seeming to fence and reboot one of my nodes when it shouldnt.

My setup:
2 proxmox (8.1.10) virtual hosts with HA/corosync
1 ubuntu (22.04 LTS) storage server as qdevice vote
all boards are supermicro and cluster network (and ceph, but irrelevant as ceph works fine) is over a separate 40GbE link

When I migrate all VMs to vhost1 and shutdown vhost2 for maintenance, vhost1 will reboot itself shortly afterwards (maybe less than 30 seconds after vhost2 goes down?) I believe this is proxmox losing quorum and fencing vhost1 - what log should I be looking at to check for sure?

While vhost2 is still down and after vhost1 comes back online, I cannot start any VMs on vhost1 due to lack of quorum. So it seems that the qdevice isn't providing a 3rd vote for quorum, but as far as I can tell everything is setup right


Code:
root@VHOST1:~# pvecm s
Cluster information
-------------------
Name:             homelab
Config Version:   4
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon May 13 16:57:21 2024
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.10e9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.0.1 (local)
0x00000002          1    A,V,NMW 172.16.0.2
0x00000000          1            Qdevice



Code:
root@VHOST1:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: VHOST1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.0.1
  }
  node {
    name: VHOST2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.16.0.2
  }
}

quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: 172.16.0.3
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
}

totem {
  cluster_name: homelab
  config_version: 4
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}


Not sure if these messages from qnetd service are indicating where the problem is:
Code:
root@storage1:/mnt/raid/storage# service corosync-qnetd status
● corosync-qnetd.service - Corosync Qdevice Network daemon
     Loaded: loaded (/lib/systemd/system/corosync-qnetd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2024-04-24 03:44:42 EDT; 2 weeks 5 days ago
       Docs: man:corosync-qnetd
   Main PID: 3636 (corosync-qnetd)
      Tasks: 1 (limit: 154400)
     Memory: 8.9M
        CPU: 1min 15.283s
     CGroup: /system.slice/corosync-qnetd.service
             └─3636 /usr/bin/corosync-qnetd -f

May 12 10:05:30 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:41242 doesn't sent any message during 12000ms. Disconnecting
May 12 14:05:24 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:34990 doesn't sent any message during 12000ms. Disconnecting
May 12 18:05:23 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:34028 doesn't sent any message during 12000ms. Disconnecting
May 12 19:57:21 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.1:57120 doesn't sent any message during 12000ms. Disconnecting
May 12 19:57:22 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:40022 doesn't sent any message during 12000ms. Disconnecting
May 13 00:01:13 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:40406 doesn't sent any message during 12000ms. Disconnecting
May 13 00:55:12 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.1:46806 doesn't sent any message during 12000ms. Disconnecting
May 13 02:27:59 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.1:53194 doesn't sent any message during 12000ms. Disconnecting
May 13 02:35:44 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:55096 doesn't sent any message during 12000ms. Disconnecting
May 13 02:39:34 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:44588 doesn't sent any message during 12000ms. Disconnecting
 
Hi,
I'm having some issues with proxmox seeming to fence and reboot one of my nodes when it shouldnt.

My setup:
2 proxmox (8.1.10) virtual hosts with HA/corosync
1 ubuntu (22.04 LTS) storage server as qdevice vote
all boards are supermicro and cluster network (and ceph, but irrelevant as ceph works fine) is over a separate 40GbE link
It could be that Ceph traffic interferes with Corosync communication (which needs very low latency) then.

When I migrate all VMs to vhost1 and shutdown vhost2 for maintenance, vhost1 will reboot itself shortly afterwards (maybe less than 30 seconds after vhost2 goes down?) I believe this is proxmox losing quorum and fencing vhost1 - what log should I be looking at to check for sure?

While vhost2 is still down and after vhost1 comes back online, I cannot start any VMs on vhost1 due to lack of quorum. So it seems that the qdevice isn't providing a 3rd vote for quorum, but as far as I can tell everything is setup right
Please provide the system log/journal from vhost1 around the time the issue happened.

Code:
May 12 10:05:30 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:41242 doesn't sent any message during 12000ms. Disconnecting
May 12 14:05:24 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:34990 doesn't sent any message during 12000ms. Disconnecting
May 12 18:05:23 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:34028 doesn't sent any message during 12000ms. Disconnecting
May 12 19:57:21 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.1:57120 doesn't sent any message during 12000ms. Disconnecting
May 12 19:57:22 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:40022 doesn't sent any message during 12000ms. Disconnecting
May 13 00:01:13 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:40406 doesn't sent any message during 12000ms. Disconnecting
May 13 00:55:12 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.1:46806 doesn't sent any message during 12000ms. Disconnecting
May 13 02:27:59 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.1:53194 doesn't sent any message during 12000ms. Disconnecting
May 13 02:35:44 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:55096 doesn't sent any message during 12000ms. Disconnecting
May 13 02:39:34 storage1 corosync-qnetd[3636]: Client ::ffff:172.16.0.2:44588 doesn't sent any message during 12000ms. Disconnecting
Can you correlate these times with some other operations?
 
When I migrate all VMs to vhost1 and shutdown vhost2 for maintenance
If you don't shut down vhost2 after VM migration to vhost1, does vhost1 remain stable constantly without rebooting. Are all the VMs that are migrated working as expected - (no unknown links/dependencies on vhost2).

While vhost2 is still down and after vhost1 comes back online, I cannot start any VMs on vhost1 due to lack of quorum.
What does root@VHOST1:~# pvecm s look like THEN.

Active: active (running) since Wed 2024-04-24 03:44:42 EDT; 2 weeks 5 days ago
Have you tried rebooting/restarting QDevice?
 
I shutdown vhost2 again last night and vhost1 stayed online this time, it doesnt happen every time... but its been at least 3 times out of maybe 10 or so times in my previous testing and now troubleshooting of a cooling issue on vhost2.

It could be that Ceph traffic interferes with Corosync communication (which needs very low latency) then.
Is there a way to alter corosync behavior to increase the allowed latency? or alternatively, increase the amount of checks that the system does kind of like rise/fall options of haproxy or keepalived? I do have a 10G network as well, but all of the VMs use this network for traffic. It is likely to see less bandwidth use though (both ceph public/backend network and cluster live migrate use the 40G) - do you think the 10G network would be better for corosync and leave ceph/migrate on the 40G?

Please provide the system log/journal from vhost1 around the time the issue happened.
This is where I needed to look, thank you - It looks like the mellanox card puked and just happened to be at the exact same time as vhost2 went down for maint. I'll be sure to keep an eye out for this now since this isn't the first time vhost1 went down when i took vhost2 down. looks like this card is running older firmware than the other 2 servers as well, I'm going to update it and hope it helps.

Can you correlate these times with some other operations?
Yep... May 13th 02:27 is when the mellanox card on vhost1 crashed. the next 2 from vhost 2 is likely because it was purposely down (though I did shut it down before vhost1 crashed). The others likely as well were a result of me rebooting or the like. I'll start keeping a log of exactly when I shutdown/reboot so I can line these events up.

If you don't shut down vhost2 after VM migration to vhost1, does vhost1 remain stable constantly without rebooting. Are all the VMs that are migrated working as expected - (no unknown links/dependencies on vhost2).
Yes, everything remains stable without rebooting - with the exception of probably a little over a week ago, vhost1 rebooted itself without warning. Im inclined to believe now that the mellanox card crashed at this time and vhost1 fenced itself, other 2 nodes were online at the time.

What does root@VHOST1:~# pvecm s look like THEN.
This will be on my list of things to check. Its odd that the pve GUI told me there was not enough quorum when vhost1 came back online. I didn't bother messing with it and just continued booting vhost2 back up and then brought all the VMs back online after. Maybe it was just too soon and the mellanox card was still having issues?


Thanks guys, I've now got a better idea of whats going on... looks like I'll need to keep an eye on that flakey card after I update the firmware. I'll post back if I get stuck again.
 
Revisiting this since I am still having issues. On a positive note, it looks like the firmware update to the mellanox card may have fixed that issue, I haven't seen it crap out since.

I've also made some changes - corosync now runs on the 10G network shared by the VMs as this network should have less traffic on it.

Also installed PVE on my workbench machine as dual boot ubuntu/PVE... this machine is usually powered off, or booted into ubuntu. I wanted to have this as a backup PVE box in case I take the entire server rack down for whatever reason. So it is joined to the cluster, but I have it configured to have no votes. I think I have everything setup right to do this but for some reason I'm having very repeatable issues.

If I migrate all VMs to vhost2 and power off vhost1, vhost2 loses quorum and then fences itself and reboots. This happened 3 times in a row in testing, each time with the same symptoms: before vhost1 powerdown, expected and highest votes shows 3 as it should, total votes 3, quorum 2. After vhost1 powerdown, expected,highest, and quorum stay the same but total drops to 1 - like the qnet device doesnt provide a vote.


Code:
root@VHOST2:~# pvecm status
Cluster information
-------------------
Name:             homelab
Config Version:   5
Transport:        knet
Secure auth:      on


Quorum information
------------------
Date:             Fri May 31 13:11:30 2024
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.1616
Quorate:          Yes


Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice


Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1         NR 192.168.1.10
0x00000002          1    A,V,NMW 192.168.1.20 (local)
0x00000000          1            Qdevice
root@VHOST2:~# pvecm status
Cluster information
-------------------
Name:             homelab
Config Version:   5
Transport:        knet
Secure auth:      on


Quorum information
------------------
Date:             Fri May 31 13:11:32 2024
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.161a
Quorate:          No


Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:            Qdevice


Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000002          1   A,NV,NMW 192.168.1.20 (local)
0x00000000          0            Qdevice (votes 1)

I noticed the last line there where the qdevice shows 0 votes, but puts votes 1 in parenthesis?

Here is the new corosync.conf with the backup host with no votes added - I did it this way because this host is only meant as a backup, not really part of the cluster per-se, but there is apparently no other way to live migrate if it isn't included in the cluster:


Code:
root@VHOST2:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: VHOST1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.10
  }
  node {
    name: VHOST2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.20
  }
  node {
    name: VHOST-backup
    nodeid: 3
    quorum_votes: 0
    ring0_addr: 192.168.1.40
  }
}

quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: 192.168.1.30
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
  expected_votes: 3
}

totem {
  cluster_name: homelab
  config_version: 5
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

If this is not the correct way of doing this, please let me know how I should have this set up instead. I come from a long history of rhel/centos and rhev/ovirt and openstack so the internals of pve are a bit different, though corosync isnt. But switching to PVE at home here was because of the dangling carrot of live migration with vgpu support (not officially supported, but confirmed working with some caveats) and though we have plenty of rhel licenses at work, those don't apply to my play toys at home.
 
And since it will probably get asked, the reason for a backup host to keep some things running if for some reason I take the entire rack down is pretty much for only a couple of small VMs that have external services running and most importantly, my router. I'd hate to have to go back to the days of running my router on its own piece of hardware when he's just so happy being virtualized, poor guy....
 
Code:
    Nodeid      Votes    Qdevice Name
0x00000001          1         NR 192.168.1.10
0x00000002          1    A,V,NMW 192.168.1.20 (local)
0x00000000          1            Qdevice
The QDevice was not registered (NR) for the first node.

Afterwards the QDevice goes from V = vote to NV = not vote. Does communication with the QDevice work at the time node1 is down? What do the logs say?

I'm not sure the inactive third node doesn't mess up the situation. At least the documentation talks about the number of (active) nodes, not nodes with votes: https://manpages.debian.org/testing/corosync-qdevice/corosync-qdevice.8.en.html#ffsplit
 
There should be no reason for communication between qdevice and vhost2 to no work when node 1 is down, each machine is connected with its own uplink to the switch. The V to NV does look like its the problem, but why is it switching?

The status output actually shows the number of nodes as "2" so i suspect that it is not counting the backup node without a vote as a node. However reading the manual you linked does state: "To use this algorithm it's required to set the number of votes per node to 1 (default) and the qdevice number of votes has to be also 1. This is achieved by setting quorum.device.votes key in corosync.conf file to 1." - so I'm not sure if that means the node with 0 votes will cause ffsplit to not function.

I didn't do anything to cause vhost1 to show as NR so I don't know why that would be. Checking on it currently, everything looks as it should:

Code:
Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.1.10
0x00000002          1    A,V,NMW 192.168.1.20 (local)
0x00000000          1            Qdevice

I didnt touch any configs or services since last week when I was testing. I'm going to migrate everything to vhost2 and shutdown vhost1 now and see what happens...
 
Immediately when I shutdown vhost1, it changes to NR in pvecm status. After vhost1 shuts down, it disappears from the membership list and vhost2 goes to NV for qdevice


Code:
Every 1.0s: pvecm status                                                                                                                 VHOST2: Wed Jun  5 19:44:38 2024

Cluster information
-------------------
Name:             homelab
Config Version:   5
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Jun  5 19:44:39 2024
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.1627
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:            Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000002          1   A,NV,NMW 192.168.1.20 (local)
0x00000000          0            Qdevice (votes 1)

During this time I have another terminal open pinging the qdevice and it continues to ping up until the point vhost2 fences itself and reboots.
 
the system logs from vhost2 when I shutdown vhost1 above:

Code:
Jun 05 19:43:56 VHOST2 pve-ha-crm[2535]: node 'VHOST1': state changed from 'online' => 'maintenance'
Jun 05 19:44:01 VHOST2 corosync[2278]:   [CFG   ] Node 1 was shut down by sysadmin
Jun 05 19:44:01 VHOST2 pmxcfs[2133]: [dcdb] notice: members: 2/2133
Jun 05 19:44:01 VHOST2 pmxcfs[2133]: [status] notice: members: 2/2133
Jun 05 19:44:02 VHOST2 corosync[2278]:   [QUORUM] Sync members[1]: 2
Jun 05 19:44:02 VHOST2 corosync[2278]:   [QUORUM] Sync left[1]: 1
Jun 05 19:44:02 VHOST2 corosync[2278]:   [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Jun 05 19:44:02 VHOST2 corosync[2278]:   [TOTEM ] A new membership (2.1627) was formed. Members left: 1
Jun 05 19:44:02 VHOST2 corosync[2278]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 05 19:44:02 VHOST2 corosync[2278]:   [QUORUM] Members[1]: 2
Jun 05 19:44:02 VHOST2 corosync[2278]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 05 19:44:02 VHOST2 pmxcfs[2133]: [status] notice: node lost quorum
Jun 05 19:44:02 VHOST2 corosync[2278]:   [KNET  ] link: host: 1 link: 0 is down
Jun 05 19:44:02 VHOST2 corosync[2278]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 19:44:02 VHOST2 corosync[2278]:   [KNET  ] host: host: 1 has no active links
Jun 05 19:44:06 VHOST2 pvestatd[2427]: status update time (5.831 seconds)
Jun 05 19:44:06 VHOST2 pve-ha-crm[2535]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Jun 05 19:44:09 VHOST2 pve-ha-lrm[2817]: lost lock 'ha_agent_VHOST2_lock - cfs lock update failed - Permission denied
Jun 05 19:44:11 VHOST2 pve-ha-crm[2535]: status change master => lost_manager_lock
Jun 05 19:44:11 VHOST2 pve-ha-crm[2535]: watchdog closed (disabled)
Jun 05 19:44:11 VHOST2 pve-ha-crm[2535]: status change lost_manager_lock => wait_for_quorum
Jun 05 19:44:14 VHOST2 pve-ha-lrm[2817]: status change active => lost_agent_lock
Jun 05 19:44:14 VHOST2 pvescheduler[3799257]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 05 19:44:14 VHOST2 pvescheduler[3799256]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 05 19:44:18 VHOST2 pvestatd[2427]: status update time (8.599 seconds)
Jun 05 19:44:28 VHOST2 pvestatd[2427]: status update time (7.841 seconds)
-- Boot a2330e858355432ca2ade3cc47de8b76 --
 
Just something I noticed, may be relevant but may be barking up the wrong tree!

You say:
I'd hate to have to go back to the days of running my router on its own piece of hardware when he's just so happy being virtualized
So on which server host is that virtualized router? What effect on the NW is that having when its host is shutdown?
 
  • Like
Reactions: justinclift
Just something I noticed, may be relevant but may be barking up the wrong tree!

You say:

So on which server host is that virtualized router? What effect on the NW is that having when its host is shutdown?
The router could be on either host. I'm using a physical enterprise managed switch - ports 1-16 are on vlan1 (cable modem uplinked), ports 17-32 are on vlan2 (5g modem uplinked), 33-48 are on vlan10 (local 10G lan) and 49-52 are vlan40 (40g cluster network). both host servers have the two onboard 1G ethernet connected to vlan1 and vlan2 on the switch, and the router gets 3 virtual net cards, one on vlan1, one on vlan2, and the 3rd on vlan10. So the router can float between either host, as well as the backup host, and the router will always have 2 WANs and 1 LAN. of course, the enterprise switch is the single point of failure in all of this and ideally i'd have a 2nd with LCAP and redundant links but thats a future mike problem.

anyway, in short, the router is only for WAN access. if the router is down, or any hosts are down, the enterprise switch is still up and both lan and cluster network remain active. hostnames are mapped in /etc/hosts on the physical hosts as per pve instructions, so DNS from the router should be irrelevant
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!