new to proxmox how to troubleshoot 'fencing'

lknite

Member
Sep 27, 2024
49
2
8
I purchased a 3rd server (pve-c) and setup all three with proxmox a couple weeks ago as a cluster with ceph as the underlying storage (using hci).

Since then, twice I've received an email saying the new server (pve-c) was being fenced.
Code:
The node 'pve-c' failed and needs manual intervention.

The PVE HA manager tries to fence it and recover the configured HA resources to a healthy node if possible.

When I log in everything seems normal, but I can see a vm was moved to another server and another was shutdown.

This is my first time using proxmox, so my first issue to investigate, how do I look into this fencing? The server seems ok so I'm not sure what could be failing. When I look at the 'System Log' via the gui at the time of the email I don't see anything unusual.

Additional detail: I looked at ceph and it showed the 'manager' was down, which was on a different server pve-a. Not sure why that was stopped so I clicked start and I created a second manager on pve-b. The manager on pve-b went into standby, and the manager on pve-a started, after a few seconds ceph showed HEALTH_OK.
 
Last edited:
This is my first time using proxmox, so my first issue to investigate, how do I look into this fencing? The server seems ok so I'm not sure what could be failing. When I look at the 'System Log' via the gui at the time of the email I don't see anything unusual.

I like non-GUI, so would start with journalctl -b -1 -e and take it from there. Look for logs from pve-ha-* services, corosync, watchdog-mux. You can post the ending here for further troubleshooting. If you are interested to dig deeper (without actually making any changes), you might find this one useful to read: https://forum.proxmox.com/threads/high-availability-watchdog-reboots.154580/
 
So far, the only thing I've discovered is that all three servers have been fenced off, and then come back online, each within about 30 seconds.

Not all at once, I seem to lose one (and it comes backup quick) about every couple days. ... so weird.

I know ceph is somewhat slow ... but everything else, 10gb nics ... nvme, 32 cores on each, 128 gb memory on each ... not sure what would timeout.
 
Last edited:
Hello,

Do you have a Corosync dedicated NIC? Ceph will quickly saturate a 10G NIC, if Corosync is using the same network it will deem it unusable and lose quorum.

Check the logs if you see events where corosync lost quorum. Nodes are fenced after 60s without quorum.
 
  • Like
Reactions: lknite
Hello,

Do you have a Corosync dedicated NIC? Ceph will quickly saturate a 10G NIC, if Corosync is using the same network it will deem it unusable and lose quorum.

Check the logs if you see events where corosync lost quorum. Nodes are fenced after 60s without quorum.
I have 2 nics on each server (21, 22, 23):
vbr0: 1 gb, 10.0.0.21
vbr1: 10gb, 10.0.1.21

I use vbr1 for the ceph storage, and I also created the proxmox cluster using that interface. Sounds like I should have used vbr0 for creating the cluster.

Should I adjust the ip the proxmox nodes are using for the proxmox cluster? Or, just dedicate vbr0 to cronosync? I'll Google how to configure a nic with cronosync once I get into the office.
 
You can find more info on how to setup corosync at [1], and don't forget to bump the `config_version` for the changes to be applied. Consider temporarily disabling HA before changing any Corosync config so the nodes do not fence if there is any issue.

I think in your usecase it makes sense to use vmbr0 as ring0 and vmbr1 as fallback ring1 for fallback. You can use

Code:
corosync-cfgtool -n

to verify both networks can be used by corosync (they should both report `enabled` AND `connected`).

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_configuration
 
When I run 'corosync-cfgtool -n' I'm seeing:
Code:
# corosync-cfgtool -n
Local node ID 1, transport knet
nodeid: 2 reachable
   LINK: 0 udp (10.0.1.21->10.0.1.22) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (10.0.1.21->10.0.1.23) enabled connected mtu: 1397

So, I don't think I'm seeing both nics there. How do I add the other nic so that it can be used by corosync?

The link is in the cluster status page:
1728316105154.png

Current state of corosync config:

Code:
# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-a
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.21
    ring1_addr: 10.0.1.21
  }
  node {
    name: pxe-b
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.22
    ring1_addr: 10.0.1.22
  }
  node {
    name: pve-c
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.23
    ring1_addr: 10.0.1.23
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster
  config_version: 4
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
Last edited:
after rebooting things went down, but i think i may just need to reboot all three hosts ... working on it

after the reboot of a host I can see the additional link now when i run: corosync-cfgtool -n
 
for now i'm leaving ring0 as 10.0.1.* and ring1 as 10.0.0.* ... and i think it will work if i reboot all three, which i'll try later on ... but i think right now its setup with a fall back to 10.0.0.* so we should be ok.

Thank you! I learned a lot with this one.
 
Hello,

In general you don't need to reboot for Corosync to pickup changes if you update the config version number. There are a few operations that do require the service to be restarted, for example, switching the network of one ring from a known network to a new one. You can restart the service via

Code:
systemctl restart corosync.service

You can see the current state of the service with

Code:
systemctl status corosync.service
 
Thank you, I didn't know this. very helpful ... only started proxmox a couple months ago; soaking up the knowledge ... no issues since adding the 2nd ring on a second nic
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!