new to proxmox how to troubleshoot 'fencing'

lknite · Oct 5, 2024

I purchased a 3rd server (pve-c) and setup all three with proxmox a couple weeks ago as a cluster with ceph as the underlying storage (using hci).

Since then, twice I've received an email saying the new server (pve-c) was being fenced.

Code:

The node 'pve-c' failed and needs manual intervention.

The PVE HA manager tries to fence it and recover the configured HA resources to a healthy node if possible.

When I log in everything seems normal, but I can see a vm was moved to another server and another was shutdown.

This is my first time using proxmox, so my first issue to investigate, how do I look into this fencing? The server seems ok so I'm not sure what could be failing. When I look at the 'System Log' via the gui at the time of the email I don't see anything unusual.

Additional detail: I looked at ceph and it showed the 'manager' was down, which was on a different server pve-a. Not sure why that was stopped so I clicked start and I created a second manager on pve-b. The manager on pve-b went into standby, and the manager on pve-a started, after a few seconds ceph showed HEALTH_OK.

esi_y · Oct 5, 2024

lknite said:
This is my first time using proxmox, so my first issue to investigate, how do I look into this fencing? The server seems ok so I'm not sure what could be failing. When I look at the 'System Log' via the gui at the time of the email I don't see anything unusual.

I like non-GUI, so would start with journalctl -b -1 -e and take it from there. Look for logs from pve-ha-* services, corosync, watchdog-mux. You can post the ending here for further troubleshooting. If you are interested to dig deeper (without actually making any changes), you might find this one useful to read: https://forum.proxmox.com/threads/high-availability-watchdog-reboots.154580/

lknite · Oct 5, 2024

thanks i'll take a look at things ... i'm seeing some pve-ha-crm records ...

lknite · Oct 6, 2024

So far, the only thing I've discovered is that all three servers have been fenced off, and then come back online, each within about 30 seconds.

Not all at once, I seem to lose one (and it comes backup quick) about every couple days. ... so weird.

I know ceph is somewhat slow ... but everything else, 10gb nics ... nvme, 32 cores on each, 128 gb memory on each ... not sure what would timeout.

Maximiliano · Oct 7, 2024

Hello,

Do you have a Corosync dedicated NIC? Ceph will quickly saturate a 10G NIC, if Corosync is using the same network it will deem it unusable and lose quorum.

Check the logs if you see events where corosync lost quorum. Nodes are fenced after 60s without quorum.

lknite · Oct 7, 2024

Maximiliano said:
Hello,

Do you have a Corosync dedicated NIC? Ceph will quickly saturate a 10G NIC, if Corosync is using the same network it will deem it unusable and lose quorum.

Check the logs if you see events where corosync lost quorum. Nodes are fenced after 60s without quorum.

I have 2 nics on each server (21, 22, 23):
vbr0: 1 gb, 10.0.0.21
vbr1: 10gb, 10.0.1.21

I use vbr1 for the ceph storage, and I also created the proxmox cluster using that interface. Sounds like I should have used vbr0 for creating the cluster.

Should I adjust the ip the proxmox nodes are using for the proxmox cluster? Or, just dedicate vbr0 to cronosync? I'll Google how to configure a nic with cronosync once I get into the office.

Maximiliano · Oct 7, 2024

You can find more info on how to setup corosync at [1], and don't forget to bump the `config_version` for the changes to be applied. Consider temporarily disabling HA before changing any Corosync config so the nodes do not fence if there is any issue.

I think in your usecase it makes sense to use vmbr0 as ring0 and vmbr1 as fallback ring1 for fallback. You can use

Code:

corosync-cfgtool -n

to verify both networks can be used by corosync (they should both report `enabled` AND `connected`).

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_configuration

lknite · Oct 7, 2024

When I run 'corosync-cfgtool -n' I'm seeing:

Code:

# corosync-cfgtool -n
Local node ID 1, transport knet
nodeid: 2 reachable
   LINK: 0 udp (10.0.1.21->10.0.1.22) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (10.0.1.21->10.0.1.23) enabled connected mtu: 1397

So, I don't think I'm seeing both nics there. How do I add the other nic so that it can be used by corosync?

The link is in the cluster status page:

Current state of corosync config:

Code:

# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-a
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.21
    ring1_addr: 10.0.1.21
  }
  node {
    name: pxe-b
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.22
    ring1_addr: 10.0.1.22
  }
  node {
    name: pve-c
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.23
    ring1_addr: 10.0.1.23
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster
  config_version: 4
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

lknite · Oct 7, 2024

after rebooting things went down, but i think i may just need to reboot all three hosts ... working on it

after the reboot of a host I can see the additional link now when i run: corosync-cfgtool -n

lknite · Oct 7, 2024

for now i'm leaving ring0 as 10.0.1.* and ring1 as 10.0.0.* ... and i think it will work if i reboot all three, which i'll try later on ... but i think right now its setup with a fall back to 10.0.0.* so we should be ok.

Thank you! I learned a lot with this one.

Maximiliano · Oct 8, 2024

Hello,

In general you don't need to reboot for Corosync to pickup changes if you update the config version number. There are a few operations that do require the service to be restarted, for example, switching the network of one ring from a known network to a new one. You can restart the service via

Code:

systemctl restart corosync.service

You can see the current state of the service with

Code:

systemctl status corosync.service

lknite · Oct 14, 2024

Thank you, I didn't know this. very helpful ... only started proxmox a couple months ago; soaking up the knowledge ... no issues since adding the 2nd ring on a second nic

Search

Search

new to proxmox how to troubleshoot 'fencing'

lknite

Member

esi_y

Renowned Member

lknite

Member

lknite

Member

Maximiliano

Proxmox Staff Member

lknite

Member

Maximiliano

Proxmox Staff Member

lknite

Member

lknite

Member

lknite

Member

Maximiliano

Proxmox Staff Member

lknite

Member

We value your privacy