Cluster HA problem

mladenciric

Member
Mar 25, 2021
23
0
6
45
Hello to everyone,
We have three proxmox nodes in Cluster.
One is at site other two are at remote location.
We have only one container for which we setup HA.
All three nodes runs same version of proxmox, below is listed packages list
Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.174-2-pve)
pve-manager: 6.4-14 (running version: 6.4-14/15e2bf61)
pve-kernel-5.4: 6.4-15
pve-kernel-helper: 6.4-15
pve-kernel-5.4.174-2-pve: 5.4.174-2
pve-kernel-5.4.166-1-pve: 5.4.166-1
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1

All nodes have three NIC cards, all NIC are 1gbps - they are organized like this, first one is for management, second one is for cluster connection, and third one is for VMs.

Last day we detected that some VM are not accessible, and also access to nodes are difficult.
After inspecting our network , and machines we noticed that cluster HA reproduces the error, and one or two nodes reboots. The two nodes which reboots are the one between which HA repilcation is scheduled.
The error which we find is that replication was started but not finished.
We try to delete HA for container but, delete was not success.
Now if one of those nodes are off the other two works, with quorum 2 of 3 and no errors at nodes console.
But if we power the first node, errors occur again.
the error which we can see than is, "old timestamps" at lrm while master are idle, but some cases we find that master also report "old timestamps" and cluster goes down, every machine works but reports that other two are in unknown state.
if we poweroff node 1, master which is nod 3 and nod 2 works good.
also we noticed that another error occur at console when all three nodes are up and running
Code:
INFO: task: pve-bridge:2541 blocked form more than 120 seconds.
Tainted: P 0 5.4.174-2-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
INFO: task pvesr:2976 blocked for more than 120 seconds.
Tainted: P 0 5.4.174-2-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

The node 1 when it is powered on don't accept commands smoothly it looks like hangs, and says replication waiting
How to resolve this at most reasonable way?
 
After couple of hours inspecting problem we found following
first case
two nodes pm4 and pm3 works and pm0 is off - cluster is stable and everything works
Cluster pm4 and pm3 is up and running
Datacenter - HA : ct201 deleting - cannot delete or edit
pm4 is master , lrm state is idle
pm3 lrm state is idle
second case when we power on pm0 - cluster have quorum 3 of 3 but access to nodes are poor and vms was inaccessible and nodes reboots
20220427_111452.jpg
even list command is unresponsive

pm0 - lrm state is timeoutpm3 - lrm state is timeout
pm4 - lrm state is timeout
pm4 - crm is buggy
20220427_111501.jpg
and stoping, starting and restarting services pve-ha-crm, pve-ha-lrm was unresponsive
20220427_112348.jpg

if we trun of pm0 by cli command it takes too long and two operations hanged : Proxmox Replication runner and PVE Cluster HA Resource Manager
if we trun of pm0 by power switch then after very short time cluster and nodes and also vms working as it should (that is also case if we just remove pm0 from all three networks)
we assumed that problem is replication scheduled at HA for ct201 which was running on pm0 from pm3 but on pm3 !!!

we run following command on all nodes:
Code:
cat /etc/pve/ha/resources.cfg
cat /etc/pve/ha/manager_status
ha-manager status

first command reporting that resource.cfg was empty
second comand reporting that manager_status has jobs for ct201 : on pm3 and pm4 deleting but on pm0 says running
20220427_111601.jpg

next what we do is to follow this instruction:
Code:
execute systemctl stop pve-ha-crm on all nodes, this stops all cluster managers
on a single node do: rm /etc/pve/ha/manager_status this resets the manager status, can only be done if no manager is active - as this is normally just representative and gets only read newly if a CRM becomes the manager, that's the reason we had to stop all CRMs before doing this.
execute systemctl star pve-ha-crm on all nodes, this starts all cluster managers again

on pm3 and pm4 we successfully deleted manager_status but on pm0 we cannot delete this file because of no permission20220427_111619.jpg20220427_112446.jpg
now HA resources was clean
But on node pm0 we still have resource.cfg empty and manager_status with ...ct:201 running... and when put pm0 on the network other two nodes become unstable.

What we can do to resolve this situation ?
how to stop HA replication on pm0 and clear manager_status ?
 
you can't do an HA cluster wth 2 DC. You need 3 locations to keep quorum.

ex:

DC1: 1host DC2: 2hosts .

- In case of network failure between the 2DC -> DC1 hosts will reboot (no quorum).
- In case of DC2 failure -> DC1 host will reboot


DC1: 2 hosts DC2: 2hosts
- In case of network failure between the 2DC -> DC1 && DC2 hosts will reboot (no quorum).
- in case of any DC failure, other dc will reboot.


Note that replication is not related to HA, but if a node don't have quorum, it can't launch the replication job.
when a vm with HA is enabled on a node, this node will reboot (fencing) if the node is loosing quorum.
 
you can't do an HA cluster wth 2 DC. You need 3 locations to keep quorum.

ex:

DC1: 1host DC2: 2hosts .

- In case of network failure between the 2DC -> DC1 hosts will reboot (no quorum).
- In case of DC2 failure -> DC1 host will reboot


DC1: 2 hosts DC2: 2hosts
- In case of network failure between the 2DC -> DC1 && DC2 hosts will reboot (no quorum).
- in case of any DC failure, other dc will reboot.


Note that replication is not related to HA, but if a node don't have quorum, it can't launch the replication job.
when a vm with HA is enabled on a node, this node will reboot (fencing) if the node is loosing quorum.
Yes. that is correct but as i describe we have three nodes pm0 pm3 and pm4 in cluster. The problem was appeared because HA replication was not successfully finished and now one node pm0 produce problem when it is started.
On two nodes which work as it should we stop HA for ct201. But at pm0 still exist scheduled replication which we cannot stop or delete.
My question is what to do to abort scheduled jobs? At pm0 resources.cfg is empty but manager_status show that HA for ct201 exist !
Is there a way to put node pm0 into listen state to only learn about jobs on cluster from other two nodes?
 
UPDATE:
on node pm0 which has had incorrect HA manager_status i do following:
1-make this node offline
2-change expected nodes to make quorum only 1 command pvecm expected 1
3-then stop service command systemctl stop pve-ha-crm.service
4-then delete manager_status command rm /etc/pve/ha/manager_status
5-then stop service pve-cluster and corosync
6-then change expected node to 3 and quorum to 2 command pvecm expected
7-then come back node online
Now i don't have HA settings at all on my cluster BUT after some time (about 30 minutes) again i stuck with lag and inaccessible vms.
Again i find that Replication between pm0 and pm3 was the problem.
Finally, when all replication was disabled the problem is gone !
But what cause this issue ?
Further inspection gives me the clue, i notice that ping to management network interface card on pm3 was long, much longer the pinging IP address from same network on another two nodes.
Two possible reasons which will inspect : management NIC on the node pm3, cables or port on the router.
For now i use ethtool to check NICs and i don't find anything wrong, all three interfaces was 1 gpbs and all have that speed.