Hello to everyone,
We have three proxmox nodes in Cluster.
One is at site other two are at remote location.
We have only one container for which we setup HA.
All three nodes runs same version of proxmox, below is listed packages list
All nodes have three NIC cards, all NIC are 1gbps - they are organized like this, first one is for management, second one is for cluster connection, and third one is for VMs.
Last day we detected that some VM are not accessible, and also access to nodes are difficult.
After inspecting our network , and machines we noticed that cluster HA reproduces the error, and one or two nodes reboots. The two nodes which reboots are the one between which HA repilcation is scheduled.
The error which we find is that replication was started but not finished.
We try to delete HA for container but, delete was not success.
Now if one of those nodes are off the other two works, with quorum 2 of 3 and no errors at nodes console.
But if we power the first node, errors occur again.
the error which we can see than is, "old timestamps" at lrm while master are idle, but some cases we find that master also report "old timestamps" and cluster goes down, every machine works but reports that other two are in unknown state.
if we poweroff node 1, master which is nod 3 and nod 2 works good.
also we noticed that another error occur at console when all three nodes are up and running
The node 1 when it is powered on don't accept commands smoothly it looks like hangs, and says replication waiting
How to resolve this at most reasonable way?
We have three proxmox nodes in Cluster.
One is at site other two are at remote location.
We have only one container for which we setup HA.
All three nodes runs same version of proxmox, below is listed packages list
Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.174-2-pve)
pve-manager: 6.4-14 (running version: 6.4-14/15e2bf61)
pve-kernel-5.4: 6.4-15
pve-kernel-helper: 6.4-15
pve-kernel-5.4.174-2-pve: 5.4.174-2
pve-kernel-5.4.166-1-pve: 5.4.166-1
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1
All nodes have three NIC cards, all NIC are 1gbps - they are organized like this, first one is for management, second one is for cluster connection, and third one is for VMs.
Last day we detected that some VM are not accessible, and also access to nodes are difficult.
After inspecting our network , and machines we noticed that cluster HA reproduces the error, and one or two nodes reboots. The two nodes which reboots are the one between which HA repilcation is scheduled.
The error which we find is that replication was started but not finished.
We try to delete HA for container but, delete was not success.
Now if one of those nodes are off the other two works, with quorum 2 of 3 and no errors at nodes console.
But if we power the first node, errors occur again.
the error which we can see than is, "old timestamps" at lrm while master are idle, but some cases we find that master also report "old timestamps" and cluster goes down, every machine works but reports that other two are in unknown state.
if we poweroff node 1, master which is nod 3 and nod 2 works good.
also we noticed that another error occur at console when all three nodes are up and running
Code:
INFO: task: pve-bridge:2541 blocked form more than 120 seconds.
Tainted: P 0 5.4.174-2-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
INFO: task pvesr:2976 blocked for more than 120 seconds.
Tainted: P 0 5.4.174-2-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
The node 1 when it is powered on don't accept commands smoothly it looks like hangs, and says replication waiting
How to resolve this at most reasonable way?