Proxmox cluster with local storage - hv fencing when high io usage

trent--

Member
Mar 19, 2021
19
0
6
Hello,

I am experimenting with a Proxmox cluster using 3 nodes, and vms on local storage.
We also have a shared storage, but due to serious performance problems with this storage, we are trying to get rid of it.
I know that shared storage is obviously needed for HA, but we don't have control on the hardware and our hoster doesn't help us with these problems, and we can do without HA, so that's why I'm running these experiments on local storage.

So, I ran the following fio config simultaneously on 4 vms, on the same hv :
Code:
[global]
name=fio-rand-write
filename=fio-rand-write
rw=randwrite
bs=4K
direct=1
numjobs=4
time_based
runtime=120

[file1]
size=10G
ioengine=libaio
iodepth=16

I previously ran the same tests, with one exception : direct was set to 0.
The results were skewed by the vm's OS cache : vms with more RAM got better results than vms with less cache.

So I then run the tests with direct=1, and the result was the hypervisor was fencend from the cluster, and rebooted automatically.

My understanding is the following : we had 16 io stressors (numjobs=4 multiplied by 4 vms) and each stressor could queue 16 ios (iodepth=16).
Maybe this could saturate the hv with too much IO, so that it would not get corosync requests and reply to them in a timely manner ?
Do you have another explanation for what happened ?
Did anyone experience similar problems with local disk IO resulting in a hv fence ?
Anything else I should check ?

By the way, each hv has 2 x Intel® Xeon E5 2660v4 CPU (nproc=56) with 256 GB RAM and the local storage are SSD, so quite good specs.
 
Your theory sounds very plausible. There should be logs that can support the theory, both on the fenced HV and on the others that fenced it.

Without logs its just a guessing game. You say it was fenced, but may be logs will show it was Kernel crash or something else. If it was truly fenced - corosync logs should tell you why.


Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
You're right, sorry I forgot to post logs.

hv1 is the node that rebooted. Its logs show nothing indicating a proper shutdown. It goes straight from "business as usual" to startup logs.

Here are the corosync logs from hv2 indicating connection to hv1 was lost, hv1 was removed from the cluster, then 4 minutes later (after rebooting) hv1 joined the cluster again :
Code:
Aug 10 16:18:17 pve2 corosync[2907]:   [KNET  ] link: host: 1 link: 0 is down
Aug 10 16:18:17 pve2 corosync[2907]:   [KNET  ] link: host: 1 link: 1 is down
Aug 10 16:18:17 pve2 corosync[2907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 16:18:17 pve2 corosync[2907]:   [KNET  ] host: host: 1 has no active links
Aug 10 16:18:17 pve2 corosync[2907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 16:18:17 pve2 corosync[2907]:   [KNET  ] host: host: 1 has no active links
Aug 10 16:18:18 pve2 corosync[2907]:   [TOTEM ] Token has not been received in 2737 ms
Aug 10 16:18:23 pve2 corosync[2907]:   [QUORUM] Sync members[2]: 2 3
Aug 10 16:18:23 pve2 corosync[2907]:   [QUORUM] Sync left[1]: 1
Aug 10 16:18:23 pve2 corosync[2907]:   [TOTEM ] A new membership (2.11c) was formed. Members left: 1
Aug 10 16:18:23 pve2 corosync[2907]:   [TOTEM ] Failed to receive the leave message. failed: 1
Aug 10 16:18:23 pve2 corosync[2907]:   [QUORUM] Members[2]: 2 3
Aug 10 16:18:23 pve2 corosync[2907]:   [MAIN  ] Completed service synchronization, ready to provide service.

Aug 10 16:22:17 pve2 corosync[2907]:   [KNET  ] rx: host: 1 link: 0 is up
Aug 10 16:22:17 pve2 corosync[2907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 16:22:17 pve2 corosync[2907]:   [QUORUM] Sync members[3]: 1 2 3
Aug 10 16:22:17 pve2 corosync[2907]:   [QUORUM] Sync joined[1]: 1
Aug 10 16:22:17 pve2 corosync[2907]:   [TOTEM ] A new membership (1.121) was formed. Members joined: 1
Aug 10 16:22:17 pve2 corosync[2907]:   [QUORUM] Members[3]: 1 2 3
Aug 10 16:22:17 pve2 corosync[2907]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 16:22:20 pve2 corosync[2907]:   [KNET  ] rx: host: 1 link: 1 is up
Aug 10 16:22:20 pve2 corosync[2907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!