Hello,
I am experimenting with a Proxmox cluster using 3 nodes, and vms on local storage.
We also have a shared storage, but due to serious performance problems with this storage, we are trying to get rid of it.
I know that shared storage is obviously needed for HA, but we don't have control on the hardware and our hoster doesn't help us with these problems, and we can do without HA, so that's why I'm running these experiments on local storage.
So, I ran the following fio config simultaneously on 4 vms, on the same hv :
I previously ran the same tests, with one exception :
The results were skewed by the vm's OS cache : vms with more RAM got better results than vms with less cache.
So I then run the tests with
My understanding is the following : we had 16 io stressors (
Maybe this could saturate the hv with too much IO, so that it would not get corosync requests and reply to them in a timely manner ?
Do you have another explanation for what happened ?
Did anyone experience similar problems with local disk IO resulting in a hv fence ?
Anything else I should check ?
By the way, each hv has 2 x Intel® Xeon E5 2660v4 CPU (nproc=56) with 256 GB RAM and the local storage are SSD, so quite good specs.
I am experimenting with a Proxmox cluster using 3 nodes, and vms on local storage.
We also have a shared storage, but due to serious performance problems with this storage, we are trying to get rid of it.
I know that shared storage is obviously needed for HA, but we don't have control on the hardware and our hoster doesn't help us with these problems, and we can do without HA, so that's why I'm running these experiments on local storage.
So, I ran the following fio config simultaneously on 4 vms, on the same hv :
Code:
[global]
name=fio-rand-write
filename=fio-rand-write
rw=randwrite
bs=4K
direct=1
numjobs=4
time_based
runtime=120
[file1]
size=10G
ioengine=libaio
iodepth=16
I previously ran the same tests, with one exception :
direct
was set to 0.The results were skewed by the vm's OS cache : vms with more RAM got better results than vms with less cache.
So I then run the tests with
direct=1
, and the result was the hypervisor was fencend from the cluster, and rebooted automatically.My understanding is the following : we had 16 io stressors (
numjobs=4
multiplied by 4 vms) and each stressor could queue 16 ios (iodepth=16
).Maybe this could saturate the hv with too much IO, so that it would not get corosync requests and reply to them in a timely manner ?
Do you have another explanation for what happened ?
Did anyone experience similar problems with local disk IO resulting in a hv fence ?
Anything else I should check ?
By the way, each hv has 2 x Intel® Xeon E5 2660v4 CPU (nproc=56) with 256 GB RAM and the local storage are SSD, so quite good specs.