[TUTORIAL] PVE 7.x Cluster Setup of shared LVM/LV with MSA2040 SAS [partial howto]

Another question about write performance,

I have done some test with fio, and I have abymissal results when the vm disk file is not preallocated.

preallocated: I got around 20000iops 4k randwrite, 3GB/S write 4M. (This is almost the same than my physical disk without gfs2)

but when the disk is not preallocated, or when I take a snapshot on a preallocated drive. (so new write are not preallocated anymore), I have :

60 iops 4k randwrite, 40MB/S for write 4M
I have not examined, nor taken measurements in regards of performance, so i cannot provide you with data.
 
I have not examined, nor taken measurements in regards of performance, so i cannot provide you with data.
ok thanks !

works fine without lvm in my tests, so no need to use lvmlockd, vgscan,.and all other lvm stuff.

About performance, I have compared with ocfs2, and it's really night and day with 4k direct write when the file is not preallocated. (i'm around 20000iops on ocfs2 and 200 iops on gfs2).

I have also notice that qcow2 snapshot is lowering iops de 100~200iops for 4k direct write. It's also happening with local storage, so I'll look to implement external qcow2 snapshot. (snapshot in external file). I don't have performance regression with external snapshot.
 
  • Like
Reactions: waltar
Hey @spirit & @Glowsome thank you for such an informative thread.

I have 6 hosts in my cluster and 2 MSAs that I am trying to use as clustered distributed storage. I initially tried using LVM on top of iSCSI to implement that but soon found out that the files were not being replicated across nodes and realised I needed GFS2. So I've installed and configured to the best of my knowledge (I don't want to use LVM if I can avoid it so have configured only GFS2 & DLM) but I don't get a prompt back when I try to mount - here is my dlm.conf


log_debug=1
protocol=tcp
post_join_delay=10
enable_fencing=0
lockspace Xypro-Cluster nodir=1

dlm_tool status
cluster nodeid 1 quorate 1 ring seq 9277 9277
daemon now 2743 fence_pid 0
node 1 M add 16 rem 0 fail 0 fence 0 at 0 0
node 2 M add 710 rem 0 fail 0 fence 0 at 0 0
node 3 M add 785 rem 0 fail 0 fence 0 at 0 0
node 4 M add 751 rem 0 fail 0 fence 0 at 0 0
node 5 M add 816 rem 0 fail 0 fence 0 at 0 0
node 6 M add 1145 rem 0 fail 0 fence 0 at 0 0

I'd appreciate any help.
 
Last edited:
Hey @spirit & @Glowsome thank you for such an informative thread.

I have 6 hosts in my cluster and 2 MSAs that I am trying to use as clustered distributed storage. I initially tried using LVM on top of iSCSI to implement that but soon found out that the files were not being replicated across nodes and realised I needed GFS2. So I've installed and configured to the best of my knowledge (I don't want to use LVM if I can avoid it so have configured only GFS2 & DLM) but I don't get a prompt back when I try to mount - here is my dlm.conf






I'd appreciate any help.
Hi,
here my dlm.conf

Code:
# Enable debugging
log_debug=1
# Use tcp as protocol
protocol=sctp
# Delay at join
#post_join_delay=10
# Disable fencing (for now)
enable_fencing=0

I'm using protocol=sctp because I have multiple corosync link, and it's mandatory.

then I format with gfs2 my block device

mkfs.gfs2 -t <corosync_clustername>:testgfs2 -j 4 -J 128 /dev/mapper/36742b0f0000010480000000000e02bf3

(here I'm using a multipath iscsi lun)

and finally I'm mounting it

mount -t gfs2 -o noatime /dev/mapper/36742b0f0000010480000000000e02bf3 /mnt/pve/gfs2
 
  • Like
Reactions: waltar
Hi,

I’m writing this post after testing the Glowsome configuration for about two months, followed by four months of production use on three nodes with mixed servers connected via FC to a Lenovo De2000H SAN.
I want to thank @Glowsome for the excellent work they’ve done.

I sincerely hope that this solution can become officially supported in Proxmox in the future.

Thank you again!
 
  • Like
Reactions: Glowsome and iwik
There is this tutorial https://forum.proxmox.com/threads/poc-2-node-ha-cluster-with-shared-iscsi-gfs2.160177/ which I have used to setup 2 node cluster in our lab, FC SAN (all flash storage), GFS2 directly on multipath device (simple setup)
From features perspective everything seems to be working (we only miss tpm2 blocking snapshots), all basic features we need (snapshots + san)
In lab seems to be stable, performance is also ok, even discard is supported on gfs2.
Some performance from windows VM:

1742477537595.png

Sequential speeds shows 8Gbit HBA are bottlenecks in this case.
 
Last edited:
  • Like
Reactions: waltar
I have issues with DLM/Mount on boot with this setup - although I'm not using LVM but the raw LUNs themselves. I've added some dependencies to the FStab entries, but the automatic mount still somehow runs into indefinite "kern_stop" for the "mount" commands. Can only be fixed by rebooting.

My current workaround is to define the mount as "noauto" and mount it manually after the proxmox box is completely booted. That works fine up until now.

Here's my fstab entry:
Code:
/dev/disk/by-uuid/8ee5d7a9-7b19-4b45-b388-bb5758c20d77 /mnt/pve/storage-gfs2-01 gfs2 _netdev,noauto,noacl,lazytime,noatime,rgrplvb,discard,x-systemd.requires=dlm.service,x-systemd.requires=nvmf-connect-script.service,x-systemd.requires=pve-ha-crm.service,nofail 0 0
/dev/disk/by-uuid/1a89385a-965c-4014-9b83-f90a1f3782f6 /mnt/pve/storage-gfs2-02 gfs2 _netdev,noauto,noacl,lazytime,noatime,rgrplvb,discard,x-systemd.requires=dlm.service,x-systemd.requires=nvmf-connect-script.service,x-systemd.requires=pve-ha-crm.service,nofail 0 0

With the x-systemd.requires and the _netdev flag, systemd adds following dependencies:
Code:
After=dlm.service nvmf-connect-script.service pve-ha-crm.service
Requires=dlm.service nvmf-connect-script.service pve-ha-crm.service
After=blockdev@dev-disk-by\x2duuid-1a89385a\x2d965c\x2d4014\x2d9b83\x2df90a1f...target

DLM should obviously be started, and the NVMe-over-TCP connection should be established. The last bit was a first stab at a workaround, trying to wait for corosync to be ready, but it didn't work reliably. Systemd automatically added the "After=blockdev@...target", which seems fine.

I don't know whether it's a race condition because of mounting two shares at once - or is it a fencing related issue? This is my default DLM config, I'm using sctp because I've got two rings defined in corosync. I was already experimenting with disabling additional fencing related options, though. I wasn't sure whether disabling something like "enable_quorum_lockspace" would be a good idea...
Code:
# cat /etc/default/dlm
DLM_CONTROLD_OPTS="--enable_fencing 0 --protocol sctp --log_debug"

# new settings might add
# --enable_startup_fencing 0 --enable_quorum_fencing 0

Can anyone see an error I've overlooked?
 
Last edited:
I have issues with DLM/Mount on boot with this setup - although I'm not using LVM but the raw LUNs themselves. I've added some dependencies to the FStab entries, but the automatic mount still somehow runs into indefinite "kern_stop" for the "mount" commands. Can only be fixed by rebooting.

My current workaround is to define the mount as "noauto" and mount it manually after the proxmox box is completely booted. That works fine up until now.

Here's my fstab entry:
Code:
/dev/disk/by-uuid/8ee5d7a9-7b19-4b45-b388-bb5758c20d77 /mnt/pve/storage-gfs2-01 gfs2 _netdev,noauto,noacl,lazytime,noatime,rgrplvb,discard,x-systemd.requires=dlm.service,x-systemd.requires=nvmf-connect-script.service,x-systemd.requires=pve-ha-crm.service,nofail 0 0
/dev/disk/by-uuid/1a89385a-965c-4014-9b83-f90a1f3782f6 /mnt/pve/storage-gfs2-02 gfs2 _netdev,noauto,noacl,lazytime,noatime,rgrplvb,discard,x-systemd.requires=dlm.service,x-systemd.requires=nvmf-connect-script.service,x-systemd.requires=pve-ha-crm.service,nofail 0 0

With the x-systemd.requires and the _netdev flag, systemd adds following dependencies:
Code:
After=dlm.service nvmf-connect-script.service pve-ha-crm.service
Requires=dlm.service nvmf-connect-script.service pve-ha-crm.service
After=blockdev@dev-disk-by\x2duuid-1a89385a\x2d965c\x2d4014\x2d9b83\x2df90a1f...target

DLM should obviously be started, and the NVMe-over-TCP connection should be established. The last bit was a first stab at a workaround, trying to wait for corosync to be ready, but it didn't work reliably. Systemd automatically added the "After=blockdev@...target", which seems fine.

I don't know whether it's a race condition because of mounting two shares at once - or is it a fencing related issue? This is my default DLM config, I'm using sctp because I've got two rings defined in corosync. I was already experimenting with disabling additional fencing related options, though. I wasn't sure whether disabling something like "enable_quorum_lockspace" would be a good idea...
Code:
# cat /etc/default/dlm
DLM_CONTROLD_OPTS="--enable_fencing 0 --protocol sctp --log_debug"

# new settings might add
# --enable_startup_fencing 0 --enable_quorum_fencing 0

Can anyone see an error I've overlooked?
Hi einhirn: you don't have to use DLM, it's required by GFS instead Shared LVM. Recommend you can reference to https://kb.blockbridge.com/technote/proxmox-lvm-shared-storage/
 
it's required by GFS
Exactly - that's what I'm using. Ok, I didn't mention that other than in the "fstab" lines, but since this thread is about using GFS2 I didn't think it neccessary.

Btw: I'm also using shared thick LVM storage via iSCSI+Multipathing and NVMe-over-TCP, but I really like to use thin provisioning for VMs and possibly snapshots, even though I was surprised that QCOW-Snapshots were internal (i.e. same file) in PVE, but that's a different topic.
 
Last edited:
I have issues with DLM/Mount on boot with this setup - although I'm not using LVM but the raw LUNs themselves. I've added some dependencies to the FStab entries, but the automatic mount still somehow runs into indefinite "kern_stop" for the "mount" commands. Can only be fixed by rebooting.
[...]

Can anyone see an error I've overlooked?
It seems that there are some dependencies to take care of:



I'll try those and check whether it helps...
 
Hi there,

I must say we did configure a production cluster with GFS2, following the instructions on this thread, and it has been working like a charm for around a year, but after some time the storage has become completely unstable and has left the cluster unusable.

For the moment, we've switched to RAW storage. Losing the ability to have snapshots is preferable to having such an unstable filesystem.

Just wanted to leave this comment as a warning to potential users: GFS2 does work, but in the long term it can also become corrupted (maybe it requires some additional maintenance?)
 
The main issue with GFS2 (and OCFS2) is that they are not really supported, so if something bad happens you are on your own. I might be wrong but I remember also, that their isn't much development with them. Luckily there is a high chance, that Proxmox9 will feature snapshot support with qcow2 on LVM-thick (there is development work at the moment, I don't know however whether it will be ready in time) in a VMFS-like fashion. This should cover most usecases why people use ocfs2 or gfs2 and will be supported officially.
For the moment, we've switched to RAW storage. Losing the ability to have snapshots is preferable to having such an unstable filesystem.

Until the snapshot/qcow2 support on LVM/thick is available this might be a workaround:

Alternatives to Snapshots
If an existing iSCSI/FC/SAS storage needs to be repurposed for a Proxmox VE cluster and using a network share like NFS/CIFS is not an option, it may be possible to rethink the overall strategy; if you plan to use a Proxmox Backup Server, then you could use backups and live restore of VMs instead of snapshots.

Backups of running VMs will be quick thanks to dirty bitmap (aka changed block tracking) and the downtime of a VM on restore can also be minimized if the live-restore option is used, where the VM is powered on while the backup is restored.
https://pve.proxmox.com/wiki/Migrate_to_Proxmox_VE#Alternatives_to_Snapshots


Now obviovsly this isn't a solution for every usecase, but maybe it's enough for you. Even if you use another backup software and a limited budget you could still use PBS only for "pseudo-snapshots" without obtaining a subscription as long as you can live with the nag screen. I wouldn't do this as permanent solution without obtaining a support subscription but for a workaround until qcow2 on LVM-thick is supported.