Fibre Channel SAN with Live Snapshot

LnxBil · May 1, 2024

darkdns said:
Perhaps, but alot of people already have SAN's that they use.

Yes, we have 3 of them ... I seldomly create snapshots, we have at least daily backups with PBS ... but ...

darkdns said:
So now they need to make a choice, buy something else, or discard the ability to use snapshots

... or virtualize a ZFS-over-iSCSI SAN in your normal SAN.

bbgeek17 · May 1, 2024

We've helped a few customers utilize their existing FC investment alongside Proxmox. But it only makes sense when there is a significant investment in FC, not when the target is the least expensive MSA/Powervault model.

OCFS2 and GFS2 are freely available. Nothing in Proxmox directly interacts with the guts of these technologies. For PVE - the target is simply a "directory pool", so nothing stops you from installing and using them.

If the requirement is to integrate these filesystems' deployment/configuration into PVE GUI, that means PVE has to:
a) dedicate development time for building packages, installation, configuration templates, API/CLI/GUI integration
b) expand QA infrastructure significantly
c) rely on major sponsors of those systems (RedHat/IBM and Oracle) to keep up to date with Kernel development.

In the end, it's all software, and possibilities are limitless. However, you can't say the same about resources.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

profox · Jan 31, 2025

I managed to set up a shared gfs2 filesystem for a proxmox cluster and would like to share some experience.

The filesystem resides directly on a fibre channel lun on an (expensive) IBM StoreVize-system (originally bought for vmware) and is shared among all hosts in the cluster. This allows moving virtual machines among all hosts and also use qcow2 snapshots.

I had to make some changes to systemd to prevent the cluster from freezing under certain circumstances, for example when installing proxmox updates.

For availability of the fc lun I use "dm-multipath".

I could not mount the gfs2 filesystem by /etc/fstab because I did not have enough control over when the filesystem is mounted and unmounted. The filesystem can only be mounted after dlm.service is started and must be unmounted before dlm.service is stopped. In my case I ensured this with the following systemd settings:

/etc/systemd/user/var-dmlun1.mount
[Unit]
DefaultDependencies=no
Before=umount.target
After=var-dlm_delay.service
Requires=var-dlm_delay.service
Conflicts=umount.target

[Mount]
What=/dev/mapper/dmlun1
Where=/var/dmlun1
Type=gfs2
Options=noatime

[Install]
WantedBy=pve-storage.target

This first unit mounts the filesystem. It depends on the service "var-dlm_delay.service". The job of the delay service is to make sure the filesystem is mounted only when dlm.service is running and unmounted only after dlm.service is stopped and also give dlm some time to disconnect from the cluster and notify the kernel before unmounting:

/etc/systemd/system/var-dlm_delay.service
[Unit]
After=dlm.service
Requires=dlm.service

[Service]
RemainAfterExit=yes
ExecStart=/bin/sleep 0
ExecStop=/bin/sleep 10

[Install]
WantedBy=pve-storage.target

I also had to make some overrides on proxmox services to ensure they are started only after the filesystem was mounted and stopped before the filesystem was unmounted.

/etc/systemd/system/pve-guests.service.d/override.conf
/etc/systemd/system/pve-ha-crm.service.d/override.conf
/etc/systemd/system/pve-ha-lrm.service.d/override.conf
/etc/systemd/system/pvestatd.service.d/override.conf

[Unit]
After=var-dmlun1.mount
Requires=var-dmlun1.mount

For dlm.service I had to disable fencing, I do not understand exactly why but with fencing enabled the cluster freezes in my environment.

/etc/default/dlm
# DLM_CONTROLD_OPTS -- set command line options for dlm_controld
# See dlm_controld man page for list of options.
DLM_CONTROLD_OPTS="--log_debug=0 --enable_fencing=0"

Xavier Droniou · Feb 6, 2025

A lot inspired of another thread https://forum.proxmox.com/threads/p...-lvm-lv-with-msa2040-sas-partial-howto.57536/

i did a doc for our internal needs :
Prior to configure block device, you may know if you want to use iscsi or fc to connect on storage array. On our lab, we have done test using primera for storage array and san switch using FC for transport protocol.
Install multipath-tool and configure multipath.conf according to storage array vendor best practice

code_language.shell:

apt-get install multipath-tools -y

you need to do same on all nodes where you want to see block devices
Add multipath.conf file to /etc of all nodes
As an example for Primera storage with one lun shared on lab

code_language.shell:

cat /etc/multipath.conf

defaults {
    polling_interval 10
    retain_attached_hw_handler 0
    user_friendly_names yes
}

devices {
    device {
        vendor "3PARdata"
        product "VV"
        path_grouping_policy "group_by_prio"
        uid_attribute "ID_SERIAL"
        prio "alua"
        path_selector "service-time 0"
        path_checker "tur"
        hardware_handler "1 alua"
        failback "immediate"
        rr_weight "uniform"
        no_path_retry 18
        rr_min_io_rq 1
        fast_io_fail_tmo 10
        dev_loss_tmo "infinity"
    }
}

blacklist {
    wwid .*
}

blacklist_exceptions {
    wwid "360002ac000000000000002a400027b43"
}

multipaths {
    multipath {
        wwid "360002ac000000000000002a400027b43"
        alias mpatha
    }
}

Wwid can be find using:

code_language.shell:

/lib/udev/scsi_id -g -u -d /dev/mapper/mpathX

there're four proxmox boxes on our lab pvirtocbhpewd01..pvirtocbhpewd04
commands should be issued on one of the boxes

code_language.shell:

for i in $(seq 1 4); do ssh pvirtocbhpewd0$i apt install lvm2-lockd dlm-controld gfs2-utils -y; done

for i in $(seq 1 4); do ssh pvirtocbhpewd0$i 'echo DLM_CONTROLD_OPTS="--enable_fencing 0 --post_join_delay 10 --log_debug 1 --protocol sctp" >> /etc/default/dlm; systemctl restart dlm'; done

We use sctp like corosync when multihomed corosync is used (best practice), then we need to postconfigure some part on lvm by modifying /etc/lvm/lvm.conf and uncomment.

code_language.shell:

use_lvmlockd = 1

lvmlockd_lock_retries = 3

To allow a proper restart of lvm after a reboot we add a sleep on daemon start, adding a file sleep.conf on /etc/systemd/system/lvmlockd.service.d
File content:

code_language.shell:

[Service]
ExecStartPre=/usr/bin/sleep 20

Then we can start properly dlm on all nodes prior to lvmlockd, to ensure that we execute

code_language.shell:

for i in $(seq 1 4); do ssh pvirtocbhpewd0$i "systemctl stop dlm; rmmod gfs2; rmmod dlm; sleep 3; systemctl restart udev; sleep 3; systemctl start dlm"; done

After that we will create gfs2 filesystem, prior to this we init physical extend and create shared volume group

code_language.shell:

pvcreate /dev/mapper/XXX
vgcreate san_vg /dev/mapper/XXX

XXX must be replace by multipath device needed like /dev/mapper/mpatha on our lab, after creating vg, we need to modify back again /etc/lvm/lvm.conf and add entry on activation section for this vg, example on lab :

code_language.shell:

activation {
    volume_list = [ "san_vg","san_vg/san_datastore" ]
}

Recopy /etc/lvm/lvm.conf on all nodes.
After, we need to find cluster name of pve node using

code_language.shell:

pve_cluster_name=`grep cluster_name /etc/pve/corosync.conf | awk '{print $2}'`

You can verify pve_cluster_name using

code_language.shell:

echo $pve_cluster_name

After that we create gfs2 file-sytem

code_language.shell:

mkfs.gfs2 -t pve_cluster_name:fs_name -j 4 -J 64 blockdevice

pve_cluster_name is the one found previously, fs_name is the name of fs you want to create, -j is to create X journal , X is same as number of node that will use this filesystem (on our lab 4 nodes), -J is size of journal here 64Mbytes, /dev/vg/lv is the blockdevice to use, on our example blockdevice is /dev/san_vg/san_datastore and fs_name is san-lun0-gfs2

You can now test mounting gfs2 filesystem on one node using below commands

code_language.shell:

vgchange --lock-start san_vg
lvchange -asy /dev/san_vg/san_datastore
mkdir -p /mnt/pve/san-lun0-gfs2
mount -t gfs2 /dev/san_vg/san_datastore /mnt/pve/san-lun0-gfs2/

if all is ok you shall see mounted volume, you can then umount /mnt/pve/san-lun0-gfs2

Now we need to prepare environment to reboot properly, meaning on start , first dlm than lvmlockd, then activate lock on vg and finally activate lv and mount it, revert when we stop, to do that we will prepare unit files for systemd, we don’t modify lvmlockd.services it shall remain unchanged in case of upgrade of packages, we create two differents unit files to manage proper start and stop

First one will be responsible to lock vgs, we create lvmlocks.service on /etc/system/system/multi-user.target.wants

code_language.shell:

[Unit]
Description=LVM locking start and stop
Documentation=man:lvmlockd(8)
Requires=lvmlockd.service dlm.service corosync.service
After=lvmlockd.service dlm.service corosync.service

[Service]
Type=oneshot
RemainAfterExit=yes

# start lockspaces and wait for them to finish starting
ExecStart=/usr/bin/bash -c "/sbin/lvm vgchange --lock-start --lock-opt auto"

# stop lockspaces and wait for them to finish stopping
ExecStop=/usr/bin/bash -c "/sbin/lvm vgchange --lock-stop"

[Install]
WantedBy=multi-user.target

We add also a file sleep.conf on /etc/systemd/system/lvmlocks.service.d

code_language.shell:

[Service]
ExecStartPre=/usr/bin/sleep 20

Then second unit file to activate lv and mount it and vice versa, we create an unit file on /etc/systemd/system/ named lvmshared.service :

code_language.shell:

[Unit]
Description=LVM locking LVs and mount LVs start and stop
Documentation=man:lvmlockd(8)
After=lvmlocks.service lvmlockd.service dlm.service
Before=pve-ha-lrm.service pve-guests.service

[Service]
Type=oneshot
RemainAfterExit=Yes

# start lockspaces LVs and mount LVs
ExecStart=/usr/bin/bash -c "/usr/sbin/vgs --noheadings -o name -S vg_shared=yes | xargs /usr/sbin/lvchange -asy && /usr/sbin/lvs --noheadings -o lv_path -S vg_shared=yes | xargs mount"
# stop lockspaces LVs after umount LVs
ExecStop=/usr/bin/bash -c "/usr/sbin/lvs --noheadings -o lv_path -S vg_shared=yes | xargs umount && /usr/sbin/lvs --noheadings -o name -S vg_shared=yes | xargs /usr/bin/lvchange -an"

[Install]
WantedBy=multi-user.target

You can copy this file on all nodes needed, then we add this service to systemd on all nodes concern, using

code_language.shell:

systemctl enable lvmlocks
systemctl enable lvmshared

Last we need to add entry in /etc/fstab to mount filesystem at boot time, again for each node, on our lab we added one line at the end of fstab like that :

code_language.shell:

/dev/san_vg/san_datastore /mnt/pve/san-lun0-gfs2 gfs2 noatime,nodiratime,noauto 1 2

Now we can test automounter by starting service lvmshared

code_language.shell:

systemctl start lvmlocks
systemctl start lvmshared

Again you shall be able to see mount on shell.

Now we need to add on pve gui , on cluster we go to Storage, Add Directory, fill id with a human readable name, tick Shared, complete directory path that is the mount point, and restrict if needed nodes lists allowed to use this storage, content type allow to use all type of content

profox · Feb 7, 2025

Xavier Droniou said:
We use sctp like corosync when multihomed corosync is used (best practice), then we need to postconfigure some part on lvm by modifying /etc/lvm/lvm.conf and uncomment.

What do you need LVM for in your setup? LVM does not seem to be a requirement for GFS2/DLM to work. I put the GFS2 filesystem directly on a shared FC/LUN provied by an IBM StoreVize-System. The StoreVize-System has some kind of logical volume management on its own, so adding another layer of logical volume management on top of that did not feel right.

Xavier Droniou · Feb 7, 2025

you mean use gfs2 on raw device not tested, i read gfs2 need lvmlockd, so i create lv maybe wrongly ? all storage array use volume management indeed , all devices are now virtual volume in any case (i use also powermax, netapp aff but not pure or hitachi one)
will check if i can setup on a raw device directly

profox · Feb 7, 2025

Xavier Droniou said:
you mean use gfs2 on raw device not tested, i read gfs2 need lvmlockd, so i create lv maybe wrongly ? all storage array use volume management indeed , all devices are now virtual volume in any case (i use also powermax, netapp aff but not pure or hitachi one)
will check if i can setup on a raw device directly

Yes on raw device. I just re-tested distributed locks by running the following two commands on two different hosts. The second flock-command immediately returns exit code 1 which makes me believe that distributed locks are working without LVM:

Bash:

# Host A
touch /var/dmlun1/tstlock
flock --nonblock /var/dmlun1/tstlock sleep 60

Bash:

# Host B
flock --nonblock /var/dmlun1/tstlock sleep 60
echo $?

Some additional information

Bash:

# Output of mount
/dev/mapper/dmlun1 on /var/dmlun1 type gfs2 (rw,noatime,rgrplvb)

Code:

# Relevant parts in multipath.conf
multipaths {
  multipath {
    wwid 123456789123456789123456789123456
    alias dmlun1
  }
}

LnxBil · Feb 10, 2025

Xavier Droniou said:
i did a doc for our internal needs

Could you please use CODE tags in your post? It's almost unreadable and cannot be copy&pasted without major editing.

Xavier Droniou · Feb 10, 2025

profox said:
Yes on raw device. I just re-tested distributed locks by running the following two commands on two different hosts. The second flock-command immediately returns exit code 1 which makes me believe that distributed locks are working without LVM:

Bash:

# Host A touch /var/dmlun1/tstlock flock --nonblock /var/dmlun1/tstlock sleep 60

Bash:

# Host B flock --nonblock /var/dmlun1/tstlock sleep 60 echo $?

Some additional information

Bash:

# Output of mount /dev/mapper/dmlun1 on /var/dmlun1 type gfs2 (rw,noatime,rgrplvb)

Code:

# Relevant parts in multipath.conf multipaths { multipath { wwid 123456789123456789123456789123456 alias dmlun1 } }

i did some test , in particular using a windows vm and crystal diskmark , little more performance on random io , less on sequential, need to look at automounter but your setup is interesting as he remove lvmlockd from equation

Xavier Droniou · Feb 18, 2025

i will simplify using gfs2 directly (no need of lvmlockd and it's a bit unstable at starting), and i need to modify also lock to test sanlock, using dlm and corosync can involve issue if you have some issue to connect to a device (loss of path partial or full) and it may disturb cluster services , and it's not a good idea

iwik · Feb 20, 2025

Any benchmarks from this GFS2 directly on LUN setup?

gabrielgbs97 · Feb 24, 2025

Xavier Droniou said:
i will simplify using gfs2 directly (no need of lvmlockd and it's a bit unstable at starting), and i need to modify also lock to test sanlock, using dlm and corosync can involve issue if you have some issue to connect to a device (loss of path partial or full) and it may disturb cluster services , and it's not a good idea

https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/configuring_gfs2_file_systems/assembly_planning-gfs2-deployment-configuring-gfs2-file-systems#shared_storage_considerations

As recommended by the GFS2 main maintainers, it’s best to set up GFS2 on top of LVM (and lvmlockd). Using different volume manager layers shouldn’t pose any issues. In our VMware-based stack, we use the following setup: SAN volume manager (LUN, thin?) -> VMFS -> thin VMDK -> LVM (thin volume?) -> XFS/btrfs.

However, we’re currently facing the same challenge of testing our block-only SAN (no NFS for HPE/IBM). Unfortunately, GFS2 or CLVM isn’t as straightforward as VMFS.

Search

Search

Fibre Channel SAN with Live Snapshot

LnxBil

Distinguished Member

bbgeek17

Distinguished Member

profox

New Member

Xavier Droniou

New Member

profox

New Member

Xavier Droniou

New Member

profox

New Member

LnxBil

Distinguished Member

Xavier Droniou

New Member

Xavier Droniou

New Member

iwik

Member

gabrielgbs97

New Member

We value your privacy