[TUTORIAL] PoC 2 Node HA Cluster with Shared iSCSI GFS2

tscret

Member
Mar 27, 2023
35
14
13
Switzerland
I'm working at a MSP which supports some 2 Node directattached storage VMware Clusters. I'm a Proxmox enthusiast and trying to lead our Business to turn away from Broadcom.
An older of my work mates stated following the docs our normal architecture (2 Server and MSA) is not feasible with Proxmox. SOOO challenge accepted and i rolled out a little PoC and I like to share my Playbook with you. I know that a 2 Node Cluster for HA could cause Split Brains but if Storage and VM Loads flow through the same Bond there are quiet big problems allready out of scope. - I hope this could help someone - enjoy!

Code:
# Playbook PoC 2 Node HA PVE Cluster with shared iSCSI with GFS2 and Corosnyc with "TWO-NODE"
# By tscret, 06.01.2025

### Information Sources
=> https://forum.proxmox.com/threads/pve-7-x-cluster-setup-of-shared-lvm-lv-with-msa2040-sas-partial-howto.57536/
=> https://manpages.debian.org/unstable/corosync/votequorum.5.en.html

# Architecture:
# 2 Node as Nested Virtualisation
# Syno DS1515+ as iSCSI Portal with two LUNS - DSM 6

# To proof => Two Node Cluster with an DAC or iSCSI Storage with Loadbalancing and HA - able for Thinprovisioning and Snapshot (QCOW2)

# Define LUN on Syno

> iSCSI Manager / Target
>> <Create> Name: poc | IQN: iqn.2000-01.com.synology:LAB-NAS01.Target-1.ae07f0977a - <NEXT> 0 Map later - <NEXT> - <Apply>
> iSCSI Manager / LUN
>> <Create> Name: lun-gfs2 | Location: Volume 1 | Total capacity: 500 GB | Space Allocation: Thin Provisioning - <NEXT> Map Later - <NEXT> - <Apply>
>> Select lun-gfs2 - <Action> <Edit> / Mapping - Select poc <ok>
>> !!!! Enable Multi Sessions Target

#Setup two VM Nodes (Nested Virtualisation)

# 6 vCPU (HOST) - 16 GB RAM - 64 GB Disk - 1 NIC on vlan120 - ISO Installer PVE 8.3 - TAG plb_ignore_vm
> Asterix VMID: 991 on LAB-PVE02 10.144.21.238/23
> Obelix VMID: 992 on LAB-PVE01 10.144.21.239/23

# Setup both host on searchdomain test.lan
# Root Password: <CHANGEME>

# Post Install Proxmox Helper Script
> Change Repos and Update
$ apt update && apt upgrade -y
$ reboot now

# Install openvswitch-switch (I just prefer OVS over Linux Bridge)
$ apt install openvswitch-switch -y

> Create Cluster (Test)
> Join Obelix to Cluster
> Create iSCSI Storage for GFS2

# Change iSCSI to node.startup automatic
$ nano /etc/iscsi/iscsid.conf  # change node.startup to automatic
$ service iscsid restart

# Determine Disks
$ lsblk
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda                  8:0    0   64G  0 disk
├─sda1               8:1    0 1007K  0 part
├─sda2               8:2    0  512M  0 part
└─sda3               8:3    0 63.5G  0 part
  ├─pve-swap       252:0    0  7.9G  0 lvm  [SWAP]
  ├─pve-root       252:1    0 25.9G  0 lvm  /
  ├─pve-data_tmeta 252:2    0    1G  0 lvm
  │ └─pve-data     252:4    0 19.8G  0 lvm
  └─pve-data_tdata 252:3    0 19.8G  0 lvm
    └─pve-data     252:4    0 19.8G  0 lvm
sdb                  8:16   0    1G  0 disk
sdc                  8:32   0  500G  0 disk
sr0                 11:0    1  1.3G  0 rom


# Edit /etc/pve/corosync.conf
$ cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
$ nano /etc/pve/corosync.conf.new

Edit quroum section:

quorum {
  provider: corosync_votequorum
  two_node: 1
}
>>>>>>>>>>>>
$ cp /etc/pve/corosync.conf /etc/pve/corosync.bak
$ mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
$ systemctl status corosync

# Check Quorum
$ pvecm status
Cluster information
-------------------
Name:             Test
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Jan  3 22:45:32 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.1b
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1
Flags:            2Node Quorate WaitForAll

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.144.21.238 (local)
0x00000002          1 10.144.21.239


# Make GFS2 Filesystem and Mount

>>>>>>>>>>> Bashscript
hosts="10.144.21.238 10.144.21.239"

for host in $hosts; do ssh $host 'apt install dlm-controld gfs2-utils -y'; done
for host in $hosts; do ssh $host 'mkdir /etc/dlm; echo protocol=tcp >> /etc/dlm/dlm.conf; echo enable_fencing=0 >> /etc/dlm/dlm.conf; systemctl restart dlm'; done

#Copy Snippet and run
read -p "Bitte gib den Clustername ein (Standard: Datacenter): " clustername
clustername="${clustername:-Datacenter}"


read -p "Bitte gib den Mount-Pfad ein (Standard: /mnt/pve/iscsi-gfs2): " mnt
mnt="${mnt:-/mnt/pve/iscsi-gfs2}"
read -p "Bitte gib das LUN Device an (Standard: /dev/sdb): " lun
lun="${lun:-/dev/sdb}"
num_hosts=$(echo $hosts | wc -w)
mkfs.gfs2 -t $clustername:iscsi-gfs2 -j $num_hosts -J 128 $lun
uuid=$(blkid $lun | sed -n 's/.*UUID=\"\([^\"]*\)\".*/\1/p')

cat > "/etc/systemd/system/gfs2mount.service" <<EOT
[Unit]
Description=Mount GFS2 Service
After=iscsid.service dlm.service network.target iscsi.service multi-user.target
Requires=iscsid.service dlm.service iscsi.service

[Service]
Type=oneshot
ExecStartPre=/usr/bin/bash -c 'while ! lsblk -o NAME,UUID | grep -q "$uuid"; do sleep 5; done'
ExecStart=/usr/bin/mount -t gfs2 /dev/disk/by-uuid/$uuid $mnt
ExecStop=/usr/bin/umount $mnt
RemainAfterExit=true

[Install]
WantedBy=multi-user.target
EOT


for host in $hosts; do scp "/etc/systemd/system/gfs2mount.service" $host:/etc/systemd/system/; ssh $host "mkdir -p $mnt; systemctl daemon-reload; systemctl enable gfs2mount.service; systemctl start gfs2mount.service"; done


cat >> /etc/pve/storage.cfg << EOT

dir: GFS2
        path $mnt
        content rootdir,images
        prune-backups keep-all=1
        shared 1
EOT

>>>>>>>>>>>>>>>>>>> End of Bashscript

# Lead out Tests
(x) Mount at boot beforde HA Start
(x) Read / Write into GFS2
(x) HA Failover by Powerouttage of an Node
(x) HA Recover after both Nodes Online
 
Thanks alot for sharing this.
But in my testing enviroment when I try shut down the node it refuses to turn off. I unset the quiet flag in grub and found out that dlm.service shut downs before the gfs2 unmounts.
I worked around this issue by using /etc/fstab instead of the gfs2mount.service:

Code:
#iSCSI GFS2
UUID=UUID MOUNTPOINT gfs2 defaults,x-systemd.automount,x-systemd.mount-timeout=30,x-systemd.requires=iscsi.service,x-systemd.requires=dlm.service,x-systemd.before=pve-ha-lrm.service,_netdev 0 2

Useful links:
How to delay the mount of Shared Folder in /etc/fstab
Special Considerations when Mounting GFS2 File Systems
Configure iSCSI Initiator and Mount via fstab
 
Last edited:
Thanks alot for sharing this.
But in my testing enviroment when I try shut down the node it refuses to turn off. I unset the quiet flag in grub and found out that dlm.service shut downs before the gfs2 unmounts.
I worked around this issue by using /etc/fstab instead of the gfs2mount.service:

Code:
#iSCSI GFS2
UUID=UUID MOUNTPOINT gfs2 defaults,x-systemd.automount,x-systemd.mount-timeout=30,x-systemd.requires=iscsi.service,x-systemd.requires=dlm.service,x-systemd.before=pve-ha-lrm.service,_netdev 0 2

Useful links:
How to delay the mount of Shared Folder in /etc/fstab
Special Considerations when Mounting GFS2 File Systems
Configure iSCSI Initiator and Mount via fstab
Thanks, seems to be working!
 
Now question is how to solve quorum issue when one node goes down. Remaining will reboot, since quorum is lost and nothing is working
 
  • Like
Reactions: Johannes S
Did you change the Corosync config?

Edit quroum section:

quorum {
provider: corosync_votequorum
two_node: 1
}

This should set Quorum to 1
 
Did you change the Corosync config?

Edit quroum section:

quorum {
provider: corosync_votequorum
two_node: 1
}

This should set Quorum to 1
No, not yet. This will probably prevent node from rebooting? But question is, what will happened when node lost network connectivity? It will try to start virtual machines on both nodes? How GFS2 behaves in this situation? Do we have data corruption when VM try to power on both nodes? Or does GFS2 locking (which is using same network as cluster) solves it?
EDIT: Probably we will need external quorum device to be safe https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support
 
Last edited:
No, not yet. This will probably prevent node from rebooting? But question is, what will happened when node lost network connectivity? It will try to start virtual machines on both nodes? How GFS2 behaves in this situation? Do we have data corruption when VM try to power on both nodes? Or does GFS2 locking (which is using same network as cluster) solves it?
EDIT: Probably we will need external quorum device to be safe https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support
When your Networkdesign is based on a Bond Uplink and iSCSI runs on the same NIC/Network as the Cluster how can the Node write to the Storage? So in such a small deployment it's in my opinion the solution to prevent coruption.
 
When your Networkdesign is based on a Bond Uplink and iSCSI runs on the same NIC/Network as the Cluster how can the Node write to the Storage? So in such a small deployment it's in my opinion the solution to prevent coruption.
Yes, this can help in case of iSCSI. Actually I am using FC SAN, so I will try that external quorum device on external VM.
Anyway, thanks for guide, first tests shows GFS2 is performing good I will keep validating this design.
 
What would be great if we could get back such a thing like a qdisk... to let the Storage be a source of quorum...
But as for now corosync has discontinued this approach.
 
  • Like
Reactions: Johannes S
I am doing more testing with two_node and seems when one node is unreachable other keeps running, but it is not possible to power on virtual machines from GFS2 volume - probably due dlm locking?
Can you confirm this behavior?
EDIT: Strange, but I am not able to replicate it any more. Seems it is working, also HA.
 
Last edited:
Thanks alot for sharing this.
But in my testing enviroment when I try shut down the node it refuses to turn off. I unset the quiet flag in grub and found out that dlm.service shut downs before the gfs2 unmounts.
I worked around this issue by using /etc/fstab instead of the gfs2mount.service:

Code:
#iSCSI GFS2
UUID=UUID MOUNTPOINT gfs2 defaults,x-systemd.automount,x-systemd.mount-timeout=30,x-systemd.requires=iscsi.service,x-systemd.requires=dlm.service,x-systemd.before=pve-ha-lrm.service,_netdev 0 2

Useful links:
How to delay the mount of Shared Folder in /etc/fstab
Special Considerations when Mounting GFS2 File Systems
Configure iSCSI Initiator and Mount via fstab
Please note "2" on end of line - this should be 0, to disable systemd fsck of gfs2 filesystem. GFS2 fsck must be run only when GFS2 is unmounted on all nodes.
 
  • Like
Reactions: tscret