Proxmox Cluster with local Gluster servers.

FrancisS · Apr 25, 2022

Hello,

My purpose is to use gluster to implement a very easy replicated local storage (like VMWare vsan).

I have a Proxmox cluster with node1 and node2 and some local disks.

I installed the gluster server on each nodes and create a gluster volume (distributed/replicated) with the local disks.

Now I want to mount the gluster volume on node1 from "localhost" with node2 as a backup and
on node2 from "localhost" with node1 as a backup.

But from the proxmox GUI I can set localhost as primary server but for the backup I can set node1 OR node2.

If I set the node1 as backup this is not good for node1 the backup need to be node2 same problem for node2 if I set node2.

A possible bypass ? I mounted the gluster volume from the "/etc/fstab" and create from the proxmox GUI a shared "directory" storage.

Best regards.

Francis

Dark26 · Apr 25, 2022

Glusterfs manage the load balancing itself, when you are connected.

The secondary IP, is use when you try to connect the storage and the Gluster node isn't up a the first connection.

Only two node is a very bad idea, you can have split brain with that.

https://docs.gluster.org/en/latest/Administrator-Guide/Split-brain-and-ways-to-deal-with-it/

Use if you can a third node for at lesat the arbiter

Volume Name: GlusterSSD
Type: Replicate
Volume ID: 248e00ac-f1d4-48e1-bf3c-06d03e549434
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.10.5.93:/Data/GlusterSSD/Brick1
Brick2: 10.10.5.92:/Data/GlusterSSD/Brick1
Brick3: 10.10.5.91:/Data/GlusterSSD/Brick1 (arbiter)
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet

Volume Name: GlusterEMMC
Type: Distributed-Replicate
Volume ID: da3ff246-481f-4a7e-9302-95eb361e6884
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: 10.10.5.91:/Data/GlusterEMMC/Brick1
Brick2: 10.10.5.92:/Data/GlusterEMMC/Brick1
Brick3: 10.10.5.93:/Data/GlusterEMMC/Brick1_a (arbiter)
Brick4: 10.10.5.91:/Data/GlusterEMMC/Brick2
Brick5: 10.10.5.93:/Data/GlusterEMMC/Brick2
Brick6: 10.10.5.92:/Data/GlusterEMMC/Brick2_a (arbiter)
Brick7: 10.10.5.93:/Data/GlusterEMMC/Brick3
Brick8: 10.10.5.92:/Data/GlusterEMMC/Brick3
Brick9: 10.10.5.91:/Data/GlusterEMMC/Brick3_a (arbiter)
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on

ex of my storage configuration in promox

glusterfs: GlusterSSD
path /mnt/pve/GlusterSSD
volume GlusterSSD
content vztmpl,images,iso,snippets,backup
prune-backups keep-all=1
server 10.10.5.93
server2 10.10.5.92

glusterfs: GlusterEMMC
path /mnt/pve/GlusterEMMC
volume GlusterEMMC
content vztmpl,backup,iso,snippets,images
prune-backups keep-last=2
server 10.10.5.91
server2 10.10.5.92

FrancisS · Apr 26, 2022

Hello Dark26,

Thank you,

I know the problems with two nodes I have only two servers so no node for arbiter but I planned to use fence.

my gluster nodes = proxmox nodes, the gluster network is in back to back (corosync network is also back to back).

Volume Name: gv1
Type: Distributed-Replicate
Volume ID: 3fde1bfe-9c2f-44d2-8282-941f2f2bf010
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: node1-gluster:/brick/brick1/brick
Brick2: node2-gluster:/brick/brick1/brick
Brick3: node1-gluster:/brick/brick2/brick
Brick4: node2-gluster:/brick/brick2/brick
Brick5: node1-gluster:/brick/brick3/brick
Brick6: node2-gluster:/brick/brick3/brick
Brick7: node1-gluster:/brick/brick4/brick
Brick8: node2-gluster:/brick/brick4/brick
Options Reconfigured:
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

node1 /etc/fstab

localhost:gv1 /gluster/gv1 glusterfs _netdev,backupvolfile-server=node2-gluster 0 1

node2 /etc/fstab

localhost:gv1 /gluster/gv1 glusterfs _netdev,backupvolfile-server=node1-gluster 0 1

and for the proxmox's storage

dir: gluster
path /gluster/gv1
content iso,snippets,vztmpl,backup,rootdir,images
prune-backups keep-all=1
shared 1

On the cluster I have also an iscsi/lvm shared/gfs2/directory storage.

Best regards.

Francis

FrancisS · Apr 27, 2022

Hello,

At the system shutdown/reboot the glusterd service stop but the glusterfsd process do not so the umount of the brick failed.

I see some articles about a change in gluster where a stop of the glusterd do not stop the glusterfsd process.

To stop the glusterfsd process we have to stop the gluster volumes before the glusterd service stop, how can I "autostop" the gluster volumes ?

Best regards.

Francis

imran.tee · Oct 18, 2022

Hi FrancisS,

I am planning to use GlusterFS as shared storage. Could you give me some idea about the performance of GlasterFS? Are you using iSCSI on top of GlusterFS?

FrancisS · Oct 18, 2022

Hi imran.tee,

The performance of GulsterFS depend on your disk, network bandwidth, number of nodes, etc... generally you have good performance.

ISCSI on top of GlusterFS no, but why ?

For GlusterFS you do not need iSCSI for iSCSI you do not need GlusterFS.

With iSCSI you can be a Target (server) or an Initiator (client) for Proxmox most the time you are an Initiator so a network client of an iSCSI Target (network server).
iSCSI is like you have local disks, so on top you create a partition/filesystem, lvm, lvm/filesystem, etc...

For GlusterFS you have on your Proxmox servers local disks, on top of the disks you create bricks (filesystems) and with the bricks you create glusterFS filesystems for all the Proxmox nodes.

You can use iSCSI local disks for GlusterFS, but you have network GlusterFS on top network iSCSI so the performance ?

You can have GlusterFS and export files as iSCSI targets but why ?

On my test environment I have:

- Local disks with a GlusterFS volume shared on the two nodes.

- An iSCSI storage shared on the two nodes with shared lvm and gfs2.

Best regards.

Francis

imran.tee · Oct 18, 2022

Hi Francis,

Thanks a lot for you reply. Actually I am looking for suitable storage solution good performance with replication/redundancy.

I have one server for proxmox and two servers for storage. Seeking expert suggestions.proxmo Server configuration are below:

Proxmox server:
Dell PowerEdge R650 Rack Server
64 core: 2x Intel Xeon Gold 6338
RAM: 512GB
HDD: 2 TB SSD: 2x960GB SATA SSD
4x10G SFP + 2x1G Ethernet port

Two Storage Server
Dell PowerEdge R730xd
2x10Core v3
32GB Ram DDR4
2x1TB SSD (For OS)
10x4TB SSD (For Data)
4x10G SFP + 2x1G Ethernet port

I have tried GlusterFS with replica 2 and arbiter 1, with both HW RAID10 and ZFS RAID10. IOPS varies 10-20K and BW 30-70 MB/s. I also configure iSCSI on top of HW RAID and got 40-70K IOPS and 140-280 MB/s BW.

FrancisS · Oct 18, 2022

Hi,

I suppose you use the two Storage Server (SS) for the Gluster servers and the Proxmox Server (PS) as a Gluster client

and you have a bond of 2x10Gb back to back for the SS's and another bond 2x10Gb to connect via a switch the PS.

For Gluster do not use hardware raid, gluster have its own raid1.

Put the 10x4TB disk in jbod, on each disk create a filesystem gluster brick and with the bricks create a distributed/replicated gluster volume.

There is lot of work to implement with two servers an iSCSI cluster (like a storage bay).

Best regards.
Francis

PS: It's possible to have 3 PSS Proxmox/Storage Servers, you have storage and hypervisor redundancy but that need 3 "same" servers.

imran.tee · Oct 18, 2022

Hi Francis,

I was planning to use 2 storage node (SS) as data brick and a partition of proxmox node (PS) as an arbiter to avoid split-brain situation.

I have no idea that Gluster can direct access to disk. Those tutorial which I followed, where all are configure Gluster on top another file system. OK, let me try with JBOD, although I have no working experience on jbod.

Thanks you.

Regards,

Imran

FrancisS · Oct 19, 2022

Hi Imran,

Gluster can not have direct access to the disk, Gluster is on top another filesystem.

You have to:

- configure the 10x4TB disks in JBOD,
- put the 10 disks in a Volume Group,
- create a Logical Volume on each 10 disks (1LV -> 1DD),
- create a filesystem (ext4/xfs) on each 10 Logical Volume and with the filesystems (bricks),
- create a distributed/replicated Gluster Volume.

- You used only the cache of the HW raid card.

Best regards.
Francis

imran.tee · Oct 19, 2022

FrancisS said:
Hi Imran,

Gluster can not have direct access to the disk, Gluster is on top another filesystem.

You have to:

- configure the 10x4TB disks in JBOD,
- put the 10 disks in a Volume Group,
- create a Logical Volume on each 10 disks (1LV -> 1DD),
- create a filesystem (ext4/xfs) on each 10 Logical Volume and with the filesystems (bricks),
- create a distributed/replicated Gluster Volume.

- You used only the cache of the HW raid card.

Best regards.
Francis

Adding extra LVM layer would not degrade performance?

FrancisS · Oct 20, 2022

HI Imran,

Of course but not so much and you have the LVM capabilities, change disk size, increase LVM size, move data online, etc...

If you have time to compare with and without LVM you are welcome.

Best regards.

Francis

Zubin Singh Parihar · Dec 31, 2023

Hey Folks,

I'm trying to get clear on something here with a possible setup...

It sounds like from what you described here is that you can do a Proxmox Hyper-converged GlusterFS system on a system with hardware raid. Is that correct?

I'm thinking of the following scenario:

3 x Dell R720
PERC HARDWARE RAID
2x256 GB SSD -- Proxmox OS (Raid1)
6X1TB SSD -- GlusterFS Brick (each XFS format)
2x10GB NIC -- Gluster Traffic and VM Traffic
4X1GB NIC -- Management Network

3 replica policy

Each 1TB is formatted with XFS, because it doesn't do the raw disk, correct?

Would this work?
Seems like a simpler system than CEPH Hyper-converged...

Let me know your thoughts...

FrancisS · Jan 1, 2024

Hi Zubin,

Yes it is working, the problem is the shutdown of a node, Gluster do not stop correctly with Debian (I do not test if the problem is solved now).

>> Each 1TB is formatted with XFS, because it doesn't do the raw disk, correct?

yes and do not use hardware raid for GlusterFS.

>> Seems like a simpler system than CEPH Hyper-converged...

No both are simple the advantage of Ceph HCI is the Gui integration.

Best regards.

Francis

FrancisS · Jan 1, 2024

Zubin,

For the Network you have 2 switches (10G/1G) or 2 switches (10G), 2 switches (1G)

Storage, VM memory migration, VM production, HA (Bond 2x10G), HV and VM management, HA (Bond 4x1G)

or

Storage, VM memory migration, HA (Bond 2x10G), HV and VM management, HA (Bond 2x1G), VM production (Bond 2x1G).

You need 2 HAs networks, I prefer +10G dedicated storage Network (and VM migration for big VM memory) without VM production.

Zubin Singh Parihar · Jan 1, 2024

FrancisS said:
Hi Zubin,

Yes it is working, the problem is the shutdown of a node, Gluster do not stop correctly with Debian (I do not test if the problem is solved now).

>> Each 1TB is formatted with XFS, because it doesn't do the raw disk, correct?

yes and do not use hardware raid for GlusterFS.

>> Seems like a simpler system than CEPH Hyper-converged...

No both are simple the advantage of Ceph HCI is the Gui integration.

Best regards.

Francis

Gluster does not like Hardware Raid? Even if the filesystem Gluster Sits on is XFS?

Zubin Singh Parihar · Jan 1, 2024

FrancisS said:
Zubin,

For the Network you have 2 switches (10G/1G) or 2 switches (10G), 2 switches (1G)

Storage, VM memory migration, VM production, HA (Bond 2x10G), HV and VM management, HA (Bond 4x1G)

or

Storage, VM memory migration, HA (Bond 2x10G), HV and VM management, HA (Bond 2x1G), VM production (Bond 2x1G).

You need 2 HAs networks, I prefer +10G dedicated storage Network (and VM migration for big VM memory) without VM production.

I have 1 x 10GB Switch
I have 1 x 1GB Switch

Thanks for those recommendations. We'll look at getting another 2 of these Switches with the same brand.

FrancisS · Jan 2, 2024

Zubin,

>> Gluster does not like Hardware Raid? Even if the filesystem Gluster Sits on is XFS?

Hardware Raid is not necessary, Gluster manage the Raid in your case 3 replicats (same if you use zfs, btrfs, lvm raid).

You need 2 switchs to avoid a spof, If the unique switch crash ("same" problem if you have to reboot after upgrade) all the VMs crash.

Best regards.
Francis

Dark26 · Jan 5, 2024

In the past i was using glusterfs with promox, when we could have lxc on this storage. Not anymore.

But it's not the only reason. Gluster for VM work fine, but you can have problem if you have to reboot a glusterfs server , for a kernel update for example.

imagine you have a VM with a disk of 50 Go.

During the time of the reboot, the vm image ( qcow / raw ) change during the time the gluster server reboot and, and the WHOLE file need to be heal, the total of 50Go.

it can take some time....

If all the vm are shutdown before restarting the gflusterfs server no problem, but it can be a problem if you can't shutdown all the VM..

With ceph, only the block which have change are "heal", so maybe only few mega / gigaoctet at best have to be heal. it's quicker.

For me Glusterfs is not good for VM, but for standard file,for a samba share for example it's very good, i still use it between my two NAS.

Tmanok · Feb 8, 2024

@Dark26
GlusterFS is volume-level redundancy, I figured this might have been achieved by mirroring blocks, not files but it sounds like my understanding was incorrect...

This seems like a pretty big reason to avoid GlusterFS as a hypervisor, but it sounds very compatible with Proxmox Backup Server. Ironically, PBS doesn't support any GUI options to mount GlusterFS, perhaps there's a huge caveat that I'm missing.
Cheers,

Tmanok

Proxmox Cluster with local Gluster servers.

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Renowned Member

Renowned Member

We value your privacy