Fine integration of glusterfs in a proxmox cluster

zeroun · Sep 8, 2024

Hello everyone.

I have a proxmox v 8.2.4 cluster composed of 3 nodes. For now, I have 3 storages. A samba storage to an external QNAP and a ZFS storage with deduplication enabled for VM images. I use the replication feature for the security of VM disks. Each VM disk is replicated on the 3 nodes. I also configured 2 VMs in HA. For now everything is fine and works wonderfully well.

Physically, the 3 nodes are made up of 3 small NUCs, 16-thread CPU and 16 GB of RAM with a 1 TB SSD hard drive on each.

The VM disks have a capacity of between 8 and 32 GB.

I wanted to switch to glusterFS to have a volume dedicated to the disk images of the cluster's VMs. And therefore have shared storage. Which would allow me to have a VM migration almost instantly or almost (a few seconds).

But here it is. I am running into a thorny problem:

History:

When I reboot the cluster nodes or when there is a power outage (No, I don't want an inverter), the configuration with ZFS storage combined with the replication functionality works perfectly well. The disks of my VMs remain intact.

So I want to have a replica of 3 on my glusterfs volume. Since I have 3 nodes. When I configure glusterfs independently on each node via the following service for mounting the volume:

Code:

cat /etc/systemd/system/glusterfs-mount.service
[Unit]
Description=Mount GlusterFS Volume
After=network-online.target glusterd.service
Wants=network-online.target glusterd.service

[Service]
Type=idle
ExecStart=/bin/mount -t glusterfs localhost:/volumegfs /gluster
RemainAfterExit=yes
Restart=on-failure
RestartSec=10
StartLimitIntervalSec=600
StartLimitBurst=10

[Install]
WantedBy=multi-user.target

The volumegfs mounts fine when all 3 nodes have been rebooted.

Bash:

df -h
Sys. de fichiers        Taille Utilisé Dispo Uti% Monté sur
udev                      960M       0  960M   0% /dev
tmpfs                     197M    536K  197M   1% /run
/dev/mapper/k1--vg-root   3,8G    2,4G  1,3G  67% /
tmpfs                     984M       0  984M   0% /dev/shm
tmpfs                     5,0M       0  5,0M   0% /run/lock
/dev/mapper/k1--vg-gfs    7,8G    1,1G  6,4G  14% /mnt/gfs
/dev/sda1                 455M    156M  275M  37% /boot
localhost:/volumegfs      7,8G    1,1G  6,4G  15% /gluster
tmpfs                     197M       0  197M   0% /run/user/1002

The 3 nodes do not all restart at the same time and in disorder. It therefore takes a little time for the volumegfs volume to be mounted and operational on all nodes.

And this is where the disaster occurs. Proxmox VE does not allow me to issue a condition when the flag of a VM is on onstart, so it must start on its node. The volumegfs volume is not yet ready which causes a failure to start the VM.

The proxmox web interface to declare a glusterfs storage does not yet meet my expectations (Replica of 3).

And I prefer to use glusterfs autonomously.

I had to set up a Ceph cluster external to the Proxmox cluster in the past for a large infrastructure and it works wonderfully.

But, here, my infrastructure is very small. That's why it seemed wise to use glusterfs.

What I'm looking for is to specify to proxmox-ve when it restarts that when a VM has its onstart flag so it must start it, which it does perfectly well, BUT, proxmox-ve must wait and retry as long as the volumegfs volume is not operational.

And apparently the virtual machine manager of proxmox-ve is internal. I don't see how to get out of this problem.

Thank you very much in advance for your help.

esi_y · Sep 9, 2024

zeroun said:
And this is where the disaster occurs. Proxmox VE does not allow me to issue a condition when the flag of a VM is on onstart, so it must start on its node. The volumegfs volume is not yet ready which causes a failure to start the VM.

The proxmox web interface to declare a glusterfs storage does not yet meet my expectations (Replica of 3).

zeroun said:
What I'm looking for is to specify to proxmox-ve when it restarts that when a VM has its onstart flag so it must start it, which it does perfectly well, BUT, proxmox-ve must wait and retry as long as the volumegfs volume is not operational.

I think you ran into a very peculiar manifestation of the same [1], quoting:

Overall, it is the responsibility of the admin to set up everything for HA to work (mainly storage, but also passed through devices).

Yeah, it's not really something considered to be in scope.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5156#c1

waltar · Sep 9, 2024

You need additional line "After=glusterfs-mount.service" in your systemd pve daemon startup scripts ...
but perhaps that should be done after each pve update ...

gfngfn256 · Sep 9, 2024

You could quite easily write a script (cron @reboot ?) & have it check that the glusterfs is mounted & then start the VM. (You'd turn off the Start at boot within Proxmox for that VM).

Another option would be to add a Start on boot delay node-wide giving enough time for the glusterfs to be mounted & available. You could also add a VM-specific delay in the VM's startup. (The latter will only work if some other VM is loaded before it). Obviously, this option (as opposed to script above) would only wait - but wouldn't check if the glusterfs is actually available.

I also believe it could be useful to many users if Proxmox would introduce (in-house) some form of conditional-templating for a VM/LXC startup. However the scope is so large & varied, that this may prove difficult/not worth it.

esi_y · Sep 9, 2024

gfngfn256 said:
I also believe it could be useful to many users if Proxmox would introduce (in-house) some form of conditional-templating for a VM/LXC startup. However the scope is so large & varied, that this may prove difficult/not worth it.

The condition could be a simple script that returns number of seconds to wait, 0 for proceed and -1 for abort. The rest is the user's issue.

waltar · Sep 9, 2024

An option which I like (and somehow here and there used for file and service servers) is to introduce a further runlevel after multi-user with the additional services and conditions in a target.wants dir (which could be defined for a service also) which provide no emergency fallback and easy service maintenance.

zeroun · Sep 10, 2024

Hello everyone.
Thank you all for your interest in my problem.I will already try Waltar's solution by adding to

Code:

cat /etc/systemd/system/pve-manager.service
[Unit]
Description=PVE guests
ConditionPathExists=/usr/bin/pvesh
RefuseManualStart=true
RefuseManualStop=true
Wants=pvestatd.service
Wants=pveproxy.service
Wants=spiceproxy.service
Wants=pve-firewall.service
Wants=lxc.service
After=pveproxy.service
After=pvestatd.service
After=spiceproxy.service
After=pve-firewall.service
After=lxc.service
After=pve-ha-crm.service pve-ha-lrm.service


[Service]
Environment="PVE_LOG_ID=pve-guests"
ExecStartPre=-/usr/share/pve-manager/helpers/pve-startall-delay
ExecStart=/usr/bin/pvesh --nooutput create /nodes/localhost/startall
ExecStop=-/usr/bin/vzdump -stop
ExecStop=/usr/bin/pvesh --nooutput create /nodes/localhost/stopall
Type=oneshot
RemainAfterExit=yes
TimeoutSec=infinity


[Install]
WantedBy=multi-user.target
Alias=pve-manager.service

the line After=glusterfs-mount.service

and manage when I perform an update to put this line back in the service configuration if it no longer exists.

Which seems to me to be indeed the simplest solution while keeping the functionalities of the web interface with the options on-start yes|no

I will keep you informed. Thanks again to all for your answers and for enlightening me on my problem.

esi_y · Wednesday at 05:59

zeroun said:
and manage when I perform an update to put this line back in the service configuration if it no longer exists.

Just do an override with systemctl edit:

https://wiki.debian.org/systemd#Small_tweaks

zeroun · Thursday at 16:42

Many thanks @esi_y

So I won't have any worries when proxmox updates.

Code:

cat /etc/systemd/system/pve-manager.service.d/override.conf
[Unit]
After=glusterfs-mount.service

Code:

systemctl daemon-reload

It's ok now for ovveriding after glusterfs-mount.service.

Code:

systemctl show pve-manager.service | grep After

RemainAfterExit=yes
After=systemd-journald.socket sysinit.target spiceproxy.service pveproxy.service system.slice lxc.service glusterfs-mount.service qmeventd.service pve-lxc-syscalld.service basic.target pve-firewall.service pve-ha-crm.service pve-ha-lrm.service pvestatd.service

zeroun · Thursday at 17:07

Additional notes for installing glusterfs under proxmox

for all nodes :

Code:

apt install curl gpg
curl https://download.gluster.org/pub/gluster/glusterfs/11/rsa.pub | gpg --dearmor > /usr/share/keyrings/glusterfs-archive-keyring.gpg
DEBID=$(grep 'VERSION_ID=' /etc/os-release | cut -d '=' -f 2 | tr -d '"')
DEBVER=$(grep 'VERSION=' /etc/os-release | grep -Eo '[a-z]+')
DEBARCH=$(dpkg --print-architecture)
echo "deb [signed-by=/usr/share/keyrings/glusterfs-archive-keyring.gpg] https://download.gluster.org/pub/gluster/glusterfs/LATEST/Debian/${DEBID}/${DEBARCH}/apt ${DEBVER} main" | tee /etc/apt/sources.list.d/gluster.list
apt update
apt install glusterfs-server
systemctl start glusterd
systemctl enable glusterd

For one node :

Code:

gluster peer probe <ip node 1>
gluster peer probe <ip node 2>
gluster peer probe <ip node n>
gluster peer status

<Creating my GlusterFS volume with a replica of 3 since I have 3 Proxmox nodes.>

For added Storage GlusterFS in proxmox

I added in /etc/pve/storage.cfg :

Code:

dir: gluster
        path /gluster
        content images
        prune-backups keep-all=1
        shared 1

In accordance with my service /etc/systemd/system/glusterfs-mount.service described at the top of this thread

zeroun · Thursday at 17:22

So now I have glusterFS perfectly integrated into my Proxmox cluster.

And when the 3 nodes of the Proxmox cluster restart in disorder, some slower than others and indifferently, my VMs under glusterFS start correctly without problems.

Fabulous Proxmox 8.2.4 under fabulous Linux

Many thanks again all.

zeroun · 2024-09-16T23:09:51+0200

Hi all, again.

oups.

There is one problem though. It doesn't work exactly as I thought it would.

The gluster_auto_heal service plays its role perfectly.

When the VM starts, it only starts when the volume $volume glusterFS is mounted on the node and when there are no more files under repair on glusterFS.So,

I tell myself that normally, the VMs on the node should wait since there is the overload rule.
Why is this not the case. My VMs start anyway, while the gluster_auto_heal service is not yet started. Why ?

It takes a good ten minutes for the self-repair to be completed and for it to return the error code 0.

Here is:

Code:

cat /etc/systemd/system/gluster_auto_heal.service
[Unit]
Description=Gluster Auto Heal Service
After=network-online.target
After=glusterd.service
After=glusterfs-mount.service
Wants=network-online.target
Wants=glusterd.service
Wants=glusterfs-mount.service

[Service]
Type=idle
ExecStart=/bin/bash -c '/usr/local/bin/gluster_auto_heal gfs0'
RemainAfterExit=yes
SuccessExitStatus=0
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Code:

cat /usr/local/bin/gluster_auto_heal
#!/bin/bash

volume="$1"

if [ "$volume" == "" ]; then
  echo "99"
  exit 99
fi


ok=$(df -h | grep $volume | wc -l)
if [ $ok -eq 1 ]; then
  ok=0
else
  ok=1
fi

# Lire chaque ligne de la commande gluster
sum=0
while read -r line; do
  # Si la ligne contient "Number of entries", extraire le nombre et l'ajouter à la somme
  if [[ $line == *"Number of entries:"* ]]; then
    number=$(echo $line | awk '{print $4}')
    sum=$((sum + number))
  fi
done < <(gluster volume heal $volume statistics heal-count)
echo "S${sum}, M${ok} "
# Afficher la somme totale
echo "S${sum}, M${ok} "
retour=1
if [ $sum -eq 0 ] && [ $ok -eq 0 ]; then
  retour=0
else
  retour=1
fi
echo "$sum"
exit "${retour}"

Code:

cat /etc/systemd/system/pve-manager.service.d/override.conf
[Unit]
After=glusterfs-mount.service
After=gluster_auto_heal.service

Code:

show pve-manager.service | grep After
RemainAfterExit=yes
After=system.slice pvestatd.service systemd-journald.socket sysinit.target pve-ha-crm.service basic.target glusterfs-mount.service pve-firewall.service qmeventd.service spiceproxy.service lxc.service pve-ha-lrm.service pve-lxc-syscalld.service gluster_auto_heal.service pveproxy.service

Thanks again in advance for your insights ....

esi_y · 2024-09-16T23:44:58+0200

zeroun said:

Code:

cat /etc/systemd/system/gluster_auto_heal.service
[Unit]
Description=Gluster Auto Heal Service
After=network-online.target
After=glusterd.service
After=glusterfs-mount.service
Wants=network-online.target
Wants=glusterd.service
Wants=glusterfs-mount.service

[Service]
Type=idle
ExecStart=/bin/bash -c '/usr/local/bin/gluster_auto_heal gfs0'
RemainAfterExit=yes
SuccessExitStatus=0
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

This is an educated guess, I do not have enough experience with gluster, but I suspect you need the auto-heal to be Type=oneshot for this to work.

See also:

https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html

Search

Search

Fine integration of glusterfs in a proxmox cluster

zeroun

New Member

esi_y

Active Member

waltar

Member

gfngfn256

Renowned Member

esi_y

Active Member

waltar

Member

zeroun

New Member

esi_y

Active Member

zeroun

New Member

zeroun

New Member

zeroun

New Member

zeroun

New Member

esi_y

Active Member