[TUTORIAL] Condition VM start on GlusterFS reachability

zeroun · Sep 8, 2024

Hello everyone.

I have a proxmox v 8.2.4 cluster composed of 3 nodes. For now, I have 3 storages. A samba storage to an external QNAP and a ZFS storage with deduplication enabled for VM images. I use the replication feature for the security of VM disks. Each VM disk is replicated on the 3 nodes. I also configured 2 VMs in HA. For now everything is fine and works wonderfully well.

Physically, the 3 nodes are made up of 3 small NUCs, 16-thread CPU and 16 GB of RAM with a 1 TB SSD hard drive on each.

The VM disks have a capacity of between 8 and 32 GB.

I wanted to switch to glusterFS to have a volume dedicated to the disk images of the cluster's VMs. And therefore have shared storage. Which would allow me to have a VM migration almost instantly or almost (a few seconds).

But here it is. I am running into a thorny problem:

History:

When I reboot the cluster nodes or when there is a power outage (No, I don't want an inverter), the configuration with ZFS storage combined with the replication functionality works perfectly well. The disks of my VMs remain intact.

So I want to have a replica of 3 on my glusterfs volume. Since I have 3 nodes. When I configure glusterfs independently on each node via the following service for mounting the volume:

Code:

cat /etc/systemd/system/glusterfs-mount.service
[Unit]
Description=Mount GlusterFS Volume
After=network-online.target glusterd.service
Wants=network-online.target glusterd.service

[Service]
Type=idle
ExecStart=/bin/mount -t glusterfs localhost:/volumegfs /gluster
RemainAfterExit=yes
Restart=on-failure
RestartSec=10
StartLimitIntervalSec=600
StartLimitBurst=10

[Install]
WantedBy=multi-user.target

The volumegfs mounts fine when all 3 nodes have been rebooted.

Bash:

df -h
Sys. de fichiers        Taille Utilisé Dispo Uti% Monté sur
udev                      960M       0  960M   0% /dev
tmpfs                     197M    536K  197M   1% /run
/dev/mapper/k1--vg-root   3,8G    2,4G  1,3G  67% /
tmpfs                     984M       0  984M   0% /dev/shm
tmpfs                     5,0M       0  5,0M   0% /run/lock
/dev/mapper/k1--vg-gfs    7,8G    1,1G  6,4G  14% /mnt/gfs
/dev/sda1                 455M    156M  275M  37% /boot
localhost:/volumegfs      7,8G    1,1G  6,4G  15% /gluster
tmpfs                     197M       0  197M   0% /run/user/1002

The 3 nodes do not all restart at the same time and in disorder. It therefore takes a little time for the volumegfs volume to be mounted and operational on all nodes.

And this is where the disaster occurs. Proxmox VE does not allow me to issue a condition when the flag of a VM is on onstart, so it must start on its node. The volumegfs volume is not yet ready which causes a failure to start the VM.

The proxmox web interface to declare a glusterfs storage does not yet meet my expectations (Replica of 3).

And I prefer to use glusterfs autonomously.

I had to set up a Ceph cluster external to the Proxmox cluster in the past for a large infrastructure and it works wonderfully.

But, here, my infrastructure is very small. That's why it seemed wise to use glusterfs.

What I'm looking for is to specify to proxmox-ve when it restarts that when a VM has its onstart flag so it must start it, which it does perfectly well, BUT, proxmox-ve must wait and retry as long as the volumegfs volume is not operational.

And apparently the virtual machine manager of proxmox-ve is internal. I don't see how to get out of this problem.

Thank you very much in advance for your help.

esi_y · Sep 9, 2024

zeroun said:
And this is where the disaster occurs. Proxmox VE does not allow me to issue a condition when the flag of a VM is on onstart, so it must start on its node. The volumegfs volume is not yet ready which causes a failure to start the VM.

The proxmox web interface to declare a glusterfs storage does not yet meet my expectations (Replica of 3).

zeroun said:
What I'm looking for is to specify to proxmox-ve when it restarts that when a VM has its onstart flag so it must start it, which it does perfectly well, BUT, proxmox-ve must wait and retry as long as the volumegfs volume is not operational.

I think you ran into a very peculiar manifestation of the same [1], quoting:

Overall, it is the responsibility of the admin to set up everything for HA to work (mainly storage, but also passed through devices).

Yeah, it's not really something considered to be in scope.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5156#c1

waltar · Sep 9, 2024

You need additional line "After=glusterfs-mount.service" in your systemd pve daemon startup scripts ...
but perhaps that should be done after each pve update ...

gfngfn256 · Sep 9, 2024

You could quite easily write a script (cron @reboot ?) & have it check that the glusterfs is mounted & then start the VM. (You'd turn off the Start at boot within Proxmox for that VM).

Another option would be to add a Start on boot delay node-wide giving enough time for the glusterfs to be mounted & available. You could also add a VM-specific delay in the VM's startup. (The latter will only work if some other VM is loaded before it). Obviously, this option (as opposed to script above) would only wait - but wouldn't check if the glusterfs is actually available.

I also believe it could be useful to many users if Proxmox would introduce (in-house) some form of conditional-templating for a VM/LXC startup. However the scope is so large & varied, that this may prove difficult/not worth it.

esi_y · Sep 9, 2024

gfngfn256 said:
I also believe it could be useful to many users if Proxmox would introduce (in-house) some form of conditional-templating for a VM/LXC startup. However the scope is so large & varied, that this may prove difficult/not worth it.

The condition could be a simple script that returns number of seconds to wait, 0 for proceed and -1 for abort. The rest is the user's issue.

waltar · Sep 9, 2024

An option which I like (and somehow here and there used for file and service servers) is to introduce a further runlevel after multi-user with the additional services and conditions in a target.wants dir (which could be defined for a service also) which provide no emergency fallback and easy service maintenance.

zeroun · Sep 10, 2024

Hello everyone.
Thank you all for your interest in my problem.I will already try Waltar's solution by adding to

Code:

cat /etc/systemd/system/pve-manager.service
[Unit]
Description=PVE guests
ConditionPathExists=/usr/bin/pvesh
RefuseManualStart=true
RefuseManualStop=true
Wants=pvestatd.service
Wants=pveproxy.service
Wants=spiceproxy.service
Wants=pve-firewall.service
Wants=lxc.service
After=pveproxy.service
After=pvestatd.service
After=spiceproxy.service
After=pve-firewall.service
After=lxc.service
After=pve-ha-crm.service pve-ha-lrm.service


[Service]
Environment="PVE_LOG_ID=pve-guests"
ExecStartPre=-/usr/share/pve-manager/helpers/pve-startall-delay
ExecStart=/usr/bin/pvesh --nooutput create /nodes/localhost/startall
ExecStop=-/usr/bin/vzdump -stop
ExecStop=/usr/bin/pvesh --nooutput create /nodes/localhost/stopall
Type=oneshot
RemainAfterExit=yes
TimeoutSec=infinity


[Install]
WantedBy=multi-user.target
Alias=pve-manager.service

the line After=glusterfs-mount.service

and manage when I perform an update to put this line back in the service configuration if it no longer exists.

Which seems to me to be indeed the simplest solution while keeping the functionalities of the web interface with the options on-start yes|no

I will keep you informed. Thanks again to all for your answers and for enlightening me on my problem.

esi_y · Sep 11, 2024

zeroun said:
and manage when I perform an update to put this line back in the service configuration if it no longer exists.

Just do an override with systemctl edit:

https://wiki.debian.org/systemd#Small_tweaks

zeroun · Sep 12, 2024

Many thanks @esi_y

So I won't have any worries when proxmox updates.

Code:

cat /etc/systemd/system/pve-manager.service.d/override.conf
[Unit]
After=glusterfs-mount.service

Code:

systemctl daemon-reload

It's ok now for ovveriding after glusterfs-mount.service.

Code:

systemctl show pve-manager.service | grep After

RemainAfterExit=yes
After=systemd-journald.socket sysinit.target spiceproxy.service pveproxy.service system.slice lxc.service glusterfs-mount.service qmeventd.service pve-lxc-syscalld.service basic.target pve-firewall.service pve-ha-crm.service pve-ha-lrm.service pvestatd.service

zeroun · Sep 12, 2024

Additional notes for installing glusterfs under proxmox

for all nodes :

Code:

apt install curl gpg
curl https://download.gluster.org/pub/gluster/glusterfs/11/rsa.pub | gpg --dearmor > /usr/share/keyrings/glusterfs-archive-keyring.gpg
DEBID=$(grep 'VERSION_ID=' /etc/os-release | cut -d '=' -f 2 | tr -d '"')
DEBVER=$(grep 'VERSION=' /etc/os-release | grep -Eo '[a-z]+')
DEBARCH=$(dpkg --print-architecture)
echo "deb [signed-by=/usr/share/keyrings/glusterfs-archive-keyring.gpg] https://download.gluster.org/pub/gluster/glusterfs/LATEST/Debian/${DEBID}/${DEBARCH}/apt ${DEBVER} main" | tee /etc/apt/sources.list.d/gluster.list
apt update
apt install glusterfs-server
systemctl start glusterd
systemctl enable glusterd

For one node :

Code:

gluster peer probe <ip node 1>
gluster peer probe <ip node 2>
gluster peer probe <ip node n>
gluster peer status

<Creating my GlusterFS volume with a replica of 3 since I have 3 Proxmox nodes.>

For added Storage GlusterFS in proxmox

I added in /etc/pve/storage.cfg :

Code:

dir: gluster
        path /gluster
        content images
        prune-backups keep-all=1
        shared 1

In accordance with my service /etc/systemd/system/glusterfs-mount.service described at the top of this thread

zeroun · Sep 12, 2024

So now I have glusterFS perfectly integrated into my Proxmox cluster.

And when the 3 nodes of the Proxmox cluster restart in disorder, some slower than others and indifferently, my VMs under glusterFS start correctly without problems.

Fabulous Proxmox 8.2.4 under fabulous Linux

Many thanks again all.

zeroun · Sep 16, 2024

Hi all, again.

oups.

There is one problem though. It doesn't work exactly as I thought it would.

The gluster_auto_heal service plays its role perfectly.

When the VM starts, it only starts when the volume $volume glusterFS is mounted on the node and when there are no more files under repair on glusterFS.So,

I tell myself that normally, the VMs on the node should wait since there is the overload rule.
Why is this not the case. My VMs start anyway, while the gluster_auto_heal service is not yet started. Why ?

It takes a good ten minutes for the self-repair to be completed and for it to return the error code 0.

Here is:

Code:

cat /etc/systemd/system/gluster_auto_heal.service
[Unit]
Description=Gluster Auto Heal Service
After=network-online.target
After=glusterd.service
After=glusterfs-mount.service
Wants=network-online.target
Wants=glusterd.service
Wants=glusterfs-mount.service

[Service]
Type=idle
ExecStart=/bin/bash -c '/usr/local/bin/gluster_auto_heal gfs0'
RemainAfterExit=yes
SuccessExitStatus=0
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Code:

cat /usr/local/bin/gluster_auto_heal
#!/bin/bash

volume="$1"

if [ "$volume" == "" ]; then
  echo "99"
  exit 99
fi


ok=$(df -h | grep $volume | wc -l)
if [ $ok -eq 1 ]; then
  ok=0
else
  ok=1
fi

# Lire chaque ligne de la commande gluster
sum=0
while read -r line; do
  # Si la ligne contient "Number of entries", extraire le nombre et l'ajouter à la somme
  if [[ $line == *"Number of entries:"* ]]; then
    number=$(echo $line | awk '{print $4}')
    sum=$((sum + number))
  fi
done < <(gluster volume heal $volume statistics heal-count)
echo "S${sum}, M${ok} "
# Afficher la somme totale
echo "S${sum}, M${ok} "
retour=1
if [ $sum -eq 0 ] && [ $ok -eq 0 ]; then
  retour=0
else
  retour=1
fi
echo "$sum"
exit "${retour}"

Code:

cat /etc/systemd/system/pve-manager.service.d/override.conf
[Unit]
After=glusterfs-mount.service
After=gluster_auto_heal.service

Code:

show pve-manager.service | grep After
RemainAfterExit=yes
After=system.slice pvestatd.service systemd-journald.socket sysinit.target pve-ha-crm.service basic.target glusterfs-mount.service pve-firewall.service qmeventd.service spiceproxy.service lxc.service pve-ha-lrm.service pve-lxc-syscalld.service gluster_auto_heal.service pveproxy.service

Thanks again in advance for your insights ....

esi_y · Sep 16, 2024

zeroun said:

Code:

cat /etc/systemd/system/gluster_auto_heal.service
[Unit]
Description=Gluster Auto Heal Service
After=network-online.target
After=glusterd.service
After=glusterfs-mount.service
Wants=network-online.target
Wants=glusterd.service
Wants=glusterfs-mount.service

[Service]
Type=idle
ExecStart=/bin/bash -c '/usr/local/bin/gluster_auto_heal gfs0'
RemainAfterExit=yes
SuccessExitStatus=0
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

This is an educated guess, I do not have enough experience with gluster, but I suspect you need the auto-heal to be Type=oneshot for this to work.

See also:

https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html

waltar · Sep 17, 2024

zeroun said:
ok=$(df -h | grep $volume | wc -l)

This may hang until next reboot and isn't even removeable by "kill -9" so never use "df" without "-l" (local) on scripts and even better use "mountpoint -q" for this !! Eg. if gfs0 should be mounted at /gfs0 and you set "ok" by number referencing to your script it would be :
mountpoint -q /gfs0 && ok=0 || ok=1

waltar · Sep 17, 2024

zeroun said:
done < <(gluster volume heal $volume statistics heal-count)

Looks like an error and probably mean
done < $(gluster volume heal $volume statistics heal-count)

waltar · Sep 17, 2024

zeroun said:
cat /etc/systemd/system/pve-manager.service.d/override.conf [Unit] After=glusterfs-mount.service After=gluster_auto_heal.service

And like esi_y said if you define above it should be Type=oneshot .

zeroun · Sep 17, 2024

Hello @waltar and @esi_y

Yes, my gluster is set to self-heal

Code:

volume=gfs0;gluster volume heal $volume statistics heal-count
Gathering count of entries to be healed on volume gfs0 has been successful

Brick 192.168.5.2:/mnt/distributed/brick0
Number of entries: 0

Brick 192.168.5.3:/mnt/distributed/brick0
Number of entries: 0

Brick 192.168.5.4:/mnt/distributed/brick0
Number of entries: 0

If the "Number Of Entries" is not equal to 0 then one or more self-repairs are in progress. Over time, this number decreases to 0 when the self-repair is complete.

I corrected following your recommendations:

Code:

cat /usr/local/bin/gluster_auto_heal
#!/bin/bash

volume="$1"

if [ "$volume" == "" ]; then
  echo "99"
  exit 99
fi


mountpoint -q /gluster && ok=0 || ok=1

# Lire chaque ligne de la commande gluster
sum=0
while read -r line; do
  # Si la ligne contient "Number of entries", extraire le nombre et l'ajouter à la somme
  if [[ $line == *"Number of entries:"* ]]; then
    number=$(echo $line | awk '{print $4}')
    sum=$((sum + number))
  fi
done < <(gluster volume heal $volume statistics heal-count)
echo "S${sum}, M${ok} "
# Afficher la somme totale
echo "S${sum}, M${ok} "
retour=1
if [ $sum -eq 0 ] && [ $ok -eq 0 ]; then
  retour=0
else
  retour=1
fi
echo "$sum"
exit "${retour}"

Code:

cat  /etc/systemd/system/gluster_auto_heal.service
[Unit]
Description=Gluster Auto Heal Service
After=network-online.target
After=glusterd.service
After=glusterfs-mount.service
Wants=network-online.target
Wants=glusterd.service
Wants=glusterfs-mount.service

[Service]
Type=oneshot
ExecStart=/bin/bash -c '/usr/local/bin/gluster_auto_heal gfs0'
RemainAfterExit=yes
SuccessExitStatus=0
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

The gluster is in its nominal state and the volume is correctly mounted on the node:

Code:

 systemctl daemon-reload; systemctl restart gluster_auto_heal.service; watch systemctl status gluster_auto_heal.service
 
 Every 2.0s: systemctl status gluster_auto_heal.service                           c3: Tue Sep 17 10:31:34 2024

● gluster_auto_heal.service - Gluster Auto Heal Service
     Loaded: loaded (/etc/systemd/system/gluster_auto_heal.service; enabled; preset: enabled)
     Active: active (exited) since Tue 2024-09-17 10:31:32 CEST; 2s ago
    Process: 204032 ExecStart=/bin/bash -c /usr/local/bin/gluster_auto_heal gfs0 (code=exited, status=0/SUCCE
SS)
   Main PID: 204032 (code=exited, status=0/SUCCESS)
        CPU: 122ms

Sep 17 10:31:32 c3 systemd[1]: Starting gluster_auto_heal.service - Gluster Auto Heal Service...
Sep 17 10:31:32 c3 bash[204032]: S0, M0
Sep 17 10:31:32 c3 bash[204032]: S0, M0
Sep 17 10:31:32 c3 bash[204032]: 0
Sep 17 10:31:32 c3 systemd[1]: Finished gluster_auto_heal.service - Gluster Auto Heal Service.

Code:

/usr/local/bin/gluster_auto_heal gfs0
S0, M0
S0, M0
0

I restart the node ...

VMs restart when conditions are not met:

Code:

/usr/local/bin/gluster_auto_heal gfs0
S8, M0
S8, M0
8

echo $?
1

Code:

systemctl status gluster_auto_heal.service
● gluster_auto_heal.service - Gluster Auto Heal Service
     Loaded: loaded (/etc/systemd/system/gluster_auto_heal.service; enabled; preset: enabled)
     Active: active (exited) since Tue 2024-09-17 10:33:35 CEST; 3min 36s ago
    Process: 1781 ExecStart=/bin/bash -c /usr/local/bin/gluster_auto_heal gfs0 (code=exited, status=0/SUCCES>
   Main PID: 1781 (code=exited, status=0/SUCCESS)
        CPU: 107ms

Sep 17 10:33:34 c3 systemd[1]: Starting gluster_auto_heal.service - Gluster Auto Heal Service...
Sep 17 10:33:35 c3 bash[1787]: Gathering count of entries to be healed on volume gfs0 has been unsuccessful:
Sep 17 10:33:35 c3 bash[1787]: Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file>
Sep 17 10:33:35 c3 bash[1781]: S0, M0
Sep 17 10:33:35 c3 bash[1781]: S0, M0
Sep 17 10:33:35 c3 bash[1781]: 0
Sep 17 10:33:35 c3 systemd[1]: Finished gluster_auto_heal.service - Gluster Auto Heal Service.

zeroun · Sep 17, 2024

And the idle type allowed me to have the service failed as long as the conditions are not met. By setting the type to oneshot I no longer have this functionality.

And the idle type allowed me to have the service failed until the conditions were met.The service would not start (in the active state) when the conditions were met. So, the idle type met my expectations.By setting the type to oneshot as you advised me, I no longer have this functionality. And so, it does not work

It is a pleasure to discuss this issue with you...

waltar · Sep 17, 2024

When it works as you wish then it's right.

zeroun · Sep 17, 2024

Exactly, it doesn't work as I want.The VMs start while the self-repair is not yet finished. This takes at least 10 minutes. @waltar, why you say : "When it works as you wish then it's right." ?

[TUTORIAL] Condition VM start on GlusterFS reachability

Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Member

Renowned Member

Member

Member

Member

Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Member

Member

Renowned Member

Member

We value your privacy