[TUTORIAL] Condition VM start on GlusterFS reachability

esi_y · Sep 17, 2024

zeroun said:
And the idle type allowed me to have the service failed as long as the conditions are not met. By setting the type to oneshot I no longer have this functionality.

I might not have paid enough attention, but if I understand you correctly, your whole point is you do not want the VMs to be starting unless and until you have healthy filesystem - so why not have your script pass with flying colours in case all is right and take 10 minutes if not, why do you want it fail?

waltar · Sep 17, 2024

zeroun said:
<(gluster volume heal $volume statistics heal-count)

Still not understand how this could work without a "$" and thought your are convinced how it is running with Type=idle yet.
It's unbelieveable how it could run 10 mins ... take a live run "bash -xv /usr/local/bin/gluster_auto_heal" ... what happens ... wrong ... ?!

zeroun · Sep 17, 2024

NB: This problem is the same as this one : https://forum.proxmox.com/threads/condition-vm-start-on-iscsi-reachability.145599/

I need to think about it...

zeroun · Sep 17, 2024

esi_y said:
I might not have paid enough attention, but if I understand you correctly, your whole point is you do not want the VMs to be starting unless and until you have healthy filesystem - so why not have your script pass with flying colours in case all is right and take 10 minutes if not, why do you want it fail?

Yes @esi_y . That's exactly it

I want the gluster_auto_heal service to fail and stay in that state until the GlusterFS is healthy.

zeroun · Sep 17, 2024

In fact, by going to see the other post (iSCSI), I tell myself that overloading pve-manager as recommended, does not work. It is not in the spirit. So I think that it will be necessary to manage this by scripts ...

waltar · Sep 17, 2024

zeroun said:
I want the gluster_auto_heal service to fail and stay in that state until the GlusterFS is healthy.

But I think that is not reachable in this kind of implementation as systemd is not willing to wait 10min for any other services to start when your glusterfs isn't ready so far !

zeroun · Sep 17, 2024

waltar said:
But I think that is not reachable in this kind of implementation as systemd is not willing to wait 10min for any other services to start when your glusterfs isn't ready so far !

And yes. I realize that. So I will have to manage the startup of the VMs myself via scripts. Like the person in the other post who talks about iSCSI does.

waltar · Sep 17, 2024

That's a way (start vm's by script) - but nevertheless you should take a look why your glusterfs needs so much time to be ok.

zeroun · Sep 17, 2024

waltar said:
That's a way (start vm's by script) - but nevertheless you should take a look why your glusterfs needs so much time to be ok.

(10 minutes is needed on the entire GlusterFS cluster for self-healing to complete. A 1 Gbit/s connection on an SSD disk per Proxmox node for VM disks up to 32 GB.8 VMs of 32 GB, or 256 GB. So at most for all VMs that would have a repair, it would take about 30 minutes.)

zeroun · Sep 17, 2024

same problem, same fight: condition-vm-start-on-iscsi-reachability and same solution

: scripting

gfngfn256 · Sep 17, 2024

zeroun said:
So I will have to manage the startup of the VMs myself via scripts

As I suspected (& suggested in my earlier post). The upside is it is fully customizable with granular control, the downside is its never healthy to mess with a host node (updates & re-installs etc.). There are cases however where it can't be avoided.

As I said in my above post, it would be highly beneficial if Proxmox would introduce (in-house) some form of conditional-templating for VM/LXC startup.

zeroun · Sep 17, 2024

gfngfn256 said:
As I suspected (& suggested in my earlier post). The upside is it is fully customizable with granular control, the downside is its never healthy to mess with a host node (updates & re-installs etc.). There are cases however where it can't be avoided.

As I said in my above post, it would be highly beneficial if Proxmox would introduce (in-house) some form of conditional-templating for VM/LXC startup.

It's clear. In the other post, the user solves his problem of managing the startup of his VMs himself by adding tags (iscsi for him). For me it would be a gluster tag.

But, then, I agree with you. It's heavy to set up, a source of errors, etc.

zeroun · Sep 17, 2024

It would indeed be good if the Proxmox team would internally introduce a conditional model for starting VMs.

esi_y · Sep 17, 2024

zeroun said:
Yes @esi_y . That's exactly it I want the gluster_auto_heal service to fail and stay in that state until the GlusterFS is healthy.

I think I lost the plot by now, but with oneshot everything is waiting for it, I understood you originally you do not even have any point in machine proceeding to boot the hypervisor without filesystem ready.

waltar said:
But I think that is not reachable in this kind of implementation as systemd is not willing to wait 10min for any other services to start when your glusterfs isn't ready so far !

Why not?

zeroun · Sep 18, 2024

Interesting discussion.

I will repost with the solution of scripts tested under Proxmox 8.2.4

zeroun · Sep 18, 2024

Backup solution to start VMs whose disk is distributed on a replicated GlusterFS volume if and only if the volume is mounted on the current Proxmox node and if the volume self-repair is finished (This can take up to 10 minutes in my case).

Thus, VMs whose disk is mounted on the GlusterFS cluster are guaranteed to have an intact disk.

Successfully tested on a Proxmox cluster in version 8.2.4.

For each Proxmox node:

Code:

apt install jq

Code:

cat /etc/systemd/system/pve-manager.service.d/override.conf
[Unit]
After=glusterfs-mount.service

Code:

cat /etc/systemd/system/gluster_auto_run_vm.service
[Unit]
Description=Gluster Auto Run VM Service
After=network-online.target
After=glusterd.service
After=glusterfs-mount.service
Wants=network-online.target
Wants=glusterd.service
Wants=glusterfs-mount.service

[Service]
Type=oneshot
ExecStart=/bin/bash -c '/usr/local/bin/gluster_auto_run_vm'
RemainAfterExit=yes
SuccessExitStatus=0
Restart=on-failure
RestartSec=10
TimeoutSec=infinity

[Install]
WantedBy=multi-user.target

And here, the gluster_auto_run_vm service is indeed of type oneshot. This allows you to shut down the VM and restart it manually. If you change to the idle type, your VM will restart by itself after 10 seconds if it has been shut down ;-)

Code:

cat /usr/local/bin/gluster_auto_heal
#!/bin/bash

volume="$1"
mount="$2"

if [ "$volume" == "" ]; then
  echo "99"
  exit 99
fi


mountpoint -q /$mount && ok=0 || ok=1

# Lire chaque ligne de la commande gluster
sum=0
while read -r line; do
  # Si la ligne contient "Number of entries", extraire le nombre et l'ajouter à la somme
  if [[ $line == *"Number of entries:"* ]]; then
    number=$(echo $line | awk '{print $4}')
    sum=$((sum + number))
  fi
done < <(gluster volume heal $volume statistics heal-count)
#echo "S${sum}, M${ok} "
# Afficher la somme totale
#echo "S${sum}, M${ok} "
retour=1
if [ $sum -eq 0 ] && [ $ok -eq 0 ]; then
  retour=0
else
  retour=1
fi
#echo "$sum"
exit "${retour}"

Code:

cat /usr/local/bin/gluster_auto_run_vm

#!/bin/bash

PVESH=$(which pvesh)
JQ=$(which jq)
HOSTNAME=$(which hostname)

type_storage="gluster"
volume="gfs0"
mountpoint="gluster"

node_name=$($HOSTNAME)

/usr/local/bin/gluster_auto_heal $volume $mountpoint
ok=$?
if [ $ok -ne 0 ]; then
 exit 1
fi

if [ $($PVESH get /nodes/"$node_name"/qemu --output json | $JQ length) != "0" ];
    then
        # check status of the $type_storage  storage for this node and if is available,
            for qemu_id in $($PVESH get /nodes/"$node_name"/qemu --output json | $JQ -r "map(select(.tags and (.tags | contains(\"$type_storage\"))) | .vmid) | @sh" | tr -d "'")
            do
                if [ $($PVESH get /nodes/"$node_name"/qemu/"$qemu_id"/status/current --output json | $JQ -r '.status') != "running" ];
                then
                    echo "vmid $qemu_id in $type_storage on node $node_name is starting."
                    $PVESH create "/nodes/$node_name/qemu/$qemu_id/status/start"
                else
                    echo "vmid $qemu_id in $type_storage is already running"
                fi
            done
    else
     echo "The node $node_name is empty of VM"
    fi

exit 0

Some examples of how the gluster_auto_run_vm service works on the nodes of the proxmox cluster:

Code:

node 1 :

systemctl daemon-reload; systemctl restart gluster_auto_run_vm.service; systemctl status gluster_auto_run_vm.service
● gluster_auto_run_vm.service - Gluster Auto Run VM Service
     Loaded: loaded (/etc/systemd/system/gluster_auto_run_vm.service; enabled; preset: enabled)
     Active: active (exited) since Wed 2024-09-18 09:24:59 CEST; 10ms ago
    Process: 594953 ExecStartPre=/bin/bash -c /usr/local/bin/gluster_auto_heal gfs0 gluster (code=exited, status=0/SUCCESS)
    Process: 594970 ExecStart=/bin/bash -c /usr/local/bin/gluster_auto_run_vm (code=exited, status=0/SUCCESS)
   Main PID: 594970 (code=exited, status=0/SUCCESS)
        CPU: 8.425s

Sep 18 09:24:50 c1 systemd[1]: Starting gluster_auto_run_vm.service - Gluster Auto Run VM Service...
Sep 18 09:24:54 c1 bash[594970]: vmid 995 in gluster is already running
Sep 18 09:24:56 c1 bash[594970]: vmid 990 in gluster is already running
Sep 18 09:24:57 c1 bash[594970]: vmid 998 in gluster is already running
Sep 18 09:24:59 c1 bash[594970]: vmid 800 in gluster is already running
Sep 18 09:24:59 c1 systemd[1]: Finished gluster_auto_run_vm.service - Gluster Auto Run VM Service.


node 2 : (never VMs with glusterFS disk)

 gluster_auto_run_vm.service - Gluster Auto Run VM Service
     Loaded: loaded (/etc/systemd/system/gluster_auto_run_vm.service; enabled; preset: enabled)
     Active: active (exited) since Wed 2024-09-18 09:47:10 CEST; 9ms ago
    Process: 699600 ExecStartPre=/bin/bash -c /usr/local/bin/gluster_auto_heal gfs0 gluster (code=exited, status=0/SUCCESS)
    Process: 699617 ExecStart=/bin/bash -c /usr/local/bin/gluster_auto_run_vm (code=exited, status=0/SUCCESS)
   Main PID: 699617 (code=exited, status=0/SUCCESS)
        CPU: 2.520s

Sep 18 09:47:08 c2 systemd[1]: Starting gluster_auto_run_vm.service - Gluster Auto Run VM Service...
Sep 18 09:47:10 c2 systemd[1]: Finished gluster_auto_run_vm.service - Gluster Auto Run VM Service.


Node 3 : (One VM with glusterFS disk and not started)

gluster_auto_run_vm.service - Gluster Auto Run VM Service
     Loaded: loaded (/etc/systemd/system/gluster_auto_run_vm.service; enabled; preset: enabled)
     Active: active (exited) since Wed 2024-09-18 09:49:42 CEST; 11ms ago
    Process: 258138 ExecStartPre=/bin/bash -c /usr/local/bin/gluster_auto_heal gfs0 gluster (code=exited, status=0/SUCCESS)
    Process: 258155 ExecStart=/bin/bash -c /usr/local/bin/gluster_auto_run_vm (code=exited, status=0/SUCCESS)
   Main PID: 258155 (code=exited, status=0/SUCCESS)
        CPU: 4.944s

Sep 18 09:49:36 c3 systemd[1]: Starting gluster_auto_run_vm.service - Gluster Auto Run VM Service...
Sep 18 09:49:40 c3 bash[258155]: vmid 900 in gluster on node c3 is starting.
Sep 18 09:49:41 c3 pvesh[258173]: <root@pam> starting task UPID:c3:0003F080:0052C7E1:66EA8615:qmstart:900:root@pam:
Sep 18 09:49:41 c3 pvesh[258176]: start VM 900: UPID:c3:0003F080:0052C7E1:66EA8615:qmstart:900:root@pam:
Sep 18 09:49:42 c3 pvesh[258173]: <root@pam> end task UPID:c3:0003F080:0052C7E1:66EA8615:qmstart:900:root@pam: OK
Sep 18 09:49:42 c3 bash[258173]: UPID:c3:0003F080:0052C7E1:66EA8615:qmstart:900:root@pam:
Sep 18 09:49:42 c3 systemd[1]: Finished gluster_auto_run_vm.service - Gluster Auto Run VM Service.


Node 3 : After reboot :

Just after reboot :

systemctl status gluster_auto_run_vm.service
systemctl status gluster_auto_run_vm.service
● gluster_auto_run_vm.service - Gluster Auto Run VM Service
     Loaded: loaded (/etc/systemd/system/gluster_auto_run_vm.service; enabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Wed 2024-09-18 10:07:22 CEST; 9s ago
    Process: 8271 ExecStart=/bin/bash -c /usr/local/bin/gluster_auto_run_vm (code=exited, status=1/FAILURE)
   Main PID: 8271 (code=exited, status=1/FAILURE)
        CPU: 128ms


/usr/local/bin/gluster_auto_heal gfs0 gluster;echo $?
1

After 10 minutes, the VM 900 is not started :

systemctl status gluster_auto_run_vm.service
● gluster_auto_run_vm.service - Gluster Auto Run VM Service
     Loaded: loaded (/etc/systemd/system/gluster_auto_run_vm.service; enabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Wed 2024-09-18 10:17:22 CEST; 9s ago
    Process: 8271 ExecStart=/bin/bash -c /usr/local/bin/gluster_auto_run_vm (code=exited, status=1/FAILURE)
   Main PID: 8271 (code=exited, status=1/FAILURE)
        CPU: 128ms

and :
/usr/local/bin/gluster_auto_heal gfs0 gluster;echo $?
1

After 11 minutes, the VM 900 starting  :

systemctl status gluster_auto_run_vm.service
● gluster_auto_run_vm.service - Gluster Auto Run VM Service
     Loaded: loaded (/etc/systemd/system/gluster_auto_run_vm.service; enabled; preset: enabled)
     Active: active (exited) since Wed 2024-09-18 10:18:41 CEST; 36s ago
    Process: 8780 ExecStart=/bin/bash -c /usr/local/bin/gluster_auto_run_vm (code=exited, status=0/SUCCESS)
   Main PID: 8780 (code=exited, status=0/SUCCESS)
        CPU: 4.985s

Sep 18 10:18:34 c3 systemd[1]: Starting gluster_auto_run_vm.service - Gluster Auto Run VM Service...
Sep 18 10:18:38 c3 bash[8780]: vmid 900 in gluster on node c3 is starting.
Sep 18 10:18:39 c3 pvesh[8824]: <root@pam> starting task UPID:c3:00002296:0001908B:66EA8A87:qmstart:900:root@pam:
Sep 18 10:18:39 c3 pvesh[8854]: start VM 900: UPID:c3:00002296:0001908B:66EA8A87:qmstart:900:root@pam:
Sep 18 10:18:41 c3 pvesh[8824]: <root@pam> end task UPID:c3:00002296:0001908B:66EA8A87:qmstart:900:root@pam: OK
Sep 18 10:18:41 c3 bash[8824]: UPID:c3:00002296:0001908B:66EA8A87:qmstart:900:root@pam:
Sep 18 10:18:41 c3 systemd[1]: Finished gluster_auto_run_vm.service - Gluster Auto Run VM Service.

and :

/usr/local/bin/gluster_auto_heal gfs0 gluster;echo $?
0

By testing the self-healing of the GlusterFS volume on the entire Proxmox cluster, this will allow me in the future to ensure that the HA of the VMs works nominally on a GlusterFS volume.
But here too, not only provide an internal condition model for storage but also an internal condition model for HA.

I agree with you. Without this. It's heavy, very heavy. Not practical at all. And what a sweat produced to arrive at something that finally works.

Long live the Proxmox developers and long live the implementation of an internal model of startup condition for VMs and containers in relation to storage and also taking into account the HA functionalities, since the latter are closely linked...

zeroun · Sep 18, 2024

Additional Notes :

self-repair in progress:

Code:

gluster volume heal gfs0 statistics heal-count
Gathering count of entries to be healed on volume gfs0 has been successful

Brick <IP NODE 1>:/mnt/distributed/brick0
Number of entries: 3

Brick <IP NODE 2>:/mnt/distributed/brick0
Number of entries: 3

Brick <IP NODE 3>:/mnt/distributed/brick0
Number of entries: 0

Code:

gluster volume heal gfs0 info
Brick <IP NODE 1>:/mnt/distributed/brick0
/images/800/vm-800-disk-0.qcow2 - Possibly undergoing heal
/images/995/vm-995-disk-0.qcow2 - Possibly undergoing heal
/images/998/vm-998-disk-0.qcow2 - Possibly undergoing heal
Status: Connected
Number of entries: 3

Brick <IP NODE 2>:/mnt/distributed/brick0
/images/800/vm-800-disk-0.qcow2 - Possibly undergoing heal
/images/995/vm-995-disk-0.qcow2 - Possibly undergoing heal
/images/998/vm-998-disk-0.qcow2 - Possibly undergoing heal
Status: Connected
Number of entries: 3

Brick <IP NODE 3>:/mnt/distributed/brick0
Status: Connected
Number of entries: 0

self-repair terminated :

Code:

gluster volume heal gfs0 statistics heal-count
Gathering count of entries to be healed on volume gfs0 has been successful

Brick <IP NODE 1>:/mnt/distributed/brick0
Number of entries: 0

Brick <IP NODE 2>:/mnt/distributed/brick0
Number of entries: 0

Brick <IP NODE 3>:/mnt/distributed/brick0
Number of entries: 0

esi_y · Sep 18, 2024

I am a bit late to the party, but a naive question ...

You have a oneshot service that is checking each 10 seconds (and quits to get restarted) for gluster health to be ok until it can run and is wanted by multiuser target. Why didn't you let the VMs simply start at the end of the heal script being oneshot?

In any case, happy you have it working according to your original requirements, just curious if it could be simplified or it serves another purpose I have overlooked in this "staged" way.

gfngfn256 · Sep 18, 2024

zeroun said:
Proxmox cluster in version 8.4.2

Typo: Should probably read 8.2.4

BTW, excellent work. Maybe you also add the tag [TUTORIAL] to the thread title.
Hope it survives future maintenance + updates etc.

zeroun · Sep 18, 2024

esi_y said:
I am a bit late to the party, but a naive question ...

You have a oneshot service that is checking each 10 seconds (and quits to get restarted) for gluster health to be ok until it can run and is wanted by multiuser target. Why didn't you let the VMs simply start at the end of the heal script being oneshot?

In any case, happy you have it working according to your original requirements, just curious if it could be simplified or it serves another purpose I have overlooked in this "staged" way.

The translation in my language is bad on what you are telling me.Can you formulate your question differently @esi_y ? :
"Why didn't you let the VMs simply start at the end of the heal script being oneshot?"

[TUTORIAL] Condition VM start on GlusterFS reachability

Renowned Member

Renowned Member

Member

Member

Member

Renowned Member

Member

Renowned Member

Member

Member

Distinguished Member

Member

Member

Renowned Member

Member

Member

Member

Renowned Member

Distinguished Member

Member

We value your privacy