[TUTORIAL] Condition VM start on GlusterFS reachability

The translation in my language is bad on what you are telling me.Can you formulate your question differently? :
"Why didn't you let the VMs simply start at the end of the heal script being oneshot?"

My English was not any better. :D I basically wonder, if you have now 2 services, one is basically restarted every 10 seconds until it can actually do what it needs to do (start vms) and the other is there to basically facilitate that it the first can start (gluster heal), then why not simply start the vms at the end of the gluster heal?

Arguably, it would also start immediately when it can (not saying that the 10 seconds matter) and it would not let systemd keep starting up every 10 secondes something that for 10 minutes cannot run.
 
Last edited:
My English was not any better. :D I basically wonder, if you have now 2 services, one is basically restarted every 10 seconds until it can actually do what it needs to do (start vms) and the other is there to basically facilitate that it the first can start (gluster heal), then why not simply start the vms at the end of the gluster heal?

Arguably, it would also start immediately when it can (not saying that the 10 seconds matter) and it would not let systemd keep starting up every 10 secondes something that for 10 minutes cannot run.

Thanks for explaining.

But it's still difficult.

Nevertheless, I think I guessed the meaning of your question:
" and the other is there to basically facilitate that it the first can start (gluster heal)"

When a service fails at time t. However, it was active for a short time at time t-1.

If you imagine 2 services with one that tests the gluster and the other that starts the VMs. The second will be called as soon as the first is active.

And since we have the behavior described above. This cannot work.

That's why I use only one service. It returns false (1) if the cluster is not self-healing and does not move on to the next: exit 1

To start the VMs with the tag "gluster"

If I understood your thinking correctly, this is an excellent question.
Thanks for asking it @esi_y . :-D
 
Last edited:
I was wondering if we could wait for the HA service to start, or when migrating a VM from one node to another with the condition that the self-repair of the GlusterFS cluster is finished...

In short, big and vast development topics...

Or as for having wonderfully integrated Ceph into Proxmox-ve, make available a proxmox-ve flavor with the same quality of integration but for glusterFS this time. For very modest hardware configurations like mine...
 
I was wondering if we could wait for the HA service to start, or when migrating a VM from one node to another with the condition that the self-repair of the GlusterFS cluster is finished...

This is the issue (also I took with my linked bugreport above) that the CRS does not e.g. allow for some conditional rescheduling, i.e. why even attempt to migrate a service to a node where it cannot start, this is not strictly one specific filesystem topic, it's not even filesystem topic per se. Imagine your migration network is up, but your outside world access to that particular node got severed. Naturally, you would not want such service to HA migrate to such node, where it would provide service to no one exactly.

I really hoped people would go to the Bugzilla report and +1 themselves add these scenarios, so that next time HA stack is updated - and the CRS is in need of refreshment - it will take these situations into consideration. They will still be user-dependent, i.e. one would have to create a script to evaluate the conditions, but should be modular.

PS On my inquiries above, I still scratch my head why not put this into one service that loops and waits, but I have not touched gluster for a while and do not want to ask silly questions, so if I find some extra time to try this myself, I will update it to here later. It might be this (mis)understanding of mine regarding gluster rather than a language issue that's causing the confusion. (And trust me, my French would not have helped it, at all. ;))
 
  • Like
Reactions: zeroun
Additional notes :

In Proxmox VE 8.2.4, the service responsible for high availability (HA) is the pve-ha-lrm (Local Resource Manager) and the pve-ha-crm (Cluster Resource Manager). These services work together to manage and monitor HA resources in the Proxmox cluster.

  • pve-ha-lrm: This service runs on each node in the cluster and manages local resources.
  • pve-ha-crm: This service runs on the master node of the cluster and coordinates HA actions between the different nodes.
 
pve-ha-crm: This service runs on the master node of the cluster

Just to get technical (I love it): Proxmox does not use a master/slave node configuration, but rather a manager-lock system, so the CRM in fact runs on all nodes as appears in the docs:
The cluster resource manager (pve-ha-crm) starts on each node and waits there for the manager lock, which can only be held by one node at a time. The node which successfully acquires the manager lock gets promoted to the CRM master.
 
Just to get technical (I love it): Proxmox does not use a master/slave node configuration, but rather a manager-lock system, so the CRM in fact runs on all nodes as appears in the docs:

I think this is twisting the OP's words (and possibly understanding), he never claimed there to be a "master/slave" concept, he used the "master" designation that one gets from literally CLI output and there is indeed only one master at any given time (the "slaves" are, if you will, the LRMs), all the other instances of the CRM service (while admittedly running) are dormant, i.e. doing nothing, waiting to be stand-in, if necessary.

And for total completeness, there are also bugs in this approach that no one even acknowledges 6 months in :
https://bugzilla.proxmox.com/show_bug.cgi?id=5243

If it sounds like a nitpick, consider this is only an issue because the CRM needs to be migratory in this "manager lock" system.
 
Last edited:
I think this is twisting the OP's words

I mostly commented, because I have seen many who do not realize that Proxmox does not use a master-slave/node system (as opposed to other Computer cluster system that do). All my comments on this forum (& others) are directed to ALL users browsing the posts, so whether or not that was the OP's intention is not relevant, it is a question of what reading his post may imply or be understood to others.

________________________________________________________



Now to nitpick, (you started & I just know you love it!):
he never claimed there to be a "master/slave" concept

Oxford dictionary definition of "master";
a man who has people working for him, especially servants or slaves.

There are other definitions there, but all imply to be a master over others or something.

So as far as I see it, if the OP states:
This service runs on the master node of the cluster
the definite implication is that this master node has slave node/s within the cluster.
Thus I conclude that my understanding of the OP's post is far from either "twisting the OP's words" or their "understanding".

while admittedly running
Dormant or not, this contradicts the OP's statement definite implication:
This service runs on the master node of the cluster
This understanding can make a huge difference to correct system management.

Anyway enjoy your day!
 
I mostly commented, because I have seen many who do not realize that Proxmox does not use a master-slave/node system (as opposed to other Computer cluster system that do). All my comments on this forum (& others) are directed to ALL users browsing the posts, so whether or not that was the OP's intention is not relevant, it is a question of what reading his post may imply or be understood to others.

Fair enough, I just felt like we ran into some possible translation hiccups in this thread before and just wanted to express there's nothing fundamentally wrong with his note, it covers the point that only one node actually is CRM at any given time.

the definite implication is that this master node has slave node/s within the cluster.
Thus I conclude that my understanding of the OP's post is far from either "twisting the OP's words" or their "understanding".

I don't want to speculate where the OP took his note from, but ha-manager status literally designates "master" as the node the CRM is currently running on. It actually, at that point, is the master of the HA stack, so I understand why they use the term.

Dormant or not, this contradicts the OP's statement definite implication:

Alright, I take this one, the "master node of the cluster" was the trigger, it's the HA master.

When I look at the docs [1] now:

And you can view the actual HA manager and resource state with:
Code:
# ha-manager status
quorum OK
master node1 (active, Wed Nov 23 11:07:23 2016)
lrm elsa (active, Wed Nov 23 11:07:19 2016)
service vm:100 (node1, started)

So, well, not the best way to convey what you tried (by the docs, not you) either. ;)

This understanding can make a huge difference to correct system management.

Anyway enjoy your day!

Likewise, no hard feelings!

[1] https://pve.proxmox.com/pve-docs-6/ha-manager.1.html
 
  • Like
Reactions: gfngfn256
Code:
The proxmox web interface to declare a glusterfs storage does not yet meet my expectations (Replica of 3).

I d'ont agree. In the web interface you can only specify for two server for glusterfs, but its' enough, even for a replica 3 three.

For gluster you only need 1 "frontend" of the three for the initial connection. After that the glusterfs manage itself.

For example i have a a gluster volume with 2 server replica ( with data ) and 1 server for the arbiter ( with no data on it ) and for the mount in my fstab, on my workstation, i only put the ip of the arbiter, and it's work!.
 
  • Like
Reactions: waltar
Code:
The proxmox web interface to declare a glusterfs storage does not yet meet my expectations (Replica of 3).

I d'ont agree. In the web interface you can only specify for two server for glusterfs, but its' enough, even for a replica 3 three.

For gluster you only need 1 "frontend" of the three for the initial connection. After that the glusterfs manage itself.

For example i have a a gluster volume with 2 server replica ( with data ) and 1 server for the arbiter ( with no data on it ) and for the mount in my fstab, on my workstation, i only put the ip of the arbiter, and it's work!.
As explained @Dark26 , the integration of glusterfs in the 3 Proxmox nodes is done autonomously (independent of Proxmox if you prefer). I do not use the GlusterFs Storage feature you mention. But a simple directory Storage (/gluster).

So that you do not agree is a fact, but before that, you need to read what is written. It is a standalone integration of GlusterFS on a Proxmox cluster. So, the integration is done by a simple directory as previously described.

On all Proxmox nodes:

Code:
apt install curl gpg

mkdir -p /mnt/distributed/brick0 /gluster

curl https://download.gluster.org/pub/gluster/glusterfs/11/rsa.pub | gpg --dearmor > /usr/share/keyrings/glusterfs-archive-keyring.gpg

DEBID=$(grep 'VERSION_ID=' /etc/os-release | cut -d '=' -f 2 | tr -d '"')
DEBVER=$(grep 'VERSION=' /etc/os-release | grep -Eo '[a-z]+')
DEBARCH=$(dpkg --print-architecture)
echo "deb [signed-by=/usr/share/keyrings/glusterfs-archive-keyring.gpg] https://download.gluster.org/pub/gluster/glusterfs/LATEST/Debian/${DEBID}/${DEBARCH}/apt ${DEBVER} main" | tee /etc/apt/sources.list.d/gluster.list

apt update
apt install glusterfs-server
systemctl start glusterd
systemctl enable glusterd

I chose a replicated glusterFs volume of 3.

Code:
gluster peer probe <ip node proxmox  1>
gluster peer probe <ip node proxmox  2>
gluster peer probe <ip node proxmox  3>
gluster peer status
gluster pool list

gluster volume create gfs0 replica 3 transport tcp  <ip node proxmox  1>:/mnt/distributed/brick0 <ip node proxmox  2>:/mnt/distributed/brick0 <ip node proxmox  3>:/mnt/distributed/brick0
gluster volume start gfs0


#My tunning for my infra :

gluster volume set gfs0 cluster.shd-max-threads 4
gluster volume set gfs0 network.ping-timeout 5
gluster vol set gfs0 cluster.heal-timeout 10
gluster volume heal gfs0 enable
gluster volume set gfs0 cluster.quorum-type none
gluster vol set gfs0 cluster.quorum-reads false
gluster volume set gfs0 network.ping-timeout 30
gluster volume set gfs0 cluster.favorite-child-policy mtime
gluster volume heal gfs0 granular-entry-heal disable
gluster volume set gfs0 cluster.data-self-heal-algorithm diff

gluster volume set gfs0 auth.allow <ip networkA node >.*,<ip networkB node >.*

Create service :
Code:
cat /etc/systemd/system/glusterfs-mount.service
[Unit]
Description=Mount GlusterFS Volume
After=network-online.target glusterd.service
Wants=network-online.target glusterd.service

[Service]
Type=idle
ExecStart=/bin/mount -t glusterfs localhost:/gfs0 /gluster
RemainAfterExit=yes
#  Redémarre le service uniquement en cas d’échec
Restart=on-failure
# Attend 10 secondes avant de tenter un redémarrage.
RestartSec=10
# Définit une fenêtre de 10 minutes (600 secondes) pour le comptage des tentatives de redémarrage.
StartLimitIntervalSec=600
#  Limite le nombre de tentatives de redémarrage à 10 dans la fenêtre définie
StartLimitBurst=10

[Install]
WantedBy=multi-user.target

Code:
systemctl daemon-reload
systemctl enable glusterfs-mount.service
systemctl start glusterfs-mount.service
Code:
df -h

localhost:/gfs0 419G  4.2G  398G   2% /gluster

Code:
cat /etc/pve/storage.cfg
...
dir: gluster
        path /gluster
        content images
        prune-backups keep-all=1
        shared 1
 
Last edited:
  • Like
Reactions: werter and waltar
Like i say : What you wrote before the "Create service :" it's ok , no problem with that.

If you use only the glusterfs with proxmox on proxmox, I think it's better for letting proxmox mounting the glusterfs it need himself. So not need to create a service for that or mount manually the filesystem on a local directory

And if proxmox need to acces the storage, proxmox processed verified if it's mount, and the launch the mount process on all server automatically.
 
For example i have this in the storage.conf:

Code:
glusterfs: Gluster_Emmc
        disable
        path /mnt/pve/Gluster_Emmc
        volume Gluster_Emmc
        content backup,iso,vztmpl,images,snippets
        prune-backups keep-all=1
        server 10.10.5.241
        server2 10.10.5.242

it a replicate glusterfs replicate 3 with 3 servers : 10.10.5.241, 10.10.5.242,110.10.5.243

only two of them in the conf file.

The only problem is if 241 and 242 are off. But it's not an issue, a replica with one brick online don't work ( need quorum) .. so no problem, nothing work.


and promox mount the glusterfs on /mnt/pve/Gluster_Emmc
 
Last edited:
I think it's better for letting proxmox mounting the glusterfs it need himself.
So no.
and if I have more than one single node, the VMs whose disk is in the gluster still work. So you are not testing what you say.

Yours sincerely.
 
Last edited:
Are you sure that the Gluster storage didn't change in "read only" on the node which is alive/ alone?

Imagine you have node 1 on a side who can't assess the over two. ( unplug lan network on node 1 for example )

you have one file on your gluster storage : file.txt

on node 1 you put the text inside : toto . On the other side ( node2+ node 3 ) you put tata in the file.

when you replug the node 1 , then you have the same file with différent content ( one with toto and one with tata )

--> then you have split brain brain https://docs.gluster.org/en/latest/Administrator-Guide/Split-brain-and-ways-to-deal-with-it/

Normally, the node who is alone pass in read only mode only, so you can't write on it ( you can't write toto ) . So when evreybody come back, no split brain ( tata for everybody ).


Maybe the vm don't died , but can't write data to disk.
 
No @Dark26, the FS of the VMs of the only node left alive is not their read-only file system. Everything is fine. Yours sincerely.
 
I enjoy how a single typo of "1" can cause 2 posts. (zeroun, I know you were being sarcastic).
It's a mistake if my words were sarcastic. I try to explain things and when someone goes down a path that is not optimal, I try to explain why. I have tested the integration of GlusterFS with Proxmox for a long time and I was faced with the problem of integration with a GlusterFS storage managed by proxmox. So there you go. I try to explain the choices that were made. Far be it from me to be sarcasm. Everyone can make a mistake, including myself. And when someone writes their public IP, the least you can do is warn them. So be careful when communicating your IPs. Security is everyone's business.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!