[SOLVED] Problems on shutdown/boots with nfs/glusterfs Filesystems and HA containers / systemd order of services!

Apollon77

Well-Known Member
Sep 24, 2018
153
13
58
47
Hi,

Some information on my setup: I have a cluster of Intel Nucs with SSDs and most of the SSD space is a cluster wide "glusterfs" filesystem where all VMs and also LXC Container images are stored. Because of the fact that Containers can not be placed on GlusterFS by default I found ideas here in forum to create a "directory storage on the gluster-fs mountpoint" and use this for Containers. so I did that way.
I use the most current proxmox pve "community" (just updated yesterday)

Now to my problem:
VMs work just fine, they shotwdown on server shutdown/reboot and start correctly. No problem here!

But Containers might get problems on shudown and mostly get problems on boot up.

Shutdown Problem:
When I read the syslog correctly then that unmount is done "too early" and with this the directory is no longer available ... then the container end up in "not beeing stopped correctly because of i/o errors". To be honest I did not waited how long it takes to get killed, but a kill -9 of the lxc process (so the very hard way) solved it, but not nice way.
It can not be the glusterfs itself that is gone because one of the VMs (101) (also located on glusterfs) stopped successfully ... only the lxc container generates problems (should be that loop0 thingy)

Log: see attachement

On Bootup it is sometimes the other way around and the system tries to start lxc but directory is not yet ready ... I need to find a log for that. I was able to work around that by increasing the "restarts" in HA mode and so after he gives up he tries again

It feels to me that I "just" should add some systemd dependencies to the correct pve/unmount service to make sure the mounts are unmounted after it and started before it (so a Wants...) flag ... which would be the correct pve service to do so?

Could anyone advice what to add where?

Thank you!

Ingo
 

Attachments

Last edited:
So you mean I stop the container and then start manually that way? But /tmp is a bad place because will be cleaned on startup or ?!
 
Last edited:
I hope one day, the proxmox developer will change there decision to support lxc on gluster again like it's was a long time ago.

but you can try :

Solution 1 :

- If you have an "older" version of glusterfs ( 5 maybe ? ); you can enable and use nfs directly with gluster. ( gluster nfs.disable )

Solution 2 : if you have a newer version of gluster; you can install nfs-ganesha (https://download.nfs-ganesha.org) to share the gluster volume via nfs.

So you can have lxc on it. But for me the performance is bad.
 
because of performance things I read I did not wanted to use nfs stuff and back to 5.x is also no option because EOL.

In general when looking in my syslog I find:

> Jun 25 14:28:15 pm7 systemd[1]: mnt-pve-glusterfs.mount: Succeeded.
Jun 25 14:28:15 pm7 systemd[1]: Unmounted /mnt/pve/glusterfs.
Jun 25 14:28:15 pm7 systemd[1]: mnt-pve-glusterfs2.mount: Succeeded.
Jun 25 14:28:15 pm7 systemd[1]: Unmounted /mnt/pve/glusterfs2.

This is done tooo early when I see the log because there are still block devices maybe active on it (the ones from the lxc) (when I understand all of that correctly).

In fact for containers also a systemd service is created but the only dependendcy there is lxc.service (so lxc as process is there ... maybe the pve-storage.target should be added here too?

Or it should go into lxc.service (that one is at least waiting for remote-fs.target ... maybe also there?

I also found this: https://stackoverflow.com/questions...ed-before-the-services-are-stopped-in-systemd ... but all these mounts are "pve handled" kind of ...

When I look at the systemd deps for pve-storage I see


Code:
root@pm1:~# systemctl list-dependencies pve-storage.target
pve-storage.target
● └─remote-fs.target
●   ├─ceph-fuse.target
●   └─nfs-client.target
●     ├─auth-rpcgss-module.service
●     ├─nfs-blkmap.service
●     └─remote-fs-pre.target

No glusterfs in ... hm ... but maybe even other topic? (glusterd is defined in the pve-storage.target

When I check the deps of gusterd:

Code:
root@pm1:~# systemctl list-dependencies glusterd.service
glusterd.service
● ├─system.slice
● └─sysinit.target
●   ├─apparmor.service
●   ├─blk-availability.service
●   ├─dev-hugepages.mount
●   ├─dev-mqueue.mount
●   ├─fake-hwclock.service
●   ├─iscsid.service
●   ├─keyboard-setup.service
●   ├─kmod-static-nodes.service
●   ├─lvm2-lvmpolld.socket
●   ├─lvm2-monitor.service
●   ├─open-iscsi.service
●   ├─proc-sys-fs-binfmt_misc.automount
●   ├─pvenetcommit.service
●   ├─sys-fs-fuse-connections.mount
●   ├─sys-kernel-config.mount
●   ├─sys-kernel-debug.mount
●   ├─systemd-ask-password-console.path
●   ├─systemd-binfmt.service
●   ├─systemd-hwdb-update.service
●   ├─systemd-journal-flush.service
●   ├─systemd-journald.service
●   ├─systemd-machine-id-commit.service
●   ├─systemd-modules-load.service
●   ├─systemd-random-seed.service
●   ├─systemd-sysctl.service
●   ├─systemd-sysusers.service
●   ├─systemd-timesyncd.service
●   ├─systemd-tmpfiles-setup-dev.service
●   ├─systemd-tmpfiles-setup.service
●   ├─systemd-udev-trigger.service
●   ├─systemd-udevd.service
●   ├─systemd-update-utmp.service
●   ├─cryptsetup.target
●   ├─local-fs.target
●   │ ├─-.mount
●   │ ├─boot-efi.mount
●   │ ├─gluster-brick1.mount
●   │ ├─systemd-fsck-root.service
●   │ └─systemd-remount-fs.service
●   └─swap.target
●     └─dev-pve-swap.swap

And for the glusterfs mount:

Code:
root@pm1:~# systemctl list-dependencies mnt-pve-glusterfs.mount
mnt-pve-glusterfs.mount
● ├─-.mount
● ├─system.slice
● └─network-online.target
●   └─networking.service

And here the deps of lxc
Code:
root@pm1:~# systemctl list-dependencies lxc
lxc.service
● ├─lxc-net.service
● ├─system.slice
● └─sysinit.target
●   ├─apparmor.service
●   ├─blk-availability.service
●   ├─dev-hugepages.mount
●   ├─dev-mqueue.mount
●   ├─fake-hwclock.service
●   ├─iscsid.service
●   ├─keyboard-setup.service
●   ├─kmod-static-nodes.service
●   ├─lvm2-lvmpolld.socket
●   ├─lvm2-monitor.service
●   ├─open-iscsi.service
●   ├─proc-sys-fs-binfmt_misc.automount
●   ├─pvenetcommit.service
●   ├─sys-fs-fuse-connections.mount
●   ├─sys-kernel-config.mount
●   ├─sys-kernel-debug.mount
●   ├─systemd-ask-password-console.path
●   ├─systemd-binfmt.service
●   ├─systemd-hwdb-update.service
●   ├─systemd-journal-flush.service
●   ├─systemd-journald.service
●   ├─systemd-machine-id-commit.service
●   ├─systemd-modules-load.service
●   ├─systemd-random-seed.service
●   ├─systemd-sysctl.service
●   ├─systemd-sysusers.service
●   ├─systemd-timesyncd.service
●   ├─systemd-tmpfiles-setup-dev.service
●   ├─systemd-tmpfiles-setup.service
●   ├─systemd-udev-trigger.service
●   ├─systemd-udevd.service
●   ├─systemd-update-utmp.service
●   ├─cryptsetup.target
●   ├─local-fs.target
●   │ ├─-.mount
●   │ ├─boot-efi.mount
●   │ ├─gluster-brick1.mount
●   │ ├─systemd-fsck-root.service
●   │ └─systemd-remount-fs.service
●   └─swap.target
●     └─dev-pve-swap.swap
 
I digged deeper and I have a new idea ... The problem is that al HA guests on the system are not stopped by the "stop all guests "call ... so they continue to run and then the filesystem get killed (or glusterfs gets stopped) ... Whe you see the log above you see


Code:
Jun 25 14:28:11 pm7 pve-guests[4227]: <root@pam> starting task UPID:pm7:00001087:0216D4F9:5EF4985B:stopall::root@pam:

Jun 25 14:28:11 pm7 pve-guests[4231]: all VMs and CTs stopped

Jun 25 14:28:11 pm7 pve-guests[4227]: <root@pam> end task UPID:pm7:00001087:0216D4F9:5EF4985B:stopall::root@pam: OK

Jun 25 14:28:11 pm7 systemd[1]: pve-guests.service: Succeeded.

Jun 25 14:28:11 pm7 systemd[1]: Stopped PVE guests.

But there are two HA guests on the system!

Same for a system with a HA/non HA mix of guests: The "Non HA guests "are correctly stopped, both "HA" guests are NOT stopped correctly. WHy that?

In fact the HA guests are stopped by "pve-ha-lrm":

Code:
J
Jun 25 14:28:14 pm7 pve-ha-lrm[1347]: received signal TERM

Jun 25 14:28:14 pm7 pve-ha-lrm[1347]: got shutdown request with shutdown policy 'conditional'

Jun 25 14:28:14 pm7 pve-ha-lrm[1347]: reboot LRM, stop and freeze all services

So maybe the dependencies here do not work out ...
 
Bildschirmfoto 2020-07-06 um 22.43.12.png
See also here. This is a systemd-analyze ... why those two mounts are done that late in the startup process?
 
Do you have any idea?
Not out of my head. Unfortunately, it will take some time until I can take a closer look at this.

As side note, I personally am actually a little skeptical about Gluster at the moment. There seems to be a problem with qemu-img which affects VM migration scenarios, for example. Regarding LXC on GlusterFS it is also relevant that we have experienced bad performance in the past. These are some reasons why ideas like supporting more than 2 Gluster servers are not that high priority for me at the moment.
 
Thank you. I see your points ... never hit them ;-) Yes having option to configure mor then 2 Gluster servers in config would be cool (I have a 3-replica sytem over 7 Nucs and one second 3 way replicated over 3 nucs; In fact the scenario like in the ticket is dangerour because does not provide quota).

For me my problem more looks like a general systemd startup order topic in proxmox which only manifests with the gluster mounts. The gluster mounts are defined in staorage.cfg of proxmox and are done very late (and removed very early). It feels to me (but I was not able to find it because too unexperiences with systemd) that these storage mounts are initiated by pve-guests package ... if it is that way then this do not make sense when you see that pve-ha-lrm is started BEFORE pve-guests (and so also stopped before them) beause pve-ha-lrm takes care of starting and stopping the HA guests. But as said this is only assumptions ... (but only that I can do without knowing who mounts all the pve defined storage mounts) :-)

I understand you time constrains ... I'm basically out of ideas at the moment :-(
For me Proxmox with glusterfs works really well and gives me HA how I want it.
CEPH is way too complicated and else there is no real other "really shared FS" available in proxmox when I see it from my latest research.

Ingo
 
@Dominic The deeper I think/look into it I'm more sure that the topic has nothing to do with glusterfs in fact but with when the storage mounts are executed. When you check my chart in https://forum.proxmox.com/threads/p...s-systemd-order-of-services.71962/post-324061 then also my "mnt-pve-apollonnas nfs mount (ok a bit overwritten by one of the red circles) is also done very late.

How I come to it? I had today a friend with a nfs mount used for/by HA guests and he had comparable issues! and so I checked it from my system and yes ... a NFS mount is also executed with pve-guests package but the HA guests (pve-ha-lrm) are started waaaay before (and so will most likely be also unmounted before pve-ha-lrm is stopped ... and so we have the issue in my eyes
 
Last edited:
wtf.. i'm not allone with this problem.. phhuuuu.. nice....

i make same as Apollo77 ... i want mount a NAS folder and after restart i cant see it..

the statement from Apollon77 is right.. the HA guests are started before.. and i have no NFS Mount point in my VM
 
@Dominic I know you are short on time, but it would help me to continue my own research if you could answer me this question:
Which process/service is initiating the mount of all the "storage.cfg" defined storage locations? if it pve_guest? or something else?
 
Some services are defined in pve-manager/services, some others you can find with
Code:
ls /etc/systemd/system/mnt-pve*

Could you please post the relevant lines from the following commands?
Code:
cat /etc/fstab
cat /etc/pve/storage.cfg

Some mount options like netdev might have an impact, so that should make tracking the problem down easier. Additionally, this might be interesting because I am not really able to reproduce your problems I think. Is your problem only the messages and HA works or does HA not work at all?

Because one single time I got something that looked not so good
Code:
Aug  5 12:30:47 pveA pve-ha-lrm[13809]: shutdown CT 100: UPID:pveA:000035F1:00017630:5F2A8A57:vzshutdown:100:root@pam:
Aug  5 12:30:47 pveA kernel: [  957.973321] blk_update_request: I/O error, dev loop0, sector 4260400 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
Aug  5 12:30:47 pveA kernel: [  957.973400] EXT4-fs warning (device loop0): htree_dirblock_to_tree:997: inode #131113: lblock 0: comm openrc: error -5 reading directory block
Aug  5 12:30:47 pveA kernel: [  957.973484] blk_update_request: I/O error, dev loop0, sector 4260400 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
when I tried to shut down a node. The shutdown also took forever. However, that might also be the result of some previous work.

I tried controlled shutdown and forced stop of the cluster nodes some more times and, as expected, the container (ID 100) restarted on another node every time. Bulk start & stop from GUI worked flawlessly, too. As comparison my config:

Code:
root@pveA:~# cat /etc/fstab | grep glusterfs
192.168.25.135:/gv0 /mnt/glusterfstab glusterfs defaults,_netdev 0 0
root@pveA:~# cat /etc/pve/storage.cfg | grep -A5 gluster
dir: glusterDir
    path /mnt/glusterfstab
    content rootdir
    is_mountpoint yes
    shared 1

root@pveA:~# cat /etc/pve/lxc/100.conf 
(...)
hostname: CT100
ostype: alpine
rootfs: glusterDir:100/vm-100-disk-0.raw,size=8G
root@pveA:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pveA (local)
         2          1 pveC
         3          1 pveB
root@pveA:~# ha-manager config
ct:100
    state started
 
Hey Dominic and thank you for supporting me on this.

fstab is very normal; the standard stuff and two glusterfs bricks in this case
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/pve/root / ext4 errors=remount-ro 0 1
UUID=10C4-4D23 /boot/efi vfat defaults 0 1
/dev/pve/swap none swap sw 0 0
proc /proc proc defaults 0 0
/dev/pve/data2 /gluster/brick1 xfs defaults 1 2
/dev/pve/data11 /gluster/brick11 xfs defaults 1 2

The relevant part for storage.cfg are the nfs share I talked about, then two glusterfs storages themself and the directory mounts to get containers running on them

nfs: apollonnas-nfs
export /volume1/proxmox
path /mnt/pve/apollonnas-nfs
server apollonnas.fritz.box
content iso,images,backup,rootdir,vztmpl
maxfiles 3
options vers=3

glusterfs: glusterfs
path /mnt/pve/glusterfs
volume gv0
content images
server 192.168.178.50
server2 192.168.178.94

dir: glusterfs-container
path /mnt/pve/glusterfs
content rootdir
shared 1

glusterfs: glusterfs2
path /mnt/pve/glusterfs2
volume gv1
content images
server 192.168.178.50
server2 192.168.178.95

dir: glusterfs2-container
path /mnt/pve/glusterfs2
content rootdir
shared 1

ls /etc/systemd/system/mnt-pve*
is not giving anything for me ... yes the mnt-pve-* servcies are there (and I can use systemctl status) to get their status but only these three exist:


mnt-pve-apollonnas\x2dnfs.mount
mnt-pve-glusterfs2.mount
mnt-pve-glusterfs.mount

root@pm3:~# systemctl status mnt-pve-glusterfs.mount
● mnt-pve-glusterfs.mount - /mnt/pve/glusterfs
Loaded: loaded (/proc/self/mountinfo)
Active: active (mounted) since Sat 2020-07-18 18:13:55 CEST; 2 weeks 3 days ago
Where: /mnt/pve/glusterfs
What: 192.168.178.50:gv0
Tasks: 0 (limit: 4915)
Memory: 0B
CGroup: /system.slice/mnt-pve-glusterfs.mount


... I think these mount services are also automatically created as soon as the "mount" is done.

for me the mounts then end up in when I call "mount":

192.168.178.50:gv0 on /mnt/pve/glusterfs type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
192.168.178.50:gv1 on /mnt/pve/glusterfs2 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
apollonnas.fritz.box:/volume1/proxmox on /mnt/pve/apollonnas-nfs type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.178.23,mountvers=3,mountport=892,mountproto=udp,local_lock=none,addr=192.168.178.23)


And here it is unimportant if a place a "HA container" on gluster or on the nfs mount. As you can see in the one system chart above the mounts are happening very late and the HA manager already tries to start the Container/VM (and same on shutdown they seem to be removed too early before the HA manager stops the VMs/Containers

As comparison my config:

Code:

root@pveA:~# cat /etc/fstab | grep glusterfs
192.168.25.135:/gv0 /mnt/glusterfstab glusterfs defaults,_netdev 0 0
root@pveA:~# cat /etc/pve/storage.cfg | grep -A5 gluster
dir: glusterDir
path /mnt/glusterfstab
content rootdir
is_mountpoint yes
shared 1

Maybe this is the difference! You seem to have the glusterfs mounted via fstab, I have it defined itself also as a proxmox storage (as by documentation). Maybe this is all of the root cause ???!!


Is your problem only the messages and HA works or does HA not work at all?
HA itself workd completely fine, also node fencing, migrations and and and is super and I really like it. As you can read below it is ONLY a problem on pve-node boot and more problematic on node shutdown.

My problem is the following: If I place a VM/Container on my NFS (ok only did for testing there) or that special glusterfs directory mount everything works well if the LXC is not HA. IF it is declared as HA I see the following effects when stopping/rebooting/starting the node:

On pve node start/boot up:
It seems the HA manager tries to start the LXC too early because most of the time lxc throws errors that it can not get the container fs. I fixed this by simply defining retries in HA, so after some mins he gives up and then he is doing the retry and then it works directly. Reason seems to be that the gluster/nfs mount is not there in some cases when he already tries to start the container. Interestingly it mainly affects LXC containers and not VMs here. If I define the lxc as not HA but "boot on start" (so when I understand correctly the start is managed by pve-guest process and not ha manager) then it always works. It seems to be a timing thing on start.
This effect I have lets say 70% of the cases, but here I could work around with teh retries.

On PVE node shutdown:
Here it is the same just the other way around: It seems that first the "non HA" VMs and LXC are stopped and this works. Then it seems that the HA ones are stopped, but here it happens 100% for containers for me that they start bringing FS errors (partly as you can see in one of the logs even before they are told to stop). But HA-VMs also some times.
So I understand that that way that the mount is already removed after pve-guest is done, but the HA manager still have not stopped the HA stuff.
And here it runs into the effect that the LXC do not shut down


Because one single time I got something that looked not so good


Code:

Aug 5 12:30:47 pveA pve-ha-lrm[13809]: shutdown CT 100: UPID:pveA:000035F1:00017630:5F2A8A57:vzshutdown:100:root@pam:
Aug 5 12:30:47 pveA kernel: [ 957.973321] blk_update_request: I/O error, dev loop0, sector 4260400 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
Aug 5 12:30:47 pveA kernel: [ 957.973400] EXT4-fs warning (device loop0): htree_dirblock_to_tree:997: inode #131113: lblock 0: comm openrc: error -5 reading directory block
Aug 5 12:30:47 pveA kernel: [ 957.973484] blk_update_request: I/O error, dev loop0, sector 4260400 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0

when I tried to shut down a node. The shutdown also took forever. However, that might also be the result of some previous work.

Exactly that way it looks for me on each shutdown for HA LXC containers. And yes shutdown then takes ages and I want to get rid of that issue. I know how I can fix it when something happens and I'm around, but no real open if I'm not around.


Ingo
 
Maybe this is the difference! You seem to have the glusterfs mounted via fstab, I have it defined itself also as a proxmox storage (as by documentation). Maybe this is all of the root cause ???!!

The exact combination might really be relevant in this case. It should also be helpful to try parameters for the directory storages like is_mountpoint.

Are you using default HA shutdown policy? pvesh get cluster/options --output-format=json-pretty
What could be relevant in addition: Are you performing a shutdown and then only some seconds later boot the node again or do you let at least some (>3) minutes pass between shutdown and boot?


It would be great if you could take a second look at the NFS problems. NFS and shutdown should not give file system errors (and I really could not see any). So maybe we should get that correct first and look at the more complex Gluster/Directory construct afterwards.
 
The exact combination might really be relevant in this case. It should also be helpful to try parameters for the directory storages like is_mountpoint.

What that parameter should do differently?

From my point of view the idea would be to not have gluster mounted by pve but having it mounted by fstab directly and so to remove the storage from pve itself, or?! What effect this would have? I might not have the statistics any longer, but what else?

Are you using default HA shutdown policy? pvesh get cluster/options --output-format=json-pretty

Code:
root@pm3:~# pvesh get cluster/options --output-format=json-pretty
{
   "console" : "html5",
   "keyboard" : "de"
}


What could be relevant in addition: Are you performing a shutdown and then only some seconds later boot the node again or do you let at least some (>3) minutes pass between shutdown and boot?

mainly a topic with reboots because else the VMs are migrated away if I have not stopped them before ... I did not have tried this because it is my "production setup" :-)

It would be great if you could take a second look at the NFS problems. NFS and shutdown should not give file system errors (and I really could not see any). So maybe we should get that correct first and look at the more complex Gluster/Directory construct afterwards.

Ok, I will give it a try

Could take some time, need to see I get it done before my vacation ;-)

Ingo
 
@Dominic I will hopefully manage to try NFS over the weekend. But one question: How can I mount glusterfs directly in fstab without doing it via proxmox storage but have it still available for VMs as usual using the better performance gluster-lib and stuff?
 
What that parameter should do differently?
It is about availability of the directory storage. The exact effects are in pvesm man page.

From my point of view the idea would be to not have gluster mounted by pve but having it mounted by fstab directly and so to remove the storage from pve itself, or?! What effect this would have? I might not have the statistics any longer, but what else?

It might have an effect on the mounting order, but I haven't checked this exactly yet.


There are detailed instructions in the Gluster documentation. Like Proxmox VE, the example uses "glusterfs" as "fstype" for mount.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!