Ceph RBD snapshot hangs when ceph-fuse mount present in LXC

zdude · Feb 20, 2024

I have been trying to get a backup solution configured for my system. The system is using a hyperconverged ceph cluster with unprivileged LXC containers, the container image is contained on an RBD ceph pool. Inside some containers a ceph-fs is mounted via ceph-fuse.

If I take a snapshot of a container that does not have a cephfs fuse mount present it will always work correctly.
If I take a snapshot of a container with a cephfs fuse mount present and no files open from the mount it works correclty (somewhat theory based on when/how they fail).
If I take a snapshot of a container with a cephfs fuse mount present and at least one file open it seems to always fail.

Once the system has failed to take the snapshot it seems that there is a locked process of some kind which requires a reboot of the host. After rebooting, the proxmox GUI shows that a snapshot exists but is not a parent of "NOW". When examining the rbd image directly there is no snapshot present on the rbd image.

The only thing that I can find that might be relevant is an error in syslog indicating that my ceph admin key is not present (then the error key) even though it very much so is. It almost appears as if a non-root process is trying to read the ceph admin key but I can't find any such process.

Anybody seen something like this before and if so, any suggestions to get around it?

Alwin Antreich · Apr 2, 2024

The CephFS fuse client is a user process mounting a remote storage (even if hyper-converged). It is not aware of the storage snapshot taken.
https://docs.ceph.com/en/latest/cephfs/mount-using-fuse/

You may have more success with a directory storage placed on cephfs.

zdude · Apr 2, 2024

Copying my update from elsewhere:

I have been able to get a little more detail and isolate what is causing the problem.

The sequence of events to cause a PVE hang appear to be the following:

LXC container and fuse mount (of any kind) inside container (of any boot root filesystem mount type tested)
Process writing to file inside fuse mount in container (I have been just using FIO in my test env) reads don't seem to cause any problems.
Snapshot on root for LXC container.

I also found this excerpt from Proxmox's documentation:

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pct_container_storage

It appears to be a somewhat known problem. Unfortunately, making about 50000 bind mounts on the host is just not feasible for me.

Alwin Antreich · Apr 2, 2024

zdude said:
It appears to be a somewhat known problem. Unfortunately, making about 50000 bind mounts on the host is just not feasible for me.

That's not what I meant by directory storage. You can create a storage on a mounted cephfs [0] as a directory storage. And you can add you volume for an LXC as a directory [1]. Or just use it as is, then raw images are created and it works through the web UI, oob.

If that's not it, then I don't understand your use case.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#storage_directory
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_storage_backed_mount_points

alexskysilk · Apr 2, 2024

zdude said:
If I take a snapshot of a container with a cephfs fuse mount present and at least one file open it seems to always fail.

this isnt cephfs specific- its endemic to any mounts inside of a container. if you must operate this way, make sure to mount your external file system in a specific location ONLY (eg, /mnt) and exclude it from backup (--exclude-path directive in your backup job config)

zdude · Apr 3, 2024

alexskysilk said:
this isnt cephfs specific- its endemic to any mounts inside of a container. if you must operate this way, make sure to mount your external file system in a specific location ONLY (eg, /mnt) and exclude it from backup (--exclude-path directive in your backup job config)

Following the configuration in this post, still doesn't work. The backup fails on the snapshot step of the backup, at the snapshot step the exclude-path directive doesn't appear to be used until actually dumping the outputs.

https://forum.proxmox.com/threads/p...irectories-from-lxc-backups.64241/post-636511

Alwin Antreich said:
That's not what I meant by directory storage. You can create a storage on a mounted cephfs [0] as a directory storage. And you can add you volume for an LXC as a directory [1]. Or just use it as is, then raw images are created and it works through the web UI, oob.

If that's not it, then I don't understand your use case.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#storage_directory
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_storage_backed_mount_points

My use case is a large number of docker containers (722 as of right now) within the LXC container each with their own granular permissions to a ceph-fs file system. Configured this way, each docker container is able to mount and manage it's own fuse mount independently. It just isn't feasible to manage hundreds of mount points from the host and pass them through to the individual containers inside LXC.

I realize that it is recommended to run docker inside a VM however I have found having a single kernel shared between LXC containers helps reduce overall memory usage significantly. Running inside a VM required an additional 115GB of memory in testing.

My current solution to this challenge is to do a stop backup during the lowest traffic time of the week. It has been working fairly well so far with the only problem appearing to be due to a shutdown sequence error leaving a docker container with the mount orphaned preventing a restart of the LXC container.

Alwin Antreich · Apr 3, 2024

zdude said:
My use case is a large number of docker containers (722 as of right now) within the LXC container each with their own granular permissions to a ceph-fs file system. Configured this way, each docker container is able to mount and manage it's own fuse mount independently. It just isn't feasible to manage hundreds of mount points from the host and pass them through to the individual containers inside LXC.

The ID mapping for unprivileged container is indeed a problem in this case. And yes, running docker in VMs is recommended. Did you ever try with KSM to reduce the memory footprint? We use that even in VMs for our handful of docker container.

zdude said:
My current solution to this challenge is to do a stop backup during the lowest traffic time of the week. It has been working fairly well so far with the only problem appearing to be due to a shutdown sequence error leaving a docker container with the mount orphaned preventing a restart of the LXC container.

That's the safest way doing backups anyway.

Search

Search

Ceph RBD snapshot hangs when ceph-fuse mount present in LXC

zdude

Active Member

Alwin Antreich

Active Member

zdude

Active Member

Alwin Antreich

Active Member

alexskysilk

Distinguished Member

zdude

Active Member

Alwin Antreich

Active Member