Backup stalled/frozen

proxbear · May 17, 2020

I'm running a CT which mounts a CIFS and a SSHFS mounts (not mounting-point, but mounted inside the privileged CT). When the nightly backup runs (mode: snapshot) it stalls; in the morning it says "Config locked (snapshot)" and I can't SSH into the box.

Question-1: Is the problem because of the underlying SSHFS mount?
- UPD: Yes, I've tracked it down to the SSHFS mount: if I mount a SSHFS, the "creating snapshot" hangs forever in the host

Question-2: As all my mounts always go under /mnt/ is it solved by putting the exclude-path: /mnt/ into the /etc/vzdump.conf?

Question-3: How to unblock this deadlock without rebooting the whole host?

I've tried:

Code:

root@prox:~# pct unlock 102
trying to acquire lock...
can't lock file '/run/lock/lxc/pve-config-102.lock' - got timeout

Code:

- root@prox:~# pct listsnapshot 102
    `-> vzdump                      2020-05-17 00:32:57     vzdump backup snapshot
    `-> current                                             You are here!

Code:

root@prox:~# cat /etc/pve/lxc/102.conf
#mp0%3A /wd1/encrypted,mp=/mnt/wd1
#mp1%3A /wd2/encrypted,mp=/mnt/wd2
arch: amd64
cores: 1
features: mount=cifs
hostname: vm-linuxtasks
lock: snapshot
memory: 1024
mp0: /wd1/encrypted,mp=/mnt/wd1
mp1: /wd2/encrypted,mp=/mnt/wd2
mp2: /wd3/encrypted,mp=/mnt/wd3
mp3: /data8x3/encrypted,mp=/mnt/data8x3
net0: name=eth0,bridge=vmbr0,firewall=1,gw=172.16.77.2,hwaddr=1A:2F:CB:57:43:FC,ip=172.16.77.30/24,type=veth
ostype: ubuntu
rootfs: encrypted-zfs:subvol-102-disk-4,size=50G
swap: 0

[vzdump]
#vzdump backup snapshot
arch: amd64
cores: 1
features: mount=cifs
hostname: vm-linuxtasks
memory: 1024
mp0: /wd1/encrypted,mp=/mnt/wd1
mp1: /wd2/encrypted,mp=/mnt/wd2
mp2: /wd3/encrypted,mp=/mnt/wd3
mp3: /data8x3/encrypted,mp=/mnt/data8x3
net0: name=eth0,bridge=vmbr0,firewall=1,gw=172.16.77.2,hwaddr=1A:2F:CB:57:43:FC,ip=172.16.77.30/24,type=veth
ostype: ubuntu
rootfs: encrypted-zfs:subvol-102-disk-4,size=50G
snapstate: prepare
snaptime: 1589668377
swap: 0
root@prox:~#

Both fuser and lsof are dead after calling them: no response, just dead, don't return to the console with any output/result.

Question-4: What does the 'snapshot' actually mean? Is it a ZFS-snapshot or some kind of vzdump thing?
- I don't see any 'vzdump' snapshot on the underlying ZFS at that subvolume.
- pct listsnapshot gives me some kind of 'snapshot' which doesn't seem to relate to any ZFS-snapshot; I can see it only in the 102.conf but nowhere else.

Question-5: Who exactly (which process) is deadlocked? In the ps -ex I see only a general 'vzdump -a', but not a specific blocking process. I could attach gdb to it and look inside, but it's easier to ask I guess ;-).

UPD: I had to reboot the host at the end, as I was not able to kill the create storage snapshot 'vzdump' result/command. The added exclude-path didn't help as well, as the snapshot being taken ignores this - I guess it relates later only to the backup's file-enumeration. Also today, the "create storage" log entry today is the very last killing-line: server hanging once again. But this time I've killed the task forked from the perl-vzdump, which made the backup-routine to continue, but the CT 102 keeps totally unresponsive: I can't kill it (lxc-stop 102), the command blocks and never returns.

I've tried to kill every CT-102 related task: this made the CT-102 offline in the GUI, but I was not able to start it anymore:
Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 102 lxc pre-start produced output: failed to remove directory '/sys/fs/cgroup/systemd/lxc/102/ns/user.slice/user-0.slice/session-21.scope': Device or resource busy

I have to reboot most probably once again. Very ugly, gives bad feeling about the whole thing.

Proxmox should definitely realize it's taking a snapshot forever with no return - and warn me, and allow me to abort, and has some obvious setting to exclude anything 'dangerous' from snapshoting. Like this, the backup runs, halts on taking snapshot and I see no way out =>

Question-6: Those users paying a subscription get the support/response faster? What about the weekends? Is there a difference when asking as a free and as a paying user? I need this only at home and only privately, but it drives me crazy anyway when waiting the whole weekend for a response, as one has time for this private stuff actually on the weekend

!

Thank you very much

wolfgang · May 20, 2020

Hi,

Q1:
a mount inside a container is not recommended. It is better to mount it on the host and use bind mounds instead.
Q2:
I guess no.
Q3:
If a CT crashes due to deployments, there is usually no other way than to restart the host to restore the CT.
The reason for this is that the namespace cannot be cleaned and therefore you cannot start the CT.
Q4:
ZFS snapshots are Filesystem snapshots that vzdump use for snapshot backup.
To view the snapshot you must use "zfs list -t all"
Q6:
see here https://www.proxmox.com/en/downloads/item/proxmox-ve-subscription-agreement

proxbear · May 20, 2020

Thank you!

wolfgang said:
Q1:
a mount inside a container is not recommended. It is better to mount it on the host and use bind mounds instead.

Hm, the issue here is, that Proxmox is not run on a secured ZFS (booting from enc ZFS is a big ugly hack-around) and so I try to avoid host-mounted filesystems, as taking the machine physically away means all the secured stuff (scripts, mounts..) is readable from outside. That's why I mount from the secured CTs/VMs.

So to understand it better: is the deadlock of snapshotting a CT with an internal mount a known/accepted BUG? From an architectural point of view, taking a snapshot from outside must not result in a host deadlock, with the only solution rebooting the host

. It should not be possible to break the host by making an mount inside an unpriviledged CT, should it?

Thank you

wolfgang · May 20, 2020

sshfs is not an encrypted FS only the transport is encrypted.
Containers use the same Kernel as the host.
The root of the host is always capable to read all data from a container.
It does no matter where the mount is.
If you like this you must use VM instead.

proxbear · May 20, 2020

I have to be more precise:
- My CTs are more protected, because when stealing the host physically, the CTs are not accessible anymore, as they're living in an enc ZFS volume, which has to be manually mounted after each host-reboot
- The root can only then read my CT, when I mounted the enc ZFS; when taking physically my machine away, the root can not access the CTs, as their volumes are not unlocked
- It does matter where the mount is, because when mounting automatically (fstab or script), the credentials are exposed in the filesystem; in the case of host-mounting, the credentials are not encrypted and hence readable by the person having physical access; in the case of LX, the whole volume is not accessible until I unlock the ZFS, manually

Moreover, when there is a lot of mounts, it gets somehow ugly to have everything mounted by the host:
- After a host reinstall, the backups of all the CTs doesn't contain the mount-definitions
- The host must mount everything right after the reboot, instead of having the CT mounting it itself when needed
- Separation of concern: the consumer is the CT and not the host itself; the host can do nothing with the mount, it's purely for the CT use-cases
- Segregation of duties: the host is responsible/needed to mounting the CT, the CT is responsible/needed to mount/use the mount-point

But I've got the message anyway

, I'll refactore it to have the host mount everything itself, also it's far from ideal. I'll put another script and separate credential files onto the enc ZFS, so the initial script will unlock the volume and create the mounts from the secured area.

Anyway, the point that vzdump enters a deadlock when shooting a snapshot with sshfs (fuse problem?), is a quite more general issue I guess. Because I can shoot a snaphshot myself of the filesystem/volume, with having mounted fuse/sshfs, so zdump should be able to do it as well, without resulting in a reboot-need.

Thank you

blackpaw · Apr 18, 2021

Ran into this exact problem myself - Priviliged CT hangs on backup snapshot because of the fuse mount. Works fine if the CT is stopped, or fuse support is disabled.

I'd really rather not mount on the server - apart from the uid/gid hassles of bind mounts, I use containers so that I don't have stuff installed on the server - I like to keep my host servers pure

Oddly, the unprivileged CT with a fuse mount backups up fine.

Is there a solution for this apart from host mounts?

Search

Search

Backup stalled/frozen

proxbear

New Member

wolfgang

Proxmox Retired Staff

proxbear

New Member

wolfgang

Proxmox Retired Staff

proxbear

New Member

blackpaw

Renowned Member