I'm running a CT which mounts a CIFS and a SSHFS mounts (not mounting-point, but mounted inside the privileged CT). When the nightly backup runs (mode: snapshot) it stalls; in the morning it says "Config locked (snapshot)" and I can't SSH into the box.
Question-1: Is the problem because of the underlying SSHFS mount?
- UPD: Yes, I've tracked it down to the SSHFS mount: if I mount a SSHFS, the "creating snapshot" hangs forever in the host
Question-2: As all my mounts always go under /mnt/ is it solved by putting the exclude-path: /mnt/ into the /etc/vzdump.conf?
Question-3: How to unblock this deadlock without rebooting the whole host?
I've tried:
Both fuser and lsof are dead after calling them: no response, just dead, don't return to the console with any output/result.
Question-4: What does the 'snapshot' actually mean? Is it a ZFS-snapshot or some kind of vzdump thing?
- I don't see any 'vzdump' snapshot on the underlying ZFS at that subvolume.
- pct listsnapshot gives me some kind of 'snapshot' which doesn't seem to relate to any ZFS-snapshot; I can see it only in the 102.conf but nowhere else.
Question-5: Who exactly (which process) is deadlocked? In the ps -ex I see only a general 'vzdump -a', but not a specific blocking process. I could attach gdb to it and look inside, but it's easier to ask I guess ;-).
UPD: I had to reboot the host at the end, as I was not able to kill the create storage snapshot 'vzdump' result/command. The added exclude-path didn't help as well, as the snapshot being taken ignores this - I guess it relates later only to the backup's file-enumeration. Also today, the "create storage" log entry today is the very last killing-line: server hanging once again. But this time I've killed the task forked from the perl-vzdump, which made the backup-routine to continue, but the CT 102 keeps totally unresponsive: I can't kill it (lxc-stop 102), the command blocks and never returns.
I've tried to kill every CT-102 related task: this made the CT-102 offline in the GUI, but I was not able to start it anymore:
Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 102 lxc pre-start produced output: failed to remove directory '/sys/fs/cgroup/systemd/lxc/102/ns/user.slice/user-0.slice/session-21.scope': Device or resource busy
I have to reboot most probably once again. Very ugly, gives bad feeling about the whole thing.
Proxmox should definitely realize it's taking a snapshot forever with no return - and warn me, and allow me to abort, and has some obvious setting to exclude anything 'dangerous' from snapshoting. Like this, the backup runs, halts on taking snapshot and I see no way out =>
Question-6: Those users paying a subscription get the support/response faster? What about the weekends? Is there a difference when asking as a free and as a paying user? I need this only at home and only privately, but it drives me crazy anyway when waiting the whole weekend for a response, as one has time for this private stuff actually on the weekend !
Thank you very much
Question-1: Is the problem because of the underlying SSHFS mount?
- UPD: Yes, I've tracked it down to the SSHFS mount: if I mount a SSHFS, the "creating snapshot" hangs forever in the host
Question-2: As all my mounts always go under /mnt/ is it solved by putting the exclude-path: /mnt/ into the /etc/vzdump.conf?
Question-3: How to unblock this deadlock without rebooting the whole host?
I've tried:
Code:
root@prox:~# pct unlock 102
trying to acquire lock...
can't lock file '/run/lock/lxc/pve-config-102.lock' - got timeout
Code:
- root@prox:~# pct listsnapshot 102
`-> vzdump 2020-05-17 00:32:57 vzdump backup snapshot
`-> current You are here!
Code:
root@prox:~# cat /etc/pve/lxc/102.conf
#mp0%3A /wd1/encrypted,mp=/mnt/wd1
#mp1%3A /wd2/encrypted,mp=/mnt/wd2
arch: amd64
cores: 1
features: mount=cifs
hostname: vm-linuxtasks
lock: snapshot
memory: 1024
mp0: /wd1/encrypted,mp=/mnt/wd1
mp1: /wd2/encrypted,mp=/mnt/wd2
mp2: /wd3/encrypted,mp=/mnt/wd3
mp3: /data8x3/encrypted,mp=/mnt/data8x3
net0: name=eth0,bridge=vmbr0,firewall=1,gw=172.16.77.2,hwaddr=1A:2F:CB:57:43:FC,ip=172.16.77.30/24,type=veth
ostype: ubuntu
rootfs: encrypted-zfs:subvol-102-disk-4,size=50G
swap: 0
[vzdump]
#vzdump backup snapshot
arch: amd64
cores: 1
features: mount=cifs
hostname: vm-linuxtasks
memory: 1024
mp0: /wd1/encrypted,mp=/mnt/wd1
mp1: /wd2/encrypted,mp=/mnt/wd2
mp2: /wd3/encrypted,mp=/mnt/wd3
mp3: /data8x3/encrypted,mp=/mnt/data8x3
net0: name=eth0,bridge=vmbr0,firewall=1,gw=172.16.77.2,hwaddr=1A:2F:CB:57:43:FC,ip=172.16.77.30/24,type=veth
ostype: ubuntu
rootfs: encrypted-zfs:subvol-102-disk-4,size=50G
snapstate: prepare
snaptime: 1589668377
swap: 0
root@prox:~#
Both fuser and lsof are dead after calling them: no response, just dead, don't return to the console with any output/result.
Question-4: What does the 'snapshot' actually mean? Is it a ZFS-snapshot or some kind of vzdump thing?
- I don't see any 'vzdump' snapshot on the underlying ZFS at that subvolume.
- pct listsnapshot gives me some kind of 'snapshot' which doesn't seem to relate to any ZFS-snapshot; I can see it only in the 102.conf but nowhere else.
Question-5: Who exactly (which process) is deadlocked? In the ps -ex I see only a general 'vzdump -a', but not a specific blocking process. I could attach gdb to it and look inside, but it's easier to ask I guess ;-).
UPD: I had to reboot the host at the end, as I was not able to kill the create storage snapshot 'vzdump' result/command. The added exclude-path didn't help as well, as the snapshot being taken ignores this - I guess it relates later only to the backup's file-enumeration. Also today, the "create storage" log entry today is the very last killing-line: server hanging once again. But this time I've killed the task forked from the perl-vzdump, which made the backup-routine to continue, but the CT 102 keeps totally unresponsive: I can't kill it (lxc-stop 102), the command blocks and never returns.
I've tried to kill every CT-102 related task: this made the CT-102 offline in the GUI, but I was not able to start it anymore:
Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 102 lxc pre-start produced output: failed to remove directory '/sys/fs/cgroup/systemd/lxc/102/ns/user.slice/user-0.slice/session-21.scope': Device or resource busy
I have to reboot most probably once again. Very ugly, gives bad feeling about the whole thing.
Proxmox should definitely realize it's taking a snapshot forever with no return - and warn me, and allow me to abort, and has some obvious setting to exclude anything 'dangerous' from snapshoting. Like this, the backup runs, halts on taking snapshot and I see no way out =>
Question-6: Those users paying a subscription get the support/response faster? What about the weekends? Is there a difference when asking as a free and as a paying user? I need this only at home and only privately, but it drives me crazy anyway when waiting the whole weekend for a response, as one has time for this private stuff actually on the weekend !
Thank you very much
Last edited: