[SOLVED] LXC backup has started failing, was working before

lifeboy

Renowned Member
I have to lxc's that I cannot backup any more. It may have started happening after a recent minor upgrade to proxmox, but I'm not sure.

If I start a backup, I see:
Code:
INFO: starting new backup job: vzdump 106 --storage cephfs --remove 0 --mode snapshot --node FT1-NodeA --compress zstd
INFO: filesystem type on dumpdir is 'ceph' -using /var/tmp/vzdumptmp3695676_106 for temporary files
INFO: Starting Backup of VM 106 (lxc)
INFO: Backup started at 2021-04-21 15:46:32
INFO: status = running
INFO: CT Name: signup
INFO: including mount point rootfs ('/') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd16
INFO: creating vzdump archive '/mnt/pve/cephfs/dump/vzdump-lxc-106-2021_04_21-15_46_32.tar.zst'

When I stop the job from the GUI, I get:
Code:
INFO: cleanup temporary 'vzdump' snapshot
Removing snap: 100% complete...done.
ERROR: Backup of VM 106 failed - command 'set -o pipefail && lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/var/tmp/vzdumptmp3695676_106' ./etc/vzdump/pct.conf ./etc/vzdump/pct.fw '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | zstd --rsyncable '--threads=1' >/mnt/pve/cephfs/dump/vzdump-lxc-106-2021_04_21-15_46_32.tar.dat' failed: interrupted by signal
INFO: Failed at 2021-04-21 15:47:23
INFO: Backup job finished with errors
TASK ERROR: job errors

I can make a manual rbd snapshot and I can also make an lxc snapshot with pct snapshot and from the GUI, but the backup seems to get stuck on making the snapshot.

/var/log/vzdump/lxc-106.log contains the same info:
Code:
2021-04-21 15:46:32 INFO: Starting Backup of VM 106 (lxc)
2021-04-21 15:46:32 INFO: status = running
2021-04-21 15:46:32 INFO: CT Name: signup
2021-04-21 15:46:32 INFO: including mount point rootfs ('/') in backup
2021-04-21 15:46:32 INFO: backup mode: snapshot
2021-04-21 15:46:32 INFO: ionice priority: 7
2021-04-21 15:46:32 INFO: create storage snapshot 'vzdump'
2021-04-21 15:46:35 INFO: creating vzdump archive '/mnt/pve/cephfs/dump/vzdump-lxc-106-2021_04_21-15_46_32.tar.zst'
2021-04-21 15:47:22 INFO: cleanup temporary 'vzdump' snapshot
2021-04-21 15:47:23 ERROR: Backup of VM 106 failed - command 'set -o pipefail && lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/var/tmp/vzdumptmp3695676_106' ./etc/vzdump/pct.conf ./etc/vzdump/pct.fw '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | zstd --rsyncable '--threads=1' >/mnt/pve/cephfs/dump/vzdump-lxc-106-2021_04_21-15_46_32.tar.dat' failed: interrupted by signal

Proxmox details:
# pveversion
pve-manager/6.3-4/0a38c56f (running kernel: 5.4.98-1-pve)

All backups work fine, except two containers, of which this is one.
 
Last edited:
A third container now also fails to back up. I can't find any error messages in the logs... Anyone? Any suspicions?

Here's what this one does when I start a backup:

Code:
1546713 ?        Ss     0:00 task UPID:FT1-NodeD:001799D9:00E2A8A3:60956940:vzdump:133:roland@pve:
1546794 ?        S      0:00 /bin/bash -c set -o pipefail && lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/var/tmp/vzdumptmp1546713_133' ./etc/vzdump/pct.conf ./etc/vzdump/pct.fw '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | zstd --rsyncable '--threads=1' >/mnt/pve/cephfs/dump/vzdump-lxc-133-2021_05_07-18_22_24.tar.dat
1546795 ?        S      0:00 lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs --xattrs-include=user.* --xattrs-include=security.capability --warning=no-file-ignored --warning=no-xattr-write --one-file-system --warning=no-file-ignored --directory=/var/tmp/vzdumptmp1546713_133 ./etc/vzdump/pct.conf ./etc/vzdump/pct.fw --directory=/mnt/vzsnap0 --no-anchored --exclude=lost+found --anchored --exclude=./tmp/?* --exclude=./var/tmp/?* --exclude=./var/run/?*.pid ./
1546797 ?        D      0:04 tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs --xattrs-include=user.* --xattrs-include=security.capability --warning=no-file-ignored --warning=no-xattr-write --one-file-system --warning=no-file-ignored --directory=/var/tmp/vzdumptmp1546713_133 ./etc/vzdump/pct.conf ./etc/vzdump/pct.fw --directory=/mnt/vzsnap0 --no-anchored --exclude=lost+found --anchored --exclude=./tmp/?* --exclude=./var/tmp/?* --exclude=./var/run/?*.pid ./
1550411 pts/2    S+     0:00 grep vzdump

Code:
# cat /var/log/vzdump/lxc-133.log 
2021-05-07 18:22:24 INFO: Starting Backup of VM 133 (lxc)
2021-05-07 18:22:24 INFO: status = running
2021-05-07 18:22:24 INFO: CT Name: support
2021-05-07 18:22:24 INFO: including mount point rootfs ('/') in backup
2021-05-07 18:22:24 INFO: backup mode: snapshot
2021-05-07 18:22:24 INFO: ionice priority: 7
2021-05-07 18:22:24 INFO: create storage snapshot 'vzdump'
2021-05-07 18:22:26 INFO: creating vzdump archive '/mnt/pve/cephfs/dump/vzdump-lxc-133-2021_05_07-18_22_24.tar.zst'

No errors or any other indication as to why the process doesn't continue...
 
Some more digging done...

In /mnt/pve/cephfs/dump I find the backup

Code:
-rw-r--r-- 1 root root 3.4G May  7 18:38 /mnt/pve/cephfs/dump/vzdump-lxc-133-2021_05_07-18_22_24.tar.dat

and it's actually being written, but no progress indicator or any other indication like there normally is when a backup is being made.

Why would that be? Did a recent update change the feedback/progress indicators?

Code:
# pveversion
pve-manager/6.4-4/337d6701 (running kernel: 5.4.106-1-pve)
 
Code:
INFO: starting new backup job: vzdump 133 --node FT1-NodeD --storage cephfs --mode snapshot --remove 0 --compress zstd
INFO: filesystem type on dumpdir is 'ceph' -using /var/tmp/vzdumptmp1546713_133 for temporary files
INFO: Starting Backup of VM 133 (lxc)
INFO: Backup started at 2021-05-07 18:22:24
INFO: status = running
INFO: CT Name: support
INFO: including mount point rootfs ('/') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd2
INFO: creating vzdump archive '/mnt/pve/cephfs/dump/vzdump-lxc-133-2021_05_07-18_22_24.tar.zst'
INFO: Total bytes written: 30207201280 (29GiB, 7.9MiB/s)
INFO: archive file size: 19.87GB
INFO: cleanup temporary 'vzdump' snapshot
Removing snap: 100% complete...done.
INFO: Finished Backup of VM 133 (01:01:34)
INFO: Backup finished at 2021-05-07 19:23:58
INFO: Backup job finished successfully
TASK OK

The job finished. That resolves the issue for 2 of the guests that I could not backup.

However, there's one that just cannot back up. Even if I stop the machine, I still can't make a backup. It simply starts the backup (at least it says so) and then nothing happens... :-(
 
When machine 105 is off, I can make a backup successfully. After all, it's basically a copy of the machine storage volume.

However, when the machine is running, this is what happens:

Code:
INFO: starting new backup job: vzdump 105 --remove 0 --compress zstd --storage cephfs --mode snapshot --node FT1-NodeA
INFO: filesystem type on dumpdir is 'ceph' -using /var/tmp/vzdumptmp3380810_105 for temporary files
INFO: Starting Backup of VM 105 (lxc)
INFO: Backup started at 2021-05-15 17:33:40
INFO: status = running
INFO: CT Name: signupdb
INFO: including mount point rootfs ('/') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'

When I look in /var/tmp/vzdumptmp3380810_105, there's nothing...


Code:
FT1-NodeA:~# ls -la /var/tmp/vzdumptmp3380810_105/
total 8
drwxr-xr-x  2 root root 4096 May 15 17:33 .
drwxrwxrwt 15 root root 4096 May 15 17:33 ..

I don't see anything in any logs I could find that gives any indication as to why the process is not proceeding.

The LXC is in a locked state, so I can't connect to it until I stop the backup process.

This is what the mounts in the LXC look like:

Code:
# mount
/dev/rbd0 on / type ext4 (rw,relatime,stripe=16)
none on /dev type tmpfs (rw,relatime,size=492k,mode=755)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
proc on /proc/sys/net type proc (rw,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
sysfs on /sys/devices/virtual/net type sysfs (rw,relatime)
sysfs on /sys/devices/virtual/net type sysfs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
udev on /dev/fuse type devtmpfs (rw,nosuid,relatime,size=65802644k,nr_inodes=16450661,mode=755)
lxcfs on /proc/cpuinfo type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/diskstats type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/loadavg type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/meminfo type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/stat type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/swaps type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/uptime type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /sys/devices/system/cpu/online type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
devpts on /dev/lxc/console type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
devpts on /dev/console type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
none on /proc/sys/kernel/random/boot_id type tmpfs (ro,nosuid,nodev,noexec,relatime,size=492k,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666,max=1024)
devpts on /dev/ptmx type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666,max=1024)
devpts on /dev/lxc/tty1 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666,max=1024)
devpts on /dev/lxc/tty2 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666,max=1024)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
root@192.168.121.200:/backups/signupdb-pre-prod on /mnt/backup type fuse.sshfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=13166104k,mode=700)

What is happening? Where can I find more details?
 
Last edited:
When machine 105 is off, I can make a backup successfully. After all, it's basically a copy of the machine storage volume.

However, when the machine is running, this is what happens:

Code:
INFO: starting new backup job: vzdump 105 --remove 0 --compress zstd --storage cephfs --mode snapshot --node FT1-NodeA
INFO: filesystem type on dumpdir is 'ceph' -using /var/tmp/vzdumptmp3380810_105 for temporary files
INFO: Starting Backup of VM 105 (lxc)
INFO: Backup started at 2021-05-15 17:33:40
INFO: status = running
INFO: CT Name: signupdb
INFO: including mount point rootfs ('/') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'

When I look in /var/tmp/vzdumptmp3380810_105, there's nothing...


Code:
FT1-NodeA:~# ls -la /var/tmp/vzdumptmp3380810_105/
total 8
drwxr-xr-x  2 root root 4096 May 15 17:33 .
drwxrwxrwt 15 root root 4096 May 15 17:33 ..

I don't see anything in any logs I could find that gives any indication as to why the process is not proceeding.

The LXC is in a locked state, so I can't connect to it until I stop the backup process.

This is what the mounts in the LXC look like:

Code:
# mount
/dev/rbd0 on / type ext4 (rw,relatime,stripe=16)
none on /dev type tmpfs (rw,relatime,size=492k,mode=755)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
proc on /proc/sys/net type proc (rw,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
sysfs on /sys/devices/virtual/net type sysfs (rw,relatime)
sysfs on /sys/devices/virtual/net type sysfs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
udev on /dev/fuse type devtmpfs (rw,nosuid,relatime,size=65802644k,nr_inodes=16450661,mode=755)
lxcfs on /proc/cpuinfo type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/diskstats type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/loadavg type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/meminfo type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/stat type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/swaps type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/uptime type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /sys/devices/system/cpu/online type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
devpts on /dev/lxc/console type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
devpts on /dev/console type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
none on /proc/sys/kernel/random/boot_id type tmpfs (ro,nosuid,nodev,noexec,relatime,size=492k,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666,max=1024)
devpts on /dev/ptmx type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666,max=1024)
devpts on /dev/lxc/tty1 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666,max=1024)
devpts on /dev/lxc/tty2 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666,max=1024)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
root@192.168.121.200:/backups/signupdb-pre-prod on /mnt/backup type fuse.sshfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=13166104k,mode=700)

What is happening? Where can I find more details?
Update: When I unmount the nfs mount at /mnt/backup then process completed properly.

I have other machine with the exact same mount that backup just fine.

I'm doing some more tests.
 
Finally! Mistery solved!

The lxc was running a privileged (by mistake) and mounting an nfs mountpoint from it causes the backup to stall without any error message. I converted the container to run unprivileged and now the backup runs.

I figured this out by step by step comparing a container that was being backed up successfully with the one that was not and saw that the only difference was that the unsuccessful one was running as privileged.

I will open a ticked for this so that a message can be displayed regarding this status, since it makes a backup process hang.