LXC Container Backup Suspend Mode exits with rsync error 23 after upgrade to PVE 7.1

Dec 22, 2021
24
2
3
46
Dear fellows,

this ist my very first Proxmox problem.
I upgraded from PVE 6.4 to 7.1 which was all milk and honey and ran fine without any trouble.
But since the upgrade, I cannot backup a specific LXC container in suspend mode, only stop mode works.
The error shown is as follows:

INFO: including mount point rootfs ('/') in backup
INFO: excluding bind mount point mp0 ('/opt/cloud-data') from backup (not a volume)
INFO: mode failure - some volumes do not support snapshots
INFO: trying 'suspend' mode instead
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: CT Name: nextcloud-vfs
INFO: including mount point rootfs ('/') in backup
INFO: excluding bind mount point mp0 ('/opt/cloud-data') from backup (not a volume)
INFO: starting first sync /proc/2403890/root/ to /var/tmp/vzdumptmp3095719_108
INFO: first sync finished - transferred 19.67G bytes in 215s
INFO: suspending guest
INFO: starting final sync /proc/2403890/root/ to /var/tmp/vzdumptmp3095719_108
INFO: resume vm
INFO: guest is online again after 17 seconds
ERROR: Backup of VM 108 failed - command 'rsync --stats -h -X -A --numeric-ids -aH --delete --no-whole-file --inplace --one-file-system --relative '--exclude=/opt/cloud-data' '--exclude=/tmp/?*' '--exclude=/var/tmp/?*' '--exclude=/var/run/?*.pid' '--exclude=/opt/cloud-data' /proc/2403890/root//./ /var/tmp/vzdumptmp3095719_108' failed: exit code 23

What really drives me crazy is that this very exact rsync command completes successfully without any errors if I just issue it on the command line. I addition, first and final sync complete before it occurs.

The following is the ouput of pct config <machine id>:

arch: amd64
cores: 16
hostname: nextcloud-vfs
memory: 65536
mp0: /rpool/data/vm-files/cloud-data,mp=/opt/cloud-data
nameserver: 1.1.1.1
net0: name=eth0,bridge=vmbr1,firewall=1,gw=192.168.56.1,gw6=fe80::2:4fff:fe7b:f1a1,hwaddr=4E:BD:69:A5:9D:BD,ip=192.168.56.48/24,ip6=fd16:c367:ccbe:8::48/64,type=veth
onboot: 1
ostype: debian
protection: 1
rootfs: Disk-Images-OS:108/vm-108-disk-0.raw,size=64G
searchdomain: DMZ.lan
startup: order=6,up=120,down=600
swap: 131072
unprivileged: 1
lxc.mount.entry: /dev/random dev/random none bind,ro,create=file 0 0
lxc.mount.entry: /dev/urandom dev/urandom none bind,ro,create=file 0 0
lxc.mount.entry: /dev/random var/spool/postfix/dev/random none bind,ro,create=file 0 0
lxc.mount.entry: /dev/urandom var/spool/postfix/dev/urandom none bind,ro,create=file 0 0
lxc.idmap: u 0 100000 1000
lxc.idmap: g 0 100000 1000
lxc.idmap: u 1000 1000 1
lxc.idmap: g 1000 1000 1
lxc.idmap: u 1001 101001 64535
lxc.idmap: g 1001 101001 64535
Any other LXC container with a similar configuration exept mp0: and idmap works perfectly with backups.
Thanks in advance for your kind help and a Merry Christmas to you all!

Kind regards,

christian
 
Last edited:
you could make the rsync as called by vzdump more verbose (/usr/share/perl5/PVE/VZDump/LXC.pm right at the top - revert with apt install --reinstall libpve-storage-perl) to see what causes the error..
 
  • Like
Reactions: ramsey
you could make the rsync as called by vzdump more verbose (/usr/share/perl5/PVE/VZDump/LXC.pm right at the top - revert with apt install --reinstall libpve-storage-perl) to see what causes the error..
Thanks for your kind answer. Hi Fabian. ;-)
Unfortunately I can't seem to figure it out.
I created a log file and its summary at the end points to "previous errors" I cannot spot because of verbosity - a search for "rsync error" just has that match at the end of the file.
Could you give me a clue?
All files are processed - rsync doesn't stop on error.

2021/12/22 12:40:39 [2257572] total: matches=51594 hash_hits=215577 false_alarms=10 data=4313545
2021/12/22 12:40:39 [2257572] rsync[2257572] (sender) heap statistics:
2021/12/22 12:40:39 [2257572] arena: 144441344 (bytes from sbrk)
2021/12/22 12:40:39 [2257572] ordblks: 2344 (chunks not in use)
2021/12/22 12:40:39 [2257572] smblks: 1
2021/12/22 12:40:39 [2257572] hblks: 2 (chunks from mmap)
2021/12/22 12:40:39 [2257572] hblkhd: 663552 (bytes from mmap)
2021/12/22 12:40:39 [2257572] allmem: 145104896 (bytes from sbrk + mmap)
2021/12/22 12:40:39 [2257572] usmblks: 0
2021/12/22 12:40:39 [2257572] fsmblks: 96
2021/12/22 12:40:39 [2257572] uordblks: 11467424 (bytes used)
2021/12/22 12:40:39 [2257572] fordblks: 132973920 (bytes free)
2021/12/22 12:40:39 [2257572] keepcost: 131200 (bytes in releasable chunk)
2021/12/22 12:40:39 [2257572] Number of files: 419,099 (reg: 357,489, dir: 54,567, link: 6,990, dev: 2, special: 51)
2021/12/22 12:40:39 [2257572] Number of created files: 2 (reg: 2)
2021/12/22 12:40:39 [2257572] Number of deleted files: 0
2021/12/22 12:40:39 [2257572] Number of regular files transferred: 29
2021/12/22 12:40:39 [2257572] Total file size: 19.85G bytes
2021/12/22 12:40:39 [2257572] Total transferred file size: 328.46M bytes
2021/12/22 12:40:39 [2257572] Literal data: 4.31M bytes
2021/12/22 12:40:39 [2257572] Matched data: 324.15M bytes
2021/12/22 12:40:39 [2257572] File list size: 589.71K
2021/12/22 12:40:39 [2257572] File list generation time: 0.001 seconds
2021/12/22 12:40:39 [2257572] File list transfer time: 0.000 seconds
2021/12/22 12:40:39 [2257572] Total bytes sent: 17.99M
2021/12/22 12:40:39 [2257572] Total bytes received: 1.66M
2021/12/22 12:40:39 [2257572] sent 17.99M bytes received 1.66M bytes 1.27M bytes/sec
2021/12/22 12:40:39 [2257572] total size is 19.85G speedup is 1,010.40
2021/12/22 12:40:39 [2257572] sent 17987575 bytes received 1660534 bytes total size 19852364899
2021/12/22 12:40:39 [2257572] rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
 
Last edited:
does your log file show the full file list? if so, adding '-q' for quiet should limit the output to just the errors, making the files/dirs causing the problem apparent.
 
does your log file show the full file list? if so, adding '-q' for quiet should limit the output to just the errors, making the files/dirs causing the problem apparent.
I still get the full file list - I also do when backing up another conatiner that works.
I changed the options to

my $rsync = ['rsync', '-q', '--log-file=/var/log/rsync2', '--stats', '-h', @xattr, '--numeric-ids',
'-aH', '--delete', '--no-whole-file',
($first ? '--sparse' : '--inplace'),
'--one-file-system', '--relative'];
Isn't there a string indicating errors I could search for?
 
I'm very sorry I'm to dumb for that, but I have not a single clue.
I did the following and it didn't show anything else than I already posted. :oops:

root@pve1:/var/log# grep -v ++++++++ /var/log/rsync2
2021/12/22 14:24:47 [3351842] building file list
2021/12/22 14:24:47 [3351842] .d..t.og... ./
2021/12/22 14:28:03 [3351842] Number of files: 419,110 (reg: 357,500, dir: 54,567, link: 6,990, dev: 2, special: 51)
2021/12/22 14:28:03 [3351842] Number of created files: 419,109 (reg: 357,500, dir: 54,566, link: 6,990, dev: 2, special: 51)
2021/12/22 14:28:03 [3351842] Number of deleted files: 0
2021/12/22 14:28:03 [3351842] Number of regular files transferred: 357,480
2021/12/22 14:28:03 [3351842] Total file size: 19.86G bytes
2021/12/22 14:28:03 [3351842] Total transferred file size: 19.73G bytes
2021/12/22 14:28:03 [3351842] Literal data: 19.73G bytes
2021/12/22 14:28:03 [3351842] Matched data: 0 bytes
2021/12/22 14:28:03 [3351842] File list size: 15.53M
2021/12/22 14:28:03 [3351842] File list generation time: 0.001 seconds
2021/12/22 14:28:03 [3351842] File list transfer time: 0.000 seconds
2021/12/22 14:28:03 [3351842] Total bytes sent: 19.76G
2021/12/22 14:28:03 [3351842] Total bytes received: 7.13M
2021/12/22 14:28:03 [3351842] sent 19.76G bytes received 7.13M bytes 100.60M bytes/sec
2021/12/22 14:28:03 [3351842] total size is 19.86G speedup is 1.00
2021/12/22 14:28:03 [3603830] building file list
2021/12/22 14:28:04 [3603830] >f..t...... etc/gdata/gdicap.cfg
2021/12/22 14:28:04 [3603830] >f..t...... etc/gdata/offline-machine.cfg
2021/12/22 14:28:04 [3603830] >f..t...... etc/gdata/offline-pgmupdate.cfg
2021/12/22 14:28:04 [3603830] >f..t...... etc/gdata/offline-soa.cfg
2021/12/22 14:28:04 [3603830] >f..t...... etc/gdata/offline-sod.cfg
2021/12/22 14:28:04 [3603830] >f..t...... etc/gdata/offline-stable.cfg
2021/12/22 14:28:04 [3603830] >f..t...... etc/gdata/offline-update.cfg
2021/12/22 14:28:04 [3603830] >f..t...... etc/gdata/pgm-update-status.cfg
2021/12/22 14:28:04 [3603830] .d..t...... tmp/
2021/12/22 14:28:57 [3603830] .d..t...... var/lib/elasticsearch/nodes/0/
2021/12/22 14:28:57 [3603830] >f..t...... var/lib/fail2ban/fail2ban.sqlite3
2021/12/22 14:28:57 [3603830] >f..t...... var/lib/mysql/ib_logfile0
2021/12/22 14:28:57 [3603830] >f..t...... var/lib/mysql/ib_logfile1
2021/12/22 14:28:57 [3603830] >f..t...... var/lib/mysql/ibdata1
2021/12/22 14:28:57 [3603830] >f..t...... var/lib/mysql/mysql/innodb_index_stats.ibd
2021/12/22 14:28:57 [3603830] >f..t...... var/lib/mysql/mysql/innodb_table_stats.ibd
2021/12/22 14:28:57 [3603830] >f..t...... var/lib/mysql/nextcloud/oc_authtoken.ibd
2021/12/22 14:28:58 [3603830] >f..t...... var/lib/php/sessions/sess_ljfhtqf6b8m5j9s5sddu23pkdh
2021/12/22 14:28:58 [3603830] >f..t...... var/lib/php/sessions/sess_pvqaof3kc9l2iptit982sfup7g
2021/12/22 14:28:58 [3603830] >f..t...... var/lib/php/sessions/sess_ti6pqs55d0udscu1s46bn5frgq
2021/12/22 14:28:58 [3603830] >f.st...... var/log/daemon.log
2021/12/22 14:28:58 [3603830] >f.st...... var/log/syslog
2021/12/22 14:28:58 [3603830] >f.st...... var/log/apache2/access.log

2021/12/22 14:28:58 [3603830] .d..t...... var/log/gdata/
2021/12/22 14:28:58 [3603830] >f.st...... var/log/gdata/gdavclient.log
2021/12/22 14:28:58 [3603830] >f.st...... var/log/gdata/gdb2bclient.log
2021/12/22 14:28:58 [3603830] >f..t...... var/log/gdata/job.lst
2021/12/22 14:28:58 [3603830] >f..t...... var/log/gdata/local_job.lst
2021/12/22 14:29:01 [3603830] Number of files: 419,110 (reg: 357,500, dir: 54,567, link: 6,990, dev: 2, special: 51)
2021/12/22 14:29:01 [3603830] Number of created files: 0
2021/12/22 14:29:01 [3603830] Number of deleted files: 0
2021/12/22 14:29:01 [3603830] Number of regular files transferred: 25
2021/12/22 14:29:01 [3603830] Total file size: 19.86G bytes
2021/12/22 14:29:01 [3603830] Total transferred file size: 324.37M bytes
2021/12/22 14:29:01 [3603830] Literal data: 1.51M bytes
2021/12/22 14:29:01 [3603830] Matched data: 322.86M bytes
2021/12/22 14:29:01 [3603830] File list size: 458.67K
2021/12/22 14:29:01 [3603830] File list generation time: 0.001 seconds
2021/12/22 14:29:01 [3603830] File list transfer time: 0.000 seconds
2021/12/22 14:29:01 [3603830] Total bytes sent: 13.91M
2021/12/22 14:29:01 [3603830] Total bytes received: 377.61K
2021/12/22 14:29:01 [3603830] sent 13.91M bytes received 377.61K bytes 244.16K bytes/sec
2021/12/22 14:29:01 [3603830] total size is 19.86G speedup is 1,390.77
2021/12/22 14:29:01 [3603830] sent 13905565 bytes received 377619 bytes total size 19864559538
2021/12/22 14:29:01 [3603830] rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
root@pve1:/var/log#
 
Another thing I only noticed now is that snapshot mode doesn't work anymore for any of the other conatiners. They all fall back to suspend mode.
Before upgrading to PVE 7.1, all containers exept the one with that bind mount worked in snaphot mode.
 
I'd grep for 'error', 'fail' and similar terms as well.. I am not sure what other options there are to add.. in general the handling hasn't changed between 6.x and 7.x with regard to the backup modes - if the storages of all mountpoints support snapshots, snapshot mode should be available. maybe you could post a config of a non-bind-mount container and /etc/pve/storage.cfg?
 
Good morning (and thanks Dunuin for joining us)!
I already did grep for "error" and "fail" - without success.
Here's an non-mp0-mount config:
arch: amd64
cores: 4
hostname: OVPN-Standard
memory: 512
nameserver: 1.1.1.1
net0: name=eth0,bridge=vmbr1,firewall=1,gw=192.168.56.1,gw6=fe80::2:4fff:fe7b:f1a1,hwaddr=9E:60:92:64:5E:6A,ip=192.168.56.44/24,ip6=fd16:c367:ccbe:8::44/64,type=veth
onboot: 1
ostype: debian
protection: 1
rootfs: Disk-Images-OS:104/vm-104-disk-0.raw,size=32G
searchdomain: DMZ.lan
startup: order=5,up=60,down=300
swap: 1024
unprivileged: 1
lxc.mount.entry: /dev/random dev/random none bind,ro,create=file 0 0
lxc.mount.entry: /dev/urandom dev/urandom none bind,ro,create=file 0 0
lxc.mount.entry: /dev/random var/spool/postfix/dev/random none bind,ro,create=file 0 0
lxc.mount.entry: /dev/urandom var/spool/postfix/dev/urandom none bind,ro,create=file 0 0
lxc.mount.entry: /dev/net/tun dev/net/tun none bind,create=file 0 0
And here's /etc/pve/storage.cfg :
dir: local
path /var/lib/vz
content backup,vztmpl,iso
prune-backups keep-last=10
shared 0

zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1

zfspool: local-ssd-zfs
pool ssd-pool
content images,rootdir
sparse 0

dir: Disk-Images-OS
path /ssd-pool/vm
content images,rootdir
shared 0

dir: OnboardSSD
path /mnt/pve/OnboardSSD
content snippets
is_mountpoint 1
nodes pve1
shared 0

dir: Templates
path /rpool/data/templates
content vztmpl
shared 0

dir: ISO-Images
path /rpool/data/iso-images
content iso
shared 0

dir: Snapshots
path /rpool/data/snapshots
content backup
prune-backups keep-last=10
shared 0

dir: Disk-Images-Data
path /rpool/data/vm-files
content rootdir,images
shared 0

dir: Schnipsel
path /rpool/data/snippets
content snippets
shared 0

dir: BakTemp
disable
path /mnt/temp/baktemp
content backup
prune-backups keep-last=1
shared 0
I'm going to try to restore containers from backups in order to see whether that turns anything into the good. ;-)
 
Last edited:
Just in case it is relevant: apt install --reinstall libpve-storage-perl did NOT revert changes in /usr/share/perl5/PVE/VZDump/LXC.pm
Maybe some files that should have been replaced during the upgrade to 7.1 were not upgraded?
It's just a wild fairly uneducated guess. ;-)
 
sorry, that should of course been 'pve-container', not 'libpve-storage-perl' :-/
 
So, tested restoring backups. No solution. What I did exactly was restoring containers from backups, turning off protection in case someting needs to be upgraded, reinstalled 'pve-container', rebooted the node.
Something must have changed concerning backups, because they are way more faster now.
With PVE 6.4, my backup job took from 00:30 to ~ 3 a.m. to complete.
With 7.1, it only takes fom 00:30 to ~ 2 a.m. to complete although neither the job nor the amount of data has changed significantly.
Merry Christimas to you all!
 
Very strange: the problem partially disappeared on its own by advanced wizardry and shady magic.
Container 108 with that suspend mode failure finally completed suspend mode backup successfully tonight.
What still remains is that no other container is able to run backups in snapshot mode (due to "some volumes don't support snaphot mode", although there is only root filesystem without any further volume).
Nevertheless, KVM VMs were not affected and use snapshot mode as intended (they are in the same storage as the containers).
I wonder whether enabling new ZFS-Features cause this (despite actual storages are directories in my case).
After upgrading to PVE 7.1, I ran zpool upgrade -a which took just a glimpse to complete. Maybe processing this happened in the background over a longer period of time - again, a wild uneducated guess because I don't really understand causes and symptoms.
 
Last edited:
a 'dir' type storage with raw images doesn't support snapshots (and never has). if you want snapshot, use the ZFS storage plugin.. VMs use a totally different backup mechanism where 'snapshot' and 'suspend' have different semantics - 'snapshot' uses a qemu-internal snapshotting mechanism, not storage snapshots, so doesn't require support for the latter.
 
Ok, my fault, I'm fine with suspend mode if it works. I always had containers in directories on a ZFS Storage. Never mind.
But backing up container 108 in suspend mode only worked once since upgrading proxmox (when I wrote it above).
New normal is that I have to stick to stopp mode for that container.
Again, I did grep for error and fail and so on in rsync log file, but nothing except file names containing the strings showed up.
The strange point to me is that I can issue that rsync command manually on command line without any errors.
Yesterday, I installed the new updates for pve-container and so on, but no change.
I guess I have to deal with stopp mode until that one gets fixed.
But thanks for your kind help anyway!
And a beautiful new year's eve!
Today, it's my love's birthday and we all know tomorrow. ;-)
 
So, after a while, it turns out as follows: once or twice a week, the daily backup in question works, the rest of the week, it fails.
Logs of rsync don't show errors, I still don't have a clue, I just worked around by setting up a stop mode backup for that specific container.
Happy new year to all of you!
 
TLDR: See screenshot below :)

I've also experienced the rsync error 23 problem with suspend mode backups for some LXC containers on Proxmox, where the mount point ACLs are set to Default (I assume this means ACL enabled).

Setting the container mount point options for ACLs to Disabled (under advanced settings) has resolved the issue for me.

1642637217023.png

From this I assume that rsync is not able to duplcate the ACLs for some files during rsync stage of the backup, and crashes with exit code 23. Not all container templates make use of ACLs on files, so containers using only standard UNIX permissions for all files might back up fine, while filesystems that have ACLs on some files (or just one file) would fail.

I discovered this issue after moving my container disk images to an NFS storage pool (for quicker migration between cluster nodes). My backup task is set to do snapshot backups, but container images stored on NFS storage don't support this and fall back to suspend mode backups, and this is where the issue surfaces.

Just my 2 cents - I hope it helps someone :)
 
Last edited:
TLDR: See screenshot below :)

I've also experienced the rsync error 23 problem with suspend mode backups for some LXC containers on Proxmox, where the mount point ACLs are set to Default (I assume this means ACL enabled).

Setting the container mount point options for ACLs to Disabled (under advanced settings) has resolved the issue for me.

View attachment 33449

From this I assume that rsync is not able to duplcate the ACLs for some files during rsync stage of the backup, and crashes with exit code 23. Not all cotainer templates make use of ACLs on files, so containers using only standard UNIX permissions for all files might back up fine, while filesystems that have ACLs on some files (or just one file) would fail.

I discovered this issue after moving my container disk images to an NFS storage pool (for quicker migration between cluster nodes). My backup task is set to do snapshot backups, but container images stored on NFS storage don't support this and fall back to suspend mode backups, and this is where the issue surfaces.

Just my 2 cents - I hope it helps someone :)
I pay a Green Manalishi for your 2 Cents. ;-) Thank you, that's great. I'm going to try this out. In the past three day, suspend backup fall back worked on that container. Just like everyone, I like spontanuous errors without logging, but thanks to you, I sleep well now. ;-)
 
that likely means you need to enable ACLs on the dir where the temp copy lives during suspend backups! but it's a good hint for future reports (and a pity that rsync apparently doesn't log this properly :()