Backup missconfiguration?

Daniel-San

Member
Mar 16, 2023
16
1
8
Hello guys,

I have a problem which I cannot explain technically by myself.
I have set up a Proxmox VE single Host with two running containers and an additional PBS as VM on other hardware where the Hypervisor should store the backups of both containers every night.

One container has a 20 GB rootfs and additionally a 20 GB mount point.
As I am using a ZFS RAID 1 Proxmox VE setup with 1 TB space, both rootfs and mount point have been created each as a subvolume on the local ZFS pool for virtual machines and LXC’s.

The other container has a 15 GB rootfs only.

Now here is the thing:
The backup duration of my container with the additional mount point is approximately 2 hours and nearly 3 TB will be crawled (remind that I have only 1 TB of physical disk space overall) in every running backup job.
The duration and crawled data will not reduce after several days as the dedupe functionality of PBS would imply.

The other container is finishing within two minutes after some days and is uploading approximately 65 Megabytes to the PBS datastore every night. This is an expected growing for me because both LXC’s are currently not in production use but system logs will be written.

Both LXC’s are relying on the Debian 12 template from the Proxmox library.

I have attached the web hook notification of the last backup job.

Do you guys have some hints for me where I can start with the troubleshooting?
It is absolutely weird that a LXC of max. 40 GB in size overall would take such a long time for backing up and nearly 3 TB of data will be crawled for it.

Thank you and best,
Daniel
 

Attachments

  • IMG_1905.jpeg
    IMG_1905.jpeg
    237 KB · Views: 7
Give this a try: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_ct_change_detection_mode
And this a read: https://pbs.proxmox.com/docs/backup-client.html#change-detection-mode
If you want more information please share your config(s) via pct config CTIDHERE and cat /etc/pve/storage.cfg and cat /etc/pve/jobs.cfg.
Thank you for the information.

Here is the requested output:


Bash:
arch: amd64
cores: 2
description:
features: keyctl=1,nesting=1
hostname: <$hostname>
memory: 2048
mp0: local-zfs:subvol-101-disk-1,mp=/var/piler,size=20G,backup=1
net0: name=eth0,bridge=lxcnet0,firewall=1,hwaddr=<$mac-address>,ip=dhcp,ip6=dhcp,type=veth
onboot: 1
ostype: debian
rootfs: local-zfs:subvol-101-disk-0,size=20G
swap: 1024
tags: debian;mailpiler
unprivileged: 1

And here:

Bash:
dir: local
        path /var/lib/vz
        content vztmpl,iso,backup

zfspool: local-zfs
        pool rpool/data
        content rootdir,images
        sparse 1

pbs: PBS
        datastore <$storage-reponame>
        server <$pbshostname>
        content backup
        prune-backups keep-all=1
        username <$username>@pbs

I have read the docs. I will try to configure the change detection mode to “metadata”.

Best
Daniel
 
Last edited:
A quick update:
Changing the detection mode in PVE backup job configuration and recreating the datastore on PBS does not have any effect.
Still almost 3 TB of data will be crawled and the job is taking almost 2 hours.

Tried of doing a full clone of problematic LXC on PVE host for testing purposes.
For creating a full clone 2.67 TB of data will be crawled from the origin LXC, too.

Could it be that not the PBS is the problem but the LXC is? That the LXC got somehow broken regarding allocated space?
 
It will not take effect on the first backup and re-creating the datastore wasn't needed. What does the job look like now? I recommend using metadata.
Can you share the CT config for the 3TB disk one?
full clone
But this is about PBS backups.
 
Last edited:
It will not take effect on the first backup and re-creating the datastore wasn't needed. What does the job look like now? I recommend using metadata.
Can you share the CT config for the 3TB disk one?

But this is about PBS backups.
Backup job is changed to metadata over PVE backup job configuration in advanced settings.

CT config is still the same like the above pasted one two threads before (first code block).

My thinking is as follows:
A full clone of my working container in the backup job will crawl 15 GB of data only (rootfs 15 GB) in clone job summary which is expected.
The problematic container has assigned 20 GB rootfs and 20 GB mount point - backup job and full clone are crawling 2.67 TB of data for their jobs in the logs.
Container config for the problematic one is showing 40 GB of aligned space in sum (rootfs and mount point).
So my theory is having a problem on space allocation - maybe ZFS subvolumes and not having a problem with the backup job.
I don’t know how, why and how to check.

Best,
Daniel
 
Last edited:
Trimming logs is always hard unless you know exactly what you want to look for and I kind of want to see all. Just .zip it and it should compress well.
 
Now it makes sense why the log is so large. I'm sorry I meant the PVE side task log of the backup job. I'm curious why it would crawl 3T~ of data when your CT only has 40G~ in total.
 
Last edited:
Yeah I see what you mean now. Sorry it took me this long to understand. To be honest I'm a bit puzzled where these 2.7T~ come from too.
This is mostly for my curiosity but would you mind sharing this from the PVE side?
Bash:
zfs list -rt all -o name,used,avail,refer,mountpoint,refquota,refreservation,logicalused,compressratio | grep -E "NAME|101"
I saw a similar issue here recently: https://forum.proxmox.com/threads/abnormally-large-processed-data-on-small-lxc.163340/
Make sure to follow the links there and in the linked issue too.
 
Last edited:
Here is the output:

Bash:
NAME                           USED  AVAIL  REFER  MOUNTPOINT                     REFQUOTA  REFRESERV  LUSED  RATIO
rpool/data/subvol-101-disk-0  1.01G  19.0G  1.01G  /rpool/data/subvol-101-disk-0       20G       none  2.00G  2.12x
rpool/data/subvol-101-disk-1  4.46M  20.0G  4.46M  /rpool/data/subvol-101-disk-1       20G       none  5.75M  1.93x
 
  • Like
Reactions: Impact
I was already thinking about recreating the LXC like @veehexx has doing it in his post.
That will take some time because lots of system customizations have been done around the piler application installation to get it running with current versions for the package dependencies.

Best,
Daniel
 
Last edited: