LXC Containers Backing Up Incredibly Slow

infinityM

Active Member
Dec 7, 2019
179
1
38
31
Hey Guys,

We have a LXC container which is about 2TB large, When backing up this server though it takes about 7+ hours to backup.
While I have several VM's which only take about 20 minutes to backup but is 5-6TB large...

Why are LXC's so slow to backup?
 
to help better, please provide your storage config cat /etc/pve/storage.cfg and your container config
Hey Matrix,

root@c6:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content vztmpl,iso,backup

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

rbd: Default
content images,rootdir
krbd 0
pool Default

nfs: NAS
export /volume1/Backups
path /mnt/pve/NAS
server 10.161.0.247
content snippets,vztmpl,backup,iso,rootdir,images
maxfiles 3

pbs: backups
datastore backups
server pmb.local
content backup
fingerprint # I removed the fingerprint
maxfiles 99
username root@pam

LXC Config

root@c6:~# cat /etc/pve/lxc/127.conf
arch: amd64
cores: 6
hostname: VPS127
memory: 8192
net0: name=eth0,bridge=vmbr0,firewall=1,gw=156.38.175.34,hwaddr=22:72:12:00:8F:59,ip=156.38.175.58/27,type=veth
onboot: 1
ostype: ubuntu
rootfs: Default:vm-127-disk-0,size=2000G
swap: 8192
unprivileged: 1
 
because for containers, PBS needs to read and chunk all the data to find out what to upload, for (running) VMs there is a shortcut because we can tell Qemu to keep track of which chunks changed.
 
because for containers, PBS needs to read and chunk all the data to find out what to upload, for (running) VMs there is a shortcut because we can tell Qemu to keep track of which chunks changed.
Hey Fabian,

Is there any plans to get around this problem?
 
there is no cheap solution for this - the short term would definitely be to move such workloads to VMs where there is this intermediate layer that can keep track of writes from the guest and speedup the backup.
 
yes, basically. then you only need to re-read all of the data when
- the VM was down (live-migrating to another node will now transfer the bitmap as well!)
- the encryption key or mode was changed
 
of course, increasing your read/chunk performance might also be an option depending on your current hardware ;)
 
read: I/O benchmark (more/faster disks, to some degree more RAM)
chunk/hash: proxmox-backup-client benchmark (faster CPU/more cores)
 
I opened a feature request https://bugzilla.proxmox.com/show_bug.cgi?id=3138 to speed up container backup, but it gets closed multiple times without discussion and even with wrong claims. The feature request is perfectly valid, because PBS would be the ONLY backup solution in the world, which needs several hours to backup UNCHANGED file systems. I claim that a dramatic improvement IS possible and it is even not too complicated to implement.
 
Yea, I have the same issue, and I see nothing changed in 2 years :(
I have 6TB LXC disk and it take 3 days to back it up...
I am using proxmox backup
 
So I'm in the same boat. Before implementing PBS for production, I deployed it in test cluster. Test LXC (4tb) takes 2 hours to backup, while nothing changed. The strange thing is that durign the 2 hour that takes the backup to finish, there isn't much of a disk activity on PBS or machine with LXC ... CPU is even cold.
Test VMs that are 128BG, 300GB and 512GB take seconds to backup. There is one thing that doesn't make sense, all proxmox staff keeps saying that only VM's that are running can have the diff done quickly, but one machine with VM's only get's up periodically and does the backup right after startup, and this one happens within few seconds. All this, while LXC from machine that runs 24/7 takes forever.

To add insult to the injury, all VM verification takes a second or two for small incremental backups, while LXC takes 7 hours to verify incremental backup (which is few meg might I add). and that does hog disks like crazy for LXC.
 
So I'm in the same boat. Before implementing PBS for production, I deployed it in test cluster. Test LXC (4tb) takes 2 hours to backup, while nothing changed. The strange thing is that durign the 2 hour that takes the backup to finish, there isn't much of a disk activity on PBS or machine with LXC ... CPU is even cold.
Test VMs that are 128BG, 300GB and 512GB take seconds to backup. There is one thing that doesn't make sense, all proxmox staff keeps saying that only VM's that are running can have the diff done quickly, but one machine with VM's only get's up periodically and does the backup right after startup, and this one happens within few seconds. All this, while LXC from machine that runs 24/7 takes forever.
besides the big difference (dirty bitmaps with running VMs allow skipping of read operations of unchanged chunks), there's also way less complexity involved in the processing of backup data for VMs
- VM backups are block-based, each input chunk is a fixed size, nothing has to be done except compressing, hashing, potentially encrypting
- CT/host backups are file-based, input chunks are variable size, a directory tree has to be parsed (potentially lots of random I/O), read, converted into a pxar archive stream, then compressed, hashed, potentially encrypted

you can probably guess the latter is affected by how your file systems performs w.r.t. directory and metadata operations.
To add insult to the injury, all VM verification takes a second or two for small incremental backups, while LXC takes 7 hours to verify incremental backup (which is few meg might I add). and that does hog disks like crazy for LXC.
what do you mean by verify? a PBS verification happens on the PBS side, and consists of reading all chunks of the snapshot and hashing their contents, so it's both read and CPU intensive, but it's basically the same for fidx and didx backups, except that the index format is a bit different. the contained data is not logically analysed or verified at all, so the difference in contents doesn't matter.
 
you can probably guess the latter is affected by how your file systems performs w.r.t. directory and metadata operations.
Yes, you are absolutelly right. THOU, you (ie, operating system of proxmox) already knows what is the filesystem for underlying storage, because I select it from drop down in the gui - ZFS. There can be a separate logic per storage type, it's not that much complicated to encapsulate different diff functionality within switch case.
what do you mean by verify? a PBS verification happens on the PBS side, and consists of reading all chunks of the snapshot and hashing their contents, so it's both read and CPU intensive, but it's basically the same for fidx and didx backups, except that the index format is a bit different. the contained data is not logically analysed or verified at all, so the difference in contents doesn't matter.
Yes, this point - I meant on PBS. Still, this one is a bit strange - even thous backups are incremental for CT (amount reported to be sent from PVE to PBS), the verify (periodic job OR on receive - tested both) takes 6 - 7 hours, and PBS seems to scratch through all 4TB of data on the disk. And yes previous backups of this CT are verified ... one would assume that only the incremental backup of 10MB that was sent to PBS would get verified ... as you know, an increment ?. And alto now there are now 10 backups of 4tb CT in PBS, it only reports that storage is 4tb (and a bit more) occupied, so I don't presume every single backup is whole 4tb of data. Even the ZFS list shows that it's not that big:

Code:
zfs list
NAME               USED  AVAIL     REFER  MOUNTPOINT
rpool             12.7G  1.74T       96K  /rpool
rpool/ROOT        12.7G  1.74T       96K  /rpool/ROOT
rpool/ROOT/pve-1  12.7G  1.74T     12.7G  /
rpool/data          96K  1.74T       96K  /rpool/data
storage_cold      3.39T  49.3T     3.39T  /mnt/datastore/storage_cold


Side point (third one):
What I was thinking (more hoping) was that PBS would be (more) aware of it's storage, and if incremental backups would be performed on storage like ZFS, it would be based on built in snapshots,
 
Yes, you are absolutelly right. THOU, you (ie, operating system of proxmox) already knows what is the filesystem for underlying storage, because I select it from drop down in the gui - ZFS. There can be a separate logic per storage type, it's not that much complicated to encapsulate different diff functionality within switch case.
no, the backup client is file-system agnostic, it uses regular file/directory operations, there is no diffing at that level.
Yes, this point - I meant on PBS. Still, this one is a bit strange - even thous backups are incremental for CT (amount reported to be sent from PVE to PBS), the verify (periodic job OR on receive - tested both) takes 6 - 7 hours, and PBS seems to scratch through all 4TB of data on the disk. And yes previous backups of this CT are verified ... one would assume that only the incremental backup of 10MB that was sent to PBS would get verified ... as you know, an increment ?. And alto now there are now 10 backups of 4tb CT in PBS, it only reports that storage is 4tb (and a bit more) occupied, so I don't presume every single backup is whole 4tb of data. Even the ZFS list shows that it's not that big:

Code:
zfs list
NAME               USED  AVAIL     REFER  MOUNTPOINT
rpool             12.7G  1.74T       96K  /rpool
rpool/ROOT        12.7G  1.74T       96K  /rpool/ROOT
rpool/ROOT/pve-1  12.7G  1.74T     12.7G  /
rpool/data          96K  1.74T       96K  /rpool/data
storage_cold      3.39T  49.3T     3.39T  /mnt/datastore/storage_cold
if you verify a single snapshot, all the chunks of that snapshot will be verified. a snapshot is always complete, there are no incremental or full snapshots - all snapshots are "equal". only the uploading part is incremental in the sense that it skips re-uploading chunks that are already there on the server. the chunk store used by the datastore takes care of the deduplication. so yes, in your case, (re)verifying a single snapshot will read and verify 4TB of data. (re)verifying ten snapshots of the same CT with little churn between the snapshots will cause only a little bit more load than verifying a single one of them, since already verified chunks within a single verification task will not be verified a second time for obvious reasons.

TL;DR "verify after backup" is not necessary unless you are really paranoid. scheduled verification with sensible re-verification settings is the right choice for most setups to reduce the load caused by verification while retaining almost all the benefits.

Side point (third one):
What I was thinking (more hoping) was that PBS would be (more) aware of it's storage, and if incremental backups would be performed on storage like ZFS, it would be based on built in snapshots,

PBS (both the client and server side) is pretty much storage agnostic and only uses regular file-system APIs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!