Offsite backup options

websmith · May 18, 2022

Hi,
I am very happy with the proxmox backup server so far - but I am wondering if there are any efficient way to backup the "chunks", i.e. the datastores offsite - without having a pbs server offsite?

Right now I am using rsync - and its "working" - meaning chunks gets updated, deleted etc - but it is horribly slow - my last sync took over 4 hours to complete.

I am guessing its because of the massive amount of files in the .chunks directory.

It is similar if you have to manually remove your datastore and clean it up, it takes ages to delete the .chunks directory. The speed is slow when doing these operations no matter if the datastore is located on SSD's or rusty spinners.

So I wonder if anyone have any good ideas to make more efficient "backups" of the datastores.

Thanks in advance.

tom · May 18, 2022

websmith said:
So I wonder if anyone have any good ideas to make more efficient "backups" of the datastores.

Not the answer you want, but the perfect solution for off-site is an additional Proxmox Backup Servers. There are also hosting companies offering this as a service, if you do not want to operate this by yourself.

websmith · May 18, 2022

tom said:
Not the answer you want, but the perfect solution for off-site is an additional Proxmox Backup Servers. There are also hosting companies offering this as a service, if you do not want to operate this by yourself.

I am aware that it would be ideal, but this is a homelab I am running - and even though I like my backups being secure - paying hundreds of EURO's per year just to have an offline backup is prohibitive.

Right now I have a ssh box available with 250gb storage where I can rsync my data to - which works, but is extremely slow.

But I guess I will just have to set up yet another pbs in my homelab - just for storing a backup of the backups

Not ideal - but at least I would have two copies.

I understand its efficient to store files like that using parts of the hash as the foldername - but I know e.g. that other software that stores files in huge numbers, i.e. a proxy uses a more nested directory structure, i.e. https://www.nginx.com/blog/nginx-caching-guide/.

To quote: "levels sets up a two‑level directory hierarchy under /path/to/cache/. Having a large number of files in a single directory can slow down file access"

top level only contains a-z+0-9 and then each directory contains the same and so forth - possibly 4 levels down.

And so forth

i.e.

.chunks/5edd/5edd397c3569048b2097caccdf9345d0847999965a291ac4d223032024ec67a9
.chunks/5ee0/5ee02e8d0bf5d6dcb7e2eb995f944c0df8b6df4e009c219ca5f2f068c27ee155
.chunks/5ee2/5ee2081883e760091971c9c2d5d9e8c8451d6ce212e51a6de563e7f1b4bf0079
.chunks/5ee2/5ee2852204377630ff9e6538ff6415afae5b14b130b00c1e3551797c43134bf4

Would be stored in the following structure:

.chunks/5/e/d/d/5edd397c3569048b2097caccdf9345d0847999965a291ac4d223032024ec67a9
.chunks/5/e/e/0/5ee02e8d0bf5d6dcb7e2eb995f944c0df8b6df4e009c219ca5f2f068c27ee155
.chunks/5/e/e/2/5ee2081883e760091971c9c2d5d9e8c8451d6ce212e51a6de563e7f1b4bf0079
.chunks/5/e/e/2/5ee2852204377630ff9e6538ff6415afae5b14b130b00c1e3551797c43134bf4

That allows the filesystem to cope better since directory entry count are smaller.

That is something you could consider changing the chunks storage to using - and it should not matter in terms of read/write performance, since I assume your code already builds the first directory name based on a file hash or something - now you just need to generate 4 directory names instead - or however many levels you decide to use.

I am aware that having many files in a single directory and the performance impact of that is very much dependent on the filesystem, but having multiple levels can only improve performance - in particular for those filesystems that do not have an efficient performance for many files.

dietmar · May 19, 2022

websmith said:
.chunks/5/e/d/d/5edd397c3569048b2097caccdf9345d0847999965a291ac4d223032024ec67a9
.chunks/5/e/e/0/5ee02e8d0bf5d6dcb7e2eb995f944c0df8b6df4e009c219ca5f2f068c27ee155
.chunks/5/e/e/2/5ee2081883e760091971c9c2d5d9e8c8451d6ce212e51a6de563e7f1b4bf0079
.chunks/5/e/e/2/5ee2852204377630ff9e6538ff6415afae5b14b130b00c1e3551797c43134bf4

That allows the filesystem to cope better since directory entry count are smaller.

Our benchmarks indicates that a two level structure is better in most cases.

websmith · May 19, 2022

dietmar said:
Our benchmarks indicates that a two level structure is better in most cases.

okay, that is fine - but you are running with only one level as far as I can see?

A directory thats the first 4 characters of the hash of the chunk - or perhaps we have a different meaning of levels - I mean levels below the .chunks directory.

i.e. if I do a

Bash:

find . -maxdepth 1 -type d|wc -l

in the .chunks directory I get 65537
if I do

Bash:

find . -maxdepth 2 -type d|wc -l

I get the exact same number 65537

So to my eyes it seems like you are only using 1 level and not two levels?

Dunuin · May 19, 2022

You could try ZFS replication instead of rsync. ZFS can incrementally sync changes between snapshots which is usually very fast, even with alot of small files, as the sync will be done on block level and not file level. But on the other hand tunneling a zfs replication over a ssh connection (by for example zfs send tank1/dataset@snap1 | ssh anotherhost zfs recv tank2/dataset) isn't that fast too. But should be fast enough to saturate most consumer internet connections. Bot for ZFS replication you need of cause a ZFS pool on both servers.

Offsite backup options

websmith

Member

tom

Proxmox Staff Member

websmith

Member

dietmar

Proxmox Staff Member

websmith

Member

Dunuin

Distinguished Member

We value your privacy