Incrementally backup directory

Eldoril73

Member
Jul 13, 2020
3
3
23
51
First of all, kudos to all the proxmox team for this release!

After backing up a large directory tree using proxmox-backup-client, I tried to backup again the same directory. It seems that the client doesn't try to skip unchanged files (i.e. it reads and sends data from all the files, changed and unchanged), this slows down subsequent backups (althought it seems that deduplication at server side will take care of occupied space).

It would be nice to use the previous backup index to skip unchanged files.

Any chance it will be implemented?
 
  • Like
Reactions: Audifilm
i.e. it reads and sends data from all the files, changed and uncha

It reads all files, but unchanged ones won't get sent - the "reused" metric is still missing from that output to make this more obvious.

It would be nice to use the previous backup index to skip unchanged files.

How'd you do that? File modification time cannot be trusted, they can be set to arbitrary values. You'd need to permanently setup something like an inotify watch for the whole directory tree you want to backup, but that's not too cheap either.
We may see if we can make some sort of optimizations which doesn't hogs the system too much, but for now it's safer and easier to read everything and upload if changed - at least for file based backup.
 
Bacula, another popular open source backup system, leaves to the user the decision on what kind of logic must be used to check if a file has been changed or not: you can choose among file checksum (slower but safer) or file metadata (last change timestamp + file size, faster but less safe). I think the same could be done on PBS.

Actually I have to put under backup a 700 GB directory full of videos, the server is a HP N54L microserver with slow AMD CPU (contains both PVE ad PBS), and backing up after first full backup takes around 6 hours with 100% CPU use. Using file metadata would shave backup time to some seconds (video library rarely changes).
 
Hmm, a opt-in check only fstat mode? Possible, but not to sure about it...
Maybe a daemon mode where the client really sets up a directory inotify watch to track actual writes and other modifications. Would have the same effect but actually work in a guaranteed safe way, I need to look into how big the inotify overhead is really.
 
IMHO the inotify stuff would break if you restart the daemon, my idea is to schedule the backup via crontab. Anyway this would be an opt-in, so the user should be aware of what's going on. And the change to the source code should be more limited than the inotify one.
 
IMHO the inotify stuff would break if you restart the daemon

Not really, you can self re-exec with keeping watches open, it's just a file descriptor after all.

my idea is to schedule the backup via crontab

If there'd be a client-daemon, then the client would just redirect that backup request to the daemon, this should be no problem either.

Anyway this would be an opt-in, so the user should be aware of what's going on

Yeah that for sure.

And the change to the source code should be more limited than the inotify one.

IMO, they're in the same order of magnitude, such a watch and remember dirty files/trees isn't that complicated.

But anyway, that's all details and there's nothing decided yet, besides the fact that some (option for) optimizing would be great. :)
 
Hmm, a opt-in check only fstat mode? Possible, but not to sure about it...
Maybe a daemon mode where the client really sets up a directory inotify watch to track actual writes and other modifications. Would have the same effect but actually work in a guaranteed safe way, I need to look into how big the inotify overhead is really.

ctime could be more trusted (it cannot be set to any value as mtime) - it is updated by system on either content or metadata modification.

As to inotify - it cannot be trusted either, as under heavy load some events could be missed, and it does not allow reliable tracking of subdirectories - every new created directory must be explicitly included thus introducing a possible race (and could be missed) and inflating number of watches (for large file systems).

I believe that user should have a choice - in most cases (especially trusted environments) mtime is quite reliable, and in any case there is ctime - and this could extremely reduce backup time (especially if list of files will be cached locally - but this is another optimization).

PS: Just hit this issue on a relatively low volume file system (only ~ 14G data but > 1 million of files) - it takes 40 minutes to backup (second run):
Code:
nocdb.pxar.didx: had to upload 236.98 MiB of 14.78 GiB in 2631.97s, average speed 92.20 KiB/s).
nocdb.pxar.didx: backup was done incrementally, reused 14.55 GiB (98.4%)

First backup took same time so network does not play a role. Just for comparison - borg backup takes twice as less (14 minutes) for same source system and same target (borg server is located on PBS VM).

Thus if volumes are really high (100s of GB and more, big database files, lots of files etc) it will be as trashing with I/O as regular vzdump - the fact that unmodified files are not transferred does not really help.
 
  • Like
Reactions: Audifilm
It's an old post but....
I also use borg backup for my backup repository, and I already love proxmox backup server and think it could be better software if it could use different methods to detect new files, not only the option to full compare.
It will be much faster to use modification date, and that way let the user compare, test and choose the better betwen differents options for his scenario.
 
  • Like
Reactions: steff123
Oh yes, it should be enabled that the user can choose another method. I wanted to use PBS also for the backup of the files from my slow NAS with a lot of big media files. But the runtime is too long for me because of the slow cpu there. Usually the most files haven't changed. Will have a look at borg now. :(
 
  • Like
Reactions: ubu74
Thanks. But I have already checked out borg. It seems to be a better fit if you have a lot of larger files.
But when you have a lot of smaller files, e.g. the filesystem of a desktop computer, then proxmox-backup-client seems to be the better choice.
 
I have a the same sort of issue,
I have submitted the following feature request that would solve this
https://bugzilla.proxmox.com/show_bug.cgi?id=4347

Basically mount the previous backup, then write the new backup on top with another tool such as rsync and then finally tell the server to save it as a "new" backup.

I hope something like this can be implemented. It would greatly speed up file level backups.
 
Please don't create a solution that uses a local database to track a remote file system. I left Duplicati and Syncthing because they do this. Both of them have far too many problems keeping the database accurate and up-to-date. I was constantly repairing them and rebuilding them. It's more trouble than it's worth. Rsync is my tool of choice for syncing.
 
Last edited:
I would like an OPTIONAL ctime/mtime based solution, this could massivly speed up Backups, maybe do an extra checksum based backup run on weekends.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!