General backup thoughts - discussion starter

pongraczi

Renowned Member
Oct 23, 2008
42
11
73
Hungary
www.startit.hu
Hello Everyone,

This could be TL;DR

I use Proxmox from the very beginning and I use ZFS for ages back when I had to compile it by myself and official PVE did not support using ZFS. I use smaller clusters (2 - 6 nodes, but a cluster is a cluster in case of 3+ nodes, so, if you see one, you saw all :).

History of my backup tasks
  • I took backups in the beginning using vzdump on lvm-ext4: it was horrible slow, suspended VM/openvz cts for a long time, the server hardware was working hard every night,
  • when I turned to zfs, I start using its snapshot to get snapshots and use them as backup (copy to elsewhere): it just worked every time, no downtime, not hammering hardware and did not slow down
  • now I use zfs based backups using simplesnap and pve-zsync, which are low-level tools without fancy GUI and log analysis and so on:
    - they just work, but if something goes wrong and there is no zabbix etc. monitoring, somebody needs to check it
    - other thing: replicating to more backup servers could be tricky, due to that, every zfs tool start making new snapshots, making master-backup1-backup2 difficult (in this case, the backup1->backup2 needs custom solution)
  • as a consultancy company advised to me, I should use some other solution for backups and they said, they recommend PBS and they are using it already, I just installed it and I start to evaluate it
My PBS experiences with vzdump
  • So, I put together a new backup server locally with a lot of capacity to store backup data and installed the latest and greatest release (PBSlocal).
  • I also put it one to Hetzner for remote backup (PBSremote).
  • I did some test backups using some smaller VMs/CTs (some GBytes per vm/ct) during the weekend from the production cluster (PVEcluster) to PBSRemote. It went fine, I was surprised.
  • I started to make some bigger backups from PVECluster to PBSlocal, it went fine, too.
  • I also started to test the sync, which makes life really easy in with backup1->backup2 sync tasks. It seems it just works.
  • Now, after almost 3TByte compressed backup I have, I started to sync them between PBClocal -> PBSremote, it runs now for 4 days, due to the uplink speed and still needs several days for initial upload.
  • I have to say, I am very impressed with the PBS-PVE ecosystem, but......
And here I run into problems I never experienced using native zfs based tools and my happy smile face gone.
  • i found that, during the backup, for a short or longer time, VMs (especially windows guests) stopped responding, one of them I had to kill and reboot again.
  • I tried to use similar regular backup as I did with pve-zsync: every 4 hours during worktime but I was surprised, users started reporting immediately, their RDP sessions hanged, stopped working, I caused a little panic for some minutes. Cancelling backup jobs solved it. I scheduled the backup to the night, one time per day.
  • Yesterday night the backups of the VMs are failed without real reason (I mean, there are info in the logs, but it tells me nothing and at this moment it is not important, what is important, it did just not work).
  • We run services, which needs to run 0-24, all the time, maybe on Saturday night we could find a time gap to stop services in a determined order, make backup and start again, because these services are running on several VMs/CTs and they are communicating to each other and if one falls out, it could cause problems (already did some noise in our the monitoring system, the question is not the quality of the used applications).
  • I read several threads on the forum after I experienced these issues to figure out, what happened in the last 10+ years, while I was living under a zfs-rock.
    • There are still different problems with vzdump on various filesystems and repeated questions regarding zfs (its cow nature, how that related to zvols of VMs etc.).
    • Vzdump is a filesystem agnostic backup tool by design, which sounds good for first, but after the reality punches in the face, one should reconsider forcing only this backup method.
My expectations to a backup system
  • Causing zero downtime, where zero is exactly zero.
  • Does not slow down the server or just for a minimum (4ex. network transport of backup).
  • Idiot proof: just works, no need to worry.
  • Could run frequently during worktime too.
  • Whatever else important but I missed.
So, what is the question after all? Why on earth there is no a zfs based backup system integrated to PBS/PVE?

I know, it could start a flamewar regarding filesystems and what is a backup etc., but my main goal is to start a discussion, how to achieve an integrated, zfs based backup utility to be able to use it as easy as the PBS/vzdump, except, it should work :)

Pros
  • ZFS is mature, just works, reliable, not a joke (like btrfs), PVE/PBS supports installing into zfs natively etc.
  • pve-zsync already written and just works. I have a feeling pve-zsync written by a resistant against vzdump :)
  • zero downtime.
  • zero CPU overhead.
  • the content directly accessible (dataset or zvols are mountable).
  • backup machine could be use dedup/zstd.
Cons
  • zfs could be no-go for lot of users (due i o various reasons, pick one)
  • 3-2-1 and its descendants backup strategies do not exist or I do not know about it (there is a high chance, I missed, never looked :)
  • it forces to use zfs everywhere, making ceph and other non-zfs storages useless
  • redundant solution for an already solved problem: backup (anyway, as lot of users experienced already, vzdump is a great tool, but with handicaps since the beginning)
  • good base to start a flamewar
One could have a feeling, I did not cover some important questions/topics/problems which could make this backup topic to a colorful, huge thread nobody wants to read.
Some quick notes regarding this:
  • zfs snapshot could create an inconsistent filesystem on running VMs/CTs
    • based on my own experience, it is not problem with zfs. For CT it is even less problem.
      Of course, when we start a new instance based on a snapshot, from the perspective of the VM/CT it is like booting after a crash. I am surprised, but windows' are pretty stable and its filesystem recovered every time. Linux VM (ext4) also can recover.
    • CTs are the best one, because their filesystem is native zfs, it will be never inconsistent.
    • Of course, some in-memory, never written data could lost. If you need these kind of backups, you already lost hours/days of data, so, who cares.
    • If one needs to clone in an idiot-proof, always working way, important services or the machine itself should turned off and make snapshot in offline state.
  • not every hardware is suitable for using zfs, they could use vzdump, no problem
    • but for those, who needs their data safe and they use local storage, I strongly recommend zfs, others could stick with lvm/ext4
  • restoring strategies/checklist not discussed: it seems most of the users do not have one, just creating backups somehow and hope the best, so, there is no different. Who has that, he/she already prepared for the demands and will not be surprised.
  • regular backup checks: please hands in the air, who do this regularly (after every backups)
  • I simply forgot everything else, maybe others could write missing things/POVs or link to already written discussions somewhere in the net.
To be able to get some exact questions to answer, here are some:
  • Do you know a working backup solution, which could be used with PVE? (please check my demands above)
  • Do you know, how to make zero downtime backup with vzdump?
  • Do you know, hot to make vzdump idiot (me) proof, which always work?
  • Do you have working backup strategies based on PVE? (3-2-1, 4-3-2-1-0 whatever fancy names you know)
  • Did you ever experienced non-bootable or totally useless backup? (backup type and non-recoverable DB and type, unreadable old files, filesystems)
  • Did you ever feel, you need an other solution than you already have for backup? (reason could be useful)

Thank you for your attention, I hope this thread could help me to find a better solution which could help others, too.

István
 
  • Like
Reactions: kwinz
Thank you for your comment and sorry about my delay.

We have 4 hardware nodes locally and 1 remotely, where production KVM/LXC machines are running, some of them needs to run non-stop, only sometimes a scheduled maintenance outage allowed (except mailfunction, which is very rare).
+1 backup server locally
+1 backup server remotely

Hardwares usually have raidz2 (6 drives) + l2arc + slog on ssds.

There are 26 VMs, some of them are small, some of them has 1TB and 1.3TB storage (kvm), total working data is about 3-4TByte, spread between VMs.

Local backup server: raidz2 as described above, it seems it can handle PBS, but verify/GCt take a lot of time (I mean, lot of, even only 1 backup/day happened). This server can handle easily zfs based backup twice a day (I mean, easily with almost no-time/cpu etc.).
It se

Remote backup server: pretty heavy, comparing to the local, but the storage is on cifs and now, after 2 months, it grown to almost 9TB data.
The proof-of-concept failed in our case, because after a time, this cheap solution become unusable.
GC has no chance, in phase 1, we reached that, when the 1% (above 12%) took 12-26 hours and increasing, so, no way to make any GC.
Verify process worked, but was slow (anyway, cifs storage also backed by raidz2 storage, but that zfs does not accessible directly, so, zfs based backup does not work in this storage).
Now the backup sync took more than a day, so, we slip with the remote and the time gap increasing.

My concluson:
  • as Proxmox recommended, this backup really needs very fast storage, preferably on SSD to get acceptable response/action times for longer term and for bigger data (I guess some TB data needs this)
  • HDD raidz2 could work, at least our local backup server is able to manage ~8TB backup data (2 months, started from 3-4TB)
  • cifs for storage could work for small amount of data (<1TB), I have an other system, where I have about 530GB backup data at this moment,
    - weekly GC took ~15 hours
    - daily verify, only new data, first verify took about 7-8 hours
  • of course, raidz2/3 is not optimal for speed, better to use raidz10 based configuration, combination with ssd/nvme
  • I like the system otherwise, even it does not meet our demands (I know, hard to achive zero downtime with such a filesystem agnostic solution, even with proxmox or other similar backup systems)
  • just a note, I use zfs under proxmox, even that ancient times, when I had to compile zfs for the kernel, because Proxmox did not used that at that time (it was really in the early times in the very beginning, I feel old :)
  • and yes, I like zfs as filesystem, as I lost data using hardware raid10 and I prefer data safety :) (there are two kind of sysadmins: who already lost data and who will loose data). So, I learned the hard way.
What is really good in PBS (only my point of view):
  • nice, clean gui with relevant info
  • easy to add backups, scheduled jobs (verify, gc, prune)
  • easy to create remote backup sync
  • integrating into proxmox (sideffect: moving VMs/CTs around remote locations is easy, using these backups)
  • it seems working as VM on a proxmox server or installing PBS besides of PVE (I use both)
I hope my experiences helps others to plan ahead of their backup solutions and can save some time/cost by avoid unnecessary learning curves.
 
I have to revise my previous conclusion, but first, some update about the actual background:
  • on the primary, on-site backup server we have almost 9.5TByte backup data, daily backup
  • we experienced very slow verify and garbage collection (days) on the primary backup server
  • on the secondary (remote) server, using cifs storage, technically died, due to the unbeliveable slow I/O
So, my conclusion:
  • using PBS with such an amount of data needs powerful hardware. I mean, powerful.
  • you really need enterprise, datacenter storage, because the PBS will hammering 0/24, all the time. Consumer grade hdd will die quickly, I guess. We lost one HDD, maybe not related to PBS, but obviously we did not use datacenter hdd (zfs itself does not need it, we have lot of hdds for a decade).
  • it is really better to use at least SSD or better NVMe drives for storage, especially you expect high amount of backup data
  • you should revise your hardware of compute nodes, I guess you need quick/powerful ones, depends on the amount of data
  • vm could be worse than ct
  • restoring a bigger vm (about 2.6TB data) could take unbeliveable time, especially I want to find some files in 10-40 backups (kvm, not ct)
I think, after about 3 months of agony, we have to find a zfs based backup.

Seriously, if you plan to use PBS, you really need ssd/nvme drives in raid10 like storage, otherwise you will be in trouble, just depend on time.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!