Best practice for VM/CT with large amounts of data

Mrt12

Well-Known Member
May 19, 2019
165
20
58
45
CH
So often one wants to run a Nextcloud or some kind of file server or similar. Then always arises the question of how to backup.
One reason why I find PVE so attractive is that one can make easy backups of the entire machines or LXCs and have a complete backup that integrates everything. Restore works then super easy.

I notice that, if Proxmox Backup Server is used, the backups of VMs are very fast, thanks to the dirty bitmap. But this works only when the VM has not been rebooted since the last backup. On the other hand, backups of LXCs take always very long time if there is a lot of data. So I wonder what would be here the best practice, lets say for a Nextcloud + File server with 1TB of data in it.

a) use an LXC and just accept that the daily backups take very long time. If the data lives on hard disks
b) use a VM. The backups are fast, as long as the VM is never rebooted.
c) use either a) or b) but exclude the large data from the backup, and run some kind of backup software *inside* the VM or LXC.

What would be the "correct" and cleanest way to do it?
 
So I wonder what would be here the best practice, lets say for a Nextcloud + File server with 1TB of data in it.
This might be an unpopular opinion, but I would say that best practice is to separate data from blockstorage. If you just use a 1TB VM disk, you get all the downsides of blockstorage, plus backing that up is painful. It is less painful with PBS, but still painful.

On the other hand, running a 32GB Nextcloud VM that accesses a 1TB dataset would work by:

Option A for Nextcloud: backing up the whole VM which is only 32GB. This is fast even to a NFS share. But of course even faster to PBS

Option B for Nextcloud:
Create a MariaDB dump and rsyc the config, data and theme folders

Option C for Nextcloud:
Use some Docker AIO that comes with Borg backup

The 1TB Data you can easily backup by:
- Rsync
- sending Snapshots
- S3

This might be initially a little bit more complex to setup, but it will perform better.
I am not very experienced with lxc, but I think the same applies to them.
 
Of course I would make individual virtual disks for the data and the OS, where the OS would be a small disk (32-ish GB) on the SSD pool, and the actual data (1TB in this example) is on a separate pool that could also be HDD.
Still the question remains, what would be the smartest way to set this up such that the backup times are reasonable.
 
Of course I would make individual virtual disks for the data and the OS, where the OS would be a small disk (32-ish GB) on the SSD pool, and the actual data (1TB in this example) is on a separate pool that could also be HDD.
Still the question remains, what would be the smartest way to set this up such that the backup times are reasonable.
You could use restic.

Phil
 
Last edited:
Of course I would make individual virtual disks for the data and the OS
That does not solve the issues I have with blockstorage.
A virtual disk will still be a fixed blocksize of 16k if using ZFS defaults.
ZVOLs have a static volblocksize.
Datasets are not blockstorage. They have an upper limit called record size. Files don't have a static record size, they are just limited by the max size you set by max recordsize.

Or to make a real example, a 1MB file on a dataset with 1MB record size will be one single record.
A 1MB file in a VM, with a RAW disk on a 16k volblocksize ZVOL will be 64 blocks.

There are many, many downsides when it comes to blockstorage. You can't ZFS send data while it being easily searchable, you get less compression because of the smaller chunks, you get way worse metadata performance, ARC performance, can't use rsync or S3 to send files somewhere else directly, maybe even have to worry about pool geometry and padding...

So IMHO you don't want to use it, unless you have to. Fine for running a Nextcloud LAMP or Docker Nextcloud. Not so great to store the Nextcloud data itself.