ZFS, Storage and Snapshot considerations for High-Change-Scripts

lordandy · May 21, 2024

Hey PVE-Community,

Sorry for the wall of text, but I'd just like to explain the setup

I have a 2-node cluster with a PBS connected. The servers are Hetzner AX102 machines with each 2 x 1,92TB NVMe SSD attached. I use it mainly to serve one large VM which runs a couple of websites and some smaller VMs and one container.

The VM uses a 1.5TB volume/disk attached, but only uses about 900GB currently. But as I don't see an option to reduce this without creating a new VM and rsyncinc all the data to the second host here is my problem:

Current Setup

Bash:

# zfs list
NAME                           USED  AVAIL  REFER  MOUNTPOINT
rpool                         1.34T   346G    96K  /rpool
rpool/ROOT                    3.08G   352G    96K  /rpool/ROOT
rpool/ROOT/pve-1              3.08G   352G  3.08G  /
rpool/data                    1.33T   346G   104K  /rpool/data
rpool/data/subvol-202-disk-0  1.05G  6.95G  1.05G  /rpool/data/subvol-202-disk-0
rpool/data/vm-100-disk-0      1.33T   346G  1.33T  -
rpool/data/vm-101-disk-0      90.9M   346G  90.9M  -
rpool/data/vm-101-disk-1      39.8M   346G  39.8M  -
rpool/data/vm-102-disk-0      4.77G   346G  4.77G  -

Unfortunately when setting this up, I did not think about storage and ZFS too much especially in terms of sizing.

The current setup uses:
- Storage Replication between the PVE hosts VIRTUAL01 and VIRTUAL02 so the VMs have a failover available if the physical server really crashes. HA is disabled, I would failover manually if really needed
- PBS with a daily backup to the PBS server more for disaster recovery, because restoring the 1.5TB needs about 3 hours so it's too long for a failover scenario in my case.

Problem

Everything was really running fine, until tonight my Opsgenie app called me telling that the servers have gone offline. Luckily due to a similar incident last week I had at least a reservation for the PVE ZFS volume itself, so the virtualization host VIRTUAL01 was still accessible with SSH and GUI.

First, I could not imagine what would cause this issue, because i do not use any thin provisioning or similar. From the 1.84TB available disk space, about 1.6TB are for the main VM and the remaining storage is attached to the smaller instances or just not used.

From the timing, this happened at 02:15 when some backup scripts on the virtual machine itself are running, and this led me to the root cause:

As I use the storage replication feature, this of course creates a snapshot in the background, you just don't see it in the GUI.

And as there are some data scripts running on the VM at 02:15, they create a lot of big TAR files (about 250GB) and send them to another customer of mine. And as the Storage replication is just running once a day (which is fine for me), the snapshot increases as it's copy-on-write and just fills up the storage until it's 100% full. Then the server becomes offline.

Although I just did not think about it, it makes sense that this happens - not sure why it did not happen in the last months, but maybe the amount of changed files was just lower that it could fit the free space. And after the replication is applied the snapshot size anyway is reduced to 0 again for the next window.

So my question is:

I cannot get rid of those scripts which just generate a lot of data and export them. So this will happen again if I keep the same setup.
But as I run the storage replication anyway, one idea would be to run the scripts on the "cold" replication on VIRTUAL02.

I don't need to (and I think I cannot) power up the same replicated VM on the VIRTUAL02 host. So my options would be:

Automatically create a VM and attach the same Disk to the new VM
Mount the /dev/zvol/rpool/data/vm-100-disk-0-part1 directly on the second PVE host VIRTUAL01 and run the scripts there

The second option sounds to me like a bad practice, mounting a client ZFS Volume directly on the Host. Is this really bad? Can anything happen, maybe I can mount it read-only?

Or what would you suggest to solve my issue?

The main VM (which has the 1.6TB volume attached) does only use about 900GB for now. But I don't see any change to get this Volume reduced with the current amount of free space? The only way I see is disabling storage replication, creating a brand new VM on the second server with a smaller Disk/Volume and then running rsync until I can switch the VM and delete the old one? At least I am not aware of any solution to reduce the size of the existing ZFS volume without huge headache

Cheers,
Andy

VictorSTS · May 21, 2024

- Run storage replication more often, so you snapshot is smaller.
- Add a second disk to that VM, check skip replication, and use it as destination for those TAR files. You will need to manually create the disk in the second node in order for the VM to properly boot if migrated to the other PVE node.

Is thin provision checked in Datacenter, Storage, local-zfs?

lordandy · May 21, 2024

Hey @VictorSTS - thanks for the help. I did not know about the "skip replication" feature for a single disk, I'll think about it if I'll just use one of the two suggestions you made.

And yes, "thin provision" is enabled on the Datacenter->Storage configuration.
Does this then mean that every disk I create on this storage is always thin provisioned and only taking the space it really needs? Because in the VM I created the file system with 1.6 TB so it might use this then on the storage as well.

But reading this again, it seems it's not the case that it automatically takes the 1.6TB..

In zfs list it shows usage of 1.33TB, but on the VM itself it shows it uses 1.5TB

Code:

# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0  1.5T  0 disk
├─sda1   8:1    0  1.5T  0 part /
├─sda2   8:2    0    1K  0 part
└─sda5   8:5    0  975M  0 part [SWAP]
sr0     11:0    1 1024M  0 rom

I would love to find a way to reduce the VM disk as well to only need 1TB if it does not need more for now, but I don't think there is an easy way to do this..

VictorSTS · May 21, 2024

lordandy said:
Does this then mean that every disk I create on this storage is always thin provisioned and only taking the space it really needs?

Correct. But to keep disks thin provisioned, discard must be checked for every disk in VM configuration and VirtIO SCSI [Single] should be used (there are other options, but these are the preferred).

lordandy said:
I would love to find a way to reduce the VM disk as well to only need 1TB if it does not need more for now, but I don't think there is an easy way to do this..

If thin provision works properly it doesn't really matter, as you will not have more data in your disks than what's currently used inside the VMs. For peace of mind, you may use disk quotas in the VM and it will never use more than that.

lordandy · May 21, 2024

Thanks for the update, I did not set the Discard option on the harddisk itself and I assume it does not help a lot when setting it now after the disk is already heavily used?

The guest operating system is a regular Debian 12.

VictorSTS · May 21, 2024

Check discard and SSD emulation, reboot the VM then run fstrim -v / (or whichever mount point the disks are on). Unused space should be recovered in the storage eventually.

EDIT: the fstrim op will be applied on the replicated disk too when replica is run, so it will recover free space in the other server too.

lordandy · May 21, 2024

Amazing

Your information was really valuable, especially as I just use Proxmox for a short time without a proper training :-D

I just checked on a small test VM and it seems to work, zfs list now shows less space for rpool/data/vm-102-disk-0 as the command freed up some space:
/: 5.1 GiB (5434249216 bytes) trimmed on /dev/sda1

I will do this on the main big VM as well, there shouldn't be any risk I assume?

So now the volume usage on rpool/data/vm-102-disk-0 went down from 4.77GB to 1.31GB.

I assume the GUI just shows the configured storage for the VM disk, not the actual used one when the thin provisioning is applied:

Bash:

# zfs list
NAME                           USED  AVAIL  REFER  MOUNTPOINT
rpool                         1.34T   349G    96K  /rpool
rpool/ROOT                    3.08G   356G    96K  /rpool/ROOT
rpool/ROOT/pve-1              3.08G   356G  3.08G  /
rpool/data                    1.33T   349G   104K  /rpool/data
rpool/data/subvol-202-disk-0  1.05G  6.95G  1.05G  /rpool/data/subvol-202-disk-0
rpool/data/vm-100-disk-0      1.33T   349G  1.33T  -
rpool/data/vm-101-disk-0      90.9M   349G  90.9M  -
rpool/data/vm-101-disk-1      39.8M   349G  39.8M  -
rpool/data/vm-102-disk-0      1.31G   349G  1.31G  -

lordandy · May 21, 2024

But in general, before learning all those things about the thin provisioning setup - what do you think about the approach just running those (also CPU-intense tasks) just on the VIRTUAL01 on the replicated storage?

Automatically create a VM and attach the same Disk to the new VM
Mount the /dev/zvol/rpool/data/vm-100-disk-0-part1 directly on the second PVE host VIRTUAL01 and run the scripts there

It really sounds hacky and I don't like the idea of mounting a guest storage disks on PVE just to run some scripts, especially as they run then as root - but it would also move the load away from the primary server

Because although

Run storage replication more often, so you snapshot is smaller.

sounds very reasonable, the big chunk is generated within 30-60 minutes and I thought running storage replication every 30 minutes just produces load on the server and disk during the day which I wanted to avoid...

VictorSTS · May 21, 2024

In a production enviroment "hacky" directly translates to "unsupported" and for me is a no go. If you want to go that route, check zfs snapshots and mount a readonly snapshot in the second server.

IMHO, this is probably much easier, safer and supported done at the app level: just mounting the needed directories using nfs or sshfs from a VM running in the second PVE node.

lordandy · May 21, 2024

You're totally right - that sounds like a more valid option, I'll think about all your proposals.. Using a separate Disk definitels sounds like the post professional approach...

lordandy · May 22, 2024

@VictorSTS really thanks a gain for your help, I've just changed the large volume to thin provisioning, and added the second disk without replication and everything works like a charm.

I will check the docs if I can also just shrink the ZFS volume down to 900G as well because I really don't need the storage, or if this always needs a new volume and send/receive or rsync all the data..

Bash:

# zfs list rpool/data/vm-100-disk-0
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool/data/vm-100-disk-0   736G   972G   735G  -

Kingneutron · May 22, 2024

> I will check the docs if I can also just shrink the ZFS volume down to 900G as well because I really don't need the storage, or if this always needs a new volume and send/receive or rsync all the data

DO NOT TRY THIS WITHOUT A BACKUP, but I just tested this with a file-backed pool on zfs 2.2.3:

Code:

## on an XFS filesystem

truncate -s 1G zdisk1
truncate -s 2G zdisk2

zp=ztestevac; zpool create -o ashift=12 -o autoexpand=on -o autoreplace=off -O atime=off -O compression=zstd-3 $zp $PWD/zdisk2

## copy a ~500MB ISO to the temporary pool

zpool add ztestevac $PWD/zdisk1

zpool remove ztestevac $PWD/zdisk2

  pool: ztestevac
 state: ONLINE
remove: Removal of vdev 0 copied 182M in 0h0m, completed on Wed May 22 10:50:31 2024
        1.24K memory used for removed device mappings
config:
        NAME                        STATE     READ WRITE CKSUM
        ztestevac                   ONLINE       0     0     0
          /mnt/seatera4-xfs/zdisk1  ONLINE       0     0     0
errors: No known data errors

I wrote a script recently to migrate a mirrored zfs boot/root to smaller disks and this works with a mirror / raid10-equivalent pool, so:

Start with a 2GB disk
Add / stripe a 1GB disk onto the pool with no redundancy (MAKE SURE apps are stopped and nothing else is writing to the pool!)

Remove the original 2GB disk from the pool

ZFS evacuates the data onto the smaller disk, as long as it has enough space to fit (wait for resilver to finish!)

Remove original larger disk as it is no longer part of the pool

Bob = your uncle ;-)

IDK how much this concept affects I/O performance on the pool going forward, but it appears to be fairly easy. If it starts to matter, you could always send/recv the data onto a fresh pool later on.

EDIT: of course if you're trying this on an rpool, you will need to use the proxmox boot tool to fix EFI boot. I recommend trying this only on a single-disk data pool, or use my script on a mirror pool - again, only after backing up

https://github.com/kneutron/ansites...replace-zfs-mirror-boot-disks-with-smaller.sh

lordandy · May 22, 2024

Hey @Kingneutron,

of course I would never do that without having the prober backup or a recent replication on the second node..

I do not want to change the root device of the PVE host itself, it's "just" one volume of the PVE guest - but of course it's the only disk on this guest having multiple partitions like a default Debian installation

And the hosts are just having two physical disks - I've really never thought of adding a ordinary file to a zpool, I did not know that this could work at all :-D

So I could try running your advice/script to create the zvol by using this files as "disks", and then maybe after everything has been changed do a zfs send/receive to move it back to the original physical disk-backed zpool...

I'm anyway not sure if I really want to go this risky path - just to have a better gut feeling about the thin provisioning and reduce the risk of overprovisioning, maybe a simple refreservation solves this issue as well

I would just have a clear state of the disks - I would not even need thin provisioning but with a smaller disk I could better track it..

Kingneutron · May 22, 2024

You already have a zfs rpool with ~964GB free, you don't want to create a zfs pool on top of zfs.
Would be better if you have a temporary disk that you can add to the system (or NAS), create a zfs pool on it, copy data there temporarily

lordandy · May 23, 2024

Thanks - as it's a production root server in a datacenter, I don't have physical access and can just add/remove disks.. But maybe I'll do it with one of my spare servers and create the pool there. But thinking about it maybe it's just not worth the effort and risk

Search

Search

ZFS, Storage and Snapshot considerations for High-Change-Scripts

lordandy

New Member

Current Setup

Problem

So my question is:

VictorSTS

Famous Member

lordandy

New Member

VictorSTS

Famous Member

lordandy

New Member

VictorSTS

Famous Member

lordandy

New Member

lordandy

New Member

VictorSTS

Famous Member

lordandy

New Member

lordandy

New Member

Kingneutron

Renowned Member

lordandy

New Member

Kingneutron

Renowned Member

lordandy

New Member

We value your privacy

ZFS, Storage and Snapshot considerations for High-Change-Scripts

New Member

Current Setup​

Problem​

So my question is:​

Famous Member

New Member

Famous Member

New Member

Famous Member

New Member

New Member

Famous Member

New Member

New Member

Renowned Member

New Member

Renowned Member

New Member

We value your privacy

Current Setup

Problem

So my question is: