ZFS pool with mixed speed & size vdevs

pancakes

Member
Dec 3, 2022
17
3
8
I have a fast 2TB NVME SSD and a slow 4TB SATA SSD inside a RAID0 zfs zpool. I like zfs (over lvm which I had before) for its versatility in managing storage (e.g. one pool to use, including the pve root), however I'm running into the issue that my mixed hardware slows down. With LVM, I could make an lvm-thin pool that only used the fast disk (if created first), and store static data on a second lvm-thin pool on the remainder of the fast disk + slow disk. It turns out zfs filled up my fast disk first, and now I have only the slow disk left:

Code:
zpool list -v
NAME                                                  SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool                                                5.44T  3.20T  2.24T        -         -    12%    58%  1.00x    ONLINE  -
  nvme-eui.0025385a11b2xxxx-part3                    1.82T  1.72T  95.0G        -         -    35%  94.9%      -    ONLINE
  ata-Samsung_SSD_860_EVO_4TB_S45JNB0M500432F-part3  3.64T  1.48T  2.14T        -         -     1%  40.8%      -    ONLINE

which degrades my performance:
Code:
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=2g --iodepth=1 --runtime=30 --time_based --end_fsync=1
pve host zfs raid0
  WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), io=1605MiB (1683MB), run=30229-30229msec

pve host lvm:
  WRITE: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s), io=6690MiB (7015MB), run=30431-30431msec

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=128m --numjobs=16 --iodepth=16 --runtime=30 --time_based --end_fsync=1
pve host zfs raid0
  WRITE: bw=741MiB/s (777MB/s), 16.8MiB/s-265MiB/s (17.6MB/s-278MB/s), io=22.6GiB (24.3GB), run=31220-31226msec

pve host lvm:
  WRITE: bw=2429MiB/s (2547MB/s), 140MiB/s-164MiB/s (147MB/s-172MB/s), io=72.7GiB (78.0GB), run=30127-30641msec

I see four options, none of which are ideal:
1. Accept this and move on
2. Re-create a setup with 2 zpools, one on the fast disk and one on the slow disk --> can I use a fraction of the fast disk (e.g. 500GB) for running virtual machines and use the remainder for data? I.e. can I create a zpool of partitions?
3. Somehow balance writing 2:1 between these asymmetric disks --> this will still mean that 66% of the data will be written to the slow disk, so my performance will
4. Manually move the static data off the fast disk to the slow disk (this data is hardly changing) --> is this possible? How?

Ideally I'd like to assign a zfs dataset to have a certain affinity with a vdev, e.g. zfs create rpool/backups/data-tim -affinity /dev/sda such that it has a preferred vdev to write to. Is this possible? Any other suggestions besides the above? Thanks :)
 
I have combined one or more HDD with a SSD (of different sizes) in a ZFS mirror (RAID1), which really improved read speeds/IOPS (for PVE) and worked fine (and still does for PBS).
You could partition the larger drive so that you can create a ZFS mirror (higher read IOPS for VMs) with the faster drive and keep the rest for storage you don't really care about (templates, downloads, ISOs, first set of backups that you also sync to another system, etc.).
 
Thanks for suggestion. That would mean two zpools, correct? What would be the advantage of making two single disk zpools, e.g. 0.5TB of my fast drive + 1.5TB / 4TB of my fast / slow drive?
 
That would mean two zpools, correct?
Yes, One as a mirror of two drives and one with a single drive.

What would be the advantage of making two single disk zpools, e.g. 0.5TB of my fast drive
I think I understand what you mean by this. But I don't really know what an advantage of a single drive is. Except maybe for write speed (since this is the fastest drive).
+ 1.5TB / 4TB of my fast / slow drive?
I don't understand what this means. I don't think that ZFS supports something like this. It sounds a bit like what you said before:
3. Somehow balance writing 2:1 between these asymmetric disks --> this will still mean that 66% of the data will be written to the slow disk, so my performance will
I don't think that ZFS works like that. Maybe BTRFS works like that, but I don't have experience with that.
 
> I have a fast 2TB NVME SSD and a slow 4TB SATA SSD inside a RAID0 zfs zpool. I like zfs ... for its versatility in managing storage (e.g. one pool to use, including the pve root), however I'm running into the issue that my mixed hardware slows down

Yeah, this is like harnessing a horse and buggy to a Corvette. Not a good idea, as you won't get good performance out of either. And it confuses the hell out of the horse.

Run your proxmox rpool off the slower SATA and your LXC/VMs off the nvme using either lvm-thin or a separate zfs pool.

> 2. Re-create a setup with 2 zpools, one on the fast disk and one on the slow disk --> can I use a fraction of the fast disk (e.g. 500GB) for running virtual machines and use the remainder for data? I.e. can I create a zpool of partitions?

I think you have a very flawed understanding of zfs. You have 2x2TB drives, use ~500GB for rpool on the slower SATA and create a separate partition + filesystem on it (e.g. XFS) for data that doesn't need the nvme speed. You will need to mount this the regular way in /etc/fstab , and define it in PVE Storage if you want PVE to have access to it for ISOs and the like.

Be aware that if you ever have to reinstall, the PVE ISO will wipe the target disk(s) for boot/root, so you still need backups.
 
Thanks, I think I'll have to re-install my proxmox on a new zfs layout, as I expected but hope I could prevent ;) I think I'll skip the mirror option and instead go for two zpools:
1. on a partition of the fast disk (~300GB, I'm currently only using 60GB of VMs and no plan to expand big time),
2. spanning the remainder of the fast disk + all of the slow disk.

One remaining question: is there any way to rebalance space between these two pools easily (in case I need more storage in one or the other)? I noticed increasing partition size is easy, so I'm wondering if a third option is not an idea:

1. on a partition of the fast disk (~100GB, enough for 60GB of VMs),
2. 300GB unpartitioned space as 'buffer' to either be added to zpool 1 or 3
3. spanning the remainder of the fast disk + all of the slow disk.
 
1. on a partition of the fast disk (~300GB, I'm currently only using 60GB of VMs and no plan to expand big time),
2. spanning the remainder of the fast disk + all of the slow disk.
what is the point ?
install PVE on HDD because no speed is required for PVE itself, rpool will be the slow zpool.
second zpool for whole ssd.
balancing data between the two disks will be done with moving vDisks between the two Storages.
(don't forget ZFS eat / quickly wearout consumer grade SSD...)
 
  • Like
Reactions: Kingneutron
Thanks for your reply @_gabriel. The point is that I want my VMs on fast disk and static storage on a slow disk, however, I don't want the full 2TB NVME disk for VMs, so hence I want to use partitions. Apologies if this was not clear.

I now have made two pools as I understand @leesteken's suggestion was, one on (a part of the) fast NVME disk, the rest on the remainder of the fast disk + full slow SATA disk:
Code:
tim@pve:/mnt/testvol$ zpool list -v
NAME                                SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool                               298G   114G   184G        -         -     1%    38%  1.00x    ONLINE  -
  nvme-eui.0025385a11b2d5e3-part3   299G   114G   184G        -         -     1%  38.3%      -    ONLINE
tank                               5.05T  1.37T  3.68T        -         -     0%    27%  1.00x    ONLINE  -
  nvme0n1p5                        1.43T   918G   538G        -         -     0%  63.0%      -    ONLINE
  sda                              3.64T   481G  3.16T        -         -     0%  12.9%      -    ONLINE

however, I still have the same/similar performance:
Code:
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=2g --iodepth=1 --runtime=30 --time_based --end_fsync=1
  WRITE: bw=70.0MiB/s (73.4MB/s), 70.0MiB/s-70.0MiB/s (73.4MB/s-73.4MB/s), io=2121MiB (2224MB), run=30277-30277msec

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=128m --numjobs=16 --iodepth=16 --runtime=30 --time_based --end_fsync=1
  WRITE: bw=741MiB/s (777MB/s), 16.8MiB/s-265MiB/s (17.6MB/s-278MB/s), io=22.6GiB (24.3GB), run=31220-31226msec
[CODE]

I also created an ext4 partition on the NVME disk (I kept 100GB free), which actually does have good performance as expected:
[CODE]
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=2g --iodepth=1 --runtime=30 --time_based --end_fsync=1
  WRITE: bw=198MiB/s (208MB/s), 198MiB/s-198MiB/s (208MB/s-208MB/s), io=6082MiB (6377MB), run=30698-30698msec

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=128m --numjobs=16 --iodepth=16 --runtime=30 --time_based --end_fsync=1
  WRITE: bw=1698MiB/s (1780MB/s), 99.6MiB/s-118MiB/s (104MB/s-124MB/s), io=50.8GiB (54.5GB), run=30103-30626msec

so it seems that it wasn't the underlying problem of combining a slow NVME & fast SATA disk in one pool but instead it was how zfs performs on the fio benchmarks. I've also read that zfs performance is slower than ext4 under some benchmarks, and that tuning (reducing) zfs record size could help. I will continue testing with zfs pools with different settings on my machine to see what gives me optimal performance, also realising that fio benchmarks might not represent fully realistic workload. If anyone has other suggestions I'd be happy to hear them :)