Nested virtualisation (filesystems) on zfs

crembz · May 18, 2023

Hi there,

So in building my home lab I've been generally pleased with the performance of windows and linux VMs despite using consumer grade nvme/ssd/

Most of my hosts contain a single SATA SSD and NVME of the same size. The plan was to have these setup as a mirror, knowing the SSD will drag the performance of the NVME back a fair amount.

I have noticed however, with both esxi and Nutanix nested on pve, that the performance is abysmal. With neither able to load their GUIs properly, seemingly taking a very long time to initialise. Moving from ZFS to LVM-thin (ext4 root) ... the issues disappear, with both loading within reasonable timeframes. In my logic, I'm thinking having VMFS or ADSF on top of ZFS on top of consumer SSDs is just amplifying writes chronically.

I'm using these in a lab context to have a play with iac and automation across hypervisors.

So this leaves me with a dilemma. I love the ability to replicate natively with pve using ZFS and migrations between local zfs stores is lightning fast compared to LVM which seems to need to sync the entire disk across. It also seems like under regular loads, ZFS has the edge for speed also. However it would seem that if I want to run these nested hypervisors I will need a change of plans.

Option 1: have one single LVM-thin disk for the nested hypervisors and a single ZFS disk for anything else. This provides no real redundancy with only the workloads living on ZFS capable of replication and HA.
Option 2: Run MDAMD + lvm for redundancy and forget about HA completely. Will need to rely on backups/restore
Option 3: Use a USB SSD for LVM-thin for the hypervisors (using backup for protection) and run ZFS mirrored root for everything else.

I don't really need HA on the nested hypervisors since they are purely sandpits. HA would be useful for the other workloads running on the cluster (TrueNas, several docker hosts, cloud management portal).

Anyone care to comment on those three options and help me rationalise them?

Thanks.

LnxBil · May 18, 2023

crembz said:
ZFS on top of consumer SSDs

This is never good, never!
I'm running VMware and Hyper-V on enterprise harddisks with a normal performance.

I recommend to buy used enterprise SSDs and run everything from them, even for a homelab. You will not have fun with consumer SSDs.

crembz · May 19, 2023

LnxBil said:
This is never good, never!
I'm running VMware and Hyper-V on enterprise harddisks with a normal performance.

I recommend to buy used enterprise SSDs and run everything from them, even for a homelab. You will not have fun with consumer SSDs.

Totally understand that and that's the plan longer term however for now I have what I have.

Are you using enterprise hdds or ssds?

Out of the three options given my current constraints which would you suggest?

Option 1: have one single LVM-thin disk for the nested hypervisors and a single ZFS disk for anything else. This provides no real redundancy with only the workloads living on ZFS capable of replication and HA.
Option 2: Run MDAMD + lvm for redundancy and forget about HA completely. Will need to rely on backups/restore
Option 3: Use a USB SSD for LVM-thin for the hypervisors (using backup for protection) and run ZFS mirrored root for everything else.

LnxBil · May 19, 2023

crembz said:
Are you using enterprise hdds or ssds?

Both. I love used enterprise hardware for home labbing.

crembz said:
Option 1: have one single LVM-thin disk for the nested hypervisors and a single ZFS disk for anything else. This provides no real redundancy with only the workloads living on ZFS capable of replication and HA.

I'm strongly against everything non-redundant, even in a homelab. Are you limited to the number of disks? What about just running Hyper-V and VMware from a USB-stick instead of PVE if you need it?

crembz said:
Option 2: Run MDAMD + lvm for redundancy and forget about HA completely. Will need to rely on backups/restore

You won't have HA with ZFS replication either, you need a real shared storage for that.

crembz said:
Option 3: Use a USB SSD for LVM-thin for the hypervisors (using backup for protection) and run ZFS mirrored root for everything else.

I tried using USB3 and it's not goging to fly as good as you want. USB3 can have very different delays so that it'll feel sluggish. This depends heavily on your used USB3 controller and used usb3-to-sata bridge, so your mileage may vary.

crembz · May 19, 2023

Enterprise kit is so expensive in Australia ... I have been looking!

LnxBil said:
I'm strongly against everything non-redundant, even in a homelab. Are you limited to the number of disks? What about just running Hyper-V and VMware from a USB-stick instead of PVE if you need it?

I am limited to 2 internal disks in the 5 of 8 systems. specifically one nvme and one ssd.

Hyper-V is giving me no issue at all. VMware and Nutanix on the other hand are not playing nicely on ZFS. Both fine on LVM. All other workloads seem to be ok also. Are you suggesting just using a USB stick for the virtual hard drive? Or running the other hypervisors bare metal off a USB stick?

LnxBil said:
You won't have HA with ZFS replication either, you need a real shared storage for that.

Hrm I was under the impression that proxmox would restart VMs in case of a node failure if it was being replicated?

Dunuin · May 19, 2023

crembz said:
Enterprise kit is so expensive in Australia ... I have been looking!

I am limited to 2 internal disks in the 5 of 8 systems. specifically one nvme and one ssd.

Hyper-V is giving me no issue at all. VMware and Nutanix on the other hand are not playing nicely on ZFS. Both fine on LVM. All other workloads seem to be ok also. Are you suggesting just using a USB stick for the virtual hard drive? Or running the other hypervisors bare metal off a USB stick?

Hrm I was under the impression that proxmox would restart VMs in case of a node failure if it was being replicated?

Yes, but ZFS is still a local storage and not a shared filesystem like NFS or a cluster filesystem like ceph. So the nodes will never be really in sync and you will lose 1+ minutes of data once a VM/node fails.
If that is toleratable or not depends on your services.
And ZFS won't scale well. 8 nodes with a sum of 800GB of redundant capacity would mean you need to buy 16x 1TB disks so you can create 8 mirrors storing 16 copies of everything across all nodes.

crembz · May 19, 2023

Yeah I understand, I'm also guessing if zfs is taxing my hw, ceph i going to be worse right?

I am happy to restore from backup and loose a little bit of data. There isn't a lot that's going on in the cluster that isn't backed up or synced to github or another cloud location. I backup stuff to a nas so as long as I can restore the vms I'm happy with a little downtime.

What bugs me about lvm is how long migrations take when using local disk ... Otherwise I'd just use that.

Dunuin · May 19, 2023

For ceph you need multiple and fast NICs (10+Gbit). For example a dedicated 1Gbit NIC for low latency cluster communication, a fast NIC that transfers the data between the nodes and maybe a third Gbit NIC for internet and your services. Sounds like you are running some NUCs or similar where adding multiple NICs isn't really an option.

crembz · May 19, 2023

Yeah I'm running a couple of atx sized systems and a bunch of tiny pcs. Pretty low end I know, pales in comparison to some peoples labs

LnxBil · May 19, 2023

crembz said:
Or running the other hypervisors bare metal off a USB stick?

Yes, on bare metal. Often this is just simpler and you have the machines for it.

Search

Search

Nested virtualisation (filesystems) on zfs

crembz

Member

LnxBil

Distinguished Member

crembz

Member

LnxBil

Distinguished Member

crembz

Member

Dunuin

Distinguished Member

crembz

Member

Dunuin

Distinguished Member

crembz

Member

LnxBil

Distinguished Member

We value your privacy