New pve9 node - Storage set up help

mikeyo

Member
Oct 24, 2022
40
9
13
Hi

About to build a new Proxmox 9 node with the following hardware -

Motherboard: Asus W880-ACE-SE
CPU: Intel 285K
RAM: 96GB
Storage:
1 x 4TB gen5 nvme,
3 x Gen4 nvme 1-2TB,
1 x 500GB SATA SSD for Proxmox install.

My plan is to populate all the nvme slots 1xgen5 and 3xgen4. I'll be running a mix of vm, docker and AI workloads.

I want ther best I/O performance from the drives, I am not convinced that creating a ZFS pool for the 3 x nvme's will give me this.

Please can i have some suggestions on how to best layout the storage for optimum I/O.

Thank you.
 
If you want the 'best' performance from the drives, ZFS is not going to cut it. ZFS is about ultimate data resiliency, it still to this day has not been completely optimized for NVME drives.

I would suggest creating a ZFS pool for vm bootdisks, where you just use SAS/SATA SSDs as main storage tier and just assign an NVME for the slog/zil.

Where the ultimate performance is required for actual workloads, meaning latency as well as bandwidth, you can in those cases do pcie passthrough of NVME drives direct to the guests which actually need it, probably your AI guests.

A note: In case you weren't aware, keep in mind that motherboard uses switching to achieve the multiple m.2 ports, so you'll never get full bandwidth at the same time from the drives connected through the chipset switch.
 
Last edited:
If you want the 'best' performance from the drives, ZFS is not going to cut it. ZFS is about ultimate data resiliency, it still to this day has not been completely optimized for NVME drives.

I would suggest creating a ZFS pool for vm bootdisks, where you just use SAS/SATA SSDs as main storage tier and just assign an NVME for the slog/zil.

Where the ultimate performance is required for actual workloads, meaning latency as well as bandwidth, you can in those cases do pcie passthrough of NVME drives direct to the guests which actually need it, probably your AI guests.

A note: In case you weren't aware, keep in mind that motherboard uses switching to achieve the multiple m.2 ports, so you'll never get full bandwidth at the same time from the drives connected through the chipset switch.
was actually thinking of just formatting the drives as XFS and just use qcow2 volumes. ZFS made me curious about striping the drives for better throughput.
 
was actually thinking of just formatting the drives as XFS and just use qcow2 volumes. ZFS made me curious about striping the drives for better throughput.

Qcow files will have the additional overhead of another file system so I would go with LVM or ZFS Block storage
 
  • Like
Reactions: Domino and UdoB
@mikeyo , not sure if you have considered it, but in my case I configured a standalone data tier server cluster, from there I present block devices via nvmeof-rdma/iscsi/smb to my compute servers across a 400G backbone. The only difficult thing was that I just couldn't find a software system that did everything I needed, so in the end I linux scripted it all up to perfection.

Anyhow, after many years of 'dabbling/testing', isolating my data-tier was the missing magical ingredient in my lab. I've never looked back once, I can literally do anything I want with any of my compute servers and not once worry about storage/data-backups/resiliency/latency/bandwidth ever again.

ps. I dropped ZFS for personal psychological reasons, hardware raid10 fits my needs everytime.
 
the "fastest" would be lvm-thick on mdraid10 or directly to individual devices.

the thing you need to realize is if the speed is above what you actually NEED you are losing out on other features that may be of much more importance, namely snapshots, inline compression, active checksum, etc etc with no useful gain.
 
  • Like
Reactions: Johannes S
the "fastest" would be lvm-thick on mdraid10 or directly to individual devices.

It would also (in case of mdraid10) something not supported by the ProxmoxVE developers see https://pve.proxmox.com/wiki/Software_RAID#mdraid
In theory one could do this by first installing Debian and afterwards the Proxmox packages but imho it's not really worth it. I would Use HW RAID if for whatever reason it's preffered otherwise I would go with ZFS or ( although it's still technology preview) btrfs.

the thing you need to realize is if the speed is above what you actually NEED you are losing out on other features that may be of much more importance, namely snapshots, inline compression, active checksum, etc etc with no useful gain.
Or the possibility to replicate snapshtos to another host (which is also used by ProxmoxVEs storage replication mechanism or pve-zsync).
See also: https://forum.proxmox.com/threads/f...y-a-few-disks-should-i-use-zfs-at-all.160037/
 
Read what it says carefully. pve can use mdadm easily and with full tooling support BUT YOU SHOULDNT. I agree with their assessment ;)

but imho it's not really worth it.
There are usecases for it. the converse of my original assertion is that there are cases where you MUST use it or not meet your minimum performance criteria. there are always tradeoffs- which is why its important to consider what your actual NEEDS are when evaluating any kind of solution instead of hyperfocusing on just one factor or worse- WANTS.

I would Use HW RAID if for whatever reason it's preffered otherwise I would go with ZFS or ( although it's still technology preview) btrfs.
HW RAID can be a solution with legacy (SATA, SAS) drives. I have yet to see a NVME HW RAID that's worth anything. btrfs is lighter than zfs, but less mature (fwiw)