PROXMOX ON A BAREMETAL SERVER WITH 18 x 3.84TB NVMe (ZFS?)

VDNKH

Renowned Member
Aug 8, 2016
57
2
73
29
Greetings, good evening, fellow forum members.

The company I work for is in the process of acquiring a High-Grade dedicated server from OVHCloud.

I've previously worked with servers from that provider, but smaller ones from the Advanced range, using Proxmox and installing with the ISO without using the templates offered by OVH.

I normally install Proxmox with ZFS in a RAID 1. If I have two more SSDs, I create another RAID 1 and host the VMs on this last RAID. (Always using Datacenter SSD with PLP)

But in this case, with the OVH High-Grade server, I'll have two 960GB NVMe SSDs dedicated to the OS (Proxmox) and 18 x 3.84TB NVMe SSDs.

I was planning to create a RAID 10 with ZFS, using 14 or 16 of those NVMe drives and hosting the virtual machines we'll be migrating there (more than 30 VMs with Windows + SQL Server).

And the remaining SSDs will be used for backups and to create a SLOG device. (I don't know if 3.84TB is too much for the SLOG.)

The RAM is more than enough, so it can have a good ARC configuration.

I wanted to know what you think of this configuration.

The server specifications are as follows:

Dual Intel Xeon Gold 6554S - 2x36c/2x72t - 2.2GHz/3GHz
1.5TB DDR5 ECC 4800MHz
2 x 960GB Datacenter Class NVMe SSDs (for the OS)
18 x 3.84TB NVMe SSDs
 
Why would you need SLOG/L2ARC if you’re already all-flash. It would only slow you down, it was intended to provide fast response on slow spinning disks. Mirrors are fine, make sure you have snapshots and backups on another server.
 
  • Like
Reactions: waltar and VDNKH
Depends on the drives, in my case with 8x Micron 7450/7500 Max Drives in a Raid10, i switched from ZFS to Btrfs.
Btrfs is especially on mysql Databases 5-10x as fast as ZFS.
And in almost every other situation Btrfs is at least 2x faster.

I filed for that an issue on ZFS:
https://github.com/openzfs/zfs/issues/16993#issuecomment-2742393686

But there are to much zfs fanguys. Its maybe true that btrfs doesnt have some of the features zfs has, but in my opinion the huge performance degradation on fast nvme drives in Raid10 is simply not worth it.

Aside from 5x the performance on Btrfs, i dont need to waste memory for Arc either.
 
  • Like
Reactions: VDNKH
Please don’t use BTRFS for real world use cases unless you really don’t care about your data. I highly doubt BTRFS is any faster given the same parameters (proper sync read/writes), in most benchmarks I’ve seen in the past BTRFS is slower. If you see 5x speedups, your benchmarks are wrong, you probably compare a synced write with an async write (meaning you are benchmarking disk writes vs RAM). ARC is in RAM, BTRFS uses Linux’s kernel cache, same thing. You seem to misunderstand base mechanics of the system.
 
Please don’t use BTRFS for real world use cases unless you really don’t care about your data. I highly doubt BTRFS is any faster given the same parameters (proper sync read/writes), in most benchmarks I’ve seen in the past BTRFS is slower. If you see 5x speedups, your benchmarks are wrong, you probably compare a synced write with an async write (meaning you are benchmarking disk writes vs RAM). ARC is in RAM, BTRFS uses Linux’s kernel cache, same thing. You seem to misunderstand base mechanics of the system.
You are simply a noob, read first the GitHub issue to the end.

Btrfs is comparable to zfs in Performance on hdds, maybe on normal consumer ssds.

But im talking about 20gbs write speed and 40gbs read speeds and about millions of iops.
As long you dont have the Hardware to compare, stop telling nonsense where you dont have a clue.

I have facts, Benchmarks and tests in the GitHub issue.
Its confirmed.
 
  • Like
Reactions: floh8
Hi, As you have 30 VMs, I would hold 2 of those SSDs as hot-spare and splice the other 16 SSDs in 2 ZFS pools with 4 mirrored vdevz (RAID-10) each.
This will allow you to balance those IOPS between those 2 pools and if one pool is lost, you don't lose the whole server.

pool1
- SSD1-SSD2 (Mirror)
- SSD3-SSD4 (Mirror)
- SSD5-SSD6 (Mirror)
- SSD7-SSD8 (Mirror)
- SPARE -SSD9

pool2
- SSD10-SSD11 (Mirror)
- SSD12-SSD13 (Mirror)
- SSD14-SSD15 (Mirror)
- SSD16-SSD17 (Mirror)
- SPARE -SSD8

As mentioned above, SLOG shouldn't be used in this scenario.
Oh, and don't hold your backups at your production server.
 
  • Like
Reactions: carles89 and VDNKH
Hi, As you have 30 VMs, I would hold 2 of those SSDs as hot-spare and splice the other 16 SSDs in 2 ZFS pools with 4 mirrored vdevz (RAID-10) each.
This will allow you to balance those IOPS between those 2 pools and if one pool is lost, you don't lose the whole server.

pool1
- SSD1-SSD2 (Mirror)
- SSD3-SSD4 (Mirror)
- SSD5-SSD6 (Mirror)
- SSD7-SSD8 (Mirror)
- SPARE -SSD9

pool2
- SSD10-SSD11 (Mirror)
- SSD12-SSD13 (Mirror)
- SSD14-SSD15 (Mirror)
- SSD16-SSD17 (Mirror)
- SPARE -SSD8

As mentioned above, SLOG shouldn't be used in this scenario.
Oh, and don't hold your backups at your production server.
@santiagobiali Thank you very much my friend, I really appreciate your opinion, I will take it into account.
 
Why would you need SLOG/L2ARC if you’re already all-flash. It would only slow you down, it was intended to provide fast response on slow spinning disks. Mirrors are fine, make sure you have snapshots and backups on another server.
Thank you for your help, it is greatly appreciated.
 
Depends on the drives, in my case with 8x Micron 7450/7500 Max Drives in a Raid10, i switched from ZFS to Btrfs.
Btrfs is especially on mysql Databases 5-10x as fast as ZFS.
And in almost every other situation Btrfs is at least 2x faster.

I filed for that an issue on ZFS:
https://github.com/openzfs/zfs/issues/16993#issuecomment-2742393686

But there are to much zfs fanguys. Its maybe true that btrfs doesnt have some of the features zfs has, but in my opinion the huge performance degradation on fast nvme drives in Raid10 is simply not worth it.

Aside from 5x the performance on Btrfs, i dont need to waste memory for Arc either.
Greetings friend, I have been reluctant to use btrfs since at the time I read that it was something experimental, and should not be used in production, it was a long time ago, anyway I will try to update myself on the current status of btrfs
 
Greetings friend, I have been reluctant to use btrfs since at the time I read that it was something experimental, and should not be used in production, it was a long time ago, anyway I will try to update myself on the current status of btrfs
BTRFS is used on Synology for a decade as default Filesystem. And im running 12x Kioxia CM7-R in BTRFS Raid-10 + 8x Micron 7450 Max in BTRFS Raid 10 a little over a Month now.

No issues so far. But its true that BTRFS was in the past a little experimental, however i think thats not an issue anymore, especially on 6.8/6.11/6.14 Kernels we are running lately.

The Performance gains we got in out Company are Night and Day, especially with everything MySQL/MSSQL Related.
The read queries are usually almost 10x faster compared to a full tuned ZFS dateset that is made especially for Mysql which you pass through to a LXC Container as /var/lib/mysql (for the sake of max performance)

Write speeds/iops got a performance gain either, but not as much as Read-Speeds/iops. Its simply insane with everything that is read related.
We have no performance degradataion on any workload.

Since ZFS relies on Ram Bandwidth heavilly, we especially buyed Genoa Servers and populated them with 12 Channels to not have a bottleneck there.

However, we have tested a lot of Servers and compared BTRFS to ZFS, everything that was hdd related, zfs clearly won.
On some situations ZFS/BTRFS was on par, but the faster the Storage got, the more difference there is between ZFS and BTRFS.

BTRFS scales almost linear, while ZFS doesnt, the faster your storage gets.
Thats the rule i learned.

But i believe that ZFS 2.3 with DirectIO will gain a lot of Performance with mysql/mssql either. I just dont believe that it will catchup and the scaling rule above will still persist.
 
  • Like
Reactions: waltar and VDNKH
How many Proxmox servers do you have in production? Be careful who you call names, I have been in the huge data storage game since Apple XRAID was the big thing, that was before either ZFS or BTRFS existed.

In the first GitHub post you admit to testing 2 completely different setups (2 disks with RAM cache vs 8 disk ZFS with metadata off) with completely unrealistic expectations (16kB reads when your disks don’t even have minimum block size that small).

I don’t have time to go through your walls of text rants when others already pointed out your flaws. The fact BTRFS isn’t poorly performing in those benchmarks is the flaw, you’re measuring RAM (kernel) cache.

And Synology is garbage, it’s home NAS pretending to be business grade. Atom CPU to drive 12 disks with NFS or SMB, BTRFS corruption, I’ve seen my fair share of major failures with it. When I see someone using Synology, we install rclone, migrate the data to VAST or TrueNAS and throw it in the garbage.
 
Last edited:
How many Proxmox servers do you have in production? Be careful who you call names, I have been in the huge data storage game since Apple XRAID was the big thing, that was before either ZFS or BTRFS existed.

In the first GitHub post you admit to testing 2 completely different setups (2 disks with RAM cache vs 8 disk ZFS with metadata off) with completely unrealistic expectations (16kB reads when your disks don’t even have minimum block size that small).

I don’t have time to go through your walls of text rants when others already pointed out your flaws. The fact BTRFS isn’t poorly performing in those benchmarks is the flaw, you’re measuring RAM cache.
1. 12-15 Proxmox Servers alone.
2. There is no word about metadata off, i tested with metadata all and metadata only.
3. Block size on disks is 4kb native, on 7450/7500/cm7-r you can reformat the disks to different blocksizes either.
4. Others didnt pointed out anything, just were lazy to read the GitHub thread and talked mostly nonsense that had nothing todo with the issue, like you.

Fact is, metadata only is almost broken and has even worse Performance Degradation up to 13x compared to btrfs, metadata all is still 5x slower compared to btrfs.

And the thing you didnt readed at all, is that ive tested with cache and without cache (as much as no cache is possible).

Cheers
 
And Synology is not garbage at all, you probably aware only about the home stuff.

There are RS Servers, which run in proper high-availibilty and has xeons with ecc memory.

Additionally to that, there is no other solution that im aware of, that features as fine grained Samba User & Group Management in a GUI.
Built in 365 backups and a lot more.

Truenas is not even able to run in high availability and you talk about Enterprise grade? Really???

And before you start telling that truenas has HA features, first you need the Enterprise HA-License and second its not real ha, since it syncs datachanges only periodically, which means that you loose data in case of a failover.

This also means that you cant use TrueNas HA as a storage backend for VM's.
Which you can do with Synology.
Or other proprietary Solutions like JovianDSS.
 
Last edited:
I wanted to know what you think of this configuration.
boot on a mirror and payload on striped mirrors is best practice. You dont need (or benefit) from a slog or l2arc for your usecase.

I have been reluctant to use btrfs since at the time I read that it was something experimental, and should not be used in production, it was a long time ago, anyway I will try to update myself on the current status of btrfs
No change. The integration into pve has some issues, but that aside there are many reports of issues that suggest that there are still maturity issues. I personally ran into an issue that as the parent filesystem got full, guest subvolumes became unmountable and was a major pain (--edit- this happened last month). I echo @guruevi in his conclusion that this isn't a production grade option.

When choosing the storage substrate you need to consider suitability in total, not a single factor (eg, performance.) More importantly, you need to set acceptance criteria; if minimum performance requirements are met there is no real value in arbitrary increase, especially if you have to trade off other features. People get too hung up on "t3h fastest" when thats really not what is going to cause them pain.
 
Last edited:
  • Like
Reactions: VDNKH
I know, we were in a decision some years ago.
We wanted an zfs over iscsi solution that was able to run in HA.
At that time we used Synology HA with iscsi for the esxi servers and needed to replace that, because Synology iscsi had some issues.

In the end we took JovianDSS, because it was the only solution at that time that was able to offer ZFS over iscsi with livesync and HA.
TrueNas was not able to cover that.

However in the meantime we have no esxi servers anymore and proxmox offers better solutions for that.
Ceph is amazing, proxmox replication is amazing.

And where we need real high-availibilty we are using hyper-v still, because its the only solution i know of, where the VM's doesnt even go offline or reboot during a failover if a host dies (still with JovianDSS as backend).

But we are about to replace those either, since Proxmox didnt had even a single crash or anything on our genoa servers ever.
And for powerloss we have anyway generators and USV's, so there is no issue either.

We currently switched to btrfs entirely with proxmox replication, because of all that.

However thats all nonsense talk, fact is TrueNas is syncing only periodically, which i dont see as a real HA solution.
Probably that changed in the meantime, unsure, i dont believe it.

The TrueNas team is pretty slow when it comes to development, so i dont believe that anything changed in the last 5 years.
 
The TrueNas team is pretty slow when it comes to development, so i dont believe that anything changed in the last 5 years.
fact is TrueNas is syncing only periodically,
the more you feel certain of something, the more likely you are to fall victim to the dunning-kreuger effect. This isnt the Truenas forum so I wont go into detail, but suffice to say your conclusions arent... supported by data.
 
the more you feel certain of something, the more likely you are to fall victim to the dunning-kreuger effect. This isnt the Truenas forum so I wont go into detail, but suffice to say your conclusions arent... supported by data.
Just readed, they talk about resilient replication, so i think its still the same periodically and simple zfs-send/receive.
Even if its made every 5 seconds, its not the same.

On JovianDSS (zfs) and Synology (btrfs) every blockchange gets instantly synced over a dedicated interface to the other node.

The other thing they are telling is dual controllers, which isnt interesting as it has nothing todo with HA.

If you see it different, tell me what i missed there.
 
Need data ?
So had zfs fileserver with power outage, nothing help than full restore from second one and department cannot work for few days ...
In other place restore 16 TB hdd in raidz2 as 4x6 hdd, pool 349 TB, 114 TB which is 32% used, took 32 h for 32 % of the 16 TB = ~45 MB/s.
Trying to get any data in normal operation mode from that pool is 10 times slower than xfs because data is arc cannot handle that anymore with 192 GB ram.
Ceph subscriptions for ceph start in the high 5 digits and you can reach easily the 7 digit price range ... I think it's because ceph is so super stable and problem less that there come so less tickets to redhat and the customers spend there money them as didn't know else what to do to ... or why not think at your own ... :-)