[SOLVED] going for max speed with proxmox 7; how to do it?

diversity · Jul 27, 2021

I'd like to setup proxmox 7 to be as fast as it can possibly be with the hardware I have and that I am considering to get;

EDIT: (this will be for non critical, non server related workloads)
Edit 2:
I would like to have a dedicated VM to pass GPU's to so that I can donate to the folding at home or other distributed science projects
Edit 3;

The workload up until this time has been (I'll edit the original post to reflect this) a few windows server 2019 VM's All but two configured for web development (Visual Studio, SQL Server, IIS). Developers log into the VM's using RDP and the actual stress comes from the compilation stages.
One other VM is for GPU number crunching like AI and BOINC. I thought having ECC memory here would benefit but I could be missing the point the more I think of it. One other VM will be for running a Digital Audio Workstation.

The already haves;
* AMD Ryzen Threadripper 3990X (CPU (64 core - 128 threads))
* Gigabyte trx-40 aorus extreme (rev1.0) (Mobo with fully updated BIOS)
* AORUS Gen4 AIC Adaptor (1 PCI device for 4 x NVMe SSD)
* 1 x evga rtx 2080 ti (GPU)
* 3 x Gigabyte rtx 2060 (GPU)
* 2 x Samsung 980 Pro 1TB (NVMe SSD)
* 4 x WD_BLACK 1TB SN850 (NVMe SSD)
* 4 x 32GB DDR4-2600MHZ ECC (memory)

considering to buy;
* 8 x Kingston Technology 32GB DDR4-3200MHZ ECC CL22 DIMM 2RX8 MICRON E (EDIT 4; If there is faster non ECC memory out there that can actually help speed up more then I am all for it)

I truly do not care how long it takes for proxmox to boot. All I am after that once it is up and running I can fire up a ton load of VM's in nano seconds.
And the performance of those VM's should be second to none.

ZFS is not a requirement as I will be making constant backups of VM's. EDIT 5; Although not a requirement I do like that idea so perhaps keeping ECC memory is better than going the non ECC route.
EDIT 6; As compilation of binaries is they primary workload I have been advised in this thread to go with ECC. So It's going to be ECC only unless I learn it's not important.

Any tips one could share?

Dunuin · Jul 27, 2021

diversity said:
ZFS is not a requirement as I will be making constant backups of VM's.

ZFS isn't just RAID that you use to loose no data if a drive suddenly fails. Its an advanced filesystem that also can do stuff that is LIKE raid.
Backups won't help you against stuff like bit rot where you data gets silently corrupted over time wihtout you being able to notice it. If you don't now if a file is corrupted or not you also don't know if your backup will overwrite a healthy version of a file with a corrupted version. Bit rot is silent and it may take years until you realize that a file is damaged because you can't open it anymore and at that point all your backups will contain the same corrupted file so you can't restore a healthy version. Atleast if you don't plan to store your backups for eternity and never want to delete a single byte.
ZFS is selfhealing because it will calculate a checksum of every data block and save it with the data. Once a month it will do a scrub and calculate the checksums of all data again and compares them to the already saved checksums. If it finds data where the new checksum doesn't match the old checksum it knows that bit rot occured and a file got corrupted. If you now also got some kind of parity (like a mirror or raidz1/2/3) it will automatically repair the damaged data.
Its the same why you buy ECC RAM. You don't want bit flips in RAM that might corrupt your data. But data on SSDs and HDDs can bit flip as well. If you just use HW raid or LVM there is nothing that can detect or repair these bit flips. So ZFS is like the ECC of your SSD/HDD.
And there are also other nice features like deduplication, compression on block level, replication, snapshots, clones, ...
So you might want to read a bit more on what ZFS can do and what not.

And most consumer/prosumer SSDs are terrible slow if you run specific workloads like sync writes. They are just not designed to run server workloads so they are missing the powerloss protection and therefore can't use the internal RAM cache for sync writes. For workloads like that a enterprise grade SSD might be 30-100 times faster than your drives. You should also monitor the SMART values of your SSDs. The durability of your drives is really bad. They only got 600TBW per 1TB of storage. My enterprise SSDs for example got 18000 TBW per 1TB of storage. So your SSDs might die 30 times faster.
So depending on your workload your prosumer SSD might be the bottleneck.

diversity · Jul 27, 2021

Dunuin said:
Backups won't help you against stuff like bit rot where you data gets silently corrupted over time wihtout you being able to notice it

understood. Important data on a VM goes to a TrueNas.
The VM it self gets backed up to TrueNas.

So I can always roll back to a previous snapshot if need be.

Dunuin said:
For workloads like that a enterprise grade SSD might be 30-100 times faster than your drives

wow, I am now considering your drives. Is there any benchmark I can see?
All I found and use was something along the lines of
https://www.tomshardware.com/reviews/best-ssds,3891.html

Dunuin said:
So your SSDs will die 30 times faster.

Thanks for the heads up but I am not looking for data retention/safty or durability. I want to live fast and die young

SSD's can be replaced. Who knows what is available by the time one of them dies

So I hope that there is still room for a suggestion on how to get this hardware at light speed.
Or indeed I am open to suggestions like the SSD enterprise one made earlier

diversity · Jul 27, 2021

for example do I use the AORUS Gen4 AIC Adaptor (1 PCI device for 4 x NVMe SSD) with 4 x WD black SSD's in raid0 mode?
Or put them in the 4 xNVMe slots of the mobo and then see if I can raid 0 them?

diversity · Jul 27, 2021

or with 256 GB of ECC memory. perhaps load an entire VM in ramfs?
that would mean that loading a VM is no longer fast but for one or two specific VM's that would be no problem

Dunuin · Jul 27, 2021

diversity said:
understood. Important data on a VM goes to a TrueNas.
The VM it self gets backed up to TrueNas.

So I can always roll back to a previous snapshot if need be.

You still don't get it. If you run your VMs on a non ZFS storage and only copy them from time to time to the NAS, how can you verify that they are healthy while on your Proxmox server and before you copy them over to the NAS? Your TrueNAS can't know if a file is healthy or not. It will just look how it was as it arrived and take that as valid. If bit rot occures while the VM is running on your proxmox server and you don't got ZFS it will corrupt. You then copy the corrupted file to Truenas and Truenas will make sure that this file won't change (or corrupt even more). But it is already corrupted when arriving on TrueNAS so TrueNAS will think that file is healthy even if it is not. So TrueNAS will tell you "everyting healthy!" but still many VMs might be corrupted. So ZFS on TrueNAS wont prevent your VMs from corruption. If you want your data to be safe you need ZFS everywhere.

diversity said:
wow, I am now considering your drives. Is there any benchmark I can see?
All I found and use was something along the lines of
https://www.tomshardware.com/reviews/best-ssds,3891.html

There is a official NVMe benchmark from the Proxmox staff. Not that much SSDs and no consumer NVMe SSDs but if you look at the benchmarks FAQ you see why:

Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise-class SSD?
No. Never. These SSDs wont provide the required performance, reliability or endurance. See the fio results from before and/or run your own fio tests.

There a "Samsung SSD 850 EVO 1TB" SATA SSD is just writing with 1359 kb/s and a enterprise SATA SSD like the "Intel DC S3500 120GB" writes with 48398 kb/s. Should be the same for NVMe SSDs. Without a powerloss protection you just get really crappy sync writes. If you really want a fast storage for server workloads you want some U.2 enterprise NVMe SSDs (best case optanes) and not M.2 NVMe consumer/prosumer stuff.
If you really want to know how fast your drives will be you can run the same fio benchmark they did. Its all explained and documented in the linked paper (you need to download the PDF).

ph0x · Jul 27, 2021

As hard as this may sound: use this setup for gaming and get some server hardware if you want to run server workloads. Even with gaming grade specs this will still be a playground and nothing more.

diversity · Jul 27, 2021

Dunuin said:
You still don't get it. If you run your VMs on a non ZFS storage and only copy them from time to time to the NAS

ok understood now. So ZFS, although I said it was not a requirement, then now it is. Never I intended to suggest that ZFS was off limits. Sorry for the confusion.

Dunuin said:
There is a official NVMe benchmark from the Proxmox staff. Not that much SSDs and no consumer NVMe SSDs but if you look at the benchmarks FAQ you see why:

That particular document I have read indeed. I also did not find much regarding NVMe's. I did see an Intel Optane on the top of the benchmark.
Does this suggest that using the AORUS Gen4 AIC Adaptor (1 PCI device for 4 x NVMe SSD) with 4 x WD black SSD's in mirrored stiped mode is a good way to go?

ph0x said:
As hard as this may sound: use this setup for gaming and get some server hardware if you want to run server workloads. Even with gaming grade specs this will still be a playground and nothing more.

Great catch. I am actually going to use this setup for non server related workloads. All I ask is how to make it the fasted it can be. Care to venture a suggestion?

diversity · Jul 27, 2021

I am not trying to be a d*&k but if the mentality remains that proxmox is not for non server workloads then perhaps installing a windows (yeah I know it's evil) machine and go from there with virtualbox?

I'd rather prevent that route as I have spend more than a year getting to know proxmox and have 3 of them running quite sweetly as we speak. So this remark is not ill intened, merely a slight wink at @ph0x 's rather firm comment

Dunuin · Jul 27, 2021

diversity said:
That particular document I have read indeed. I also did not find much regarding NVMe's. I did see an Intel Optane on the top of the benchmark.

1st place: "Intel Optane SSD DC P4800X Series 375 GB"

diversity said:
Does this suggest that using the AORUS Gen4 AIC Adaptor (1 PCI device for 4 x NVMe SSD) with 4 x WD black SSD's in mirrored stiped mode is a good way to go?

Yes, atleast a striped mirror is how I would use it for best performance while still being sure that the data is save.

diversity · Jul 27, 2021

also, when using PCIe ssds, one sacrifices a PCIe slot. Slots that could be used for GPU pass through.

ph0x said:
As hard as this may sound: use this setup for gaming and get some server hardware if you want to run server workloads. Even with gaming grade specs this will still be a playground and nothing more.

What I failed to mention is that I would like to have a dedicated VM to pass GPU's to so that I can donate to the folding at home or other distributed science projects

Dunuin · Jul 27, 2021

diversity said:
I am not trying to be a d*&k but if the mentality remains that proxmox is not for non server workloads then perhaps installing a windows (yeah I know it's evil) machine and go from there with virtualbox?

I'd rather prevent that route as I have spend more than a year getting to know proxmox and have 3 of them running quite sweetly as we speak. So this remark is not ill intened, merely a slight wink at ph0x's rather firm comment

Proxmox will run on nearly any hardware. And with good performance too. But you bought gaming hardware for server stuff. Thats like buying a fast and fancy looking sports car if your goal is to pull a heavy trailer. Sports cars aren't build for that and a powerfull but slow and ugly truck would be way better suited for that task.
So you can use it just don't think it will be as fast or as reliable as a real server if it comes to server workloads like running multiple VMs in parallel.

And you don't want to run virtualbox. Use atleast a tier 1 hypervisor like Hyper-V.

diversity · Jul 27, 2021

Dunuin said:
And you don't want to run virtualbox. Use atleast a tier 1 hypervisor like Hyper-V

Hyper-V on Windows Server 2019 comes with it's own set of problems. It is simply not able yet (please believe me I have tried for 2 months before I setteled down to proxmox) to do proper GPU passthrough for AMD ryzen, RTX 2060+ setups
EDIT: and also the SATA controller passthrough of at least 5 of the motherboards I tried went no where.

diversity · Jul 27, 2021

OK, so coming back to the disk aspect of it all.
One would say that having a single PCIe ssd (consisting of 4 NVMe's in raid 10(mirrored striped)) is better than putting in those same 4 NVMe's in the slots on the Mobo and have proxmox deal with the raid aspect?

Ahh Before I make and edit to this post. I remember that when putting in the PCIe ssd container (AORUS Gen4 AIC Adaptor (1 PCI device for 4 x NVMe SSD)) without any raid done by the motherboard, I actually saw 4 extra disks in my BIOS. That could well mean that proxmox could use them at leisure for ZFS

diversity · Jul 27, 2021

Dunuin said:
But you bought gaming hardware for server stuff

Please forgive my ignorance. What made one think I was looking to do server stuff? (my thread opening post already edited to reflect this)
I am looking for a non server workload. Just freakishly fast performance with the stuff that is available to me

Dunuin · Jul 27, 2021

diversity said:
Please forgive my ignorance. What made one think I was looking to do server stuff? (my thread opening post already edited to reflect this)
I am looking for a non server workload. Just freakishly fast performance with the stuff that is available to me

You want to run VMs. Thats part of virtualization and that is server stuff. It doesn't matter that much what you are actually running (except for DBs and so on) inside the VM. The point that you run VMs at all is enough. You can run two Win10s VMs for gaming only... that is still a server workload and not made for gaming hardware. The more VMs you want to run in parallel, the worse stuff like consumer SSD will be able to handle it. And consumer hardware isn't build to survive 24/7 operation and may fail very quickly. If you want to run that computer 24/7 (and I would assume that is your plan if you want to run TrueNAS and distributed computing) that is nothing consumer hardware is build for.

This is how a gaming PC expects you to store stuff:
SSD <- onboard controller/NVMe <- ntfs <- OS

This is how Proxmox stores stuff:
SSDs <- HBA <- pool <- zvol <- cache <- virtio <- virtual disk <- filesystem <- GuestOS

Here you get write amplification with each step (ZFS, virtio, ...) between the guest OS and your physical SSD. And write amplification isn't adding up but multiplying so that it is more an exponential than an linear increase. Lets use some fictional numbers and say there is an factor 10 amplification inside the SSD (thats low for sync writes with consumer SSDs), an factor of 2 because of ZFSs ZIL, an factor of 3 because of other ZFS stuff, an factor of 4 from virtio virtualization and a factor of 2 inside the guest because of journaling. You wound't get a write amplification factor of 21 (10+2+3+4+2) but the write amplification would be a factor of 480 (10*2*3*4*2). So for each 1GB you write inside the guest your SSD would write 480GB!!!
And if you want some real numbers: With my homeserver for every 1GB written inside the VM 20 to 40 GB will be written to the NAND of the SSD because I got a write amplification of 20 for async writes and 40 for sync writes.
So if I run a windows VM and install a 100GB steam game it will write 2 to 4 TB to the SSD. If I install Windows bare metal instead and install the same game it might only write 120GB or something like that.
So if you think "600 TBW isnt that bad, i can install that game 6000 times...I will never reach that..." thats wrong. Atleast if you run it inside a VM. Because installing it 200 times inside a VM might be enough to kill the ssd. Now you know why enterprise SSDs are build for a much higher write endurance and why I sold my Samsung EVO NVMe SSDs and got some enterprise SATA SSDs instead. My 1TB enterprise SSD is now capable of writing 18000 TB and my old EVOs would have died after just 600 TB, like your SSDs.
So your 600TBW aren't that much anymore because if you write more than 20TB within 5 years inside your VM you loose your warranty and you are over the SSDs life expentation. So your SSD can only handle a constant write over 5 years of 136kb/s (20TB / (5*365*24*60*60 sec)). Now lets say 10 VMs are sharing the same SSD. Now each VM may only write with up to 14kb/s in average. If a VM writes with more than 14kb/s you would loose your warranty and the drive may die in under 5 years. 14kb/s is that low it can't even handle the logs...
Sure, you could not care about the warranty and write more like 70kb/s now your SSD will die within 1 year. Or you write with 840kb/s and the SSD might die within a month...

And you are dreaming about four SSDs in raid0 that can write with 5,300MB/s each, so 21,200 MB/s in total. Do the math yourself:
4x 600 TBW = 2,400 TB life expectation = 2,400,000,000 MB life expectation
2,400,000,000 MB life expectation / 21,200 MB/s = 113,208 sec
113,208 sec / 60 / 60 = 31.44 hours

So write with full speed for 31.44 hours and you might have killed 4 new drives. You say you don't care how long they survive, you just want maximum performance. Thats not really cheap if you need buy 4 new SSDs for 700$ every second day...
Sure, in reality your are not always writing with full speed for so long and your SSDs also wouldn't be able to handle that for more than some seconds because they are just build for short bursts and not for long sustained writes like enterprise SSDs, but it is nice to show how bad the life expectation of SSD really are. Also keep in mind: If you got a write amplification of factor 20 it's doesn't only mean that 20 times more is written to the SSDs NAND so the drive dies 20 times faster. It also means that your SSDs will be 20 times slower. If your VM writes with 1000 MB/s your SSDs will write with 20.000MB/s. So, if a raid0 can only handle 21,200 MB/s you wont see better writes than 1060MB/s inside a VM. And you are running multiple VMs that need to share that speed. So most of the time you won't even see that inside a VM.

Maybe your write amplification isn't as bad as mine but you will still get write amplification. And I'm no special case. Its just a normal homeserver thats runs local services and I am the only person why uses it. Right now it writes around 900GB a day of data while idleing and most of that are just logs/metrics. Show me a gaming PC that is writing 900GB a day while nobody is using it..if that would be normal no one would buy these Samsung QVOs anymore.

And there are alot of things like this that you probably didn't think of when building that server.

Distributed computing like BOINC, F@H and so on that are heavy scientific computations that I also would call server workloads. Especially if you want to run so many GPUs using CUDA/OpenCL in parallel.

And it sounds like you want to run a TrueNAS on it. That again is server stuff. ZFS wasn't made to be run on consumer hardware. Initially that was a file system designed by solaris to only run in datacenters on big storage servers with dozens of HDDs. No of the engineers even thought of someone might want to run that on consumer stuff. It is designed that way that it just relies on you to run it on very stable hardware with ECC RAM, UPS, redundant power supplies and so on. So even "simple" stuff like storing files is a server workload.

Look at the hardware recommendation for Proxmox and TrueNAS and they all tell you to get server and not consumer hardware.
They don't do that because they get money from HP/Dell/Lenovo for selling stuff, but because consumer hardware just isn't made to reliably run heavy workloads 24/7.
Don't ask how many consumer PCs I killed the last 20 years running BOINC 24/7...I stopped counting failed hardware...has to be atleat 8 PSUs, a dozen of RAMs, some GPUs and so on...

So if you already got that hardware thats fine and we can see what best to do with it. Atleast CPU and RAM are fine for a server and the board got alot of PCIe lanes. But just don't expect too much.
I've already seen in another thread that you complained that your EVOs are so slow. They just aren't fast SSDs for server workloads so they will be never as fast as good enterprise SSDs if it comes to advanced stuff like sync writes, heavy IO, continous IO and so on. Maybe you bought them because they got big numbers on paper and advertise with "up to 5300MB/s writes" but the reality is that you can be lucky if they reach something like 50MB/s (like a SATA enterprise SSD) on stuff like random sync writes on the host or some single digit MB/s inside the VM. What really matter in a server isn't big write/read numbers that might be achieved but instead what the SSD ensures to achieve at minimum under all conditions and at every time. And of cause how reliable and durable the SSD is.

If your SSDs just can't handle server workloads and you don't care about enterprise features and data integety you could for example try to PCI passthrough one NVMe SSD to each VM. That way you don't get virtualization overhead, no write amplification from mixed block sizes and the workload is more what they are build for. Just one OS using it without any advanced stuff in between. Then they are just like a SSDs in a normal gaming PC and should perform better.
But you would loose all the nice enterprise features like redundany, bit rot protection, block level compression, deduplication and so on and you can't run more than one VM of each SSD.

ph0x · Jul 28, 2021

Your patience is astonishing.

diversity · Jul 28, 2021

@Dunuin thank you for your detailed explanation. Much appreciated.
Currently the Samsung SSD's have a wearout of 4 and 3 % after about a year of operation. Are there statistcs available on how many writes have been done in that time?
The workload up until this time has been (I'll edit the original post to reflect this) a few windows server 2019 VM's All but two configured for web development (Visual Studio, SQL Server, IIS). Developers log into the VM's using RDP and the actual stress comes from the compilation stages.
One other VM is for GPU number crunching like AI and BOINC. I thought having ECC memory here would benefit but I could be missing the point the more I think of it. One other VM will be for running a Digital Audio Workstation.

TrueNAS runs on a different server on the network

So if I add 2 more WD black ssds in the 2 remaining M.2 slots on the motherboard I can create an additional ZFS mirror and stripe it to the current one?

And if so then I could also add the remaining 2 WD black ssd's to the AIC Adaptor and create an extra mirror and stripe it again?

And if still better results buy 2 extra ssd's and mirror and stripe again.

If so then I am going to try that and I'd like to run performance tests after each step.

currently I get

Code:

pveperf /var/lib/vz
CPU BOGOMIPS:      742451.20
REGEX/SECOND:      4016924
HD SIZE:           71.07 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     281.94

I am not sure why I am missing some stats in the output as when @udo ran in in this thread (https://forum.proxmox.com/threads/measure-pve-performance.3000/) he got this result

Code:

pveperf /var/lib/vz
CPU BOGOMIPS:      27293.47
REGEX/SECOND:      1101025
HD SIZE:           543.34 GB (/dev/mapper/pve-data)
BUFFERED READS:    497.18 MB/sec
AVERAGE SEEK TIME: 5.51 ms
FSYNCS/SECOND:     5646.17

diversity · Jul 28, 2021

current zpool status

Code:

zpool status
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 00:36:04 with 0 errors on Sun Jul 11 01:00:05 2021
config:

        NAME                                 STATE     READ WRITE CKSUM
        rpool                                ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            nvme-eui.002538b9015063be-part3  ONLINE       0     0     0
            nvme-eui.002538b9015063e1-part3  ONLINE       0     0     0

errors: No known data errors

diversity · Jul 28, 2021

what would be more optimal in terms of random read/writes?
simply adding new nvme's to the mirror-0 or create extra sets of mirrors each containing 2 nvme's?

Also will it matter for either scenario if the nvme’s are not the same brand and model? They are the same size, in terms of what is marketed (1TB), though

[SOLVED] going for max speed with proxmox 7; how to do it?

Active Member

Distinguished Member

Active Member

Active Member

Active Member

Distinguished Member

Renowned Member

Active Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Active Member

Active Member

Distinguished Member

Renowned Member

Active Member

Active Member

Active Member