[SOLVED] going for max speed with proxmox 7; how to do it?

Dunuin · Jul 28, 2021

diversity said:
@Dunuin thank you for your detailed explanation. Much appreciated.
Currently the Samsung SSD's have a wearout of 4 and 3 % after about a year of operation. Are there statistcs available on how many writes have been done in that time?

You could check the SMART attibutes of the SSDs with something like smartctl -a /dev/nvme0. But what you are able to see there and what not depends on the manufacturer. So they might tell you how much data you tried to write to them over their lifetime (like "Data Units Written: 285,976,046 [146 TB]") but they might not show you how much actually was written to the NAND because of the internal write amplification.
And I also always install the sysstat package (apt install sysstat). With that there is a iostat command that will low IO activity. If you run something like iostat 60 2 it will show you one statistic instantly and that statistic will how you statistics of all your drives collected since boot. If you wait 60 seconds for the command to finish it will show you similar statistics but this time they are just based on the last 60 seconds.

diversity said:
The workload up until this time has been (I'll edit the original post to reflect this) a few windows server 2019 VM's All but two configured for web development (Visual Studio, SQL Server, IIS). Developers log into the VM's using RDP and the actual stress comes from the compilation stages.
One other VM is for GPU number crunching like AI and BOINC. I thought having ECC memory here would benefit but I could be missing the point the more I think of it. One other VM will be for running a Digital Audio Workstation.

ECC will be way slower compared to non ECC RAM. But using ECCs means your server isn't crashing that much and you can rely on it that your data is what it is supposed to be. If you compile the same code 1,000,000 times with ECC you would get 1,000,000 times the same binary. Do it without ECC and only 999,999 binaries might be the same and one of them is corrupted and will cause unexpected behavior when running that binary. But really useful it gets when your RAM is slowly dieing. Then maybe each 10th binary might be corrupted instead of each millionth binary and you are serching for bugs in that program but can't just find them...because the source code is absolutly fine and just the binary is corrupted.
If you care about your data and you want to rely on it, that your data is always correct, you should go with ECC RAM even it will be slower.

diversity said:
So if I add 2 more WD black ssds in the 2 remaining M.2 slots on the motherboard I can create an additional ZFS mirror and stripe it to the current one?

And if so then I could also add the remaining 2 WD black ssd's to the AIC Adaptor and create an extra mirror and stripe it again?

And if still better results buy 2 extra ssd's and mirror and stripe again.

If so then I am going to try that and I'd like to run performance tests after each step.

Yes, it shouldn't matter for ZFS which slot you use as long as there is no HW raid and it can access each individual drive. ZFS is also no real raid that uses arrays with a fixed stripe width or something like that. It just distributes datablocks across multiple individual drives dynamically. So it can work like a raid10 but technically it isn't a raid.
And yes, you can add pairs of drives later and stripe them so it gets one big pool. Just isn't that easy if that pool is also your boot pool.

diversity said:
currently I get

Code:

pveperf /var/lib/vz CPU BOGOMIPS: 742451.20 REGEX/SECOND: 4016924 HD SIZE: 71.07 GB (rpool/ROOT/pve-1) FSYNCS/SECOND: 281.94

I am not sure why I am missing some stats in the output as when @udo ran in in this thread (https://forum.proxmox.com/threads/measure-pve-performance.3000/) he got this result

Code:

pveperf /var/lib/vz CPU BOGOMIPS: 27293.47 REGEX/SECOND: 1101025 HD SIZE: 543.34 GB (/dev/mapper/pve-data) BUFFERED READS: 497.18 MB/sec AVERAGE SEEK TIME: 5.51 ms FSYNCS/SECOND: 5646.17

Pveperf will only benchmark your boot pool and not other pools. "FSYNCS/SECOND: 281.94" is how much sync IOPS your boot pool can handle. All your SSDs will be terrible at handling sync writes because they all are missing the powerloss protection so don't expect good numbers there.

diversity said:
what would be more optimal in terms of random read/writes?
simply adding new nvme's to the mirror-0 or create extra sets of mirrors each containing 2 nvme's?

For random read/write you want to stripe them. So the more mirrors you stripe the better your random read/write should be.

diversity said:
Also will it matter for either scenario if the nvme’s are not the same brand and model? They are the same size, in terms of what is marketed (1TB), though

If you stripe different 1TB SSD models all other SSDs will be slowed down to the speed of the slowest SSD. So best would be not to mix different models.

diversity · Jul 28, 2021

Dunuin said:
Just isn't that easy if that pool is also your boot pool.

Can you please elaborate? I am about to shut down this particular system and going to do brain surgery.

diversity · Jul 28, 2021

Dunuin said:
If you compile the same code 1,000,000 times with ECC you would get 1,000,000 times the same binary

Ok, I am now fully back onboard the ECC camp. Not that I ever left it but I was considering going for extra speed. Non ECC is now off the table for good. ECC for the win

diversity · Jul 28, 2021

Dunuin said:
If you stripe different 1TB SSD models all other SSDs will be slowed down to the speed of the slowest SSD. So best would be not to mix different models.

Understood, I have to make ends meet and deal with what I have. The misses is already on my rear end for having such expensive toys to begin with so I have to make due with what I have. Sure I can persuade the budget to squeeze in an extra 8 x 32gb ECC 3200MHZ memory but I need to do it carefully else raise hell

TL;DR, I am willing to sacrifice lowering the individual speed of drives to the slowest ones as long as the mirrored striping will make things much better than they are now. It's like a balancing act really between cost vs performance gains considering all the same fastest models vs mixed models but raid10

diversity · Jul 28, 2021

ph0x said:
Your patience is astonishing.

hahhah you're quick to jump to conclusions ability is even more advanced than that

Anyway @ph0x thank you for contributing non the less. I really mean that. It helps one stay on course. Even though in this instance it was kinda besides the point.

Dunuin · Jul 28, 2021

diversity said:
Can you please elaborate? I am about to shut down this particular system and going to do brain surgery.

Best would be to have a dedicated pool for your hosts OS and a dedicated pool for your VMs. Some examples:
1.) it is easier to replace failed SSDs. Most writes you got will write to your VM storage so these SSD might die first. If your VM storage is a dedicated pool it is quite simple. That pool just uses complete unpartitioned SSDs so you just replace the dead SSD with a new one and run zpool replace VmPool /dev/MyNewSsd.
But if a SSD on your boot pool will fail that is much more complicated. Because here you aren't using complete SSDs for your ZFS pool you are only using a partition. And there are other partitions too like the ESP, maybe the grub bootloader and so on. So first you would need to partition the failed drive correctly, then you would need to rebuild the bootloader and write it to the new SSDs and so on. How to do that is described here under "chaning a failed device".
2.) Another benefit of not running proxmox on the same pool as your VMs is that proxmox should have the highest priority. Proxmox needs to hypervise and manage all the VMs and if proxmox needs to wait for disk access, all the VMs need to wait too. If you run proxmox and VMs of the same pool a VM could write that much, that the complete pool is overloaded. In that case it will prevent proxmox from hypervising and that is bad. So if VMs use a separate pool they can't slowdown your hosts root filesystem.
3.) Third benefit is that it is easier to manage stuff. Many pool parameters like the ashift can only be set at creation. And if you for example add a "special device" it can't be removed without destroying the complete pool. So it is not unlikely that you need to destroy and recreate the complete pool in the future. If you use the same pool for proxmox and VMs thats really annoying because you loose your complete OS and need to install and setup poxmox again. If your VMs are on a dedicated pool you just backup your VMs to your NAS or Proxmox Backup Server, destroy and recreate your VM pool (which are just some lines of zfs commands) and restore all VMs from the backup to the new VM pool. So that would be super simple and your proxmox itself will continue running and not be changed. And if you want best performance you will need to find the best combination of drives and ZFS settings first. So atleast for testing purposes you will need to destroy your VM pool several times. Here it is also nice if you don't need to install proxmox again each time you want to test another pool configuration.
There is no "best" config so you just need to create several pools with different settings, benchmark them, destroy them and try another one. Later you look what configuration gave the best results and use that for production.

Two 32GB SATA SSDs as a mirror would be fine for Proxmox as a boot/system pool. Proxmox will only write around 30GB per day of logs/configs/metrics so even a consumer SSD might survive for years.
So if you got free SATA ports I would install proxmox to two small SATA SSDs and use the M.2 slots only as a VM storage.
Just don't boot of a USB stick or SD card. They can't handle the writes because of the missing wear leveling.

diversity · Jul 28, 2021

thank you @Dunuin once again an elaborate explanation to help me get in the right direction.

I'll keep the boot nvme's seperate and will evenetualy move the vm disk locations.

after I have installed 2 extra nvme's in the mobo m2 slots and also will be including the pci adaptor to house (for now 2 exta nvme ssd's
)
extra ssd's. i'll do a perf check on each step.

diversity · Jul 28, 2021

ahh my how brain surgery is not going well ;(

I ended up placing 2 nvm'e in the 2 left over m.2 mobo slots.
I also reconfigured the pcie configuration with slot one being 4x4x4x4 (as otherwise I will not see 4 disks on the ACI adaptor)
Added a gigabyte GPU on slot 2
A evga GPU on slot 3 (and configured the bios to use that one as primary display)
Added a gigabyte GPU on slot 4

Boom, out of resources (d4) BIOS led code error. LOL and I thought that this mobo was EXTREME

Or it is stupidity on my part which is always an option

So I removed the GPU from slot 2 and now no more recourses issues.
Now I can't boot as it hangs on
"EFI stub: loaded initrd from command line option"

Any ideas how to proceed?

Dunuin · Jul 28, 2021

Sounds like your EFI can'T find a drive to boot from. Did you changed the boot device in BIOS after moving the SSDs so it boots from the right drive?

You should also look into your mainboards manual. The 4 PCIe Slots are 16x+16x+8x+8x and are directly connected to the CPU. M.2 slots M2M and M2Q are also directly connected to the CPU. So they get full performance.
M.2 slots M2P and M2C are only connected to the chipset and need to share the same PCIe lanes with all other IO like USB, SATA, NICs and so on. So these two slots won't give you full performance, especially if there is much stuff going on the other interfaces.
So if you really want M.2 as system/boot drives I would use the slower M2P and M2C for that.

diversity · Jul 28, 2021

I got impatient and installed proxmox 7
It went happily through the installation and I selected as installation options the first 2 samsung ssd's in zfs raid1. (just like last time (I did see the other 4 wd black ssds though but in the other thread it came to my attention to better seperate vm storage from os storage))

And now it hangs again

cannot import 'rpool' ,pre than one matching pool
Import by numeric id instead
Error: 1
....
Manually import the pool and exit.

I forgot that I have already used some of the wd black ssds once to try and install proxmox and ran into the same issue. So how to I clean them out and reinstall proxmox 7?

diversity · Jul 28, 2021

ahh I could create a proxmox install using all disks and then later remove the wd black's and create a new raid 10 pool for vm's

right?

Dunuin · Jul 28, 2021

There is the wipefs command to wipe data from disks. You could run that from a live linux usb stick. If your drive supports secure erase you could also use that to wipe really everything and return it to factory defaults.

diversity · Jul 28, 2021

Code:

zpool status
  pool: rpool
 state: ONLINE
config:

        NAME                                                 STATE     READ WRITE CKSUM
        rpool                                                ONLINE       0     0     0
          mirror-0                                           ONLINE       0     0     0
            nvme-eui.002538b9015063be-part3                  ONLINE       0     0     0
            nvme-eui.002538b9015063e1-part3                  ONLINE       0     0     0
          mirror-1                                           ONLINE       0     0     0
            nvme-eui.e8238fa6bf530001001b448b492a6acc-part3  ONLINE       0     0     0
            nvme-eui.e8238fa6bf530001001b448b49df8c91-part3  ONLINE       0     0     0
          mirror-2                                           ONLINE       0     0     0
            nvme-eui.e8238fa6bf530001001b448b49df8f93-part3  ONLINE       0     0     0
            nvme-eui.e8238fa6bf530001001b448b49df19b5-part3  ONLINE       0     0     0

errors: No known data errors

Code:

pveperf
CPU BOGOMIPS:      742465.28
REGEX/SECOND:      3994828
HD SIZE:           2697.00 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     1198.36

previously
FSYNCS/SECOND: 281.94

big increase for sure.
Am going to try a 6 way mirror and see what happens
EDIT: am also sure to not foget getting an additional 2 x wd black nvme ssd's.
I am hoping an additional increase in performance.
That is the limit of the slots I have available at the moment. I hope things will get lightning fast indeed.

diversity · Jul 28, 2021

@udo
why do your pveperf results also include

BUFFERED READS: xx MB/sec
AVERAGE SEEK TIME: xx ms

While mine do not?

Anyone else knows?

diversity · Jul 28, 2021

@Dunuin

this is the read out of the ssd with 4% wearout

Code:

smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.22-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 PRO 1TB
Serial Number:                      S5GXNG0N902829K
Firmware Version:                   1B2QGXA7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            1,257,967,616 [1.25 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 b9015063be
Local Time is:                      Wed Jul 28 23:25:34 2021 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.49W       -        -    0  0  0  0        0       0
 1 +     4.48W       -        -    1  1  1  1        0     200
 2 +     3.18W       -        -    2  2  2  2        0    1000
 3 -   0.0400W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    4%
Data Units Read:                    11,081,233 [5.67 TB]
Data Units Written:                 50,910,271 [26.0 TB]
Host Read Commands:                 490,360,423
Host Write Commands:                1,140,915,705
Controller Busy Time:               2,817
Power Cycles:                       93
Power On Hours:                     2,967
Unsafe Shutdowns:                   69
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               45 Celsius
Temperature Sensor 2:               52 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Is this to be interpret as doing ok or still not up to par with the big guns?

diversity · Jul 28, 2021

hmm I have been running this system for around a year now but the amount of hours is but around 100 something days.
I believe the data is more accurate than my memory so I must have been using a different setup earlier but for the life of me I can't remember

bobmc · Jul 29, 2021

I'm just wondering.... have you tested and got GPU passthrough running successfully on multiple VM's before?

Dunuin · Jul 29, 2021

diversity said:
@udo
why do your pveperf results also include

BUFFERED READS: xx MB/sec
AVERAGE SEEK TIME: xx ms

While mine do not?

Anyone else knows?

Don't really know. I also get seek times and reads:

pveperf
CPU BOGOMIPS: 67195.68
REGEX/SECOND: 2640302
HD SIZE: 20.97 GB (/dev/mapper/vgpmx-lvroot)
BUFFERED READS: 286.18 MB/sec
AVERAGE SEEK TIME: 0.09 ms
FSYNCS/SECOND: 1289.92
DNS EXT: 119.68 ms

By the way, that 1289 FSYNC is a pair of 100GB SATA enterprise SSDs that I bought second hand for 12$ each. Each got over 1800 TWB left.
So if you buy another pair of NVMe consumer SSD for 300$ you might even be able to be faster than them

Maybe you should just get two U.2 enterprise SSDs and use them as a a single mirror for your VMs and use the m.2 for something else instead of buying even more consumer drives. You could use them as a datastore for a proxmox backup server, as a pool in your TrueNAS (SMB normally only uses async writes, so sync write performance isn't that important) or something like that. But I wouldn't buy more of them just because you think they aren't fast enough.

diversity · Jul 29, 2021

bobmc said:
I'm just wondering.... have you tested and got GPU passthrough running successfully on multiple VM's before?

certainly

diversity · Jul 29, 2021

Dunuin said:
the way, that 1289 FSYNC is a pair of 100GB SATA enterprise SSDs th

What brand and model?

[SOLVED] going for max speed with proxmox 7; how to do it?

Distinguished Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Renowned Member

Distinguished Member

Well-Known Member

Well-Known Member

We value your privacy