What is best performing file system for Proxmox 5.2 on 4 x NVMe SSD Drives?

Deli Veli

New Member
Dec 6, 2018
18
0
1
37
Good day all,

Hardware Specs:
Dell PowerEdge R630,
Dual (2) Intel Xeon 8 Core E5-2667 v3 CPUs, 3.2 Ghz,
256 Gigabytes Memory,
2x Intel S3610 SSD fo Proxmox OS (raid 1 over PERC H330 SAS RAID Controller),
4x Intel P4510 series 1 Terabyte U.2 format NVMe SSD (VM storage),
front 4 bays configured for PCIe NVMe U.2 SSDs,

Proxmox Setup:
OS runs on 2 Intel S3610 SSD mirrored using PERC H330 RAID Controller,

Research and Expectations:
Since we have 4x NVMe drives, we are looking into creating the fastest possible file system to run our VMs, including DB (Mongodb, mysql, graph, etc.) as well as web servers, applications servers, redis, queue service etc.

We are willing to sacrifice 1 NVMe drive for parity, data redundancy, or for fault tolerance. This will be used in production environment as one of the servers serving ~3M users/month. Assume network is not a bottle neck. Out of scope for this thread.

Documentations Researched Already:
https://forum.proxmox.com/threads/what-is-the-best-file-system-for-proxmox.30228/
https://forum.proxmox.com/threads/which-filesystem-to-use-with-proxmox-sofraid-server.41988/
https://pve.proxmox.com/wiki/ZFS_on_Linux
https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks

few others

Few Findings:
Here is what I did to benchmark

I created 3 Centos KVMs, same Ram, same CPU

A: 1 Centos KVM on mirrored LVM thin where Proxmox OS is installed on Intel S3610 SSD,

B: 1 Centos KVM on zfs raidz1 created on 3 NVMe SSD on Intel P4510 SSD,
To create I used: zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>

C: 1 Centos KVM on xfs on single drive mounted to /mnt/nvmedrive on 1 NVMe SSD on Intel P4510 SSD,

Benchmarking Results:
On A - KVM on mirrored LVM thin where Proxmox OS is installed on Intel S3610 SSD:
Screen Shot 2018-12-06 at 12.24.43 PM.png

On B - KVM on zfs raidz1 created on 3 NVMe SSD on Intel P4510 SSD:
Screen Shot 2018-12-06 at 12.24.54 PM.png
On C - KVM on xfs on single drive mounted to /mnt/nvmedrive on 1 NVMe SSD on Intel P4510 SSD:
Screen Shot 2018-12-06 at 12.25.02 PM.png


Confusion:
according to fio tests,
C is fastest by big margin - single nvme drive with xfs fs mounted on /mnt/nvmedrive,
B: Raidz1 on 3 nvme drives is much slower than C, by big margin too,
A is on LVM-thin on same location where OS is installed, over perc mirrored not high performance ssds

I would expect Raidz to perform better in this case. What do you think about our findings? Are these reasonable?
 
Last edited:
if you want max performance, avoid zfs (double write for journal), avoid raid-z or raid-5.

a raid-10 with software raid could be great. (never try it, but it's possible to do raid with lvm directly)
 
Update on another round of experiment.

Destroyed the Raidz-1, not beneficial from performance perspective. Created RAID10 with below commands,
Code:
zpool create -f -o ashift=12 nvmeraid10pool mirror /dev/nvme0n1 /dev/nvme1n1 mirror /dev/nvme2n1 /dev/nvme3n1

Below is the same benchmark results. It gave a little better performance than Raidz-1 however still much slower than single drive performance benchmark.
Screen Shot 2018-12-07 at 3.49.37 PM.png
 
Ok below are the steps taken to create LVM RAID10 on 4x NVMe SSD drives. Also attaching benchmarking results.

// Create Physical Volume Drives
Code:
pvcreate /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

// Create Volume Group
Code:
vgcreate my_vol_grp /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

// Create a logical volume with RAID10 from the already created volume group
Code:
lvcreate --type raid10 -m 1 -i 2 -l 100%FREE -n lvm_raid10 my_vol_grp

// Create ext4 filesystem on the logical volume that you have created
Code:
mkfs.ext4 /dev/my_vol_grp/lvm_raid10


// Mount the new file system on logical volume onto some folder

Code:
mkdir /mnt/lvm_raid10_mount/  
mount /dev/my_vol_grp/lvm_raid10 /mnt/lvm_raid10_mount/

Still tracing my steps to see if I have done everything correctly however the results look very promising. It is a huge performance boost over ZRAID1.

Code:
[root@localhost ~]# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=10G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.1
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=364MiB/s,w=121MiB/s][r=93.1k,w=31.1k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=4184: Wed Dec 12 18:29:49 2018
   read: IOPS=93.3k, BW=364MiB/s (382MB/s)(7678MiB/21073msec)
   bw (  KiB/s): min=321200, max=394384, per=100.00%, avg=373271.33, stdev=15381.26, samples=42
   iops        : min=80300, max=98596, avg=93317.74, stdev=3845.41, samples=42
  write: IOPS=31.1k, BW=122MiB/s (128MB/s)(2562MiB/21073msec)
   bw (  KiB/s): min=107000, max=133400, per=100.00%, avg=124586.38, stdev=5305.34, samples=42
   iops        : min=26750, max=33350, avg=31146.50, stdev=1326.35, samples=42
  cpu          : usr=27.55%, sys=70.27%, ctx=4472, majf=0, minf=23
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwt: total=1965456,655984,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=364MiB/s (382MB/s), 364MiB/s-364MiB/s (382MB/s-382MB/s), io=7678MiB (8051MB), run=21073-21073msec
  WRITE: bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=2562MiB (2687MB), run=21073-21073msec

Disk stats (read/write):
    dm-0: ios=1948959/650587, merge=0/0, ticks=623937/157518, in_queue=781534, util=99.23%, aggrios=1965456/655988, aggrmerge=0/0, aggrticks=629961/158857, aggrin_queue=788040, aggrutil=99.15%
  vda: ios=1965456/655988, merge=0/0, ticks=629961/158857, in_queue=788040, util=99.15%
 
I would expect Raidz to perform better in this case. What do you think about our findings? Are these reasonable?

Any raidz cannot perform better compared with only one HDD/SSD or with any mirror! You compare different things(orange with apple ....), I also think that your simple test(lets go to write/read a file) has nothing to do with your real load! Generally speaking, I think that some kind of tests is not usefull at all when you run many VMs with different kind of loads at the same time. In the real world, for your load for sure you do not use only one file. You will use for sure hundreds of files, who are read/w by many hundred of different procces at the same time.

Another error is the fact that you do not take in account the fact that zfs can be tunned very different for each VM/CT. You can not do this with LVM, xfs.
Running your load without redundancy, could be dangerous on any case!!!!
 
  • Like
Reactions: DonMcCoy
Another question. Above solution meaning creating the LVM RAID10 and mounting to a folder in filesystem then using that inside Proxmox does seem to work. However when I add LVM directly on proxmox GUI, it shows 100% of the data taken. Here are the steps for both scenarios.

Does NOT work:
Code:
pvcreate /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
vgcreate my_vol_grp /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
lvcreate --type raid10 -m 1 -i 2 -l 100%FREE -n lvm_raid10 my_vol_grp
Then I add LVM on Proxmox GUI as such; Datacenter -> Storage -> Add -> LVM -> "chose LVM name and give it ID"
Screen Shot 2018-12-14 at 4.10.29 PM.png

Works:
Code:
pvcreate /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
vgcreate my_vol_grp /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
lvcreate --type raid10 -m 1 -i 2 -l 100%FREE -n lvm_raid10 my_vol_grp
mkfs.ext4 /dev/my_vol_grp/lvm_raid10
mkdir /mnt/lvm_raid10_mount/
mount /dev/my_vol_grp/lvm_raid10 /mnt/lvm_raid10_mount/
Then I add a "Directory" on proxmox gui and I see the expected 1.79T space on 1T each LVM RAID10 setup. Read and Write iops are fast. Below is the proxmox gui
Screen Shot 2018-12-14 at 4.17.43 PM.png

Any clue?
 
@guletz I did not expect Raidz to perform that poorly comparing to single drive is what I meant. Naturally, I was expecting some slow down but not to this extend. If you look at the results, you can see that Zraid1 on NVMe pci ssds performed almost the same as mirrored non-performance ssds over perc. So what is the benefit of having nvme ssds then? Therefore, I was after squeezing better performance over those 4 pci ssds.

When it comes to the benchmark, all configurations are tested under same load, same server, same VM. The purpose here is not to emulate actual performance testing under load. It was to see the differences between different storage configurations. Single drive benchmark testing purpose was to get some initial data. A starting point.

About redundancy, the whole purpose of the thread is to find the best performing Raid configuration. All other results are shown here for comparison purposes.
 
you can see that Zraid1 on NVMe pci ssds performed almost the same as mirrored non-performance ssds over perc.

appels with oranges ;) For a write/read in a single file ? Try to test with > 1000 files .


The purpose here is not to emulate actual performance testing under load. It was to see the differences between different storage configurations. Single drive benchmark testing purpose was to get some initial data.

Wrong. When you find the conclusion, then you will go forward and do what? You will extrapolate your conclusions for your actual load? Yes you can do this. But it will not be relevant for your load.

In the end you will want to get the optimal storage layout for your load, and not for a single file acces load!

And for raidz1 the iops is the same as for a single disk. Any hardware mirror will be fast, because the raid cache will lie that your data is written on disks. Also a raid controller will not guarantee that your data is ok. Even with enterprise Ssd/nvme I see check-sum errors on zfs (data center series), so what is the point of having a optional speed but with data errors?
 
Last edited:
  • Like
Reactions: freeman1doma

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!