What is best performing file system for Proxmox 5.2 on 4 x NVMe SSD Drives?

Discussion in 'Proxmox VE: Installation and configuration' started by Deli Veli, Dec 6, 2018.

  1. Deli Veli

    Deli Veli New Member

    Joined:
    Dec 6, 2018
    Messages:
    18
    Likes Received:
    0
    Good day all,

    Hardware Specs:
    Dell PowerEdge R630,
    Dual (2) Intel Xeon 8 Core E5-2667 v3 CPUs, 3.2 Ghz,
    256 Gigabytes Memory,
    2x Intel S3610 SSD fo Proxmox OS (raid 1 over PERC H330 SAS RAID Controller),
    4x Intel P4510 series 1 Terabyte U.2 format NVMe SSD (VM storage),
    front 4 bays configured for PCIe NVMe U.2 SSDs,

    Proxmox Setup:
    OS runs on 2 Intel S3610 SSD mirrored using PERC H330 RAID Controller,

    Research and Expectations:
    Since we have 4x NVMe drives, we are looking into creating the fastest possible file system to run our VMs, including DB (Mongodb, mysql, graph, etc.) as well as web servers, applications servers, redis, queue service etc.

    We are willing to sacrifice 1 NVMe drive for parity, data redundancy, or for fault tolerance. This will be used in production environment as one of the servers serving ~3M users/month. Assume network is not a bottle neck. Out of scope for this thread.

    Documentations Researched Already:
    https://forum.proxmox.com/threads/what-is-the-best-file-system-for-proxmox.30228/
    https://forum.proxmox.com/threads/which-filesystem-to-use-with-proxmox-sofraid-server.41988/
    https://pve.proxmox.com/wiki/ZFS_on_Linux
    https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks

    few others

    Few Findings:
    Here is what I did to benchmark

    I created 3 Centos KVMs, same Ram, same CPU

    A: 1 Centos KVM on mirrored LVM thin where Proxmox OS is installed on Intel S3610 SSD,

    B: 1 Centos KVM on zfs raidz1 created on 3 NVMe SSD on Intel P4510 SSD,
    To create I used: zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>

    C: 1 Centos KVM on xfs on single drive mounted to /mnt/nvmedrive on 1 NVMe SSD on Intel P4510 SSD,

    Benchmarking Results:
    On A - KVM on mirrored LVM thin where Proxmox OS is installed on Intel S3610 SSD:
    Screen Shot 2018-12-06 at 12.24.43 PM.png

    On B - KVM on zfs raidz1 created on 3 NVMe SSD on Intel P4510 SSD:
    Screen Shot 2018-12-06 at 12.24.54 PM.png
    On C - KVM on xfs on single drive mounted to /mnt/nvmedrive on 1 NVMe SSD on Intel P4510 SSD:
    Screen Shot 2018-12-06 at 12.25.02 PM.png


    Confusion:
    according to fio tests,
    C is fastest by big margin - single nvme drive with xfs fs mounted on /mnt/nvmedrive,
    B: Raidz1 on 3 nvme drives is much slower than C, by big margin too,
    A is on LVM-thin on same location where OS is installed, over perc mirrored not high performance ssds

    I would expect Raidz to perform better in this case. What do you think about our findings? Are these reasonable?
     
    #1 Deli Veli, Dec 6, 2018
    Last edited: Dec 6, 2018
  2. joshin

    joshin Member
    Proxmox Subscriber

    Joined:
    Jul 23, 2013
    Messages:
    92
    Likes Received:
    8
    I wouldn't do RAID-Z, I'd set up mirrored vdevs, and stripe them.
     
    spirit likes this.
  3. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,302
    Likes Received:
    131
    if you want max performance, avoid zfs (double write for journal), avoid raid-z or raid-5.

    a raid-10 with software raid could be great. (never try it, but it's possible to do raid with lvm directly)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. Deli Veli

    Deli Veli New Member

    Joined:
    Dec 6, 2018
    Messages:
    18
    Likes Received:
    0
    https://pve.proxmox.com/wiki/Software_RAID

    It looks like as a general approach software raid 10 is not a suggested way. I don't mind doing and testing it, however if it is not suggested for production environments by Proxmox, what is the next best approach then?
     
  5. Deli Veli

    Deli Veli New Member

    Joined:
    Dec 6, 2018
    Messages:
    18
    Likes Received:
    0
    Update on another round of experiment.

    Destroyed the Raidz-1, not beneficial from performance perspective. Created RAID10 with below commands,
    Code:
    zpool create -f -o ashift=12 nvmeraid10pool mirror /dev/nvme0n1 /dev/nvme1n1 mirror /dev/nvme2n1 /dev/nvme3n1
    
    Below is the same benchmark results. It gave a little better performance than Raidz-1 however still much slower than single drive performance benchmark.
    Screen Shot 2018-12-07 at 3.49.37 PM.png
     
  6. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,302
    Likes Received:
    131
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. Deli Veli

    Deli Veli New Member

    Joined:
    Dec 6, 2018
    Messages:
    18
    Likes Received:
    0
    Ok below are the steps taken to create LVM RAID10 on 4x NVMe SSD drives. Also attaching benchmarking results.

    // Create Physical Volume Drives
    Code:
    pvcreate /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
    // Create Volume Group
    Code:
    vgcreate my_vol_grp /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
    // Create a logical volume with RAID10 from the already created volume group
    Code:
    lvcreate --type raid10 -m 1 -i 2 -l 100%FREE -n lvm_raid10 my_vol_grp
    // Create ext4 filesystem on the logical volume that you have created
    Code:
    mkfs.ext4 /dev/my_vol_grp/lvm_raid10

    // Mount the new file system on logical volume onto some folder

    Code:
    mkdir /mnt/lvm_raid10_mount/  
    mount /dev/my_vol_grp/lvm_raid10 /mnt/lvm_raid10_mount/
    Still tracing my steps to see if I have done everything correctly however the results look very promising. It is a huge performance boost over ZRAID1.

    Code:
    [root@localhost ~]# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=10G --readwrite=randrw --rwmixread=75
    test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
    fio-3.1
    Starting 1 process
    test: Laying out IO file (1 file / 10240MiB)
    Jobs: 1 (f=1): [m(1)][100.0%][r=364MiB/s,w=121MiB/s][r=93.1k,w=31.1k IOPS][eta 00m:00s]
    test: (groupid=0, jobs=1): err= 0: pid=4184: Wed Dec 12 18:29:49 2018
       read: IOPS=93.3k, BW=364MiB/s (382MB/s)(7678MiB/21073msec)
       bw (  KiB/s): min=321200, max=394384, per=100.00%, avg=373271.33, stdev=15381.26, samples=42
       iops        : min=80300, max=98596, avg=93317.74, stdev=3845.41, samples=42
      write: IOPS=31.1k, BW=122MiB/s (128MB/s)(2562MiB/21073msec)
       bw (  KiB/s): min=107000, max=133400, per=100.00%, avg=124586.38, stdev=5305.34, samples=42
       iops        : min=26750, max=33350, avg=31146.50, stdev=1326.35, samples=42
      cpu          : usr=27.55%, sys=70.27%, ctx=4472, majf=0, minf=23
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
         issued rwt: total=1965456,655984,0, short=0,0,0, dropped=0,0,0
         latency   : target=0, window=0, percentile=100.00%, depth=64
    
    Run status group 0 (all jobs):
       READ: bw=364MiB/s (382MB/s), 364MiB/s-364MiB/s (382MB/s-382MB/s), io=7678MiB (8051MB), run=21073-21073msec
      WRITE: bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=2562MiB (2687MB), run=21073-21073msec
    
    Disk stats (read/write):
        dm-0: ios=1948959/650587, merge=0/0, ticks=623937/157518, in_queue=781534, util=99.23%, aggrios=1965456/655988, aggrmerge=0/0, aggrticks=629961/158857, aggrin_queue=788040, aggrutil=99.15%
      vda: ios=1965456/655988, merge=0/0, ticks=629961/158857, in_queue=788040, util=99.15%
    
     
  8. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    839
    Likes Received:
    114
    Any raidz cannot perform better compared with only one HDD/SSD or with any mirror! You compare different things(orange with apple ....), I also think that your simple test(lets go to write/read a file) has nothing to do with your real load! Generally speaking, I think that some kind of tests is not usefull at all when you run many VMs with different kind of loads at the same time. In the real world, for your load for sure you do not use only one file. You will use for sure hundreds of files, who are read/w by many hundred of different procces at the same time.

    Another error is the fact that you do not take in account the fact that zfs can be tunned very different for each VM/CT. You can not do this with LVM, xfs.
    Running your load without redundancy, could be dangerous on any case!!!!
     
  9. Deli Veli

    Deli Veli New Member

    Joined:
    Dec 6, 2018
    Messages:
    18
    Likes Received:
    0
    Another question. Above solution meaning creating the LVM RAID10 and mounting to a folder in filesystem then using that inside Proxmox does seem to work. However when I add LVM directly on proxmox GUI, it shows 100% of the data taken. Here are the steps for both scenarios.

    Does NOT work:
    Code:
    pvcreate /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
    vgcreate my_vol_grp /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
    lvcreate --type raid10 -m 1 -i 2 -l 100%FREE -n lvm_raid10 my_vol_grp
    
    Then I add LVM on Proxmox GUI as such; Datacenter -> Storage -> Add -> LVM -> "chose LVM name and give it ID"
    Screen Shot 2018-12-14 at 4.10.29 PM.png

    Works:
    Code:
    pvcreate /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
    vgcreate my_vol_grp /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
    lvcreate --type raid10 -m 1 -i 2 -l 100%FREE -n lvm_raid10 my_vol_grp
    mkfs.ext4 /dev/my_vol_grp/lvm_raid10
    mkdir /mnt/lvm_raid10_mount/
    mount /dev/my_vol_grp/lvm_raid10 /mnt/lvm_raid10_mount/
    
    Then I add a "Directory" on proxmox gui and I see the expected 1.79T space on 1T each LVM RAID10 setup. Read and Write iops are fast. Below is the proxmox gui
    Screen Shot 2018-12-14 at 4.17.43 PM.png

    Any clue?
     
  10. Deli Veli

    Deli Veli New Member

    Joined:
    Dec 6, 2018
    Messages:
    18
    Likes Received:
    0
    @guletz I did not expect Raidz to perform that poorly comparing to single drive is what I meant. Naturally, I was expecting some slow down but not to this extend. If you look at the results, you can see that Zraid1 on NVMe pci ssds performed almost the same as mirrored non-performance ssds over perc. So what is the benefit of having nvme ssds then? Therefore, I was after squeezing better performance over those 4 pci ssds.

    When it comes to the benchmark, all configurations are tested under same load, same server, same VM. The purpose here is not to emulate actual performance testing under load. It was to see the differences between different storage configurations. Single drive benchmark testing purpose was to get some initial data. A starting point.

    About redundancy, the whole purpose of the thread is to find the best performing Raid configuration. All other results are shown here for comparison purposes.
     
  11. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    839
    Likes Received:
    114
    appels with oranges ;) For a write/read in a single file ? Try to test with > 1000 files .


    Wrong. When you find the conclusion, then you will go forward and do what? You will extrapolate your conclusions for your actual load? Yes you can do this. But it will not be relevant for your load.

    In the end you will want to get the optimal storage layout for your load, and not for a single file acces load!

    And for raidz1 the iops is the same as for a single disk. Any hardware mirror will be fast, because the raid cache will lie that your data is written on disks. Also a raid controller will not guarantee that your data is ok. Even with enterprise Ssd/nvme I see check-sum errors on zfs (data center series), so what is the point of having a optional speed but with data errors?
     
    #11 guletz, Dec 15, 2018
    Last edited: Dec 15, 2018
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice