NVMe is slow down after few days (and other)

Alibek

Well-Known Member
Jan 13, 2017
102
15
58
45
Summary:
Use next platform Supermicro A+ Server 2123BT-HNC0R with per node configuration:
NVME: 4x 2.5" U.2 Intel Optane 900P 280GB
1x M.2 Samsung SM961 256GB (NVMe) SM961
SAS: 2x 2'5" SSD Samsung PM1633a 7.68TB
FC: QLogic QLE8362 (attache to FC switch, for use exported pools from external storages)
CPU: 2x AMD EPYC 7601 with SMT (64 cores/128 threads)
Memory: DDR4 ECC 2 TiB
Network: 2x 10Gbps Intel X550T Ethernet - bonding balanced-alb
OS: Debian GNU/Linux 9.5 (stretch) with linux kernel version 4.15.18-1-pve #1 SMP PVE 4.15.18-17 (Mon, 30 Jul 2018 12:53:35 +0200)
Virtualization: Proxmox 5.2-2, PVE Manager Version pve-manager/5.2-7/8d88e66a
Ceph Version: luminous 12.2.7-pve1​

Cause
:
I found next problem with NVMe storage devices after week of server powered and idle without reboot:
Code:
root@host2:~# nvme list
Node             SN                  Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     PHMXXXXXXXXX280AGN   INTEL SSDPE21D280GA                      1         280.07  GB / 280.07  GB    512   B +  0 B   E2010325
/dev/nvme1n1     PHMXXXXXXXXX280AGN   INTEL SSDPE21D280GA                      1         280.07  GB / 280.07  GB    512   B +  0 B   E2010325
/dev/nvme2n1     PHMXXXXXXXXX280AGN   INTEL SSDPE21D280GA                      1         280.07  GB / 280.07  GB    512   B +  0 B   E2010325
/dev/nvme3n1     PHMXXXXXXXXX280AGN   INTEL SSDPE21D280GA                      1         280.07  GB / 280.07  GB    512   B +  0 B   E2010325
/dev/nvme4n1     S346XXXXXXXXXXX       SAMSUNG MZVPW256HEGL-00000               1         239.87  GB / 256.06  GB    512   B +  0 B   CXZ7500Q

Code:
root@host2:~# for d in {0..4}; do hdparm -Tt --direct /dev/nvme${d}n1; done

/dev/nvme0n1:
 Timing O_DIRECT cached reads:   4870 MB in  2.00 seconds = 2435.09 MB/sec
 Timing O_DIRECT disk reads: 7276 MB in  3.00 seconds = 2425.19 MB/sec

/dev/nvme1n1:
 Timing O_DIRECT cached reads:   2758 MB in  2.00 seconds = 1379.17 MB/sec
 Timing O_DIRECT disk reads: 2726 MB in  3.00 seconds = 908.07 MB/sec

/dev/nvme2n1:
 Timing O_DIRECT cached reads:   614 MB in  2.12 seconds = 290.25 MB/sec
 Timing O_DIRECT disk reads:  64 MB in  3.13 seconds =  20.42 MB/sec

/dev/nvme3n1:
 Timing O_DIRECT cached reads:   4716 MB in  2.00 seconds = 2358.23 MB/sec
 Timing O_DIRECT disk reads: 6068 MB in  3.00 seconds = 2022.55 MB/sec

/dev/nvme4n1:
 Timing O_DIRECT cached reads:   2106 MB in  2.00 seconds = 1052.56 MB/sec
 Timing O_DIRECT disk reads: 3886 MB in  3.00 seconds = 1295.33 MB/sec


root@host4:~# for d in {0..4}; do hdparm -Tt --direct /dev/nvme${d}n1; done

/dev/nvme0n1:
 Timing O_DIRECT cached reads:    16 MB in  2.12 seconds =  7.56 MB/sec
 Timing O_DIRECT disk reads:  24 MB in  3.17 seconds =   7.58 MB/sec

/dev/nvme1n1:
 Timing O_DIRECT cached reads:    16 MB in  2.12 seconds =  7.55 MB/sec
 Timing O_DIRECT disk reads:  22 MB in  3.03 seconds =   7.26 MB/sec

/dev/nvme2n1:
 Timing O_DIRECT cached reads:   4814 MB in  2.00 seconds = 2406.97 MB/sec
 Timing O_DIRECT disk reads: 7204 MB in  3.00 seconds = 2400.93 MB/sec

/dev/nvme3n1:
 Timing O_DIRECT cached reads:   1010 MB in  2.07 seconds = 488.49 MB/sec
 Timing O_DIRECT disk reads: 290 MB in  3.02 seconds =  95.92 MB/sec

/dev/nvme4n1:
 Timing O_DIRECT cached reads:   3256 MB in  2.00 seconds = 1627.93 MB/sec
 Timing O_DIRECT disk reads: 6434 MB in  3.00 seconds = 2144.20 MB/sec
This problem is present on all 4 nodes, some NVMe storage devices is slow down, some is not.

If I reboot system the problem disappears (but not completely, see next) and again occurs in a couple of days.
Code:
after reboot again:
root@host2:~# for d in {0..4}; do hdparm -Tt --direct /dev/nvme${d}n1; done

/dev/nvme0n1:
 Timing O_DIRECT cached reads:   4782 MB in  2.00 seconds = 2391.16 MB/sec
 Timing O_DIRECT disk reads: 1482 MB in  3.00 seconds = 493.41 MB/sec

/dev/nvme1n1:
 Timing O_DIRECT cached reads:   4868 MB in  2.00 seconds = 2434.48 MB/sec
 Timing O_DIRECT disk reads: 4566 MB in  3.00 seconds = 1521.70 MB/sec

/dev/nvme2n1:
 Timing O_DIRECT cached reads:   4880 MB in  2.00 seconds = 2440.29 MB/sec
 Timing O_DIRECT disk reads: 7016 MB in  3.00 seconds = 2338.38 MB/sec

/dev/nvme3n1:
 Timing O_DIRECT cached reads:   4870 MB in  2.00 seconds = 2435.76 MB/sec
 Timing O_DIRECT disk reads: 7048 MB in  3.00 seconds = 2349.15 MB/sec

/dev/nvme4n1:
 Timing O_DIRECT cached reads:   3042 MB in  2.00 seconds = 1521.08 MB/sec
 Timing O_DIRECT disk reads: 6612 MB in  3.00 seconds = 2203.62 MB/sec

Other thing - strange detect:

root@host2:~# smartctl -i /dev/nvme0 | grep model
Model Number: SAMSUNG MZVPW256HEGL-00000
root@host2:~# smartctl -i /dev/nvme0n1 | grep model
Model Number: INTEL SSDPE21D280GA

and

root@host2:~# smartctl -i /dev/nvme4 | grep model
Model Number: INTEL SSDPE21D280GA
root@host2:~# smartctl -i /dev/nvme4n1 | grep model
Model Number: SAMSUNG MZVPW256HEGL-00000

I think that is wrong naming of nvme devices in kernel driver

Decision:

Conclusion:
 
Hi,

can you retry this test with fio.
I have no good experience with hdparm.

Code:
fio --filename=/dev/nvme<x>n1 --direct=1 --sync=1 --rw=write --bs=4k --numjobs=4 --iodepth=16 --runtime=360 --time_based --group_reporting --name=journal-test

Use for nvme devices the nvmecli to get the smart values.

Code:
nvme smart-log /dev/nvme0n1
 
Hi, Wolfgang! Thank you for your attention!

I use little more different command fio:
Code:
fio --filename=/dev/nvme${d}n1 --direct=1 --sync=1 --rw=write --bs=4k --numjobs=4 --iodepth=16 --runtime=60 --time_based --group_reporting --name=journal-test-nvme${d}n1 --ioengine=libaio --exec_prerun="echo 3 > /proc/sys/vm/drop_caches"

Your command fio use ioengine=psync - that engine 2x slow than libaio.

I attached results (also with nvme-cli command, and dd bench after fio, at end of files)

FIO demonstrated full speed of nvme, but for example DD demonstrated very bad results (look at end of file fio-host4.txt):
Code:
root@host4:~# echo 3 > /proc/sys/vm/drop_caches && for d in {0..3}; do echo nvme${d}n1; dd if=/dev/nvme${d}n1 of=/dev/null bs=4k count=256000 oflag=nonblock; done
nvme0n1
256000+0 records in
256000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 112.44 s, 9.3 MB/s
nvme1n1
256000+0 records in
256000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.7016 s, 616 MB/s
nvme2n1
256000+0 records in
256000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 103.155 s, 10.2 MB/s
nvme3n1
256000+0 records in
256000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 136.361 s, 7.7 MB/s

I previously tried to reset the device with nvme-cli on other hosts before reboot it - but that is did't any effect.
I don't reboot host4 for specifically, it stay in the current bad state with nvme
 

Attachments

Your command fio use ioengine=psync - that engine 2x slow than libaio.
Please repeat the fio test with psync the most Linux programs use fwrite/fread.
 

Attachments

It locks like libaio is more volatile then sync and psync.

Can you prove that after a few days the speed is low with fio and psync?
 
So...how did these monster servers end up giving the performance you were after in production?

Oh and I intel support basically blaming your using of their products for the shity performance or its the open source. Really now Wintel
 
Hi!
Since I found a way to speedup slow nvme devices (I run 10 times reads by dd from slow device) I will provide it to production, preparing to put it into operation.
Currently on this servers installed created Ceph pools in tiering mode (hot-pool (replication 3/2) on NVMe Optane, storage-pool (erasure code (k=2,m=2) on SAS SSD):
Code:
# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE VAR   PGS
 0  nvme 0.25499  1.00000  261GiB 9.69GiB  251GiB 3.72  8.96  25
 1  nvme 0.25499  1.00000  261GiB 8.84GiB  252GiB 3.39  8.17  23
 2  nvme 0.25499  1.00000  261GiB 11.5GiB  249GiB 4.43 10.67  30
 3  nvme 0.25499  1.00000  261GiB 9.62GiB  251GiB 3.69  8.89  25
 4   ssd 6.00000  1.00000 6.99TiB 13.9GiB 6.97TiB 0.19  0.47 272
 5   ssd 6.00000  1.00000 6.99TiB 12.5GiB 6.97TiB 0.17  0.42 240
 6  nvme 0.25499  1.00000  261GiB 10.9GiB  250GiB 4.20 10.11  28
 7  nvme 0.25499  1.00000  261GiB 9.70GiB  251GiB 3.72  8.96  25
 8  nvme 0.25499  1.00000  261GiB 7.46GiB  253GiB 2.86  6.90  19
 9  nvme 0.25499  1.00000  261GiB 8.50GiB  252GiB 3.26  7.85  21
10   ssd 6.00000  1.00000 6.99TiB 12.6GiB 6.97TiB 0.18  0.42 247
11   ssd 6.00000  1.00000 6.99TiB 13.8GiB 6.97TiB 0.19  0.46 265
12  nvme 0.25499  1.00000  261GiB 9.26GiB  251GiB 3.55  8.56  24
13  nvme 0.25499  1.00000  261GiB 10.4GiB  250GiB 4.00  9.63  27
14  nvme 0.25499  1.00000  261GiB 9.85GiB  251GiB 3.78  9.10  26
15  nvme 0.25499  1.00000  261GiB 7.28GiB  253GiB 2.79  6.73  18
16   ssd 6.00000  1.00000 6.99TiB 13.9GiB 6.97TiB 0.19  0.47 272
17   ssd 6.00000  1.00000 6.99TiB 12.5GiB 6.97TiB 0.17  0.42 240
18  nvme 0.25499  1.00000  261GiB 9.27GiB  251GiB 3.55  8.56  24
19  nvme 0.25499  1.00000  261GiB 9.87GiB  251GiB 3.79  9.12  26
20  nvme 0.25499  1.00000  261GiB 10.4GiB  250GiB 4.00  9.64  27
21  nvme 0.25499  1.00000  261GiB 6.57GiB  254GiB 2.52  6.07  16
22   ssd 6.00000  1.00000 6.99TiB 12.7GiB 6.97TiB 0.18  0.43 244
23   ssd 6.00000  1.00000 6.99TiB 13.6GiB 6.97TiB 0.19  0.46 268
                    TOTAL 60.0TiB  255GiB 59.7TiB 0.42         
MIN/MAX VAR: 0.42/10.67  STDDEV: 2.62

# ceph df
GLOBAL:
    SIZE        AVAIL       RAW USED     %RAW USED
    60.0TiB     59.7TiB       255GiB          0.42
POOLS:
    NAME         ID     USED        %USED     MAX AVAIL     OBJECTS
    ec_ssd       39     48.7GiB      0.18       26.5TiB       13066
    rbd_nvme     40     44.6GiB      3.42       1.23TiB       13766

Performance of this servers is good:
Code:
#read:100%
root@zfsperf-test:~# fio fio-ceph-optane.cfg
job mnt-ceph1: (g=0): rw=randread, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=256
job mnt-ceph2: (g=0): rw=randread, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=256
job mnt-ceph3: (g=0): rw=randread, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=256
job mnt-ceph4: (g=0): rw=randread, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=256
job mnt-ceph5: (g=0): rw=randread, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=256
fio-2.16
Starting 5 processes
Jobs: 5 (f=5): [r(5)] [100.0% done] [1814MB/0KB/0KB /s] [14.6K/0/0 iops] [eta 00m:00s]
job mnt-ceph1: (groupid=0, jobs=5): err= 0: pid=3738: Wed Oct 24 10:21:46 2018
  read : io=338878MB, bw=1881.7MB/s, iops=15046, runt=180096msec
    slat (usec): min=7, max=896345, avg=330.50, stdev=2062.42
    clat (msec): min=18, max=5030, avg=84.75, stdev=104.48
     lat (msec): min=18, max=5030, avg=85.08, stdev=104.54
    clat percentiles (msec):
     |  1.00th=[   31],  5.00th=[   35], 10.00th=[   38], 20.00th=[   41],
     | 30.00th=[   44], 40.00th=[   46], 50.00th=[   50], 60.00th=[   55],
     | 70.00th=[   62], 80.00th=[   92], 90.00th=[  178], 95.00th=[  269],
     | 99.00th=[  553], 99.50th=[  627], 99.90th=[ 1029], 99.95th=[ 1172],
     | 99.99th=[ 2024]
    lat (msec) : 20=0.01%, 50=51.62%, 100=29.55%, 250=13.20%, 500=4.32%
    lat (msec) : 750=1.12%, 1000=0.11%, 2000=0.10%, >=2000=0.01%
  cpu          : usr=0.72%, sys=5.18%, ctx=2986052, majf=0, minf=35379
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=115.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2709751/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
   READ: io=338878MB, aggrb=1881.7MB/s, minb=1881.7MB/s, maxb=1881.7MB/s, mint=180096msec, maxt=180096msec

#write:80%
#read:30%
root@zfsperf-test:~# fio fio-ceph-optane.cfg
job mnt-ceph1: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=256
job mnt-ceph2: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=256
job mnt-ceph3: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=256
job mnt-ceph4: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=256
job mnt-ceph5: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=256
fio-2.16
Starting 5 processes
Jobs: 5 (f=5): [m(5)] [100.0% done] [89033KB/205.5MB/0KB /s] [11.2K/26.3K/0 iops] [eta 00m:00s]
job mnt-ceph1: (groupid=0, jobs=5): err= 0: pid=7079: Thu Oct 25 15:29:25 2018
  read : io=19387MB, bw=110273KB/s, iops=13783, runt=180032msec
    slat (usec): min=3, max=170789, avg=93.55, stdev=788.83
    clat (usec): min=115, max=256439, avg=12418.25, stdev=9013.07
     lat (usec): min=203, max=265456, avg=12512.60, stdev=9047.28
    clat percentiles (usec):
     |  1.00th=[  668],  5.00th=[ 1928], 10.00th=[ 3248], 20.00th=[ 5472],
     | 30.00th=[ 7584], 40.00th=[ 9408], 50.00th=[11328], 60.00th=[13120],
     | 70.00th=[15168], 80.00th=[17792], 90.00th=[21888], 95.00th=[25984],
     | 99.00th=[40192], 99.50th=[51968], 99.90th=[97792], 99.95th=[112128],
     | 99.99th=[146432]
  write: io=45227MB, bw=257243KB/s, iops=32148, runt=180032msec
    slat (usec): min=3, max=218966, avg=92.52, stdev=652.18
    clat (msec): min=2, max=486, avg=34.35, stdev=19.77
     lat (msec): min=2, max=486, avg=34.44, stdev=19.79
    clat percentiles (msec):
     |  1.00th=[   12],  5.00th=[   16], 10.00th=[   18], 20.00th=[   21],
     | 30.00th=[   24], 40.00th=[   27], 50.00th=[   30], 60.00th=[   33],
     | 70.00th=[   38], 80.00th=[   44], 90.00th=[   57], 95.00th=[   71],
     | 99.00th=[  112], 99.50th=[  131], 99.90th=[  186], 99.95th=[  212],
     | 99.99th=[  302]
    lat (usec) : 250=0.01%, 500=0.13%, 750=0.24%, 1000=0.23%
    lat (msec) : 2=0.97%, 4=2.40%, 10=9.08%, 20=24.82%, 50=52.27%
    lat (msec) : 100=8.79%, 250=1.06%, 500=0.01%
  cpu          : usr=4.55%, sys=12.92%, ctx=8483207, majf=0, minf=1313
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=116.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2481495/w=5787819/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
   READ: io=19387MB, aggrb=110273KB/s, minb=110273KB/s, maxb=110273KB/s, mint=180032msec, maxt=180032msec
  WRITE: io=45227MB, aggrb=257243KB/s, minb=257243KB/s, maxb=257243KB/s, mint=180032msec, maxt=180032msec
Disk stats (read/write):
  rbd0: ios=574675/1380684, merge=0/20123, ticks=1535528/10217744, in_queue=11869616, util=100.00%
  rbd1: ios=596383/1426959, merge=0/18868, ticks=1586900/10506716, in_queue=12222016, util=100.00%
  rbd2: ios=549816/1316641, merge=0/16961, ticks=1441276/9625572, in_queue=11147208, util=100.00%
  rbd3: ios=585145/1404060, merge=0/19898, ticks=1576824/10210776, in_queue=11884440, util=100.00%
  rbd4: ios=594204/1425566, merge=0/19586, ticks=1632092/10526332, in_queue=12258172, util=100.00%

CPU and Memory performance of this servers is: https://browser.geekbench.com/v4/cpu/10088145
In lxc container with cores=120 and cpuuinits=1024 performance is: https://browser.geekbench.com/v4/cpu/10471366
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!