Slow disk performance

Tim Denis · Feb 27, 2021

Hi!
I've got a X9DRE-LN4F supermicro board with 128 GB DDR3, and 2x Xeon e5 2620. Not the newest server, but it runs.
I noticed a high IO delay, and went on an investigation.

Specs:
- proxmox 6.3
- 2 x 1TB Samsung 850 pro 1TB in zfs mirror (configured during install)
BIOS (and IPMI) firmwares updated to the latest to be found on the supermicro site)

The server also contains a NVME drive (with an adapter - no m2 slot on the board).

Whatever I do, I cannot get past 180MB/s for disk writes.
So when I do dd if=/dev/urandom BS=10M count=1024 of=/randomfile
speed is 180MB/s in the beginning. Gradually drops down to like 40MB/s.

EXACTLY the same pattern with the NVME drive. Weird, because that is another interface, right?

CPU is low during write.

I checked connections, right SATA ports etc... Seems to be allright (yes, connected to the SATA3 ports, not the SATA2 the board has).
Linux tells me the speed is 6gb/s. So that must be SATA3... but the speeds do not match.

Upon further investigation, I installed windows 10 pro bare metal on the machine... And guess what? I Do get the speeds I expect (crystaldiskmark)
SATA ssd: 562MB/s read | 517 MB/s write. That's what I expect from the samsung 850 Pro connected to SATA 3 port.
NVME: 742MB/s read | 296 MB/s write. not really what I expect, but still a lot more than what I get in proxmox.

Okay, maybe it is proxmox. So I installed fedora server on the machine - also bare metal.
Same slow speeds. Exactly the 180MB/s region again.

Everywhere I look, I understand that the C602 chipset should be supported by the kernel...

I also added an add-in card providing 4 sata ports. exactly the same speeds. I added an intel 100GB S3700 SSD. Exactly the same speeds...

So I feel something is capping the speed of the storage...

The goal for the server is to have a truenas running in a VM with a HBA passed through (pcie). That works, and gives me the speeds I expect of the classic spinning HDD's. I wanted to add the NVME drive as a cache to that truenas VM, but same result. Also in truenas it is slow.

as a final test, I passed though the nvme drive to a windows VM that I installed on the proxmox via PCIe passthrough. Running crystal disk mark in that vm gives me about the same results as I get on that NVME drive when I'm running bare metal windows on that server...

So, that leads me to the conclusion it must be the Linux kernel that does not allow the full speed...?

any Ideas or suggestions that might lead into the right direction highly appreciated...!

H4R0 · Feb 27, 2021

Tim Denis said:
So when I do dd if=/dev/urandom BS=10M count=1024 of=/randomfile

You're testing with a random generator, those aren't meant to be fast but accurate. Your system has no entropy.

Verify the speed with:

Code:

apt install pv
cat /dev/urandom | pv > /dev/null

You should use fio for proper benchmarking.

Tim Denis · Feb 28, 2021

Thanks for your reply!
I did not know that /dev/urandom would be so slow...

cat /dev/urandom | pv > /dev/null

gave me the exact same 180MB/s limit. So that's clear now.

I started testing with fio.

Seems ever slower than I thought!

Bash:

fio --filename=/media/nvme/fiofile.delete --size=5GB --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=5

Where I would vary the --size parameter...

For NVME storage:

Bash:

NVME 1GB
Run status group 0 (all jobs):
   READ: bw=124MiB/s (130MB/s), 124MiB/s-124MiB/s (130MB/s-130MB/s), io=14.5GiB (15.6GB), run=120047-120047msec
  WRITE: bw=124MiB/s (130MB/s), 124MiB/s-124MiB/s (130MB/s-130MB/s), io=14.5GiB (15.6GB), run=120047-1200

NVME 2GB
  Run status group 0 (all jobs):
   READ: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=15.1GiB (16.3GB), run=120046-120046msec
  WRITE: bw=129MiB/s (136MB/s), 129MiB/s-129MiB/s (136MB/s-136MB/s), io=15.2GiB (16.3GB), run=120046-120046msec

NVME 5GB
Run status group 0 (all jobs):
   READ: bw=65.4MiB/s (68.5MB/s), 65.4MiB/s-65.4MiB/s (68.5MB/s-68.5MB/s), io=7924MiB (8308MB), run=121235-121235msec
  WRITE: bw=65.6MiB/s (68.8MB/s), 65.6MiB/s-65.6MiB/s (68.8MB/s-68.8MB/s), io=7951MiB (8337MB), run=121235-121235msec

For Samsung 850 Pro storage (in the zfs)

Bash:

Run status group 0 (all jobs):
   READ: bw=4008MiB/s (4203MB/s), 4008MiB/s-4008MiB/s (4203MB/s-4203MB/s), io=470GiB (504GB), run=120001-120001msec
  WRITE: bw=4005MiB/s (4200MB/s), 4005MiB/s-4005MiB/s (4200MB/s-4200MB/s), io=469GiB (504GB), run=120001-120001msec



2GB
Run status group 0 (all jobs):
   READ: bw=1032MiB/s (1082MB/s), 1032MiB/s-1032MiB/s (1082MB/s-1082MB/s), io=121GiB (130GB), run=120001-120001msec
  WRITE: bw=1032MiB/s (1082MB/s), 1032MiB/s-1032MiB/s (1082MB/s-1082MB/s), io=121GiB (130GB), run=120001-120001msec


5GB
Run status group 0 (all jobs):
   READ: bw=66.5MiB/s (69.7MB/s), 66.5MiB/s-66.5MiB/s (69.7MB/s-69.7MB/s), io=7980MiB (8368MB), run=120003-120003msec
  WRITE: bw=66.8MiB/s (70.0MB/s), 66.8MiB/s-66.8MiB/s (70.0MB/s-70.0MB/s), io=8012MiB (8401MB), run=120003-120003msec

so, all in all, even slower than the 180MB/s I thought it was...
The really fast speeds (+4.000MB/s) for the smaller tests are - I assume - due to the caching the RAM does for the zpool.
nvme drive is just formatted as ext4, so no caching there.

I'm puzzled ...
Any advice? Thanks!

H4R0 · Feb 28, 2021

I think your results are mixed up. NVME should be the faster one.

SSD's have a built in cache and that's why they are fast. After 2-8GB the cache is full and they go down to 100-700mb/s depending on the model.

Crystaldisk on windows by default tests with 1GB so that's always only the cache, it's misleading but they advertise it as such.

ZFS uses cow with very small block sizes, so that adds a lot of overhead. Consumer drives aren't really suited for that kind of workload.

That's why you get such bad results.

Just take a look at the official proxmox zfs benchmark https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020

For reliable result I run:

Code:

fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --direct=1 --loops=3 --ioengine=libaio --bs=1m --size=5G --runtime=60 --group_reporting

For cache testing system + ssd/nvme

Code:

fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --direct=1 --ioengine=libaio --bs=1m --size=1G --runtime=60 --group_reporting

Tim Denis · Feb 28, 2021

I think your results are mixed up. NVME should be the faster one.

That should be the case. But it isn't. I'm 100% sure I didn't mix up the results.

SSD's have a built in cache and that's why they are fast. After 2-8GB the cache is full and they go down to 100-700mb/s depending on the model.

Yes, I know. But still, those first couple of GB's should be fast, no?
Or is linux somehow skipping this cache? it shouldn't, since that is hardware...

Crystaldisk on windows by default tests with 1GB so that's always only the cache, it's misleading but they advertise it as such.

Yes. But I tested with 8GB test file size. To make sure to skip this problem.

ZFS uses cow with very small block sizes, so that adds a lot of overhead. Consumer drives aren't really suited for that kind of workload.

Okay. But should I expect an impact larger than 50%?

H4R0 said:
Just take a look at the official proxmox zfs benchmark https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020

looks like a very good resource. I'm gonna read that!
Thanks for your cooperation.

mcunicode · May 24, 2021

I have been struggling for several weeks now and I faced the same problem, the SSD speed at the host is only transferring at around 24MB/s. When I test using hdd at host, it is also running at 24MB/s. For the SSD I am using Samsung 860 500MB. It really looks like the host Debian kernel has a problem.

I am running proxmox 6.1. Tested on 2 separate machine. One with super micro server, the other proxmox is converted from pc. Both have the same result, therefore, I rule out the sata controller.

0xd149e38e · May 24, 2021

Tim Denis said:
The server also contains a NVME drive (with an adapter - no m2 slot on the board).

So what kind of adapter are you using?
And is the NVME connected to the chipset or the CPU?

Is your NVME overheating and throttling?
Did you check the temperatures of the NVME?

millosz222 · Jul 18, 2021

Hi,

I have similar problem,

I have 3 disk NVME Corsair MP510 1,8TB all of them is in ZFS RAIDZ

When I did fio benchmark from proxmox in /dev/zvol/RAID I have really nice value

Code:

root@px01:/dev/zvol/RAID# fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --loops=3 --ioengine=libaio --bs=1m --size=5G --runtime=60 --group_reporting
seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=2011MiB/s][w=2011 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=849808: Sun Jul 18 21:37:04 2021
  write: IOPS=1939, BW=1939MiB/s (2034MB/s)(15.0GiB/7920msec); 0 zone resets
    slat (usec): min=273, max=453, avg=303.07, stdev=27.67
    clat (nsec): min=990, max=10317, avg=1143.23, stdev=176.85
     lat (usec): min=274, max=460, avg=304.37, stdev=27.71
    clat percentiles (nsec):
     |  1.00th=[ 1020],  5.00th=[ 1032], 10.00th=[ 1048], 20.00th=[ 1048],
     | 30.00th=[ 1064], 40.00th=[ 1080], 50.00th=[ 1112], 60.00th=[ 1128],
     | 70.00th=[ 1160], 80.00th=[ 1208], 90.00th=[ 1272], 95.00th=[ 1368],
     | 99.00th=[ 1592], 99.50th=[ 1704], 99.90th=[ 2352], 99.95th=[ 2928],
     | 99.99th=[ 7776]
   bw (  MiB/s): min= 1814, max= 2040, per=99.91%, avg=1937.60, stdev=89.59, samples=15
   iops        : min= 1814, max= 2040, avg=1937.60, stdev=89.59, samples=15
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=99.80%, 4=0.14%, 10=0.03%, 20=0.01%
  cpu          : usr=42.75%, sys=57.24%, ctx=21, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,15360,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Run status group 0 (all jobs):
  WRITE: bw=1939MiB/s (2034MB/s), 1939MiB/s-1939MiB/s (2034MB/s-2034MB/s), io=15.0GiB (16.1GB), run=7920-7920msec

but when I did it from VM (centos 8.3) i have problem with performance

Code:

[root@localhost ~]# fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --loops=3 --ioengine=libaio --bs=1m --size=5G --runtime=60 --group_reporting
seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=1110MiB/s][w=1109 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=3280: Sun Jul 18 15:39:47 2021
  write: IOPS=719, BW=720MiB/s (755MB/s)(15.0GiB/21346msec); 0 zone resets
    slat (usec): min=342, max=12585, avg=728.40, stdev=578.64
    clat (usec): min=2, max=1113, avg= 3.14, stdev= 9.27
     lat (usec): min=345, max=12595, avg=732.24, stdev=579.06
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    3], 40.00th=[    3], 50.00th=[    3], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    4], 95.00th=[    4],
     | 99.00th=[    6], 99.50th=[   11], 99.90th=[   18], 99.95th=[   20],
     | 99.99th=[  265]
   bw (  KiB/s): min=49053, max=1655568, per=100.00%, avg=992474.52, stdev=436697.03, samples=31
   iops        : min=   47, max= 1616, avg=968.77, stdev=426.49, samples=31
  lat (usec)   : 4=96.55%, 10=2.88%, 20=0.53%, 50=0.03%, 500=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=16.18%, sys=61.19%, ctx=1097, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,15360,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Run status group 0 (all jobs):
  WRITE: bw=720MiB/s (755MB/s), 720MiB/s-720MiB/s (755MB/s-755MB/s), io=15.0GiB (16.1GB), run=21346-21346msec


Disk stats (read/write):
    dm-0: ios=0/7069, merge=0/0, ticks=0/101322, in_queue=101322, util=47.50%, aggrios=0/13676, aggrmerge=0/0, aggrticks=0/175770, aggrin_queue=175769, aggrutil=51.85%
  sda: ios=0/13676, merge=0/0, ticks=0/175770, in_queue=175769, util=51.85%

I have proxmox 7.0.8
and the configuration of VM it looks like below

Code:

agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
efidisk0: RAID:vm-102-disk-0,size=1M
ide2: local:iso/CentOS-Stream-8-x86_64-20210706-boot.iso,media=cdrom
machine: q35
memory: 24048
name: centos.raid
net0: virtio=B6:F2:FC:8D:D5:97,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: RAID:vm-102-disk-1,size=120G
scsihw: virtio-scsi-pci
smbios1: uuid=1e296c70-e03c-4748-a3ae-3a56b08200e1
sockets: 2
vga: virtio
vmgenid: 9d132687-a8de-4fb5-b7b9-79432b9c7e5f

Do you have any idea how i can get full performance on my NVME disk?

Search

Search

Slow disk performance

Tim Denis

Active Member

H4R0

Well-Known Member

Tim Denis

Active Member

H4R0

Well-Known Member

Tim Denis

Active Member

mcunicode

Member

0xd149e38e

Member

millosz222

Member

We value your privacy