Slow disk performance

Tim Denis

Active Member
May 19, 2016
15
0
41
41
Hi!
I've got a X9DRE-LN4F supermicro board with 128 GB DDR3, and 2x Xeon e5 2620. Not the newest server, but it runs.
I noticed a high IO delay, and went on an investigation.

Specs:
- proxmox 6.3
- 2 x 1TB Samsung 850 pro 1TB in zfs mirror (configured during install)
BIOS (and IPMI) firmwares updated to the latest to be found on the supermicro site)

The server also contains a NVME drive (with an adapter - no m2 slot on the board).

Whatever I do, I cannot get past 180MB/s for disk writes.
So when I do dd if=/dev/urandom BS=10M count=1024 of=/randomfile
speed is 180MB/s in the beginning. Gradually drops down to like 40MB/s.

EXACTLY the same pattern with the NVME drive. Weird, because that is another interface, right?

CPU is low during write.

I checked connections, right SATA ports etc... Seems to be allright (yes, connected to the SATA3 ports, not the SATA2 the board has).
Linux tells me the speed is 6gb/s. So that must be SATA3... but the speeds do not match.

Upon further investigation, I installed windows 10 pro bare metal on the machine... And guess what? I Do get the speeds I expect (crystaldiskmark)
SATA ssd: 562MB/s read | 517 MB/s write. That's what I expect from the samsung 850 Pro connected to SATA 3 port.
NVME: 742MB/s read | 296 MB/s write. not really what I expect, but still a lot more than what I get in proxmox.

Okay, maybe it is proxmox. So I installed fedora server on the machine - also bare metal.
Same slow speeds. Exactly the 180MB/s region again.

Everywhere I look, I understand that the C602 chipset should be supported by the kernel...

I also added an add-in card providing 4 sata ports. exactly the same speeds. I added an intel 100GB S3700 SSD. Exactly the same speeds...

So I feel something is capping the speed of the storage...

The goal for the server is to have a truenas running in a VM with a HBA passed through (pcie). That works, and gives me the speeds I expect of the classic spinning HDD's. I wanted to add the NVME drive as a cache to that truenas VM, but same result. Also in truenas it is slow.

as a final test, I passed though the nvme drive to a windows VM that I installed on the proxmox via PCIe passthrough. Running crystal disk mark in that vm gives me about the same results as I get on that NVME drive when I'm running bare metal windows on that server...

So, that leads me to the conclusion it must be the Linux kernel that does not allow the full speed...?

any Ideas or suggestions that might lead into the right direction highly appreciated...!
 
So when I do dd if=/dev/urandom BS=10M count=1024 of=/randomfile

You're testing with a random generator, those aren't meant to be fast but accurate. Your system has no entropy.

Verify the speed with:
Code:
apt install pv
cat /dev/urandom | pv > /dev/null

You should use fio for proper benchmarking.
 
Last edited:
Thanks for your reply!
I did not know that /dev/urandom would be so slow... :)

cat /dev/urandom | pv > /dev/null
gave me the exact same 180MB/s limit. So that's clear now.

I started testing with fio.

Seems ever slower than I thought!
Bash:
fio --filename=/media/nvme/fiofile.delete --size=5GB --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=5

Where I would vary the --size parameter...

For NVME storage:
Bash:
NVME 1GB
Run status group 0 (all jobs):
   READ: bw=124MiB/s (130MB/s), 124MiB/s-124MiB/s (130MB/s-130MB/s), io=14.5GiB (15.6GB), run=120047-120047msec
  WRITE: bw=124MiB/s (130MB/s), 124MiB/s-124MiB/s (130MB/s-130MB/s), io=14.5GiB (15.6GB), run=120047-1200

NVME 2GB
  Run status group 0 (all jobs):
   READ: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=15.1GiB (16.3GB), run=120046-120046msec
  WRITE: bw=129MiB/s (136MB/s), 129MiB/s-129MiB/s (136MB/s-136MB/s), io=15.2GiB (16.3GB), run=120046-120046msec

NVME 5GB
Run status group 0 (all jobs):
   READ: bw=65.4MiB/s (68.5MB/s), 65.4MiB/s-65.4MiB/s (68.5MB/s-68.5MB/s), io=7924MiB (8308MB), run=121235-121235msec
  WRITE: bw=65.6MiB/s (68.8MB/s), 65.6MiB/s-65.6MiB/s (68.8MB/s-68.8MB/s), io=7951MiB (8337MB), run=121235-121235msec

For Samsung 850 Pro storage (in the zfs)

Bash:
Run status group 0 (all jobs):
   READ: bw=4008MiB/s (4203MB/s), 4008MiB/s-4008MiB/s (4203MB/s-4203MB/s), io=470GiB (504GB), run=120001-120001msec
  WRITE: bw=4005MiB/s (4200MB/s), 4005MiB/s-4005MiB/s (4200MB/s-4200MB/s), io=469GiB (504GB), run=120001-120001msec



2GB
Run status group 0 (all jobs):
   READ: bw=1032MiB/s (1082MB/s), 1032MiB/s-1032MiB/s (1082MB/s-1082MB/s), io=121GiB (130GB), run=120001-120001msec
  WRITE: bw=1032MiB/s (1082MB/s), 1032MiB/s-1032MiB/s (1082MB/s-1082MB/s), io=121GiB (130GB), run=120001-120001msec


5GB
Run status group 0 (all jobs):
   READ: bw=66.5MiB/s (69.7MB/s), 66.5MiB/s-66.5MiB/s (69.7MB/s-69.7MB/s), io=7980MiB (8368MB), run=120003-120003msec
  WRITE: bw=66.8MiB/s (70.0MB/s), 66.8MiB/s-66.8MiB/s (70.0MB/s-70.0MB/s), io=8012MiB (8401MB), run=120003-120003msec

so, all in all, even slower than the 180MB/s I thought it was...
The really fast speeds (+4.000MB/s) for the smaller tests are - I assume - due to the caching the RAM does for the zpool.
nvme drive is just formatted as ext4, so no caching there.

I'm puzzled ...
Any advice? Thanks!
 
Last edited:
I think your results are mixed up. NVME should be the faster one.

SSD's have a built in cache and that's why they are fast. After 2-8GB the cache is full and they go down to 100-700mb/s depending on the model.

Crystaldisk on windows by default tests with 1GB so that's always only the cache, it's misleading but they advertise it as such.


ZFS uses cow with very small block sizes, so that adds a lot of overhead. Consumer drives aren't really suited for that kind of workload.

That's why you get such bad results.

Just take a look at the official proxmox zfs benchmark https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020


For reliable result I run:
Code:
fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --direct=1 --loops=3 --ioengine=libaio --bs=1m --size=5G --runtime=60 --group_reporting

For cache testing system + ssd/nvme
Code:
fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --direct=1 --ioengine=libaio --bs=1m --size=1G --runtime=60 --group_reporting
 
I think your results are mixed up. NVME should be the faster one.
That should be the case. But it isn't. I'm 100% sure I didn't mix up the results.

SSD's have a built in cache and that's why they are fast. After 2-8GB the cache is full and they go down to 100-700mb/s depending on the model.
Yes, I know. But still, those first couple of GB's should be fast, no?
Or is linux somehow skipping this cache? it shouldn't, since that is hardware...

Crystaldisk on windows by default tests with 1GB so that's always only the cache, it's misleading but they advertise it as such.
Yes. But I tested with 8GB test file size. To make sure to skip this problem.

ZFS uses cow with very small block sizes, so that adds a lot of overhead. Consumer drives aren't really suited for that kind of workload.
Okay. But should I expect an impact larger than 50%?

looks like a very good resource. I'm gonna read that!
Thanks for your cooperation.
 
I have been struggling for several weeks now and I faced the same problem, the SSD speed at the host is only transferring at around 24MB/s. When I test using hdd at host, it is also running at 24MB/s. For the SSD I am using Samsung 860 500MB. It really looks like the host Debian kernel has a problem.

I am running proxmox 6.1. Tested on 2 separate machine. One with super micro server, the other proxmox is converted from pc. Both have the same result, therefore, I rule out the sata controller.
 
Hi,

I have similar problem,

I have 3 disk NVME Corsair MP510 1,8TB all of them is in ZFS RAIDZ

When I did fio benchmark from proxmox in /dev/zvol/RAID I have really nice value

Code:
root@px01:/dev/zvol/RAID# fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --loops=3 --ioengine=libaio --bs=1m --size=5G --runtime=60 --group_reporting
seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=2011MiB/s][w=2011 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=849808: Sun Jul 18 21:37:04 2021
  write: IOPS=1939, BW=1939MiB/s (2034MB/s)(15.0GiB/7920msec); 0 zone resets
    slat (usec): min=273, max=453, avg=303.07, stdev=27.67
    clat (nsec): min=990, max=10317, avg=1143.23, stdev=176.85
     lat (usec): min=274, max=460, avg=304.37, stdev=27.71
    clat percentiles (nsec):
     |  1.00th=[ 1020],  5.00th=[ 1032], 10.00th=[ 1048], 20.00th=[ 1048],
     | 30.00th=[ 1064], 40.00th=[ 1080], 50.00th=[ 1112], 60.00th=[ 1128],
     | 70.00th=[ 1160], 80.00th=[ 1208], 90.00th=[ 1272], 95.00th=[ 1368],
     | 99.00th=[ 1592], 99.50th=[ 1704], 99.90th=[ 2352], 99.95th=[ 2928],
     | 99.99th=[ 7776]
   bw (  MiB/s): min= 1814, max= 2040, per=99.91%, avg=1937.60, stdev=89.59, samples=15
   iops        : min= 1814, max= 2040, avg=1937.60, stdev=89.59, samples=15
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=99.80%, 4=0.14%, 10=0.03%, 20=0.01%
  cpu          : usr=42.75%, sys=57.24%, ctx=21, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,15360,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Run status group 0 (all jobs):
  WRITE: bw=1939MiB/s (2034MB/s), 1939MiB/s-1939MiB/s (2034MB/s-2034MB/s), io=15.0GiB (16.1GB), run=7920-7920msec


but when I did it from VM (centos 8.3) i have problem with performance


Code:
[root@localhost ~]# fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --loops=3 --ioengine=libaio --bs=1m --size=5G --runtime=60 --group_reporting
seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=1110MiB/s][w=1109 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=3280: Sun Jul 18 15:39:47 2021
  write: IOPS=719, BW=720MiB/s (755MB/s)(15.0GiB/21346msec); 0 zone resets
    slat (usec): min=342, max=12585, avg=728.40, stdev=578.64
    clat (usec): min=2, max=1113, avg= 3.14, stdev= 9.27
     lat (usec): min=345, max=12595, avg=732.24, stdev=579.06
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    3], 40.00th=[    3], 50.00th=[    3], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    4], 95.00th=[    4],
     | 99.00th=[    6], 99.50th=[   11], 99.90th=[   18], 99.95th=[   20],
     | 99.99th=[  265]
   bw (  KiB/s): min=49053, max=1655568, per=100.00%, avg=992474.52, stdev=436697.03, samples=31
   iops        : min=   47, max= 1616, avg=968.77, stdev=426.49, samples=31
  lat (usec)   : 4=96.55%, 10=2.88%, 20=0.53%, 50=0.03%, 500=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=16.18%, sys=61.19%, ctx=1097, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,15360,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Run status group 0 (all jobs):
  WRITE: bw=720MiB/s (755MB/s), 720MiB/s-720MiB/s (755MB/s-755MB/s), io=15.0GiB (16.1GB), run=21346-21346msec


Disk stats (read/write):
    dm-0: ios=0/7069, merge=0/0, ticks=0/101322, in_queue=101322, util=47.50%, aggrios=0/13676, aggrmerge=0/0, aggrticks=0/175770, aggrin_queue=175769, aggrutil=51.85%
  sda: ios=0/13676, merge=0/0, ticks=0/175770, in_queue=175769, util=51.85%

I have proxmox 7.0.8
and the configuration of VM it looks like below
Code:
agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
efidisk0: RAID:vm-102-disk-0,size=1M
ide2: local:iso/CentOS-Stream-8-x86_64-20210706-boot.iso,media=cdrom
machine: q35
memory: 24048
name: centos.raid
net0: virtio=B6:F2:FC:8D:D5:97,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: RAID:vm-102-disk-1,size=120G
scsihw: virtio-scsi-pci
smbios1: uuid=1e296c70-e03c-4748-a3ae-3a56b08200e1
sockets: 2
vga: virtio
vmgenid: 9d132687-a8de-4fb5-b7b9-79432b9c7e5f



Do you have any idea how i can get full performance on my NVME disk?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!