(7.1) Performance Issues

kromberg · Mar 21, 2022

OK, I have been chasing a performance issue transferring data between two proxmox hosts and I cant seem to figure things out. The general issue is that I am seeing really low transfer rates of data between the machines. Copying a large 10GB file is only getting about 80MB/s in transfer. This is regardless of using scp, rsync, nfs, iscsi. Here is the basic layout of the two machines:

Host A

dual Xeon E5-2680v3
256GB DDR4-2133 EEC RDIMMs
dual port 10gb RJ-45 NIC bonded( rr ) added to vmbr1 ( ports, bond, and bridge MTU 9000 )
zpool 6 iNTEL dc s3700 striped RAID0 compression=on shift=12 blocksize=32k
60GB ram disk mounted

Host B

dual Xeon E5-2690v3
512GB DDR4-2133 ECC RDIMMs
dual port 10gb RJ-45 NIC bonded( rr ) added to vmbr1 ( ports, bond, and bridge MTU 9000 )
80GB ram disk mounted
zpool 8 6G 15K SAS2 striped RAID0 compression=on shift=12 blocksize=32k

The two machines are connected together with a pair of cat6 cables port0 to port0 and port1 to port1.

Using iperf3 going host A to B across the bonded bridge, I am getting:

Code:

[  5]   0.00-10.00  sec  22.9 GBytes  19.7 Gbits/sec  627             sender
[  5]   0.00-10.00  sec  22.9 GBytes  19.7 Gbits/sec                  receiver

Totally what I expect, one check here. Same results basically going either way between them. The bridge at the host level is working.

On host A, doing some basic disk performance using 'dd' I get:

Code:

root@odin:/mnt/pve/ram# dd if=/dev/random of=/vm2-zfs-r0/test/test.dat bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB, 20 GiB) copied, 94.6553 s, 222 MB/s

root@odin:/mnt/pve/ram# dd if=/dev/zero of=/vm2-zfs-r0/test/test1.dat bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB, 20 GiB) copied, 5.78259 s, 3.6 GB/s

root@odin:/mnt/pve/ram# dd if=/dev/random of=/mnt/pve/ram/test2.dat bs=1M count=20000  
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB, 20 GiB) copied, 102.238 s, 205 MB/s

root@odin:/mnt/pve/ram# dd if=/dev/zero of=/mnt/pve/ram/test3.dat bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB, 20 GiB) copied, 15.1278 s, 1.4 GB/s

root@odin:/mnt/pve/ram# dd if=test2.dat of=/vm2-zfs-r0/test/test4.dat bs=1M
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB, 20 GiB) copied, 18.981 s, 1.1 GB/s

root@odin:/mnt/pve/ram# dd if=test3.dat of=/vm2-zfs-r0/test/test5.dat bs=1M
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB, 20 GiB) copied, 13.5282 s, 1.6 GB/s

root@odin:/mnt/pve/ram# time cp test2.dat /vm2-zfs-r0/test/test6.dat

real    0m19.271s
user    0m0.156s
sys     0m19.099s
root@odin:/mnt/pve/ram# time cp test3.dat /vm2-zfs-r0/test/test7.dat  

real    0m14.861s
user    0m0.112s
sys     0m14.739s

root@odin:/mnt/pve/ram# rsync -avp test2.dat /vm2-zfs-r0/test
sending incremental file list
test2.dat

sent 20,976,640,100 bytes  received 35 bytes  856,189,393.27 bytes/sec
total size is 20,971,520,000  speedup is 1.00
root@odin:/mnt/pve/ram# rsync -avp test3.dat /vm2-zfs-r0/test  
sending incremental file list
test3.dat

sent 20,976,640,099 bytes  received 35 bytes  1,133,872,439.68 bytes/sec
total size is 20,971,520,000  speedup is 1.00

Overall the disk performance is what I was kinda expecting. Though I was a little surprised about the ram disk performance: /dev/zero about 1.4GB/s and the transfer of the zero and random files to the zpool. Using straight cp and rsync produced low performance transfers, but still within the expected results.

I did the same type of disk benchmarks on host B and got the expected results. Moving data around with dd giving 1+GB/s transfers and cp/rsync around 800MB/s transfers.

So how I created a VM on eahc host with basically the same configuration:

8 cores, host, numa=1
16GB ram
VirtIO SCSI controller
100G disk scsi
VirtIO NIC to vmbr1 firewall=off ( MTU 9000 set in guest os )
Fedora 33 x86 guest os

Now on Host A using iperf3 going from the VM to Host A, I am getting:

Code:

[  5]   0.00-10.00  sec  19.8 GBytes  17.0 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  19.8 GBytes  17.0 Gbits/sec                  receiver

Pretty much expected results as the two ends are using the same bridge on the same host.

How going from the VM on Host A to Host B, I am getting this in iperf3:

Code:

[  5]   0.00-10.00  sec  14.1 GBytes  12.1 Gbits/sec   30             sender
[  5]   0.00-10.00  sec  14.1 GBytes  12.1 Gbits/sec                  receiver

That was about 30% slower than expected at 12.1Gb/s. The traffic for this testing is making two hops: VM to host A and then Host A to Host B. Certainly the processing needed to handle the routing from first hop to the second hop cant take the big of a hit. Question 1: What would be causing the drop in performance/through put here?

Now going from VM host A to VM on Host B using iperf3, I am getting the following:

Code:

[  5]   0.00-10.00  sec  10.6 GBytes  9.11 Gbits/sec   25             sender
[  5]   0.00-10.00  sec  10.6 GBytes  9.11 Gbits/sec                  receiver

That was about 50% slower than expected at 9.11Gb/s. The traffic for this testing is making three hops: VM to host A, Host A to Host B, and then Host B to VM. Again I can not see the routing taking that big of a hit as it is not that comprex. Question 2: What would be causing the drop in performance/through put here?

Now doing some disk bench marking with dd on a VM with the disk sitting on the host zpool, I am getting:

Code:

[root@sauron gondor]# dd if=/dev/zero of=/gondor/test1.dat bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB, 20 GiB) copied, 54.3383 s, 386 MB/s

[root@sauron gondor]# dd if=/dev/random of=/gondor/test2.dat bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB, 20 GiB) copied, 136.695 s, 153 MB/s

The write performance in the VM is nowhere close to what is expected. Raw writes from /dev/zero is about 1/3 of the speed and writes from /dev/random is down about 30%. I know that there is overhead with the VM sitting on top of the zpool and handling all the abstration of hardware, but this seems waaaaaaay off. Question 3: What is causing the deduced disk performance from the VM perspective?

Now copying/moving/transfering data from one VM to the VM is the major source of pain where the transfer speeds are only around 80MB/s.
-- using tar, mbuffer, and ssh: in @ 77.9 MiB/s, out @ 77.9 MiB/s, 4202 MiB total, buffer 100% full apps/UpRev/video/06-Evans-Up
-- rysnc: sent 998,511,690 bytes received 35 bytes 60,515,862.12 bytes/sec

One thing I did notice is that the VM receiving the data has at least 50% of the cores pegged at 100% WA. What is causing the huge amount of IO wait? This might be the critical issue.

Dunuin · Mar 21, 2022

kromberg said:
Overall the disk performance is what I was kinda expecting. Though I was a little surprised about the ram disk performance: /dev/zero about 1.4GB/s and the transfer of the zero and random files to the zpool. Using straight cp and rsync produced low performance transfers, but still within the expected results.

Using /dev/zero with ZFS is useless as a benchmark because of the block level compression. Just zeros is super compressible so the disks are nearly writing nothing. So for that either disable zfs compression or use /dev/random instead.

I think fio is best for benchmarking disks. You should try that: https://fio.readthedocs.io/en/latest/fio_doc.html

kromberg · Mar 21, 2022

OK, I did use /dev/random and the host results where in the low 200MB/s range. How much of that is CPU number generation and how is disk performance. For several SSD or 15K SAS drives striped in RAID0, I would expect the write performance to be in 800+ MB/s range.

kromberg · Mar 21, 2022

On one of the VMs:


[root@sauron gondor]#
fio --filename=/gondor/test.dat --name=random-write --ioengine=posixaio --rw=
randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --time_based --end_fsync=1              
fio: time_based requires a runtime/timeout setting
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengi
ne=posixaio, iodepth=1
fio-3.26
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]                          
random-write: (groupid=0, jobs=1): err= 0: pid=2407: Mon Mar 21 07:11:14 2022
  write: IOPS=8049, BW=31.4MiB/s (33.0MB/s)(4096MiB/130260msec); 0 zone resets
    slat (nsec): min=559, max=6765.7k, avg=4890.24, stdev=12423.89
    clat (nsec): min=381, max=13792k, avg=21687.97, stdev=38906.36
     lat (usec): min=7, max=13802, avg=26.58, stdev=41.57
    clat percentiles (usec):
     |  1.00th=[    8],  5.00th=[    9], 10.00th=[   11], 20.00th=[   19],
     | 30.00th=[   20], 40.00th=[   20], 50.00th=[   21], 60.00th=[   21],
     | 70.00th=[   21], 80.00th=[   22], 90.00th=[   27], 95.00th=[   31],
     | 99.00th=[   50], 99.50th=[   87], 99.90th=[  408], 99.95th=[  578],
     | 99.99th=[ 1270]
   bw (  KiB/s): min=68264, max=273141, per=100.00%, avg=145751.95, stdev=24356.66, samples=57
   iops        : min=17066, max=68285, avg=36437.96, stdev=6089.17, samples=57
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=9.60%, 20=39.00%, 50=50.40%
  lat (usec)   : 100=0.54%, 250=0.24%, 500=0.15%, 750=0.04%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=4.06%, sys=8.01%, ctx=1119344, majf=0, minf=20
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=31.4MiB/s (33.0MB/s), 31.4MiB/s-31.4MiB/s (33.0MB/s-33.0MB/s), io=4096MiB (4295MB),
 run=130260-130260msec

Disk stats (read/write):
  sdc: ios=0/473525, merge=0/42, ticks=0/3918218, in_queue=4174008, util=16.23%

Again, most of the cores where stitting in IO wait. Watching the pool on the host pretty much gave these results the whole test"


                                        capacity     operations     bandwidth  
pool                                  alloc   free   read  write   read  write
------------------------------------  -----  -----  -----  -----  -----  -----
vm1-zfs-r0                             183G  3.45T      0    234      0  29.3M
  ata-CT1000BX500SSD1_1951E230E036    44.9G   883G      0     40      0  5.11M
  ata-CT1000BX500SSD1_1951E230E00E    44.4G   884G      0     45      0  5.74M
  ata-CT1000BX500SSD1_1951E230E085    49.8G   878G      0     73      0  9.23M
  ata-CT1000BX500SSD1_1951E230E495    43.8G   884G      0     73      0  9.23M
------------------------------------  -----  -----  -----  -----  -----  -----

This would give one of my old 386 DX/2s a run for their money.

spirit · Mar 21, 2022

don't use consumer ssd for zfs or ceph, they sucks for syncronous write, needed for zfs or ceph journal. (because they don't have supercapacitor)

some bench here:

https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

Search

Search

(7.1) Performance Issues

kromberg

Member

Dunuin

Distinguished Member

kromberg

Member

kromberg

Member

spirit

Distinguished Member