disk passthrough performance weirdness

RolandK

Renowned Member
Mar 5, 2019
901
170
88
50
can somebody explain the following ?

i have two harddisks, one SAS and one SATA

Code:
# fdisk -l /dev/sdz
Disk /dev/sdz: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: ST4000NM0034
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Code:
# fdisk -l /dev/sdae
Disk /dev/sdae: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: HGST HDN728080AL
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

i passtrough this disks in virtual machine

Code:
scsi1: /dev/sdz,aio=threads,backup=0,iothread=1,size=3907018584K
scsi2: /dev/sdae,aio=threads,backup=0,iothread=1,size=7814026584K

when i directly do "dd if=/dev/zero of=/dev/disk bs=4k" to the disks on the proxmox host, i'm getting decent performance with both disks ( about 200MB/s on writes)

Code:
Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sdae             0.00      0.00     0.00   0.00    0.00     0.00  424.00 217088.00 54102.00  99.22    7.75   512.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.29 100.00
sdz              0.00      0.00     0.00   0.00    0.00     0.00  384.00 196608.00 48895.00  99.22    8.60   512.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.30 100.00

in virtual machine, writing the same way to the mapped/passed-trough virtual disks , write to the sas disk is dead slow:

iostat in VM:
Code:
Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sdr              0.00      0.00     0.00   0.00    0.00     0.00   75.00  76200.00 18216.00  99.59  156.37  1016.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   11.73  99.60
sdx              0.00      0.00     0.00   0.00    0.00     0.00  171.00 173736.00 43650.00  99.61   82.51  1016.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   14.11 100.40

iostat on HOST:
Code:
Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sdae             0.00      0.00     0.00   0.00    0.00     0.00  339.00 172216.00     0.00   0.00    4.40   508.01    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.49 100.00
sdz              0.00      0.00     0.00   0.00    0.00     0.00   76.00  77216.00     0.00   0.00   12.93  1016.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.98  99.20

we see write to the sata disk inside the VM is just a little bit slower then on the host, but writing to the sas disk is less than half as fast as bevore

what could be the reason for this ? i'm out of ideas

mappings:
sdr in VM -> sdz on host
sdx in VM -> sdae on host
 
Last edited:
first, dd is not a benchmark. if you really want to do benchmarks, use something like 'fio'

second, what is the scsi controller of the vm ? also what is the host cpu (if you don't use iothreads, all io happens on the same thread for all disks)
 
>first, dd is not a benchmark. if you really want to do benchmarks, use something like 'fio'

you are right. i was always using dd for very quick/basic benchmarking, with good results - but apparently it's not reliable

>second, what is the scsi controller of the vm ? also what is the host cpu (if you don't use iothreads, all io happens on the same thread for all disks)

Controller is virtio-scsi-single
Host CPU is E5-2630L v3 @ 1.80GHz

physical controller on host is Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)


i'm was trying to mimic

Code:
dd if=/dev/zero of=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi22 bs=1024k oflag=direct

behaviour as close as possible with fio like this:

Code:
[test]
bs=1024k
filename=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi22
rw=write
direct=1
buffered=0
size=1g
ioengine=sync


Code:
# fio /root/fio.job
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=sync, iodepth=1

fio-3.25
Starting 1 process

Jobs: 1 (f=1): [W(1)][100.0%][w=193MiB/s][w=193 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=4086046: Mon Dec  5 15:29:41 2022
  write: IOPS=205, BW=205MiB/s (215MB/s)(1024MiB/4989msec); 0 zone resets
    clat (usec): min=1835, max=19638, avg=4827.14, stdev=784.34
     lat (usec): min=1872, max=19669, avg=4866.28, stdev=783.42
    clat percentiles (usec):
     |  1.00th=[ 4080],  5.00th=[ 4146], 10.00th=[ 4228], 20.00th=[ 4359],
     | 30.00th=[ 4490], 40.00th=[ 4621], 50.00th=[ 4686], 60.00th=[ 4817],
     | 70.00th=[ 4948], 80.00th=[ 5080], 90.00th=[ 5473], 95.00th=[ 5932],
     | 99.00th=[ 6652], 99.50th=[ 6915], 99.90th=[12911], 99.95th=[19530],
     | 99.99th=[19530]

   bw (  KiB/s): min=188416, max=221184, per=100.00%, avg=210261.33, stdev=12952.69, samples=9
   iops        : min=  184, max=  216, avg=205.33, stdev=12.65, samples=9
  lat (msec)   : 2=0.29%, 4=0.20%, 10=99.22%, 20=0.29%
  cpu          : usr=1.34%, sys=1.08%, ctx=1024, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1024,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=205MiB/s (215MB/s), 205MiB/s-205MiB/s (215MB/s-215MB/s), io=1024MiB (1074MB), run=4989-4989msec


we can see that fio indeed behaves different, providing decent/expected performance


i did try to find out, why

now things started to get interesting:

if i write RANDOM data dumped to a file on zfs, i.e. reading it from arc-cache (instead of /dev/urandom, as that may be too slow), i'm als getting decent performance with dd (like fio) for both sata and sas passtrough disk.

apparently, only writing ZEROES to the sas passtrough disk is slow:

Code:
# dd if=/backuppool/test.dat of=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi26 bs=1024k oflag=direct count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.40889 s, 199 MB/s

root@pbs01:/backuppool# dd if=/dev/zero of=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi26 bs=1024k oflag=direct count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.38993 s, 199 MB/s

root@pbs01:/backuppool# dd if=/backuppool/test.dat of=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi22 bs=1024k oflag=direct count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.19516 s, 207 MB/s

root@pbs01:/backuppool# dd if=/dev/zero of=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi22 bs=1024k oflag=direct count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 13.5184 s, 79.4 MB/s  <- !!!

I have set both passtrough entries to detect_zeroes=0 afterwards , and voila, the problem went away.

so the question is, why detect_zeroes seems to be automatically active for device name/path "/dev/disk/by-id/scsi-35000c500836b5fa7" but not for name/path "/dev/disk/by-id/ata-HGST_HDN728080ALE604_VJGD61TX"

furthermore, why has detecting zeroes such a big impact on performance , as determining of a data block's contents are all zero should be blazingly fast operation for recent cpu (!? - we can compress zeroes to nothing with zstd at >1gb/s speed , so why does detecting zeroes eat up more then half of 200MB/s ?)
 
so the question is, why detect_zeroes seems to be automatically active for device name/path "/dev/disk/by-id/scsi-35000c500836b5fa7" but not for name/path "/dev/disk/by-id/ata-HGST_HDN728080ALE604_VJGD61TX"
maybe it's active for both but the drive report their capabilities differently, such that qemu does something differently for sas that is slower? (e.g. really writing zeroes instead of using 'unmap' or similar)

can you reproduce that with an upstream qemu too (if you want to try to reproduce that, since i don't have any sas disks here at the moment ;) )
 
  • Like
Reactions: RolandK
>maybe it's active for both but the drive report their capabilities differently, such that qemu
>does something differently for sas that is slower? (e.g. really writing zeroes instead of using 'unmap' or similar)

@dcsapak , no, apparently it's a hardware issue.

i did investigate further and it's slowness of the SAS disk with BLKZEROOUT ioctl ( i can also see with blkdiscard -z , that the ioctl needs significantly longer with the sas disk )

i have made another weird observation during investigation, reported at https://gitlab.com/qemu-project/qemu/-/issues/1362

maybe you want to have l look. i think it's interesting
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!