Intel P3600 NVME SSD Passthrough - Poor Performance

nitrag

Member
Feb 16, 2020
7
1
8
36
I am passing through an Intel P3600 1.6TB NVME SSD to a fresh install of Ubuntu 19.10 VM. Using Q35 Machine code, 44 cores, 48GB of RAM and the performance is terrible, maxing out around 1000 MB/s but averaging 450 MB/s.

It has been previously connected, through passthrough, on an Dell R710 running ESXI and I had no problem reaching 1500+ MB/s on PCIe 2.0. I was expecting 2500-3500 on this new server with PCIe 3.0.

Not to mention I get 1500 MB/s on my EVO 970 500GB NVME connected to a PCIe 3.0 x4 card that is storage for Proxmox VMs but I chalked that up to being shared VM storage with 6 VMs running.

Code:
hdparm -t /dev/nvme0n1p1
/dev/nvme0n1p1:
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
Timing buffered disk reads: 2966 MB in  3.00 seconds = 988.67 MB/sec

dd if=/dev/zero of=/tmp/benchmark bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.43485 s, 441 MB/s

Code:
fio --name=seqread --rw=read --direct=1 --ioengine=libaio --bs=8k --numjobs=8 --size=1G --runtime=600  --group_reporting
seqread: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=1
...
fio-3.12
Starting 8 processes
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
Jobs: 7 (f=6): [_(1),R(3),f(1),R(3)][100.0%][r=383MiB/s][r=48.0k IOPS][eta 00m:00s]
seqread: (groupid=0, jobs=8): err= 0: pid=27687: Sun Feb 16 00:12:26 2020
  read: IOPS=51.5k, BW=402MiB/s (422MB/s)(8192MiB/20366msec)
    slat (usec): min=6, max=1241, avg=19.30, stdev=10.46
    clat (usec): min=2, max=3532, avg=129.98, stdev=42.18
     lat (usec): min=27, max=3551, avg=150.04, stdev=43.85
    clat percentiles (usec):
     |  1.00th=[   80],  5.00th=[   92], 10.00th=[   95], 20.00th=[   99],
     | 30.00th=[  104], 40.00th=[  116], 50.00th=[  125], 60.00th=[  135],
     | 70.00th=[  143], 80.00th=[  153], 90.00th=[  174], 95.00th=[  194],
     | 99.00th=[  245], 99.50th=[  265], 99.90th=[  351], 99.95th=[  420],
     | 99.99th=[ 1319]
   bw (  KiB/s): min=43184, max=59216, per=12.65%, avg=52111.00, stdev=2559.42, samples=318
   iops        : min= 5398, max= 7402, avg=6513.81, stdev=319.92, samples=318
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.04%, 50=0.20%, 100=23.08%
  lat (usec)   : 250=75.83%, 500=0.81%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.03%, 4=0.01%
  cpu          : usr=9.73%, sys=18.60%, ctx=1049518, majf=0, minf=131
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
   READ: bw=402MiB/s (422MB/s), 402MiB/s-402MiB/s (422MB/s-422MB/s), io=8192MiB (8590MB), run=20366-20366msec
Disk stats (read/write):
  nvme0n1: ios=1113066/72, merge=0/69, ticks=127992/8, in_queue=0, util=99.61%

Code:
06:10.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Intel Corporation PCIe Data Center SSD
    Physical Slot: 16
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 21
    NUMA node: 0
    Region 0: Memory at fde50000 (64-bit, non-prefetchable) [size=16K]
    Expansion ROM at fde40000 [disabled] [size=64K]
    Capabilities: <access denied>
    Kernel driver in use: nvme
    Kernel modules: nvme

Code:
isdct show -a -intelssd

- Intel SSD DC P3600 Series CVMD426300961P6KGN -
AdminPath : /dev/nvme0
AggregationThreshold : 0
AggregationTime : 0
ArbitrationBurst : 0
Bootloader : 8B1B012B
CoalescingDisable : 0
DevicePath : /dev/nvme0n1
DeviceStatus : Healthy
DirectivesSupported : False
DisableThermalThrottle : The selected drive does not support this feature.
DynamicMMIOEnabled : The selected drive does not support this feature.
EndToEndDataProtCapabilities : 17
EnduranceAnalyzer : Media Workload Indicators have reset values. Run 60+ minute workload prior to running the endurance analyzer.
ErrorString :
Firmware : 8DV1RA03
FirmwareUpdateAvailable : Please contact your Intel representative about firmware update for this drive.
FormatNVMCryptoEraseSupported : True
FormatNVMSupported : True
HighPriorityWeightArbitration : 0
IOCompletionQueuesRequested : 30
IOSubmissionQueuesRequested : 30
Index : 0
Intel : True
IntelGen3SATA : False
IntelNVMe : True
InterruptVector : 0
IsDualPort : False
LBAFormat : 0
LatencyTrackingEnabled : Invalid Field in Command
LowPriorityWeightArbitration : 0
MaximumLBA : 3125627567
MediumPriorityWeightArbitration : 0
MetadataSetting : 0
MetadataSize : 0
ModelNumber : INTEL SSDPEDME016T4S
NVMeControllerID : 0
NVMeMajorVersion : 1
NVMeMinorVersion : 0
NVMePowerState : 0
NVMeTertiaryVersion : 0
NamespaceId : 1
NamespaceManagementSupported : False
NativeMaxLBA : 3125627567
NumErrorLogPageEntries : 63
NumLBAFormats : 6
NumberOfNamespacesSupported : 1
OEM : Oracle
PCIBus : 5
PCIDevice : 1
PCIDomain : 0
PCIFunction : 0
PCILinkGenSpeed : 3
PCILinkWidth : 4
PLITestTimeInterval : The selected drive does not support this feature.
PhyConfig : The selected drive does not support this feature.
PhySpeed : The selected drive does not support this feature.
PhysicalSectorSize : The selected drive does not support this feature.
PhysicalSize : 1600321314816
PowerGovernorAveragePower : Feature is not supported.
PowerGovernorBurstPower : Feature is not supported.
PowerGovernorMode : 0
Product : Middledale
ProductFamily : Intel SSD DC P3600 Series
ProductProtocol : NVME
ProtectionInformation : 0
ProtectionInformationLocation : 0
ReadErrorRecoveryTimer : Device does not support this command set.
SMARTEnabled : True
SMARTHealthCriticalWarningsConfiguration : 0
SMBusAddress : Invalid Field in Command
SMI : False
SanitizeBlockEraseSupported : False
SanitizeCryptoScrambleSupported : False
SanitizeOverwriteSupported : False
SectorDataSize : 512
SectorSize : 512
SelfTestSupported : False
SerialNumber : CVMD426300961P6KGN
TCGSupported : False
TelemetryLogSupported : False
TempThreshold : 85
TemperatureLoggingInterval : The selected drive does not support this feature.
TimeLimitedErrorRecovery : 0
TrimSupported : True
VolatileWriteCacheEnabled : False
WriteAtomicityDisableNormal : 0
WriteCacheReorderingStateEnabled : The selected drive does not support this feature.
WriteCacheState : The selected drive does not support this feature.
WriteErrorRecoveryTimer : Device does not support this command set.


Q35 machine type, tried various SCSI controllers, tried PCI vs PCI-E, no difference.
 
Hi,

how do you pass the disk through?
 
I used these instructions.

I will note that with the `direct` parameter I was able to achieve 1900 MB/sec but this is still shy for an enterprise drive I would think.

This is the VM with 2 sockets and 24 cores/threads.

Code:
 # hdparm -Tt --direct /dev/nvme0n1

/dev/nvme0n1:
 Timing O_DIRECT cached reads:   3294 MB in  1.99 seconds = 1651.51 MB/sec
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
Timing O_DIRECT disk reads: 5640 MB in  3.00 seconds = 1879.65 MB/sec

# hdparm -Tt --direct /dev/nvme0n1

/dev/nvme0n1:
 Timing O_DIRECT cached reads:   3408 MB in  1.99 seconds = 1709.16 MB/sec
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Timing O_DIRECT disk reads: 5880 MB in  3.00 seconds = 1959.73 MB/sec

Now on Proxmox (non-VM) with the EVO 970 500GB I was able to hit 2400 MB/sec:

Code:
# hdparm -Tt --direct /dev/nvme0n1

/dev/nvme0n1:
 Timing O_DIRECT cached reads:   3748 MB in  2.00 seconds = 1878.09 MB/sec
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Timing O_DIRECT disk reads: 6952 MB in  3.00 seconds = 2317.22 MB/sec

I have both cards in PCIe 3.0 slots. There is a PCH slot but it is 2.0.

Even still, running fio the BW results and IOPS are at the 10% for a drive that's supposed to do 2500/1600 and up to 450k IOPS:

Code:
# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=190MiB/s,w=63.3MiB/s][r=48.6k,w=16.2k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=2693: Tue Feb 18 09:26:01 2020
  read: IOPS=67.8k, BW=265MiB/s (278MB/s)(3070MiB/11593msec)
   bw (  KiB/s): min=161736, max=386208, per=99.92%, avg=270951.61, stdev=60534.87, samples=23
   iops        : min=40434, max=96552, avg=67737.87, stdev=15133.69, samples=23
  write: IOPS=22.7k, BW=88.5MiB/s (92.8MB/s)(1026MiB/11593msec); 0 zone resets
   bw (  KiB/s): min=53912, max=129368, per=99.91%, avg=90539.74, stdev=20469.51, samples=23
   iops        : min=13478, max=32342, avg=22634.91, stdev=5117.36, samples=23
  cpu          : usr=38.09%, sys=61.65%, ctx=256, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
   READ: bw=265MiB/s (278MB/s), 265MiB/s-265MiB/s (278MB/s-278MB/s), io=3070MiB (3219MB), run=11593-11593msec
  WRITE: bw=88.5MiB/s (92.8MB/s), 88.5MiB/s-88.5MiB/s (92.8MB/s-92.8MB/s), io=1026MiB (1076MB), run=11593-11593msec
Disk stats (read/write):
  nvme0n1: ios=780449/261307, merge=0/22, ticks=126298/8385, in_queue=3576, util=99.20%

Thanks for the help on this. I'm new to proxmox. Next step I think is to boot into Ubuntu Live and mount the disk and try it without any virtualization.
 
Can you please send me the vm.config?

Also, a 'lscpi -tv' from the host would be helpfull
 
Code:
bios: ovmf
boot: d
cores: 24
cpu: kvm64,flags=+pdpe1gb;+aes
efidisk0: storageprox0:17202/vm-17202-disk-0.qcow2,size=128K
hostpci0: 02:00,pcie=1
machine: q35
memory: 49152
name: OSM
net0: virtio=32:C5:3D:AB:4B:4F,bridge=vmbr1
numa: 1
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=da9d4dc8-3461-4cc4-a85a-42d5ddd3b352
sockets: 2
vmgenid: 3ac6bd69-3f58-40b4-bded-050bda007cc8

I've tried different CPU flags, numa on/off, PCIe options, ovmf vs seabios. The above is the current config.

Here's the lspci -tv: https://gist.github.com/nitrag/1df4ee409a4d3e1532ab28c5991878ff
 
  • Like
Reactions: gb00s
I see 2 potential problems here
You have a two-socket system so if you run on the wrong Socket you have to pass the QPI which cost performance.
The CPU type should be set to "host".

Try to use NUMA and give the VM 2 Sockets maybe this will help the VM map the PCI correct.
 
Setting CPU to host definitely helped. Looks like the IOPS doubled. I tested on Ubuntu live and got within 5% of the results with hdparm as seen inside the VM. So in terms of max throughput it looks like a limitation of the hardware itself. Thanks for your help @wolfgang!

Code:
 fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=582MiB/s,w=194MiB/s][r=149k,w=49.7k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1424: Thu Feb 20 21:41:21 2020
  read: IOPS=125k, BW=490MiB/s (514MB/s)(3070MiB/6264msec)
   bw (  KiB/s): min=235344, max=606472, per=98.93%, avg=496513.33, stdev=120485.29, samples=12
   iops        : min=58836, max=151618, avg=124128.33, stdev=30121.32, samples=12
  write: IOPS=41.9k, BW=164MiB/s (172MB/s)(1026MiB/6264msec); 0 zone resets
   bw (  KiB/s): min=78520, max=201776, per=99.02%, avg=166079.33, stdev=40367.39, samples=12
   iops        : min=19630, max=50444, avg=41519.83, stdev=10091.85, samples=12
  cpu          : usr=37.79%, sys=61.10%, ctx=2759, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
   READ: bw=490MiB/s (514MB/s), 490MiB/s-490MiB/s (514MB/s-514MB/s), io=3070MiB (3219MB), run=6264-6264msec
  WRITE: bw=164MiB/s (172MB/s), 164MiB/s-164MiB/s (172MB/s-172MB/s), io=1026MiB (1076MB), run=6264-6264msec
Disk stats (read/write):
  nvme0n1: ios=766709/256325, merge=0/3, ticks=176648/4293, in_queue=2136, util=98.43%
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!