Tesla P4 massive performance problems

zenowl77

Active Member
Feb 22, 2024
207
42
28
i am running hashcat for fun as a kind of hobby project to learn more about it in depth with a Tesla P4, in a windows VM and the tesla is under performing by quite a bit. just wondering if anyone has any fixes / suggestions?

nvidia-smi is set to persistent mode, P0 performance state, drivers are configured for max performance, hashcat is using highest performance setting for workload profile (-w 4), i have tried both the Cuda and OpenCL instance for the device (same performance roughly)

on benchmarks it performs for example i should see 29.9GH/s NTLM with a P4, but i see 1.838 - 2.167 GH/s NTLM for reference on just how badly it is under performing, that is like 6-7% performance vs bare metal.

is there something i should tweak / change to get more out of it or is this a bug in proxmox that somehow cripples cuda performance? im guessing something is somehow wrong? everything seems to be setup properly ive put a lot of time and all into making sure it was all just right but performance seems to be terrible for many things, its just good enough to be good for most tasks even games, but high performance tasks ive noticed it on, like AI applications aswell for example, but i thought it was just because i had an old card (others report higher performance too on the it/s for ai with the p4 than what i see but i figured it was just the models used or something), now im seeing it is not that....
 
i am running hashcat for fun as a kind of hobby project to learn more about it in depth with a Tesla P4, in a windows VM and the tesla is under performing by quite a bit. just wondering if anyone has any fixes / suggestions?

nvidia-smi is set to persistent mode, P0 performance state, drivers are configured for max performance, hashcat is using highest performance setting for workload profile (-w 4), i have tried both the Cuda and OpenCL instance for the device (same performance roughly)

on benchmarks it performs for example i should see 29.9GH/s NTLM with a P4, but i see 1.838 - 2.167 GH/s NTLM for reference on just how badly it is under performing, that is like 6-7% performance vs bare metal.

is there something i should tweak / change to get more out of it or is this a bug in proxmox that somehow cripples cuda performance? im guessing something is somehow wrong? everything seems to be setup properly ive put a lot of time and all into making sure it was all just right but performance seems to be terrible for many things, its just good enough to be good for most tasks even games, but high performance tasks ive noticed it on, like AI applications aswell for example, but i thought it was just because i had an old card (others report higher performance too on the it/s for ai with the p4 than what i see but i figured it was just the models used or something), now im seeing it is not that....
silly question maybe, did you install the qemu-agent ?
was the VM graphics card configured appropiately at the VM config ?

you may find reading thru this helpful
https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE#Enable_PCIe_Passthrough
 
Last edited:
silly question maybe, did you install the qemu-agent ?
was the VM graphics card configured appropiately at the VM config ?
yes i have and i am pretty sure i have the vGPU setup well, i went through patching it for the P4 to be able to run the latest drivers, etc, i am just not sure what is causing the massive performance loss. i have a Intel ARC A310 in the same VM and it seems to perform fine
 
yes i have and i am pretty sure i have the vGPU setup well, i went through patching it for the P4 to be able to run the latest drivers, etc, i am just not sure what is causing the massive performance loss. i have a Intel ARC A310 in the same VM and it seems to perform fine
okay, check how much of the dedicated RAM is available to the GPU, with the Radeon I'm using this was not always assigned fully and causes issues
also, when running hashcat, double-check if the benchmark is actually using the GPU (been there, really)
 
okay, check how much of the dedicated RAM is available to the GPU, with the Radeon I'm using this was not always assigned fully and causes issues
also, when running hashcat, double-check if the benchmark is actually using the GPU (been there, really)
It is definitely allocating vram correctly and using the GPU, the CPU only sees about 1/7th the performance the GPU currently is getting, it is definitely using the GPU and strangely it says it is hitting 100% usage at times in nvidia-smi although it bounces around from 5-6% to 30% to 100% and back to 3-8% etc, somehow it is only roughly using about 6-7% overall.

I have tried changing the settings around a bit but i will have to keep playing with it and see if i can find the problem.

Here in a few hours i will get back to it and i think i will try different profile types and manually disabling any frame limit for the vGPU profile and see how those go. (Pretty sure i am already using the most performant profile type according to the documentation but we’ll see i guess)
 
It is definitely allocating vram correctly and using the GPU, the CPU only sees about 1/7th the performance the GPU currently is getting, it is definitely using the GPU and strangely it says it is hitting 100% usage at times in nvidia-smi although it bounces around from 5-6% to 30% to 100% and back to 3-8% etc, somehow it is only roughly using about 6-7% overall.

I have tried changing the settings around a bit but i will have to keep playing with it and see if i can find the problem.

Here in a few hours i will get back to it and i think i will try different profile types and manually disabling any frame limit for the vGPU profile and see how those go. (Pretty sure i am already using the most performant profile type according to the documentation but we’ll see i guess)
I've only worked with AMD thus far.

The configuration of the GPU PCI pass through can be a bit delicate to get things to work well.

I wonder if you're doing pass-through configuring all in the VM or if you're using the Proxmox VE resource mapping ?

Coincidentally I've noticed what appears to be repeat VM instability when using resource mapping of USB and PCI devices. My impression is these may cause errors on the device in the VM (also when using virtio)

It may be worth checking in MS Windows if the in the device manager and event-viewer there are errors/warnings reported on the PCIe bus and/or for drivers, PCI devices.

Also, why on earth use windows for hashcat =-D ?

On my VM i'm using virtio-GPU for the VM Display and I've simply added the GPU to a VM as a PCIe device (as Raw Device not mapped device, ROM bar = on, Primary GPU = on, PCI express = on)

Your computer may also have settings in BIOS/UEFI which prevent for the performance to go full throttle ? Thinking here of PCI bus specifics.

Maybe run 3DMark or something similar to benchmark the GPU outside of hashcat ?
 
Last edited:
Nvidia vGPU is similar to AMD MxGPU, it lets you assign profiles for portioning out the GPU to multiple VMs, its kind of nice, mostly has very few issues, i think so far one of the biggest ive had with it once it was up and running is lack of nesting support.

so while it is similar to passthrough its not actually passing the whole card.

it is using the resource mapping, maybe i should try without it and directly specifying the card just as a test, thank you for the suggestion.

the event viewer is not showing any errors related to that.

haha, yeah hashcat via windows is not the best choice, but seeing as i have the vGPU setup in windows currently with all 8GB of vram and i have to mess with the vGPU licenses, etc since its a tesla and uses the grid drivers and the licenses have to be setup again every x days, this is the vm i have working and its just not worth it currently to boot linux back up and refresh the license, reconfigure both vms, just to test hashcat a bit then shut them both down and reconfigure it again to get the vram back on the main vm. so hashcat on windows it is. (yeah im kind of going the lazy route on this one im doing other things too so its just not worth it)

i should test all of that more, im pretty sure it was getting closer to full usage before, i think it is running in 8x mode though (im pretty sure that should only cause like at most a 2-3% difference), but it is at least connected to a pci express slot that is connected to the cpu and not the chipset (that was causing major issues before so maybe it just is the board still ill have to see if i can maybe switch the intel card the the slot the nvidia is in and put the nvidia card in the main slot, not sure it fi will fit like that though, i have it packed pretty full and the nvidia is a 1 slot card)

it does seem to have issues with performance on other applications too, hashcat just really helps give a number to the exact cuda performance rather than frame rate or something else.
 
Last edited:
just ran a passmark 3D benchmark:

my score: 6,921

online score for grid driver: 6,235
online bare metal p4 score: 9,025

so according to passmark my score is about 11% better than the average for using a Grid driver and only about 23% less than bare metal.
 
  • Like
Reactions: Joris L.
If you’re using vGPU, do you have the current vGPU software properly installed and licensed (can you see it being checked out)?
Do you have any potential thermal issues? Is this a 1U or 2U server? What’s your intake temperatures?
Are you sharing the PCIe bus with anything else? What is the layout of your bus-memory-CPU pipeline?
Is this the original CPU that this system was specced with? Missing any memory modules?
There have been some fixes for CPU vulnerabilities since the last time anyone seriously benchmarked a P4, are you comparing apples-to-apples with ‘bare metal’?
 
If you’re using vGPU, do you have the current vGPU software properly installed and licensed (can you see it being checked out)?
Do you have any potential thermal issues? Is this a 1U or 2U server? What’s your intake temperatures?
Are you sharing the PCIe bus with anything else? What is the layout of your bus-memory-CPU pipeline?
Is this the original CPU that this system was specced with? Missing any memory modules?
There have been some fixes for CPU vulnerabilities since the last time anyone seriously benchmarked a P4, are you comparing apples-to-apples with ‘bare metal’?
yeah, its all setup and licensed properly, i haven't upgraded to v18.0 yet still on 17.5 / 550 for now but its still pretty new.

its a desktop PC repurposed as a proxmox server.

Spec:
motherboard: Gigabyte X299 UD4 pro
CPU: I7-7820X
Ram: 96GB DDR4 mix of 8/16GB dimms over 8 modules
PCPIE Slot 1: ASrock intel Arc A310
PCIE Slot 2: INSPUR 9211-8i SAS card in IT Mode
PCIE Slot 3: Tesla P4 8GB
NVME1: WD SN550 Blue 1TB
NVME2: samsung mznln256hmhq-000h1 256GB
HDD 1/2: HGST 10TB enterprise (7200rpm)
HDD3: HGST 12TB enterprise (7200rpm)
HDD 4/5: WD Red 8TB / HGST8TB enterprise (7200rpm)
HDD 6: WD blue SA510 1TB sata SSD
HDD 7: 1TB WD Blue 2.5in HDD (5400RPM non-smr)

i am not sure about the pipeline setup with this board beyond which pcie slots are cpu and chipset connected haha

i have mitigations disabled (its a homelab server behind a firewall im not worried about it really) but that is a really good point, it could just be the forced mitigations that cannot be disabled that are in the bios/microcode, etc, that are crippling aspects of the performance because i just cannot find the cause for it.
 
Yeah, these things aren’t designed to run full power in a desktop.

Various problems: desktop CPU, doesn’t have nearly the amount of memory lanes to even keep a P4 fed, the 3rd PCIe slot is likely only running at x4 (or maybe even x1 if you use the NVMe slots), which is already running slower than its server counterparts, you mix-and-match memory.

You basically have way too many things in your box to run anything at full speed. Nothing wrong with it, you likely can encode quite some videos and play some games with it.
 
Last edited:
Yeah, these things aren’t designed to run full power in a desktop.

Various problems: desktop CPU, doesn’t have nearly the amount of memory lanes to even keep a P4 fed, the 3rd PCIe slot is likely only running at x4 (or maybe even x1 if you use the NVMe slots), which is already running slower than its server counterparts, you mix-and-match memory.

You basically have way too many things in your box to run anything at full speed. Nothing wrong with it, you likely can encode quite some videos and play some games with it.
this cpu has 28, which isnt a lot, would be nicer if it had 44 or more, but the P4 is running at 8x and its pcie 3.0 x8,0, so that isnt terrible, it seems like it would be enough? even if just barely not to have a massive impact at least.

i also lowered the lanes for the intel gpu to 8x in the primary slot if i remember correctly (since its just a 4GB / 30w card, not exactly needing 16x esp since i just use it for video encoding and basic tasks)
 
Here is the problem: your Intel video and sas card both use x8, your 2 NVMe slots are another x8, your NVIDIA requires x8, your NIC likely sits behind an x4 controller, somewhere behind that x4 are all your SATA controllers and various other things your CPU does . And then if you mix and match memory, your memory controller only has 2/4 available channels for some of the work. I don’t know your motherboard’s topology, but getting only a 23% performance reduction on a benchmark from bare metal is pretty good (see lspci, lstopo). If you want a number cruncher (the patterns for hashcat and passmark are completely different), then you need x8 lanes, dedicated and the 6 memory busses on that era Xeon CPU.
 
i definitely can see what you mean, i do not exactly have enough lanes and some are without a doubt shared (since theres more lanes in use really than there are lanes on the cpu so it has to be)
i believe its set
intel gpu: 8x
sas card: 4x (max for the specific slot)
tesla p4: 8x
nvme1/2: 4x each (max for the slots and disables on board sata ports when in use and one is just in sata mode basically (the 256gb samsung is on the sata slot) )
which really is the full 28 across those devices by themselves not including everything else..... also why i limited the intel cards lanes specifically to help with this problem, i should probably limit it to x4 just to help it a little more, i do not exactly use the intel card for anything serious.

lspci & lstopo

lspci:
Code:
00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 04)
00:04.0 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.1 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.2 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.3 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.4 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.5 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.6 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.7 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:05.0 System peripheral: Intel Corporation Sky Lake-E MM/Vt-d Configuration Registers (rev 04)
00:05.2 System peripheral: Intel Corporation Sky Lake-E RAS (rev 04)
00:05.4 PIC: Intel Corporation Sky Lake-E IOAPIC (rev 04)
00:08.0 System peripheral: Intel Corporation Sky Lake-E Ubox Registers (rev 04)
00:08.1 Performance counters: Intel Corporation Sky Lake-E Ubox Registers (rev 04)
00:08.2 System peripheral: Intel Corporation Sky Lake-E Ubox Registers (rev 04)
00:14.0 USB controller: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME HECI #1
00:17.0 SATA controller: Intel Corporation 200 Series PCH SATA controller [AHCI mode]
00:1b.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #17 (rev f0)
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #1 (rev f0)
00:1c.4 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #5 (rev f0)
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation X299 Chipset LPC/eSPI Controller
00:1f.2 Memory controller: Intel Corporation 200 Series/Z370 Chipset Family Power Management Controller
00:1f.3 Audio device: Intel Corporation 200 Series PCH HD Audio
00:1f.4 SMBus: Intel Corporation 200 Series/Z370 Chipset Family SMBus Controller
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V
01:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550 NVMe SSD (rev 01)
03:00.0 USB controller: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller
16:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
16:05.0 System peripheral: Intel Corporation Sky Lake-E VT-d (rev 04)
16:05.2 System peripheral: Intel Corporation Sky Lake-E RAS Configuration Registers (rev 04)
16:05.4 PIC: Intel Corporation Sky Lake-E IOxAPIC Configuration Registers (rev 04)
16:08.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1e.0 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.1 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.2 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.3 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.4 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.5 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.6 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
17:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
64:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
64:05.0 System peripheral: Intel Corporation Sky Lake-E VT-d (rev 04)
64:05.2 System peripheral: Intel Corporation Sky Lake-E RAS Configuration Registers (rev 04)
64:05.4 PIC: Intel Corporation Sky Lake-E IOxAPIC Configuration Registers (rev 04)
64:08.0 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:09.0 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.0 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.1 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.2 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.3 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.4 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.5 System peripheral: Intel Corporation Sky Lake-E LM Channel 1 (rev 04)
64:0a.6 System peripheral: Intel Corporation Sky Lake-E LMS Channel 1 (rev 04)
64:0a.7 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 1 (rev 04)
64:0b.0 System peripheral: Intel Corporation Sky Lake-E DECS Channel 2 (rev 04)
64:0b.1 System peripheral: Intel Corporation Sky Lake-E LM Channel 2 (rev 04)
64:0b.2 System peripheral: Intel Corporation Sky Lake-E LMS Channel 2 (rev 04)
64:0b.3 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 2 (rev 04)
64:0c.0 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0c.1 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0c.2 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0c.3 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0c.4 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0c.5 System peripheral: Intel Corporation Sky Lake-E LM Channel 1 (rev 04)
64:0c.6 System peripheral: Intel Corporation Sky Lake-E LMS Channel 1 (rev 04)
64:0c.7 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 1 (rev 04)
64:0d.0 System peripheral: Intel Corporation Sky Lake-E DECS Channel 2 (rev 04)
64:0d.1 System peripheral: Intel Corporation Sky Lake-E LM Channel 2 (rev 04)
64:0d.2 System peripheral: Intel Corporation Sky Lake-E LMS Channel 2 (rev 04)
64:0d.3 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 2 (rev 04)
65:00.0 PCI bridge: Intel Corporation Device 4fa1 (rev 01)
66:01.0 PCI bridge: Intel Corporation Device 4fa4
66:04.0 PCI bridge: Intel Corporation Device 4fa4
67:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A310] (rev 05)
68:00.0 Audio device: Intel Corporation DG2 Audio Controller
b2:03.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port D (rev 04)
b2:05.0 System peripheral: Intel Corporation Sky Lake-E VT-d (rev 04)
b2:05.2 System peripheral: Intel Corporation Sky Lake-E RAS Configuration Registers (rev 04)
b2:05.4 PIC: Intel Corporation Sky Lake-E IOxAPIC Configuration Registers (rev 04)
b2:12.0 Performance counters: Intel Corporation Sky Lake-E M3KTI Registers (rev 04)
b2:12.1 Performance counters: Intel Corporation Sky Lake-E M3KTI Registers (rev 04)
b2:12.2 System peripheral: Intel Corporation Sky Lake-E M3KTI Registers (rev 04)
b2:15.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b2:16.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b2:16.4 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b2:17.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b3:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

lstopo:
Code:
Machine (94GB total)
  Package L#0
    NUMANode L#0 (P#0 94GB)
    L3 L#0 (11MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#8)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#9)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#10)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#11)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#12)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#13)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#14)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#15)
  HostBridge
    PCI 00:17.0 (SATA)
      Block(Disk) "sdb"
      Block(Disk) "sda"
    PCIBridge
      PCI 01:00.0 (NVMExp)
        Block(Disk) "nvme0n1"
    PCI 00:1f.6 (Ethernet)
      Net "enp0s31f6"
  HostBridge
    PCIBridge
      PCI 17:00.0 (3D)
  HostBridge
    PCIBridge
      PCIBridge
        PCIBridge
          PCI 67:00.0 (VGA)
  HostBridge
    PCIBridge
      PCI b3:00.0 (SAS)
        Block(Disk) "sdf"
        Block(Disk) "sdd"
        Block(Disk) "sde"
        Block(Disk) "sdc"
  Misc(MemoryModule)
  Misc(MemoryModule)
 
just ran a passmark 3D benchmark:

my score: 6,921

online score for grid driver: 6,235
online bare metal p4 score: 9,025

so according to passmark my score is about 11% better than the average for using a Grid driver and only about 23% less than bare metal.
if you use the card without vGPU you're likely to see performance similar or at times higher than bare metal, my experience with the Radeon card
 
Yeah, these things aren’t designed to run full power in a desktop.

Various problems: desktop CPU, doesn’t have nearly the amount of memory lanes to even keep a P4 fed, the 3rd PCIe slot is likely only running at x4 (or maybe even x1 if you use the NVMe slots), which is already running slower than its server counterparts, you mix-and-match memory.

You basically have way too many things in your box to run anything at full speed. Nothing wrong with it, you likely can encode quite some videos and play some games with it.
I'd agree for a 3D benchmark or something else media-heavy, for hashcat I've not seen much variance in compute benchmarks regardless of the slot.
Also, unless the system is under stress test or this is indeed loaded to the max any performance issues due to bus constraints are unlikely.

In modern hardware and OS a lot happens with zero tweaking required. Unless this is a system with a cheap and/or poorly designed motherboard I'd doubt this to be relevant.
 
i definitely can see what you mean, i do not exactly have enough lanes and some are without a doubt shared (since theres more lanes in use really than there are lanes on the cpu so it has to be)
i believe its set
intel gpu: 8x
sas card: 4x (max for the specific slot)
tesla p4: 8x
nvme1/2: 4x each (max for the slots and disables on board sata ports when in use and one is just in sata mode basically (the 256gb samsung is on the sata slot) )
which really is the full 28 across those devices by themselves not including everything else..... also why i limited the intel cards lanes specifically to help with this problem, i should probably limit it to x4 just to help it a little more, i do not exactly use the intel card for anything serious.

lspci & lstopo

lspci:
Code:
00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 04)
00:04.0 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.1 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.2 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.3 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.4 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.5 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.6 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.7 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:05.0 System peripheral: Intel Corporation Sky Lake-E MM/Vt-d Configuration Registers (rev 04)
00:05.2 System peripheral: Intel Corporation Sky Lake-E RAS (rev 04)
00:05.4 PIC: Intel Corporation Sky Lake-E IOAPIC (rev 04)
00:08.0 System peripheral: Intel Corporation Sky Lake-E Ubox Registers (rev 04)
00:08.1 Performance counters: Intel Corporation Sky Lake-E Ubox Registers (rev 04)
00:08.2 System peripheral: Intel Corporation Sky Lake-E Ubox Registers (rev 04)
00:14.0 USB controller: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME HECI #1
00:17.0 SATA controller: Intel Corporation 200 Series PCH SATA controller [AHCI mode]
00:1b.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #17 (rev f0)
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #1 (rev f0)
00:1c.4 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #5 (rev f0)
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation X299 Chipset LPC/eSPI Controller
00:1f.2 Memory controller: Intel Corporation 200 Series/Z370 Chipset Family Power Management Controller
00:1f.3 Audio device: Intel Corporation 200 Series PCH HD Audio
00:1f.4 SMBus: Intel Corporation 200 Series/Z370 Chipset Family SMBus Controller
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V
01:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550 NVMe SSD (rev 01)
03:00.0 USB controller: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller
16:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
16:05.0 System peripheral: Intel Corporation Sky Lake-E VT-d (rev 04)
16:05.2 System peripheral: Intel Corporation Sky Lake-E RAS Configuration Registers (rev 04)
16:05.4 PIC: Intel Corporation Sky Lake-E IOxAPIC Configuration Registers (rev 04)
16:08.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1e.0 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.1 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.2 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.3 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.4 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.5 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.6 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
17:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
64:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
64:05.0 System peripheral: Intel Corporation Sky Lake-E VT-d (rev 04)
64:05.2 System peripheral: Intel Corporation Sky Lake-E RAS Configuration Registers (rev 04)
64:05.4 PIC: Intel Corporation Sky Lake-E IOxAPIC Configuration Registers (rev 04)
64:08.0 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:09.0 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.0 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.1 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.2 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.3 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.4 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0a.5 System peripheral: Intel Corporation Sky Lake-E LM Channel 1 (rev 04)
64:0a.6 System peripheral: Intel Corporation Sky Lake-E LMS Channel 1 (rev 04)
64:0a.7 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 1 (rev 04)
64:0b.0 System peripheral: Intel Corporation Sky Lake-E DECS Channel 2 (rev 04)
64:0b.1 System peripheral: Intel Corporation Sky Lake-E LM Channel 2 (rev 04)
64:0b.2 System peripheral: Intel Corporation Sky Lake-E LMS Channel 2 (rev 04)
64:0b.3 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 2 (rev 04)
64:0c.0 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0c.1 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0c.2 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0c.3 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0c.4 System peripheral: Intel Corporation Sky Lake-E Integrated Memory Controller (rev 04)
64:0c.5 System peripheral: Intel Corporation Sky Lake-E LM Channel 1 (rev 04)
64:0c.6 System peripheral: Intel Corporation Sky Lake-E LMS Channel 1 (rev 04)
64:0c.7 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 1 (rev 04)
64:0d.0 System peripheral: Intel Corporation Sky Lake-E DECS Channel 2 (rev 04)
64:0d.1 System peripheral: Intel Corporation Sky Lake-E LM Channel 2 (rev 04)
64:0d.2 System peripheral: Intel Corporation Sky Lake-E LMS Channel 2 (rev 04)
64:0d.3 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 2 (rev 04)
65:00.0 PCI bridge: Intel Corporation Device 4fa1 (rev 01)
66:01.0 PCI bridge: Intel Corporation Device 4fa4
66:04.0 PCI bridge: Intel Corporation Device 4fa4
67:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A310] (rev 05)
68:00.0 Audio device: Intel Corporation DG2 Audio Controller
b2:03.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port D (rev 04)
b2:05.0 System peripheral: Intel Corporation Sky Lake-E VT-d (rev 04)
b2:05.2 System peripheral: Intel Corporation Sky Lake-E RAS Configuration Registers (rev 04)
b2:05.4 PIC: Intel Corporation Sky Lake-E IOxAPIC Configuration Registers (rev 04)
b2:12.0 Performance counters: Intel Corporation Sky Lake-E M3KTI Registers (rev 04)
b2:12.1 Performance counters: Intel Corporation Sky Lake-E M3KTI Registers (rev 04)
b2:12.2 System peripheral: Intel Corporation Sky Lake-E M3KTI Registers (rev 04)
b2:15.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b2:16.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b2:16.4 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b2:17.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b3:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

lstopo:
Code:
Machine (94GB total)
  Package L#0
    NUMANode L#0 (P#0 94GB)
    L3 L#0 (11MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#8)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#9)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#10)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#11)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#12)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#13)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#14)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#15)
  HostBridge
    PCI 00:17.0 (SATA)
      Block(Disk) "sdb"
      Block(Disk) "sda"
    PCIBridge
      PCI 01:00.0 (NVMExp)
        Block(Disk) "nvme0n1"
    PCI 00:1f.6 (Ethernet)
      Net "enp0s31f6"
  HostBridge
    PCIBridge
      PCI 17:00.0 (3D)
  HostBridge
    PCIBridge
      PCIBridge
        PCIBridge
          PCI 67:00.0 (VGA)
  HostBridge
    PCIBridge
      PCI b3:00.0 (SAS)
        Block(Disk) "sdf"
        Block(Disk) "sdd"
        Block(Disk) "sde"
        Block(Disk) "sdc"
  Misc(MemoryModule)
  Misc(MemoryModule)
looking at this @guruevi could be right the memory lay-out may be causing performance issues (96GB of ram) (32+32+16+16 ?)
best is to consult the motherboard manual to find how the ram is assigned and also to check if all ram is compatible
also, not all ram slots are equal for any kind of pair, only pair ram of equal size&speed per slot pair, smaller ram may need to be at the low (0-1) pair or at the high (3-4/7-8) pair to assure performance doesn't suffer

looking at the output of lspci and lstopo about, am I right this is not a regular desktop but more like an enterprise/workstation grade machine ?
 
Last edited:
if you use the card without vGPU you're likely to see performance similar or at times higher than bare metal, my experience with the Radeon card
sadly it would be a lot of work to do that so its not exactly an option here, i can with vgpu and the fixes i have thanks to great devs keeping it supported use the new 550 drivers that are months old but if i go bare metal the support ended with 535 drivers and i think data center drivers are much older, plus i would have to setup a whole new vm or totally reinstall drivers, etc since it is a completely different driver to use it directly with passthrough and that also requires a reboot to get it back on proxmox for vgpu once used in that way. i might try a whole new vm if i get desperate for a fix but not sure i want it back enough to not use vgpu at all.

I'd agree for a 3D benchmark or something else media-heavy, for hashcat I've not seen much variance in compute benchmarks regardless of the slot.
Also, unless the system is under stress test or this is indeed loaded to the max any performance issues due to bus constraints are unlikely.

In modern hardware and OS a lot happens with zero tweaking required. Unless this is a system with a cheap and/or poorly designed motherboard I'd doubt this to be relevant.
this is what i was thinking, i know i have some performance loss because of it but its not 90% lol, the card is still running on 8x, id be seeing a lot of impact on 3D and everything else if it was that bad, its definitely there but mostly only shows itself when everything in the system is stressed at once which is rare.

looking at this @guruevi could be right the memory lay-out may be causing performance issues (96GB of ram) (32+32+16+16 ?)
best is to consult the motherboard manual to find how the ram is assigned and also to check if all ram is compatible
also, not all ram slots are equal for any kind of pair, only pair ram of equal size&speed per slot pair, smaller ram may need to be at the low (0-1) pair or at the high (3-4/7-8) pair to assure performance doesn't suffer

looking at the output of lspci and lstopo about, am I right this is not a regular desktop but more like an enterprise/workstation grade machine ?
it has 8 chips, they are 16 and 8 gb chips i forget lay out i think its 16 8 16 8 16 8 16 8 i put them in slots relative to how the cpu handles them and i when through a lot of tweaking / benchmarking and they hit 55GB/s with good performance results.

yes it is a higher end board, i linked to it above its a gigabyte x299 ud4 board that claims Server-Class Digital Power Design and other features and the cpu is a i7-7820X skylake-x cpu.
 
I'm seeing pretty poor performance with an HP z420 and a P4 doing vGPU as well. I have mine split though instead of allocated 100% to one VM. I have it split x4 across 4 VMs. I was wondering if it was my driver but it sounds like you have an upgraded version and still see the same results. I'm on 535 / 16 and I see Jellyfin doing transcoding at *unreasonably* low rates.

Code:
frame=    1 fps=0.2 q=17.0 size=N/A time=00:00:00.00 bitrate=N/A speed=   0x   
frame=    2 fps=0.3 q=13.0 size=N/A time=00:00:00.04 bitrate=N/A speed=0.00596x   
frame=    4 fps=0.5 q=10.0 size=N/A time=00:00:00.12 bitrate=N/A speed=0.0167x   
frame=    6 fps=0.7 q=10.0 size=N/A time=00:00:00.20 bitrate=N/A speed=0.0261x   
frame=    8 fps=0.9 q=10.0 size=N/A time=00:00:00.29 bitrate=N/A speed=0.0343x   
frame=   10 fps=1.1 q=10.0 size=N/A time=00:00:00.37 bitrate=N/A speed=0.0417x   
frame=   12 fps=1.3 q=10.0 size=N/A time=00:00:00.45 bitrate=N/A speed=0.0483x
 
I'm seeing pretty poor performance with an HP z420 and a P4 doing vGPU as well. I have mine split though instead of allocated 100% to one VM. I have it split x4 across 4 VMs. I was wondering if it was my driver but it sounds like you have an upgraded version and still see the same results. I'm on 535 / 16 and I see Jellyfin doing transcoding at *unreasonably* low rates.

Code:
frame=    1 fps=0.2 q=17.0 size=N/A time=00:00:00.00 bitrate=N/A speed=   0x
frame=    2 fps=0.3 q=13.0 size=N/A time=00:00:00.04 bitrate=N/A speed=0.00596x
frame=    4 fps=0.5 q=10.0 size=N/A time=00:00:00.12 bitrate=N/A speed=0.0167x
frame=    6 fps=0.7 q=10.0 size=N/A time=00:00:00.20 bitrate=N/A speed=0.0261x
frame=    8 fps=0.9 q=10.0 size=N/A time=00:00:00.29 bitrate=N/A speed=0.0343x
frame=   10 fps=1.1 q=10.0 size=N/A time=00:00:00.37 bitrate=N/A speed=0.0417x
frame=   12 fps=1.3 q=10.0 size=N/A time=00:00:00.45 bitrate=N/A speed=0.0483x
here is another thread where i linked to the posts on the forum and code pages devs posted to get you up and running with 17.5 and mask the p4 as a A5500 instead so you can install driver 550 if you want them.

Driver fixes and v17.5 for the p4

i sometimes split mine up, usually 4GB / 4GB to linux / windows for AI and other tasks, it does better on linux than windows but currently just using it with windows, typically i se zero issues with performance on 3D tasks, encoding tasks, etc and i also seen zero issues on 535 there, just upgraded for the system fallback setting and other fixes plus having a newerversion to pair with the 6.14 kernel

there is basically literally no way you should be seeing rates THAT low if it is using the gpu (it looks like theres just no way it is, doesnt appear to even be using your cpu properly unless that is 4k or something)

the p4 should easily do 4k@30fps 1080p@120+FPS 720p@200-240FPS, etc

edit: do you have licensing setup? run & nvidia-smi | Select-String "License" and see the output if unsure. if "unlicensed" you're limited heavily and performance will suffer, you only get 15FPS, etc and you need to set up this server and set licenses from it for your p4. (not advisable to ever use that in corp settings of any kind, just home / personal use )
 
Last edited: