[SOLVED] vGPU just stopped working randomly (solution includes 6.14, pascal fixes for 17.5, changing mock p4 to A5500 thanks to GreenDam )

Randell · Friday at 19:56

Slight threadjack:

What version of what driver are you running now?

With kernel 6.14 and patched 16.9 drivers, when I start up a VM that has my P4 passed thru (not using the A5500 patch yet) I have this in the host:

Note: I do not get this with kernel 6.11 on the host (just 6.14)

Code:

[  141.872136] ------------[ cut here ]------------
[  141.872163] WARNING: CPU: 19 PID: 7908 at ./include/linux/rwsem.h:85 remap_pfn_range_internal+0x4af/0x5a0
[  141.872192] Modules linked in: ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics nvme_keyring nfnetlink_cttimeout softdog sunrpc binfmt_misc bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample nfnetlink_log nfnetlink nvidia_vgpu_vfio(OE) xfs nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel ipmi_ssif crypto_simd cryptd mdev ast rapl pcspkr kvm acpi_ipmi ccp k10temp ipmi_si ptdma ipmi_devintf ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq hid_generic usbkbd usbmouse mlx4_ib ib_uverbs usbhid ses hid enclosure ib_core mlx4_en mpt3sas igb xhci_pci nvme raid_class i2c_algo_bit ahci
[  141.872278]  dca mlx4_core scsi_transport_sas libahci nvme_core xhci_hcd i2c_piix4 i2c_smbus nvme_auth
[  141.872414] CPU: 19 UID: 0 PID: 7908 Comm: CPU 1/KVM Tainted: P           OE      6.14.0-2-pve #1
[  141.872432] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  141.872445] Hardware name: Supermicro Super Server/H11SSL-NC, BIOS 3.0 07/01/2024
[  141.872460] RIP: 0010:remap_pfn_range_internal+0x4af/0x5a0
[  141.872473] Code: 31 db c3 cc cc cc cc 48 8b 7d a8 4c 89 fa 4c 89 ce 4c 89 4d c0 e8 81 e2 ff ff 85 c0 75 9c 4c 8b 4d c0 4d 8b 01 e9 aa fd ff ff <0f> 0b e9 d7 fb ff ff 0f 0b 48 8b 7d a8 4c 89 fa 48 89 de 4c 89 45
[  141.873338] RSP: 0018:ffffae58873774b0 EFLAGS: 00010246
[  141.873726] RAX: 00000000280200fb RBX: ffff8d25d7b6a0b8 RCX: 0000000000001000
[  141.874111] RDX: 0000000000000000 RSI: 00007bb81fe00000 RDI: ffff8d25d7b6a0b8
[  141.874520] RBP: ffffae5887377568 R08: 8000000000000037 R09: 0000000000000000
[  141.874903] R10: 0000000000000000 R11: ffff8d254bc47380 R12: 000000002000fdf1
[  141.875297] R13: 00007bb81fe01000 R14: 00007bb81fe00000 R15: 8000000000000037
[  141.875683] FS:  00007bbc4affd6c0(0000) GS:ffff8d638e980000(0000) knlGS:0000000000000000
[  141.876074] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  141.876469] CR2: 0000000000b50b10 CR3: 00000001ce5b8000 CR4: 0000000000350ef0
[  141.876859] Call Trace:
[  141.877245]  <TASK>
[  141.877620]  ? show_regs+0x6c/0x80
[  141.877994]  ? __warn+0x8d/0x150
[  141.878367]  ? remap_pfn_range_internal+0x4af/0x5a0
[  141.878731]  ? report_bug+0x182/0x1b0
[  141.879091]  ? handle_bug+0x6e/0xb0
[  141.879453]  ? exc_invalid_op+0x18/0x80
[  141.879809]  ? asm_exc_invalid_op+0x1b/0x20
[  141.880172]  ? remap_pfn_range_internal+0x4af/0x5a0
[  141.880530]  ? pat_pagerange_is_ram+0x7a/0xa0
[  141.880886]  ? memtype_lookup+0x3b/0x70
[  141.881244]  ? lookup_memtype+0xd1/0xf0
[  141.881595]  remap_pfn_range+0x5c/0xb0
[  141.881946]  ? up+0x58/0xa0
[  141.882302]  vgpu_mmio_fault_wrapper+0x1fa/0x340 [nvidia_vgpu_vfio]
[  141.882661]  __do_fault+0x3a/0x180
[  141.883016]  do_fault+0xca/0x4f0
[  141.883373]  __handle_mm_fault+0x840/0x10b0
[  141.883717]  handle_mm_fault+0x1a5/0x360
[  141.884056]  __get_user_pages+0x1f2/0x15d0
[  141.884402]  get_user_pages_unlocked+0xe7/0x370
[  141.884732]  hva_to_pfn+0x380/0x4c0 [kvm]
[  141.885127]  ? __perf_event_task_sched_out+0x5a/0x4a0
[  141.885447]  kvm_follow_pfn+0x97/0x100 [kvm]
[  141.885825]  __kvm_faultin_pfn+0x5c/0x90 [kvm]
[  141.886194]  kvm_mmu_faultin_pfn+0x19d/0x6e0 [kvm]
[  141.886576]  kvm_tdp_page_fault+0x8e/0xe0 [kvm]
[  141.886938]  kvm_mmu_do_page_fault+0x243/0x290 [kvm]
[  141.887301]  kvm_mmu_page_fault+0x8e/0x6d0 [kvm]
[  141.887646]  ? nv_vgpu_vfio_access+0x2d4/0x450 [nvidia_vgpu_vfio]
[  141.887915]  npf_interception+0xba/0x190 [kvm_amd]
[  141.888181]  svm_invoke_exit_handler+0x182/0x1b0 [kvm_amd]
[  141.888448]  svm_handle_exit+0xa2/0x200 [kvm_amd]
[  141.888705]  vcpu_enter_guest+0x4e8/0x1640 [kvm]
[  141.889033]  ? kvm_arch_vcpu_load+0xac/0x290 [kvm]
[  141.889359]  ? restore_fpregs_from_fpstate+0x3d/0xd0
[  141.889599]  kvm_arch_vcpu_ioctl_run+0x35d/0x750 [kvm]
[  141.889903]  kvm_vcpu_ioctl+0x2c2/0xaa0 [kvm]
[  141.890195]  ? kvm_vcpu_ioctl+0x23e/0xaa0 [kvm]
[  141.890488]  ? nv_vfio_mdev_read+0x23/0x70 [nvidia_vgpu_vfio]
[  141.890713]  __x64_sys_ioctl+0xa4/0xe0
[  141.890933]  x64_sys_call+0xb45/0x2540
[  141.891148]  do_syscall_64+0x7e/0x170
[  141.891364]  ? syscall_exit_to_user_mode+0x38/0x1d0
[  141.891575]  ? do_syscall_64+0x8a/0x170
[  141.891781]  ? arch_exit_to_user_mode_prepare.constprop.0+0xc8/0xd0
[  141.891989]  ? syscall_exit_to_user_mode+0x38/0x1d0
[  141.892193]  ? do_syscall_64+0x8a/0x170
[  141.892405]  ? syscall_exit_to_user_mode+0x38/0x1d0
[  141.892613]  ? do_syscall_64+0x8a/0x170
[  141.892820]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[  141.893029]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  141.893245] RIP: 0033:0x7bbc53e81d1b
[  141.893496] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[  141.893983] RSP: 002b:00007bbc4aff7ee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  141.894270] RAX: ffffffffffffffda RBX: 00005da48d103680 RCX: 00007bbc53e81d1b
[  141.894518] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000030
[  141.894764] RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000
[  141.895011] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  141.895264] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  141.895508]  </TASK>
[  141.895744] ---[ end trace 0000000000000000 ]---

In addition, I get this in the VM (with 6.11 and 6.14 running on the host)

Code:

[  141.872136]
[    3.407489] ------------[ cut here ]------------
[    3.407491] WARNING: CPU: 1 PID: 560 at drivers/pci/msi/msi.c:888 __pci_enable_msi_range+0x1b3/0x1d0
[    3.407500] Modules linked in: overlay lz4 lz4_compress zram zsmalloc binfmt_misc nls_ascii nls_cp437 vfat fat nvidia_drm(POE) nvidia_modeset(POE) intel_rapl_msr intel_rapl_common nvidia(POE) kvm_amd ccp kvm nouveau irqbypass ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core mxm_wmi video snd_hwdep wmi snd_pcm iTCO_wdt drm_display_helper aesni_intel cec snd_timer rc_core intel_pmc_bxt crypto_simd iTCO_vendor_support snd cryptd pcspkr hid_generic watchdog i2c_algo_bit virtio_console soundcore button joydev evdev sg serio_raw fuse loop efi_pstore dm_mod configfs efivarfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic usbhid hid sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic virtio_net virtio_scsi net_failover failover virtio_pci ahci virtio_pci_legacy_dev libahci ehci_pci virtio_pci_modern_dev virtio crct10dif_pclmul libata crct10dif_common
[    3.407572]  virtio_ring bochs crc32_pclmul drm_vram_helper uhci_hcd crc32c_intel drm_kms_helper scsi_mod psmouse drm_ttm_helper ttm i2c_i801 scsi_common i2c_smbus lpc_ich ehci_hcd drm usbcore usb_common
[    3.407585] CPU: 1 PID: 560 Comm: nvidia-gridd Tainted: P        W  OE      6.1.0-33-amd64 #1  Debian 6.1.133-1
[    3.407589] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 4.2025.02-3 04/03/2025
[    3.407590] RIP: 0010:__pci_enable_msi_range+0x1b3/0x1d0
[    3.407593] Code: 4c 89 ef e8 df fb ff ff 89 c6 85 c0 0f 84 68 ff ff ff 78 0e 39 c5 7f 8c 4d 85 e4 75 cc 41 89 f6 eb d8 41 89 c6 e9 50 ff ff ff <0f> 0b 41 be ea ff ff ff e9 43 ff ff ff 41 be de ff ff ff e9 38 ff
[    3.407595] RSP: 0018:ffffa973c0813998 EFLAGS: 00010202
[    3.407597] RAX: 0000000000000010 RBX: 0000000000000001 RCX: 0000000000000000
[    3.407598] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9b1c40fb7000
[    3.407599] RBP: 0000000000000001 R08: 0000000000000001 R09: ffff9b1c4b759708
[    3.407600] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[    3.407601] R13: ffff9b1c40fb7000 R14: ffff9b1c4586d3e0 R15: ffff9b1c4586d000
[    3.407604] FS:  00007fb7dd066040(0000) GS:ffff9b1fafc80000(0000) knlGS:0000000000000000
[    3.407605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.407607] CR2: 0000000000b50b10 CR3: 0000000108f32000 CR4: 0000000000350ee0
[    3.407610] Call Trace:
[    3.407612]  <TASK>
[    3.407615]  ? __warn+0x7d/0xc0
[    3.407618]  ? __pci_enable_msi_range+0x1b3/0x1d0
[    3.407620]  ? report_bug+0xe2/0x150
[    3.407623]  ? handle_bug+0x41/0x70
[    3.407627]  ? exc_invalid_op+0x13/0x60
[    3.407629]  ? asm_exc_invalid_op+0x16/0x20
[    3.407634]  ? __pci_enable_msi_range+0x1b3/0x1d0
[    3.407636]  pci_enable_msi+0x16/0x30
[    3.407638]  nv_init_msi+0x1a/0xe0 [nvidia]
[    3.408007]  nv_open_device+0x843/0x940 [nvidia]
[    3.408364]  nvidia_open+0x361/0x610 [nvidia]
[    3.408721]  ? kobj_lookup+0xf1/0x160
[    3.408725]  nvidia_frontend_open+0x50/0xa0 [nvidia]
[    3.409109]  chrdev_open+0xc1/0x250
[    3.409113]  ? __unregister_chrdev+0x50/0x50
[    3.409116]  do_dentry_open+0x1e2/0x410
[    3.409119]  path_openat+0xb7d/0x1260
[    3.409122]  do_filp_open+0xaf/0x160
[    3.409126]  do_sys_openat2+0xaf/0x170
[    3.409128]  __x64_sys_openat+0x6a/0xa0
[    3.409131]  do_syscall_64+0x55/0xb0
[    3.409134]  ? call_rcu+0xde/0x630
[    3.409137]  ? mntput_no_expire+0x4a/0x250
[    3.409141]  ? kmem_cache_free+0x15/0x310
[    3.409144]  ? do_unlinkat+0xb8/0x320
[    3.409146]  ? exit_to_user_mode_prepare+0x40/0x1e0
[    3.409149]  ? syscall_exit_to_user_mode+0x1e/0x40
[    3.409150]  ? do_syscall_64+0x61/0xb0
[    3.409153]  ? exit_to_user_mode_prepare+0x40/0x1e0
[    3.409155]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[    3.409157] RIP: 0033:0x7fb7dcc36fc1
[    3.409159] Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d 2a 26 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 93 00 00 00 48 8b 54 24 28 64 48 2b 14 25
[    3.409161] RSP: 002b:00007fffd8dede30 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[    3.409163] RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007fb7dcc36fc1
[    3.409164] RDX: 0000000000080002 RSI: 00007fffd8dedec0 RDI: 00000000ffffff9c
[    3.409165] RBP: 00007fffd8dedec0 R08: 0000000000000000 R09: 0000000000000064
[    3.409166] R10: 0000000000000000 R11: 0000000000000202 R12: 00007fffd8dedfec
[    3.409167] R13: 0000000000c22560 R14: 0000000000c22560 R15: 0000000000c22560
[    3.409169]  </TASK>
[    3.409170] ---[ end trace 0000000000000000 ]---
[    3.409171] NVRM: GPU 0000:01:00.0: Failed to enable MSI; falling back to PCIe virtual-wire interrupts.

Everything seems to be working, regardless of these warnings. I just don't remember seeing it before I started changing the kernels and updating the drivers.

Is there any real benefit to using the 17.x drivers and mocking an A5500? To apply the A5500 patch, I guess I need to reverse and reinstall the vgpu_unlock-rs but using GreenDamTan repo and instructions?

Follow up: this warning:
`WARNING: CPU: 1 PID: 560 at drivers/pci/msi/msi.c:888 __pci_enable_msi_range+0x1b3/0x1d0`
was because in my VM I did not have the nouveau driver blacklisted.

I am, however, still getting this warning on the host:
`WARNING: CPU: 5 PID: 24912 at ./include/linux/rwsem.h:85 remap_pfn_range_internal+0x4af/0x5a0`

I only get that warning when I start up a VM that has the vGPU passed to it. Either that 6.14 kernel, or the 17.5 drivers introduced that (or maybe the latest qemu that was installed with 8.4?) because I didn't have that before I updated everything.

It seems to be working, so I'm not worried.

Randell · Saturday at 00:26

@zenowl77 For the first time I tried to start up a 2nd VM and it failed when it had a different amount of memory. I set it to 4gb (the same as the one in use) and it failed with a different error:

Code:

kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000000000210,enable-migration=on,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: vfio 00000000-0000-0000-0000-000000000210: error getting device from group 97: Input/output error
Verify all devices in group 97 are bound to vfio-<bus> or pci-stub and not already in use
stopping swtpm instance (pid 223324) due to QEMU startup error
waited 10 seconds for mediated device driver finishing clean up
actively clean up mediated device with UUID 00000000-0000-0000-0000-000000000210
TASK ERROR: start failed: QEMU exited with code 1

My card is an 8GB card. One VM has a 4GB vCpu passed to it. Maybe it doesn't like 2 4GB vCpus? When I tried with a 2GB vCpu it failed with some message about an mdev instance wasn't available.

Edit: I guess my 8GB card doesn't like 2x4GB vgpus. I used the profile override feature of the unlocker for 1 VM to force it to 2GB and it started up.

zenowl77 · Tuesday at 23:37

that is interesting, i haven't had that issue with mine. i am currently using 2x 4GB profiles one for windows 11 and one for ubuntu 22.04 for running stable diffusion in one ollama in the other.

here is the configuration in mine for the 2 vms:

Code:

[vm.107]
display_width = 3840
display_height = 2160
max_pixels = 8294400
cuda_enabled = 1
framebuffer = 0xEC000000
framebuffer_reservation = 0x14000000 # 4GB

[vm.128]
display_width = 3840
display_height = 2160
max_pixels = 8294400
cuda_enabled = 1
framebuffer = 0xEC000000
framebuffer_reservation = 0x14000000 # 4GB

you may of course also want to add frl_enabled = 0 to remove frame rate limits, i keep forgetting to do that in my AI profiles, not sure it will help with AI though.

here is my vram reference section too for anyone who wants it, i worked out a custom 7GB one just because sometimes i want to use almost all the vram while leaving enough for a 512mb profile or two. i just keep this at the end of my file to copy/paste from to change the amounts.

Code:

#ref
#framebuffer = 0x1A000000
#framebuffer_reservation = 0x6000000 # 512MB
#framebuffer = 0x38000000
#framebuffer_reservation = 0x8000000 # 1GB
#framebuffer = 0x74000000
#framebuffer_reservation = 0xC000000 # 2GB
#framebuffer = 0xB0000000
#framebuffer_reservation = 0x10000000 # 3GB
#framebuffer = 0xEC000000
#framebuffer_reservation = 0x14000000 # 4GB
#framebuffer = 0x128000000
#framebuffer_reservation = 0x18000000 # 5GB
#framebuffer = 0x164000000
#framebuffer_reservation = 0x1C000000 # 6GB
#framebuffer = 0x1A0000000
#framebuffer_reservation = 0x20000000 # 7GB
#framebuffer = 0x1DC000000
#framebuffer_reservation = 0x24000000 # 8GB

Randell · 2025-04-23T14:55:03+0200

Yeah, not sure. I even overrode the [profile.nvidia-765] profile and set those same values in the framebuffer for 4GB just in case, but it didn't work.

The good thing is that it reminded me how to override one for a specific VM since I only needed 2GB anyway.

I'll leave that alone and see if I can figure out this warning that pops up whenever I start a VM that has a vgpu. I think it is related to the CPU type passed to the VM.

What CPU do you have in your host and what cpu type do you select for your VM?

Code:

[126838.253040] ------------[ cut here ]------------
[126838.253466] WARNING: CPU: 29 PID: 23749 at ./include/linux/rwsem.h:85 remap_pfn_range_internal+0x4af/0x5a0
[126838.253723] Modules linked in: veth nfs_layout_flexfiles rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables nfnetlink_cttimeout softdog sunrpc binfmt_misc bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample nfnetlink_log nfnetlink nvidia_vgpu_vfio(OE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd nvidia(POE) ipmi_ssif kvm_amd polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd mdev rapl acpi_ipmi kvm ast pcspkr ccp k10temp ptdma ipmi_si ipmi_devintf joydev input_leds ipmi_msghandler mac_hid vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq hid_generic usbkbd usbmouse mlx4_ib usbhid ib_uverbs hid ib_core mlx4_en igb xhci_pci
[126838.253794]  nvme i2c_algo_bit mpt3sas ahci dca mlx4_core nvme_core xhci_hcd libahci raid_class i2c_piix4 nvme_auth i2c_smbus scsi_transport_sas
[126838.256903] CPU: 29 UID: 0 PID: 23749 Comm: CPU 0/KVM Tainted: P        W  OE      6.14.0-2-pve #1
[126838.257218] Tainted: [P]=PROPRIETARY_MODULE, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[126838.257544] Hardware name: Supermicro Super Server/H11SSL-NC, BIOS 3.2 01/24/2025
[126838.257872] RIP: 0010:remap_pfn_range_internal+0x4af/0x5a0
[126838.258195] Code: 31 db c3 cc cc cc cc 48 8b 7d a8 4c 89 fa 4c 89 ce 4c 89 4d c0 e8 81 e2 ff ff 85 c0 75 9c 4c 8b 4d c0 4d 8b 01 e9 aa fd ff ff <0f> 0b e9 d7 fb ff ff 0f 0b 48 8b 7d a8 4c 89 fa 48 89 de 4c 89 45
[126838.258866] RSP: 0018:ffffbb367bdc3320 EFLAGS: 00010246
[126838.259203] RAX: 000000002c0644fb RBX: ffff923cd4d3c450 RCX: 0000000000001000
[126838.259552] RDX: 0000000000000000 RSI: 000076a58fe03000 RDI: ffff923cd4d3c450
[126838.259934] RBP: ffffbb367bdc33d8 R08: 8000000000000037 R09: 0000000000000000
[126838.260278] R10: 0000000000000000 R11: ffff92379bdf0b00 R12: 000000002000fdf4
[126838.260634] R13: 000076a58fe04000 R14: 000076a58fe03000 R15: 8000000000000037
[126838.260979] FS:  000076adda5a26c0(0000) GS:ffff9275cee80000(0000) knlGS:0000000000000000
[126838.261332] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[126838.261687] CR2: 00007f06ccbff718 CR3: 000000052bf0c000 CR4: 0000000000350ef0
[126838.262039] Call Trace:
[126838.262395]  <TASK>
[126838.262743]  ? show_regs+0x6c/0x80
[126838.263095]  ? __warn+0x8d/0x150
[126838.263461]  ? remap_pfn_range_internal+0x4af/0x5a0
[126838.263810]  ? report_bug+0x182/0x1b0
[126838.264160]  ? handle_bug+0x6e/0xb0
[126838.264515]  ? exc_invalid_op+0x18/0x80
[126838.264861]  ? asm_exc_invalid_op+0x1b/0x20
[126838.265210]  ? remap_pfn_range_internal+0x4af/0x5a0
[126838.265553]  ? pat_pagerange_is_ram+0x7a/0xa0
[126838.265893]  ? memtype_lookup+0x3b/0x70
[126838.266215]  ? lookup_memtype+0xd1/0xf0
[126838.266544]  remap_pfn_range+0x5c/0xb0
[126838.266859]  ? up+0x58/0xa0
[126838.267180]  vgpu_mmio_fault_wrapper+0x1fa/0x340 [nvidia_vgpu_vfio]
[126838.267513]  __do_fault+0x3a/0x180
[126838.267836]  do_fault+0xca/0x4f0
[126838.268153]  __handle_mm_fault+0x840/0x10b0
[126838.268478]  handle_mm_fault+0x1a5/0x360
[126838.268795]  fixup_user_fault+0x8c/0x1d0
[126838.269111]  hva_to_pfn+0x337/0x4c0 [kvm]
[126838.269504]  kvm_follow_pfn+0x97/0x100 [kvm]
[126838.269874]  __kvm_faultin_pfn+0x5c/0x90 [kvm]
[126838.270228]  kvm_mmu_faultin_pfn+0x19d/0x6e0 [kvm]
[126838.270596]  kvm_tdp_page_fault+0x8e/0xe0 [kvm]
[126838.270943]  kvm_mmu_do_page_fault+0x243/0x290 [kvm]
[126838.271283]  kvm_mmu_page_fault+0x8e/0x6d0 [kvm]
[126838.271622]  ? psi_group_change+0x1fd/0x410
[126838.271880]  ? __perf_event_task_sched_in+0x93/0x1f0
[126838.272132]  ? emulator_read_write+0x42/0x1c0 [kvm]
[126838.272482]  npf_interception+0xba/0x190 [kvm_amd]
[126838.272725]  svm_invoke_exit_handler+0x182/0x1b0 [kvm_amd]
[126838.272964]  svm_handle_exit+0xa2/0x200 [kvm_amd]
[126838.273196]  vcpu_enter_guest+0x4e8/0x1640 [kvm]
[126838.273494]  ? x86_emulate_instruction+0x42b/0x760 [kvm]
[126838.273793]  ? kvm_arch_vcpu_load+0xac/0x290 [kvm]
[126838.274074]  kvm_arch_vcpu_ioctl_run+0x35d/0x750 [kvm]
[126838.274353]  ? finish_wait+0x5a/0x80
[126838.274557]  kvm_vcpu_ioctl+0x2c2/0xaa0 [kvm]
[126838.274845]  ? up+0x58/0xa0
[126838.275042]  ? nv_vgpu_vfio_access+0x37f/0x430 [nvidia_vgpu_vfio]
[126838.275243]  ? __check_object_size+0x6a/0x300
[126838.275452]  __x64_sys_ioctl+0xa4/0xe0
[126838.275648]  x64_sys_call+0xb45/0x2540
[126838.275840]  do_syscall_64+0x7e/0x170
[126838.276029]  ? vfio_device_fops_read+0x27/0x50 [vfio]
[126838.276218]  ? vfs_read+0xfc/0x390
[126838.276406]  ? do_syscall_64+0x8a/0x170
[126838.276589]  ? __rseq_handle_notify_resume+0xa0/0x4e0
[126838.276776]  ? timer_delete_sync+0x10/0x20
[126838.276964]  ? arch_exit_to_user_mode_prepare.constprop.0+0xc8/0xd0
[126838.277154]  ? syscall_exit_to_user_mode+0x38/0x1d0
[126838.277357]  ? do_syscall_64+0x8a/0x170
[126838.277547]  ? nv_vgpu_vfio_access+0x37f/0x430 [nvidia_vgpu_vfio]
[126838.277753]  ? nv_vgpu_vfio_write+0xb4/0x150 [nvidia_vgpu_vfio]
[126838.277951]  ? nv_vfio_mdev_write+0x23/0x70 [nvidia_vgpu_vfio]
[126838.278149]  ? vfio_device_fops_write+0x27/0x50 [vfio]
[126838.278353]  ? vfs_write+0x104/0x480
[126838.278546]  ? arch_exit_to_user_mode_prepare.constprop.0+0xc8/0xd0
[126838.278745]  ? __rseq_handle_notify_resume+0xa0/0x4e0
[126838.278974]  ? __rseq_handle_notify_resume+0xa0/0x4e0
[126838.279173]  ? arch_exit_to_user_mode_prepare.constprop.0+0xc8/0xd0
[126838.279386]  ? syscall_exit_to_user_mode+0x38/0x1d0
[126838.279583]  ? do_syscall_64+0x8a/0x170
[126838.279781]  ? do_syscall_64+0x8a/0x170
[126838.279975]  ? do_syscall_64+0x8a/0x170
[126838.280166]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[126838.280368]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[126838.280577] RIP: 0033:0x76addf6cdd1b
[126838.280772] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[126838.281194] RSP: 002b:000076adda59cee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[126838.281420] RAX: ffffffffffffffda RBX: 00006356437e7740 RCX: 000076addf6cdd1b
[126838.281641] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000004e
[126838.281873] RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000
[126838.282095] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[126838.282321] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[126838.282546]  </TASK>
[126838.282761] ---[ end trace 0000000000000000 ]---

zenowl77 · 2025-04-23T17:47:20+0200

Randell said:

Yeah, not sure. I even overrode the [profile.nvidia-765] profile and set those same values in the framebuffer for 4GB just in case, but it didn't work.

The good thing is that it reminded me how to override one for a specific VM since I only needed 2GB anyway.

I'll leave that alone and see if I can figure out this warning that pops up whenever I start a VM that has a vgpu. I think it is related to the CPU type passed to the VM.

What CPU do you have in your host and what cpu type do you select for your VM?

Code:

[126838.253040] ------------[ cut here ]------------
[126838.253466] WARNING: CPU: 29 PID: 23749 at ./include/linux/rwsem.h:85 remap_pfn_range_internal+0x4af/0x5a0
[126838.253723] Modules linked in: veth nfs_layout_flexfiles rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables nfnetlink_cttimeout softdog sunrpc binfmt_misc bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample nfnetlink_log nfnetlink nvidia_vgpu_vfio(OE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd nvidia(POE) ipmi_ssif kvm_amd polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd mdev rapl acpi_ipmi kvm ast pcspkr ccp k10temp ptdma ipmi_si ipmi_devintf joydev input_leds ipmi_msghandler mac_hid vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq hid_generic usbkbd usbmouse mlx4_ib usbhid ib_uverbs hid ib_core mlx4_en igb xhci_pci
[126838.253794]  nvme i2c_algo_bit mpt3sas ahci dca mlx4_core nvme_core xhci_hcd libahci raid_class i2c_piix4 nvme_auth i2c_smbus scsi_transport_sas
[126838.256903] CPU: 29 UID: 0 PID: 23749 Comm: CPU 0/KVM Tainted: P        W  OE      6.14.0-2-pve #1
[126838.257218] Tainted: [P]=PROPRIETARY_MODULE, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[126838.257544] Hardware name: Supermicro Super Server/H11SSL-NC, BIOS 3.2 01/24/2025
[126838.257872] RIP: 0010:remap_pfn_range_internal+0x4af/0x5a0
[126838.258195] Code: 31 db c3 cc cc cc cc 48 8b 7d a8 4c 89 fa 4c 89 ce 4c 89 4d c0 e8 81 e2 ff ff 85 c0 75 9c 4c 8b 4d c0 4d 8b 01 e9 aa fd ff ff <0f> 0b e9 d7 fb ff ff 0f 0b 48 8b 7d a8 4c 89 fa 48 89 de 4c 89 45
[126838.258866] RSP: 0018:ffffbb367bdc3320 EFLAGS: 00010246
[126838.259203] RAX: 000000002c0644fb RBX: ffff923cd4d3c450 RCX: 0000000000001000
[126838.259552] RDX: 0000000000000000 RSI: 000076a58fe03000 RDI: ffff923cd4d3c450
[126838.259934] RBP: ffffbb367bdc33d8 R08: 8000000000000037 R09: 0000000000000000
[126838.260278] R10: 0000000000000000 R11: ffff92379bdf0b00 R12: 000000002000fdf4
[126838.260634] R13: 000076a58fe04000 R14: 000076a58fe03000 R15: 8000000000000037
[126838.260979] FS:  000076adda5a26c0(0000) GS:ffff9275cee80000(0000) knlGS:0000000000000000
[126838.261332] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[126838.261687] CR2: 00007f06ccbff718 CR3: 000000052bf0c000 CR4: 0000000000350ef0
[126838.262039] Call Trace:
[126838.262395]  <TASK>
[126838.262743]  ? show_regs+0x6c/0x80
[126838.263095]  ? __warn+0x8d/0x150
[126838.263461]  ? remap_pfn_range_internal+0x4af/0x5a0
[126838.263810]  ? report_bug+0x182/0x1b0
[126838.264160]  ? handle_bug+0x6e/0xb0
[126838.264515]  ? exc_invalid_op+0x18/0x80
[126838.264861]  ? asm_exc_invalid_op+0x1b/0x20
[126838.265210]  ? remap_pfn_range_internal+0x4af/0x5a0
[126838.265553]  ? pat_pagerange_is_ram+0x7a/0xa0
[126838.265893]  ? memtype_lookup+0x3b/0x70
[126838.266215]  ? lookup_memtype+0xd1/0xf0
[126838.266544]  remap_pfn_range+0x5c/0xb0
[126838.266859]  ? up+0x58/0xa0
[126838.267180]  vgpu_mmio_fault_wrapper+0x1fa/0x340 [nvidia_vgpu_vfio]
[126838.267513]  __do_fault+0x3a/0x180
[126838.267836]  do_fault+0xca/0x4f0
[126838.268153]  __handle_mm_fault+0x840/0x10b0
[126838.268478]  handle_mm_fault+0x1a5/0x360
[126838.268795]  fixup_user_fault+0x8c/0x1d0
[126838.269111]  hva_to_pfn+0x337/0x4c0 [kvm]
[126838.269504]  kvm_follow_pfn+0x97/0x100 [kvm]
[126838.269874]  __kvm_faultin_pfn+0x5c/0x90 [kvm]
[126838.270228]  kvm_mmu_faultin_pfn+0x19d/0x6e0 [kvm]
[126838.270596]  kvm_tdp_page_fault+0x8e/0xe0 [kvm]
[126838.270943]  kvm_mmu_do_page_fault+0x243/0x290 [kvm]
[126838.271283]  kvm_mmu_page_fault+0x8e/0x6d0 [kvm]
[126838.271622]  ? psi_group_change+0x1fd/0x410
[126838.271880]  ? __perf_event_task_sched_in+0x93/0x1f0
[126838.272132]  ? emulator_read_write+0x42/0x1c0 [kvm]
[126838.272482]  npf_interception+0xba/0x190 [kvm_amd]
[126838.272725]  svm_invoke_exit_handler+0x182/0x1b0 [kvm_amd]
[126838.272964]  svm_handle_exit+0xa2/0x200 [kvm_amd]
[126838.273196]  vcpu_enter_guest+0x4e8/0x1640 [kvm]
[126838.273494]  ? x86_emulate_instruction+0x42b/0x760 [kvm]
[126838.273793]  ? kvm_arch_vcpu_load+0xac/0x290 [kvm]
[126838.274074]  kvm_arch_vcpu_ioctl_run+0x35d/0x750 [kvm]
[126838.274353]  ? finish_wait+0x5a/0x80
[126838.274557]  kvm_vcpu_ioctl+0x2c2/0xaa0 [kvm]
[126838.274845]  ? up+0x58/0xa0
[126838.275042]  ? nv_vgpu_vfio_access+0x37f/0x430 [nvidia_vgpu_vfio]
[126838.275243]  ? __check_object_size+0x6a/0x300
[126838.275452]  __x64_sys_ioctl+0xa4/0xe0
[126838.275648]  x64_sys_call+0xb45/0x2540
[126838.275840]  do_syscall_64+0x7e/0x170
[126838.276029]  ? vfio_device_fops_read+0x27/0x50 [vfio]
[126838.276218]  ? vfs_read+0xfc/0x390
[126838.276406]  ? do_syscall_64+0x8a/0x170
[126838.276589]  ? __rseq_handle_notify_resume+0xa0/0x4e0
[126838.276776]  ? timer_delete_sync+0x10/0x20
[126838.276964]  ? arch_exit_to_user_mode_prepare.constprop.0+0xc8/0xd0
[126838.277154]  ? syscall_exit_to_user_mode+0x38/0x1d0
[126838.277357]  ? do_syscall_64+0x8a/0x170
[126838.277547]  ? nv_vgpu_vfio_access+0x37f/0x430 [nvidia_vgpu_vfio]
[126838.277753]  ? nv_vgpu_vfio_write+0xb4/0x150 [nvidia_vgpu_vfio]
[126838.277951]  ? nv_vfio_mdev_write+0x23/0x70 [nvidia_vgpu_vfio]
[126838.278149]  ? vfio_device_fops_write+0x27/0x50 [vfio]
[126838.278353]  ? vfs_write+0x104/0x480
[126838.278546]  ? arch_exit_to_user_mode_prepare.constprop.0+0xc8/0xd0
[126838.278745]  ? __rseq_handle_notify_resume+0xa0/0x4e0
[126838.278974]  ? __rseq_handle_notify_resume+0xa0/0x4e0
[126838.279173]  ? arch_exit_to_user_mode_prepare.constprop.0+0xc8/0xd0
[126838.279386]  ? syscall_exit_to_user_mode+0x38/0x1d0
[126838.279583]  ? do_syscall_64+0x8a/0x170
[126838.279781]  ? do_syscall_64+0x8a/0x170
[126838.279975]  ? do_syscall_64+0x8a/0x170
[126838.280166]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[126838.280368]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[126838.280577] RIP: 0033:0x76addf6cdd1b
[126838.280772] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[126838.281194] RSP: 002b:000076adda59cee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[126838.281420] RAX: ffffffffffffffda RBX: 00006356437e7740 RCX: 000076addf6cdd1b
[126838.281641] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000004e
[126838.281873] RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000
[126838.282095] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[126838.282321] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[126838.282546]  </TASK>
[126838.282761] ---[ end trace 0000000000000000 ]---

Happy to hear it helped somehow at least.

Is your immou flag set in grub?

I am working with a i7-7820X and i always use the “host” setting as others do not expose all the same features properly and always end up with lower performance while seeming to simultaneously cause more cpu usage host side in my experience at least.

Also while not always the case, ive quite a few times had a lot of luck with just restarting the service then running the VMs by using this command

systemctl restart nvidia-vgpu-mgr.service

which if that is the fix then i usually have to do it every time after reboot but it will usually work in those instances at least.

I think i did run into that same error one time, not sure what caused it but the vm didn’t start and i tried it again and it started. I think mine might have just been from too much memory being occupied at the time though.

The P4 absolutely should work with two VMs set to 4GB each. It seems to actually be one of the most optimal settings, as I’ve noticed when setting 7-8GB to one VM its always 6.5, 7.5, etc where as 4Gb is 4Gb

Randell · 2025-04-23T20:10:25+0200

Yeah, it is set. But I'll go back thru everything just to make sure.

Here is my nvidia-smi from one host that just has a single vm/vgpu in use:

The mdev type is nvidia-765. I wonder if there is something on the host causing some memory usage and taking a bit off the max? 7680MiB

Code:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.02             Driver Version: 550.144.02     CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P4                       On  |   00000000:81:00.0 Off |                    0 |
| N/A   36C    P8             10W /   75W |    4009MiB /   7680MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     23769    C+G   vgpu                                         3976MiB |
+-----------------------------------------------------------------------------------------+

Here is a different host with nothing running:

Code:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.02             Driver Version: 550.144.02     CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P4                       On  |   00000000:81:00.0 Off |                    0 |
| N/A   36C    P8             10W /   75W |      33MiB /   7680MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I wonder if it is that 33MiB that is the issue

Edit: Disregard the 33MiB part, I don't think that is it

zenowl77 · 2025-04-23T20:14:43+0200

that's interesting, my p4 shows 8192MB it should be exactly 8GB

here is what mine shows with the 2x4GB profiles running.

Code:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.02             Driver Version: 550.144.02     CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P4                       On  |   00000000:17:00.0 Off |                  Off |
| N/A   59C    P0             47W /   68W |    8094MiB /   8192MiB |     73%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3516431    C+G   vgpu                                         4030MiB |
|    0   N/A  N/A   3585720    C+G   vgpu                                         4030MiB |
+-----------------------------------------------------------------------------------------+

it seems for some reason your total vram is not what it should be?

EDIT: i suppose you could always do a 4GB+3GB setup or 4GB+2GB+1GB or something. or even 4GB+3.5GB (would have to figure out the settings for 3.5GB as i did with the 7gb profile)

Randell · 2025-04-23T21:06:38+0200

I'm investigating the vram issue now. Going thru my bios settings, etc.

Yeah, I ended up using 1x4GB and 1x2GB vms for now. I'm more curious about the *why* right now. Just to be sure it doesn't cause some other issue.

zenowl77 · 2025-04-23T21:14:05+0200

Randell said:
I'm investigating the vram issue now. Going thru my bios settings, etc.

Yeah, I ended up using 1x4GB and 1x2GB vms for now. I'm more curious about the *why* right now. Just to be sure it doesn't cause some other issue.

I would be too, so far i have not run into the same issue myself, i found another post on a blog in Chinese that the user was also using vgpu with proxmox using a P4 and they had the exact nvidia-smi output you have, so it seems like some kind of bios setting or issue on some of the P4s, have you noticed it saying 8192mb before at any point or always 7680mb?

Which is also weird cause its exactly 512mb less than it should be which seems very specific and also points to something allocating/reserving it, is it your primary gpu? Could that be the issue?

Randell · 2025-04-23T21:57:26+0200

Yeah, I noticed the 512MB and was proceeding along the same thoughts about something reserving that. It is indeed the only GPU in the system, the board is "Supermicro H11SSL-NC 2.0". (all 3 nodes are the same).

I poked and prodded the BIOS and nothing stands out and I don't really remember if it reported 8192 before. This is the first time I started had a 2nd instance of a vgpu enabled vm.

I'll keep looking around as a learning exercise.

zenowl77 · 2025-04-23T22:08:01+0200

Randell said:
Yeah, I noticed the 512MB and was proceeding along the same thoughts about something reserving that. It is indeed the only GPU in the system, the board is "Supermicro H11SSL-NC 2.0". (all 3 nodes are the same).

I poked and prodded the BIOS and nothing stands out and I don't really remember if it reported 8192 before. This is the first time I started had a 2nd instance of a vgpu enabled vm.

I'll keep looking around as a learning exercise.

does the AMD cpu have an iGPU or anything? i am guessing no with that motherboard, etc, if so though enabling it could give you a second gpu and free up the rest of the tesla but then you would also lose som ram, but then you could also use the iGPU in ways.

i would be interested to know too just for the sake of future reference as thats kind of an annoyance.

one thing that can reduce ram is ECC but it appears to be off in your nvidia-smi and that should be 12.5% so 1024mb on 8gb, which would make it 7168mb and vGPU then shouldnt work if it was enabled. so it doesnt appear to be that at least. but then again your ECC portion says "0" where as mine says "off"

Disabling GPU ECC Memory and Persistence Mode

Randell · 2025-04-23T22:22:58+0200

Funny, I just came across this post talking about ECC vram settings. I just turned it off rebooted and now it shows up as the full 8GB

A

Thread 'Disable vGPU ECC Memory'

Jan 30, 2023

Has anyone been able to disable ECC memory on a Tesla GPU being passed through to a VM? Running a P4 which may only be able to use the full 8GB of memory if ECC is disabled, and I cannot figure out how to configure Nvidia-SMI in proxmox. Thanks!

Code:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.02             Driver Version: 550.144.02     CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P4                       On  |   00000000:81:00.0 Off |                  Off |
| N/A   36C    P8             10W /   75W |      34MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

zenowl77 · 2025-04-23T22:28:04+0200

Awesome, glad to see you got it.

Thats a strange one, i thought vGPU wouldn’t work if ECC was enabled, maybe that’s only on old driver versions.

I wonder if there is any benefit at all to ai accuracy using ECC. (Probably extremely minimal if anything)

Randell · 2025-04-23T22:35:18+0200

My understanding of ECC is really more for hardware faults, a little power blip or cosmic ray flipping a bit here or there. I definitely wanted it for my system memory with ZFS but for GPU workload (transcoding and codeprojectai for person/vehicle detection with blue iris) I can live with a flipped bit

zenowl77 · 2025-04-23T22:49:27+0200

Randell said:
My understanding of ECC is really more for hardware faults, a little power blip or cosmic ray flipping a bit here or there. I definitely wanted it for my system memory with ZFS but for GPU workload (transcoding and codeprojectai for person/vehicle detection with blue iris) I can live with a flipped bit

yeah, i definitely see the point with memory, it does seem to make a difference overall, like noticeably so with system stability, i do not currently have ECC ram, but years ago when using a windows 2012 R2 copy as a server OS, it definitely made an impact that was noticeable.

i wonder if it could potentially have any impact on LLMs, etc, but the benefit would likely be so minimal its just not worth the 512mb loss and the performance impact would probably also be something to consider. lol

Search

Search

[SOLVED] vGPU just stopped working randomly (solution includes 6.14, pascal fixes for 17.5, changing mock p4 to A5500 thanks to GreenDam )

Randell

Well-Known Member

Randell

Well-Known Member

zenowl77

Member

Randell

Well-Known Member

zenowl77

Member

Randell

Well-Known Member

zenowl77

Member

Randell

Well-Known Member

zenowl77

Member

Randell

Well-Known Member

zenowl77

Member

Randell

Well-Known Member

Thread 'Disable vGPU ECC Memory'

zenowl77

Member

Randell

Well-Known Member

zenowl77

Member

We value your privacy