Nvidia vGPU + Windows 11 guest - DRIVER POWER STATE FAILURE

gracexciv

New Member
May 29, 2024
5
0
1
Hello,
I have a pve 8.1.4 setup with 2x Nvidia L40s cards in it, as a well as a pair of VMs with Windows 11 installed.

Host and guests drivers as well as as the DLS server setup works well. I can confirm the vGPUs work and can be used in the guest VMs.

Setup here:

Code:
root@proxmox:~# pveversion -h
Unknown option: h
USAGE: pveversion [--verbose]
root@proxmox:~# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1



Code:
root@proxmox:~# qm config 200
balloon: 0
bios: ovmf
boot: order=scsi0;ide0
cores: 4
cpu: x86-64-v2-AES
efidisk0: ss-lvm-pool:vm-200-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: mapping=vGPUpool,mdev=nvidia-1155,pcie=1
machine: pc-q35-8.1
memory: 16384
meta: creation-qemu=8.1.5,ctime=1716915047
name: VM-win11test1
net0: virtio=BC:24:11:5F:9D:85,bridge=vmbr20,firewall=1
net1: virtio=BC:24:11:E5:27:88,bridge=vmbr50,firewall=1
numa: 0
ostype: win11
scsi0: ss-lvm-pool:vm-200-disk-1,iothread=1,size=100G
scsihw: virtio-scsi-single
smbios1: uuid=4f344fe7-460e-4913-9715-251c871a5252
sockets: 2
tpmstate0: ss-lvm-pool:vm-200-disk-2,size=4M,version=v2.0
vmgenid: c5f7655c-c19c-4c0d-bb86-208ca0a77f1c

The only (minor) nuisance I have left is to do with a blue screen panic that Windows throws out occasionally on boot - DRIVER POWER STATE FAILURE.

I have noticed that this happens only when there is no other VM running with an allocated vGPU instance.
Say I have VM200 and 201 both with vGPU allocated.
If VM200 is running and I boot VM201, it's all nice and smooth.
If both VM200 and VM201 are off and I boot either, I get
  1. DRIVER POWER STATE FAILURE (see attached). The vm reboots
  2. on second boot, i get PAGE FAULT IN NON PAGED AREA (what failed nvddmkm.sys). The vm reboots
  3. on third reboot, the VM boots correctly.
This makes the boot sequence quite long, and I don't remember having any of these issues with vSphere so I feel there must be a solution out there.

It appears as if the L40s needs to be somewhat powered on before the VM boots. I have looked all over the Nvidia documentation but can't find any relevant option for my case.

Has anyone had a similar experience?

thanks
 

Attachments

  • 1716988450229.png
    1716988450229.png
    84.3 KB · Views: 4
  • 1716988719671.png
    1716988719671.png
    86.3 KB · Views: 3
syslog of the vm boot sequence:
Code:
May 29 14:23:05 proxmox pvedaemon[158074]: <root@pam> starting task UPID:proxmox:0002E736:0071FBEB:66572C39:vncproxy:200:root@pam:
May 29 14:23:05 proxmox pvedaemon[190262]: starting vnc proxy UPID:proxmox:0002E736:0071FBEB:66572C39:vncproxy:200:root@pam:
May 29 14:23:05 proxmox pveproxy[176949]: worker exit
May 29 14:23:05 proxmox pveproxy[1872]: worker 176949 finished
May 29 14:23:05 proxmox pveproxy[1872]: starting 1 worker(s)
May 29 14:23:05 proxmox pveproxy[1872]: worker 190265 started
May 29 14:23:26 proxmox nvidia-vgpu-mgr[188867]: notice: vmiop_env_log: (0x0): Plugin migration stage change none -> stop_and_copy. QEMU migration state: STOPNCOPY_ACTIVE
May 29 14:23:27 proxmox kernel: tap200i0: left allmulticast mode
May 29 14:23:27 proxmox kernel: fwbr200i0: port 2(tap200i0) entered disabled state
May 29 14:23:27 proxmox kernel: fwbr200i0: port 1(fwln200i0) entered disabled state
May 29 14:23:27 proxmox kernel: vmbr20: port 2(fwpr200p0) entered disabled state
May 29 14:23:27 proxmox kernel: fwln200i0 (unregistering): left allmulticast mode
May 29 14:23:27 proxmox kernel: fwln200i0 (unregistering): left promiscuous mode
May 29 14:23:27 proxmox kernel: fwbr200i0: port 1(fwln200i0) entered disabled state
May 29 14:23:27 proxmox kernel: fwpr200p0 (unregistering): left allmulticast mode
May 29 14:23:27 proxmox kernel: fwpr200p0 (unregistering): left promiscuous mode
May 29 14:23:27 proxmox kernel: vmbr20: port 2(fwpr200p0) entered disabled state
May 29 14:23:27 proxmox kernel: tap200i1: left allmulticast mode
May 29 14:23:27 proxmox kernel: fwbr200i1: port 2(tap200i1) entered disabled state
May 29 14:23:27 proxmox kernel: fwbr200i1: port 1(fwln200i1) entered disabled state
May 29 14:23:27 proxmox kernel: vmbr50: port 2(fwpr200p1) entered disabled state
May 29 14:23:27 proxmox kernel: fwln200i1 (unregistering): left allmulticast mode
May 29 14:23:27 proxmox kernel: fwln200i1 (unregistering): left promiscuous mode
May 29 14:23:27 proxmox kernel: fwbr200i1: port 1(fwln200i1) entered disabled state
May 29 14:23:27 proxmox kernel: fwpr200p1 (unregistering): left allmulticast mode
May 29 14:23:27 proxmox kernel: fwpr200p1 (unregistering): left promiscuous mode
May 29 14:23:27 proxmox kernel: vmbr50: port 2(fwpr200p1) entered disabled state
May 29 14:23:28 proxmox qmeventd[1417]: read: Connection reset by peer
May 29 14:23:28 proxmox pvedaemon[186579]: VM 200 qmp command failed - VM 200 not running
May 29 14:23:28 proxmox nvidia-vgpu-mgr[188867]: notice: vmiop_log: Stopping all vGPU migration threads
May 29 14:23:28 proxmox pvedaemon[158074]: <root@pam> end task UPID:proxmox:0002E736:0071FBEB:66572C39:vncproxy:200:root@pam: OK
May 29 14:23:28 proxmox qmeventd[190368]: Starting cleanup for 200
May 29 14:23:29 proxmox systemd[1]: 200.scope: Deactivated successfully.
May 29 14:23:29 proxmox systemd[1]: 200.scope: Consumed 7min 49.358s CPU time.
May 29 14:23:38 proxmox qmeventd[190368]: Finished cleanup for 200
May 29 14:23:38 proxmox kernel: nvidia-vgpu-vfio 00000000-0000-0000-0000-000000000200: Removing from iommu group 293
May 29 14:23:39 proxmox pvedaemon[158074]: <root@pam> starting task UPID:proxmox:0002E7C2:00720934:66572C5B:qmstart:200:root@pam:
May 29 14:23:39 proxmox pvedaemon[190402]: start VM 200: UPID:proxmox:0002E7C2:00720934:66572C5B:qmstart:200:root@pam:
May 29 14:23:39 proxmox kernel: nvidia-vgpu-vfio 00000000-0000-0000-0000-000000000200: Adding to iommu group 293
May 29 14:23:39 proxmox systemd[1]: Started 200.scope.
May 29 14:23:40 proxmox kernel: tap200i0: entered promiscuous mode
May 29 14:23:40 proxmox kernel: vmbr20: port 2(fwpr200p0) entered blocking state
May 29 14:23:40 proxmox kernel: vmbr20: port 2(fwpr200p0) entered disabled state
May 29 14:23:40 proxmox kernel: fwpr200p0: entered allmulticast mode
May 29 14:23:40 proxmox kernel: fwpr200p0: entered promiscuous mode
May 29 14:23:40 proxmox kernel: vmbr20: port 2(fwpr200p0) entered blocking state
May 29 14:23:40 proxmox kernel: vmbr20: port 2(fwpr200p0) entered forwarding state
May 29 14:23:40 proxmox kernel: fwbr200i0: port 1(fwln200i0) entered blocking state
May 29 14:23:40 proxmox kernel: fwbr200i0: port 1(fwln200i0) entered disabled state
May 29 14:23:40 proxmox kernel: fwln200i0: entered allmulticast mode
May 29 14:23:40 proxmox kernel: fwln200i0: entered promiscuous mode
May 29 14:23:40 proxmox kernel: fwbr200i0: port 1(fwln200i0) entered blocking state
May 29 14:23:40 proxmox kernel: fwbr200i0: port 1(fwln200i0) entered forwarding state
May 29 14:23:40 proxmox kernel: fwbr200i0: port 2(tap200i0) entered blocking state
May 29 14:23:40 proxmox kernel: fwbr200i0: port 2(tap200i0) entered disabled state
May 29 14:23:40 proxmox kernel: tap200i0: entered allmulticast mode
May 29 14:23:40 proxmox kernel: fwbr200i0: port 2(tap200i0) entered blocking state
May 29 14:23:40 proxmox kernel: fwbr200i0: port 2(tap200i0) entered forwarding state
May 29 14:23:40 proxmox kernel: tap200i1: entered promiscuous mode
May 29 14:23:40 proxmox kernel: vmbr50: port 2(fwpr200p1) entered blocking state
May 29 14:23:40 proxmox kernel: vmbr50: port 2(fwpr200p1) entered disabled state
May 29 14:23:40 proxmox kernel: fwpr200p1: entered allmulticast mode
May 29 14:23:40 proxmox kernel: fwpr200p1: entered promiscuous mode
May 29 14:23:40 proxmox kernel: vmbr50: port 2(fwpr200p1) entered blocking state
May 29 14:23:40 proxmox kernel: vmbr50: port 2(fwpr200p1) entered forwarding state
May 29 14:23:40 proxmox kernel: fwbr200i1: port 1(fwln200i1) entered blocking state
May 29 14:23:40 proxmox kernel: fwbr200i1: port 1(fwln200i1) entered disabled state
May 29 14:23:40 proxmox kernel: fwln200i1: entered allmulticast mode
May 29 14:23:40 proxmox kernel: fwln200i1: entered promiscuous mode
May 29 14:23:40 proxmox kernel: fwbr200i1: port 1(fwln200i1) entered blocking state
May 29 14:23:40 proxmox kernel: fwbr200i1: port 1(fwln200i1) entered forwarding state
May 29 14:23:40 proxmox kernel: fwbr200i1: port 2(tap200i1) entered blocking state
May 29 14:23:40 proxmox kernel: fwbr200i1: port 2(tap200i1) entered disabled state
May 29 14:23:40 proxmox kernel: tap200i1: entered allmulticast mode
May 29 14:23:40 proxmox kernel: fwbr200i1: port 2(tap200i1) entered blocking state
May 29 14:23:40 proxmox kernel: fwbr200i1: port 2(tap200i1) entered forwarding state
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0000-000000000200 GPU PCI id 00:0d:00.4 config params vgpu_type_id=1155
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=1155
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_env_log: Successfully updated env symbols!
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): detected a VF at 0:d:0.4
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): gpu-pci-id : 0xd00
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): vgpu_type : Quadro
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): Framebuffer: 0x560000000
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): Virtual Device Id: 0x26b9:0x1893
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: ######## vGPU Manager Information: ########
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: Driver Version: 535.154.02
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): Detected ECC enabled on physical GPU.
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): Guest usable FB size is reduced due to ECC.
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x120001)
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
May 29 14:23:43 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): vGPU migration enabled
May 29 14:23:44 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): vGPU manager is running in SRIOV with GSP mode.
May 29 14:23:44 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: display_init inst: 0 successful
May 29 14:23:44 proxmox kernel: [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000200: vGPU migration enabled with upstream V2 migration protocol
May 29 14:23:44 proxmox pveproxy[190265]: proxy detected vanished client connection
May 29 14:23:44 proxmox pvedaemon[158074]: <root@pam> end task UPID:proxmox:0002E7C2:00720934:66572C5B:qmstart:200:root@pam: OK
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 7/KVM/190634 took a split_lock trap at address: 0x7eedd050
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 3/KVM/190630 took a split_lock trap at address: 0x7eedd050
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 5/KVM/190632 took a split_lock trap at address: 0x7eedd050
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 6/KVM/190633 took a split_lock trap at address: 0x7eedd050
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 1/KVM/190628 took a split_lock trap at address: 0x7eedd050
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 4/KVM/190631 took a split_lock trap at address: 0x7eedd050
May 29 14:25:42 proxmox kernel: x86/split lock detection: #AC: CPU 0/KVM/190627 took a split_lock trap at address: 0xfffff803642e657f
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): Guest driver loaded.
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): Wiring up the functions for RPC version 23050000
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: Driver Version: 538.15
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 14:26:50 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): Guest driver loaded.
May 29 14:26:50 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): Wiring up the functions for RPC version 23050000
May 29 14:26:50 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
May 29 14:26:50 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: Driver Version: 538.15
May 29 14:26:50 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
May 29 14:27:21 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): vGPU license state: Licensed
 
Code:
root@proxmox:~# nvidia-smi -q -i b5:00.0

==============NVSMI LOG==============

Timestamp                                 : Wed May 29 14:31:08 2024
Driver Version                            : 535.154.02
CUDA Version                              : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU           : Supported

Attached GPUs                             : 2
GPU 00000000:B5:00.0
    Product Name                          : NVIDIA L40S
    Product Brand                         : NVIDIA
    Product Architecture                  : Ada Lovelace
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : N/A
    vGPU Device Capability
        Fractional Multi-vGPU             : Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1324123033593
    GPU UUID                              : GPU-6cd1415b-3c88-c5e1-02c7-0c29b4037a0c
    Minor Number                          : 1
    VBIOS Version                         : 95.02.66.00.02
    MultiGPU Board                        : No
    Board ID                              : 0xb500
    Board Part Number                     : 900-2G133-0180-030
    GPU Part Number                       : 26B9-896-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G133.0242.00.03
        OEM Object                        : 2.1
        ECC Object                        : 6.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.154.02
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0xB5
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x26B910DE
        Bus Id                            : 00000000:B5:00.0
        Sub System Id                     : 0x185110DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 46068 MiB
        Reserved                          : 1103 MiB
        Used                              : 0 MiB
        Free                              : 44964 MiB
    BAR1 Memory Usage
        Total                             : 65536 MiB
        Used                              : 1 MiB
        Free                              : 65535 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 32 C
        GPU T.Limit Temp                  : 55 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A
    GPU Power Readings
        Power Draw                        : 37.51 W
        Current Power Limit               : 350.00 W
        Requested Power Limit             : 350.00 W
        Default Power Limit               : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 1185 MHz
    Applications Clocks
        Graphics                          : 2520 MHz
        Memory                            : 9001 MHz
    Default Applications Clocks
        Graphics                          : 2520 MHz
        Memory                            : 9001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2520 MHz
        SM                                : 2520 MHz
        Memory                            : 9001 MHz
        Video                             : 1965 MHz
    Max Customer Boost Clocks
        Graphics                          : 2520 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 875.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes                             : None
 
mhmm did not see that issue yet, but from the logs:
May 29 14:23:44 proxmox pvedaemon[158074]: <root@pam> end task UPID:proxmox:0002E7C2:00720934:66572C5B:qmstart:200:root@pam: OK
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 7/KVM/190634 took a split_lock trap at address: 0x7eedd050
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 3/KVM/190630 took a split_lock trap at address: 0x7eedd050
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 5/KVM/190632 took a split_lock trap at address: 0x7eedd050
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 6/KVM/190633 took a split_lock trap at address: 0x7eedd050
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 1/KVM/190628 took a split_lock trap at address: 0x7eedd050
May 29 14:23:45 proxmox kernel: x86/split lock detection: #AC: CPU 4/KVM/190631 took a split_lock trap at address: 0x7eedd050
May 29 14:25:42 proxmox kernel: x86/split lock detection: #AC: CPU 0/KVM/190627 took a split_lock trap at address: 0xfffff803642e657f
May 29 14:26:17 proxmox nvidia-vgpu-mgr[190664]: notice: vmiop_log: (0x0): Guest driver loaded.
the only noticeable thing between starting and the first guest driver load is the split lock detection, so i'd guess there may be a connection.

This detects "bad" memory access and by default and "punishes" the processes by inserting 10ms sleeps which may be bad if that happens in the guest driver initialization
(we're currently working on a better solution, but none exist yet)

you can turn off the "punishment" by setting the following sysctl:

Code:
sysctl -w kernel.split_lock_mitigate=0

note this is only temporary. after a reboot it's set to 1 again

would you mind disabling that before trying again?
 
Hello and thanks for your help!

I have run
sysctl -w kernel.split_lock_mitigate=0
successfully and started a vm from fully shut down.

See syslog below, I did not notice any chance in behavior - same two blue screens sequence.


Code:
May 29 16:23:14 proxmox pvedaemon[208066]: <root@pam> starting task UPID:proxmox:00034477:007CFC0C:66574862:qmstart:200:root@pam:
May 29 16:23:14 proxmox pvedaemon[214135]: start VM 200: UPID:proxmox:00034477:007CFC0C:66574862:qmstart:200:root@pam:
May 29 16:23:14 proxmox kernel: nvidia-vgpu-vfio 00000000-0000-0000-0000-000000000200: Adding to iommu group 293
May 29 16:23:14 proxmox systemd[1]: Started 200.scope.
May 29 16:23:15 proxmox kernel: tap200i0: entered promiscuous mode
May 29 16:23:15 proxmox kernel: vmbr20: port 2(fwpr200p0) entered blocking state
May 29 16:23:15 proxmox kernel: vmbr20: port 2(fwpr200p0) entered disabled state
May 29 16:23:15 proxmox kernel: fwpr200p0: entered allmulticast mode
May 29 16:23:15 proxmox kernel: fwpr200p0: entered promiscuous mode
May 29 16:23:15 proxmox kernel: vmbr20: port 2(fwpr200p0) entered blocking state
May 29 16:23:15 proxmox kernel: vmbr20: port 2(fwpr200p0) entered forwarding state
May 29 16:23:15 proxmox kernel: fwbr200i0: port 1(fwln200i0) entered blocking state
May 29 16:23:15 proxmox kernel: fwbr200i0: port 1(fwln200i0) entered disabled state
May 29 16:23:15 proxmox kernel: fwln200i0: entered allmulticast mode
May 29 16:23:15 proxmox kernel: fwln200i0: entered promiscuous mode
May 29 16:23:15 proxmox kernel: fwbr200i0: port 1(fwln200i0) entered blocking state
May 29 16:23:15 proxmox kernel: fwbr200i0: port 1(fwln200i0) entered forwarding state
May 29 16:23:15 proxmox kernel: fwbr200i0: port 2(tap200i0) entered blocking state
May 29 16:23:15 proxmox kernel: fwbr200i0: port 2(tap200i0) entered disabled state
May 29 16:23:15 proxmox kernel: tap200i0: entered allmulticast mode
May 29 16:23:15 proxmox kernel: fwbr200i0: port 2(tap200i0) entered blocking state
May 29 16:23:15 proxmox kernel: fwbr200i0: port 2(tap200i0) entered forwarding state
May 29 16:23:15 proxmox kernel: tap200i1: entered promiscuous mode
May 29 16:23:16 proxmox kernel: vmbr50: port 3(fwpr200p1) entered blocking state
May 29 16:23:16 proxmox kernel: vmbr50: port 3(fwpr200p1) entered disabled state
May 29 16:23:16 proxmox kernel: fwpr200p1: entered allmulticast mode
May 29 16:23:16 proxmox kernel: fwpr200p1: entered promiscuous mode
May 29 16:23:16 proxmox kernel: vmbr50: port 3(fwpr200p1) entered blocking state
May 29 16:23:16 proxmox kernel: vmbr50: port 3(fwpr200p1) entered forwarding state
May 29 16:23:16 proxmox kernel: fwbr200i1: port 1(fwln200i1) entered blocking state
May 29 16:23:16 proxmox kernel: fwbr200i1: port 1(fwln200i1) entered disabled state
May 29 16:23:16 proxmox kernel: fwln200i1: entered allmulticast mode
May 29 16:23:16 proxmox kernel: fwln200i1: entered promiscuous mode
May 29 16:23:16 proxmox kernel: fwbr200i1: port 1(fwln200i1) entered blocking state
May 29 16:23:16 proxmox kernel: fwbr200i1: port 1(fwln200i1) entered forwarding state
May 29 16:23:16 proxmox kernel: fwbr200i1: port 2(tap200i1) entered blocking state
May 29 16:23:16 proxmox kernel: fwbr200i1: port 2(tap200i1) entered disabled state
May 29 16:23:16 proxmox kernel: tap200i1: entered allmulticast mode
May 29 16:23:16 proxmox kernel: fwbr200i1: port 2(tap200i1) entered blocking state
May 29 16:23:16 proxmox kernel: fwbr200i1: port 2(tap200i1) entered forwarding state
May 29 16:23:16 proxmox pvedaemon[208066]: <root@pam> starting task UPID:proxmox:00034561:007CFCC4:66574864:vncshell::root@pam:
May 29 16:23:16 proxmox pvedaemon[214369]: starting termproxy UPID:proxmox:00034561:007CFCC4:66574864:vncshell::root@pam:
May 29 16:23:16 proxmox pvedaemon[208066]: <root@pam> successful auth for user 'root@pam'
May 29 16:23:16 proxmox login[214374]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)
May 29 16:23:16 proxmox systemd-logind[1424]: New session 64 of user root.
May 29 16:23:16 proxmox systemd[1]: Started session-64.scope - Session 64 of User root.
May 29 16:23:16 proxmox login[214379]: ROOT LOGIN  on '/dev/pts/0'
May 29 16:23:17 proxmox systemd[1]: session-64.scope: Deactivated successfully.
May 29 16:23:17 proxmox systemd-logind[1424]: Session 64 logged out. Waiting for processes to exit.
May 29 16:23:17 proxmox systemd-logind[1424]: Removed session 64.
May 29 16:23:17 proxmox pvedaemon[208066]: <root@pam> end task UPID:proxmox:00034561:007CFCC4:66574864:vncshell::root@pam: OK
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0000-000000000200 GPU PCI id 00:0d:00.4 config params vgpu_type_id=1155
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=1155
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_env_log: Successfully updated env symbols!
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): detected a VF at 0:d:0.4
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): gpu-pci-id : 0xd00
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): vgpu_type : Quadro
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): Framebuffer: 0x560000000
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): Virtual Device Id: 0x26b9:0x1893
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: ######## vGPU Manager Information: ########
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: Driver Version: 535.154.02
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): Detected ECC enabled on physical GPU.
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): Guest usable FB size is reduced due to ECC.
May 29 16:23:18 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x120001)
May 29 16:23:19 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
May 29 16:23:19 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): vGPU migration enabled
May 29 16:23:19 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): vGPU manager is running in SRIOV with GSP mode.
May 29 16:23:19 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: display_init inst: 0 successful
May 29 16:23:19 proxmox kernel: [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000200: vGPU migration enabled with upstream V2 migration protocol
May 29 16:23:20 proxmox pvedaemon[208066]: <root@pam> end task UPID:proxmox:00034477:007CFC0C:66574862:qmstart:200:root@pam: OK
May 29 16:23:20 proxmox pveproxy[210691]: proxy detected vanished client connection
May 29 16:23:20 proxmox kernel: x86/split lock detection: #AC: CPU 6/KVM/214365 took a split_lock trap at address: 0x7eedd050
May 29 16:23:20 proxmox kernel: x86/split lock detection: #AC: CPU 2/KVM/214361 took a split_lock trap at address: 0x7eedd050
May 29 16:23:28 proxmox systemd[1]: Stopping user@0.service - User Manager for UID 0...
May 29 16:23:28 proxmox systemd[213791]: Activating special unit exit.target...
May 29 16:23:28 proxmox systemd[213791]: Stopped target default.target - Main User Target.
May 29 16:23:28 proxmox systemd[213791]: Stopped target basic.target - Basic System.
May 29 16:23:28 proxmox systemd[213791]: Stopped target paths.target - Paths.
May 29 16:23:28 proxmox systemd[213791]: Stopped target sockets.target - Sockets.
May 29 16:23:28 proxmox systemd[213791]: Stopped target timers.target - Timers.
May 29 16:23:28 proxmox systemd[213791]: Closed dirmngr.socket - GnuPG network certificate management daemon.
May 29 16:23:28 proxmox systemd[213791]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 29 16:23:28 proxmox systemd[213791]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
May 29 16:23:28 proxmox systemd[213791]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
May 29 16:23:28 proxmox systemd[213791]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
May 29 16:23:28 proxmox systemd[213791]: Removed slice app.slice - User Application Slice.
May 29 16:23:28 proxmox systemd[213791]: Reached target shutdown.target - Shutdown.
May 29 16:23:28 proxmox systemd[213791]: Finished systemd-exit.service - Exit the Session.
May 29 16:23:28 proxmox systemd[213791]: Reached target exit.target - Exit the Session.
May 29 16:23:28 proxmox systemd[1]: user@0.service: Deactivated successfully.
May 29 16:23:28 proxmox systemd[1]: Stopped user@0.service - User Manager for UID 0.
May 29 16:23:28 proxmox systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
May 29 16:23:28 proxmox systemd[1]: run-user-0.mount: Deactivated successfully.
May 29 16:23:28 proxmox systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
May 29 16:23:28 proxmox systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
May 29 16:23:28 proxmox systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
May 29 16:24:51 proxmox kernel: x86/split lock detection: #AC: CPU 0/KVM/214359 took a split_lock trap at address: 0xfffff8040984b9dd
May 29 16:25:40 proxmox kernel: x86/split lock detection: #AC: CPU 4/KVM/214363 took a split_lock trap at address: 0x7eedd050
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): Guest driver loaded.
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): Wiring up the functions for RPC version 23050000
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: Driver Version: 538.15
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:25:51 proxmox nvidia-vgpu-mgr[214387]: error: vmiop_log: (0x0): VGPU message 0 failed, result code: 0xff100002
May 29 16:26:13 proxmox kernel: x86/split lock detection: #AC: CPU 5/KVM/214364 took a split_lock trap at address: 0x7eedd050
May 29 16:26:24 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): Guest driver loaded.
May 29 16:26:24 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): Wiring up the functions for RPC version 23050000
May 29 16:26:24 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
May 29 16:26:24 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: Driver Version: 538.15
May 29 16:26:24 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
May 29 16:26:55 proxmox nvidia-vgpu-mgr[214387]: notice: vmiop_log: (0x0): vGPU license state: Licensed
 
See syslog below, I did not notice any chance in behavior - same two blue screens sequence.
ok, well it was worth a shot anyway


maybe you're on the right track with
It appears as if the L40s needs to be somewhat powered on before the VM boots.
as that it may be a power issue?

do you have any options in your bios for pcie power management? (e.g. it may be named ASPM or similar)
if yes, can you change the value from what it is currently? you can also disable it on the kernel commandline with `pcie_aspm=off`

also can you provide the output of nvidia-smi immediatly before starting the first vm ? (the output of `dmesg` could maybe also help)

and do the logs inside the guest say anything?

another thing to test could be to install a linux guest vm, just to see how that behaves differently (and the dmesg/kernel logs may give more hints there than the windows bluescreen)

EDIT: added another point for ASPM
 
Hey,
thanks for all the pointers! I've been keeping an eye on my current solution, seems like it's behaving.

But some answers for you first.

See attached nvidia-smi -q with VMs off.

My hypervisor does have ASPM settings in the BIOs, it was set to Disabled. I did try to change that but did not make a difference. As you can imagine, disabling aspm in the kernel also did not make a difference.

I was able to export a dump file from the Windows VM, see attached.

With not much else new to lead on, I kept on researching this DRIVER_POWER_STATE_FAILURE error.

A few people reported that disabling fast start and hibernation in Windows did the trick.

I've done both (this guide here for reference) and all my VMs now boot wonderfully!
 

Attachments

  • nvidiasmi with VMs off.zip
    2.3 KB · Views: 2
  • windows VM dump file.zip
    2.5 KB · Views: 1
A few people reported that disabling fast start and hibernation in Windows did the trick.

I've done both (this guide here for reference) and all my VMs now boot wonderfully!
great, then it seems that maybe windows just expects the device to be in a different power state for its 'fast start and hibernation' and that trips up their kernel/driver ?
 
Hi there!

Jumping in this conversation. I see you use L40S, and that you got vGPU working. I have been trying already a long time to get this to work, but vGPU does not work for me in Proxmox. Which installation guide did you follow for the installation of vGPUs for L40S?

I used the Polloloco guide: https://gitlab.com/polloloco/vgpu-proxmox to no avail in both Proxmox 7.4 as 8.2
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!