GPU Pass-through causing nightly reboots

mustava · Mar 2, 2023

Hi All,
I have finally diagnosed the root of long running persistent nightly reboots - it seems that my GPU pass-through is the culprit.
My server worked fine with my GPU pass through for months however I think upgrading to 7.x is when the issue started to appear?

I have a 1660S that is passed through to a VM for use by docker containers. The actual pass-through works perfectly (excluding the reboots) however upon disabling it and changing settings back to stock, the server has stayed up for two days (woah) without issue.
One odd thing is that the server still crashes/reboots even when the GPU is not in use by the VM (VM not booted).

I don't get any error messages in syslog or dmesg that I am aware of.

My config is as follows (I have tried every combination in-between here)

Hardware:
CPU: Intel 7700k
Mobo: ASRock Fatal1ty Z270 Gaming K6
GPU: MSI Nvidia 1660S

lspci -s 01:00

Code:

01:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660 SUPER] (rev a1) -
01:00.1 Audio device: NVIDIA Corporation TU116 High Definition Audio Controller (rev a1)
01:00.2 USB controller: NVIDIA Corporation TU116 USB 3.1 Host Controller (rev a1)
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU116 USB Type-C UCSI Controller (rev a1)

Vendor ID's

Code:

10de:21c4,10de:1aeb,10de:1aec,10de:1aed

Code:

01:00.0 0300: 10de:21c4 (rev a1)
01:00.1 0403: 10de:1aeb (rev a1)
01:00.2 0c03: 10de:1aec (rev a1)
01:00.3 0c80: 10de:1aed (rev a1)

/etc/default/grub (I have tried minimal configs here also)

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt video=efifb:off video=vesa:off"

/etc/modprobe.d/vfio.conf

Code:

options vfio-pci ids=1b4b:9230,1b21:2142,10de:21c4,10de:1aeb,10de:1aec,10de:1aed disable_vga=1

/etc/modprobe.d/pve-blacklist.conf

Code:

blacklist nvidiafb
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist i2c_nvidia_gpu

Any help or advice would be hugely appreciated.

mustava · Mar 8, 2023

Bump! I am desperate!!

mustava · Mar 8, 2023

Latest logs from most recent reboot - didnt even last 3 hours!

/var/log/syslog

Code:

Mar 08 11:12:16 pve kernel: perf: interrupt took too long (3956 > 3947), lowering kernel.perf_event_max_sample_rate to 50500
Mar 08 11:12:23 pve postfix/pickup[88305]: warning: /etc/postfix/main.cf, line 24: overriding earlier entry: relayhost=
Mar 08 11:17:01 pve CRON[89068]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Mar 08 11:17:01 pve CRON[89069]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar 08 11:17:01 pve CRON[89068]: pam_unix(cron:session): session closed for user root
Mar 08 11:21:41 pve pvedaemon[2981]: <root@pam> successful auth for user 'root@pam'
Mar 08 11:22:27 pve smartd[2622]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62
Mar 08 11:22:27 pve smartd[2622]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 38
Mar 08 11:22:27 pve smartd[2622]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 120 to 119
Mar 08 11:22:33 pve smartd[2622]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 69
Mar 08 11:22:33 pve smartd[2622]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 30 to 31
Mar 08 11:36:42 pve pvedaemon[2983]: <root@pam> successful auth for user 'root@pam'
Mar 08 11:51:42 pve pvedaemon[2983]: <root@pam> successful auth for user 'root@pam'
Mar 08 11:52:28 pve smartd[2622]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 69 to 68
Mar 08 11:52:28 pve smartd[2622]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 31 to 32
Mar 08 11:52:38 pve smartd[2622]: Device: /dev/sdf [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62
Mar 08 11:52:38 pve smartd[2622]: Device: /dev/sdf [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 38
Mar 08 11:58:39 pve pveproxy[2989]: worker 83946 finished
Mar 08 11:58:39 pve pveproxy[2989]: starting 1 worker(s)
Mar 08 11:58:39 pve pveproxy[2989]: worker 95888 started
Mar 08 11:58:43 pve pveproxy[95887]: got inotify poll request in wrong process - disabling inotify
-- Reboot --

dmesg

Code:

[26097.840539] vmbr0v4: port 4(fwpr105p1) entered forwarding state
[26097.846694] fwbr105i1: port 1(fwln105i1) entered blocking state
[26097.846697] fwbr105i1: port 1(fwln105i1) entered disabled state
[26097.846737] device fwln105i1 entered promiscuous mode
[26097.846763] fwbr105i1: port 1(fwln105i1) entered blocking state
[26097.846764] fwbr105i1: port 1(fwln105i1) entered forwarding state
[26097.852897] fwbr105i1: port 2(tap105i1) entered blocking state
[26097.852899] fwbr105i1: port 2(tap105i1) entered disabled state
[26097.852951] fwbr105i1: port 2(tap105i1) entered blocking state
[26097.852952] fwbr105i1: port 2(tap105i1) entered forwarding state
[26100.174620] pcieport 0000:00:1c.6: Intel SPT PCH root port ACS workaround enabled
[26100.222614] pcieport 0000:00:1c.4: Intel SPT PCH root port ACS workaround enabled
[26101.262760] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x19@0x200
[26101.262766] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x1e@0x400
[26101.398722] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[26101.398740] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[26101.430625] vfio-pci 0000:01:00.1: enabling device (0000 -> 0002)
[26101.454621] vfio-pci 0000:01:00.3: enabling device (0000 -> 0002)
[26136.984719] kvm [76987]: ignored rdmsr: 0x10f data 0x0
[26136.984749] kvm [76987]: ignored rdmsr: 0x123 data 0x0
[26136.984763] kvm [76987]: ignored rdmsr: 0xc0011020 data 0x0
[26137.190391] usb 1-8: reset high-speed USB device number 3 using xhci_hcd
[27891.775204] perf: interrupt took too long (2512 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[28444.018611] perf: interrupt took too long (3158 > 3140), lowering kernel.perf_event_max_sample_rate to 63250
[30022.025506] perf: interrupt took too long (3956 > 3947), lowering kernel.perf_event_max_sample_rate to 50500

dcsapak · Mar 8, 2023

hi,

mustava said:
I have a 1660S that is passed through to a VM for use by docker containers. The actual pass-through works perfectly (excluding the reboots) however upon disabling it and changing settings back to stock, the server has stayed up for two days (woah) without issue.
One odd thing is that the server still crashes/reboots even when the GPU is not in use by the VM (VM not booted).

this indicates that the pass through is not the problem, since when the vm is not booted, it's the same as when it's not configured to being passed through at all...
i'd run some hardware diagnostics (e.g. memtest, maybe cpu stress test)

maybe the correlation with the passthrough was just an accident? also you could try a different kernel, but spontaneous resets without any log are most often hardware problems (cpu/memory/psu/etc.)

mustava · Mar 13, 2023

dcsapak said:
hi,

this indicates that the pass through is not the problem, since when the vm is not booted, it's the same as when it's not configured to being passed through at all...
i'd run some hardware diagnostics (e.g. memtest, maybe cpu stress test)

maybe the correlation with the passthrough was just an accident? also you could try a different kernel, but spontaneous resets without any log are most often hardware problems (cpu/memory/psu/etc.)

Took your advice and ran some more testing.
With GPU pass-through disabled all together, host is perfectly fine - its been up for 3 days running various stress tests and workloads. The PSU is brand new and is rated much more than the server needs (850W for 7700k and 8 drives).
Mem test is showing all clear. Tried with two different sticks and same result.

Whats interesting is I also tried passing through iGPU instead of the nvidia GPU which also caused the host to crash nightly in the same way?
I know the iGPU works fine with this workload as I ran it for years with ESXi but changed to proxmox because I wanted to use my nvidia GPU for Cuda related things.

Any advice would be appreciated

dcsapak · Mar 13, 2023

mustava said:
With GPU pass-through disabled all together, host is perfectly fine - its been up for 3 days running various stress tests and workloads.

what do you mean exactly with 'disabled all together' ? did you just not start the vms, or did you reconfigure something else too (e.g. disabled iommu)?

mustava said:
The PSU is brand new and is rated much more than the server needs (850W for 7700k and 8 drives).

"brand new" is not a guarantee for "not defect", and i have seen all sorts of weird issues that were caused by the psu

mustava said:
Whats interesting is I also tried passing through iGPU instead of the nvidia GPU which also caused the host to crash nightly in the same way?
I know the iGPU works fine with this workload as I ran it for years with ESXi but changed to proxmox because I wanted to use my nvidia GPU for Cuda related things.

of course it's possible that pasthrough on linux works differently than on esxi and that it causes the crash, but it's impossible to tell without any logs

are you sure that e.g. the devices are in seperate iommu groups?

mustava · Mar 13, 2023

what do you mean exactly with 'disabled all together' ? did you just not start the vms, or did you reconfigure something else too (e.g. disabled iommu)?

Removed device ID's from vfio.conf, took drivers our of black list and removed the PCI device from the VM.
VM is sitting happy at the moment. When I stress test the VM with the GPU enabled, it works perfectly until it crashes (usually about 12 hours?)

"brand new" is not a guarantee for "not defect", and i have seen all sorts of weird issues that were caused by the psu

I would agree, but in this case peak power draw would be spinning up all the drives at VM' launch + when the GPU is under load which seems to work fine for many hours as per above- I think for a PSU fault to present itself as 'fail after steady load at consistent intervals of around 12 hours' would be highly improbable. I would expect volatile failures based on load?

are you sure that e.g. the devices are in separate iommu groups?

Fairly sure, everything looks fine to me? For GPU I passed :01:00.0 and ticked 'all functions'

dcsapak · Mar 14, 2023

mustava said:
Removed device ID's from vfio.conf, took drivers our of black list and removed the PCI device from the VM.

and if you really just remove the gpu from the vm config (so leaving the vfio.conf and driver blacklist in place) ?

mustava said:
I would agree, but in this case peak power draw would be spinning up all the drives at VM' launch + when the GPU is under load which seems to work fine for many hours as per above- I think for a PSU fault to present itself as 'fail after steady load at consistent intervals of around 12 hours' would be highly improbable. I would expect volatile failures based on load?

i'm inclined to agree with you, but there has been *very* weird behaviour with faulty psus, that includes transient failures at certain temperatures etc...
(i am not saying the psu is faulty, but you can really only rule it out when you tried a different one and the error still happens)

mustava said:
Fairly sure, everything looks fine to me? For GPU I passed :01:00.0 and ticked 'all functions'

yes that looks ok

not really sure what to look at next, since there are no real errors anywhere, i guess you could setup a remote syslog and see if that logs more before a reset
(there should be plenty guides on how to do that on the internet, just search 'remote syslog on debian bullseye' )

Search

Search

GPU Pass-through causing nightly reboots

mustava

New Member

mustava

New Member

mustava

New Member

dcsapak

Proxmox Staff Member

mustava

New Member

dcsapak

Proxmox Staff Member

mustava

New Member

dcsapak

Proxmox Staff Member