I've got a home server with the following hardware:
It took me months to pinpoint the issue but I found out that after I switched from a gt 710 to a rx 570 the server didn’t crash for more than a month. Yesterday I switched back to the gt 710 and after barely 18h the server crashed. When the server crashes there's no log or output from the GPU, sometimes even the ethernet port lights goes out.
I tried the gt 710 on Windows and it worked without any issue
I tried also blacklisting the Nvidia driver but it didn't help
I tried blacklisting the pci id (Typically done for GPU passthrough) but it didn't work
I tried also swapping some component around and that's how I noticed that the server worked without issue with a rx 570
More info:
During the crash I was only running some CTs, all VM were shut down so the load was really low
I'm using a single boot disk (860 evo ssd) using ZFS and scrub didn't find any corruption
Months ago I tried also reinstalling everything and downclocking the ram and memtesting it
This is my first post in this forum so let me know if there are missing info
Also the hardware I'm using isn't for enterprise so if this post is not allowed I will remove it ASAP
Here's dmesg (08:00 is the gpu)
syslog (at around 15:36:00 I manually reset the computer)
- Amd Ryzen 1600
- Gigabyte s2h a320m
- 32 gb ram
- Nvidia gt 710
- some ssd and hdd
It took me months to pinpoint the issue but I found out that after I switched from a gt 710 to a rx 570 the server didn’t crash for more than a month. Yesterday I switched back to the gt 710 and after barely 18h the server crashed. When the server crashes there's no log or output from the GPU, sometimes even the ethernet port lights goes out.
I tried the gt 710 on Windows and it worked without any issue
I tried also blacklisting the Nvidia driver but it didn't help
I tried blacklisting the pci id (Typically done for GPU passthrough) but it didn't work
I tried also swapping some component around and that's how I noticed that the server worked without issue with a rx 570
More info:
During the crash I was only running some CTs, all VM were shut down so the load was really low
I'm using a single boot disk (860 evo ssd) using ZFS and scrub didn't find any corruption
Months ago I tried also reinstalling everything and downclocking the ram and memtesting it
This is my first post in this forum so let me know if there are missing info
Also the hardware I'm using isn't for enterprise so if this post is not allowed I will remove it ASAP
Here's dmesg (08:00 is the gpu)
Code:
root@pve:/var/log# dmesg | grep 08:00
[ 0.176894] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
[ 0.180098] acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug LTR]
[ 0.180276] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[ 0.180288] acpi PNP0A08:00: [Firmware Info]: MMCONFIG for domain 0000 [bus 00-3f] only partially covers this bridge
[ 0.186247] pci 0000:08:00.0: [10de:128b] type 00 class 0x030000
[ 0.186262] pci 0000:08:00.0: reg 0x10: [mem 0xf6000000-0xf6ffffff]
[ 0.186271] pci 0000:08:00.0: reg 0x14: [mem 0xe8000000-0xefffffff 64bit pref]
[ 0.186280] pci 0000:08:00.0: reg 0x1c: [mem 0xf0000000-0xf1ffffff 64bit pref]
[ 0.186287] pci 0000:08:00.0: reg 0x24: [io 0xf000-0xf07f]
[ 0.186293] pci 0000:08:00.0: reg 0x30: [mem 0xf7000000-0xf707ffff pref]
[ 0.186305] pci 0000:08:00.0: BAR 3: assigned to efifb
[ 0.186378] pci 0000:08:00.1: [10de:0e0f] type 00 class 0x040300
[ 0.186391] pci 0000:08:00.1: reg 0x10: [mem 0xf7080000-0xf7083fff]
[ 0.188711] pci 0000:08:00.0: vgaarb: setting as boot VGA device
[ 0.188711] pci 0000:08:00.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[ 0.188711] pci 0000:08:00.0: vgaarb: bridge control possible
[ 0.209706] pci 0000:08:00.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[ 0.209717] pci 0000:08:00.1: D0 power state depends on 0000:08:00.0
[ 0.798174] pci 0000:08:00.0: Adding to iommu group 15
[ 0.798200] pci 0000:08:00.1: Adding to iommu group 15
[ 9.622464] snd_hda_intel 0000:08:00.1: Disabling MSI
[ 9.622475] snd_hda_intel 0000:08:00.1: Handle vga_switcheroo audio client
[ 10.784476] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input8
[ 10.784553] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input9
[ 10.784592] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input10
[ 10.784626] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input11
[ 10.784660] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input12
syslog (at around 15:36:00 I manually reset the computer)
Code:
May 25 15:24:01 pve systemd[1]: Started Proxmox VE replication runner.
May 25 15:25:00 pve systemd[1]: Starting Proxmox VE replication runner...
May 25 15:25:01 pve systemd[1]: pvesr.service: Succeeded.
May 25 15:25:01 pve systemd[1]: Started Proxmox VE replication runner.
May 25 15:26:00 pve systemd[1]: Starting Proxmox VE replication runner...
May 25 15:26:01 pve systemd[1]: pvesr.service: Succeeded.
May 25 15:26:01 pve systemd[1]: Started Proxmox VE replication runner.
May 25 15:27:00 pve systemd[1]: Starting Proxmox VE replication runner...
May 25 15:27:01 pve systemd[1]: pvesr.service: Succeeded.
May 25 15:27:01 pve systemd[1]: Started Proxmox VE replication runner.
May 25 15:28:00 pve systemd[1]: Starting Proxmox VE replication runner...
May 25 15:28:01 pve systemd[1]: pvesr.service: Succeeded.
May 25 15:28:01 pve systemd[1]: Started Proxmox VE replication runner.
May 25 15:29:00 pve systemd[1]: Starting Proxmox VE replication runner...
May 25 15:29:01 pve systemd[1]: pvesr.service: Succeeded.
May 25 15:29:01 pve systemd[1]: Started Proxmox VE replication runner.
May 25 15:36:21 pve systemd-modules-load[912]: Inserted module 'overlay'
May 25 15:36:21 pve dmeventd[928]: dmeventd ready for processing.
May 25 15:36:21 pve systemd-modules-load[912]: Inserted module 'aufs'
May 25 15:36:21 pve lvm[928]: Monitoring thin pool data--nvme-data--nvme-tpool.
May 25 15:36:21 pve systemd-modules-load[912]: Inserted module 'vfio'
May 25 15:36:21 pve systemd-modules-load[912]: Inserted module 'vfio_pci'
May 25 15:36:21 pve systemd-modules-load[912]: Inserted module 'iscsi_tcp'
May 25 15:36:21 pve systemd-modules-load[912]: Inserted module 'ib_iser'
May 25 15:36:21 pve systemd-modules-load[912]: Inserted module 'vhost_net'
May 25 15:36:21 pve systemd[1]: Starting Flush Journal to Persistent Storage...
May 25 15:36:21 pve systemd[1]: Started Flush Journal to Persistent Storage.
May 25 15:36:21 pve lvm[911]: 6 logical volume(s) in volume group "data-nvme" monitored
May 25 15:36:21 pve systemd[1]: Started udev Coldplug all Devices.
May 25 15:36:21 pve systemd[1]: Starting udev Wait for Complete Device Initialization...
May 25 15:36:21 pve systemd[1]: Starting Helper to synchronize boot up for ifupdown...
May 25 15:36:21 pve systemd-udevd[974]: Using default interface naming scheme 'v240'.
May 25 15:36:21 pve systemd-udevd[993]: Using default interface naming scheme 'v240'.
May 25 15:36:21 pve systemd-udevd[974]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
May 25 15:36:21 pve systemd-udevd[993]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
May 25 15:36:21 pve systemd-udevd[982]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
May 25 15:36:21 pve systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
May 25 15:36:21 pve systemd[1]: Found device WDC_WD40EZAZ-00SF3B0 1.
May 25 15:36:21 pve systemd[1]: Reached target Sound Card.
Code:
root@pve:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-5 (running version: 6.4-5/6c7bf5de)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1