PCIe/NVME Server Crash

TheBull369

New Member
Jan 15, 2021
11
0
1
54
Hello,

I am very new to proxmox and loving it so far BUT after running for about a week I have had 2 crashes in the last 36 hours which requires a reboot. What happens is I notice the web interface is no longer available and any machines running off of the nvme drive stop working. The first time it crashed I could not retrieve any errors. The second time I was able to capture the below messages from syslog. Any of the virtual machines that were running on the non-nvme drive stayed running. Any help would be appreciated, please let me know what additional information would be helpful in troubleshooting. The machine that is being used is a new machine. Since this is a new server, at the time I only had a Windows 10 VM and 2 Ubuntu containers running.

Error from syslog:
Code:
Jan 15 04:05:51 proxsrv kernel: [57859.268892] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f15 source:0x0100
Jan 15 04:05:51 proxsrv kernel: [57859.268893] pcieport 0000:00:1b.0: DPC: ERR_FATAL detected
Jan 15 04:05:51 proxsrv kernel: [57859.268894] nvme
Jan 15 04:05:51 proxsrv kernel: [57859.326043] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Jan 15 04:05:51 proxsrv kernel: [57859.326130] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jan 15 04:05:51 proxsrv kernel: [57859.326133] nvme 0000:01:00.0: AER:   device [1344:5410] error status/mask=00001040/00002000
Jan 15 04:05:51 proxsrv kernel: [57859.326134] nvme 0000:01:00.0: AER:    [ 6] BadTLP
Jan 15 04:05:51 proxsrv kernel: [57859.326135] nvme 0000:01:00.0: AER:    [12] Timeout
Jan 15 04:05:51 proxsrv kernel: [57859.444612] nvme nvme0: restart after slot reset
Jan 15 04:05:51 proxsrv kernel: [57859.722985] nvme nvme0: 16/0/0 default/read/poll queues
Jan 15 04:05:51 proxsrv kernel: [57859.756970] pcieport 0000:00:1b.0: AER: Device recovery successful
Jan 15 04:05:51 proxsrv kernel: [57859.756974] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0100
Jan 15 04:05:51 proxsrv kernel: [57859.756975] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
Jan 15 04:05:51 proxsrv kernel: [57859.756979] nvme nvme0: frozen state error detected, reset controller
Jan 15 04:05:53 proxsrv kernel: [57861.856549] pcieport 0000:00:1b.0: Data Li
 
Last edited:
In addition to the above the NVMe drive I am running is a Micron 2200S NVMe 256GB with a firmware of 22001040. Based on this post (https://www.reddit.com/r/linuxhardware/comments/kspktp/psa_fix_for_micron_2200s_ssd_crashes/) there does appear to be an issue with the version of smartmontools that proxmox uses with that version of firmware on the Micron. There are no new firmwares available that I can find and I am running the latest available version of smartmontools package that proxmox distrib offers.

Just to add a little bit of additional detail on my versions:

Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-1
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 
Anyone have any suggestions or help on this one? Right now my server, which I cannot hard reboot at this hour, is somewhat operational. What I mean is the virtual machines seem to be running but I cannot access the Proxmox GUI nor can I SSH into the proxmox server itself. The only option I have is to hard reboot the server (hooking up a monitor shows nothing) which would force down all VM's.

As a side question to this, would paying for the community support offer any additional support options to me?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!