Workaround: nvme nvme0: frozen state error detected - Samsung 980 M.2 NVMe

goseph

Renowned Member
Dec 4, 2014
35
1
73
Hi,

PVE 7.1-12
Kernel Version Linux 5.13.19-6-pve #1 SMP PVE 5.13.19-15
Samsung 980 NVME M.2 latest firmware 2B4QFXO7
System: Shuttle DL20N6 (DL20N)

Strange things happend with Samsung 980 NVME already at Proxmox-Installation. Last step of installer: one time it was writeable, then again no.
Put the Samsung 980 in a NVMe-USB Adapter and everything was fine.

NVMe back in the NVMe-Slot and dmesg -T says:
Code:
[Fri Apr 15 09:21:44 2022] nvme nvme0: frozen state error detected, reset controller
[Fri Apr 15 09:21:44 2022] nvme nvme0: restart after slot reset
[Fri Apr 15 09:21:44 2022] nvme nvme0: Shutdown timeout set to 8 seconds
[Fri Apr 15 09:21:45 2022] nvme nvme0: 4/0/0 default/read/poll queues

[Fri Apr 15 09:26:05 2022] pcieport 0000:00:1c.4: AER:   Error of this Agent is reported first
[Fri Apr 15 09:26:05 2022] nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[Fri Apr 15 09:26:05 2022] nvme 0000:03:00.0:   device [144d:a809] error status/mask=00001000/0000e000
[Fri Apr 15 09:26:05 2022] nvme 0000:03:00.0:    [12] Timeout
[Fri Apr 15 09:26:05 2022] pcieport 0000:00:1c.4: AER: Multiple Corrected error received: 0000:00:1c.4
[Fri Apr 15 09:26:05 2022] pcieport 0000:00:1c.4: AER: can't find device of ID00e4
[Fri Apr 15 09:26:06 2022] pcieport 0000:00:1c.4: AER: Multiple Corrected error received: 0000:00:1c.4
[Fri Apr 15 09:26:06 2022] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[Fri Apr 15 09:26:06 2022] pcieport 0000:00:1c.4:   device [8086:4dbc] error status/mask=00000001/00002000
[Fri Apr 15 09:26:06 2022] pcieport 0000:00:1c.4:    [ 0] RxErr

[Fri Apr 15 09:29:55 2022] blk_update_request: I/O error, dev nvme0n1, sector 57947608 op 0x1:(WRITE) flags 0x800 phys_seg 6 prio class 0
[Fri Apr 15 09:29:55 2022] blk_update_request: I/O error, dev nvme0n1, sector 1050624 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[Fri Apr 15 09:29:55 2022] blk_update_request: I/O error, dev nvme0n1, sector 95757712 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Fri Apr 15 09:29:55 2022] EXT4-fs warning (device dm-1): ext4_end_bio:342: I/O error 10 writing to inode 2493463 starting block 10003122)
[Fri Apr 15 09:29:55 2022] Buffer I/O error on device dm-1, logical block 10003122
[Fri Apr 15 09:29:55 2022] blk_update_request: I/O error, dev nvme0n1, sector 95757696 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Fri Apr 15 09:29:55 2022] EXT4-fs warning (device dm-1): ext4_end_bio:342: I/O error 10 writing to inode 2493463 starting block 10003120)
[Fri Apr 15 09:29:55 2022] Buffer I/O error on device dm-1, logical block 10003120
[Fri Apr 15 09:29:55 2022] blk_update_request: I/O error, dev nvme0n1, sector 95742128 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Fri Apr 15 09:29:55 2022] EXT4-fs warning (device dm-1): ext4_end_bio:342: I/O error 10 writing to inode 2493453 starting block 10001174)
[Fri Apr 15 09:29:55 2022] Buffer I/O error on device dm-1, logical block 10001174

This will end up in drive being read only at some point

Does NOT help: apt update && apt install pve-kernel-5.15

Workaround (hope it helps others for now):
Code:
nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
update-grub
reboot

Just tell me what i should post so you can fix this. Read there are some Bugs around with Ubuntu as well:

https://bugzilla.kernel.org/show_bug.cgi?id=195039
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1942624

Might happen to other nvme drives like Kingston as well?

Thanks a lot.
 
Last edited:
I'm running into a similar problem as well, but using a Silicon Power 512GB NVMe M.2 PCIe Gen3x4 2280 SSD (SP512GBP34A60M28).

Code:
2022-08-06T16:56:33.244Z    [70724.469616] nvme nvme0: frozen state error detected, reset controller
2022-08-06T16:56:33.244Z    [70724.469566] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
2022-08-06T16:56:33.244Z    [70724.469589] pcieport 0000:00:1b.0:   device [8086:a32c] error status/mask=00200000/00010000
2022-08-06T16:56:33.244Z    [70724.469559] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
2022-08-06T16:56:33.244Z    [70724.469580] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
2022-08-06T16:56:33.244Z    [70724.469597] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
2022-08-06T16:56:35.308Z    [70726.530267] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
2022-08-06T16:56:35.308Z    [70726.530284] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
2022-08-06T16:56:35.308Z    [70726.530276] pcieport 0000:00:1b.0:   device [8086:a32c] error status/mask=00200000/00010000
2022-08-06T16:56:35.308Z    [70726.530303] nvme nvme0: frozen state error detected, reset controller
2022-08-06T16:56:35.308Z    [70726.530257] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
2022-08-06T16:56:35.308Z    [70726.530254] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
2022-08-06T16:56:35.308Z    [70726.530242] pcieport 0000:00:1b.0: AER: device recovery failed
2022-08-06T16:56:35.308Z    [70726.530207] pcieport 0000:00:1b.0: DPC: Data Link Layer Link Active not set in 1000 msec
2022-08-06T16:56:35.308Z    [70726.530212] pcieport 0000:00:1b.0: AER: subordinate device reset failed
2022-08-06T16:56:37.388Z    [70728.610282] nvme nvme0: frozen state error detected, reset controller
2022-08-06T16:56:37.388Z    [70728.610262] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
2022-08-06T16:56:37.388Z    [70728.610254] pcieport 0000:00:1b.0:   device [8086:a32c] error status/mask=00200000/00010000
2022-08-06T16:56:37.388Z    [70728.610244] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
2022-08-06T16:56:37.388Z    [70728.610235] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
2022-08-06T16:56:37.388Z    [70728.610231] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
2022-08-06T16:56:37.388Z    [70728.610219] pcieport 0000:00:1b.0: AER: device recovery failed
2022-08-06T16:56:37.388Z    [70728.610184] pcieport 0000:00:1b.0: DPC: Data Link Layer Link Active not set in 1000 msec
2022-08-06T16:56:37.388Z    [70728.610190] pcieport 0000:00:1b.0: AER: subordinate device reset failed
2022-08-06T16:56:39.448Z    [70730.670143] nvme nvme0: frozen state error detected, reset controller
2022-08-06T16:56:39.448Z    [70730.670123] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
2022-08-06T16:56:39.448Z    [70730.670104] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
2022-08-06T16:56:39.448Z    [70730.670115] pcieport 0000:00:1b.0:   device [8086:a32c] error status/mask=00200000/00010000
2022-08-06T16:56:39.448Z    [70730.670095] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
2022-08-06T16:56:39.448Z    [70730.670092] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
2022-08-06T16:56:39.448Z    [70730.670079] pcieport 0000:00:1b.0: AER: device recovery failed
2022-08-06T16:56:39.448Z    [70730.670050] pcieport 0000:00:1b.0: AER: subordinate device reset failed
2022-08-06T16:56:39.448Z    [70730.670044] pcieport 0000:00:1b.0: DPC: Data Link Layer Link Active not set in 1000 msec
 
Does the mentioned workaround help you?
It helped me.

Fresh install of Proxmox 7.2-3. Had NVME drive installed during initial Proxmox install. NVME was available and even had a VM running on it. Shutdown all VMs/CTs, powered down complete system. In the morning NVME was not showing up in the storage gui. Troubleshooting: couldn't find it with lsblk, powdered down and verified it didn't fall out, powered on and the boot took quite a while and just before it brought up the login prompt it flashed an error message about couldn't mount NVME, timeout.

Made adjustments to grub as above. lsblk shows nvme disk and part. Should work now. Finishing here then back to the VM fun!

Thanks for sharing goseph.
 
  • Like
Reactions: goseph
It’s been a week and I haven’t seen a reoccurrence of the problem since implementing this workaround. I’m going to call it successful and thank you very much!

I’m provisioning another node this week with two of the same NVMe drives and that can serve as another test case. I won’t put in the workaround on the new node to see if the problem pops up again.

I’m very happy to have come across this thread!
 
  • Like
Reactions: goseph
It's been only a few days for me since modifying grub. A couple of reboots though, setting up new system. I did have one snag. After a reboot it hung for a bit then complained about the NVME. I failed to take note of the error message, fail. Rebooted again and everything worked.
I would say this is a good fix but might still have minor issues on some systems. I'm using an Asus Rampage V Mobo i7 6850, older ,'gaming' system.
I'm happy with this fix. Works great for me. Just wanted to note the minor issue Incase someone's running into this issue with a production system.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!