Proxmox just died with: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

marcosscriven · May 3, 2021

I was playing a game on a Windows VM, and it suddenly paused.

I checked the Proxmox logs, and saw this:

Code:

[268690.209099] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[268690.289109] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[268690.289234] nvme nvme0: Removing after probe failure status: -19
[268690.313116] nvme0n1: detected capacity change from 1953525168 to 0
[268690.313116] blk_update_request: I/O error, dev nvme0n1, sector 119170336 op 0x1:(WRITE) flags 0x800 phys_seg 14 prio class 0
[268690.313117] blk_update_request: I/O error, dev nvme0n1, sector 293367304 op 0x1:(WRITE) flags 0x8800 phys_seg 5 prio class 0
[268690.313118] blk_update_request: I/O error, dev nvme0n1, sector 1886015680 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[268690.313119] blk_update_request: I/O error, dev nvme0n1, sector 335354496 op 0x1:(WRITE) flags 0x8800 phys_seg 82 prio class 0
[268690.313121] blk_update_request: I/O error, dev nvme0n1, sector 324852224 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[268690.313121] blk_update_request: I/O error, dev nvme0n1, sector 215307472 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[268690.313133] EXT4-fs warning (device dm-1): ext4_end_bio:346: I/O error 10 writing to inode 6186409 starting block 24684698)
[268690.313141] blk_update_request: I/O error, dev nvme0n1, sector 286812536 op 0x1:(WRITE) flags 0x8800 phys_seg 3 prio class 0
[268690.313141] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268690.313143] blk_update_request: I/O error, dev nvme0n1, sector 334614632 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[268690.313165] Aborting journal on device dm-1-8.
[268690.313168] blk_update_request: I/O error, dev nvme0n1, sector 334655488 op 0x1:(WRITE) flags 0x8800 phys_seg 6 prio class 0
[268690.313173] EXT4-fs error (device dm-1): ext4_journal_check_start:83: comm pmxcfs: Detected aborted journal
[268690.313175] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268690.313176] blk_update_request: I/O error, dev nvme0n1, sector 334647608 op 0x1:(WRITE) flags 0x8800 phys_seg 9 prio class 0
[268690.313178] EXT4-fs error (device dm-1): ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted journal
[268690.313180] Buffer I/O error on dev dm-1, logical block 12615680, lost sync page write
[268690.313181] EXT4-fs error (device dm-1): ext4_journal_check_start:83: comm pveproxy worker: Detected aborted journal
[268690.313184] JBD2: Error -5 detected when updating journal superblock for dm-1-8.
[268690.313187] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268690.313193] EXT4-fs warning (device dm-1): ext4_end_bio:346: I/O error 10 writing to inode 5777456 starting block 560183)
[268690.313208] EXT4-fs error (device dm-1): ext4_journal_check_start:83: comm kworker/u64:0: Detected aborted journal
[268690.313210] Buffer I/O error on dev dm-1, logical block 0, lost sync page write
[268690.313211] Buffer I/O error on device dm-1, logical block 560183
[268690.313225] EXT4-fs (dm-1): I/O error while writing superblock
[268690.313229] EXT4-fs (dm-1): Remounting filesystem read-only
[268690.313236] Buffer I/O error on dev dm-1, logical block 23593269, lost async page write
[268690.313245] Buffer I/O error on dev dm-1, logical block 0, lost sync page write
[268690.313246] Buffer I/O error on dev dm-1, logical block 23593272, lost async page write
[268690.313251] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268690.313255] Buffer I/O error on dev dm-1, logical block 24650487, lost async page write
[268690.313258] EXT4-fs (dm-1): I/O error while writing superblock
[268690.313263] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268690.313266] EXT4-fs warning (device dm-1): ext4_end_bio:346: I/O error 10 writing to inode 5902737 starting block 559027)
[268690.313273] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268690.313280] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268690.313283] Buffer I/O error on dev dm-1, logical block 0, lost sync page write
[268690.313287] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268690.313292] EXT4-fs (dm-1): I/O error while writing superblock
[268690.313295] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268690.313295] EXT4-fs (dm-1): previous I/O error to superblock detected
[268690.313301] Buffer I/O error on dev dm-1, logical block 0, lost sync page write
[268690.313307] EXT4-fs (dm-1): I/O error while writing superblock
[268690.313312] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268690.313311] EXT4-fs (dm-1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 6186409, error -30)
[268690.313320] Buffer I/O error on device dm-1, logical block 24684698
[268690.313330] EXT4-fs (dm-1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 5902737, error -30)
[268690.313337] Buffer I/O error on device dm-1, logical block 559027
[268690.346871] nvme nvme0: failed to set APST feature (-19)
[268695.355772] process_cell: 40 callbacks suppressed
[268695.355775] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268695.355794] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268696.371288] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268696.371306] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268697.371295] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268697.371315] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268698.386891] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268698.387041] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268698.387358] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268698.387649] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268700.418147] process_cell: 2 callbacks suppressed
[268700.418150] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268700.418164] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268701.418159] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268701.418180] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268702.418135] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[268702.418265] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5

Then hundreds more of that last line.

The host file system suddenly became read-only, but oddly the Windows VM came back for a few moments.

Any ideas what's happened here? The drive is reasonably new (0% wear) and there were no thermal warnings.

marcosscriven · May 3, 2021

SMART data for nvme0:

Code:

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    17,690,794 [9.05 TB]
Data Units Written:                 13,206,455 [6.76 TB]
Host Read Commands:                 291,944,005
Host Write Commands:                204,342,061
Controller Busy Time:               79
Power Cycles:                       166
Power On Hours:                     620
Unsafe Shutdowns:                   92
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

ejkeebler · May 11, 2021

Any luck, I see this issue sporadically with my 870 EVO. I've seen someone say it was fixed after updating firmware on their drive, might be an option. I've suspected power problems in my device, no firmware for my device to update, so looking for any potential solution or location to investigate. I basically see nothing at all in any logs i can find when it happens.

marcosscriven · May 11, 2021

I think I may have fixed it. In my case I have several devices attached by PCIe, two on the motherboard, and three on a bifurcator:

On the motherboard:

x4 PCIe 3.0 Samsung 970 Evo Plus
x4 PCIe 4.0 WD SN850

On the bifurcator in x8x4x4 mode (all PCIe 3.0):

x8 RTX 3080 (with 2x6-pin power)
x4 GT 1030 (no power)
x4 Samsung SM951

Now on the bifrucator there's a 4-pin 12V EPS connector, that I didn't think I needed to use. The 3080 has plenty of power from its own supply. And I thought three NVMe drives + a 25W GPU should be fine to run off the PCIe bus?

However, since plugging in the extra power I've not seen this reoccur.

ejkeebler · May 11, 2021

interesting, so yours may be related to power. I only have a p400 in my risers. dual 750w, dual xeon v2 2690's, only 6 drives loaded into my back plane, so I would think I'm not stressing it from a power perspective, i guess one/both power could be "bad"

Thanks for the information.

marcosscriven · Jul 1, 2021

@ejkeebler - and anyone else reading this in the future. It turned out adding extra power didn't fix the issue.

In my case I think it's because the SSD wasn't quite seated correctly.

It's extremely confusing behavior however, why would an SSD work for a while, and then stop, if it's just not physically plugged in correctly? All I can think is (and I don't know for sure), is that either the fan spinning up more and causing vibration, or the GPU or SSD getting hotter and thermally expanding a tiny amount, (or some combination thereof), was enough to disrupt the fragile connection to the drive.

It sort of seems unlikely to me - but when I physically checked the drive/heatsink, it was loose.

If this is a load of baloney, I'd like to hear it though.

marcosscriven · May 13, 2022

Update is, I'm still seeing this. Especially during game loads in a Windows VM (using a virtual disk).

Any suggestions welcome here.

celemine1gig · May 13, 2022

Set PCIe speed for all hardware connected through the "bifurcator" to Gen2 or Gen1. If it runs then, you have a signal integrity issue, concerning the operation on PCIe Gen3, in combination with the "bifurcator".
Remember, the more transitions over connectors the signals have to make, the worse they get.

damo2929 · May 13, 2022

I get this when my cards go over 80 degrees, I was using a passively cooled case that I have now added a fan too and it's not reoccured.
when you took the temp was it the time it happened or after?
as I found my NVME disks didn't come back when cooled until I rebooted.

leesteken · May 13, 2022

Since the problem appears after a while when the system is busy, I suspect the PSU. Hardware gets warm and resistance increases. CPU and GPU are using more power and doing lots of high frequency switching between voltages. Maybe the PSU cannot keep up, even though the average load is well within specification?

damo2929 · May 13, 2022

leesteken said:
Since the problem appears after a while when the system is busy, I suspect the PSU. Hardware gets warm and resistance increases. CPU and GPU are using more power and doing lots of high frequency switching between voltages. Maybe the PSU cannot keep up, even though the average load is well within specification?

could just be a weak rail on the PSU. try balancing the connectors and see what happens.
have you got voltage sensors configured on the Motherboard because you could capture the values and see if it is power.

marcosscriven · May 13, 2022

Thanks for all the replies - I should have stated that, since my initial post, I'm now not using the bifurcator, and just using the direct motherboard m.2 slot (direct CPU lanes)

marcosscriven · May 14, 2022

Things I've done thus far:

* Updated to lastest stable BIOS
* Updated the nvme firmware to latest
* Reseated the drive (which is directly on the CPU PCIe 4.0 bus, rather than the chipset 3.0 bus)
* Checked temp of drive at time of crash (51C)
* Checked temp of GPU at time of crash (61C)

The disk crash only happens running Half-Life: Alyx, so I feel like this could just be a software issue. It can also happen very early on into running the game.

Only thing I haven't done is @damo2929 suggestion of 'balancing', but not sure what plugs are on what rails (this is a Corsair 750 SFX)

I'm also not sure how to monitor mobo voltages (Aorus b500i pro ax) via the Proxmox host. It doesn't seem to be displayed in `sensors`.

marcosscriven · May 14, 2022

Just looking in https://github.com/torvalds/linux/tree/master/drivers/hwmon, I don't see any of the nctXXX.c variants mention the b550 Aorus

EDIT - Hmm, some Googling reveals there might be an out-of-tree NCT6776 option https://blog.hqcodeshop.fi/archives/276-Improving-Nuvoton-NCT6776-lm_sensors-output.html

leesteken · May 14, 2022

Maybe a silly question, but did you run sensors-detect (as root) on the Proxmox host (and added the modules it suggests to /etc/module)?
Your M.2 drive and GPU temperatures look normal. I don't know the rest of your hardware but Corsair does make reasonable PSUs. (I always suspect PSU if you don't get error message that indicate something else.)
The loading of levels in a game can tax both the drive (loading the content), the CPU (decompressing the content) and the GPU (pre-compiling shaders).

marcosscriven · May 14, 2022

leesteken said:
Maybe a silly question, but did you run sensors-detect (as root) on the Proxmox host (and added the modules it suggests to /etc/module)?
Your M.2 drive and GPU temperatures look normal. I don't know the rest of your hardware but Corsair does make reasonable PSUs. (I always suspect PSU if you don't get error message that indicate something else.)
The loading of levels in a game can tax both the drive (loading the content), the CPU (decompressing the content) and the GPU (pre-compiling shaders).

Yes I ran detect - although it didn't seem to suggest modules, just scanned for them (see output at end)

One thing I'm about to try is

Code:

acpi_enforce_resources=lax

in boot params.

I'm not even 100% sure what chip/module I need. These things are not well documented.

Code:

root@pve:~# sensors-detect
# sensors-detect version 3.6.0
# System: Gigabyte Technology Co., Ltd. B550I AORUS PRO AX [Default string]
# Kernel: 5.13.19-6-pve x86_64
# Processor: AMD Ryzen 9 3900X 12-Core Processor (23/113/0)

This program will help you determine which kernel modules you need
to load to use lm_sensors most effectively. It is generally safe
and recommended to accept the default answers to all questions,
unless you know what you're doing.

Some south bridges, CPUs or memory controllers contain embedded sensors.
Do you want to scan for them? This is totally safe. (YES/no):
Module cpuid loaded successfully.
Silicon Integrated Systems SIS5595...                       No
VIA VT82C686 Integrated Sensors...                          No
VIA VT8231 Integrated Sensors...                            No
AMD K8 thermal sensors...                                   No
AMD Family 10h thermal sensors...                           No
AMD Family 11h thermal sensors...                           No
AMD Family 12h and 14h thermal sensors...                   No
AMD Family 15h thermal sensors...                           No
AMD Family 16h thermal sensors...                           No
AMD Family 17h thermal sensors...                           Success!
    (driver `k10temp')
AMD Family 15h power sensors...                             No
AMD Family 16h power sensors...                             No
Hygon Family 18h thermal sensors...                         No
Intel digital thermal sensor...                             No
Intel AMB FB-DIMM thermal sensor...                         No
Intel 5500/5520/X58 thermal sensor...                       No
VIA C7 thermal sensor...                                    No
VIA Nano thermal sensor...                                  No

Some Super I/O chips contain embedded sensors. We have to write to
standard I/O ports to probe them. This is usually safe.
Do you want to scan for Super I/O sensors? (YES/no):
Probing for Super-I/O at 0x2e/0x2f
Trying family `National Semiconductor/ITE'...               No
Trying family `SMSC'...                                     No
Trying family `VIA/Winbond/Nuvoton/Fintek'...               No
Trying family `ITE'...                                      Yes
Found unknown chip with ID 0x8688
Probing for Super-I/O at 0x4e/0x4f
Trying family `National Semiconductor/ITE'...               No
Trying family `SMSC'...                                     No
Trying family `VIA/Winbond/Nuvoton/Fintek'...               No
Trying family `ITE'...                                      No

Some systems (mainly servers) implement IPMI, a set of common interfaces
through which system health data may be retrieved, amongst other things.
We first try to get the information from SMBIOS. If we don't find it
there, we have to read from arbitrary I/O ports to probe for such
interfaces. This is normally safe. Do you want to scan for IPMI
interfaces? (YES/no):
Probing for `IPMI BMC KCS' at 0xca0...                      No
Probing for `IPMI BMC SMIC' at 0xca8...                     No

Some hardware monitoring chips are accessible through the ISA I/O ports.
We have to write to arbitrary I/O ports to probe them. This is usually
safe though. Yes, you do have ISA I/O ports even if you do not have any
ISA slots! Do you want to scan the ISA I/O ports? (YES/no):
Probing for `National Semiconductor LM78' at 0x290...       No
Probing for `National Semiconductor LM79' at 0x290...       No
Probing for `Winbond W83781D' at 0x290...                   No
Probing for `Winbond W83782D' at 0x290...                   No

Lastly, we can probe the I2C/SMBus adapters for connected hardware
monitoring devices. This is the most risky part, and while it works
reasonably well on most systems, it has been reported to cause trouble
on some systems.
Do you want to probe the I2C/SMBus adapters now? (YES/no):
Using driver `i2c-piix4' for device 0000:00:14.0: AMD KERNCZ SMBus


Now follows a summary of the probes I have just done.
Just press ENTER to continue:

Driver `k10temp' (autoloaded):
  * Chip `AMD Family 17h thermal sensors' (confidence: 9)

No modules to load, skipping modules configuration.

Unloading cpuid... OK

leesteken · May 14, 2022

marcosscriven said:
AMD Family 17h thermal sensors... Success!
(driver `k10temp')

OK, you can see CPU temperatures.

marcosscriven said:
Trying family `ITE'... Yes
Found unknown chip with ID 0x8688

I think that it does not support your motherboard and you will not be able to see temperatures and voltages.

marcosscriven · May 14, 2022

leesteken said:
OK, you can see CPU temperatures.

I think that it does not support your motherboard and you will not be able to see temperatures and voltages.

I can see a pretty good range of temps, just no voltages. Sometimes in the past I've had luck with forcing a given module that 'similar enough' as it were.

Code:

root@pve:~# sensors
gigabyte_wmi-virtual-0
Adapter: Virtual device
temp1:        +40.0°C
temp2:        +42.0°C
temp3:        +38.0°C
temp4:        +20.0°C
temp5:        +40.0°C
temp6:        +45.0°C

nvme-pci-0800
Adapter: PCI adapter
Composite:    +45.9°C  (low  = -273.1°C, high = +84.8°C)
                       (crit = +84.8°C)
Sensor 1:     +45.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +44.9°C  (low  = -273.1°C, high = +65261.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +16.8°C  (crit = +20.8°C)
temp2:        +16.8°C  (crit = +20.8°C)

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:            N/A

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +38.2°C
Tdie:         +38.2°C
Tccd1:        +47.2°C
Tccd2:        +36.8°C

nvme-pci-0100
Adapter: PCI adapter
Composite:    +48.9°C  (low  =  -5.2°C, high = +83.8°C)
                       (crit = +87.8°C)

marcosscriven · May 14, 2022

@leesteken - would you believe it, I've asked the same question about sensors before on the x570 (I since moved to b550 to avoid chipset fans)

https://github.com/lm-sensors/lm-sensors/issues/154#issuecomment-650662163

Anyway, I think that'll work. I can then monitor voltages and see if they drop just before the crash. Of course, it could just be the SSD has a problem.

Thanks for your help

marcosscriven · May 15, 2022

Ok, I got something after building the ancient it87 out-of-tree driver:

Code:

root@pve:~/drivers/it87-it8688E# sensors
it8688-isa-0a40
Adapter: ISA adapter
in0:         276.00 mV (min =  +0.00 V, max =  +3.06 V)
in1:           1.99 V  (min =  +0.00 V, max =  +3.06 V)
in2:           1.99 V  (min =  +0.00 V, max =  +3.06 V)
in3:           2.00 V  (min =  +0.00 V, max =  +3.06 V)
in4:           1.03 V  (min =  +0.00 V, max =  +3.06 V)
in5:         912.00 mV (min =  +0.00 V, max =  +3.06 V)
in6:           1.40 V  (min =  +0.00 V, max =  +3.06 V)
3VSB:          3.31 V  (min =  +0.00 V, max =  +6.12 V)

Proxmox just died with: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

Member

Member

Member

Member

Member

Member

Member

Member

Member

Distinguished Member

Member

Member

Member

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Member