Regular crash followed by input/output error on a NUC 11 - Samsung 970 Pro

RNab

Member
Jun 20, 2021
31
3
13
33
Hi all,

I've been struggling with some issues on my Proxmox server on a NUC 11. It has a NVME drive 970 Pro. I have already unmounted/remounted all components manually in case it was an issue with the connection. But basically, every now and then the machine would stop responding and looking at the screen to which it is connected I see either some input/output error, or dm_lvm_thin_block related errors (forgot the exact line I will update next time).
What I have done so far is :
- Monitor the smartctl (they look good) I've pasted it below.
- Disconnected and reconnected both RAM and NVME drive
- Full re-install with full disk wipe out

I've pasted below an extract of syslog around the time of the latest crash (between 23:30 and 23:40) and the dmesg content mentioning error or warning.

Could anyone help me figure out what are the correct next step to fix this ? The disk is new (3 months old).

Currently installed on the machine are 3 LXC that are mostly idle. 1 of them is used intermittently (media center/jellyfin)

Thank you in advance for any kind of help

Smartctl

Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.16-14-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 PRO 1TB
Serial Number:                      S5JXNS0N702999F
Firmware Version:                   1B2QEXP7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization:            199,396,958,208 [199 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 57014138e2
Local Time is:                      Sun Oct  1 12:17:27 2023 +08
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0037):   Security Format Frmw_DL Self_Test Directvs
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     81 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        47 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    5,226,266 [2.67 TB]
Data Units Written:                 2,395,764 [1.22 TB]
Host Read Commands:                 52,953,136
Host Write Commands:                106,623,181
Controller Busy Time:               228
Power Cycles:                       33
Power On Hours:                     436
Unsafe Shutdowns:                   21
Media and Data Integrity Errors:    0
Error Information Log Entries:      36
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               47 Celsius
Temperature Sensor 2:               49 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         36     0  0x000c  0x4004      -            0     0     -

Sys Log

Code:
2023-09-30T23:17:01.135707+08:00 pve CRON[1826125]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
2023-09-30T23:26:56.420028+08:00 pve smartd[765]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 71 to 72
2023-09-30T23:26:56.420136+08:00 pve smartd[765]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 55 to 54
2023-09-30T23:26:56.420159+08:00 pve smartd[765]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 45 to 46
2023-10-01T11:50:44.819060+08:00 pve kernel: [    0.000000] Linux version 6.2.16-14-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-14 (>
2023-10-01T11:50:44.819093+08:00 pve kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.2.16-14-pve root=/dev/mapper/pve-root ro quiet
2023-10-01T11:50:44.819094+08:00 pve kernel: [    0.000000] KERNEL supported cpus:
2023-10-01T11:50:44.819094+08:00 pve kernel: [    0.000000]   Intel GenuineIntel
2023-10-01T11:50:44.819095+08:00 pve kernel: [    0.000000]   AMD AuthenticAMD
2023-10-01T11:50:44.819095+08:00 pve kernel: [    0.000000]   Hygon HygonGenuine
2023-10-01T11:50:44.819096+08:00 pve kernel: [    0.000000]   Centaur CentaurHauls
2023-10-01T11:50:44.819102+08:00 pve kernel: [    0.000000]   zhaoxin   Shanghai 
2023-10-01T11:50:44.819103+08:00 pve kernel: [    0.000000] x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
2023-10-01T11:50:44.819103+08:00 pve kernel: [    0.000000] BIOS-provided physical RAM map:
2023-10-01T11:50:44.819103+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
2023-10-01T11:50:44.819104+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
2023-10-01T11:50:44.819104+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000037f42fff] usable
2023-10-01T11:50:44.819105+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x0000000037f43000-0x0000000040953fff] reserved
2023-10-01T11:50:44.819106+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x0000000040954000-0x0000000040a1ffff] ACPI data
2023-10-01T11:50:44.819106+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x0000000040a20000-0x0000000040b5afff] ACPI NVS
2023-10-01T11:50:44.819106+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x0000000040b5b000-0x000000004173afff] reserved
2023-10-01T11:50:44.819107+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x000000004173b000-0x00000000417fefff] type 20
2023-10-01T11:50:44.819107+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x00000000417ff000-0x00000000417fffff] usable
2023-10-01T11:50:44.819113+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x0000000041800000-0x0000000047ffffff] reserved
2023-10-01T11:50:44.819114+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x0000000048e00000-0x000000004f7fffff] reserved
2023-10-01T11:50:44.819114+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x00000000c0000000-0x00000000cfffffff] reserved
2023-10-01T11:50:44.819114+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
2023-10-01T11:50:44.819115+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
2023-10-01T11:50:44.819115+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
2023-10-01T11:50:44.819116+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x00000000fed20000-0x00000000fed7ffff] reserved
2023-10-01T11:50:44.819117+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
2023-10-01T11:50:44.819117+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
2023-10-01T11:50:44.819118+08:00 pve kernel: [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x00000008b07fffff] usable
2023-10-01T11:50:44.819118+08:00 pve kernel: [    0.000000] NX (Execute Disable) protection: active
2023-10-01T11:50:44.819118+08:00 pve kernel: [    0.000000] efi: EFI v2.70 by American Megatrends
2023-10-01T11:50:44.819119+08:00 pve kernel: [    0.000000] efi: ACPI=0x40adc000 ACPI 2.0=0x40adc014 TPMFinalLog=0x40ae6000 SMBIOS=0x41577000 SMBIOS 3.0=0x41576000 MEMATTR=0x3024a018 ESRT=0x30252e98 
2023-10-01T11:50:44.819120+08:00 pve kernel: [    0.000000] efi: Remove mem68: MMIO range=[0xc0000000-0xcfffffff] (256MB) from e820 map

dmesg error/warn
Code:
[    0.000000] x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
[   25.419445] EXT4-fs warning (device dm-9): ext4_multi_mount_protect:328: MMP interval 42 higher than expected, please wait.
[   72.466888] EXT4-fs warning (device dm-8): ext4_multi_mount_protect:328: MMP interval 42 higher than expected, please wait.


[   14.575842] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[   71.515171] audit: type=1400 audit(1696132300.682:26): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-102_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=2165 comm="(nft)" srcname="/" flags="rw, rbind"
[   71.532642] audit: type=1400 audit(1696132300.698:27): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-102_</var/lib/lxc>" name="/dev/" pid=2174 comm="(sd-mkdcreds)" flags="rw, rslave"
[   71.573588] audit: type=1400 audit(1696132300.742:28): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-102_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=2205 comm="(networkd)" srcname="/" flags="rw, rbind"
[   71.615099] audit: type=1400 audit(1696132300.782:29): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-102_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=2218 comm="(resolved)" srcname="/" flags="rw, rbind"
[   71.648644] audit: type=1400 audit(1696132300.814:30): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-102_</var/lib/lxc>//&:lxc-102_<-var-lib-lxc>:unconfined" pid=2216 comm="apparmor_parser"
[   71.678146] audit: type=1400 audit(1696132300.846:31): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-102_</var/lib/lxc>//&:lxc-102_<-var-lib-lxc>:unconfined" pid=2222 comm="apparmor_parser"
[   71.678165] audit: type=1400 audit(1696132300.846:32): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-102_</var/lib/lxc>//&:lxc-102_<-var-lib-lxc>:unconfined" pid=2222 comm="apparmor_parser"
[   71.771467] audit: type=1400 audit(1696132300.938:33): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-102_</var/lib/lxc>//&:lxc-102_<-var-lib-lxc>:unconfined" pid=2223 comm="apparmor_parser"
[   71.815537] audit: type=1400 audit(1696132300.982:34): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-102_</var/lib/lxc>//&:lxc-102_<-var-lib-lxc>:unconfined" pid=2228 comm="apparmor_parser"
 
Unsafe Shutdowns: 21
Error Information Log Entries: 36

Maybe replace that NVME....
 
The unsafe shutdown are a result of the system crashing not the cause though no ?

Just to add, after some research it seems the Error Information Log entries is fairly benign. The only issue I spot when using nvme error-log CLI is :

0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)

Which seems to be benign ? But I don’t know
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!