Hi all,
I've been struggling with some issues on my Proxmox server on a NUC 11. It has a NVME drive 970 Pro. I have already unmounted/remounted all components manually in case it was an issue with the connection. But basically, every now and then the machine would stop responding and looking at the screen to which it is connected I see either some input/output error, or dm_lvm_thin_block related errors (forgot the exact line I will update next time).
What I have done so far is :
- Monitor the smartctl (they look good) I've pasted it below.
- Disconnected and reconnected both RAM and NVME drive
- Full re-install with full disk wipe out
I've pasted below an extract of syslog around the time of the latest crash (between 23:30 and 23:40) and the dmesg content mentioning error or warning.
Could anyone help me figure out what are the correct next step to fix this ? The disk is new (3 months old).
Currently installed on the machine are 3 LXC that are mostly idle. 1 of them is used intermittently (media center/jellyfin)
Thank you in advance for any kind of help
Smartctl
Sys Log
dmesg error/warn
I've been struggling with some issues on my Proxmox server on a NUC 11. It has a NVME drive 970 Pro. I have already unmounted/remounted all components manually in case it was an issue with the connection. But basically, every now and then the machine would stop responding and looking at the screen to which it is connected I see either some input/output error, or dm_lvm_thin_block related errors (forgot the exact line I will update next time).
What I have done so far is :
- Monitor the smartctl (they look good) I've pasted it below.
- Disconnected and reconnected both RAM and NVME drive
- Full re-install with full disk wipe out
I've pasted below an extract of syslog around the time of the latest crash (between 23:30 and 23:40) and the dmesg content mentioning error or warning.
Could anyone help me figure out what are the correct next step to fix this ? The disk is new (3 months old).
Currently installed on the machine are 3 LXC that are mostly idle. 1 of them is used intermittently (media center/jellyfin)
Thank you in advance for any kind of help
Smartctl
Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.16-14-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 PRO 1TB
Serial Number: S5JXNS0N702999F
Firmware Version: 1B2QEXP7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization: 199,396,958,208 [199 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 57014138e2
Local Time is: Sun Oct 1 12:17:27 2023 +08
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0037): Security Format Frmw_DL Self_Test Directvs
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 81 Celsius
Critical Comp. Temp. Threshold: 81 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.20W - - 0 0 0 0 0 0
1 + 4.30W - - 1 1 1 1 0 0
2 + 2.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 47 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 5,226,266 [2.67 TB]
Data Units Written: 2,395,764 [1.22 TB]
Host Read Commands: 52,953,136
Host Write Commands: 106,623,181
Controller Busy Time: 228
Power Cycles: 33
Power On Hours: 436
Unsafe Shutdowns: 21
Media and Data Integrity Errors: 0
Error Information Log Entries: 36
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 47 Celsius
Temperature Sensor 2: 49 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 36 0 0x000c 0x4004 - 0 0 -
Sys Log
Code:
2023-09-30T23:17:01.135707+08:00 pve CRON[1826125]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
2023-09-30T23:26:56.420028+08:00 pve smartd[765]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 71 to 72
2023-09-30T23:26:56.420136+08:00 pve smartd[765]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 55 to 54
2023-09-30T23:26:56.420159+08:00 pve smartd[765]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 45 to 46
2023-10-01T11:50:44.819060+08:00 pve kernel: [ 0.000000] Linux version 6.2.16-14-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-14 (>
2023-10-01T11:50:44.819093+08:00 pve kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.2.16-14-pve root=/dev/mapper/pve-root ro quiet
2023-10-01T11:50:44.819094+08:00 pve kernel: [ 0.000000] KERNEL supported cpus:
2023-10-01T11:50:44.819094+08:00 pve kernel: [ 0.000000] Intel GenuineIntel
2023-10-01T11:50:44.819095+08:00 pve kernel: [ 0.000000] AMD AuthenticAMD
2023-10-01T11:50:44.819095+08:00 pve kernel: [ 0.000000] Hygon HygonGenuine
2023-10-01T11:50:44.819096+08:00 pve kernel: [ 0.000000] Centaur CentaurHauls
2023-10-01T11:50:44.819102+08:00 pve kernel: [ 0.000000] zhaoxin Shanghai
2023-10-01T11:50:44.819103+08:00 pve kernel: [ 0.000000] x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
2023-10-01T11:50:44.819103+08:00 pve kernel: [ 0.000000] BIOS-provided physical RAM map:
2023-10-01T11:50:44.819103+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
2023-10-01T11:50:44.819104+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
2023-10-01T11:50:44.819104+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000037f42fff] usable
2023-10-01T11:50:44.819105+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x0000000037f43000-0x0000000040953fff] reserved
2023-10-01T11:50:44.819106+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x0000000040954000-0x0000000040a1ffff] ACPI data
2023-10-01T11:50:44.819106+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x0000000040a20000-0x0000000040b5afff] ACPI NVS
2023-10-01T11:50:44.819106+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x0000000040b5b000-0x000000004173afff] reserved
2023-10-01T11:50:44.819107+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x000000004173b000-0x00000000417fefff] type 20
2023-10-01T11:50:44.819107+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x00000000417ff000-0x00000000417fffff] usable
2023-10-01T11:50:44.819113+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x0000000041800000-0x0000000047ffffff] reserved
2023-10-01T11:50:44.819114+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x0000000048e00000-0x000000004f7fffff] reserved
2023-10-01T11:50:44.819114+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x00000000c0000000-0x00000000cfffffff] reserved
2023-10-01T11:50:44.819114+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
2023-10-01T11:50:44.819115+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
2023-10-01T11:50:44.819115+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
2023-10-01T11:50:44.819116+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x00000000fed20000-0x00000000fed7ffff] reserved
2023-10-01T11:50:44.819117+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
2023-10-01T11:50:44.819117+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
2023-10-01T11:50:44.819118+08:00 pve kernel: [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x00000008b07fffff] usable
2023-10-01T11:50:44.819118+08:00 pve kernel: [ 0.000000] NX (Execute Disable) protection: active
2023-10-01T11:50:44.819118+08:00 pve kernel: [ 0.000000] efi: EFI v2.70 by American Megatrends
2023-10-01T11:50:44.819119+08:00 pve kernel: [ 0.000000] efi: ACPI=0x40adc000 ACPI 2.0=0x40adc014 TPMFinalLog=0x40ae6000 SMBIOS=0x41577000 SMBIOS 3.0=0x41576000 MEMATTR=0x3024a018 ESRT=0x30252e98
2023-10-01T11:50:44.819120+08:00 pve kernel: [ 0.000000] efi: Remove mem68: MMIO range=[0xc0000000-0xcfffffff] (256MB) from e820 map
dmesg error/warn
Code:
[ 0.000000] x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
[ 25.419445] EXT4-fs warning (device dm-9): ext4_multi_mount_protect:328: MMP interval 42 higher than expected, please wait.
[ 72.466888] EXT4-fs warning (device dm-8): ext4_multi_mount_protect:328: MMP interval 42 higher than expected, please wait.
[ 14.575842] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[ 71.515171] audit: type=1400 audit(1696132300.682:26): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-102_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=2165 comm="(nft)" srcname="/" flags="rw, rbind"
[ 71.532642] audit: type=1400 audit(1696132300.698:27): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-102_</var/lib/lxc>" name="/dev/" pid=2174 comm="(sd-mkdcreds)" flags="rw, rslave"
[ 71.573588] audit: type=1400 audit(1696132300.742:28): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-102_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=2205 comm="(networkd)" srcname="/" flags="rw, rbind"
[ 71.615099] audit: type=1400 audit(1696132300.782:29): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-102_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=2218 comm="(resolved)" srcname="/" flags="rw, rbind"
[ 71.648644] audit: type=1400 audit(1696132300.814:30): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-102_</var/lib/lxc>//&:lxc-102_<-var-lib-lxc>:unconfined" pid=2216 comm="apparmor_parser"
[ 71.678146] audit: type=1400 audit(1696132300.846:31): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-102_</var/lib/lxc>//&:lxc-102_<-var-lib-lxc>:unconfined" pid=2222 comm="apparmor_parser"
[ 71.678165] audit: type=1400 audit(1696132300.846:32): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-102_</var/lib/lxc>//&:lxc-102_<-var-lib-lxc>:unconfined" pid=2222 comm="apparmor_parser"
[ 71.771467] audit: type=1400 audit(1696132300.938:33): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-102_</var/lib/lxc>//&:lxc-102_<-var-lib-lxc>:unconfined" pid=2223 comm="apparmor_parser"
[ 71.815537] audit: type=1400 audit(1696132300.982:34): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-102_</var/lib/lxc>//&:lxc-102_<-var-lib-lxc>:unconfined" pid=2228 comm="apparmor_parser"