Random crash and not being able to start host via IPMI

Feb 24, 2023
7
0
1
Happy New Year everyone!

I am currently encountering an reaccuring issue with one of my nodes (pve2). Both nodes are hardware wise exactly the same. The random crash hasn't happened in a long time on the other node (pve1).

The node (pve2) just randomly crashes and is unable to be started via IPMI. It is a Supermicro board which when I try to power it on, it tries a couple of times and then it couldn't be booted. When I unplug power from the system and then plug it back it, I can power on the system via IPMI. I would say the crash happens every 2-4 weeks.

The crash happened today (2024-01-02) at around 3:15am. Both nodes are behind a UPS, so it isn't a general power failure since the other node (pve1) is still running

I am not really sure, what I should look for, since the logs don't really say anything about the crash. If anyone can point me in a direction, I would greatly appreciate that.

I have looked at a couple of threads like this one https://forum.proxmox.com/threads/h...ter-crash-and-hopefully-fix-the-crash.131401/ but I am not sure if this applies to my setup.

Hardware (lshw -short)
Code:
/0                                     bus            H12SSL-i
/0/28                                  memory         128GiB System Memory
/0/28/2                                memory         32GiB DIMM DDR4 Synchronous Registered (Buffered) 3200 MHz (0.3 ns)
/0/28/3                                memory         32GiB DIMM DDR4 Synchronous Registered (Buffered) 3200 MHz (0.3 ns)
/0/28/6                                memory         32GiB DIMM DDR4 Synchronous Registered (Buffered) 3200 MHz (0.3 ns)
/0/28/7                                memory         32GiB DIMM DDR4 Synchronous Registered (Buffered) 3200 MHz (0.3 ns)
/0/2e                                  processor      AMD EPYC 7272 12-Core Processor
/0/100/3.3/0          /dev/nvme0       storage        Samsung SSD 970 PRO 512GB
/0/118/1.1                             bridge         Starship/Matisse GPP Bridge
/0/118/1.1/0          enp129s0         network        82599 10 Gigabit Network Connection
/0/120/3.1                             bridge         Starship/Matisse GPP Bridge
/0/120/3.1/0          enp193s0         network        82599 10 Gigabit Network Connection

pveversion
Code:
pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-7-pve)

last
Code:
reboot   system boot  6.5.11-7-pve     Tue Jan  2 09:40   still running
root     pts/0                         Wed Dec 20 16:18 - crash (12+17:21)

Code:
Jan 02 00:00:10 pve2 systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Jan 02 00:00:10 pve2 systemd[1]: Starting logrotate.service - Rotate log files...
Jan 02 00:00:10 pve2 systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Jan 02 00:00:10 pve2 systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Jan 02 00:00:10 pve2 systemd[1]: Reloading pveproxy.service - PVE API Proxy Server...
Jan 02 00:00:11 pve2 pveproxy[501296]: send HUP to 1282
Jan 02 00:00:11 pve2 pveproxy[1282]: received signal HUP
Jan 02 00:00:11 pve2 pveproxy[1282]: server closing
Jan 02 00:00:11 pve2 pveproxy[1282]: server shutdown (restart)
Jan 02 00:00:11 pve2 systemd[1]: Reloaded pveproxy.service - PVE API Proxy Server.
Jan 02 00:00:11 pve2 systemd[1]: Reloading spiceproxy.service - PVE SPICE Proxy Server...
Jan 02 00:00:11 pve2 spiceproxy[501299]: send HUP to 1288
Jan 02 00:00:11 pve2 spiceproxy[1288]: received signal HUP
Jan 02 00:00:11 pve2 spiceproxy[1288]: server closing
Jan 02 00:00:11 pve2 spiceproxy[1288]: server shutdown (restart)
Jan 02 00:00:11 pve2 systemd[1]: Reloaded spiceproxy.service - PVE SPICE Proxy Server.
Jan 02 00:00:11 pve2 pvefw-logger[1743373]: received terminate request (signal)
Jan 02 00:00:11 pve2 pvefw-logger[1743373]: stopping pvefw logger
Jan 02 00:00:11 pve2 systemd[1]: Stopping pvefw-logger.service - Proxmox VE firewall logger...
Jan 02 00:00:11 pve2 systemd[1]: pvefw-logger.service: Deactivated successfully.
Jan 02 00:00:11 pve2 systemd[1]: Stopped pvefw-logger.service - Proxmox VE firewall logger.
Jan 02 00:00:11 pve2 systemd[1]: pvefw-logger.service: Consumed 5.447s CPU time.
Jan 02 00:00:11 pve2 spiceproxy[1288]: restarting server
Jan 02 00:00:11 pve2 spiceproxy[1288]: starting 1 worker(s)
Jan 02 00:00:11 pve2 spiceproxy[1288]: worker 501308 started
Jan 02 00:00:11 pve2 systemd[1]: Starting pvefw-logger.service - Proxmox VE firewall logger...
Jan 02 00:00:11 pve2 pvefw-logger[501310]: starting pvefw logger
Jan 02 00:00:11 pve2 systemd[1]: Started pvefw-logger.service - Proxmox VE firewall logger.
Jan 02 00:00:11 pve2 systemd[1]: logrotate.service: Deactivated successfully.
Jan 02 00:00:11 pve2 systemd[1]: Finished logrotate.service - Rotate log files.
Jan 02 00:00:12 pve2 pveproxy[1282]: restarting server
Jan 02 00:00:12 pve2 pveproxy[1282]: starting 3 worker(s)
Jan 02 00:00:12 pve2 pveproxy[1282]: worker 501315 started
Jan 02 00:00:12 pve2 pveproxy[1282]: worker 501316 started
Jan 02 00:00:12 pve2 pveproxy[1282]: worker 501317 started
Jan 02 00:00:16 pve2 spiceproxy[1743375]: worker exit
Jan 02 00:00:16 pve2 spiceproxy[1288]: worker 1743375 finished
Jan 02 00:00:17 pve2 pveproxy[1743380]: worker exit
Jan 02 00:00:17 pve2 pveproxy[1743381]: worker exit
Jan 02 00:00:17 pve2 pveproxy[1743379]: worker exit
Jan 02 00:00:17 pve2 pveproxy[1282]: worker 1743380 finished
Jan 02 00:00:17 pve2 pveproxy[1282]: worker 1743379 finished
Jan 02 00:00:17 pve2 pveproxy[1282]: worker 1743381 finished
Jan 02 00:01:08 pve2 pmxcfs[1120]: [dcdb] notice: data verification successful
Jan 02 00:17:01 pve2 CRON[535821]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 00:17:01 pve2 CRON[535822]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 02 00:17:01 pve2 CRON[535821]: pam_unix(cron:session): session closed for user root
Jan 02 00:24:01 pve2 CRON[550175]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 00:24:01 pve2 CRON[550176]: (root) CMD (if [ $(date +%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/trim ]; then /usr/lib/zfs-linux/trim; fi)
Jan 02 00:24:01 pve2 CRON[550175]: pam_unix(cron:session): session closed for user root
Jan 02 01:01:08 pve2 pmxcfs[1120]: [dcdb] notice: data verification successful
Jan 02 01:17:01 pve2 CRON[658709]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 01:17:01 pve2 CRON[658710]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 02 01:17:01 pve2 CRON[658709]: pam_unix(cron:session): session closed for user root
Jan 02 01:52:35 pve2 pmxcfs[1120]: [status] notice: received log
Jan 02 01:52:39 pve2 pmxcfs[1120]: [status] notice: received log
Jan 02 02:00:08 pve2 pmxcfs[1120]: [status] notice: received log
Jan 02 02:01:08 pve2 pmxcfs[1120]: [dcdb] notice: data verification successful
Jan 02 02:17:01 pve2 CRON[781579]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 02:17:01 pve2 CRON[781580]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 02 02:17:01 pve2 CRON[781579]: pam_unix(cron:session): session closed for user root
Jan 02 03:01:08 pve2 pmxcfs[1120]: [dcdb] notice: data verification successful
Jan 02 03:10:01 pve2 CRON[890169]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 03:10:01 pve2 CRON[890170]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Jan 02 03:10:01 pve2 CRON[890169]: pam_unix(cron:session): session closed for user root
-- Reboot --
Jan 02 09:40:13 pve2 kernel: Linux version 6.5.11-7-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-7 (2023-12-05T09:44Z) ()
Jan 02 09:40:13 pve2 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.11-7-pve root=/dev/mapper/pve-root ro quiet
Jan 02 09:40:13 pve2 kernel: KERNEL supported cpus:
Jan 02 09:40:13 pve2 kernel:   Intel GenuineIntel
Jan 02 09:40:13 pve2 kernel:   AMD AuthenticAMD
Jan 02 09:40:13 pve2 kernel:   Hygon HygonGenuine
Jan 02 09:40:13 pve2 kernel:   Centaur CentaurHauls
Jan 02 09:40:13 pve2 kernel:   zhaoxin   Shanghai
Jan 02 09:40:13 pve2 kernel: BIOS-provided physical RAM map:
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000073ffffff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000074000000-0x0000000074021fff] ACPI NVS
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000074022000-0x0000000075daffff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000075db0000-0x0000000075ffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000076000000-0x00000000a5892fff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a5893000-0x00000000a7737fff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a7738000-0x00000000a7822fff] ACPI data
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a7823000-0x00000000a7ca3fff] ACPI NVS
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a7ca4000-0x00000000a8d62fff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a8d63000-0x00000000a8ec7fff] type 20
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a8ec8000-0x00000000abffffff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000ac000000-0x00000000afffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000b4000000-0x00000000b5ffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000f4000000-0x00000000f5ffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000fe000000-0x00000000ffffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000100000000-0x000000204f2fffff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x000000204f300000-0x000000204fffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000010000000000-0x00000100201fffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000020030000000-0x00000200403fffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000020060000000-0x00000200801fffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000038090000000-0x00000380a03fffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x000007fc00000000-0x000007fc03ffffff] reserved
Jan 02 09:40:13 pve2 kernel: NX (Execute Disable) protection: active
Jan 02 09:40:13 pve2 kernel: efi: EFI v2.7 by American Megatrends
Jan 02 09:40:13 pve2 kernel: efi: ACPI=0xa7c85000 ACPI 2.0=0xa7c85014 SMBIOS=0xa8a1c000 SMBIOS 3.0=0xa8a1b000 MEMATTR=0x9f420018 ESRT=0x9f421f98
Jan 02 09:40:13 pve2 kernel: efi: Remove mem37: MMIO range=[0xb4000000-0xb5ffffff] (32MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0xb4000000-0xb5ffffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem38: MMIO range=[0xf4000000-0xf5ffffff] (32MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0xf4000000-0xf5ffffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem39: MMIO range=[0xfe000000-0xffffffff] (32MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0xfe000000-0xffffffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem41: MMIO range=[0x10000000000-0x100201fffff] (514MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x10000000000-0x100201fffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem42: MMIO range=[0x20030000000-0x200403fffff] (260MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x20030000000-0x200403fffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem43: MMIO range=[0x20060000000-0x200801fffff] (514MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x20060000000-0x200801fffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem44: MMIO range=[0x38090000000-0x380a03fffff] (260MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x38090000000-0x380a03fffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem45: MMIO range=[0x7fc00000000-0x7fc03ffffff] (64MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x7fc00000000-0x7fc03ffffff] reserved
Jan 02 09:40:13 pve2 kernel: secureboot: Secure boot disabled
Jan 02 09:40:13 pve2 kernel: SMBIOS 3.2.0 present.
Jan 02 09:40:13 pve2 kernel: DMI: Supermicro Super Server/H12SSL-i, BIOS 2.5 09/08/2022
Jan 02 09:40:13 pve2 kernel: tsc: Fast TSC calibration using PIT
Jan 02 09:40:13 pve2 kernel: tsc: Detected 2899.975 MHz processor
Jan 02 09:40:13 pve2 kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
Jan 02 09:40:13 pve2 kernel: last_pfn = 0x204f300 max_arch_pfn = 0x400000000
Jan 02 09:40:13 pve2 kernel: MTRR map: 8 entries (3 fixed + 5 variable; max 20), built from 9 variable MTRRs
Jan 02 09:40:13 pve2 kernel: x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
Jan 02 09:40:13 pve2 kernel: last_pfn = 0xac000 max_arch_pfn = 0x400000000
Jan 02 09:40:13 pve2 kernel: found SMP MP-table at [mem 0x000fd260-0x000fd26f]
Jan 02 09:40:13 pve2 kernel: esrt: Reserving ESRT space from 0x000000009f421f98 to 0x000000009f421fd0.
Jan 02 09:40:13 pve2 kernel: e820: update [mem 0x9f421000-0x9f421fff] usable ==> reserved
Jan 02 09:40:13 pve2 kernel: Using GB pages for direct mapping
Jan 02 09:40:13 pve2 kernel: secureboot: Secure boot disabled
Jan 02 09:40:13 pve2 kernel: RAMDISK: [mem 0x30c85000-0x34639fff]

IPMI (Maintenance Event Log)
Code:
Severity    Date/Time    Interface    User    Source    Description    Category
OK    2024-01-02 10:36:41    Web    ADMIN(ADMIN)    192.168.XXX.XXX    [MEL-0129] Web login was successful.    account
OK    2024-01-02 10:36:41    Redfish    ADMIN(ADMIN)    192.168.XXX.XXX    [MEL-0133] Redfish session was created successfully.    account
OK    2024-01-02 10:36:15    IPMI    ADMIN    BMC    [MEL-0149] Primary NTP server access successful.    others
OK    2024-01-02 04:13:43    IPMI    ADMIN(ADMIN)    Localhost    [MEL-0207] The host FW user password has been removed.    account
 
Last edited:
I don't think this is an OS problem. Your description sounds more like a hardware problem.

Who assembled the server? What kind of chassis is this? Do you have the latest BIOS and BMC versions? Have you ever tried resetting BIOS and BMC?
 
I don't think this is an OS problem. Your description sounds more like a hardware problem.

Who assembled the server? What kind of chassis is this? Do you have the latest BIOS and BMC versions? Have you ever tried resetting BIOS and BMC?
I assembled the system. It is a 3U case from Intertech.

I reinstalled Proxmox on both nodes in October. I installed the lasted BIOS and BMC in July and also resetted it. I saw that they released a new one in Ocotober.
BMC 01.01.10
BIOS Firmware Version 2.5

The crashes happened before too, so that didn't fix it.

What could cause this kind of hardware problem? Since when I unplug the system and plug it back in, I can boot the system again.
 
I spy a consumer NVME. Is this your only drive in your PVE which contains both the OS and VMs? Could be some sort of dying cells or heat problem with the drive.

Beside that there are several possibilities for crashs. Faulty mainboard components, PSU, cooling, etc. Can you boot a live linux system and perform some heavy stress tests? Usually you can choose between different scenarios (RAM or CPU only, both, etc.).

Does your IPMI log contain any useful informations/warnings?
 
I spy a consumer NVME. Is this your only drive in your PVE which contains both the OS and VMs? Could be some sort of dying cells or heat problem with the drive.

Beside that there are several possibilities for crashs. Faulty mainboard components, PSU, cooling, etc. Can you boot a live linux system and perform some heavy stress tests? Usually you can choose between different scenarios (RAM or CPU only, both, etc.).

Does your IPMI log contain any useful informations/warnings?
Yes, it is a consumer nvme, because the VMs are all on seperate storage server. There aren't any VMs directly stored on Proxmox. I didn't think Proxmox would do any kind of heavy io to it.
If it is the nvme, shouldn't there be any recordings somewhere in the logs?

Can you recommend a enterprise nvme that is suitable for Proxmox?

Do you have a particular live linux system in mind, I can use to stresstest?

I checked the IPMI maintenance event log but those were the last 4 entries. Everything else is from 2023-12-18
 
Yes, it is a consumer nvme, because the VMs are all on seperate storage server. There aren't any VMs directly stored on Proxmox. I didn't think Proxmox would do any kind of heavy io to it.
Proxmox writes a lot for logs and graphs, but if you don't run VMs on it, it's probably fine (until it wears out). Proxmox also runs fine from an old HDD (which typically do not have issues with many writes).
If it is the nvme, shouldn't there be any recordings somewhere in the logs?
I would expect the Proxmox logs (run journalctl and use the arrow keys) to be on that drive, and if it (temporarily) broke down then there might be no logs of it.
 
I spy a consumer NVME. Is this your only drive in your PVE which contains both the OS and VMs? Could be some sort of dying cells or heat problem with the drive.
That sounds quite valid and could explain why the system doesn't want to start again straight away. If the NVMe is overheated, it will take a moment. On the other hand, I would also expect a message in the syslog about this, but that is not the case.

Do you have a monitor in use that shows you the temperature progression?
Otherwise, have you checked all cables for correct seating? Have you checked the power supply for defects? Did you perhaps miss a spacer that is now bridging something here and there?

Maybe you can swap parts between the two servers, then you can work by elimination and see whether the error moves or not.
 
Proxmox writes a lot for logs and graphs, but if you don't run VMs on it, it's probably fine (until it wears out). Proxmox also runs fine from an old HDD (which typically do not have issues with many writes).

I would expect the Proxmox logs (run journalctl and use the arrow keys) to be on that drive, and if it (temporarily) broke down then there might be no logs of it.
It has now 1% wear after 2 years of use. Temperature sensors 1 and 2 report 35°C and 44°C (smartctl).

The room where the hosts are located is cooled to 21°C. CPU Temps are never above 45°C

Is a SSD more reliable for Proxmox than a nvme?

SMARTCTL

Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.11-7-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 PRO 512GB
Serial Number:                      S5JYNS0RB09598D
Firmware Version:                   1B2QEXP7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            191,673,315,328 [191 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5b11407dbb
Local Time is:                      Tue Jan  2 15:12:41 2024 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0037):   Security Format Frmw_DL Self_Test Directvs
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     81 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        35 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    45,982,830 [23.5 TB]
Data Units Written:                 56,856,580 [29.1 TB]
Host Read Commands:                 190,837,697
Host Write Commands:                938,543,267
Controller Busy Time:               1,696
Power Cycles:                       49
Power On Hours:                     5,876
Unsafe Shutdowns:                   13
Media and Data Integrity Errors:    0
Error Information Log Entries:      68
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               35 Celsius
Temperature Sensor 2:               44 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         68     0  0x000c  0x4004      -            0     0     -
 
That sounds quite valid and could explain why the system doesn't want to start again straight away. If the NVMe is overheated, it will take a moment. On the other hand, I would also expect a message in the syslog about this, but that is not the case.

Do you have a monitor in use that shows you the temperature progression?
Otherwise, have you checked all cables for correct seating? Have you checked the power supply for defects? Did you perhaps miss a spacer that is now bridging something here and there?

Maybe you can swap parts between the two servers, then you can work by elimination and see whether the error moves or not.
One of the crashes happened on a sunday morning and I tried starting the system via IPMI on monday morning. It wouldn't boot. I had to unplug the entire system and plug it back it. Then I was able to start the system via IPMI. The nvme should've cooled down by then.. Maybe the motherboard was stuck somewhere and couldn't find the nvme anymore.

I checked all the cables in october, when I did the full reset of the Proxmox cluster.

Temperatures from the nvme seemed normal when I checked them via smartctl
 
Stresslinux.org provides bootable environments, just an example.

As @leesteken mentioned, there’s no difference in SSDs vs. NVMEs if both are enterprise grade. We often use PM17XX models as NVME drives for PVE. These are PCIe cards but even in specially cooled racks and datacenters they become pretty hot, although they have massive cooling plates.

Last year I had a Supermicro board (also AMD Epyc) which gone wild until I completely reflashed everything (BIOS and IPMI) again with the same versions prior installed. Don’t know up today what the reason was (including the Supermicro support).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!