Happy New Year everyone!
I am currently encountering an reaccuring issue with one of my nodes (pve2). Both nodes are hardware wise exactly the same. The random crash hasn't happened in a long time on the other node (pve1).
The node (pve2) just randomly crashes and is unable to be started via IPMI. It is a Supermicro board which when I try to power it on, it tries a couple of times and then it couldn't be booted. When I unplug power from the system and then plug it back it, I can power on the system via IPMI. I would say the crash happens every 2-4 weeks.
The crash happened today (2024-01-02) at around 3:15am. Both nodes are behind a UPS, so it isn't a general power failure since the other node (pve1) is still running
I am not really sure, what I should look for, since the logs don't really say anything about the crash. If anyone can point me in a direction, I would greatly appreciate that.
I have looked at a couple of threads like this one https://forum.proxmox.com/threads/h...ter-crash-and-hopefully-fix-the-crash.131401/ but I am not sure if this applies to my setup.
Hardware (lshw -short)
pveversion
last
IPMI (Maintenance Event Log)
I am currently encountering an reaccuring issue with one of my nodes (pve2). Both nodes are hardware wise exactly the same. The random crash hasn't happened in a long time on the other node (pve1).
The node (pve2) just randomly crashes and is unable to be started via IPMI. It is a Supermicro board which when I try to power it on, it tries a couple of times and then it couldn't be booted. When I unplug power from the system and then plug it back it, I can power on the system via IPMI. I would say the crash happens every 2-4 weeks.
The crash happened today (2024-01-02) at around 3:15am. Both nodes are behind a UPS, so it isn't a general power failure since the other node (pve1) is still running
I am not really sure, what I should look for, since the logs don't really say anything about the crash. If anyone can point me in a direction, I would greatly appreciate that.
I have looked at a couple of threads like this one https://forum.proxmox.com/threads/h...ter-crash-and-hopefully-fix-the-crash.131401/ but I am not sure if this applies to my setup.
Hardware (lshw -short)
Code:
/0 bus H12SSL-i
/0/28 memory 128GiB System Memory
/0/28/2 memory 32GiB DIMM DDR4 Synchronous Registered (Buffered) 3200 MHz (0.3 ns)
/0/28/3 memory 32GiB DIMM DDR4 Synchronous Registered (Buffered) 3200 MHz (0.3 ns)
/0/28/6 memory 32GiB DIMM DDR4 Synchronous Registered (Buffered) 3200 MHz (0.3 ns)
/0/28/7 memory 32GiB DIMM DDR4 Synchronous Registered (Buffered) 3200 MHz (0.3 ns)
/0/2e processor AMD EPYC 7272 12-Core Processor
/0/100/3.3/0 /dev/nvme0 storage Samsung SSD 970 PRO 512GB
/0/118/1.1 bridge Starship/Matisse GPP Bridge
/0/118/1.1/0 enp129s0 network 82599 10 Gigabit Network Connection
/0/120/3.1 bridge Starship/Matisse GPP Bridge
/0/120/3.1/0 enp193s0 network 82599 10 Gigabit Network Connection
pveversion
Code:
pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-7-pve)
last
Code:
reboot system boot 6.5.11-7-pve Tue Jan 2 09:40 still running
root pts/0 Wed Dec 20 16:18 - crash (12+17:21)
Code:
Jan 02 00:00:10 pve2 systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Jan 02 00:00:10 pve2 systemd[1]: Starting logrotate.service - Rotate log files...
Jan 02 00:00:10 pve2 systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Jan 02 00:00:10 pve2 systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Jan 02 00:00:10 pve2 systemd[1]: Reloading pveproxy.service - PVE API Proxy Server...
Jan 02 00:00:11 pve2 pveproxy[501296]: send HUP to 1282
Jan 02 00:00:11 pve2 pveproxy[1282]: received signal HUP
Jan 02 00:00:11 pve2 pveproxy[1282]: server closing
Jan 02 00:00:11 pve2 pveproxy[1282]: server shutdown (restart)
Jan 02 00:00:11 pve2 systemd[1]: Reloaded pveproxy.service - PVE API Proxy Server.
Jan 02 00:00:11 pve2 systemd[1]: Reloading spiceproxy.service - PVE SPICE Proxy Server...
Jan 02 00:00:11 pve2 spiceproxy[501299]: send HUP to 1288
Jan 02 00:00:11 pve2 spiceproxy[1288]: received signal HUP
Jan 02 00:00:11 pve2 spiceproxy[1288]: server closing
Jan 02 00:00:11 pve2 spiceproxy[1288]: server shutdown (restart)
Jan 02 00:00:11 pve2 systemd[1]: Reloaded spiceproxy.service - PVE SPICE Proxy Server.
Jan 02 00:00:11 pve2 pvefw-logger[1743373]: received terminate request (signal)
Jan 02 00:00:11 pve2 pvefw-logger[1743373]: stopping pvefw logger
Jan 02 00:00:11 pve2 systemd[1]: Stopping pvefw-logger.service - Proxmox VE firewall logger...
Jan 02 00:00:11 pve2 systemd[1]: pvefw-logger.service: Deactivated successfully.
Jan 02 00:00:11 pve2 systemd[1]: Stopped pvefw-logger.service - Proxmox VE firewall logger.
Jan 02 00:00:11 pve2 systemd[1]: pvefw-logger.service: Consumed 5.447s CPU time.
Jan 02 00:00:11 pve2 spiceproxy[1288]: restarting server
Jan 02 00:00:11 pve2 spiceproxy[1288]: starting 1 worker(s)
Jan 02 00:00:11 pve2 spiceproxy[1288]: worker 501308 started
Jan 02 00:00:11 pve2 systemd[1]: Starting pvefw-logger.service - Proxmox VE firewall logger...
Jan 02 00:00:11 pve2 pvefw-logger[501310]: starting pvefw logger
Jan 02 00:00:11 pve2 systemd[1]: Started pvefw-logger.service - Proxmox VE firewall logger.
Jan 02 00:00:11 pve2 systemd[1]: logrotate.service: Deactivated successfully.
Jan 02 00:00:11 pve2 systemd[1]: Finished logrotate.service - Rotate log files.
Jan 02 00:00:12 pve2 pveproxy[1282]: restarting server
Jan 02 00:00:12 pve2 pveproxy[1282]: starting 3 worker(s)
Jan 02 00:00:12 pve2 pveproxy[1282]: worker 501315 started
Jan 02 00:00:12 pve2 pveproxy[1282]: worker 501316 started
Jan 02 00:00:12 pve2 pveproxy[1282]: worker 501317 started
Jan 02 00:00:16 pve2 spiceproxy[1743375]: worker exit
Jan 02 00:00:16 pve2 spiceproxy[1288]: worker 1743375 finished
Jan 02 00:00:17 pve2 pveproxy[1743380]: worker exit
Jan 02 00:00:17 pve2 pveproxy[1743381]: worker exit
Jan 02 00:00:17 pve2 pveproxy[1743379]: worker exit
Jan 02 00:00:17 pve2 pveproxy[1282]: worker 1743380 finished
Jan 02 00:00:17 pve2 pveproxy[1282]: worker 1743379 finished
Jan 02 00:00:17 pve2 pveproxy[1282]: worker 1743381 finished
Jan 02 00:01:08 pve2 pmxcfs[1120]: [dcdb] notice: data verification successful
Jan 02 00:17:01 pve2 CRON[535821]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 00:17:01 pve2 CRON[535822]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 02 00:17:01 pve2 CRON[535821]: pam_unix(cron:session): session closed for user root
Jan 02 00:24:01 pve2 CRON[550175]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 00:24:01 pve2 CRON[550176]: (root) CMD (if [ $(date +%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/trim ]; then /usr/lib/zfs-linux/trim; fi)
Jan 02 00:24:01 pve2 CRON[550175]: pam_unix(cron:session): session closed for user root
Jan 02 01:01:08 pve2 pmxcfs[1120]: [dcdb] notice: data verification successful
Jan 02 01:17:01 pve2 CRON[658709]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 01:17:01 pve2 CRON[658710]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 02 01:17:01 pve2 CRON[658709]: pam_unix(cron:session): session closed for user root
Jan 02 01:52:35 pve2 pmxcfs[1120]: [status] notice: received log
Jan 02 01:52:39 pve2 pmxcfs[1120]: [status] notice: received log
Jan 02 02:00:08 pve2 pmxcfs[1120]: [status] notice: received log
Jan 02 02:01:08 pve2 pmxcfs[1120]: [dcdb] notice: data verification successful
Jan 02 02:17:01 pve2 CRON[781579]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 02:17:01 pve2 CRON[781580]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 02 02:17:01 pve2 CRON[781579]: pam_unix(cron:session): session closed for user root
Jan 02 03:01:08 pve2 pmxcfs[1120]: [dcdb] notice: data verification successful
Jan 02 03:10:01 pve2 CRON[890169]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 03:10:01 pve2 CRON[890170]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Jan 02 03:10:01 pve2 CRON[890169]: pam_unix(cron:session): session closed for user root
-- Reboot --
Jan 02 09:40:13 pve2 kernel: Linux version 6.5.11-7-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-7 (2023-12-05T09:44Z) ()
Jan 02 09:40:13 pve2 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.11-7-pve root=/dev/mapper/pve-root ro quiet
Jan 02 09:40:13 pve2 kernel: KERNEL supported cpus:
Jan 02 09:40:13 pve2 kernel: Intel GenuineIntel
Jan 02 09:40:13 pve2 kernel: AMD AuthenticAMD
Jan 02 09:40:13 pve2 kernel: Hygon HygonGenuine
Jan 02 09:40:13 pve2 kernel: Centaur CentaurHauls
Jan 02 09:40:13 pve2 kernel: zhaoxin Shanghai
Jan 02 09:40:13 pve2 kernel: BIOS-provided physical RAM map:
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000073ffffff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000074000000-0x0000000074021fff] ACPI NVS
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000074022000-0x0000000075daffff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000075db0000-0x0000000075ffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000076000000-0x00000000a5892fff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a5893000-0x00000000a7737fff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a7738000-0x00000000a7822fff] ACPI data
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a7823000-0x00000000a7ca3fff] ACPI NVS
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a7ca4000-0x00000000a8d62fff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a8d63000-0x00000000a8ec7fff] type 20
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000a8ec8000-0x00000000abffffff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000ac000000-0x00000000afffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000b4000000-0x00000000b5ffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000f4000000-0x00000000f5ffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x00000000fe000000-0x00000000ffffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000000100000000-0x000000204f2fffff] usable
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x000000204f300000-0x000000204fffffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000010000000000-0x00000100201fffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000020030000000-0x00000200403fffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000020060000000-0x00000200801fffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x0000038090000000-0x00000380a03fffff] reserved
Jan 02 09:40:13 pve2 kernel: BIOS-e820: [mem 0x000007fc00000000-0x000007fc03ffffff] reserved
Jan 02 09:40:13 pve2 kernel: NX (Execute Disable) protection: active
Jan 02 09:40:13 pve2 kernel: efi: EFI v2.7 by American Megatrends
Jan 02 09:40:13 pve2 kernel: efi: ACPI=0xa7c85000 ACPI 2.0=0xa7c85014 SMBIOS=0xa8a1c000 SMBIOS 3.0=0xa8a1b000 MEMATTR=0x9f420018 ESRT=0x9f421f98
Jan 02 09:40:13 pve2 kernel: efi: Remove mem37: MMIO range=[0xb4000000-0xb5ffffff] (32MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0xb4000000-0xb5ffffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem38: MMIO range=[0xf4000000-0xf5ffffff] (32MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0xf4000000-0xf5ffffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem39: MMIO range=[0xfe000000-0xffffffff] (32MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0xfe000000-0xffffffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem41: MMIO range=[0x10000000000-0x100201fffff] (514MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x10000000000-0x100201fffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem42: MMIO range=[0x20030000000-0x200403fffff] (260MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x20030000000-0x200403fffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem43: MMIO range=[0x20060000000-0x200801fffff] (514MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x20060000000-0x200801fffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem44: MMIO range=[0x38090000000-0x380a03fffff] (260MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x38090000000-0x380a03fffff] reserved
Jan 02 09:40:13 pve2 kernel: efi: Remove mem45: MMIO range=[0x7fc00000000-0x7fc03ffffff] (64MB) from e820 map
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x7fc00000000-0x7fc03ffffff] reserved
Jan 02 09:40:13 pve2 kernel: secureboot: Secure boot disabled
Jan 02 09:40:13 pve2 kernel: SMBIOS 3.2.0 present.
Jan 02 09:40:13 pve2 kernel: DMI: Supermicro Super Server/H12SSL-i, BIOS 2.5 09/08/2022
Jan 02 09:40:13 pve2 kernel: tsc: Fast TSC calibration using PIT
Jan 02 09:40:13 pve2 kernel: tsc: Detected 2899.975 MHz processor
Jan 02 09:40:13 pve2 kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Jan 02 09:40:13 pve2 kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
Jan 02 09:40:13 pve2 kernel: last_pfn = 0x204f300 max_arch_pfn = 0x400000000
Jan 02 09:40:13 pve2 kernel: MTRR map: 8 entries (3 fixed + 5 variable; max 20), built from 9 variable MTRRs
Jan 02 09:40:13 pve2 kernel: x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
Jan 02 09:40:13 pve2 kernel: last_pfn = 0xac000 max_arch_pfn = 0x400000000
Jan 02 09:40:13 pve2 kernel: found SMP MP-table at [mem 0x000fd260-0x000fd26f]
Jan 02 09:40:13 pve2 kernel: esrt: Reserving ESRT space from 0x000000009f421f98 to 0x000000009f421fd0.
Jan 02 09:40:13 pve2 kernel: e820: update [mem 0x9f421000-0x9f421fff] usable ==> reserved
Jan 02 09:40:13 pve2 kernel: Using GB pages for direct mapping
Jan 02 09:40:13 pve2 kernel: secureboot: Secure boot disabled
Jan 02 09:40:13 pve2 kernel: RAMDISK: [mem 0x30c85000-0x34639fff]
IPMI (Maintenance Event Log)
Code:
Severity Date/Time Interface User Source Description Category
OK 2024-01-02 10:36:41 Web ADMIN(ADMIN) 192.168.XXX.XXX [MEL-0129] Web login was successful. account
OK 2024-01-02 10:36:41 Redfish ADMIN(ADMIN) 192.168.XXX.XXX [MEL-0133] Redfish session was created successfully. account
OK 2024-01-02 10:36:15 IPMI ADMIN BMC [MEL-0149] Primary NTP server access successful. others
OK 2024-01-02 04:13:43 IPMI ADMIN(ADMIN) Localhost [MEL-0207] The host FW user password has been removed. account
Last edited: