Help to diagnose random crash

crc-error-79 · Jul 9, 2023

Hello all,

my proxmox installation crashes after 2-3 days and I don't know why...

At the beginning I tough that were some issue during the backup on a Synology NAS via NFS but after disable the schedule the problem happened again..

The strange thing is that the entire system freezes and the hardware reset doesn't work.
I have to power off the machine (press the power button for few seconds) and then on again.

How could I do to find the issue?

System is
cpu i7 7700
mb asrock z270 itx
ram 32 gb
pcie dual nic 10gb x520
boot/main disk: dual kingston datacenter ssd 500gb in mirror.

Temperatures are 50-60°C for the cpu and 40 for the mb

I have 8 vms
- pfsense
- truenas (with 4 hdd passed)
- and the others are debian 11/12 with docker

System is 99% of the time idling, the big cpu usage is from pfsense (<10%) or jellyfin during transcoding (spikes >70% then it stabilizes around 40% because it uses the quicksync).

I think that the problem happen when system is not used (even pfsense is idling) but I not at 100% sure...

the syslog is clear (no errors or strange texts), like if I have normally reboot or after pushed the hardware reset.

Code:

Jul 09 18:17:01 zeus CRON[3443800]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 09 18:17:01 zeus CRON[3443801]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 09 18:17:01 zeus CRON[3443800]: pam_unix(cron:session): session closed for user root
Jul 09 18:21:51 zeus systemd[1]: Starting Cleanup of Temporary Directories...
Jul 09 18:21:51 zeus systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Jul 09 18:21:51 zeus systemd[1]: Finished Cleanup of Temporary Directories.
-- Reboot --
Jul 09 18:41:39 zeus kernel: Linux version 5.15.108-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.108-1 (2023-06-17T09:41Z) ()
Jul 09 18:41:39 zeus kernel: Command line: initrd=\EFI\proxmox\5.15.108-1-pve\initrd.img-5.15.108-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt
Jul 09 18:41:39 zeus kernel: KERNEL supported cpus:

Cleverness · Jul 10, 2023

Hey, is this machine on Proxmox 8? I ran into a similar issue on one of my Intel Nuc nodes when upgrading. Anywhere from 30 minutes to 2 days it would freeze and require a physical reboot to recover. I even did a fresh install of 8 to see if maybe the upgrade process had an issue but it still happened and logs show no errors like yours.

I haven't found a solution yet so I reinstalled Proxmox 7.4 on this node and its been working normally ever since

mercapto · Jul 10, 2023

Although I have no solution for your issue, I can add the information that I also have random crash of the whole system since upgrading to V8.
Intel NUC system.
Could Not Link the freeze to certain actions. SSH Not reachable but system reacts still in hardware power Button and does shutoff.

crc-error-79 · Jul 10, 2023

Cleverness said:
Hey, is this machine on Proxmox 8? I ran into a similar issue on one of my Intel Nuc nodes when upgrading. Anywhere from 30 minutes to 2 days it would freeze and require a physical reboot to recover. I even did a fresh install of 8 to see if maybe the upgrade process had an issue but it still happened and logs show no errors like yours.

I haven't found a solution yet so I reinstalled Proxmox 7.4 on this node and its been working normally ever since

Ciao, it is a 7.4

I tried the 8 but I had issue with google coral on usb and stability issues with the zigbee on home assistant (on LXC), so I did a clean install with the 7.4 and I have restored/resinstall the vms.

mercapto said:
Although I have no solution for your issue, I can add the information that I also have random crash of the whole system since upgrading to V8.
Intel NUC system.
Could Not Link the freeze to certain actions. SSH Not reachable but system reacts still in hardware power Button and does shutoff.

The same problem I am having but with the 7.4 (no SSH, no ping, ho video & no hw reset, the only thing that works is remove the power plug or keep pressed the power button for few seconds)

Could this be related to some energy management settings enabled on bios? I mean cstates etc?

Also it is very strange because it happens only after 48-72 hours after the power on..

crc-error-79 · Jul 13, 2023

A little update:

After the last crash, I disabled the C-States and the deepsleep in S4-S5 in BIOS.
I don't know if the issue was caused by those 2 settings.. anyway now my uptime is 4 days and 2 hours a new record

I will keep it monitored, if pass the week I will add the [solved] to the post

mercapto · Jul 24, 2023

Small update:
I went back to kernel 5.15.108-1. Since then, system is stable again.

Fingers crossed!

crc-error-79 · Jul 24, 2023

mercapto said:
Small update:
I went back to kernel 5.15.108-1. Since then, system is stable again.

Fingers crossed!

Have you C-States and the deepsleep in S4-S5 enabled in your BIOS ?
Just to know if my problem was really related to these settings... Anyway since I disabled them I had 0 crash (I am still on 7.4)

mercapto · Jul 24, 2023

Have Not checked my BIOS settings. I'm on 8.0.3 and my problems occurred after upgrade to the version 8.

crc-error-79 · Jul 25, 2023

I have spoke too early..

It happened again, this time after 8 days and, coincidence or not, 3 days after I enabled the automatic backup (at 11.00 am) on the synology via nfs.
The crash happened between 8 and 9 am

Code:

Jul 25 06:37:48 zeus systemd[1]: Starting Daily apt upgrade and clean activities...
Jul 25 06:37:48 zeus systemd[1]: apt-daily-upgrade.service: Succeeded.
Jul 25 06:37:48 zeus systemd[1]: Finished Daily apt upgrade and clean activities.
Jul 25 06:54:26 zeus systemd[1]: Starting Daily apt download activities...
Jul 25 06:54:26 zeus systemd[1]: apt-daily.service: Succeeded.
Jul 25 06:54:26 zeus systemd[1]: Finished Daily apt download activities.
Jul 25 07:01:26 zeus smartd[2545]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 61 to 62
Jul 25 07:01:26 zeus smartd[2545]: Device: /dev/sdf [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 56 to 58
Jul 25 07:17:01 zeus CRON[1617918]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 25 07:17:01 zeus CRON[1617923]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 25 07:17:01 zeus CRON[1617918]: pam_unix(cron:session): session closed for user root
Jul 25 07:17:47 zeus nfsidmap[1618965]: nss_getpwnam: name 'casa' not found in domain 'zerocinque.local'
Jul 25 07:31:26 zeus smartd[2545]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 59 to 60
Jul 25 07:31:26 zeus smartd[2545]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 61 to 62
Jul 25 08:01:26 zeus smartd[2545]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 62 to 61
Jul 25 08:01:26 zeus smartd[2545]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 62 to 61
Jul 25 08:01:26 zeus smartd[2545]: Device: /dev/sdf [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 58 to 56
Jul 25 08:17:01 zeus CRON[1705853]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 25 08:17:01 zeus CRON[1705854]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 25 08:17:01 zeus CRON[1705853]: pam_unix(cron:session): session closed for user root
-- Reboot --
Jul 25 12:40:07 zeus kernel: Linux version 5.15.108-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.108-1 (2023-06-17T09:41Z) ()
Jul 25 12:40:07 zeus kernel: Command line: initrd=\EFI\proxmox\5.15.108-1-pve\initrd.img-5.15.108-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt
Jul 25 12:40:07 zeus kernel: KERNEL supported cpus:
Jul 25 12:40:07 zeus kernel:   Intel GenuineIntel

TOLF · Jul 25, 2023

Hello! I sadly join this group of desperate people that cannot manage to run a stable Proxmox8... Few weeks ago I upgraded to v8, and like everyone here, my node freezes after 24/48h of activity : no log, no log in netconsole (maybe my configuration is not correct...), no peak of activity for the VM, or backup or watever, just randomly freezes. I did not run autoremove after the upgrade, so I still have my kernel5. When I boot on 5.15.108-1-pve, the node is stable! Several days and no crash, working properly. But I don't see this as a solution because I'm scared that the OS become more and more buggy as I don't think debian12 is meant to work with kernel5. I don't know if the problem comes from the kernel or not, and I hope be get to see this proxmox8 stable for our machines... Meanwhile, I will use this kernel5, and pray for my system to not go crazy, and if it does, rollback to proxmox7.

I mostly have the same VMs as @crc-error-79, OPNsense + 2 debian12 + HomeAssistant. One particularity is I use PCI passthrough to pass a NIC to OPNsense, I have problem with the NIC not behind forwarded after VM reboot (and need full node reboot) but that's not the subject here. I also have PBS installed next to PVE but that should not be important as you might not have it installed and still get the problem.

Z170A GAMING PRO CARBON (MS-7A12)
Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
OS on 275GB Crucial_CT275MX3
ZFS RAID10 on 2 x ST2000VN004-2E41 and 2 x ST4000VN006-3CW1

crc-error-79 · Jul 25, 2023

TOLF said:
Hello! I sadly join this group of desperate people that cannot manage to run a stable Proxmox8... Few weeks ago I upgraded to v8, and like everyone here, my node freezes after 24/48h of activity : no log, no log in netconsole (maybe my configuration is not correct...), no peak of activity for the VM, or backup or watever, just randomly freezes. I did not run autoremove after the upgrade, so I still have my kernel5. When I boot on 5.15.108-1-pve, the node is stable! Several days and no crash, working properly. But I don't see this as a solution because I'm scared that the OS become more and more buggy as I don't think debian12 is meant to work with kernel5. I don't know if the problem comes from the kernel or not, and I hope be get to see this proxmox8 stable for our machines... Meanwhile, I will use this kernel5, and pray for my system to not go crazy, and if it does, rollback to proxmox7.

Hello and welcome to the club

I don't think it is related to the kernel, because on my previous installation (proxmox 7.4, before upgrade to 8 and rollback to 7.4) I used the kernel 6 without any issue.
I think that is something that 7.4 (with updates) and 8 have in common.
The next time it happens after the reboot could you check the syslog if even your system frezes during the cron job (like mine)?
Because I have the suspect that is something about the filesystem and its scrub/optimization

I have a similar setup to yours, mine is a Z270 + i7 7700 (so 1 generation above), maybe is something that affect these cpus' gens

TOLF · Jul 25, 2023

crc-error-79 said:
Hello and welcome to the club

I don't think it is related to the kernel, because on my previous installation (proxmox 7.4, before upgrade to 8 and rollback to 7.4) I used the kernel 6 without any issue.
I think that is something that 7.4 (with updates) and 8 have in common.
The next time it happens after the reboot could you check the syslog if even your system frezes during the cron job (like mine)?
Because I have the suspect that is something about the filesystem and its scrub/optimization

I have a similar setup to yours, mine is a Z270 + i7 7700 (so 1 generation above), maybe is something that affect these cpus' gens

I remember also having a cronjob as the last log before the crash. The thing is we cannot determine the exact hour of the crash, I should write a script that autoupdate a file with the current date to be sure. Here is the syslog before my last crash:

Code:

2023-07-24T02:17:01.257714+02:00 proxmox CRON[2514862]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
2023-07-24T02:25:36.982305+02:00 proxmox systemd[1]: Starting proxmox-backup-daily-update.service - Daily Proxmox Backup Server update and maintenance activit
ies...
2023-07-24T02:25:36.990514+02:00 proxmox systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
2023-07-24T02:25:36.990989+02:00 proxmox proxmox-daily-update[2527239]: starting apt-get update
2023-07-24T02:25:38.079371+02:00 proxmox proxmox-daily-update[2527239]: Get:1 http://security.debian.org bookworm-security InRelease [48.0 kB]
2023-07-24T02:25:38.079442+02:00 proxmox proxmox-daily-update[2527239]: Hit:2 http://ftp.fr.debian.org/debian bookworm InRelease
2023-07-24T02:25:38.079505+02:00 proxmox proxmox-daily-update[2527239]: Get:3 http://ftp.fr.debian.org/debian bookworm-updates InRelease [52.1 kB]
2023-07-24T02:25:38.079551+02:00 proxmox proxmox-daily-update[2527239]: Hit:4 http://download.proxmox.com/debian bookworm InRelease
2023-07-24T02:25:38.079571+02:00 proxmox proxmox-daily-update[2527239]: Hit:5 http://download.proxmox.com/debian/pbs bookworm InRelease
2023-07-24T02:25:38.079589+02:00 proxmox proxmox-daily-update[2527239]: Fetched 100 kB in 1s (179 kB/s)
2023-07-24T02:25:38.079608+02:00 proxmox proxmox-daily-update[2527239]: Reading package lists...
2023-07-24T02:25:38.121298+02:00 proxmox pveupdate[2527240]: <root@pam> starting task UPID:proxmox:00269299:009D830E:64BDC502:aptupdate::root@pam:
2023-07-24T02:25:38.147410+02:00 proxmox proxmox-daily-update[2527239]: TASK OK
2023-07-24T02:25:38.207628+02:00 proxmox systemd[1]: proxmox-backup-daily-update.service: Deactivated successfully.
2023-07-24T02:25:38.207730+02:00 proxmox systemd[1]: Finished proxmox-backup-daily-update.service - Daily Proxmox Backup Server update and maintenance activit
ies.
2023-07-24T02:25:38.207871+02:00 proxmox systemd[1]: proxmox-backup-daily-update.service: Consumed 1.107s CPU time.
2023-07-24T02:25:39.115508+02:00 proxmox pveupdate[2527897]: update new package list: /var/lib/pve-manager/pkgupdates
2023-07-24T02:25:40.903790+02:00 proxmox pveupdate[2527240]: <root@pam> end task UPID:proxmox:00269299:009D830E:64BDC502:aptupdate::root@pam: OK
2023-07-24T02:25:40.929182+02:00 proxmox systemd[1]: pve-daily-update.service: Deactivated successfully.
2023-07-24T02:25:40.929277+02:00 proxmox systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
2023-07-24T02:25:40.929442+02:00 proxmox systemd[1]: pve-daily-update.service: Consumed 3.409s CPU time.

Maybe it is a coincidence, but I think the host mainly crash during the night, usually around 2-3am, once a 8:30, once around 19:00, but usually during the night.

crc-error-79 · Jul 25, 2023

TOLF said:
The thing is we cannot determine the exact hour of the crash

Yes, but we can see what it was doing before.

In syslog you should have something similar to mine

My system crashed after 8.17 am, after that cronjob (I am not an expert I don't know what it was doing), then I have an "hole" in the log until I have hard-reset the pc when I arrived at home for lunch break

Code:

Jul 25 08:17:01 zeus CRON[1705853]: pam_unix(cron:session): session closed for user root
-- Reboot --
Jul 25 12:40:07 zeus kernel: Linux version 5.15.108-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.108-1 (2023-06-17T09:41Z) ()

The previous time instead, it successfully completed that cronjob and crashes during some operation on the filesystem

Code:

Jul 09 18:17:01 zeus CRON[3443800]: pam_unix(cron:session): session closed for user root
Jul 09 18:21:51 zeus systemd[1]: Starting Cleanup of Temporary Directories...
Jul 09 18:21:51 zeus systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Jul 09 18:21:51 zeus systemd[1]: Finished Cleanup of Temporary Directories.
-- Reboot --
Jul 09 18:41:39 zeus kernel: Linux version 5.15.108-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.108-1 (2023-06-17T09:41Z) ()

Maybe today it failed during the startup of the system cleanup, so I don't have it in log

TOLF · Jul 25, 2023

This is crazy because you are using 7.4 with kernel 5, I'm using 8 with kernel 6 and get crashes, but no crashes with kernel 5. Two different setup with the same issue, I thought that was the kernel or something proxmox 8 do with kernel 6 that cause the crash but since you have the same problem with 7.4 I'm really confused. I will try to regenerate crashes, the problem is that I have a mail server I don't really want the server to be down for a full day.. (yeah no HA yet...).

I tired to surch in gz syslog, but the timestamp are weird and I cannot determine if it is a crash or not...

crc-error-79 · Jul 25, 2023

Yes it is very strange.

To me if it crashes I am angry a bit, but it is not the end of the world.
Because the worst thing it can happen is that lights and automations will not work ('cause of homeassistant down) and no internet (even if, sometimes it is not a bad thing). Other vms are less important services, so nothing mission critical, but just a small homelab.

I hope that it will not happen to you again, but if it would do, please check the syslog

TOLF · Jul 25, 2023

I rebooted the node at 17:00, At 17:30, I got a crash:

Code:

2023-07-25T17:17:01.587395+02:00 proxmox CRON[42019]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
2023-07-25T17:29:41.267391+02:00 proxmox smartd[2111]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 118
2023-07-25T17:29:41.441684+02:00 proxmox smartd[2111]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 119 to 102
2023-07-25T17:29:41.685429+02:00 proxmox proxmox proxmox-backup-proxy[2448]: write rrd data back to disk
2023-07-25T17:29:41.690491+02:00 proxmox proxmox proxmox-backup-proxy[2448]: starting rrd data sync
2023-07-25T17:29:41.695633+02:00 proxmox proxmox proxmox-backup-proxy[2448]: rrd journal successfully committed (25 files in 0.010 seconds)
2023-07-25T17:29:41.917443+02:00 proxmox smartd[2111]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 59
2023-07-25T17:29:41.917597+02:00 proxmox smartd[2111]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 41
2023-07-25T19:11:47.014065+02:00 proxmox kernel: [    0.000000] Linux version 6.2.16-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Bin
utils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-5 (2023-07-14T17:53Z) ()

I will run a script to determine the exact time of the crash.

TOLF · Jul 25, 2023

OK got a crash, the script stops at 22:12:59 and here are the logs:

Code:

2023-07-25T22:11:48.798078+02:00 proxmox smartd[2157]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 67
2023-07-25T22:11:48.800498+02:00 proxmox smartd[2157]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 34 to 33
2023-07-25T22:11:48.976466+02:00 proxmox smartd[2157]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
2023-07-25T22:11:48.976612+02:00 proxmox smartd[2157]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
2023-07-25T22:11:49.082580+02:00 proxmox smartd[2157]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 75 to 76
2023-07-25T22:11:49.082695+02:00 proxmox smartd[2157]: Device: /dev/sdd [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 75 to 76
2023-07-25T22:11:49.184904+02:00 proxmox smartd[2157]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 62
2023-07-25T22:11:49.185017+02:00 proxmox smartd[2157]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 39 to 38
2023-07-25T22:11:49.811193+02:00 proxmox proxmox proxmox-backup-proxy[2494]: write rrd data back to disk
2023-07-25T22:11:49.815975+02:00 proxmox proxmox proxmox-backup-proxy[2494]: starting rrd data sync
2023-07-25T22:11:49.820642+02:00 proxmox proxmox proxmox-backup-proxy[2494]: rrd journal successfully committed (25 files in 0.010 seconds)

Every 30min, smartd sends logs regarding temps of HDD, but no error

ulrich46 · Jul 25, 2023

This might be worth a shot...https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-575892

TOLF · Jul 26, 2023

I do have ballooning enable on my VMs, I'll try to disable it and see what happens. Also, I don't really have huge peaks of CPU/RAM that could cause a crash, it's just the host giving up randomly.

TOLF · Jul 26, 2023

Here is another crash. Now the host survive barely few hours, before time it was at least a day. It's getting worse for obscure reason :

Code:

2023-07-26T10:17:01.736299+02:00 proxmox CRON[107169]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
2023-07-26T10:17:03.176710+02:00 proxmox smartd[2053]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 62 to 60
2023-07-26T10:17:04.045766+02:00 proxmox proxmox proxmox-backup-proxy[2388]: write rrd data back to disk
2023-07-26T10:17:04.050466+02:00 proxmox proxmox proxmox-backup-proxy[2388]: starting rrd data sync
2023-07-26T10:17:04.075937+02:00 proxmox proxmox proxmox-backup-proxy[2388]: rrd journal successfully committed (25 files in 0.030 seconds)
2023-07-26T10:23:09.104249+02:00 proxmox pveproxy[2627]: worker exit
2023-07-26T10:23:09.126309+02:00 proxmox pveproxy[2614]: worker 2627 finished
2023-07-26T10:23:09.126415+02:00 proxmox pveproxy[2614]: starting 1 worker(s)
2023-07-26T10:23:09.126476+02:00 proxmox pveproxy[2614]: worker 116301 started
2023-07-26T10:30:02.522211+02:00 proxmox pvedaemon[2576]: <root@pam> successful auth for user 'root@pam'

Help to diagnose random crash

Member

Active Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

New Member

Member

Member

We value your privacy