Help to diagnose random crash

crc-error-79 · Jul 26, 2023

ulrich46 said:
This might be worth a shot...https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-575892

I don't think it is my case, cpu utilization is always below 30% and I have no ballooning memory. but, maybe it cloud be related to the ram because the utilization is always above 85 with spikes above 95. Maybe during the scrubbing of the zfs the ram goes at 100% and everything crashes.. who knows..

TOLF said:
Here is another crash. Now the host survive barely few hours, before time it was at least a day. It's getting worse for obscure reason

I think that your problem is different, in my cases the system freezes and then I get the crash, the host never survive.
Maybe it is something related to the backup because in your log there is always this

Code:

proxmox proxmox proxmox-backup-proxy[2494]: write rrd data back to disk

Maybe there is some corruption on the disk

xokia · Jul 26, 2023

Whoever figures out proxmox crashing on newer intel cores is going to be a rockstar. I've considered buying a support license but i'm not convinced that would solve the issue. I think I might be out the money and have the same issue. Mine still also crashes randomly no real messages in journalctl before the crash. I suspect C-state or S-state issue as others have mentioned.

And for those questioning their VM ox LXC you can disable all that and just have proxmox running your system will still go down randomly. I have a crontab option to reboot my machine every night at 3am. That has eliminated a lot of manual reboots but I still need to do it occasionally. I also have it on a plug that I can remotely power cycle.

crc-error-79 · Jul 27, 2023

xokia said:
Whoever figures out proxmox crashing on newer intel cores is going to be a rockstar. I've considered buying a support license but i'm not convinced that would solve the issue. I think I might be out the money and have the same issue. Mine still also crashes randomly no real messages in journalctl before the crash. I suspect C-state or S-state issue as others have mentioned.

And for those questioning their VM ox LXC you can disable all that and just have proxmox running your system will still go down randomly. I have a crontab option to reboot my machine every night at 3am. That has eliminated a lot of manual reboots but I still need to do it occasionally. I also have it on a plug that I can remotely power cycle.

I honestly don't know what to think..
After disabling c-states I achieved a record uptime of 10 days. Then I re-enabled the backup on the nas and it crashed after 3 days..
I thought it was solved with c-states but it wasn't..

I'm almost at X time, I have an uptime of 2 days and 9 hours, it should happen again tomorrow (I think)

I bought a used xeon e3-1265L with mobo and ram for 2 bucks, it should arrive next week.
I'll create a test machine with the same vm and lxc and I ll see what happens.. maybe with memory etc and a fresh installation I will solve..

xokia · Jul 28, 2023

crc-error-79 said:
I honestly don't know what to think..
After disabling c-states I achieved a record uptime of 10 days. Then I re-enabled the backup on the nas and it crashed after 3 days..
I thought it was solved with c-states but it wasn't..

I'm almost at X time, I have an uptime of 2 days and 9 hours, it should happen again tomorrow (I think)

I bought a used xeon e3-1265L with mobo and ram for 2 bucks, it should arrive next week.
I'll create a test machine with the same vm and lxc and I ll see what happens.. maybe with memory etc and a fresh installation I will solve..

I suspect the silence on the matter is that its acknowledged but folks do not have a solution. So there is no benefit is discussing it.

crc-error-79 · Jul 30, 2023

Another couple of days, another crash out of nowhere..

Code:

Jul 30 15:53:22 zeus kernel: usb 2-3: reset SuperSpeed USB device number 2 using xhci_hcd
Jul 30 15:53:22 zeus kernel: usb 2-3: LPM exit latency is zeroed, disabling LPM.
Jul 30 16:10:24 zeus smartd[2959]: Device: /dev/sdf [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 58 to 56
Jul 30 16:17:01 zeus CRON[277080]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 30 16:17:01 zeus CRON[277081]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 30 16:17:01 zeus CRON[277080]: pam_unix(cron:session): session closed for user root
-- Reboot --
Jul 30 16:39:13 zeus kernel: Linux version 5.15.108-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.108-2 (2023-07-20T10:06Z) ()
Jul 30 16:39:13 zeus kernel: Command line: initrd=\EFI\proxmox\5.15.108-1-pve\initrd.img-5.15.108-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt
Jul 30 16:39:13 zeus kernel: KERNEL supported cpus:
Jul 30 16:39:13 zeus kernel:   Intel GenuineIntel
Jul 30 16:39:13 zeus kernel:   AMD AuthenticAMD
Jul 30 16:39:13 zeus kernel:   Hygon HygonGenuine
Jul 30 16:39:13 zeus kernel:   Centaur CentaurHauls
Jul 30 16:39:13 zeus kernel:   zhaoxin   Shanghai 
Jul 30 16:39:13 zeus kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

OnlyHardOfficial · Aug 2, 2023

Hello everyone!
I have installed proxmox 8 here:


root@pve:~# pveversion --verbose
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

for a couple months now, and I have been having this problem too! Randomly the system reboots out of nowhere, and for no apparent reason!
I even formatted the system 2 times I thought it was some bug ramdom by some bad configuration, but I did not even touch the system configs!
I have this 2 LXC containers running, and almost every 2 days, my system restarts out of nowhere!
I thought it might be something hardware related like a poor quality ssd, but the logs don't indicate that!
Characteristics of my server:
Dell optiplex 920 sff (have s4/s5 C-States enabled too)
4 x Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
10.96% (2.13 GiB of 19.41 GiB) DDR3 -1600Mhz
And it has a SSD of the cheapest and low quality of 60Gb

There is my main LOG of latest Crash

Code:

Jul 31 02:00:11 pve pvescheduler[545641]: INFO: Backup job finished successfully
Jul 31 02:00:11 pve postfix/pickup[538212]: 6DDC231627CB: uid=0 from=<root>
Jul 31 02:00:11 pve postfix/cleanup[545717]: 6DDC231627CB: message-id=<20230731010011.6DDC231627CB@pve.localhome>
Jul 31 02:00:11 pve postfix/qmgr[859]: 6DDC231627CB: from=<root@pve.localhome>, size=3572, nrcpt=1 (queue active)
Jul 31 02:00:11 pve postfix/smtp[545719]: 6DDC231627CB: to=<marcelocoutinho_1994@hotmail.com>, relay=hotmail-com.olc.protection.outlook.com[104.47.11.97]:25, delay=0.52, delays=0.02/0.01/0.45/0.05, dsn=5.7.1, status=bounced (host hotmail-com.olc.protection.outlook.com[104.47.11.97] said: 550 5.7.1 Service unavailable, Client host [148.63.65.93] blocked using Spamhaus. To request removal from this list see https://www.spamhaus.org/query/ip/148.63.65.93 (AS3130). [DB5EUR02FT027.eop-EUR02.prod.protection.outlook.com 2023-07-31T01:00:11.924Z 08DB91279DE20B4D] (in reply to MAIL FROM command))
Jul 31 02:00:11 pve postfix/smtp[545719]: 6DDC231627CB: lost connection with hotmail-com.olc.protection.outlook.com[104.47.11.97] while sending RCPT TO
Jul 31 02:00:11 pve postfix/cleanup[545717]: EDCB231627CA: message-id=<20230731010011.EDCB231627CA@pve.localhome>
Jul 31 02:00:11 pve postfix/bounce[545720]: 6DDC231627CB: sender non-delivery notification: EDCB231627CA
Jul 31 02:00:11 pve postfix/qmgr[859]: EDCB231627CA: from=<>, size=6163, nrcpt=1 (queue active)
Jul 31 02:00:11 pve postfix/qmgr[859]: 6DDC231627CB: removed
Jul 31 02:00:11 pve proxmox-mail-fo[545722]: pve proxmox-mail-forward[545722]: forward mail to <mail@example.mail>
Jul 31 02:00:11 pve postfix/pickup[538212]: F29E931627CC: uid=65534 from=<root>
Jul 31 02:00:11 pve postfix/cleanup[545717]: F29E931627CC: message-id=<20230731010011.EDCB231627CA@pve.localhome>
Jul 31 02:00:11 pve postfix/local[545721]: EDCB231627CA: to=<root@pve.localhome>, relay=local, delay=0.02, delays=0/0/0/0.01, dsn=2.0.0, status=sent (delivered to command: /usr/bin/proxmox-mail-forward)
Jul 31 02:00:11 pve postfix/qmgr[859]: EDCB231627CA: removed
Jul 31 02:00:11 pve postfix/qmgr[859]: F29E931627CC: from=<root@pve.localhome>, size=6347, nrcpt=1 (queue active)
Jul 31 02:00:12 pve postfix/smtp[545719]: F29E931627CC: to=<mail@example.mail>, relay=none, delay=0.23, delays=0.01/0/0.23/0, dsn=5.4.4, status=bounced (Host or domain name not found. Name service error for name=example.mail type=AAAA: Host found but no data record of requested type)
Jul 31 02:00:12 pve postfix/qmgr[859]: F29E931627CC: removed
Jul 31 02:00:12 pve postfix/cleanup[545717]: 3749631627CA: message-id=<20230731010012.3749631627CA@pve.localhome>
Jul 31 02:17:01 pve CRON[548816]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 31 02:17:01 pve CRON[548817]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 31 02:17:01 pve CRON[548816]: pam_unix(cron:session): session closed for user root
Jul 31 02:39:56 pve pmxcfs[776]: [dcdb] notice: data verification successful
Jul 31 02:57:05 pve systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Jul 31 02:57:05 pve systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Jul 31 02:57:05 pve systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Jul 31 02:57:05 pve systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Jul 31 03:10:01 pve CRON[558581]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 31 03:10:01 pve CRON[558582]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Jul 31 03:10:01 pve CRON[558581]: pam_unix(cron:session): session closed for user root
Jul 31 03:17:01 pve CRON[559867]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 31 03:17:01 pve CRON[559868]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 31 03:17:01 pve CRON[559867]: pam_unix(cron:session): session closed for user root
Jul 31 03:39:56 pve pmxcfs[776]: [dcdb] notice: data verification successful
Jul 31 03:50:07 pve chronyd[712]: Selected source 194.117.47.44 (2.debian.pool.ntp.org)
Jul 31 04:17:01 pve CRON[570909]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 31 04:17:01 pve CRON[570910]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 31 04:17:01 pve CRON[570909]: pam_unix(cron:session): session closed for user root
Jul 31 04:39:56 pve pmxcfs[776]: [dcdb] notice: data verification successful
Jul 31 05:04:15 pve systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
Jul 31 05:04:16 pve pveupdate[579601]: <root@pam> starting task UPID:pve:0008D822:0114E5E4:64C732C0:aptupdate::root@pam:
Jul 31 05:04:18 pve pveupdate[579618]: command 'apt-get update' failed: exit code 100
Jul 31 05:04:18 pve pveupdate[579601]: command 'apt-get update' failed: exit code 100
Jul 31 05:04:18 pve pveupdate[579601]: <root@pam> end task UPID:pve:0008D822:0114E5E4:64C732C0:aptupdate::root@pam: command 'apt-get update' failed: exit code 100
Jul 31 05:04:18 pve systemd[1]: pve-daily-update.service: Deactivated successfully.
Jul 31 05:04:18 pve systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
Jul 31 05:04:18 pve systemd[1]: pve-daily-update.service: Consumed 1.709s CPU time.
Jul 31 05:17:01 pve CRON[582237]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 31 05:17:01 pve CRON[582238]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 31 05:17:01 pve CRON[582237]: pam_unix(cron:session): session closed for user root
Jul 31 05:39:56 pve pmxcfs[776]: [dcdb] notice: data verification successful
Jul 31 06:17:01 pve CRON[593275]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 31 06:17:01 pve CRON[593276]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 31 06:17:01 pve CRON[593275]: pam_unix(cron:session): session closed for user root
Jul 31 06:25:01 pve CRON[594753]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 31 06:25:01 pve CRON[594754]: (root) CMD (test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.daily; })
Jul 31 06:25:01 pve CRON[594753]: pam_unix(cron:session): session closed for user root
Jul 31 06:29:55 pve systemd[1]: Starting apt-daily-upgrade.service - Daily apt upgrade and clean activities...
Jul 31 06:29:55 pve systemd[1]: apt-daily-upgrade.service: Deactivated successfully.
Jul 31 06:29:55 pve systemd[1]: Finished apt-daily-upgrade.service - Daily apt upgrade and clean activities.
Jul 31 06:39:56 pve pmxcfs[776]: [dcdb] notice: data verification successful
Jul 31 07:17:01 pve CRON[604368]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 31 07:17:01 pve CRON[604369]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 31 07:17:01 pve CRON[604368]: pam_unix(cron:session): session closed for user root
-- Reboot --

xokia · Aug 2, 2023

can you folks share what is in your /etc/crontab

rebooting is different then crashing. Seems like something is calling a reboot

I have the following in mine to cause a reboot at 4:04am
0 4 * * * root /sbin/shutdown -r +4

leesteken · Aug 2, 2023

xokia said:
rebooting is different then crashing. Seems like something is calling a reboot

It does not look like a graceful shutdown and/or a normal reboot. journalctl shows -- Reboot -- when it detects that the system was restarted in between log lines. It also shows -- Reboot -- when an unexpected power interruption happened, for example. But maybe I'm misunderstanding what you are saying?

crc-error-79 · Aug 2, 2023

...Three
That's the magic number
Yes, it is
It's the magic number..

And after 3 days it crashed, this time is different, it seems that the cause is the nas (192.168.201.148) I presume it went in sleep mode or something similar..
If so my suspect that NFS is the cause of my problem are true..

Code:

Aug 02 06:25:01 zeus CRON[2406477]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Aug 02 06:25:02 zeus CRON[2406476]: pam_unix(cron:session): session closed for user root
Aug 02 06:27:02 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:27:53 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:30:07 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:30:53 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:33:15 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:33:55 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:36:20 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:36:55 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:39:26 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:39:55 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:42:31 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:42:57 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:45:36 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:45:58 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:48:41 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:49:00 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:51:13 zeus systemd[1]: Starting Daily apt upgrade and clean activities...
Aug 02 06:51:14 zeus systemd[1]: apt-daily-upgrade.service: Succeeded.
Aug 02 06:51:14 zeus systemd[1]: Finished Daily apt upgrade and clean activities.
Aug 02 06:51:49 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:52:00 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:54:54 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:55:02 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:57:59 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
Aug 02 06:58:03 zeus kernel: nfs: server 192.168.201.148 not responding, timed out
-- Reboot --
Aug 02 07:31:15 zeus kernel: Linux version 5.15.108-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.108-2 (2023-07-20T10:06Z) ()
Aug 02 07:31:15 zeus kernel: Command line: initrd=\EFI\proxmox\5.15.108-1-pve\initrd.img-5.15.108-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt
Aug 02 07:31:15 zeus kernel: KERNEL supported cpus:
Aug 02 07:31:15 zeus kernel:   Intel GenuineIntel
Aug 02 07:31:15 zeus kernel:   AMD AuthenticAMD

Code:

root@zeus:~# pveversion --verbose
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
root@zeus:~#

crc-error-79 · Aug 2, 2023

and again.. this time after few hours.. please help..

Code:

Aug 02 13:10:42 zeus systemd[927423]: Reached target Sockets.
Aug 02 13:10:42 zeus systemd[927423]: Reached target Basic System.
Aug 02 13:10:42 zeus systemd[927423]: Reached target Main User Target.
Aug 02 13:10:42 zeus systemd[927423]: Startup finished in 154ms.
Aug 02 13:10:42 zeus systemd[1]: Started User Manager for UID 0.
Aug 02 13:10:42 zeus systemd[1]: Started Session 24 of user root.
Aug 02 13:10:42 zeus login[927438]: ROOT LOGIN  on '/dev/pts/0'
Aug 02 13:17:01 zeus CRON[936500]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 02 13:17:01 zeus CRON[936501]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 02 13:17:01 zeus CRON[936500]: pam_unix(cron:session): session closed for user root
Aug 02 13:31:16 zeus smartd[2557]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 60 to 61
-- Reboot --
Aug 02 18:02:02 zeus kernel: Linux version 5.15.108-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.108-2 (2023-07-20T10:06Z) ()
Aug 02 18:02:02 zeus kernel: Command line: initrd=\EFI\proxmox\5.15.108-1-pve\initrd.img-5.15.108-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt

leesteken · Aug 2, 2023

crc-error-79 said:
And after 3 days it crashed, this time is different, it seems that the cause is the nas (192.168.201.148) I presume it went in sleep mode or something similar..
If so my suspect that NFS is the cause of my problem are true..

You could try this by disconnecting your NAS and see if your Proxmox host consistently crashes. But I think that's unlikely.

crc-error-79 said:
and again.. this time after few hours.. please help..

Start replacing (parts of) your hardware to see if you can narrow it down.I doubt that it's something you can fix by adjusting a software configuration setting.

crc-error-79 · Aug 2, 2023

leesteken said:
You could try this by disconnecting your NAS and see if your Proxmox host consistently crashes. But I think that's unlikely.

Start replacing (parts of) your hardware to see if you can narrow it down.I doubt that it's something you can fix by adjusting a software configuration setting.

Thanks for reply.
I just disabled the daily backup but leaving the nfs connected, if I will get another crash I will try removing it.

leesteken said:
Start replacing (parts of) your hardware to see if you can narrow it down.I doubt that it's something you can fix by adjusting a software configuration setting.

Parts are ok, disks are new (ssd kingston datacenter . 3 months of life), now I am checking the ram with memtest..

Anyway, as said I bought an used xeon e3, on weekend I will move my vm on it..

xokia said:
can you folks share what is in your /etc/crontab

this is mine

Code:

root@zeus:~# cat /etc/crontab
# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields,
# that none of the other crontabs do.

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
17 *    * * *   root    cd / && run-parts --report /etc/cron.hourly
25 6    * * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6    * * 7   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6    1 * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )

Plus I added on crontab -e

Code:

@reboot  echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

this to set the powersave governor

xokia · Aug 5, 2023

leesteken said:
It does not look like a graceful shutdown and/or a normal reboot. journalctl shows -- Reboot -- when it detects that the system was restarted in between log lines. It also shows -- Reboot -- when an unexpected power interruption happened, for example. But maybe I'm misunderstanding what you are saying?

when mine crashes I get a bunch of jiberish on the screen that is endless. Journal log checks out and no longer logs. The kernel is dead but endless jiberish on the screen.

His looks like something is causing a reboot but the kernel is still alive. Random failures on 13th gen Intel seems to be the norm at the moment. some folks have had luck disabling C-states in BIOS. Not a valid solution though. If you can keep the cores busy you can extend your time between a crash.

xokia · Aug 5, 2023

crc-error-79 said:
Plus I added on crontab -e

Code:

@reboot echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

this to set the powersave governor

powersave in my experience has prevented a hard crash but things will still hang and you get quite a bit of CPU degradation the max frequency of the cores gets dropped.

I'm going to guess moving to the xeon will solve your problem.

crc-error-79 · Aug 6, 2023

xokia said:
powersave in my experience has prevented a hard crash but things will still hang and you get quite a bit of CPU degradation the max frequency of the cores gets dropped.

I have set it because I am trying to reduce the power consumptions, even if I don't know if this "mode" is really effective..
Anyway the uptime is above 4 days and no crash since I disabled the automatic backup on the syno nas via nfs.

Code:

root@zeus:~# uptime
 19:08:24 up 4 days,  1:06,  1 user,  load average: 2.01, 2.13, 1.95

xokia said:
I'm going to guess moving to the xeon will solve your problem.

I hope so.. tomorrow I will start assembly the parts, I am going to use the v8, so fingers crossed

TOLF · Aug 20, 2023

The NFS causing the crash made me think that I may have a similar problem. Indeed, I have 4 VMs running, 2 debian, 1 HomeAssistant and 1 OPNsense. I tried to stop everything but OPNsense and it crashed, but yesterday, I stopped the OPNsense and left the 3 VMs. Proxmox has been running ever since and no crash! It is important saying that OPNsense uses PCI PASSTHROUGH as a second NIC! I think that might be the cause of the crash in my case. I won't be able to work in this next week, but after I'll try to use a bridge instead of passthrough, but I'm confident that was the problem. I will keep you updated.

TOLF · Aug 24, 2023

Ok after about a week, that was the problem. I was using PCI passthrough for a NIC. After changing adding a bridge and a virtual NIC, all my VMs are up and running and proxmox is not crashing anymore!

TOLF · Aug 28, 2023

I was wrong, even if proxmox managed to run for a full week with kernel 6 and OPNsense shutdown, I tried use bridge on my NIC and it crashed again. After shuting down OPNsense and unpluging the NIC, it continues to crash...

IndyJ · Aug 31, 2023

I had similar issues with Proxmox 8.x. I backup to a synology nas via NFS and a host completely froze up during backup.

I also had migration issues, both live and cold would randomly crash.
Backup restores displayed the same symptoms as migrations.

However:
I had 2 clusters: one using LVM and the other using ZFS. The ZFS cluster had no issues with backups and migrations. The LVM cluster was the one which would freeze during backups and migrations.

BTW these did not happen with the Proxmox 7.x series, even with the testing kernel and LVM.

TOLF · Sep 6, 2023

Ok so I removed the oldest RAM sticks in my server and I did not experienced any crash since it must be faulty. I will try to put them back to see if that was my problem, because memtest did not show any problem, but since the crash occur after about a day maybe that's why a 2h test did not show anything.

Help to diagnose random crash

Member

Member

Member

Member

Member

Active Member

Member

Distinguished Member

Member

Member

Distinguished Member

Member

Member

Member

Member

Member

Member

Member

Active Member

Member

We value your privacy