Proxmox Random Crashing how to Debug and Logs to View

osopolar · Sep 27, 2023

HI Guys
I am getting random Server Crashing several days apart for no apparent reason (what i mean its not even being used)
I have had a crash on PBS which needed a reboot, but i have also had 2 crashes when its doing nothing which has required a reboot.
As a noob on Proxmox how do i start to to track down the issue?

Where are logs located to check for this issue?

Chris · Sep 27, 2023

osopolar said:
HI Guys
I am getting random Server Crashing several days apart for no apparent reason (what i mean its not even being used)
I have had a crash on PBS which needed a reboot, but i have also had 2 crashes when its doing nothing which has required a reboot.
As a noob on Proxmox how do i start to to track down the issue?

Where are logs located to check for this issue?

Hi,
please check your systemd journal for errors. You can get a paginated output of the journal in reverse by running journalctl -r. If you know the approximate time of the crash, you can limit the output to that time interval by journalctl --since <DATETIME> --until <DATETIME>.

osopolar · Sep 28, 2023

Chris said:
Hi,
please check your systemd journal for errors. You can get a paginated output of the journal in reverse by running journalctl -r. If you know the approximate time of the crash, you can limit the output to that time interval by journalctl --since <DATETIME> --until <DATETIME>.

root@proxmox:~# journalctl --since "$(date -d '26 SEPT 2023 14:00:00' +'%Y-%m-%d %H:%M:%S')" --until "$(date -d '26 SEPT 2023 18:30:00' +'%Y-%m-%d %H:%M:%S')"
Sep 26 14:17:01 proxmox CRON[1450634]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 14:17:01 proxmox CRON[1450635]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 14:17:01 proxmox CRON[1450634]: pam_unix(cron:session): session closed for user root
Sep 26 14:22:31 proxmox systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary D>
Sep 26 14:22:31 proxmox systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Sep 26 14:22:31 proxmox systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary D>
Sep 26 14:22:31 proxmox systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deac>
Sep 26 15:17:01 proxmox CRON[1465539]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 15:17:01 proxmox CRON[1465540]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 15:17:01 proxmox CRON[1465539]: pam_unix(cron:session): session closed for user root
Sep 26 16:17:01 proxmox CRON[1480473]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 16:17:01 proxmox CRON[1480474]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 16:17:01 proxmox CRON[1480473]: pam_unix(cron:session): session closed for user root
Sep 26 17:01:38 proxmox pvestatd[949]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox>
Sep 26 17:01:38 proxmox pvestatd[949]: status update time (8.320 seconds)
Sep 26 17:01:48 proxmox kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STA>
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off>
Sep 26 17:01:48 proxmox kernel: nvme 0000:07:00.0: Unable to change power state from D3cold to D0, d>
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Disabling device after reset failure: -19
Sep 26 17:01:48 proxmox kernel: nvme0n1: detected capacity change from 1953525168 to 0
Sep 26 17:01:48 proxmox kernel: Aborting journal on device nvme0n1p1-8.
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1) in ext4_do_update_inode:5288: Journ>
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 121667584, lost syn>
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1): ext4_dirty_inode:6120: inode #5138>
Sep 26 17:01:48 proxmox kernel: JBD2: I/O error when updating journal superblock for nvme0n1p1-8.
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1) in ext4_dirty_inode:6121: IO failure
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page w>
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): I/O error while writing superblock
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm >
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page w>
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): I/O error while writing superblock
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): Remounting filesystem read-only
Sep 26 17:01:48 proxmox pvestatd[949]: status update time (8.369 seconds)
Sep 26 17:03:57 proxmox smartd[634]: Device: /dev/nvme0, removed NVMe device: Resource temporarily u>
Sep 26 17:17:01 proxmox CRON[1495394]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 17:17:01 proxmox CRON[1495395]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 17:17:01 proxmox CRON[1495394]: pam_unix(cron:session): session closed for user root
Sep 26 17:53:13 proxmox pvedaemon[976]: <root@pam> successful auth for user 'root@pam'
Sep 26 17:53:32 proxmox pvedaemon[976]: <root@pam> starting task UPID

roxmox:0016F4A0:02246035:6513>
Sep 26 17:53:34 proxmox pvedaemon[1504416]: update new package list: /var/lib/pve-manager/pkgupdates
Sep 26 17:53:37 proxmox pvedaemon[976]: <root@pam> end task UPID

roxmox:0016F4A0:02246035:65130C8C:>
Sep 26 17:56:44 proxmox pvedaemon[976]: <root@pam> successful auth for user 'root@pam'
Sep 26 17:57:48 proxmox systemd-logind[637]: The system will reboot now!
Sep 26 17:57:48 proxmox systemd-logind[637]: System is rebooting.
Sep 26 17:57:48 proxmox systemd[1]: 100.scope: Deactivated successfully.
Sep 26 17:57:48 proxmox systemd[1]: Stopped 100.scope.
Sep 26 17:57:48 proxmox systemd[1]: 100.scope: Consumed 1h 25min 16.810s CPU time.
Sep 26 17:57:48 proxmox systemd[1]: Removed slice qemu.slice - Slice /qemu.
Sep 26 17:57:48 proxmox systemd[1]: qemu.slice: Consumed 6h 40min 23.765s CPU time.
Sep 26 17:57:48 proxmox systemd[1]: Removed slice system-modprobe.slice - Slice /system/modprobe.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target graphical.target - Graphical Interface.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target multi-user.target - Multi-User System.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target getty.target - Login Prompts.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target rpc_pipefs.target.

Chris · Sep 28, 2023

osopolar said:
root@proxmox:~# journalctl --since "$(date -d '26 SEPT 2023 14:00:00' +'%Y-%m-%d %H:%M:%S')" --until "$(date -d '26 SEPT 2023 18:30:00' +'%Y-%m-%d %H:%M:%S')"
Sep 26 14:17:01 proxmox CRON[1450634]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 14:17:01 proxmox CRON[1450635]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 14:17:01 proxmox CRON[1450634]: pam_unix(cron:session): session closed for user root
Sep 26 14:22:31 proxmox systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary D>
Sep 26 14:22:31 proxmox systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Sep 26 14:22:31 proxmox systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary D>
Sep 26 14:22:31 proxmox systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deac>
Sep 26 15:17:01 proxmox CRON[1465539]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 15:17:01 proxmox CRON[1465540]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 15:17:01 proxmox CRON[1465539]: pam_unix(cron:session): session closed for user root
Sep 26 16:17:01 proxmox CRON[1480473]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 16:17:01 proxmox CRON[1480474]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 16:17:01 proxmox CRON[1480473]: pam_unix(cron:session): session closed for user root
Sep 26 17:01:38 proxmox pvestatd[949]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox>
Sep 26 17:01:38 proxmox pvestatd[949]: status update time (8.320 seconds)
Sep 26 17:01:48 proxmox kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STA>
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off>
Sep 26 17:01:48 proxmox kernel: nvme 0000:07:00.0: Unable to change power state from D3cold to D0, d>
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Disabling device after reset failure: -19
Sep 26 17:01:48 proxmox kernel: nvme0n1: detected capacity change from 1953525168 to 0
Sep 26 17:01:48 proxmox kernel: Aborting journal on device nvme0n1p1-8.
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1) in ext4_do_update_inode:5288: Journ>
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 121667584, lost syn>
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1): ext4_dirty_inode:6120: inode #5138>
Sep 26 17:01:48 proxmox kernel: JBD2: I/O error when updating journal superblock for nvme0n1p1-8.
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1) in ext4_dirty_inode:6121: IO failure
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page w>
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): I/O error while writing superblock
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm >
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page w>
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): I/O error while writing superblock
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): Remounting filesystem read-only
Sep 26 17:01:48 proxmox pvestatd[949]: status update time (8.369 seconds)
Sep 26 17:03:57 proxmox smartd[634]: Device: /dev/nvme0, removed NVMe device: Resource temporarily u>
Sep 26 17:17:01 proxmox CRON[1495394]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 17:17:01 proxmox CRON[1495395]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 17:17:01 proxmox CRON[1495394]: pam_unix(cron:session): session closed for user root
Sep 26 17:53:13 proxmox pvedaemon[976]: <root@pam> successful auth for user 'root@pam'
Sep 26 17:53:32 proxmox pvedaemon[976]: <root@pam> starting task UPIDroxmox:0016F4A0:02246035:6513>
Sep 26 17:53:34 proxmox pvedaemon[1504416]: update new package list: /var/lib/pve-manager/pkgupdates
Sep 26 17:53:37 proxmox pvedaemon[976]: <root@pam> end task UPIDroxmox:0016F4A0:02246035:65130C8C:>
Sep 26 17:56:44 proxmox pvedaemon[976]: <root@pam> successful auth for user 'root@pam'
Sep 26 17:57:48 proxmox systemd-logind[637]: The system will reboot now!
Sep 26 17:57:48 proxmox systemd-logind[637]: System is rebooting.
Sep 26 17:57:48 proxmox systemd[1]: 100.scope: Deactivated successfully.
Sep 26 17:57:48 proxmox systemd[1]: Stopped 100.scope.
Sep 26 17:57:48 proxmox systemd[1]: 100.scope: Consumed 1h 25min 16.810s CPU time.
Sep 26 17:57:48 proxmox systemd[1]: Removed slice qemu.slice - Slice /qemu.
Sep 26 17:57:48 proxmox systemd[1]: qemu.slice: Consumed 6h 40min 23.765s CPU time.
Sep 26 17:57:48 proxmox systemd[1]: Removed slice system-modprobe.slice - Slice /system/modprobe.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target graphical.target - Graphical Interface.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target multi-user.target - Multi-User System.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target getty.target - Login Prompts.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target rpc_pipefs.target.

Looks like your nvme drive is at fault here. Check it's smart values and maybe replace it.

osopolar · Sep 28, 2023

Chris said:
Looks like your nvme drive is at fault here. Check it's smart values and maybe replace it.

HI Chris i removed it yesterday and replaced it.
I noticed it was read only but now it has read write so i will monitor it
It is a brand-new unit, i check again this morning
https://paste.centos.org/view/e46b4a76

osopolar · Oct 6, 2023

osopolar said:
HI Chris i removed it yesterday and replaced it.
I noticed it was read only but now it has read write so i will monitor it
It is a brand-new unit, i check again this morning
https://paste.centos.org/view/e46b4a76

Hi Chris
Had another crash again yesterday afternoon again
https://i.imgur.com/m8AxoX9.png

osopolar · Oct 6, 2023

Does this current crash still point to a faulty nvme?

https://i.imgur.com/m8AxoX9.png

Chris · Oct 6, 2023

osopolar said:
Hi Chris
Had another crash again yesterday afternoon again
https://i.imgur.com/m8AxoX9.png

As you can see from the kernel complaining that the NVMe controller disappears, I suspect that either the controller is overheating or the drive might be faulty. Check smart values and temperatures. What drive is this?

osopolar · Oct 6, 2023

Chris said:
As you can see from the kernel complaining that the NVMe controller disappears, I suspect that either the controller is overheating or the drive might be faulty. Check smart values and temperatures. What drive is this?

HI Chris
Cheers for quick reply its a Crucial P3 Plus 1TB M.2 PCIe Gen4 NVMe Internal SSD - Up to 5000MB/s - CT1000P3PSSD8
How can i check the smart values and temp you mention?
Is there a terminal command or area in Proxmox i can view these ?

Chris · Oct 6, 2023

osopolar said:
HI Chris
Cheers for quick reply its a Crucial P3 Plus 1TB M.2 PCIe Gen4 NVMe Internal SSD - Up to 5000MB/s - CT1000P3PSSD8
How can i check the smart values and temp you mention?
Is there a terminal command or area in Proxmox i can view these ?

Okay, yes maybe not the most reliable disk you bought there, there is a reason these are so cheap. You can find the smart values and temperatures in the WebUI by going to <nodename> > Disks > <diskname> > Show S.M.A.R.T. values.

osopolar · Oct 6, 2023

Chris said:
Okay, yes maybe not the most reliable disk you bought there, there is a reason these are so cheap. You can find the smart values and temperatures in the WebUI by going to <nodename> > Disks > <diskname> > Show S.M.A.R.T. values.

Just so you know i have it in this slot

https://i.imgur.com/aj5syE5.jpg

osopolar · Oct 6, 2023

SMART Info
https://i.imgur.com/Vb0Kdva.png

I was thinking of sending this back and getting a Samsung SSD 870 EVO, 1 TB, Form Factor 2.5”, Intelligent Turbo Write, Magician 6 Software, Black (Internal SSD)

osopolar · Oct 6, 2023

This is a log from 1 Oct untill today with 68 mentions of failed
https://logs.notifiarr.com/?537af6f3faff6044#3htmbHe2ms6SVwpw1ZhdPDyNX5PGeAkuwUxJyX3HHdqG

Chris · Oct 6, 2023

The smart values look good, nothing strange there, the number of Error Log entries increased stem from some systems sending non-NVMe commands to these disks, but can be ignored.

Before anything else, you can try to see if disabling the power saving feature by setting the nvme_core.default_ps_max_latency_us=0 pcie_aspm=off in the kernel command line has an effect or if the problem is present with an other kernel version as well.

You can find the steps to take to edit the kernel commandline in the docs https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_edit_kernel_cmdline

osopolar · Oct 10, 2023

Chris said:
The smart values look good, nothing strange there, the number of Error Log entries increased stem from some systems sending non-NVMe commands to these disks, but can be ignored.

Before anything else, you can try to see if disabling the power saving feature by setting the nvme_core.default_ps_max_latency_us=0 pcie_aspm=off in the kernel command line has an effect or if the problem is present with an other kernel version as well.

You can find the steps to take to edit the kernel commandline in the docs https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_edit_kernel_cmdline

thusband · Oct 10, 2023

Thanks but now the drive shows as not active. I ran an update this morning and rebooted but that drive now shows enabled but not active. I also got another smart values email warning and the ATA errors have increased further. I tried restarting everything but the drive won't kick in. I went ahead and bought a new one so will change it out in a day or two.

Not sure about editing the kernel command line. It sounds a bit intimidating.

osopolar · Oct 10, 2023

thusband said:
Thanks but now the drive shows as not active. I ran an update this morning and rebooted but that drive now shows enabled but not active. I also got another smart values email warning and the ATA errors have increased further. I tried restarting everything but the drive won't kick in. I went ahead and bought a new one so will change it out in a day or two.

Not sure about editing the kernel command line. It sounds a bit intimidating.

Hello how is this connected with my post ?
I don’t understand sorry

thusband · Oct 10, 2023

osopolar said:
Hello how is this connected with my post ?
I don’t understand sorry

Sorry, my fault completely, I was following your thread because I have a similar problem and another thread going. I got yours confused with mine.

Again, my apologies, I'll shut up now,

osopolar · Oct 10, 2023

thusband said:
Sorry, my fault completely, I was following your thread because I have a similar problem and another thread going. I got yours confused with mine.

Again, my apologies, I'll shut up now,

No worries I was just wondering how this impacted my issues ‍

Chris · Oct 11, 2023

osopolar said:
View attachment 56382

The output you showed here is from within a VM, not the host. How is this related to your issue with the nvme controller disappearing? You will have to give more details.

Also, in addition to exclude the disk as being faulity run an extended memory test to check if your RAM modules are fine, they to might cause issues.

Proxmox Random Crashing how to Debug and Logs to View

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member

Member

Member

Member

Member

Proxmox Staff Member

We value your privacy