Proxmox Random Crashing how to Debug and Logs to View

osopolar

Member
Nov 3, 2022
53
1
8
HI Guys
I am getting random Server Crashing several days apart for no apparent reason (what i mean its not even being used)
I have had a crash on PBS which needed a reboot, but i have also had 2 crashes when its doing nothing which has required a reboot.
As a noob on Proxmox how do i start to to track down the issue?

Where are logs located to check for this issue?
 
HI Guys
I am getting random Server Crashing several days apart for no apparent reason (what i mean its not even being used)
I have had a crash on PBS which needed a reboot, but i have also had 2 crashes when its doing nothing which has required a reboot.
As a noob on Proxmox how do i start to to track down the issue?

Where are logs located to check for this issue?
Hi,
please check your systemd journal for errors. You can get a paginated output of the journal in reverse by running journalctl -r. If you know the approximate time of the crash, you can limit the output to that time interval by journalctl --since <DATETIME> --until <DATETIME>.
 
Hi,
please check your systemd journal for errors. You can get a paginated output of the journal in reverse by running journalctl -r. If you know the approximate time of the crash, you can limit the output to that time interval by journalctl --since <DATETIME> --until <DATETIME>.
root@proxmox:~# journalctl --since "$(date -d '26 SEPT 2023 14:00:00' +'%Y-%m-%d %H:%M:%S')" --until "$(date -d '26 SEPT 2023 18:30:00' +'%Y-%m-%d %H:%M:%S')"
Sep 26 14:17:01 proxmox CRON[1450634]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 14:17:01 proxmox CRON[1450635]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 14:17:01 proxmox CRON[1450634]: pam_unix(cron:session): session closed for user root
Sep 26 14:22:31 proxmox systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary D>
Sep 26 14:22:31 proxmox systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Sep 26 14:22:31 proxmox systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary D>
Sep 26 14:22:31 proxmox systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deac>
Sep 26 15:17:01 proxmox CRON[1465539]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 15:17:01 proxmox CRON[1465540]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 15:17:01 proxmox CRON[1465539]: pam_unix(cron:session): session closed for user root
Sep 26 16:17:01 proxmox CRON[1480473]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 16:17:01 proxmox CRON[1480474]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 16:17:01 proxmox CRON[1480473]: pam_unix(cron:session): session closed for user root
Sep 26 17:01:38 proxmox pvestatd[949]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox>
Sep 26 17:01:38 proxmox pvestatd[949]: status update time (8.320 seconds)
Sep 26 17:01:48 proxmox kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STA>
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off>
Sep 26 17:01:48 proxmox kernel: nvme 0000:07:00.0: Unable to change power state from D3cold to D0, d>
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Disabling device after reset failure: -19
Sep 26 17:01:48 proxmox kernel: nvme0n1: detected capacity change from 1953525168 to 0
Sep 26 17:01:48 proxmox kernel: Aborting journal on device nvme0n1p1-8.
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1) in ext4_do_update_inode:5288: Journ>
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 121667584, lost syn>
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1): ext4_dirty_inode:6120: inode #5138>
Sep 26 17:01:48 proxmox kernel: JBD2: I/O error when updating journal superblock for nvme0n1p1-8.
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1) in ext4_dirty_inode:6121: IO failure
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page w>
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): I/O error while writing superblock
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm >
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page w>
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): I/O error while writing superblock
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): Remounting filesystem read-only
Sep 26 17:01:48 proxmox pvestatd[949]: status update time (8.369 seconds)
Sep 26 17:03:57 proxmox smartd[634]: Device: /dev/nvme0, removed NVMe device: Resource temporarily u>
Sep 26 17:17:01 proxmox CRON[1495394]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 17:17:01 proxmox CRON[1495395]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 17:17:01 proxmox CRON[1495394]: pam_unix(cron:session): session closed for user root
Sep 26 17:53:13 proxmox pvedaemon[976]: <root@pam> successful auth for user 'root@pam'
Sep 26 17:53:32 proxmox pvedaemon[976]: <root@pam> starting task UPID:proxmox:0016F4A0:02246035:6513>
Sep 26 17:53:34 proxmox pvedaemon[1504416]: update new package list: /var/lib/pve-manager/pkgupdates
Sep 26 17:53:37 proxmox pvedaemon[976]: <root@pam> end task UPID:proxmox:0016F4A0:02246035:65130C8C:>
Sep 26 17:56:44 proxmox pvedaemon[976]: <root@pam> successful auth for user 'root@pam'
Sep 26 17:57:48 proxmox systemd-logind[637]: The system will reboot now!
Sep 26 17:57:48 proxmox systemd-logind[637]: System is rebooting.
Sep 26 17:57:48 proxmox systemd[1]: 100.scope: Deactivated successfully.
Sep 26 17:57:48 proxmox systemd[1]: Stopped 100.scope.
Sep 26 17:57:48 proxmox systemd[1]: 100.scope: Consumed 1h 25min 16.810s CPU time.
Sep 26 17:57:48 proxmox systemd[1]: Removed slice qemu.slice - Slice /qemu.
Sep 26 17:57:48 proxmox systemd[1]: qemu.slice: Consumed 6h 40min 23.765s CPU time.
Sep 26 17:57:48 proxmox systemd[1]: Removed slice system-modprobe.slice - Slice /system/modprobe.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target graphical.target - Graphical Interface.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target multi-user.target - Multi-User System.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target getty.target - Login Prompts.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target rpc_pipefs.target.
 
root@proxmox:~# journalctl --since "$(date -d '26 SEPT 2023 14:00:00' +'%Y-%m-%d %H:%M:%S')" --until "$(date -d '26 SEPT 2023 18:30:00' +'%Y-%m-%d %H:%M:%S')"
Sep 26 14:17:01 proxmox CRON[1450634]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 14:17:01 proxmox CRON[1450635]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 14:17:01 proxmox CRON[1450634]: pam_unix(cron:session): session closed for user root
Sep 26 14:22:31 proxmox systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary D>
Sep 26 14:22:31 proxmox systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Sep 26 14:22:31 proxmox systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary D>
Sep 26 14:22:31 proxmox systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deac>
Sep 26 15:17:01 proxmox CRON[1465539]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 15:17:01 proxmox CRON[1465540]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 15:17:01 proxmox CRON[1465539]: pam_unix(cron:session): session closed for user root
Sep 26 16:17:01 proxmox CRON[1480473]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 16:17:01 proxmox CRON[1480474]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 16:17:01 proxmox CRON[1480473]: pam_unix(cron:session): session closed for user root
Sep 26 17:01:38 proxmox pvestatd[949]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox>
Sep 26 17:01:38 proxmox pvestatd[949]: status update time (8.320 seconds)
Sep 26 17:01:48 proxmox kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STA>
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off>
Sep 26 17:01:48 proxmox kernel: nvme 0000:07:00.0: Unable to change power state from D3cold to D0, d>
Sep 26 17:01:48 proxmox kernel: nvme nvme0: Disabling device after reset failure: -19
Sep 26 17:01:48 proxmox kernel: nvme0n1: detected capacity change from 1953525168 to 0
Sep 26 17:01:48 proxmox kernel: Aborting journal on device nvme0n1p1-8.
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1) in ext4_do_update_inode:5288: Journ>
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 121667584, lost syn>
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1): ext4_dirty_inode:6120: inode #5138>
Sep 26 17:01:48 proxmox kernel: JBD2: I/O error when updating journal superblock for nvme0n1p1-8.
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1) in ext4_dirty_inode:6121: IO failure
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page w>
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): I/O error while writing superblock
Sep 26 17:01:48 proxmox kernel: EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm >
Sep 26 17:01:48 proxmox kernel: Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page w>
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): I/O error while writing superblock
Sep 26 17:01:48 proxmox kernel: EXT4-fs (nvme0n1p1): Remounting filesystem read-only
Sep 26 17:01:48 proxmox pvestatd[949]: status update time (8.369 seconds)
Sep 26 17:03:57 proxmox smartd[634]: Device: /dev/nvme0, removed NVMe device: Resource temporarily u>
Sep 26 17:17:01 proxmox CRON[1495394]: pam_unix(cron:session): session opened for user root(uid=0) b>
Sep 26 17:17:01 proxmox CRON[1495395]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 26 17:17:01 proxmox CRON[1495394]: pam_unix(cron:session): session closed for user root
Sep 26 17:53:13 proxmox pvedaemon[976]: <root@pam> successful auth for user 'root@pam'
Sep 26 17:53:32 proxmox pvedaemon[976]: <root@pam> starting task UPID:proxmox:0016F4A0:02246035:6513>
Sep 26 17:53:34 proxmox pvedaemon[1504416]: update new package list: /var/lib/pve-manager/pkgupdates
Sep 26 17:53:37 proxmox pvedaemon[976]: <root@pam> end task UPID:proxmox:0016F4A0:02246035:65130C8C:>
Sep 26 17:56:44 proxmox pvedaemon[976]: <root@pam> successful auth for user 'root@pam'
Sep 26 17:57:48 proxmox systemd-logind[637]: The system will reboot now!
Sep 26 17:57:48 proxmox systemd-logind[637]: System is rebooting.
Sep 26 17:57:48 proxmox systemd[1]: 100.scope: Deactivated successfully.
Sep 26 17:57:48 proxmox systemd[1]: Stopped 100.scope.
Sep 26 17:57:48 proxmox systemd[1]: 100.scope: Consumed 1h 25min 16.810s CPU time.
Sep 26 17:57:48 proxmox systemd[1]: Removed slice qemu.slice - Slice /qemu.
Sep 26 17:57:48 proxmox systemd[1]: qemu.slice: Consumed 6h 40min 23.765s CPU time.
Sep 26 17:57:48 proxmox systemd[1]: Removed slice system-modprobe.slice - Slice /system/modprobe.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target graphical.target - Graphical Interface.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target multi-user.target - Multi-User System.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target getty.target - Login Prompts.
Sep 26 17:57:48 proxmox systemd[1]: Stopped target rpc_pipefs.target.
Looks like your nvme drive is at fault here. Check it's smart values and maybe replace it.
 
Hi Chris
Had another crash again yesterday afternoon again
https://i.imgur.com/m8AxoX9.png
As you can see from the kernel complaining that the NVMe controller disappears, I suspect that either the controller is overheating or the drive might be faulty. Check smart values and temperatures. What drive is this?
 
As you can see from the kernel complaining that the NVMe controller disappears, I suspect that either the controller is overheating or the drive might be faulty. Check smart values and temperatures. What drive is this?
HI Chris
Cheers for quick reply its a Crucial P3 Plus 1TB M.2 PCIe Gen4 NVMe Internal SSD - Up to 5000MB/s - CT1000P3PSSD8
How can i check the smart values and temp you mention?
Is there a terminal command or area in Proxmox i can view these ?
 
HI Chris
Cheers for quick reply its a Crucial P3 Plus 1TB M.2 PCIe Gen4 NVMe Internal SSD - Up to 5000MB/s - CT1000P3PSSD8
How can i check the smart values and temp you mention?
Is there a terminal command or area in Proxmox i can view these ?
Okay, yes maybe not the most reliable disk you bought there, there is a reason these are so cheap. You can find the smart values and temperatures in the WebUI by going to <nodename> > Disks > <diskname> > Show S.M.A.R.T. values.
 
The smart values look good, nothing strange there, the number of Error Log entries increased stem from some systems sending non-NVMe commands to these disks, but can be ignored.

Before anything else, you can try to see if disabling the power saving feature by setting the nvme_core.default_ps_max_latency_us=0 pcie_aspm=off in the kernel command line has an effect or if the problem is present with an other kernel version as well.

You can find the steps to take to edit the kernel commandline in the docs https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_edit_kernel_cmdline
 
The smart values look good, nothing strange there, the number of Error Log entries increased stem from some systems sending non-NVMe commands to these disks, but can be ignored.

Before anything else, you can try to see if disabling the power saving feature by setting the nvme_core.default_ps_max_latency_us=0 pcie_aspm=off in the kernel command line has an effect or if the problem is present with an other kernel version as well.

You can find the steps to take to edit the kernel commandline in the docs https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_edit_kernel_cmdline
1696942372695.png
 
Thanks but now the drive shows as not active. I ran an update this morning and rebooted but that drive now shows enabled but not active. I also got another smart values email warning and the ATA errors have increased further. I tried restarting everything but the drive won't kick in. I went ahead and bought a new one so will change it out in a day or two.

Not sure about editing the kernel command line. It sounds a bit intimidating.
 
Thanks but now the drive shows as not active. I ran an update this morning and rebooted but that drive now shows enabled but not active. I also got another smart values email warning and the ATA errors have increased further. I tried restarting everything but the drive won't kick in. I went ahead and bought a new one so will change it out in a day or two.

Not sure about editing the kernel command line. It sounds a bit intimidating.
Hello how is this connected with my post ?
I don’t understand sorry
 
Hello how is this connected with my post ?
I don’t understand sorry
Sorry, my fault completely, I was following your thread because I have a similar problem and another thread going. I got yours confused with mine.

Again, my apologies, I'll shut up now,
 
Sorry, my fault completely, I was following your thread because I have a similar problem and another thread going. I got yours confused with mine.

Again, my apologies, I'll shut up now,
No worries I was just wondering how this impacted my issues ‍
 
The output you showed here is from within a VM, not the host. How is this related to your issue with the nvme controller disappearing? You will have to give more details.

Also, in addition to exclude the disk as being faulity run an extended memory test to check if your RAM modules are fine, they to might cause issues.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!