I am having a Proxmox server crash for once a week, bi-week or so, started a month ago. The vms and the host becomes inaccessible, and when I check from the console monitor I couldn't able to get access. Had to hard reboot.
Checked the logs but no identifiable errors like that I saw in the console. I saw kernel panic errors with lots of zeroes.
Server still boots.
What would be the best way to analyze the next crash, when it happens, or eliminating them beforehand. Also can those crashes harm my spinning hardrives?
Last logs before the crash, there is a small read error for one of the disks. But done a
What would you do next?
Thank you all for patience.
Checked the logs but no identifiable errors like that I saw in the console. I saw kernel panic errors with lots of zeroes.
Server still boots.
What would be the best way to analyze the next crash, when it happens, or eliminating them beforehand. Also can those crashes harm my spinning hardrives?
Bash:
uname -r
6.8.8-2-pve
Last logs before the crash, there is a small read error for one of the disks. But done a
smartctl -t short
shows ok.
Bash:
Jul 14 05:11:33 proxmox systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
Jul 14 05:11:34 proxmox pveupdate[3735686]: <root@pam> starting task UPID:proxmox:00390090:05E0E284:66939646:aptupdate::root@pam:
Jul 14 05:11:35 proxmox pveupdate[3735696]: update new package list: /var/lib/pve-manager/pkgupdates
Jul 14 05:11:37 proxmox pveupdate[3735686]: <root@pam> end task UPID:proxmox:00390090:05E0E284:66939646:aptupdate::root@pam: OK
Jul 14 05:11:37 proxmox pveupdate[3735686]: ACME config found for node, but no custom certificate exists. Skipping ACME renewal until initial certificate has been deployed.
Jul 14 05:11:37 proxmox systemd[1]: pve-daily-update.service: Deactivated successfully.
Jul 14 05:11:37 proxmox systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
Jul 14 05:11:37 proxmox systemd[1]: pve-daily-update.service: Consumed 3.680s CPU time.
Jul 14 05:14:38 proxmox smartd[1363]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83
Jul 14 05:15:01 proxmox CRON[3736557]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 14 05:15:01 proxmox CRON[3736558]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 14 05:15:01 proxmox CRON[3736557]: pam_unix(cron:session): session closed for user root
Jul 14 05:17:01 proxmox CRON[3736853]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 14 05:17:01 proxmox CRON[3736854]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 14 05:17:01 proxmox CRON[3736853]: pam_unix(cron:session): session closed for user root
Jul 14 05:25:01 proxmox CRON[3738012]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 14 05:25:01 proxmox CRON[3738013]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 14 05:25:01 proxmox CRON[3738012]: pam_unix(cron:session): session closed for user root
Jul 14 05:28:42 proxmox zed[3738552]: eid=53 class=scrub_finish pool='s10tb'
Jul 14 05:35:01 proxmox CRON[3739359]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 14 05:35:01 proxmox CRON[3739360]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 14 05:35:01 proxmox CRON[3739359]: pam_unix(cron:session): session closed for user root
What would you do next?
Thank you all for patience.
Last edited: