Proxmox shuts down

celdrith

New Member
Apr 26, 2023
28
0
1
Hello Proxmox,

My Proxmox server suddenly goes offline with all its vm's... This is really unwanted because it causes all my services to go DOWN.

Where can I check for the logs regarding the abrupt end of its reachability?
 
In the UI you can select your node and then check System -> System Log.
If you're comfortable with the CLI, then you can run journalct -b -1l to check the logs from the last boot.

Is this host part of a cluster?
 
This host was part of a cluster yes. The other host stays online but the primary host (this one) becomes unresponsive, it does not shut down, but just becomes disabled:
Code:
Jun 20 21:17:01 Osiris CRON[21231]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 20 21:17:01 Osiris CRON[21232]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 20 21:17:01 Osiris CRON[21231]: pam_unix(cron:session): session closed for user root
Jun 20 21:18:15 Osiris systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
Jun 20 21:18:16 Osiris pveupdate[21475]: <root@pam> starting task UPID:Osiris:000053E8:0008A747:66748078:aptupdate::root@pam:
Jun 20 21:18:18 Osiris pveupdate[21480]: update new package list: /var/lib/pve-manager/pkgupdates
Jun 20 21:18:20 Osiris pveupdate[21475]: <root@pam> end task UPID:Osiris:000053E8:0008A747:66748078:aptupdate::root@pam: OK
Jun 20 21:18:20 Osiris systemd[1]: pve-daily-update.service: Deactivated successfully.
Jun 20 21:18:20 Osiris systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
Jun 20 21:18:20 Osiris systemd[1]: pve-daily-update.service: Consumed 4.205s CPU time.
Jun 20 21:19:35 Osiris pvedaemon[1799]: <root@pam> successful auth for user 'root@pam'
Jun 20 21:21:59 Osiris pveproxy[20636]: Clearing outdated entries from certificate cache
Jun 20 21:23:11 Osiris pveproxy[17301]: worker exit
Jun 20 21:23:11 Osiris pveproxy[1825]: worker 17301 finished
Jun 20 21:23:11 Osiris pveproxy[1825]: starting 1 worker(s)
Jun 20 21:23:11 Osiris pveproxy[1825]: worker 22792 started
Jun 20 21:23:41 Osiris pveproxy[22792]: Clearing outdated entries from certificate cache
Jun 20 21:24:35 Osiris pvedaemon[1797]: <root@pam> successful auth for user 'root@pam'
Jun 20 21:33:59 Osiris pveproxy[8320]: worker exit
Jun 20 21:33:59 Osiris pvedaemon[1799]: <root@pam> end task UPID:Osiris:00001ACF:0001ED42:66746F40:vncproxy:200:root@pam: OK
Jun 20 21:34:36 Osiris pvedaemon[1798]: <root@pam> successful auth for user 'root@pam'
Jun 20 21:47:01 Osiris pmxcfs[1653]: [dcdb] notice: data verification successful
Jun 20 21:47:05 Osiris smartd[1398]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117
Jun 20 21:47:05 Osiris smartd[1398]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 45
Jun 20 21:47:06 Osiris smartd[1398]: Device: /dev/sdg [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 47
Jun 20 21:50:35 Osiris pvedaemon[1797]: <root@pam> successful auth for user 'root@pam'
Jun 20 21:54:15 Osiris systemd[1]: Starting man-db.service - Daily man-db regeneration...
Jun 20 21:54:15 Osiris systemd[1]: man-db.service: Deactivated successfully.
Jun 20 21:54:15 Osiris systemd[1]: Finished man-db.service - Daily man-db regeneration.
Jun 20 21:57:44 Osiris pveproxy[19881]: worker exit
Jun 20 21:57:44 Osiris pveproxy[1825]: worker 19881 finished
Jun 20 21:57:44 Osiris pveproxy[1825]: starting 1 worker(s)
Jun 20 21:57:44 Osiris pveproxy[1825]: worker 29524 started
Jun 20 22:06:35 Osiris pvedaemon[1799]: <root@pam> successful auth for user 'root@pam'
Jun 20 22:17:01 Osiris CRON[33204]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 20 22:17:01 Osiris CRON[33205]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 20 22:17:01 Osiris CRON[33204]: pam_unix(cron:session): session closed for user root
Jun 20 22:17:06 Osiris smartd[1398]: Device: /dev/sdg [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 47 to 46
Jun 20 22:21:37 Osiris pvedaemon[1798]: <root@pam> successful auth for user 'root@pam'
Jun 20 22:21:39 Osiris pvedaemon[1798]: <root@pam> successful auth for user 'root@pam'
Jun 20 22:28:11 Osiris kernel: usb 1-9: USB disconnect, device number 2
Jun 20 22:28:13 Osiris kernel: nouveau 0000:09:00.0: DRM: DDC responded, but no EDID for HDMI-A-1
Jun 20 22:47:01 Osiris pmxcfs[1653]: [dcdb] notice: data verification successful
Jun 20 22:47:06 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 71
Jun 20 22:47:06 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29
Jun 20 22:47:06 Osiris smartd[1398]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 35 to 36
Jun 20 22:47:06 Osiris smartd[1398]: Device: /dev/sdg [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 46 to 45
Jun 20 23:17:01 Osiris CRON[44681]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 20 23:17:01 Osiris CRON[44682]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 20 23:17:01 Osiris CRON[44681]: pam_unix(cron:session): session closed for user root
Jun 20 23:17:06 Osiris smartd[1398]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 34
Jun 20 23:17:06 Osiris smartd[1398]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 34 to 33
Jun 20 23:47:01 Osiris pmxcfs[1653]: [dcdb] notice: data verification successful
Jun 21 00:00:01 Osiris pvescheduler[52858]: <root@pam> starting task UPID:Osiris:0000CE7B:00177626:6674A661:vzdump:100:root@pam:
Jun 21 00:00:01 Osiris pvescheduler[52859]: INFO: starting new backup job: vzdump 100 --prune-backups 'keep-daily=2' --mailto dselen@nerthus.nl --fleecing 0 --compress zstd --storage Backup_Store --mode snapshot --quiet 1 --mailnotification always --notes-template '{{guestname}}'
Jun 21 00:00:01 Osiris pvescheduler[52859]: INFO: Starting Backup of VM 100 (qemu)
Jun 21 00:00:01 Osiris systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Jun 21 00:00:01 Osiris systemd[1]: Starting logrotate.service - Rotate log files...
Jun 21 00:00:02 Osiris systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Jun 21 00:00:02 Osiris systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Jun 21 00:00:02 Osiris systemd[1]: Reloading pveproxy.service - PVE API Proxy Server...
Jun 21 00:00:03 Osiris pveproxy[52886]: send HUP to 1825
Jun 21 00:00:03 Osiris pveproxy[1825]: received signal HUP
Jun 21 00:00:03 Osiris pveproxy[1825]: server closing
Jun 21 00:00:03 Osiris pveproxy[1825]: server shutdown (restart)
Jun 21 00:00:03 Osiris systemd[1]: Reloaded pveproxy.service - PVE API Proxy Server.
Jun 21 00:00:03 Osiris systemd[1]: Reloading spiceproxy.service - PVE SPICE Proxy Server...
Jun 21 00:00:03 Osiris spiceproxy[52902]: send HUP to 1831
Jun 21 00:00:03 Osiris spiceproxy[1831]: received signal HUP
Jun 21 00:00:03 Osiris spiceproxy[1831]: server closing
Jun 21 00:00:03 Osiris spiceproxy[1831]: server shutdown (restart)
Jun 21 00:00:03 Osiris systemd[1]: Reloaded spiceproxy.service - PVE SPICE Proxy Server.
Jun 21 00:00:03 Osiris pvefw-logger[1382]: received terminate request (signal)
Jun 21 00:00:03 Osiris pvefw-logger[1382]: stopping pvefw logger
Jun 21 00:00:03 Osiris systemd[1]: Stopping pvefw-logger.service - Proxmox VE firewall logger...
Jun 21 00:00:03 Osiris systemd[1]: pvefw-logger.service: Deactivated successfully.
Jun 21 00:00:03 Osiris systemd[1]: Stopped pvefw-logger.service - Proxmox VE firewall logger.
Jun 21 00:00:03 Osiris systemd[1]: Starting pvefw-logger.service - Proxmox VE firewall logger...
Jun 21 00:00:03 Osiris pvefw-logger[52929]: starting pvefw logger
Jun 21 00:00:03 Osiris systemd[1]: Started pvefw-logger.service - Proxmox VE firewall logger.
Jun 21 00:00:03 Osiris systemd[1]: logrotate.service: Deactivated successfully.
Jun 21 00:00:03 Osiris systemd[1]: Finished logrotate.service - Rotate log files.
Jun 21 00:00:03 Osiris spiceproxy[1831]: restarting server
Jun 21 00:00:03 Osiris spiceproxy[1831]: starting 1 worker(s)
Jun 21 00:00:03 Osiris spiceproxy[1831]: worker 52934 started
Jun 21 00:00:04 Osiris pveproxy[1825]: restarting server
Jun 21 00:00:04 Osiris pveproxy[1825]: starting 3 worker(s)
Jun 21 00:00:04 Osiris pveproxy[1825]: worker 52935 started
Jun 21 00:00:04 Osiris pveproxy[1825]: worker 52936 started
Jun 21 00:00:04 Osiris pveproxy[1825]: worker 52937 started
Jun 21 00:00:08 Osiris spiceproxy[1832]: worker exit
Jun 21 00:00:08 Osiris spiceproxy[1831]: worker 1832 finished
Jun 21 00:00:09 Osiris pveproxy[20636]: worker exit
Jun 21 00:00:09 Osiris pveproxy[22792]: worker exit
Jun 21 00:00:09 Osiris pveproxy[29524]: worker exit
Jun 21 00:00:09 Osiris pveproxy[1825]: worker 22792 finished
Jun 21 00:00:09 Osiris pveproxy[1825]: worker 29524 finished
Jun 21 00:00:09 Osiris pveproxy[1825]: worker 20636 finished
Jun 21 00:00:26 Osiris kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
Jun 21 00:00:26 Osiris kernel: pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Jun 21 00:00:26 Osiris kernel: pcieport 0000:00:01.3:   device [1022:1453] error status/mask=00000040/00006000
Jun 21 00:00:26 Osiris kernel: pcieport 0000:00:01.3:    [ 6] BadTLP               
Jun 21 00:17:01 Osiris CRON[57339]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 21 00:17:01 Osiris CRON[57340]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 21 00:17:01 Osiris CRON[57339]: pam_unix(cron:session): session closed for user root
Jun 21 00:17:06 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 75 to 79
Jun 21 00:17:06 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 1 to 3
Jun 21 00:17:06 Osiris smartd[1398]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 34 to 35
Jun 21 00:17:07 Osiris smartd[1398]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 34
Jun 21 00:19:24 Osiris kernel: perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Jun 21 00:34:39 Osiris lvm[956]: WARNING: Thin pool HDD3-HDD3-tpool data is now 85.01% full.
Jun 21 00:41:17 Osiris lvm[956]: WARNING: Thin pool HDD6-HDD6-tpool data is now 85.01% full.
Jun 21 00:43:38 Osiris kernel: perf: interrupt took too long (3135 > 3128), lowering kernel.perf_event_max_sample_rate to 63750
Jun 21 00:47:01 Osiris pmxcfs[1653]: [dcdb] notice: data verification successful
Jun 21 00:47:05 Osiris smartd[1398]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 70
Jun 21 00:47:05 Osiris smartd[1398]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 30
Jun 21 00:47:06 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 79 to 82
Jun 21 00:47:06 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 3 to 5
Jun 21 00:47:07 Osiris smartd[1398]: Device: /dev/sdg [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 70
Jun 21 00:47:07 Osiris smartd[1398]: Device: /dev/sdg [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 30
Jun 21 00:47:07 Osiris smartd[1398]: Device: /dev/sdg [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 45 to 44
Jun 21 01:17:01 Osiris CRON[73849]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 21 01:17:01 Osiris CRON[73850]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 21 01:17:01 Osiris CRON[73849]: pam_unix(cron:session): session closed for user root
Jun 21 01:17:05 Osiris smartd[1398]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71
Jun 21 01:17:05 Osiris smartd[1398]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 30 to 29
Jun 21 01:17:06 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 100
Jun 21 01:17:06 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 5 to 100
Jun 21 01:30:00 Osiris pvescheduler[52859]: INFO: Finished Backup of VM 100 (01:29:59)
Jun 21 01:30:00 Osiris pvescheduler[52859]: INFO: Backup job finished successfully
Jun 21 01:30:00 Osiris postfix/pickup[64451]: 36B39A9AC6: uid=0 from=<root>
Jun 21 01:30:00 Osiris postfix/cleanup[77297]: 36B39A9AC6: message-id=<20240620233000.36B39A9AC6@Osiris.nerthus.local>
Jun 21 01:30:00 Osiris postfix/qmgr[1730]: 36B39A9AC6: from=<root@Osiris.nerthus.local>, size=28683, nrcpt=1 (queue active)
Jun 21 01:30:00 Osiris postfix/smtp[77299]: 36B39A9AC6: to=<dselen@nerthus.nl>, relay=avas01.systemec.nl[89.20.83.19]:25, delay=0.56, delays=0.02/0.01/0.2/0.34, dsn=2.0.0, status=sent (250 Ok: queued as AD494D0061)
Jun 21 01:30:00 Osiris postfix/qmgr[1730]: 36B39A9AC6: removed
Jun 21 01:38:29 Osiris systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
Jun 21 01:38:31 Osiris pveupdate[78975]: <root@pam> starting task UPID:Osiris:00013484:00207A92:6674BD77:aptupdate::root@pam:
Jun 21 01:38:32 Osiris pveupdate[78980]: update new package list: /var/lib/pve-manager/pkgupdates
Jun 21 01:38:34 Osiris pveupdate[78975]: <root@pam> end task UPID:Osiris:00013484:00207A92:6674BD77:aptupdate::root@pam: OK
Jun 21 01:38:34 Osiris systemd[1]: pve-daily-update.service: Deactivated successfully.
Jun 21 01:38:34 Osiris systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
Jun 21 01:38:34 Osiris systemd[1]: pve-daily-update.service: Consumed 4.141s CPU time.
Jun 21 01:47:01 Osiris pmxcfs[1653]: [dcdb] notice: data verification successful
Jun 21 01:47:05 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 75
Jun 21 01:47:05 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 70
Jun 21 01:47:05 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 30
Jun 21 01:47:05 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 100 to 17
Jun 21 02:17:01 Osiris CRON[86818]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 21 02:17:01 Osiris CRON[86819]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 21 02:17:01 Osiris CRON[86818]: pam_unix(cron:session): session closed for user root
Jun 21 02:17:05 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71
Jun 21 02:17:05 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 30 to 29
Jun 21 02:17:05 Osiris smartd[1398]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 17 to 15
Jun 21 02:17:06 Osiris smartd[1398]: Device: /dev/sdg [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 43
-- Reboot --
Jun 21 18:51:10 Osiris kernel: Linux version 5.15.143-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.143-1 (2024-02-08T18:12Z) ()
Jun 21 18:51:10 Osiris kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.143-1-pve root=/dev/mapper/pve-root ro quiet
Jun 21 18:51:10 Osiris kernel: KERNEL supported cpus:
Jun 21 18:51:10 Osiris kernel:   Intel GenuineIntel
Jun 21 18:51:10 Osiris kernel:   AMD AuthenticAMD
Jun 21 18:51:10 Osiris kernel:   Hygon HygonGenuine
Jun 21 18:51:10 Osiris kernel:   Centaur CentaurHauls
Jun 21 18:51:10 Osiris kernel:   zhaoxin   Shanghai 
Jun 21 18:51:10 Osiris kernel: BIOS-provided physical RAM map:

These are the logs, I don't see much to base a hypothesis on...
 
Code:
un 22 09:49:38 Osiris pvedaemon[89645]: <root@pam> successful auth for user 'root@pam'
Jun 22 09:51:18 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 102 to 105
Jun 22 09:51:18 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 42 to 44
Jun 22 09:51:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 72
Jun 22 09:51:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 28
Jun 22 09:51:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 19 to 16
Jun 22 09:51:19 Osiris smartd[1433]: Device: /dev/sde [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 79 to 80
Jun 22 09:58:21 Osiris pveproxy[198970]: worker exit
Jun 22 09:58:22 Osiris pvedaemon[89645]: <root@pam> end task UPID:Osiris:0002FB13:00505E3E:66767C78:vncproxy:200:root@pam: OK
Jun 22 09:58:26 Osiris pvedaemon[76921]: <root@pam> starting task UPID:Osiris:00031357:00535CA2:66768422:vncproxy:200:root@pam:
Jun 22 09:58:26 Osiris pvedaemon[201559]: starting vnc proxy UPID:Osiris:00031357:00535CA2:66768422:vncproxy:200:root@pam:
Jun 22 09:58:28 Osiris pvedaemon[76921]: <root@pam> end task UPID:Osiris:00031357:00535CA2:66768422:vncproxy:200:root@pam: OK
Jun 22 09:58:29 Osiris pvedaemon[201575]: starting vnc proxy UPID:Osiris:00031367:00535D71:66768425:vncproxy:201:root@pam:
Jun 22 09:58:29 Osiris pvedaemon[82222]: <root@pam> starting task UPID:Osiris:00031367:00535D71:66768425:vncproxy:201:root@pam:
Jun 22 09:58:30 Osiris pvedaemon[82222]: <root@pam> end task UPID:Osiris:00031367:00535D71:66768425:vncproxy:201:root@pam: OK
Jun 22 10:04:38 Osiris pvedaemon[89645]: <root@pam> successful auth for user 'root@pam'
Jun 22 10:09:01 Osiris pveproxy[197805]: Clearing outdated entries from certificate cache
Jun 22 10:17:01 Osiris CRON[205098]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 22 10:17:01 Osiris CRON[205099]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 22 10:17:01 Osiris CRON[205098]: pam_unix(cron:session): session closed for user root
Jun 22 10:21:18 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 105 to 107
Jun 22 10:21:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 80 to 81
Jun 22 10:21:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 71
Jun 22 10:21:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29
Jun 22 10:21:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 16 to 15
Jun 22 10:21:19 Osiris smartd[1433]: Device: /dev/sdg [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 45 to 44
Jun 22 10:45:55 Osiris pmxcfs[48374]: [dcdb] notice: data verification successful
Jun 22 10:51:18 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 107 to 108
Jun 22 10:51:18 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 45
Jun 22 10:51:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 72
Jun 22 10:51:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 28
Jun 22 10:51:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 15 to 14
Jun 22 11:17:01 Osiris CRON[216511]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 22 11:17:01 Osiris CRON[216512]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 22 11:17:01 Osiris CRON[216511]: pam_unix(cron:session): session closed for user root
Jun 22 11:21:18 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 45 to 44
Jun 22 11:21:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 14 to 13
Jun 22 11:21:19 Osiris smartd[1433]: Device: /dev/sdg [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 43
Jun 22 11:45:55 Osiris pmxcfs[48374]: [dcdb] notice: data verification successful
Jun 22 11:51:18 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 43
Jun 22 11:51:18 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 13 to 12
Jun 22 11:51:19 Osiris smartd[1433]: Device: /dev/sdg [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 43 to 42
Jun 22 12:17:01 Osiris CRON[227851]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 22 12:17:01 Osiris CRON[227852]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 22 12:17:01 Osiris CRON[227851]: pam_unix(cron:session): session closed for user root
Jun 22 12:21:18 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 108 to 109
Jun 22 12:21:18 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 43 to 44
Jun 22 12:21:19 Osiris smartd[1433]: Device: /dev/sde [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 80 to 81
Jun 22 12:45:55 Osiris pmxcfs[48374]: [dcdb] notice: data verification successful
Jun 22 12:51:18 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 109 to 110
Jun 22 12:51:19 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 81 to 82
Jun 22 12:51:19 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 12 to 11
Jun 22 12:51:19 Osiris smartd[1433]: Device: /dev/sdg [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 114 to 115
Jun 22 12:51:19 Osiris smartd[1433]: Device: /dev/sdg [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 42 to 43
Jun 22 13:16:12 Osiris pvedaemon[89645]: <root@pam> successful auth for user 'root@pam'
Jun 22 13:17:01 Osiris CRON[239195]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 22 13:17:01 Osiris CRON[239196]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 22 13:17:01 Osiris CRON[239195]: pam_unix(cron:session): session closed for user root
Jun 22 13:20:30 Osiris pveproxy[198971]: Clearing outdated entries from certificate cache
Jun 22 13:20:30 Osiris pveproxy[197805]: Clearing outdated entries from certificate cache
Jun 22 13:20:42 Osiris pvedaemon[89645]: <root@pam> starting task UPID:Osiris:0003A915:0065E113:6676B38A:vzdump::root@pam:
Jun 22 13:20:42 Osiris pvedaemon[239893]: INFO: starting new backup job: vzdump 100 101 102 200 201 203 202 204 --mailto dselen@nerthus.nl --notes-template '{{guestname}}' --compress zstd --storage Backup_Storage --node Osiris --mailnotification always --prune-backups 'keep-daily=2' --fleecing 0 --all 0 --mode snapshot
Jun 22 13:20:42 Osiris pmxcfs[48374]: [status] notice: received log
Jun 22 13:20:42 Osiris pvedaemon[239893]: INFO: Starting Backup of VM 100 (qemu)
Jun 22 13:20:53 Osiris pveproxy[199064]: Clearing outdated entries from certificate cache
Jun 22 13:21:19 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 110 to 111
Jun 22 13:21:19 Osiris smartd[1433]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 45
Jun 22 13:21:19 Osiris smartd[1433]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 11 to 10
Jun 22 13:21:40 Osiris pvedaemon[239893]: INFO: Finished Backup of VM 100 (00:00:58)
Jun 22 13:21:40 Osiris pvedaemon[239893]: INFO: Starting Backup of VM 102 (qemu)
Jun 22 13:21:59 Osiris pveproxy[197805]: worker exit
Jun 22 13:21:59 Osiris pveproxy[1834]: worker 197805 finished
Jun 22 13:21:59 Osiris pveproxy[1834]: starting 1 worker(s)
Jun 22 13:21:59 Osiris pveproxy[1834]: worker 240197 started
Jun 22 13:23:22 Osiris kernel: hrtimer: interrupt took 2120 ns
Jun 22 13:23:35 Osiris pvedaemon[82222]: worker exit
Jun 22 13:23:35 Osiris pvedaemon[1820]: worker 82222 finished
Jun 22 13:23:35 Osiris pvedaemon[1820]: starting 1 worker(s)
Jun 22 13:23:35 Osiris pvedaemon[1820]: worker 240488 started
Jun 22 13:24:55 Osiris pvedaemon[76921]: worker exit
Jun 22 13:24:55 Osiris pvedaemon[1820]: worker 76921 finished
Jun 22 13:24:55 Osiris pvedaemon[1820]: starting 1 worker(s)
Jun 22 13:24:55 Osiris pvedaemon[1820]: worker 240738 started
Jun 22 13:25:57 Osiris pveproxy[199064]: worker exit
Jun 22 13:25:57 Osiris pveproxy[1834]: worker 199064 finished
Jun 22 13:25:57 Osiris pveproxy[1834]: starting 1 worker(s)
Jun 22 13:25:57 Osiris pveproxy[1834]: worker 240946 started
Jun 22 13:25:59 Osiris pveproxy[198971]: worker exit
Jun 22 13:25:59 Osiris pveproxy[1834]: worker 198971 finished
Jun 22 13:25:59 Osiris pveproxy[1834]: starting 1 worker(s)
Jun 22 13:25:59 Osiris pveproxy[1834]: worker 240960 started
Jun 22 13:27:00 Osiris pvedaemon[239893]: INFO: Finished Backup of VM 102 (00:05:20)
Jun 22 13:27:00 Osiris pvedaemon[239893]: INFO: Starting Backup of VM 202 (qemu)
Jun 22 13:28:25 Osiris pveproxy[240946]: Clearing outdated entries from certificate cache
Jun 22 13:28:33 Osiris pvedaemon[89645]: <root@pam> starting task UPID:Osiris:0003AF29:006698F2:6676B561:qmigrate:102:root@pam:
Jun 22 13:28:34 Osiris pmxcfs[48374]: [status] notice: received log
Jun 22 13:28:36 Osiris pmxcfs[48374]: [status] notice: received log
Jun 22 13:28:49 Osiris pveproxy[240197]: Clearing outdated entries from certificate cache
Jun 22 13:28:49 Osiris kernel: fwbr102i0: port 2(tap102i0) entered disabled state
Jun 22 13:28:49 Osiris kernel: fwbr102i0: port 1(fwln102i0) entered disabled state
Jun 22 13:28:49 Osiris kernel: vmbr0: port 3(fwpr102p0) entered disabled state
Jun 22 13:28:49 Osiris kernel: device fwln102i0 left promiscuous mode
Jun 22 13:28:49 Osiris kernel: fwbr102i0: port 1(fwln102i0) entered disabled state
Jun 22 13:28:49 Osiris kernel: device fwpr102p0 left promiscuous mode
Jun 22 13:28:49 Osiris kernel: vmbr0: port 3(fwpr102p0) entered disabled state
Jun 22 13:28:49 Osiris qmeventd[1436]: read: Connection reset by peer
Jun 22 13:28:49 Osiris systemd[1]: 102.scope: Deactivated successfully.
Jun 22 13:28:49 Osiris systemd[1]: 102.scope: Consumed 20min 17.607s CPU time.
Jun 22 13:28:50 Osiris pvedaemon[89645]: <root@pam> end task UPID:Osiris:0003AF29:006698F2:6676B561:qmigrate:102:root@pam: OK
Jun 22 13:28:51 Osiris pveproxy[240960]: Clearing outdated entries from certificate cache
Jun 22 13:28:55 Osiris pvedaemon[241543]: starting vnc proxy UPID:Osiris:0003AF87:0066A1A9:6676B577:vncproxy:102:root@pam:
Jun 22 13:28:55 Osiris pvedaemon[240488]: <root@pam> starting task UPID:Osiris:0003AF87:0066A1A9:6676B577:vncproxy:102:root@pam:
Jun 22 13:29:03 Osiris pvedaemon[240488]: <root@pam> end task UPID:Osiris:0003AF87:0066A1A9:6676B577:vncproxy:102:root@pam: OK
Jun 22 13:29:03 Osiris pvedaemon[240738]: <root@pam> starting task UPID:Osiris:0003AFA9:0066A4D0:6676B57F:vncproxy:101:root@pam:
Jun 22 13:29:03 Osiris pvedaemon[241577]: starting vnc proxy UPID:Osiris:0003AFA9:0066A4D0:6676B57F:vncproxy:101:root@pam:
Jun 22 13:29:07 Osiris pvedaemon[240738]: <root@pam> end task UPID:Osiris:0003AFA9:0066A4D0:6676B57F:vncproxy:101:root@pam: OK
Jun 22 13:29:07 Osiris pvedaemon[89645]: <root@pam> starting task UPID:Osiris:0003AFBF:0066A663:6676B583:vncproxy:200:root@pam:
Jun 22 13:29:07 Osiris pvedaemon[241599]: starting vnc proxy UPID:Osiris:0003AFBF:0066A663:6676B583:vncproxy:200:root@pam:
Jun 22 13:29:09 Osiris pvedaemon[89645]: <root@pam> end task UPID:Osiris:0003AFBF:0066A663:6676B583:vncproxy:200:root@pam: OK
Jun 22 13:29:17 Osiris pvedaemon[241632]: starting vnc proxy UPID:Osiris:0003AFE0:0066AA72:6676B58D:vncproxy:200:root@pam:
Jun 22 13:29:17 Osiris pvedaemon[89645]: <root@pam> starting task UPID:Osiris:0003AFE0:0066AA72:6676B58D:vncproxy:200:root@pam:
Jun 22 13:29:18 Osiris pvedaemon[240738]: <root@pam> starting task UPID:Osiris:0003AFEB:0066AAB4:6676B58E:vncproxy:201:root@pam:
Jun 22 13:29:18 Osiris pvedaemon[241643]: starting vnc proxy UPID:Osiris:0003AFEB:0066AAB4:6676B58E:vncproxy:201:root@pam:
Jun 22 13:29:19 Osiris pvedaemon[89645]: <root@pam> end task UPID:Osiris:0003AFE0:0066AA72:6676B58D:vncproxy:200:root@pam: OK
Jun 22 13:29:20 Osiris pvedaemon[240738]: <root@pam> end task UPID:Osiris:0003AFEB:0066AAB4:6676B58E:vncproxy:201:root@pam: OK
Jun 22 13:31:11 Osiris pvedaemon[240738]: <root@pam> successful auth for user 'root@pam'
Jun 22 13:31:33 Osiris pvedaemon[239893]: INFO: Finished Backup of VM 202 (00:04:33)
Jun 22 13:31:33 Osiris pvedaemon[239893]: INFO: Starting Backup of VM 203 (qemu)
Jun 22 13:36:50 Osiris pvedaemon[239893]: INFO: Finished Backup of VM 203 (00:05:17)
Jun 22 13:36:50 Osiris pvedaemon[239893]: INFO: Starting Backup of VM 204 (qemu)
Jun 22 13:38:15 Osiris pmxcfs[48374]: [status] notice: received log
Jun 22 13:39:46 Osiris kernel: perf: interrupt took too long (3142 > 3133), lowering kernel.perf_event_max_sample_rate to 63500
Jun 22 13:40:06 Osiris pvedaemon[239893]: INFO: Finished Backup of VM 204 (00:03:16)
Jun 22 13:40:06 Osiris pvedaemon[239893]: INFO: Backup job finished successfully
Jun 22 13:40:06 Osiris postfix/pickup[237141]: 36B4FA03C3: uid=0 from=<root>
Jun 22 13:40:06 Osiris pvedaemon[89645]: <root@pam> end task UPID:Osiris:0003A915:0065E113:6676B38A:vzdump::root@pam: OK
Jun 22 13:40:06 Osiris postfix/cleanup[243706]: 36B4FA03C3: message-id=<20240622114006.36B4FA03C3@Osiris.nerthus.local>
Jun 22 13:40:06 Osiris postfix/qmgr[1758]: 36B4FA03C3: from=<root@Osiris.nerthus.local>, size=80906, nrcpt=1 (queue active)
Jun 22 13:40:07 Osiris postfix/smtp[243708]: 36B4FA03C3: to=<dselen@nerthus.nl>, relay=avas02.systemec.nl[89.20.83.31]:25, delay=1.2, delays=0.02/0.01/0.62/0.51, dsn=2.0.0, status=sent (250 Ok: queued as 575C12088BE)
Jun 22 13:40:07 Osiris postfix/qmgr[1758]: 36B4FA03C3: removed
-- Reboot --

These are the errors just before just stopping to function.
 
This smells like a hardware issue. Do a memtest and if that does not find anything start replacing parts until it goes away. Your drives are also reporting ECC recoveries (data on the drive was corrupt but repairable because only a single bit was flipped). Are you running your system close to a nuclear reactor or high in the atmosphere or some other interference?
 
This smells like a hardware issue. Do a memtest and if that does not find anything start replacing parts until it goes away. Your drives are also reporting ECC recoveries (data on the drive was corrupt but repairable because only a single bit was flipped). Are you running your system close to a nuclear reactor or high in the atmosphere or some other interference?
Not that I am aware of! The server itself is just tucked away next to a big closet. Hence the ghetto setup with the monitor. I am affraid its a hardware issue as well, but the Memtest86 came back with a pass. Can I be one of the drives and/or how would I efficiently debug the components?
 
Can I be one of the drives and/or how would I efficiently debug the components?
It could be silent data corruption on important files on the drives that trigger a kernel crash, but usually it's the memory (even though it passes a memtest). Maybe relax memory timings and/or speed?
It's hard to diagnose hardware issues. Just start replacing parts until it's fixed (and you might end up with a whole different system)? Memory usually fails before motherboards, which usually fail way before CPUs do. Power supplies can cause weird issues when then are stressed by other components (reduce CPU speed or disable turbo etc.).
Maybe boot the system with a Ubuntu installer (don't install, just select "try") and run some stress tests? This is probably not Proxmox specific and any guide on hardware troubleshooting might apply.
 
  • Like
Reactions: mira
Thanks I also installed the amd64-microcode package that could have helped. But I will keep this thread updated. Going to test the RAM speeds and latency once I am able to.
 
Thanks I also installed the amd64-microcode package that could have helped. But I will keep this thread updated. Going to test the RAM speeds and latency once I am able to.
Maybe just run the system with one of the memory DIMMs? And try another when it still fails.
Did you search for your system (or motherboard), if there might be known issues (and work-arounds) on this forum? Is this a NUC with stability issues and no support from the manufacturer, for example?
 
My setup is:

ASUS B450M-A Motherboard (I updated it to the latest bios a couple days ago)
AMD Ryzen 1600
3x Corsair 8 GB
1x G.Skill 8 GB
about 5-8 Hard drives with a SATA expansion card.
Nvidia GT 710 for video output.
Corsair 750 SIlver power supply.
 
I do have to say that when only my TrueNAS Scale VM is running everything seems to run fine. But once the Kubernetes worker nodes come online its only a matter of time before it panics...
The purpose of the TrueNAS is exporting the disks to a NFS share for Proxmox and Kubernetes to use.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!