Node randomly rebooting. Possible watchdog error?

domtech05 · Jan 18, 2025

Hey guys,
I'm really new to the whole proxmox and networking scene. I have a node set up at the minuet just running a few bits. I have an issue at the moment where the node seems to decide it wants to reboot itself. This reboot will hang until I manually cut power and reboot the system myself. This causes a lot of issues as this hosts my DNS and DHCP servers. I've copied below the part of the log that I think is relevant. Looks like some kind of watchdog (don't know much about this). If anyone can help me diagnose and solve this as well as possibly explain a bit more behind it that would be amazing!
Many thanks, Dom

Code:

Jan 17 11:28:23 DSH-SRV-01 kernel: r8169 0000:06:00.0 enp6s0: NETDEV WATCHDOG: CPU: 6: transmit queue 0 timed out 9920 ms
Jan 17 11:28:23 DSH-SRV-01 kernel: r8169 0000:06:00.0 enp6s0: ASPM disabled on Tx timeout
Jan 17 11:28:23 DSH-SRV-01 systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
Jan 17 11:28:23 DSH-SRV-01 systemd[1]: systemd-udevd.service: Killing process 465 (systemd-udevd) with signal SIGABRT.
Jan 17 11:29:53 DSH-SRV-01 systemd[1]: systemd-udevd.service: State 'stop-watchdog' timed out. Killing.
Jan 17 11:29:53 DSH-SRV-01 systemd[1]: systemd-udevd.service: Killing process 465 (systemd-udevd) with signal SIGKILL.
Jan 17 11:31:23 DSH-SRV-01 systemd[1]: systemd-udevd.service: Processes still around after SIGKILL. Ignoring.
Jan 17 11:32:53 DSH-SRV-01 systemd[1]: systemd-udevd.service: State 'final-sigterm' timed out. Killing.
Jan 17 11:32:53 DSH-SRV-01 systemd[1]: systemd-udevd.service: Killing process 465 (systemd-udevd) with signal SIGKILL.

waltar · Jan 18, 2025

That's from your allowed powersaving mode "suspend to idle" in your bios cpu settings so change that to just allow S3 powersaving mode. Reason:
Upon wake, when systemd sees the clock has advanced beyond the 3 minute watchdog timeout, it will kill and restart services that have the watchdog enabled.

domtech05 · Jan 20, 2025

waltar said:
That's from your allowed powersaving mode "suspend to idle" in your bios cpu settings so change that to just allow S3 powersaving mode. Reason:
Upon wake, when systemd sees the clock has advanced beyond the 3 minute watchdog timeout, it will kill and restart services that have the watchdog enabled.

Thanks for your reply. I am using an old PC for my server and couldn't seem to find an s3 setting. I did a bit of googling and found that possibly enabling something called ErP would work? I tried it for a bit and it still crashed out this time with the following console output:

Code:

Jan 20 11:07:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:08:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:09:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:10:34 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/qmgr
Jan 20 11:10:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:11:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:12:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:13:42 DSH-SRV-01 smartd[918]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 70 to 69
Jan 20 11:13:42 DSH-SRV-01 smartd[918]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 65 to 64
Jan 20 11:13:42 DSH-SRV-01 smartd[918]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 35 to 36
Jan 20 11:13:42 DSH-SRV-01 smartd[918]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 142 to 146
Jan 20 11:13:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:14:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:15:34 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/qmgr
Jan 20 11:15:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:16:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:17:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:18:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:19:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:20:34 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/qmgr
Jan 20 11:20:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:21:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:22:50 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
Jan 20 11:23:51 DSH-SRV-01 postfix/master[1227]: warning: unix_trigger_event: read timeout for service public/pickup
-- Reboot --

However the reboot just hangs and never restarts.
TIA,
Dom

waltar · Jan 20, 2025

Nevertheless your problem is to be fixed in your bios configurations and the (pve) kernel is (developed to) trying to follow the rules the bios is talking to the kernel about cpu powersaving (frequency up/down) states.

Search

Search

Node randomly rebooting. Possible watchdog error?

domtech05

New Member

waltar

Renowned Member

domtech05

New Member

waltar

Renowned Member

We value your privacy