pve node stops responding and requires reboot

radar

Member
May 11, 2021
29
2
8
48
Hi,
I have a 2 nodes Proxmox cluster and from time to time, one of them, always the same, stops responding until I reboot it. I cannot ping anymore the node. I don't have a screen nor keyboard to check when it's not responding but here are the logs of this node. I don't know if the "reboot" happened at 20:27, whet it stopped responding or at 21:28 when I forced the reboot.

Do you have any idea what's happening here? Since this node is running my DHCP and DNS, it's a critical one (I'm thinking about HA but I have to open another thread for that).

Thanks in advance and do not hesitate if you need any addition log or output.

Code:
Feb 09 20:23:18 pve2 pmxcfs[985]: [dcdb] notice: data verification successful
Feb 09 20:27:24 pve2 pmxcfs[985]: [status] notice: received log
-- Reboot --
Feb 09 21:28:16 pve2 kernel: Linux version 6.8.12-8-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-8 (2025-01-24T12:32Z) ()
Feb 09 21:28:16 pve2 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-8-pve root=/dev/mapper/pve-root ro quiet
 
Nothing ugly in that log snippet. Hard to say, can you tell if the machine is hard locked - guessing not if no keyboard or monitor? I would guess either a hardware issue of some sort, or network issue just on first glance. Is there any way you can get a console on it to see when it is in a crashed state? Do you by chance have centralized logging where you can monitor the log at least during a crash? Need a bit more insight to troubleshoot if possible.
 
Hi,
Thanks for your response. I managed to get a keyboard and screen and figured out that the node has rebooted and, while rebooting, no boot option was found. So here are 2 issues I have to solve: why no boot option (not proxmox related) and why it reboots.
I have no centralized logs but that's a good hint. I'm going to implement this.
 
  • Like
Reactions: dj423
Too many reboots on this node and I can't figure out why. I see few warnings and errors but can't figure out if that leads to these reboots.
I attach the log since feb 1st with 13 reboots. If you prefer I paste the logs in pastebin, I can (but cannot attach as several days).
Thanks a lot for your help.
 

Attachments

A few observations,

I can count all 13 from your logs, and it appears you are doing ha possibly?


For example this one sticks out, all the events right before the "-- Boot xxxx" line, are what was going on right before the reboot:
Code:
Feb 09 21:11:27 proxmox pve-ha-lrm[1045]: lost lock 'ha_agent_proxmox_lock - cfs lock update failed - Device or resource busy
Feb 09 21:11:27 proxmox pmxcfs[930]: [dcdb] crit: can't initialize service
Feb 09 21:11:27 proxmox pve-ha-crm[1032]: status change slave => wait_for_quorum
Feb 09 21:11:32 proxmox pve-ha-lrm[1045]: status change active => lost_agent_lock
Feb 09 21:11:33 proxmox pmxcfs[930]: [dcdb] notice: members: 1/930
Feb 09 21:11:33 proxmox pmxcfs[930]: [dcdb] notice: all data is up to date
-- Boot 7937282e8ae645b799e58d9d18dd916b --

just a few minutes later:
This one something about gdrive, then it reboots:

Power key:
Code:
Feb 09 21:17:04 proxmox rclone[759]: Failed to create file system for "GDrive:": couldn't find root directory ID: Get "https://www.googleapis.com/drive/v3/files/root?alt=json&fields=id&prettyPrint=false&supportsAllDrives=true": dial tcp: lookup www.googleapis.com on 192.168.1.119:53: read udp 192.168.1.120:55109->192.168.1.119:53: i/o timeout
Feb 09 21:17:04 proxmox systemd[1]: gdrive.service: Main process exited, code=exited, status=1/FAILURE
Feb 09 21:17:04 proxmox systemd[1]: gdrive.service: Failed with result 'exit-code'.
Feb 09 21:17:14 proxmox systemd[1]: gdrive.service: Scheduled restart job, restart counter is at 1.
Feb 09 21:17:14 proxmox systemd[1]: Stopped gdrive.service - rclone for gdrive.
Feb 09 21:17:14 proxmox systemd[1]: Started gdrive.service - rclone for gdrive.
Feb 09 21:17:17 proxmox pvedaemon[1027]: <root@pam> successful auth for user 'root@pam'
Feb 09 21:19:55 proxmox systemd-logind[651]: Power key pressed short.
Feb 09 21:19:55 proxmox systemd-logind[651]: Powering off...
Feb 09 21:19:55 proxmox systemd-logind[651]: System is powering down.
-- Boot 250e2c5e6da84feb9f54d3b545373497 --

Then a few minutes later:
Code:
Feb 09 21:33:27 proxmox pmxcfs[922]: [dcdb] crit: received write while not quorate - trigger resync
Feb 09 21:33:27 proxmox pmxcfs[922]: [dcdb] crit: leaving CPG group
Feb 09 21:33:27 proxmox pmxcfs[922]: [dcdb] notice: start cluster connection
Feb 09 21:33:27 proxmox pmxcfs[922]: [dcdb] crit: cpg_join failed: 14
Feb 09 21:33:27 proxmox pmxcfs[922]: [dcdb] crit: can't initialize service
Feb 09 21:33:28 proxmox pve-ha-crm[1032]: status change wait_for_quorum => slave
Feb 09 21:33:28 proxmox pve-ha-crm[1032]: status change slave => wait_for_quorum
-- Boot c489fa33a197491d80d3603966d4e85b --
Feb 09 21:34:35 proxmox kernel: Linux version 6.8.12-4-pve

Possible watchdog restart here?
Code:
Feb 09 21:38:14 proxmox kernel: vmbr0: port 4(veth100i0) entered blocking state
Feb 09 21:38:14 proxmox kernel: vmbr0: port 4(veth100i0) entered forwarding state
Feb 09 21:38:29 proxmox watchdog-mux[653]: client watchdog expired - disable watchdog updates
-- Boot bb38f18e597647bc9a3f193025d75ad9 --
Feb 09 21:39:16 proxmox kernel: Linux version 6.8.12-4-pve


Little bit later:
Code:
Feb 09 21:56:48 proxmox pve-ha-crm[1030]: status change slave => wait_for_quorum
Feb 09 21:56:52 proxmox pve-ha-lrm[1045]: status change active => lost_agent_lock
Feb 09 21:56:54 proxmox pmxcfs[920]: [dcdb] notice: members: 1/920
Feb 09 21:56:54 proxmox pmxcfs[920]: [dcdb] notice: all data is up to date
Feb 09 21:57:09 proxmox pvescheduler[7209]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Feb 09 21:57:09 proxmox pvescheduler[7208]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
-- Boot 25e806e9c94b4e7084867c7df8a0abb7 --
Feb 09 21:58:25 proxmox kernel: Linux version 6.8.12-4-pve


Couple hours later:
Code:
Feb 09 23:17:01 proxmox CRON[23481]: pam_unix(cron:session): session closed for user root
Feb 09 23:28:30 proxmox smartd[648]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 68 to 72
-- Boot a6ed264ef5644f6ea381c3475c4cd3fe --
Feb 10 00:43:58 proxmox kernel: Linux version 6.8.12-4-pve

Then today:

Code:
Feb 11 21:26:58 proxmox systemd-shutdown[1]: Watchdog running with a hardware timeout of 10min.
Feb 11 21:26:58 proxmox systemd-shutdown[1]: Syncing filesystems and block devices.
Feb 11 21:26:58 proxmox systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Feb 11 21:26:58 proxmox systemd-journald[361]: Received SIGTERM from PID 1 (systemd-shutdow).
Feb 11 21:26:58 proxmox systemd-journald[361]: Journal stopped
-- Boot ed777731864747bf8d10280bbdeef529 --
Feb 11 21:27:34 proxmox kernel: Linux version 6.8.12-4-pve


Something to keep an eye on at least, is watchdog triggering your reboots??

Have yet to see a solid root cause, but I can see it's definately not happy and rebooting. One would appear to be due to power switch on 2/9 so throw that one out maybe: Feb 09 21:19:55 proxmox systemd-logind[651]: Power key pressed short.
 
Hi,
Thank you very much for analyzing this. We are on the same line regarding 2/9, I should have pressed the shutdown button.
Regarding HA, I'm thinking about enabling it but did not yet.
I saw several issues due to quorum. Can this be the reason? On Feb 9th, I had 2 nodes only and I know 3 is the minimum to have quorum. But can this trigger a reboot?
Regarding watchdog, I don't think I enabled any on purpose and can't see why watchdog is rebooting. I can dig further and report here if any new discovery.

Thanks again.
 
Shouldn't be restarting no. Nothing wrong with ha, just make sure everything is solid, storage, network, cluster settings. Do you have a dedicated nic for corosync?
 
That's a tough question actually :rolleyes:
I did not pay attention at all to corosync and I have only one nic per node. So control and data go through the same interface.
 
Well it's not to say that is the issue, just soemthing to be aware of down the road if you ever put it into production, could cause issues later.
 
  • Like
Reactions: radar
Was there ever more discovered on this issue? I have one node that falls off once or twice a week and another that falls off maybe once a month. Hardware tests are fine for both.

I found this thread because I was searching for the last event in my logs "[dcdb] notice: data verification successful" to see if it was related. Because my logs stop entirely, there's not much I can do - I was originally going to set up a cron job that simply reboots it can't ping the gateway. I have daily reboots already running around 4 a.m. for all nodes but since they don't trigger there's nothing I can do on that front.

I'm going to wait for it to happen again and stick a monitor on the problem node(s) to see if they are also rebooting without a boot option.

Bash:
Mar 07 14:52:32 hl-pm-05 pmxcfs[951]: [status] notice: received log
Mar 07 15:07:32 hl-pm-05 pmxcfs[951]: [status] notice: received log
Mar 07 15:10:59 hl-pm-05 pmxcfs[951]: [dcdb] notice: data verification successful
-- Reboot --
Mar 08 10:07:17 hl-pm-05 kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) ()
 
Just keep an eye on log output leading up to the random crash/reboots, try to find a pattern if you can and if needed setup rsyslog to push logging to an aggregator if you can to keep an eye on them. Never good if you are needing to reboot nodes daily, should be able to go months in between maintenance restarts.