[Feature request] - Watchdog for standalone hosts - or workaround

morlies

Member
Dec 30, 2019
27
5
23
Germany
I'm on latest update and have a frozen hosts every few hours. My host is standalone and would appreciate when there would be an option to reboot automatically.

Thread for such a problem

It doesn't replace analysis but I had same issue with some changes a while back and it got solved by an update. Currently I need to reboot manually after my monitoring sends an "out of office" notification of my server check. As said, purpose of this thread is not to get my issue solved. Just for information, it's a freeze without unusual entries in the log.

Regular Debian watchdog cannot be installed as it would remove core PVE packages.

I guess it would be very helpful for all using a standalone host.
 
First, have you tried using crashdump? If not, please do so.

Regular Debian watchdog cannot be installed as it would remove core PVE packages.
Hmm .. I wasn't aware of that ... I' curious why this is the case. It hasn't been always the case. I remember installing and using it a few ... eh ... maybe a few years more ... ago.
 
First, have you tried using crashdump? If not, please do so.


Hmm .. I wasn't aware of that ... I' curious why this is the case. It hasn't been always the case. I remember installing and using it a few ... eh ... maybe a few years more ... ago.
Never heard about crashdump. Is this a package or how to activate / install. This would be for analysis in case of a crash, right? Or would it restart the host as well?

I post the error log which comes when I try to install the watchdog package.
Code:
root@pve:~# apt-get -y install watchdog
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  fonts-font-logos libnet-subnet-perl libpve-network-perl libpve-notify-perl proxmox-default-kernel
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  proxmox-ve pve-container pve-ha-manager pve-manager qemu-server
The following NEW packages will be installed:
  watchdog
0 upgraded, 1 newly installed, 5 to remove and 0 not upgraded.
Need to get 69.7 kB of archives.
After this operation, 4,226 kB disk space will be freed.
Get:1 http://ftp.de.debian.org/debian bookworm/main amd64 watchdog amd64 5.16-1+b2 [69.7 kB]
Fetched 69.7 kB in 0s (280 kB/s)   
W: (pve-apt-hook) !! WARNING !!
W: (pve-apt-hook) You are attempting to remove the meta-package 'proxmox-ve'!
W: (pve-apt-hook)
W: (pve-apt-hook) If you really want to permanently remove 'proxmox-ve' from your system, run the following command
W: (pve-apt-hook)       touch '/please-remove-proxmox-ve'
W: (pve-apt-hook) run apt purge proxmox-ve to remove the meta-package
W: (pve-apt-hook) and repeat your apt invocation.
W: (pve-apt-hook)
W: (pve-apt-hook) If you are unsure why 'proxmox-ve' would be removed, please verify
W: (pve-apt-hook)       - your APT repository settings
W: (pve-apt-hook)       - that you are using 'apt full-upgrade' to upgrade your system
E: Sub-process /usr/share/proxmox-ve/pve-apt-hook returned an error code (1)
E: Failure running script /usr/share/proxmox-ve/pve-apt-hook
root@pve:~#
 
Never heard about crashdump. Is this a package or how to activate / install. This would be for analysis in case of a crash, right? Or would it restart the host as well?
Both, but it has to be a "real" kernel crash, which is very seldom but the only way to debug it.
Otherwise, maybe use netconsole to log the kernel output to another machine.

I used both methods sucessfully in the past, yet I haven't had any software-induced crashes for years (on >100 hosts), so it is VERY stable in general. Most of the time, the crash is hardware induced and therefore with error logging in the ipmi system event log.
 
Both, but it has to be a "real" kernel crash, which is very seldom but the only way to debug it.
Otherwise, maybe use netconsole to log the kernel output to another machine.

I used both methods sucessfully in the past, yet I haven't had any software-induced crashes for years (on >100 hosts), so it is VERY stable in general. Most of the time, the crash is hardware induced and therefore with error logging in the ipmi system event log.
Is it a apt-package or how to activate? Package name or related information
 
if it's the host itself that is crashing/hanging, you could just activate HA for one of your guests (or add a tiny special guest solely for that purpose). with HA active, the LRM needs to write to /etc/pve, else the watchdog won't be pulled and once it expires, the node will hard-reset.
 
will check the dump for analysis. But I will open an own thread in case I require help. Thanks for the links and hints.

I keep this thread open as it would be really helpful to have a documented process e.g. in wiki (or even better a general function) to have watchdog functionality. Proxmox is used in many home automatism systems like homeassitant or fhem. I know, for professional customers it's not relevant as commercial setup will have clusters.
 
if it's the host itself that is crashing/hanging, you could just activate HA for one of your guests (or add a tiny special guest solely for that purpose). with HA active, the LRM needs to write to /etc/pve, else the watchdog won't be pulled and once it expires, the node will hard-reset.

Why would he need to do that?

Fresh standalone PVE install from ISO:

Code:
# ha-manager status
quorum OK



# systemctl status watchdog-mux.service

● watchdog-mux.service - Proxmox VE watchdog multiplexer
     Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
     Active: active (running) since Sun 2024-02-18 01:04:29 UTC; 2 days ago
   Main PID: 507 (watchdog-mux)
      Tasks: 1 (limit: 18987)
     Memory: 184.0K
        CPU: 5.703s
     CGroup: /system.slice/watchdog-mux.service
             └─507 /usr/sbin/watchdog-mux

Feb 18 01:04:29 a2 systemd[1]: Started watchdog-mux.service - Proxmox VE watchdog multiplexer.
Feb 18 01:04:29 a2 watchdog-mux[507]: Watchdog driver 'Software Watchdog', version 0



# strace -t -e ioctl  -p507  | grep WDIOC_KEEPALIVE

strace: Process 507 attached
15:18:35 ioctl(3, WDIOC_KEEPALIVE)      = 0
15:18:36 ioctl(3, WDIOC_KEEPALIVE)      = 0
15:18:37 ioctl(3, WDIOC_KEEPALIVE)      = 0
15:18:38 ioctl(3, WDIOC_KEEPALIVE)      = 0
15:18:39 ioctl(3, WDIOC_KEEPALIVE)      = 0
15:18:40 ioctl(3, WDIOC_KEEPALIVE)      = 0
15:18:41 ioctl(3, WDIOC_KEEPALIVE)      = 0
^Cstrace: Process 507 detached



# wdctl /dev/watchdog0

Device:        /dev/watchdog0
Identity:      Software Watchdog [version 0]
Timeout:       10 seconds
Pre-timeout:    0 seconds
Pre-timeout governor: noop
Available pre-timeout governors: noop

I would consider this in and of itself a bug on a standalone node. Undocumented feature at best?

May I file it?
 
Hmm .. I wasn't aware of that ... I' curious why this is the case. It hasn't been always the case. I remember installing and using it a few ... eh ... maybe a few years more ... ago.

It's an unintended consequence, apparently, of having something that should not even be on a standalone node, or any node without HA, to begin with, in my opinion.

Code:
# apt install --dry-run -o Debug::pkgProblemResolver=true watchdog

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Starting pkgProblemResolver with broken count: 1
Starting 2 pkgProblemResolver with broken count: 1
Investigating (0) pve-ha-manager:amd64 < 4.0.3 @ii K Ib >
Broken pve-ha-manager:amd64 Conflicts on watchdog:amd64 < none -> 5.16-1+b2 @un puN >
  Considering watchdog:amd64 9998 as a solution to pve-ha-manager:amd64 9
  Removing pve-ha-manager:amd64 rather than change watchdog:amd64
Investigating (0) qemu-server:amd64 < 8.0.10 @ii K Ib >
Broken qemu-server:amd64 Depends on pve-ha-manager:amd64 < 4.0.3 @ii R > (>= 3.0-9)
  Considering pve-ha-manager:amd64 9 as a solution to qemu-server:amd64 7
  Removing qemu-server:amd64 rather than change pve-ha-manager:amd64
Investigating (0) pve-container:amd64 < 5.0.8 @ii K Ib >
Broken pve-container:amd64 Depends on pve-ha-manager:amd64 < 4.0.3 @ii R > (>= 3.0-9)
  Considering pve-ha-manager:amd64 9 as a solution to pve-container:amd64 6
  Removing pve-container:amd64 rather than change pve-ha-manager:amd64
Investigating (0) pve-manager:amd64 < 8.1.4 @ii K Ib >
Broken pve-manager:amd64 Depends on pve-container:amd64 < 5.0.8 @ii R > (>= 5.0.5)
  Considering pve-container:amd64 6 as a solution to pve-manager:amd64 1
  Removing pve-manager:amd64 rather than change pve-container:amd64
Investigating (0) proxmox-ve:amd64 < 8.1.0 @ii K Ib >
Broken proxmox-ve:amd64 Depends on pve-manager:amd64 < 8.1.4 @ii R > (>= 8.0.4)
  Considering pve-manager:amd64 1 as a solution to proxmox-ve:amd64 0
  Removing proxmox-ve:amd64 rather than change pve-manager:amd64
Done
The following packages will be REMOVED:
  proxmox-ve pve-container pve-ha-manager pve-manager qemu-server
The following NEW packages will be installed:
  watchdog
0 upgraded, 1 newly installed, 5 to remove and 4 not upgraded.
Remv proxmox-ve [8.1.0]
Remv pve-manager [8.1.4]
Remv qemu-server [8.0.10] [pve-ha-manager:amd64 ]
Remv pve-ha-manager [4.0.3] [pve-container:amd64 ]
Remv pve-container [5.0.8]
Inst watchdog (5.16-1+b2 Debian:12.5/stable [amd64])
Conf watchdog (5.16-1+b2 Debian:12.5/stable [amd64])

EDIT: Split off to separate thread now:
https://forum.proxmox.com/threads/cannot-remove-pve-ha-manager-why.141940/#post-636316
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!