Cannot remove pve-ha-manager, why?

esi_y

Renowned Member
Nov 29, 2023
2,221
374
68
github.com
Never planning to use HA stack, but cannot remove pve-ha-manager:

Code:
# apt remove pve-ha-manager --dry-run -o Debug::pkgProblemResolver=true 

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Starting pkgProblemResolver with broken count: 3
Starting 2 pkgProblemResolver with broken count: 3
Investigating (0) qemu-server:amd64 < 8.0.10 @ii K Ib >
Broken qemu-server:amd64 Depends on pve-ha-manager:amd64 < 4.0.3 @ii pR > (>= 3.0-9)
  Considering pve-ha-manager:amd64 10001 as a solution to qemu-server:amd64 3
  Removing qemu-server:amd64 rather than change pve-ha-manager:amd64
Investigating (0) pve-container:amd64 < 5.0.8 @ii K Ib >
Broken pve-container:amd64 Depends on pve-ha-manager:amd64 < 4.0.3 @ii pR > (>= 3.0-9)
  Considering pve-ha-manager:amd64 10001 as a solution to pve-container:amd64 2
  Removing pve-container:amd64 rather than change pve-ha-manager:amd64
Investigating (0) pve-manager:amd64 < 8.1.4 @ii K Ib >
Broken pve-manager:amd64 Depends on pve-container:amd64 < 5.0.8 @ii R > (>= 5.0.5)
  Considering pve-container:amd64 2 as a solution to pve-manager:amd64 1
  Removing pve-manager:amd64 rather than change pve-container:amd64
Investigating (0) proxmox-ve:amd64 < 8.1.0 @ii K Ib >
Broken proxmox-ve:amd64 Depends on pve-manager:amd64 < 8.1.4 @ii R > (>= 8.0.4)
  Considering pve-manager:amd64 1 as a solution to proxmox-ve:amd64 0
  Removing proxmox-ve:amd64 rather than change pve-manager:amd64
Done
The following packages will be REMOVED:
  proxmox-ve pve-container pve-ha-manager pve-manager qemu-server
0 upgraded, 0 newly installed, 5 to remove and 4 not upgraded.
Remv proxmox-ve [8.1.0]
Remv pve-manager [8.1.4]
Remv qemu-server [8.0.10] [pve-ha-manager:amd64 ]
Remv pve-ha-manager [4.0.3] [pve-container:amd64 ]
Remv pve-container [5.0.8]
 
It's an integral part of the whole system and never intended to be removed.
If you don't plan on using HA, simply don't enable HA on any resources.
 
It's an integral part of the whole system and never intended to be removed.

That's concerning in terms of implications and when it comes to potential bugs.

If you don't plan on using HA, simply don't enable HA on any resources.

See for instance:
https://bugzilla.proxmox.com/show_bug.cgi?id=5243

In a complex piece like that, it is quite natural one wants to be able to completely deactivate it, or even be able to prevent HA resources to be even configured.



Also, the dependencies appear broken in terms of logic:

Code:
pve-ha-manager
  Reverse Depends: pve-container (>= 5.0.2)
  Reverse Depends: pve-ha-manager-dbgsym (= 4.0.1)
  Reverse Depends: pve-manager (8.0.0~8)
  Reverse Depends: qemu-server (>= 8.0.10)
pve-container
  Reverse Depends: pve-ha-manager (>= 4.0.1)
  Reverse Depends: pve-manager (>= 8.0.0~9)
pve-manager
  Reverse Depends: proxmox-ve (8.0.0)
proxmox-ve
pve-ha-manager-dbgsym
qemu-server
  Reverse Depends: proxmox-ve (8.0.0)
  Reverse Depends: pve-ha-manager (>= 4.0.1)
  Reverse Depends: pve-manager (>= 8.0.0~9)
  Reverse Depends: qemu-server-dbgsym (= 8.0.10)
qemu-server-dbgsym

E.g. qemu-server or pve-container depends on pve-ha-manager .? ... that cannot be in this way ...
 
If each piece can be installed/enabled separately, the testing effort grows exponentially due to the number of combinations possible, which is also a concern.
Not each piece, this is called HA stack, literally, in the docs. It's by its very nature an add-on. It needs the other components to do its part, the other components cannot depend on it. If that was the case, something would be terribly broken with the architecture as a whole.

Quite to the contrary, even the basic unit testing is based on the idea that less complex systems are more streamlined to test.

I am now literally concerned about the dependencies list, why is it botched.
 
That's concerning in terms of implications and when it comes to potential bugs.
It's exactly the opposite. If they would be completely removable, all combinations would have to be tested in the same way, effort and scope.

E.g. qemu-server or pve-container depends on pve-ha-manager .? ... that cannot be in this way ...
Yes, that's by the very nature of these components. As both these components have to deal with HA if and only if it is enabled, these obviously also need to interact with the HA stack.

In a complex piece like that, it is quite natural one wants to be able to completely deactivate it, or even be able to prevent HA resources to be even configured.
Not really. If you don't know what's something doing and/or don't need it (as in your case), you simply do not enable it.
And, if one is still learning, you never try out such features in a production environment anyway. That's why staging and test environments exist.
 
It's exactly the opposite. If they would be completely removable, all combinations would have to be tested in the same way, effort and scope.

If codebase of 1000 lines has potentially 5 bugs per year, how many is it for 10,000 and which one would you like to run?

Yes, that's by the very nature of these components. As both these components have to deal with HA if and only if it is enabled, these obviously also need to interact with the HA stack.

Perfectly understandable, what does it have to do with dependencies? With e.g. pve-ha-crm and pve-ha-lrm disabled, all the other services run just fine, i.e. those codepaths are either not run into or they are gracefully handled. How is this related to package dependencies?

Not really. If you don't know what's something doing and/or don't need it (as in your case), you simply do not enable it.

Except it gets re-enabled on e.g. update?

And, if one is still learning, you never try out such features in a production environment anyway. That's why staging and test environments exist.

If there's more bugs in the HA stack, how is it related? They are running in my production, dormant, or so you believe.

But I really would like to know why the package dependencies are all wrong. It has nothing to do with that HA stack interacts with qemu.
 
NB I just want to make it clear I am asking something very reasonable, ie. pve-ha-manager can depend on everything else, but qemu-server cannot depend on pve-ha-manager.

I am not asking to be able to e.g. remove qemu-server because I do not intend to run VMs, that would have been where your answer was reasonable.
 
Not each piece, this is called HA stack, literally, in the docs. It's by its very nature an add-on.
Please show where it is called an add-on. It is an integral part of Proxmox VE, yet does not do anything if not enabled.
"Stack" simply means that it is made up of multiple, layered components - that is the definition of the word. Again, I do not know where you get "add-on" from.

Perfectly understandable, what does it have to do with dependencies? With e.g. pve-ha-crm and pve-ha-lrm disabled, all the other services run just fine, i.e. those codepaths are either not run into or they are gracefully handled. How is this related to package dependencies?
Because they are still components of the whole Proxmox VE stack and the code is still called, although it does nothing inside the HA code. In this case, it's multiple packages. In other, compiled languages it would have been a big binary, still with all the components, including HA, compiled in. That's just how software works.
The only difference here - one is shipped as source and dynamically run, the other gets compiled ahead-of-time.

Please have a look at the Introduction and its first few section - it explains a bit about the architecture of our whole stack. :)

Except it gets re-enabled on e.g. update?
No? HA for a resource never get's enabled automatically via package updates? Where did you get that from?
Or are you talking about a hypothetical scenario, where you could remove the package? Again, its a complete void argument in any case.

I am not asking to be able to e.g. remove qemu-server because I do not intend to run VMs, that would have been where your answer was reasonable.
But it's the exact same thing you are asking here for. It does not matter which compontent. qemu-server and pve-ha-manager are both integral parts of the whole system and are needed. They do nothing if not used.
 
Please show where it is called an add-on.

I did not quote it was called an add-on. And I do not wish to get detracted from the subject of this post of mine, so I will explain why I posted it, see below.

It is an integral part of Proxmox VE, yet does not do anything if not enabled.

This is not meaningful argumentation when all the services are enabled on start. There's literally no way to disable HA. The previously linked bug report above, conveniently not reacted on, is a perfect example where, in your understanding, the "disabled" component was ... well, not disabled at all.

"Stack" simply means that it is made up of multiple, layered components - that is the definition of the word. Again, I do not know where you get "add-on" from.

I am sorry, this is twisting it, when you call a piece of LARGE stack "a stack", you are making it stand out. Why so? The docs literally repeatedly refer to the "HA stack" - it is not the point what "stack" means, the point is that you do not refer to any other part of PVE to be "a stack."

Because they are still components of the whole Proxmox VE stack and the code is still called, although it does nothing inside the HA code. In this case, it's multiple packages. In other, compiled languages it would have been a big binary

No, it would have been libraries. If they were one big binary, they would have shipped in one big package. Why doesn't it ship in one package when it's all so intertwined? The whole example of that in compiled languages this would have been one large binary - I have never seen one big binary shipped in multiple package chunks. I don't want to nitpick, but why then split it (into packages)?

, still with all the components, including HA, compiled in. That's just how software works.
The only difference here - one is shipped as source and dynamically run, the other gets compiled ahead-of-time.

This has nothing to do with compiled or interpreted. This is the case where you can have meta package (PVE) and proper dependencies amongst packages. Otherwise why not ship it as one?

Please have a look at the Introduction and its first few section - it explains a bit about the architecture of our whole stack. :)

Excellent link, so how come ... this is not a problem:
Code:
# pveceph fs create
binary not installed: /usr/bin/ceph-mon

Is it shipped incomplete?

This was really great example. I can totally understand if I ran e.g. ha-manager status and it said not installed.

No? HA for a resource never get's enabled automatically via package updates? Where did you get that from?
Or are you talking about a hypothetical scenario, where you could remove the package? Again, its a complete void argument in any case.

This is a misunderstanding, you are talking "enabled" as if there was some real way to enable/disable the component (config line for an HA resource is not what my post is asking about). I said, even in the case I systemctl disable the respective services, it's not reliable (across upgrades), but it's clearly not that integrated as otherwise the whole rest of the stack would crumble.

But it's the exact same thing you are asking here for. It does not matter which compontent. qemu-server and pve-ha-manager are both integral parts of the whole system and are needed.

But they don't have broken logic in package dependencies:

pve-manager has only proxmox-pve depend on it
pve-manager depends on qemu-server
qemu-server does NOT depend on pve-manager


I really wonder what's an explanation of this inconsisency?

They do nothing if not used.

What about the:


Moreover, because of these broken dependencies, it is e.g. not possible to install alternative watchdog package due to "HA stack" cannot be kicked out independently.

And there is no way to disable the HA stack
, empty config is not disabled component. One can opt to not install CEPH, but like this one has to run services that can any time open socket to watchdog_mux and if they crash the whole node would reboot. Then you would be dispensing advice here that if someone was not running anything HA "enabled", it was impossible.

How is it helpful to be running watchdog_mux on every node?
 
Last edited:
So this is really bad, because (including here for the benefit of everyone else wondering the same):

Code:
drwxr-xr-x root/root         0 2023-11-17 13:49 ./
drwxr-xr-x root/root         0 2023-11-17 13:49 ./etc/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./etc/default/
-rw-r--r-- root/root        76 2023-05-02 16:04 ./etc/default/pve-ha-manager
drwxr-xr-x root/root         0 2023-11-17 13:49 ./lib/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./lib/systemd/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./lib/systemd/system/
-rw-r--r-- root/root       477 2023-05-02 16:04 ./lib/systemd/system/pve-ha-crm.service
-rw-r--r-- root/root       675 2023-05-02 16:04 ./lib/systemd/system/pve-ha-lrm.service
-rw-r--r-- root/root       172 2023-05-02 16:04 ./lib/systemd/system/watchdog-mux.service
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/sbin/
-rwxr-xr-x root/root       251 2023-11-17 13:49 ./usr/sbin/ha-manager
-rwxr-xr-x root/root       519 2023-11-17 13:49 ./usr/sbin/pve-ha-crm
-rwxr-xr-x root/root       519 2023-11-17 13:49 ./usr/sbin/pve-ha-lrm
-rwxr-xr-x root/root     18880 2023-11-17 13:49 ./usr/sbin/watchdog-mux
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/bash-completion/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/bash-completion/completions/
-rw-r--r-- root/root       310 2023-11-17 13:49 ./usr/share/bash-completion/completions/ha-manager
-rw-r--r-- root/root       310 2023-11-17 13:49 ./usr/share/bash-completion/completions/pve-ha-crm
-rw-r--r-- root/root       310 2023-11-17 13:49 ./usr/share/bash-completion/completions/pve-ha-lrm
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/doc/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/doc/pve-ha-manager/
-rw-r--r-- root/root       109 2023-11-17 13:49 ./usr/share/doc/pve-ha-manager/SOURCE
-rw-r--r-- root/root      3306 2023-11-17 13:49 ./usr/share/doc/pve-ha-manager/changelog.gz
-rw-r--r-- root/root       764 2023-05-02 16:04 ./usr/share/doc/pve-ha-manager/copyright
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/lintian/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/lintian/overrides/
-rw-r--r-- root/root       119 2023-05-02 16:04 ./usr/share/lintian/overrides/pve-ha-manager
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/man/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/man/man1/
-rw-r--r-- root/root     15677 2023-11-17 13:49 ./usr/share/man/man1/ha-manager.1.gz
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/man/man8/
-rw-r--r-- root/root      1227 2023-11-17 13:49 ./usr/share/man/man8/pve-ha-crm.8.gz
-rw-r--r-- root/root      1226 2023-11-17 13:49 ./usr/share/man/man8/pve-ha-lrm.8.gz
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/perl5/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/perl5/PVE/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/perl5/PVE/API2/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/perl5/PVE/API2/HA/
-rw-r--r-- root/root      5886 2023-11-17 13:49 ./usr/share/perl5/PVE/API2/HA/Groups.pm
-rw-r--r-- root/root      8849 2023-11-17 13:49 ./usr/share/perl5/PVE/API2/HA/Resources.pm
-rw-r--r-- root/root      8059 2023-11-17 13:49 ./usr/share/perl5/PVE/API2/HA/Status.pm
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/perl5/PVE/CLI/
-rw-r--r-- root/root      5204 2023-11-17 13:49 ./usr/share/perl5/PVE/CLI/ha_manager.pm
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/
-rw-r--r-- root/root      6381 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/CRM.pm
-rw-r--r-- root/root      8450 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Config.pm
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Env/
-rw-r--r-- root/root      9942 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Env/PVE2.pm
-rw-r--r-- root/root      5462 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Env.pm
-rw-r--r-- root/root      5847 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Fence.pm
-rw-r--r-- root/root      4807 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/FenceConfig.pm
-rw-r--r-- root/root      2698 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Groups.pm
-rw-r--r-- root/root     26365 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/LRM.pm
-rw-r--r-- root/root     31845 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Manager.pm
-rw-r--r-- root/root      6128 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/NodeStatus.pm
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Resources/
-rw-r--r-- root/root      3359 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Resources/PVECT.pm
-rw-r--r-- root/root      4076 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Resources/PVEVM.pm
-rw-r--r-- root/root      4477 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Resources.pm
-rw-r--r-- root/root      7081 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Tools.pm
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Usage/
-rw-r--r-- root/root      1007 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Usage/Basic.pm
-rw-r--r-- root/root      3548 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Usage/Static.pm
-rw-r--r-- root/root       856 2023-11-17 13:49 ./usr/share/perl5/PVE/HA/Usage.pm
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/perl5/PVE/Service/
-rw-r--r-- root/root       921 2023-11-17 13:49 ./usr/share/perl5/PVE/Service/pve_ha_crm.pm
-rw-r--r-- root/root       924 2023-11-17 13:49 ./usr/share/perl5/PVE/Service/pve_ha_lrm.pm
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/zsh/
drwxr-xr-x root/root         0 2023-11-17 13:49 ./usr/share/zsh/vendor-completions/
-rw-r--r-- root/root       362 2023-11-17 13:49 ./usr/share/zsh/vendor-completions/_ha-manager
-rw-r--r-- root/root       362 2023-11-17 13:49 ./usr/share/zsh/vendor-completions/_pve-ha-crm
-rw-r--r-- root/root       362 2023-11-17 13:49 ./usr/share/zsh/vendor-completions/_pve-ha-lrm

It's completely weird way to package it when:

Code:
Feb 20 21:53:49 n3 pvedaemon[1115]: BEGIN failed--compilation aborted at /usr/share/perl5/PVE/API2/Cluster.pm line 14.
Feb 20 21:53:49 n3 pvedaemon[1115]: Compilation failed in require at /usr/share/perl5/PVE/API2.pm line 15.
Feb 20 21:53:49 n3 pvedaemon[1115]: BEGIN failed--compilation aborted at /usr/share/perl5/PVE/API2.pm line 15.
Feb 20 21:53:49 n3 pvedaemon[1115]: Compilation failed in require at /usr/share/perl5/PVE/Service/pvedaemon.pm line 8.
Feb 20 21:53:49 n3 pvedaemon[1115]: BEGIN failed--compilation aborted at /usr/share/perl5/PVE/Service/pvedaemon.pm line 8.
Feb 20 21:53:49 n3 pvedaemon[1115]: Compilation failed in require at /usr/bin/pvedaemon line 11.
Feb 20 21:53:49 n3 pvedaemon[1115]: BEGIN failed--compilation aborted at /usr/bin/pvedaemon line 11.

... all because pvedaemon wants to get hold of HA e.g. PVE::API2::HA::Resources, PVE::API2::HA::Groups, PVE::API2::HA::Status.

1) Why would you package it separately when there's literally 3 services that are not needed for anything else; and then
2) Hardcode it like this?

It's one or the other. It's not even consistent within PVE, e.g. pve-common is not a separate package.

The issue however remains (not addressed):

If codebase of 1000 lines has potentially 5 bugs per year, how many is it for 10,000 and which one would you like to run?

If there's more bugs in the HA stack, how is it related? They are running in my production, dormant, or so you believe.

This is concerning to anyone running the solution.

I understand it's easier for you to ship it this way (except why is HA alone split off into a package). In your own explanation, had this been compiled set of components, missing a kernel module for something I do not intend to use cannot absolutely leave get me rebooted.

There's a need for a "supported" way to actually disable the HA stack (so that even if it has bugs, those can't be run into).
 
Last edited:
I really wonder what's an explanation of this inconsisency?
There is no inconsistency, pve-manager is a leaf-node in the dependency tree, it mounts all API calls, and thus must depend on every package that provides API methods.
The dependency from proxmox-ve is a very specific exception that is only present due to proxmox-ve being a meta package, mostly useful for version purpose and installing the current correct package set, including a small set of things like current default kernel, pve-manager (that then pulls in the rest), on top of a vanilla Debian installation.

A dependency from pve-manager to qemu-server would not only make no sense at all, as qemu-server does not use any of the manager's code, doing so would add a circular dependency, which are a great PITA, not only for bootstrapping but also for any and each update. Currently, we only have one such circular dependency, pve-access-control <-> pve-cluster, there we contained the pain due to binary package splits to be mostly relevant for bootstrapping, which is manageable, but even that circular dependency would be great to get rid of sometime.

We had an optional include for the SDN stack during its experimental time, but this alone was a huge PITA to manage even though it was new and had very limited entry points, adding more of that, especially long-time endpoints is just not an option.
To be blunt, if you think that's easy you just have no idea what work goes into maintaining such things, and the regression potential, at all.

While the watchdog-mux is always running, it's never triggering as long as there was no HA resource configured, as both LRM and CRM are idle then, so HA and the watchdog triggering is disabled by default.
The watchdog-mux starting at boot is a must to ensure that PVE can actually bind to that, otherwise enabling HA later at any time won't work.
If you do not want that I'd recommend disabling the watchdog-mux, pve-ha-crm and pve-ha-lrm systemd services and either rebooting, or stop all LRMs, then all CRMs and only then all watchdog-mux, they won't start at the next boot – so the feature you ask for is already present, just use it...
 
  • Like
Reactions: cheiss
There is no inconsistency, pve-manager is a leaf-node in the dependency tree, it mounts all API calls, and thus must depend on every package that provides API methods.
I have seen that since, but why is not then shoved all into the pve-manager? That's the inconsistency. Since it has all those use (which I do not wish to argue), why is it not treated like e.g. pve-common? There's no pve-common package either. I understand why it's a seprate git repo, not why it is a package.

The dependency from proxmox-ve is a very specific exception that is only present due to proxmox-ve being a meta package

I have no issue with meta packages.

A dependency from pve-manager to qemu-server would not only make no sense at all

I think my point got lost here because I exactly argue that dependency of qemu on pve-manager makes no sense. But there's circular dependency between pve-ha-manager and qemu-server:

Code:
apt depends qemu-server
qemu-server
  ...
  Depends: pve-ha-manager (>= 3.0-9)



# apt depends pve-ha-manager
pve-ha-manager
  ...
  Depends: qemu-server (>= 8.0.2)

That's the inconsistency. Also there's circular with pve-container.

, as qemu-server does not use any of the manager's code, doing so would add a circular dependency, which are a great PITA

Exactly! And it was in my second post that so should not pve-ha-manager's code be used by qemu-server, or?

, not only for bootstrapping but also for any and each update. Currently, we only have one such circular dependency, pve-access-control <-> pve-cluster,

At least I got my point across by now I guess. Something's wrong with pve-ha-manager dependencies.

To be blunt, if you think that's easy

I don't mind blunt, I don't think it's easy, but there's something wrong above.

While the watchdog-mux is always running, it's never triggering as long as there was no HA resource configured, as both LRM and CRM are idle then, so HA and the watchdog triggering is disabled by default.

Thomas, you yourself know full well (from your earlier post) that (exact terms now):
1) softdog is active always
2) watchdog_mux is part of pve-ha-manager and always listens on that socket (so can be triggered) and currently has one bug where it is active where it should not have been and I have been seeing some logs where I suspect another, but I do not go blunt on that one before I feel confident myself
3) pve-ha-manager conflicts with watchdog which is PITA on its own if someone does not use HA and wants their own

The watchdog-mux starting at boot is a must to ensure that PVE can actually bind to that, otherwise enabling HA later at any time won't work.

I don't want to detract in this post, but watchdog-mux:
1) does not have to start on a node, it can be launched when CRM or LRM was about to get active only
2) when it starts it does not have to start feeding softdog unless it has clients (there's nothing to multiplex when there are none)
3) could close /dev/watchdog0 orderly when the clients come and go

If you do not want that I'd recommend disabling the watchdog-mux, pve-ha-crm and pve-ha-lrm systemd services and either rebooting, or stop all LRMs, then all CRMs and only then all watchdog-mux, they won't start at the next boot – so the feature you ask for is already present, just use it...

I know about this, it's been frequently asked on forum, but:
1) i can't rely that it will survive upgrades
2) if it's just disabled on some nodes they will be "excluded" from the HA in "not so nice way"
3) softdog device is still there albeit not active - yes i know i can blacklist the module, then the log is messy with HA stack complaining

Quite a few posts of completely confused people on the forum regarding HA when they believe they had not been using it. And the bug.
 
I have seen that since, but why is not then shoved all into the pve-manager? That's the inconsistency. Since it has all those use (which I do not wish to argue), why is it not treated like e.g. pve-common? There's no pve-common package either. I understand why it's a seprate git repo, not why it is a package.
There's libpve-common-perl package that comes from pve-common that is handled like every other package, but as relatively low level package it itself includes no API endpoints (but lots of supporting and base code for those).
Oh and yeah, I forgot to mention the ha <-> qemu-server and ha <-> container ones, those are also something which we'd like to be able to resolve someday, they are there as start/stop/migrate request to VM/CT CLI/API is redirected to HA, and the LRM in charge of a guest calls the actual lower level start/stop/migrate to execute that.
This could be resolved by a package split, but as its pain is mostly felt on bootstrapping, which is 1) not often done and 2) done by staff with enough packaging and PVE experience to handle that correctly, it wasn't simply due to a small ROI.

But the existence of those does not justify adding more, as they are already too much, but exist due to historic reasons and got already isolated such that the maintenance work is manageable, but that again does not mean that adding more such non-ideal options is fine.
1) softdog is active always
softdog is just the default and fallback, one watchdog is always active not necessarily softdog.

2) watchdog_mux is part of pve-ha-manager and always listens on that socket (so can be triggered)
Yeah, anything that is installable or can be enabled can be triggered. The point is, the watchdog won't be through the HA stack if it was never configured at current boot, which is the default – so this really is not the loaded gun you try to make it seem like.
Stay with the facts, don't shoot from the hips with half knowledge, and we can talk about actual improvements.
3) pve-ha-manager conflicts with watchdog which is PITA on its own if someone does not use HA and wants their own
Yeah, because that would grab /dev/watchdog, failing watchdog-mux and pve-ha-manager, and PVE simply does not support other watchdogs, which is shown by that dependency. The fix here would be that watchdog-mux can offer a device node that supports the watchdog protocol but goes through itself, would be actually a nice feature that could make it quite useful outside the PVE ecosystem.
1) does not have to start on a node, it can be launched when CRM or LRM was about to get active only
No, as then the watchdog can be already bound by something else. We had socket activation in an early prototype, was removed for a few reasons, this was one of them.
2) when it starts it does not have to start feeding softdog unless it has clients (there's nothing to multiplex when there are none)
see reply to 1), either it starts and binds the watchdog, making it active, or not.
3) could close /dev/watchdog0 orderly when the clients come and go
same here, simple design choice.
1) i can't rely that it will survive upgrades
You can definitively rely on explicitly disabled and especially masked systemd services staying that way.
2) if it's just disabled on some nodes they will be "excluded" from the HA in "not so nice way"
Either one wants a HA cluster or one doesn't, in which case one either disables it in the whole cluster (not required if one never configures HA services), or doesn't do so for the whole cluster.
3) softdog device is still there albeit not active - yes i know i can blacklist the module, then the log is messy with HA stack complaining
The log is not messy if those services are disabled.

So again, can already be done. If you want an easier way, then that's fine to add an enhancement request in for, but saying it's not possible, or that it's activated by default is both just bogus.
 
  • Like
Reactions: cheiss
There's libpve-common-perl package that comes from pve-common that is handled like every other package, but as relatively low level package it itself includes no API endpoints (but lots of supporting and base code for those).

There's 16 rdepends on libpve-common-perl, that's all fine. For pve-ha-manager there's 4: 2 circular, 1 meta and pve-manager itself => pve-ha-manager should have been packed into pve-manager then.

Oh and yeah, I forgot to mention the ha <-> qemu-server and ha <-> container ones, those are also something which we'd like to be able to resolve someday, they are there as start/stop/migrate request to VM/CT CLI/API is redirected to HA, and the LRM in charge of a guest calls the actual lower level start/stop/migrate to execute that.
This could be resolved by a package split, but as its pain is mostly felt on bootstrapping, which is 1) not often done and 2) done by staff with enough packaging and PVE experience to handle that correctly, it wasn't simply due to a small ROI.

Ok.

softdog is just the default and fallback, one watchdog is always active not necessarily softdog.

The point is that it's always active. Arguably (the forum posts being good source) it's not something one wants or even expects. I could understand it once in a cluster. On standalone node, the module should not be loaded by default.

This is the whole point of my thread: Do I have to advise people to blacklist softdog if they ask "how to reliably keep HA stack off (because reboots)?" Is that it? There can't be simple 0/1 HA on/off (since can't rip it out of pvedaemon)? That 0/1 could make a difference between opening /dev/watchdog0 at boot or not. Is that too much to ask?

Yeah, anything that is installable or can be enabled can be triggered. The point is, the watchdog won't be through the HA stack if it was never configured at current boot, which is the default – so this really is not the loaded gun you try to make it seem like.

The "current boot" is not documented, that happened now. :) It went off in lots of forum posts - people test around, turn it off and then experience random dangling CRM (which they have no predictability which node it might have been). The bugreport is factual (no reply there so far).

Stay with the facts, don't shoot from the hips with half knowledge, and we can talk about actual improvements.

I think I know how that operates quite well by now (no one corrected me so far):
https://forum.proxmox.com/threads/getting-rid-of-watchdog-emergency-node-reboot.136789/#post-635602

The OP surely felt something "bogus" going on.

The issue is it's not even documented, watchdog is buried under HA stack. I literally found ONE forum post where you briefly mentioned it that it could trigger anyway.

Yeah, because that would grab /dev/watchdog, failing watchdog-mux and pve-ha-manager, and PVE simply does not support other watchdogs, which is shown by that dependency. The fix here would be that watchdog-mux can offer a device node that supports the watchdog protocol but goes through itself, would be actually a nice feature that could make it quite useful outside the PVE ecosystem.

I am pretty sure you know I was not asking for added complexity, but reduced (for those who choose). :)

No, as then the watchdog can be already bound by something else. We had socket activation in an early prototype, was removed for a few reasons, this was one of them.

Will I find details re this in pve-devel somewhere?

You can definitively rely on explicitly disabled and especially masked systemd services staying that way.

Next thing I know, pvedaemon will reboot the whole thing when there's no watchdog_mux detected running or something - it's not predictable as a built-in choice would be.

Either one wants a HA cluster or one doesn't, in which case one either disables it in the whole cluster (not required if one never configures HA services), or doesn't do so for the whole cluster.

The log is not messy if those services are disabled.

Ok I tell them to blacklist softdog, disable pve-ha-crm, pve-ha-lrm and watchdog-mux. Next thing someone will complain this unsupported thing of mine made them forget about it and now they don't have HA when they want to use it because in GUI didn't give any hint. Then finger at me for my advice. Their other choice is to maybe have their hardware reboot, just because they want to run PVE.

So again, can already be done. If you want an easier way, then that's fine to add an enhancement request in for, but saying it's not possible, or that it's activated by default is both just bogus.

The softdog (or other kernel one) is touched the moment watchdog_mux starts up. HA or not, cluster or standalone. Not a word in any official doc.

EDIT: It's pretty terrific to read on forum the watchdog being "disarmed" just because there's no clients on its socket - that's not how you would want to transport an explosive, if watchdog_mux was one.

Thanks for the replies though.
 
Last edited:
@cheiss Apologies for the haggling about whether it is hardcoded to be together or not, I did not check beforehand. You can imagine I still think it better wouldn't have been and that now the split packaging is wrong for a change, but I should have checked the content before posting the Q in the original form!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!