[PROPOSAL/RFC] - Invent a script to check for missed important changes during a clusters/node timespan

Nov 28, 2016
409
185
108
Hamburg
uniquoo.com
Hey!


Back in February this year, after quite a long round of discussions, Proxmox introduced a change in the corosync config that allows for overriding the default token coefficient and lowering it by default. Quick background: By default, corosync sets the token and consensus timeouts based on the cluster size.

Code:
token timeout = token + (#nodes - 2) * token_coefficient
consensus timeout = 1.2 * token timeout

With a cluster size of around 30 nodes, the result approaches about 60 seconds (1 minute), which can lead to the watchdog kicking in and fencing nodes. This is just one example, but I see quite often with customers that such important changes tend to be overlooked, either because changelogs are not read thoroughly or simply get lost in the day-to-day workload.

So my suggestion would be to build some kind of tooling or script, similar to the pveXtoY script, that offers the user a toolkit to check for changes (may it be new defaults, new formulas, new proposed best practices) which were introduced hrough past Proxmox generations which are of a certain relevance (like the example above).

In other words, this tooling would evolve alongside Proxmox versions and provide a way to review existing clusters against new features and default values, compare them, and notify the user accordingly. Clusters are typically designed for long lifetimes and often span multiple Proxmox generations without benefiting from improved defaults and changes.

Happy to hear ideas. This could be done by and/or within the Proxmox community aswell, i'm here to ask for input so we as a whole can come up with a good approach.

This idea has also been posted on the bugtracker as an enhancement idea.
 
Last edited:
  • Like
Reactions: wbk and Johannes S
This is a great idea. We use something conceptiually similar to this with any Linux-based OS we have in the field to keep track of changing stuff over the years.
Our implementation is kind of simple: store the bash scripts in a special directory and all scripts are run by the check utility (manually or per cron and integrated into the welcome screen on each node). Files in the directory can be manually put there or - as we do most of your stuff via packages - provided by packages. Each script does do its own check and reports back analogously to an icinga check with text and exit codes 0 - all ok, 1 - warning, 2 - alarm and a custom 3 - not applicable. Most checks have instructive text and commands so that you only need to run what you see on the screen.