Hey!
Back in February this year, after quite a long round of discussions, Proxmox introduced a change in the corosync config that allows for overriding the default token coefficient and lowering it by default. Quick background: By default, corosync sets the token and consensus timeouts based on the cluster size.
With a cluster size of around 30 nodes, the result approaches about 60 seconds (1 minute), which can lead to the watchdog kicking in and fencing nodes. This is just one example, but I see quite often with customers that such important changes tend to be overlooked, either because changelogs are not read thoroughly or simply get lost in the day-to-day workload.
So my suggestion would be to build some kind of tooling or script, similar to the pveXtoY script, that offers the user a toolkit to check for changes (may it be new defaults, new formulas, new proposed best practices) which were introduced hrough past Proxmox generations which are of a certain relevance (like the example above).
In other words, this tooling would evolve alongside Proxmox versions and provide a way to review existing clusters against new features and default values, compare them, and notify the user accordingly. Clusters are typically designed for long lifetimes and often span multiple Proxmox generations without benefiting from improved defaults and changes.
Happy to hear ideas. This could be done by and/or within the Proxmox community aswell, i'm here to ask for input so we as a whole can come up with a good approach.
This idea has also been posted on the bugtracker as an enhancement idea.
Back in February this year, after quite a long round of discussions, Proxmox introduced a change in the corosync config that allows for overriding the default token coefficient and lowering it by default. Quick background: By default, corosync sets the token and consensus timeouts based on the cluster size.
Code:
token timeout = token + (#nodes - 2) * token_coefficient
consensus timeout = 1.2 * token timeout
With a cluster size of around 30 nodes, the result approaches about 60 seconds (1 minute), which can lead to the watchdog kicking in and fencing nodes. This is just one example, but I see quite often with customers that such important changes tend to be overlooked, either because changelogs are not read thoroughly or simply get lost in the day-to-day workload.
So my suggestion would be to build some kind of tooling or script, similar to the pveXtoY script, that offers the user a toolkit to check for changes (may it be new defaults, new formulas, new proposed best practices) which were introduced hrough past Proxmox generations which are of a certain relevance (like the example above).
In other words, this tooling would evolve alongside Proxmox versions and provide a way to review existing clusters against new features and default values, compare them, and notify the user accordingly. Clusters are typically designed for long lifetimes and often span multiple Proxmox generations without benefiting from improved defaults and changes.
Happy to hear ideas. This could be done by and/or within the Proxmox community aswell, i'm here to ask for input so we as a whole can come up with a good approach.
This idea has also been posted on the bugtracker as an enhancement idea.
Last edited: