Hi everyone,
We recently installed a cluster with Proxmox VE 6 and restored hundreds of VPS on it.
The cluster has 6 virtualization nodes and 4 storage nodes.
It was running fine for about 20 hours, no issues at all - all nodes showing up with green checkmarks.
All of a sudden, two of the nodes showed a red X.
We restarted corosync on the two nodes that had the X and
5 minutes after - all nodes have question marks but 1
The actual virtual machines seem to be intact and online, the cluster seems to be fine with Quorate = yes and 10 show online.
We can't see anything out of the ordinary in syslog, we did however, see this:
Any clue what we should do to troubleshoot this?
Any and all help will be greatly appreciated.
We recently installed a cluster with Proxmox VE 6 and restored hundreds of VPS on it.
The cluster has 6 virtualization nodes and 4 storage nodes.
It was running fine for about 20 hours, no issues at all - all nodes showing up with green checkmarks.
All of a sudden, two of the nodes showed a red X.
We restarted corosync on the two nodes that had the X and
5 minutes after - all nodes have question marks but 1
The actual virtual machines seem to be intact and online, the cluster seems to be fine with Quorate = yes and 10 show online.
We can't see anything out of the ordinary in syslog, we did however, see this:
Code:
[ 1489.040938] HTB: quantum of class 10001 is big. Consider r2q change
[ 4049.114183] perf: interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[ 4808.865178] perf: interrupt took too long (3143 > 3136), lowering kernel.perf_event_max_sample_rate to 63500
[ 5760.579921] perf: interrupt took too long (3940 > 3928), lowering kernel.perf_event_max_sample_rate to 50750
[ 7373.209122] perf: interrupt took too long (4949 > 4925), lowering kernel.perf_event_max_sample_rate to 40250
[ 8590.353579] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.538 msecs
[10160.077215] perf: interrupt took too long (6205 > 6186), lowering kernel.perf_event_max_sample_rate to 32000
[15252.770846] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.670 msecs
[15726.728383] perf: interrupt took too long (7766 > 7756), lowering kernel.perf_event_max_sample_rate to 25750
Any clue what we should do to troubleshoot this?
Any and all help will be greatly appreciated.