Question marks on all nodes but 1 in Proxmox VE 6

Mor H.

New Member
Jun 3, 2019
18
1
3
32
Hi everyone,

We recently installed a cluster with Proxmox VE 6 and restored hundreds of VPS on it.
The cluster has 6 virtualization nodes and 4 storage nodes.

It was running fine for about 20 hours, no issues at all - all nodes showing up with green checkmarks.

All of a sudden, two of the nodes showed a red X.
We restarted corosync on the two nodes that had the X and
5 minutes after - all nodes have question marks but 1

The actual virtual machines seem to be intact and online, the cluster seems to be fine with Quorate = yes and 10 show online.

We can't see anything out of the ordinary in syslog, we did however, see this:
Code:
[ 1489.040938] HTB: quantum of class 10001 is big. Consider r2q change
[ 4049.114183] perf: interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[ 4808.865178] perf: interrupt took too long (3143 > 3136), lowering kernel.perf_event_max_sample_rate to 63500
[ 5760.579921] perf: interrupt took too long (3940 > 3928), lowering kernel.perf_event_max_sample_rate to 50750
[ 7373.209122] perf: interrupt took too long (4949 > 4925), lowering kernel.perf_event_max_sample_rate to 40250
[ 8590.353579] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.538 msecs
[10160.077215] perf: interrupt took too long (6205 > 6186), lowering kernel.perf_event_max_sample_rate to 32000
[15252.770846] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.670 msecs
[15726.728383] perf: interrupt took too long (7766 > 7756), lowering kernel.perf_event_max_sample_rate to 25750

Any clue what we should do to troubleshoot this?

Any and all help will be greatly appreciated.
 
Also, we're seeing this:
Code:
root@hyp08:~# systemctl status pvestatd
â pvestatd.service - PVE Status Daemon
  Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
  Active: active (running) since Thu 2019-08-08 18:31:19 CEST; 1 day 3h ago
Main PID: 2217 (pvestatd)
   Tasks: 1 (limit: 13516)
  Memory: 196.8M
  CGroup: /system.slice/pvestatd.service
          ââ2217 pvestatdAug 09 21:34:15 hyp08 pvestatd[2217]: could not activate storage 'local-zfs', zfs error: cannot open 'rpool': no such pool
Aug 09 21:34:25 hyp08 pvestatd[2217]: zfs error: cannot open 'rpool': no such pool
Aug 09 21:34:25 hyp08 pvestatd[2217]: zfs error: cannot open 'rpool': no such pool
Aug 09 21:34:25 hyp08 pvestatd[2217]: could not activate storage 'local-zfs', zfs error: cannot open 'rpool': no such pool
Aug 09 21:34:35 hyp08 pvestatd[2217]: zfs error: cannot open 'rpool': no such pool
Aug 09 21:34:35 hyp08 pvestatd[2217]: zfs error: cannot open 'rpool': no such pool
Aug 09 21:34:35 hyp08 pvestatd[2217]: could not activate storage 'local-zfs', zfs error: cannot open 'rpool': no such pool
Aug 09 21:34:45 hyp08 pvestatd[2217]: zfs error: cannot open 'rpool': no such pool
Aug 09 21:34:45 hyp08 pvestatd[2217]: zfs error: cannot open 'rpool': no such pool
Aug 09 21:34:45 hyp08 pvestatd[2217]: could not activate storage 'local-zfs', zfs error: cannot open 'rpool': no such pool

but we do not use ZFS storage at all. so why does it throw that error. How can we fix this?