Proxmox Cluster status ?

frenchface

New Member
Aug 8, 2021
5
0
1
44
I have a proxmox cluster with 3 nodes versions: 7.1-10.

Every day atleast one node goes to status ? and all of the VMs also go to ?.

The only thing that so far I can get them backonline is to kill the server and start it back up.

I'm really not sure what logs I need to look at.

/var/log/messages
Code:
Mar  2 09:58:05 pve3 kernel: [ 4325.946512] 8021q: adding VLAN 0 to HW filter on device enp1s0f0
Mar  2 09:58:05 pve3 kernel: [ 4326.057002] cfg80211: Loading compiled-in X.509 certificates for regulatory database
Mar  2 09:58:05 pve3 kernel: [ 4326.058830] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
Mar  2 09:58:05 pve3 kernel: [ 4326.060186] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Mar  2 09:58:05 pve3 kernel: [ 4326.061519] cfg80211: failed to load regulatory.db
Mar  2 09:58:05 pve3 kernel: [ 4326.456402] device tap136i0 entered promiscuous mode
Mar  2 09:58:05 pve3 kernel: [ 4326.491910] fwbr136i0: port 1(fwln136i0) entered blocking state
Mar  2 09:58:05 pve3 kernel: [ 4326.492786] fwbr136i0: port 1(fwln136i0) entered disabled state
Mar  2 09:58:05 pve3 kernel: [ 4326.493668] device fwln136i0 entered promiscuous mode
Mar  2 09:58:05 pve3 kernel: [ 4326.494533] fwbr136i0: port 1(fwln136i0) entered blocking state
Mar  2 09:58:05 pve3 kernel: [ 4326.495337] fwbr136i0: port 1(fwln136i0) entered forwarding state
Mar  2 09:58:05 pve3 kernel: [ 4326.500353] vmbr0: port 16(fwpr136p0) entered blocking state
Mar  2 09:58:05 pve3 kernel: [ 4326.501189] vmbr0: port 16(fwpr136p0) entered disabled state
Mar  2 09:58:05 pve3 kernel: [ 4326.502071] device fwpr136p0 entered promiscuous mode
Mar  2 09:58:05 pve3 kernel: [ 4326.502931] vmbr0: port 16(fwpr136p0) entered blocking state
Mar  2 09:58:05 pve3 kernel: [ 4326.503748] vmbr0: port 16(fwpr136p0) entered forwarding state
Mar  2 09:58:05 pve3 kernel: [ 4326.508658] fwbr136i0: port 2(tap136i0) entered blocking state
Mar  2 09:58:05 pve3 kernel: [ 4326.509496] fwbr136i0: port 2(tap136i0) entered disabled state
Mar  2 09:58:05 pve3 kernel: [ 4326.510386] fwbr136i0: port 2(tap136i0) entered blocking state
Mar  2 09:58:05 pve3 kernel: [ 4326.511190] fwbr136i0: port 2(tap136i0) entered forwarding state
Mar  2 09:58:08 pve3 kernel: [ 4329.297522] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Mar  2 09:58:08 pve3 kernel: [ 4329.298922] vmbr0: port 9(veth140i0) entered blocking state
Mar  2 09:58:08 pve3 kernel: [ 4329.300215] vmbr0: port 9(veth140i0) entered forwarding state
Mar  2 09:58:08 pve3 kernel: [ 4329.520838] kauditd_printk_skb: 4 callbacks suppressed
Mar  2 09:58:08 pve3 kernel: [ 4329.520843] audit: type=1400 audit(1646233088.678:34): apparmor="STATUS" operation="profile_load" label="lxc-140_</var/lib/lxc>//&:lxc-140_<-var-lib-lxc>:unconfined" name="nvidia_modprobe" pid=14964 comm="apparmor_parser"
Mar  2 09:58:08 pve3 kernel: [ 4329.523528] audit: type=1400 audit(1646233088.678:35): apparmor="STATUS" operation="profile_load" label="lxc-140_</var/lib/lxc>//&:lxc-140_<-var-lib-lxc>:unconfined" name="nvidia_modprobe//kmod" pid=14964 comm="apparmor_parser"
Mar  2 09:58:08 pve3 kernel: [ 4329.642392] audit: type=1400 audit(1646233088.798:36): apparmor="STATUS" operation="profile_load" label="lxc-140_</var/lib/lxc>//&:lxc-140_<-var-lib-lxc>:unconfined" name="/usr/sbin/tcpdump" pid=14963 comm="apparmor_parser"
Mar  2 09:58:08 pve3 kernel: [ 4329.650126] audit: type=1400 audit(1646233088.806:37): apparmor="STATUS" operation="profile_load" label="lxc-140_</var/lib/lxc>//&:lxc-140_<-var-lib-lxc>:unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=14967 comm="apparmor_parser"
Mar  2 09:58:08 pve3 kernel: [ 4329.652303] audit: type=1400 audit(1646233088.806:38): apparmor="STATUS" operation="profile_load" label="lxc-140_</var/lib/lxc>//&:lxc-140_<-var-lib-lxc>:unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=14967 comm="apparmor_parser"
Mar  2 09:58:08 pve3 kernel: [ 4329.654403] audit: type=1400 audit(1646233088.806:39): apparmor="STATUS" operation="profile_load" label="lxc-140_</var/lib/lxc>//&:lxc-140_<-var-lib-lxc>:unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=14967 comm="apparmor_parser"
Mar  2 09:58:08 pve3 kernel: [ 4329.656507] audit: type=1400 audit(1646233088.806:40): apparmor="STATUS" operation="profile_load" label="lxc-140_</var/lib/lxc>//&:lxc-140_<-var-lib-lxc>:unconfined" name="/{,usr/}sbin/dhclient" pid=14967 comm="apparmor_parser"
Mar  2 09:58:08 pve3 kernel: [ 4329.677381] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Mar  2 09:58:08 pve3 kernel: [ 4329.679074] vmbr0: port 11(veth143i0) entered blocking state
Mar  2 09:58:08 pve3 kernel: [ 4329.680637] vmbr0: port 11(veth143i0) entered forwarding state
Mar  2 09:58:08 pve3 kernel: [ 4329.692565] audit: type=1400 audit(1646233088.850:41): apparmor="STATUS" operation="profile_load" label="lxc-140_</var/lib/lxc>//&:lxc-140_<-var-lib-lxc>:unconfined" name="lsb_release" pid=14972 comm="apparmor_parser"
Mar  2 09:58:08 pve3 kernel: [ 4329.698052] audit: type=1400 audit(1646233088.854:42): apparmor="STATUS" operation="profile_load" label="lxc-140_</var/lib/lxc>//&:lxc-140_<-var-lib-lxc>:unconfined" name="/usr/bin/man" pid=14971 comm="apparmor_parser"
Mar  2 09:58:08 pve3 kernel: [ 4329.700005] audit: type=1400 audit(1646233088.854:43): apparmor="STATUS" operation="profile_load" label="lxc-140_</var/lib/lxc>//&:lxc-140_<-var-lib-lxc>:unconfined" name="man_filter" pid=14971 comm="apparmor_parser"
Mar  2 10:50:58 pve3 kernel: [ 7499.372518] perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Mar  2 13:16:32 pve3 kernel: [16233.638310] perf: interrupt took too long (3151 > 3131), lowering kernel.perf_event_max_sample_rate to 63250

/var/log/syslog
Code:
Mar  2 15:49:42 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:49:43 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:49:47 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:49:47 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:49:48 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:49:52 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:49:52 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:49:53 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:49:57 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:49:57 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:49:58 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:02 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:02 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:03 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:07 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:07 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:08 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:12 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:12 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:13 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:17 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:17 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:18 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:22 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:22 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:23 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:27 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:27 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:28 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:32 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:32 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:33 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:37 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:37 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:38 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:42 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:42 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:43 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:47 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:47 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:48 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:52 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:52 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:52 pve3 systemd[1]: user@0.service: State 'final-sigterm' timed out. Killing.
Mar  2 15:50:52 pve3 systemd[1]: user@0.service: Killing process 55583 (systemd) with signal SIGKILL.
Mar  2 15:50:53 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting
Mar  2 15:50:57 pve3 pve-ha-lrm[14050]: Task 'UPID:pve3:000036E3:0006980C:621F85F8:qmstart:120:root@pam:' still active, waiting
Mar  2 15:50:57 pve3 pve-ha-lrm[14748]: Task 'UPID:pve3:0000399E:00069A0C:621F85FD:qmstart:150:root@pam:' still active, waiting
Mar  2 15:50:58 pve3 pve-ha-lrm[14486]: Task 'UPID:pve3:00003899:0006997F:621F85FC:qmstart:134:root@pam:' still active, waiting


Any suggestions at what I should look at?
 
the '?' in the gui happen when the 'pvestatd' daemon cannot update the status of the vms/ct/node etc.
normally this happens when a storage hangs (e.g. an nfs mount). this can happen because it's overloaded, network issues, etc.

from the logs you posted, i cannot see a why it would hang though
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!