Node with question mark

decibel83

Renowned Member
Oct 15, 2008
210
1
83
Hi,

suddenly one node in my Proxmox 5.1 cluster becomes unavailable into the web interface and it's icon becomes grey with a question mark, like in the following screenshots:

Screen Shot 2018-02-05 at 09.18.32.png

This happened three times on three different nodes (on node01 and node02, I rebooted them to solve the problem).

All virtual machines on the node are running good, and the web interface on the failed node is reachable, but it shows the same (if I connect to the node03 web interface all nodes are green except itself).

All nodes are correctly pingable from node03.

In the datacenter summary all 11 nodes are Online.

I don't see any error in /var/log/syslog on node03.

The call to the status API (https://node1:8006/api2/json/cluster/status) returns all nodes online but node03 is marked as local cand has "level":null instead of "level":"":

Code:
{"data":[{"nodes":11,"id":"cluster","type":"cluster","name":"mycluster","version":11,"quorate":1},
{"nodeid":6,"id":"node/node10","local":0,"name":"node10","online":1,"ip":"192.168.60.10","level":"","type":"node"},
{"ip":"192.168.60.2","level":"","type":"node","nodeid":10,"id":"node/node02","local":0,"name":"node02","online":1},
{"ip":"192.168.60.11","level":"","type":"node","nodeid":2,"id":"node/node11","local":0,"name":"node11","online":1},
{"ip":"192.168.60.3","level":null,"type":"node","nodeid":7,"id":"node/node03","local":0,"name":"node03","online":1},
{"online":1,"nodeid":8,"id":"node/node01","name":"node01","local":1,"ip":"192.168.60.1","level":"","type":"node"},
{"name":"node09","local":0,"nodeid":11,"id":"node/node09","online":1,"type":"node","level":"","ip":"192.168.60.9"},
{"level":"","type":"node","ip":"192.168.60.5","id":"node/node05","nodeid":1,"local":0,"name":"node05","online":1},
{"ip":"192.168.60.6","level":"","type":"node","id":"node/node06","nodeid":3,"name":"node06","local":0,"online":1},
{"ip":"192.168.60.8","type":"node","level":"","online":1,"local":0,"name":"node08","id":"node/node08","nodeid":5},
{"level":"","type":"node","ip":"192.168.60.4","nodeid":9,"id":"node/node04","local":0,"name":"node04","online":1},
{"ip":"192.168.60.7","type":"node","level":"","name":"node07","local":0,"id":"node/node07","nodeid":4,"online":1}]}

This is the pvecm status output from node03:

Code:
root@node03:~# pvecm status
Quorum information
------------------
Date:             Mon Feb  5 10:09:54 2018
Quorum provider:  corosync_votequorum
Nodes:            11
Node ID:          0x00000007
Ring ID:          8/1256
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   11
Highest expected: 11
Total votes:      11
Quorum:           6
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000008          1 192.168.60.1
0x0000000a          1 192.168.60.2
0x00000007          1 192.168.60.3 (local)
0x00000009          1 192.168.60.4
0x00000001          1 192.168.60.5
0x00000003          1 192.168.60.6
0x00000004          1 192.168.60.7
0x00000005          1 192.168.60.8
0x0000000b          1 192.168.60.9
0x00000006          1 192.168.60.10
0x00000002          1 192.168.60.11

All nodes are updated to the last version of Proxmox, the last version of the kernel (I updated all packages yesterday):

Code:
root@node03:~# pveversion -v
proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-3-pve: 4.13.13-34
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.4-pve2~bpo9

Could you help me please?

Thanks!
 
Last edited:
Hi,
I have the same problem, and the only way to fix it is rebooting.

Does anyone else have this problem?
 

Attachments

  • proxmox.png
    proxmox.png
    98.5 KB · Views: 235
  • Like
Reactions: pepabeme
Same problem here, I've just posted another thread about this.

I think this is caused by the last updates.
 
hello everyone,

I have the 5.1 version
I have the same problem happened 3 times since last month in 2 differents nodes. I have to reboot the node to rectify the issue, this is not a permanent resolution.
In console, my cluster status is ok. I have composition of 3 nodes.
1 node is not responding on the web interface, but its containers works fine.
Guys please help me, i would like to find a permanent resolution for this problem.

upload_2018-2-17_17-43-36.png


upload_2018-2-17_17-42-26.png
 
Last edited:
you should check if the pvestatd daemon still runs, and if maybe there is a storage which blocks (e.g. nfs) since the pvestatd is responsible for collecting/sending that information across the cluster, if
it hangs/crashes (most often because of a error with a storage) it stops sending that information
 
I've been having this issue as well...whenever i initiate a backup the system faults and is thrown into this state. There really isn't anything in the logs to go off of. Restarting services doesn't seem to fix the issue either. My backup is being sent to a nfs share, but it NEVER had issues like this before 5.x version of Proxmox.

pvestatd is still running and a restart doesn't solve the issue.
 
I've been having this issue as well...whenever i initiate a backup the system faults and is thrown into this state. There really isn't anything in the logs to go off of. Restarting services doesn't seem to fix the issue either. My backup is being sent to a nfs share, but it NEVER had issues like this before 5.x version of Proxmox.

pvestatd is still running and a restart doesn't solve the issue.
same here, every week
 
Meanwhile waiting update from prox team, I have decided to downgrade 1 node to the 5.0 version. I do not have issue since 4 days now.
 
Meanwhile waiting update from prox team, I have decided to downgrade 1 node to the 5.0 version. I do not have issue since 4 days now.

In my case everything runs good for 2 - 3 weeks, and then it happens.

I'm investigating if it could be caused by using Unicast instead of Multicast.

Are you using Unicast too ?
 
I'm having issues with this constantly now, not just when backing up. I'm really not able to find any possiblities as to why this is happening. Even simple poweroff's of CTs cause this to happen now. Is there anyone on the proxmox team that would be able to help us with this?!
 
Without knowing your setup, I assume the issue is somewhere in your cluster network. Most issues are caused when storage and cluster network is not separated and/or the network does not meet the requirement regarding latency and reliability. Or storage overloads.

Do you have a separate cluster network? Test with omping if it works reliable.

We can deeply analyse your setup by logging into your cluster via SSH, please contact our enterprise support team (subscription needed).
 
I ran into this grey question mark situation yesterday on PVE 5.1-43. Single node setup (setup PVE fresh 2 weeks ago) with all default options; no cluster. Rebooted node and it fixed the issue.

After reboot, I applied all updates - now at 5.1-46. Let’s see if this happens again.
 
5.1.46, now this happens the second time in this week on the same node in my cluster.
When I restart the "pvestatd" on the node, the KVMs become visible again.
Some of the lxc containers are running and some are dead.
"pct list" hangs...
 
  • Like
Reactions: frawst
It's happening every week for me too. I have 12 nodes, and when it happens, I have to stop all proxmox services, in every node:

service pve-cluster stop
service corosync stop
service pvestatd stop
service pveproxy stop
service pvedaemon stop

and then

service pve-cluster start
service corosync start
service pvestatd start
service pveproxy start
service pvedaemon start
 
  • Like
Reactions: shaylami
It's happening every week for me too. I have 12 nodes, and when it happens, I have to stop all proxmox services, in every node:

service pve-cluster stop
service corosync stop
service pvestatd stop
service pveproxy stop
service pvedaemon stop

and then

service pve-cluster start
service corosync start
service pvestatd start
service pveproxy start
service pvedaemon start

Thanks for the hint, I will also try the "solution" from here ... we don't have zfs:
https://forum.proxmox.com/threads/p...el-tainted-pvestatd-frozen.38408/#post-189727
 
Unfortunately both solutions do not work for me.
Everyday one node crashes... unusable.
@tom Is there an upgrade planned for this issue?
 
yes, I also face same issue.

I am having sleepless nights for last one week.

Proxmox is a nightmare now, everyday 2 or 3 nodes crash for me. I have 25 nodes with LXC.

No reply from Proxmox till now.
 
Same issue here on a 16-node LXC cluster. Only begin with a recent update and reboot, so I also suspect this to be an issue with the kernel. But, I also notice the issue persists for an hour or two everytime, then everything backs to normal. So instead of rebooting every node, I just wait (of course, this is not a solution for hosting provider).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!