Node with question mark

decibel83 · Feb 5, 2018

Hi,

suddenly one node in my Proxmox 5.1 cluster becomes unavailable into the web interface and it's icon becomes grey with a question mark, like in the following screenshots:

This happened three times on three different nodes (on node01 and node02, I rebooted them to solve the problem).

All virtual machines on the node are running good, and the web interface on the failed node is reachable, but it shows the same (if I connect to the node03 web interface all nodes are green except itself).

All nodes are correctly pingable from node03.

In the datacenter summary all 11 nodes are Online.

I don't see any error in /var/log/syslog on node03.

The call to the status API (https://node1:8006/api2/json/cluster/status) returns all nodes online but node03 is marked as local cand has "level":null instead of "level":"":

Code:

{"data":[{"nodes":11,"id":"cluster","type":"cluster","name":"mycluster","version":11,"quorate":1},
{"nodeid":6,"id":"node/node10","local":0,"name":"node10","online":1,"ip":"192.168.60.10","level":"","type":"node"},
{"ip":"192.168.60.2","level":"","type":"node","nodeid":10,"id":"node/node02","local":0,"name":"node02","online":1},
{"ip":"192.168.60.11","level":"","type":"node","nodeid":2,"id":"node/node11","local":0,"name":"node11","online":1},
{"ip":"192.168.60.3","level":null,"type":"node","nodeid":7,"id":"node/node03","local":0,"name":"node03","online":1},
{"online":1,"nodeid":8,"id":"node/node01","name":"node01","local":1,"ip":"192.168.60.1","level":"","type":"node"},
{"name":"node09","local":0,"nodeid":11,"id":"node/node09","online":1,"type":"node","level":"","ip":"192.168.60.9"},
{"level":"","type":"node","ip":"192.168.60.5","id":"node/node05","nodeid":1,"local":0,"name":"node05","online":1},
{"ip":"192.168.60.6","level":"","type":"node","id":"node/node06","nodeid":3,"name":"node06","local":0,"online":1},
{"ip":"192.168.60.8","type":"node","level":"","online":1,"local":0,"name":"node08","id":"node/node08","nodeid":5},
{"level":"","type":"node","ip":"192.168.60.4","nodeid":9,"id":"node/node04","local":0,"name":"node04","online":1},
{"ip":"192.168.60.7","type":"node","level":"","name":"node07","local":0,"id":"node/node07","nodeid":4,"online":1}]}

This is the pvecm status output from node03:

Code:

root@node03:~# pvecm status
Quorum information
------------------
Date:             Mon Feb  5 10:09:54 2018
Quorum provider:  corosync_votequorum
Nodes:            11
Node ID:          0x00000007
Ring ID:          8/1256
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   11
Highest expected: 11
Total votes:      11
Quorum:           6
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000008          1 192.168.60.1
0x0000000a          1 192.168.60.2
0x00000007          1 192.168.60.3 (local)
0x00000009          1 192.168.60.4
0x00000001          1 192.168.60.5
0x00000003          1 192.168.60.6
0x00000004          1 192.168.60.7
0x00000005          1 192.168.60.8
0x0000000b          1 192.168.60.9
0x00000006          1 192.168.60.10
0x00000002          1 192.168.60.11

All nodes are updated to the last version of Proxmox, the last version of the kernel (I updated all packages yesterday):

Code:

root@node03:~# pveversion -v
proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-3-pve: 4.13.13-34
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.4-pve2~bpo9

Could you help me please?

Thanks!

Hugo Matos · Feb 5, 2018

Hi,
I have the same problem, and the only way to fix it is rebooting.

Does anyone else have this problem?

masterdaweb · Feb 5, 2018

Same problem here, I've just posted another thread about this.

I think this is caused by the last updates.

fadmedi · Feb 17, 2018

hello everyone,

I have the 5.1 version
I have the same problem happened 3 times since last month in 2 differents nodes. I have to reboot the node to rectify the issue, this is not a permanent resolution.
In console, my cluster status is ok. I have composition of 3 nodes.
1 node is not responding on the web interface, but its containers works fine.
Guys please help me, i would like to find a permanent resolution for this problem.

dcsapak · Feb 19, 2018

you should check if the pvestatd daemon still runs, and if maybe there is a storage which blocks (e.g. nfs) since the pvestatd is responsible for collecting/sending that information across the cluster, if
it hangs/crashes (most often because of a error with a storage) it stops sending that information

lastb0isct · Feb 23, 2018

I've been having this issue as well...whenever i initiate a backup the system faults and is thrown into this state. There really isn't anything in the logs to go off of. Restarting services doesn't seem to fix the issue either. My backup is being sent to a nfs share, but it NEVER had issues like this before 5.x version of Proxmox.

pvestatd is still running and a restart doesn't solve the issue.

masterdaweb · Feb 24, 2018

lastb0isct said:
I've been having this issue as well...whenever i initiate a backup the system faults and is thrown into this state. There really isn't anything in the logs to go off of. Restarting services doesn't seem to fix the issue either. My backup is being sent to a nfs share, but it NEVER had issues like this before 5.x version of Proxmox.

pvestatd is still running and a restart doesn't solve the issue.

same here, every week

fadmedi · Feb 24, 2018

Meanwhile waiting update from prox team, I have decided to downgrade 1 node to the 5.0 version. I do not have issue since 4 days now.

masterdaweb · Feb 24, 2018

fadmedi said:
Meanwhile waiting update from prox team, I have decided to downgrade 1 node to the 5.0 version. I do not have issue since 4 days now.

In my case everything runs good for 2 - 3 weeks, and then it happens.

I'm investigating if it could be caused by using Unicast instead of Multicast.

Are you using Unicast too ?

lastb0isct · Feb 26, 2018

I'm having issues with this constantly now, not just when backing up. I'm really not able to find any possiblities as to why this is happening. Even simple poweroff's of CTs cause this to happen now. Is there anyone on the proxmox team that would be able to help us with this?!

tom · Feb 26, 2018

Without knowing your setup, I assume the issue is somewhere in your cluster network. Most issues are caused when storage and cluster network is not separated and/or the network does not meet the requirement regarding latency and reliability. Or storage overloads.

Do you have a separate cluster network? Test with omping if it works reliable.

We can deeply analyse your setup by logging into your cluster via SSH, please contact our enterprise support team (subscription needed).

venk25 · Mar 5, 2018

I ran into this grey question mark situation yesterday on PVE 5.1-43. Single node setup (setup PVE fresh 2 weeks ago) with all default options; no cluster. Rebooted node and it fixed the issue.

After reboot, I applied all updates - now at 5.1-46. Let’s see if this happens again.

Vasu Sreekumar · Mar 5, 2018

5.1.46 also has same issue.

I also have same issue https://forum.proxmox.com/threads/p...b-for-pve-container-101-service-failed.41878/

It is not LXC bug, it is Kernel bug with version 4.13.13.

We have to wait for next release with with kernel 4.14.20 .

Jospeh Huber · Mar 7, 2018

5.1.46, now this happens the second time in this week on the same node in my cluster.
When I restart the "pvestatd" on the node, the KVMs become visible again.
Some of the lxc containers are running and some are dead.
"pct list" hangs...

masterdaweb · Mar 7, 2018

It's happening every week for me too. I have 12 nodes, and when it happens, I have to stop all proxmox services, in every node:

service pve-cluster stop
service corosync stop
service pvestatd stop
service pveproxy stop
service pvedaemon stop

and then

service pve-cluster start
service corosync start
service pvestatd start
service pveproxy start
service pvedaemon start

Jospeh Huber · Mar 7, 2018

masterdaweb said:
It's happening every week for me too. I have 12 nodes, and when it happens, I have to stop all proxmox services, in every node:

service pve-cluster stop
service corosync stop
service pvestatd stop
service pveproxy stop
service pvedaemon stop

and then

service pve-cluster start
service corosync start
service pvestatd start
service pveproxy start
service pvedaemon start

Thanks for the hint, I will also try the "solution" from here ... we don't have zfs:
https://forum.proxmox.com/threads/p...el-tainted-pvestatd-frozen.38408/#post-189727

Vasu Sreekumar · Mar 7, 2018

It didn't work in my case.

I had to restart the node to get issue fixed.

Jospeh Huber · Mar 8, 2018

Unfortunately both solutions do not work for me.
Everyday one node crashes... unusable.
@tom Is there an upgrade planned for this issue?

Vasu Sreekumar · Mar 8, 2018

yes, I also face same issue.

I am having sleepless nights for last one week.

Proxmox is a nightmare now, everyday 2 or 3 nodes crash for me. I have 25 nodes with LXC.

No reply from Proxmox till now.

Kaijia · Mar 8, 2018

Same issue here on a 16-node LXC cluster. Only begin with a recent update and reboot, so I also suspect this to be an issue with the kernel. But, I also notice the issue persists for an hour or two everytime, then everything backs to normal. So instead of rebooting every node, I just wait (of course, this is not a solution for hosting provider).

Node with question mark

Renowned Member

Member

Attachments

Renowned Member

New Member

Proxmox Staff Member

Renowned Member

Renowned Member

New Member

Renowned Member

Renowned Member

Proxmox Staff Member

Active Member

Active Member

Renowned Member

Renowned Member

Renowned Member

Active Member

Renowned Member

Active Member

Active Member

We value your privacy