cluster member fails

aneubau

Renowned Member
Sep 5, 2008
44
1
73
Vienna/Austria
I have her an annoying problem with a proxmox server. It shows the following error message with "pveca -l":
4 : 192.168.200.3 N ERROR: 500 read timeout
It was previously master when the error occurred. I could not login on the web gui of this sever and still cannot. As it happened after changing the ip address of one of the other cluster members I have removed entries of the old ip address keys in /root/.ssh/authorized_keys and /root/.ssh/known_hosts.
I have restarted the pv* services. I deleted the cluster, created from scratch , created with another master, nothing helped.
Any ideas ?
 
Is 'pvedaemon' running? Whats the content of /etc/pve/cluster.cfg?

Yes the pvedaemon is running.
The content of the /etc/pve/cluster.cfg see below (192.168.200.3 is the server with the error message):

maxcid 4

master 1 {
IP: 192.168.200.2
NAME: proxmox02
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEAoOXRuO2rSg7VGKZ9yiSMDrKVJJV+77NuRmHEbIsUQ0HInXVh3W6qGw6Uphcn5Y0+A6iPRkrwh94Jz5P+eJL+cyQD23G6Bh21oVVE5Zqm2YxUcYo2tw4tSKiNILHZ9bVijX4z6YqyW2zUsWk/AhGRQr59FeUMTU9LVQNPAMOrwUVLVRW+QCzWZMbHYksKeabBmkGPrzuUDy7pmYZKGcBKJlTw/YRn0pmkRxmTCvCTHfbuq8kzoHwY4IovWHotujIw+wkIteLzeM8M2xTMNi//+U5ij+YVmJOFSHZab9UcCWfBspDZyyW6R5xw07ejK7EBp8eUdhRO8gKkbSaZZ4lacw==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEAlJrx42d8cz4my15Gd04spa2PkHieSdpP4tsGWSVed0w4crMKWpiUfgfYqZmHT/K50+FGbmQ79wRCzOoojza1MHifDezs3hkqGc5/tZoDBgPmz6Eia8M/fcC9/wghoKxdKz0dD672r8ZcDjA6XFIzbhhCSn4yCnuGJpuvd2mEOzU=
}

node 2 {
IP: 192.168.200.1
NAME: proxmox01
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEAuGoismBjlZtEAR17ZwXjZFXu4CP6HxbaN2FT5XKQhUG+L7nOULzycgr/HOEjO0mCYV+GctBAIaTOkwh/atroPckAG68ouXV/EgYMS3P2BJ78lxhE49PLl6myaNKmv7IJM6BOucJAUJ/t7JNba7Q+5fiic9BLd0gsK+SYlrFaWTJzy8nHzHjldl6ai8LvSZszhaMWxSRdkGPvL7LpnMJqHTGUk4x89ZV+YlKeZ2VwwIYCV9FAZkoWGWNPqGx6WlCbaKQGy4+tNvuLBFMOZ8w9JKNs+IBRoXuLdkJrwHivCaCcpSpNzg/iT4su9RqFjyINFKP+Hj/qyEAgCH1i/PrD2w==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEAvDVCcbOFq0oEkyD3PpOgjlKGrBt5RlQ9CD1FJqK3eXzEsUBvXYmVLdWbVQ4tZ3378WvsRKlDPrNLZsOeZkDv6QLEvDumd9CKIQybvDAYKeVipiE3MT6p/jkgvxVv792Nflw1/tQEOOS4ebFJib43pjed5fC4+tPNXEoyy5ZzYEE=
}

node 3 {
IP: 192.168.200.4
NAME: proxmox04
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEAlKEf3ODIYOJyFhl/+x3r/Ed5rL7Bmw5Z68WMjdWt22rvYyEwH7jxBsBJPo74Ql4LBeuChHoMgTkRFnKpHPQ0zb+TlONLdMeIhccWLqyUBOKrI6IT0U/eHhCM5j30dguu7E77U2ZrFQRhXwK2bHj8u95mpSoRE836wWTn0tcm9Qa51msGzEKDNvmyl2HlkRDDJlp3b8id9NwqNxZTZkY00nH7IM4vtWG+ws1Jaw0ds5+cOQ5hCTeEOWbKF+mBf5Tm6kxS69aiT8/59FyCbo0aFhk4NHGTBVszokJr21JKwx4GbvKhRweMB1i5NuPmiy9qCUzdCBRmSJkgrCXWtrEdKw==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEAvv8a8/ww7gbCHf5AC5fR3bZTzHU39TyGkboJr7EQjkLJnISPrJ6WW2wlQiejtdKYFNZ2BqKrBPLFbGkAN7DUAF8wYw3kitWOO4y5DqZ2OWw6LePsbUPDNSQKgviVnhM/pvt6RdS7l273WH9nBFYffA8+KJ8xLfPF932NxZOg8H8=
}

node 4 {
IP: 192.168.200.3
NAME: proxmox03
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEApJhOYWcLJwg1GsLaf2NnonYry9m44Tzyu0tDVotxRxYbcmHsbBu933jA55VERpPy+p9tAPhPAxULOto18Fj78g18+4gWD8w1b3rnC5HCFs9v46ijEkyTlGwzaT2aqaZr78fL99hCYFwFKttVjU+gxj2jx6UNxg6R02XFnlf1Jg3WuVUWqmmXY7Ee7TCXGeBpjbL25CKz+DxM3mMvd/P9G/HgQvDc2tm88Y5+ohP7I3RbQtpi3hyIpCQHpj7iceDmy5kpGPr/iAR4d2T6efouI6L/crHZSi2x1hMUdntcPK2xiuhZLN2mCIu6EpvOkBwJyexUPROP4QLUa5xIX8wF8w==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEA46Ah/8a37OaSV1sYj6QqkG72+Cx8tjY/qtyjITkGl1N5y2VqcXJydaUX2hVJTXbObAYzmu1qJCjf1G8HzMhA+y5J4xk5fqoZlXc2kpzrcDV17P3yQOoMecpIJA5whPErwrKVomQk6NkmQ68nyhjjZuyjYeHdB3EHM881V2Dhis8=
}
 
I have found now an error message that might be related to the issue in /var/log/auth:

Nov 3 18:06:44 proxmox03 sshd[11449]: error: connect_to localhost port 83: failed.

thats the port use by pvedaemon - do you run a local firewall?
 
Hi All

I have her an annoying problem with a proxmox server. It shows the following error message with "pveca -l":
4 : 192.168.200.3 N ERROR: 500 read timeout
It was previously master when the error occurred. I could not login on the web gui of this sever and still cannot. As it happened after changing the ip address of one of the other cluster members I have removed entries of the old ip address keys in /root/.ssh/authorized_keys and /root/.ssh/known_hosts.
I have restarted the pv* services. I deleted the cluster, created from scratch , created with another master, nothing helped.
Any ideas ?

I have a bit a similar issue I think...

It's a 4-Node cluster:
Node 1 (Master):
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:25 0.17 44% 2%
2 : 172.30.0.1 N S 01:56 0.50 31% 1%
3 : 172.30.0.50 N A 8 days 00:32 0.35 81% 7%
4 : 172.30.0.40 N A 01:53 0.49 70% 7%

Node 2 (Node):
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:24 0.39 42% 2%
2 : 172.30.0.1 N S 01:54 0.56 31% 1%
3 : 172.30.0.50 N A 8 days 00:30 0.26 81% 7%
4 : 172.30.0.40 N ERROR: 500 read failed: Connection reset by peer

Node 3 (Node):
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:26 0.62 44% 2%
2 : 172.30.0.1 N S 01:56 0.55 31% 1%
3 : 172.30.0.50 N A 8 days 00:33 0.26 81% 7%
4 : 172.30.0.40 N A 01:54 0.27 70% 7%

Node 4 (Node)
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:27 0.61 44% 2%
2 : 172.30.0.1 N S 01:57 0.43 31% 1%
3 : 172.30.0.50 N A 8 days 00:33 0.19 81% 7%
4 : 172.30.0.40 N A 01:54 0.41 71% 7%


SSH works from any to any.
On the node failing (172.30.0.1) I have the following in dmesg:
RPC: bad TCP reclen 0x504f5354 (non-terminal)

Any idea? This issue came overnight...

I already tried to reset the cluster as mentioned here:
http://pve.proxmox.com/wiki/Proxmox_VE_Cluster

Thanks for your helping!

Tobias
 
In my case the problem was related to a shared nfs store I used as backup for all server. The share was not available since the ip address changed.
The work around would be:
shutting down the pve services and apache.
Removing/changing entry for the nfs share in /etc/pve/storage.cfg
Restart all services.
 
Hi Aneubau

In my case the problem was related to a shared nfs store I used as backup for all server. The share was not available since the ip address changed.
The work around would be:
shutting down the pve services and apache.
Removing/changing entry for the nfs share in /etc/pve/storage.cfg
Restart all services.

Nope, storage.cfg is correct - same thing on all systems.

What disturbs me is the RPC error...

CU
Tobias
 
i encountered a similar problem where a node could not be synced. The file /var/log/auth.log on the master was full of these messages
Code:
sshd[7173]: error: connect_to localhost: unknown host (Name or service not known)
What fixed the problem for me was adding localhost to the file /etc/hosts on the master.
Code:
127.0.0.1       localhost.localdomain localhost
x.x.x.x   the.hostname.com pvelocalhost
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!