cluster member fails

aneubau

Renowned Member
Sep 5, 2008
44
1
73
Vienna/Austria
I have her an annoying problem with a proxmox server. It shows the following error message with "pveca -l":
4 : 192.168.200.3 N ERROR: 500 read timeout
It was previously master when the error occurred. I could not login on the web gui of this sever and still cannot. As it happened after changing the ip address of one of the other cluster members I have removed entries of the old ip address keys in /root/.ssh/authorized_keys and /root/.ssh/known_hosts.
I have restarted the pv* services. I deleted the cluster, created from scratch , created with another master, nothing helped.
Any ideas ?
 
Is 'pvedaemon' running? Whats the content of /etc/pve/cluster.cfg?

Yes the pvedaemon is running.
The content of the /etc/pve/cluster.cfg see below (192.168.200.3 is the server with the error message):

maxcid 4

master 1 {
IP: 192.168.200.2
NAME: proxmox02
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEAoOXRuO2rSg7VGKZ9yiSMDrKVJJV+77NuRmHEbIsUQ0HInXVh3W6qGw6Uphcn5Y0+A6iPRkrwh94Jz5P+eJL+cyQD23G6Bh21oVVE5Zqm2YxUcYo2tw4tSKiNILHZ9bVijX4z6YqyW2zUsWk/AhGRQr59FeUMTU9LVQNPAMOrwUVLVRW+QCzWZMbHYksKeabBmkGPrzuUDy7pmYZKGcBKJlTw/YRn0pmkRxmTCvCTHfbuq8kzoHwY4IovWHotujIw+wkIteLzeM8M2xTMNi//+U5ij+YVmJOFSHZab9UcCWfBspDZyyW6R5xw07ejK7EBp8eUdhRO8gKkbSaZZ4lacw==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEAlJrx42d8cz4my15Gd04spa2PkHieSdpP4tsGWSVed0w4crMKWpiUfgfYqZmHT/K50+FGbmQ79wRCzOoojza1MHifDezs3hkqGc5/tZoDBgPmz6Eia8M/fcC9/wghoKxdKz0dD672r8ZcDjA6XFIzbhhCSn4yCnuGJpuvd2mEOzU=
}

node 2 {
IP: 192.168.200.1
NAME: proxmox01
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEAuGoismBjlZtEAR17ZwXjZFXu4CP6HxbaN2FT5XKQhUG+L7nOULzycgr/HOEjO0mCYV+GctBAIaTOkwh/atroPckAG68ouXV/EgYMS3P2BJ78lxhE49PLl6myaNKmv7IJM6BOucJAUJ/t7JNba7Q+5fiic9BLd0gsK+SYlrFaWTJzy8nHzHjldl6ai8LvSZszhaMWxSRdkGPvL7LpnMJqHTGUk4x89ZV+YlKeZ2VwwIYCV9FAZkoWGWNPqGx6WlCbaKQGy4+tNvuLBFMOZ8w9JKNs+IBRoXuLdkJrwHivCaCcpSpNzg/iT4su9RqFjyINFKP+Hj/qyEAgCH1i/PrD2w==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEAvDVCcbOFq0oEkyD3PpOgjlKGrBt5RlQ9CD1FJqK3eXzEsUBvXYmVLdWbVQ4tZ3378WvsRKlDPrNLZsOeZkDv6QLEvDumd9CKIQybvDAYKeVipiE3MT6p/jkgvxVv792Nflw1/tQEOOS4ebFJib43pjed5fC4+tPNXEoyy5ZzYEE=
}

node 3 {
IP: 192.168.200.4
NAME: proxmox04
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEAlKEf3ODIYOJyFhl/+x3r/Ed5rL7Bmw5Z68WMjdWt22rvYyEwH7jxBsBJPo74Ql4LBeuChHoMgTkRFnKpHPQ0zb+TlONLdMeIhccWLqyUBOKrI6IT0U/eHhCM5j30dguu7E77U2ZrFQRhXwK2bHj8u95mpSoRE836wWTn0tcm9Qa51msGzEKDNvmyl2HlkRDDJlp3b8id9NwqNxZTZkY00nH7IM4vtWG+ws1Jaw0ds5+cOQ5hCTeEOWbKF+mBf5Tm6kxS69aiT8/59FyCbo0aFhk4NHGTBVszokJr21JKwx4GbvKhRweMB1i5NuPmiy9qCUzdCBRmSJkgrCXWtrEdKw==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEAvv8a8/ww7gbCHf5AC5fR3bZTzHU39TyGkboJr7EQjkLJnISPrJ6WW2wlQiejtdKYFNZ2BqKrBPLFbGkAN7DUAF8wYw3kitWOO4y5DqZ2OWw6LePsbUPDNSQKgviVnhM/pvt6RdS7l273WH9nBFYffA8+KJ8xLfPF932NxZOg8H8=
}

node 4 {
IP: 192.168.200.3
NAME: proxmox03
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEApJhOYWcLJwg1GsLaf2NnonYry9m44Tzyu0tDVotxRxYbcmHsbBu933jA55VERpPy+p9tAPhPAxULOto18Fj78g18+4gWD8w1b3rnC5HCFs9v46ijEkyTlGwzaT2aqaZr78fL99hCYFwFKttVjU+gxj2jx6UNxg6R02XFnlf1Jg3WuVUWqmmXY7Ee7TCXGeBpjbL25CKz+DxM3mMvd/P9G/HgQvDc2tm88Y5+ohP7I3RbQtpi3hyIpCQHpj7iceDmy5kpGPr/iAR4d2T6efouI6L/crHZSi2x1hMUdntcPK2xiuhZLN2mCIu6EpvOkBwJyexUPROP4QLUa5xIX8wF8w==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEA46Ah/8a37OaSV1sYj6QqkG72+Cx8tjY/qtyjITkGl1N5y2VqcXJydaUX2hVJTXbObAYzmu1qJCjf1G8HzMhA+y5J4xk5fqoZlXc2kpzrcDV17P3yQOoMecpIJA5whPErwrKVomQk6NkmQ68nyhjjZuyjYeHdB3EHM881V2Dhis8=
}
 
I have found now an error message that might be related to the issue in /var/log/auth:

Nov 3 18:06:44 proxmox03 sshd[11449]: error: connect_to localhost port 83: failed.

thats the port use by pvedaemon - do you run a local firewall?
 
Hi All

I have her an annoying problem with a proxmox server. It shows the following error message with "pveca -l":
4 : 192.168.200.3 N ERROR: 500 read timeout
It was previously master when the error occurred. I could not login on the web gui of this sever and still cannot. As it happened after changing the ip address of one of the other cluster members I have removed entries of the old ip address keys in /root/.ssh/authorized_keys and /root/.ssh/known_hosts.
I have restarted the pv* services. I deleted the cluster, created from scratch , created with another master, nothing helped.
Any ideas ?

I have a bit a similar issue I think...

It's a 4-Node cluster:
Node 1 (Master):
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:25 0.17 44% 2%
2 : 172.30.0.1 N S 01:56 0.50 31% 1%
3 : 172.30.0.50 N A 8 days 00:32 0.35 81% 7%
4 : 172.30.0.40 N A 01:53 0.49 70% 7%

Node 2 (Node):
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:24 0.39 42% 2%
2 : 172.30.0.1 N S 01:54 0.56 31% 1%
3 : 172.30.0.50 N A 8 days 00:30 0.26 81% 7%
4 : 172.30.0.40 N ERROR: 500 read failed: Connection reset by peer

Node 3 (Node):
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:26 0.62 44% 2%
2 : 172.30.0.1 N S 01:56 0.55 31% 1%
3 : 172.30.0.50 N A 8 days 00:33 0.26 81% 7%
4 : 172.30.0.40 N A 01:54 0.27 70% 7%

Node 4 (Node)
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:27 0.61 44% 2%
2 : 172.30.0.1 N S 01:57 0.43 31% 1%
3 : 172.30.0.50 N A 8 days 00:33 0.19 81% 7%
4 : 172.30.0.40 N A 01:54 0.41 71% 7%


SSH works from any to any.
On the node failing (172.30.0.1) I have the following in dmesg:
RPC: bad TCP reclen 0x504f5354 (non-terminal)

Any idea? This issue came overnight...

I already tried to reset the cluster as mentioned here:
http://pve.proxmox.com/wiki/Proxmox_VE_Cluster

Thanks for your helping!

Tobias
 
In my case the problem was related to a shared nfs store I used as backup for all server. The share was not available since the ip address changed.
The work around would be:
shutting down the pve services and apache.
Removing/changing entry for the nfs share in /etc/pve/storage.cfg
Restart all services.
 
Hi Aneubau

In my case the problem was related to a shared nfs store I used as backup for all server. The share was not available since the ip address changed.
The work around would be:
shutting down the pve services and apache.
Removing/changing entry for the nfs share in /etc/pve/storage.cfg
Restart all services.

Nope, storage.cfg is correct - same thing on all systems.

What disturbs me is the RPC error...

CU
Tobias
 
i encountered a similar problem where a node could not be synced. The file /var/log/auth.log on the master was full of these messages
Code:
sshd[7173]: error: connect_to localhost: unknown host (Name or service not known)
What fixed the problem for me was adding localhost to the file /etc/hosts on the master.
Code:
127.0.0.1       localhost.localdomain localhost
x.x.x.x   the.hostname.com pvelocalhost