cluster member fails

aneubau · Oct 30, 2009

I have her an annoying problem with a proxmox server. It shows the following error message with "pveca -l":
4 : 192.168.200.3 N ERROR: 500 read timeout
It was previously master when the error occurred. I could not login on the web gui of this sever and still cannot. As it happened after changing the ip address of one of the other cluster members I have removed entries of the old ip address keys in /root/.ssh/authorized_keys and /root/.ssh/known_hosts.
I have restarted the pv* services. I deleted the cluster, created from scratch , created with another master, nothing helped.
Any ideas ?

dietmar · Oct 30, 2009

Is 'pvedaemon' running? Whats the content of /etc/pve/cluster.cfg?

aneubau · Oct 30, 2009

dietmar said:
Is 'pvedaemon' running? Whats the content of /etc/pve/cluster.cfg?

Yes the pvedaemon is running.
The content of the /etc/pve/cluster.cfg see below (192.168.200.3 is the server with the error message):

maxcid 4

master 1 {
IP: 192.168.200.2
NAME: proxmox02
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEAoOXRuO2rSg7VGKZ9yiSMDrKVJJV+77NuRmHEbIsUQ0HInXVh3W6qGw6Uphcn5Y0+A6iPRkrwh94Jz5P+eJL+cyQD23G6Bh21oVVE5Zqm2YxUcYo2tw4tSKiNILHZ9bVijX4z6YqyW2zUsWk/AhGRQr59FeUMTU9LVQNPAMOrwUVLVRW+QCzWZMbHYksKeabBmkGPrzuUDy7pmYZKGcBKJlTw/YRn0pmkRxmTCvCTHfbuq8kzoHwY4IovWHotujIw+wkIteLzeM8M2xTMNi//+U5ij+YVmJOFSHZab9UcCWfBspDZyyW6R5xw07ejK7EBp8eUdhRO8gKkbSaZZ4lacw==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEAlJrx42d8cz4my15Gd04spa2PkHieSdpP4tsGWSVed0w4crMKWpiUfgfYqZmHT/K50+FGbmQ79wRCzOoojza1MHifDezs3hkqGc5/tZoDBgPmz6Eia8M/fcC9/wghoKxdKz0dD672r8ZcDjA6XFIzbhhCSn4yCnuGJpuvd2mEOzU=
}

node 2 {
IP: 192.168.200.1
NAME: proxmox01
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEAuGoismBjlZtEAR17ZwXjZFXu4CP6HxbaN2FT5XKQhUG+L7nOULzycgr/HOEjO0mCYV+GctBAIaTOkwh/atroPckAG68ouXV/EgYMS3P2BJ78lxhE49PLl6myaNKmv7IJM6BOucJAUJ/t7JNba7Q+5fiic9BLd0gsK+SYlrFaWTJzy8nHzHjldl6ai8LvSZszhaMWxSRdkGPvL7LpnMJqHTGUk4x89ZV+YlKeZ2VwwIYCV9FAZkoWGWNPqGx6WlCbaKQGy4+tNvuLBFMOZ8w9JKNs+IBRoXuLdkJrwHivCaCcpSpNzg/iT4su9RqFjyINFKP+Hj/qyEAgCH1i/PrD2w==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEAvDVCcbOFq0oEkyD3PpOgjlKGrBt5RlQ9CD1FJqK3eXzEsUBvXYmVLdWbVQ4tZ3378WvsRKlDPrNLZsOeZkDv6QLEvDumd9CKIQybvDAYKeVipiE3MT6p/jkgvxVv792Nflw1/tQEOOS4ebFJib43pjed5fC4+tPNXEoyy5ZzYEE=
}

node 3 {
IP: 192.168.200.4
NAME: proxmox04
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEAlKEf3ODIYOJyFhl/+x3r/Ed5rL7Bmw5Z68WMjdWt22rvYyEwH7jxBsBJPo74Ql4LBeuChHoMgTkRFnKpHPQ0zb+TlONLdMeIhccWLqyUBOKrI6IT0U/eHhCM5j30dguu7E77U2ZrFQRhXwK2bHj8u95mpSoRE836wWTn0tcm9Qa51msGzEKDNvmyl2HlkRDDJlp3b8id9NwqNxZTZkY00nH7IM4vtWG+ws1Jaw0ds5+cOQ5hCTeEOWbKF+mBf5Tm6kxS69aiT8/59FyCbo0aFhk4NHGTBVszokJr21JKwx4GbvKhRweMB1i5NuPmiy9qCUzdCBRmSJkgrCXWtrEdKw==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEAvv8a8/ww7gbCHf5AC5fR3bZTzHU39TyGkboJr7EQjkLJnISPrJ6WW2wlQiejtdKYFNZ2BqKrBPLFbGkAN7DUAF8wYw3kitWOO4y5DqZ2OWw6LePsbUPDNSQKgviVnhM/pvt6RdS7l273WH9nBFYffA8+KJ8xLfPF932NxZOg8H8=
}

node 4 {
IP: 192.168.200.3
NAME: proxmox03
HOSTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAQEApJhOYWcLJwg1GsLaf2NnonYry9m44Tzyu0tDVotxRxYbcmHsbBu933jA55VERpPy+p9tAPhPAxULOto18Fj78g18+4gWD8w1b3rnC5HCFs9v46ijEkyTlGwzaT2aqaZr78fL99hCYFwFKttVjU+gxj2jx6UNxg6R02XFnlf1Jg3WuVUWqmmXY7Ee7TCXGeBpjbL25CKz+DxM3mMvd/P9G/HgQvDc2tm88Y5+ohP7I3RbQtpi3hyIpCQHpj7iceDmy5kpGPr/iAR4d2T6efouI6L/crHZSi2x1hMUdntcPK2xiuhZLN2mCIu6EpvOkBwJyexUPROP4QLUa5xIX8wF8w==
ROOTRSAPUBKEY: AAAAB3NzaC1yc2EAAAABIwAAAIEA46Ah/8a37OaSV1sYj6QqkG72+Cx8tjY/qtyjITkGl1N5y2VqcXJydaUX2hVJTXbObAYzmu1qJCjf1G8HzMhA+y5J4xk5fqoZlXc2kpzrcDV17P3yQOoMecpIJA5whPErwrKVomQk6NkmQ68nyhjjZuyjYeHdB3EHM881V2Dhis8=
}

dietmar · Oct 30, 2009

aneubau said:
192.168.200.3 is the server with the error

Try to manually connect to 192.168.200.3

# ssh 192.168.200.3

That should bin you a shell without prompting for anything?

aneubau · Oct 30, 2009

dietmar said:
Try to manually connect to 192.168.200.3

# ssh 192.168.200.3

That should bin you a shell without prompting for anything?

This works as expected

dietmar · Oct 31, 2009

And pvedaemon is running on 'both' master and node?

dietmar · Nov 1, 2009

Also the tunnel daemon need to be runnunig (pvetunnel). Any hints in syslog?

aneubau · Nov 3, 2009

dietmar said:
Also the tunnel daemon need to be runnunig (pvetunnel). Any hints in syslog?

pvetunnel is running, syslog shows the same error messages:

Nov 3 12:14:38 proxmox03 pvemirror[11518]: syncing vzlist from '192.168.200.3' failed: 500 read timeout

aneubau · Nov 3, 2009

dietmar said:
Also the tunnel daemon need to be runnunig (pvetunnel). Any hints in syslog?

I have found now an error message that might be related to the issue in /var/log/auth:

Nov 3 18:06:44 proxmox03 sshd[11449]: error: connect_to localhost port 83: failed.

dietmar · Nov 3, 2009

aneubau said:
I have found now an error message that might be related to the issue in /var/log/auth:

Nov 3 18:06:44 proxmox03 sshd[11449]: error: connect_to localhost port 83: failed.

thats the port use by pvedaemon - do you run a local firewall?

aneubau · Nov 4, 2009

dietmar said:
thats the port use by pvedaemon - do you run a local firewall?

No, I don't use any firewall on the proxmox server.

dietmar · Nov 4, 2009

I am out of ideas now - sorry. But if you provide me a login i can debug it - please contact me at dietmar@proxmox.com

iprigger · Nov 16, 2009

Hi All

aneubau said:
I have her an annoying problem with a proxmox server. It shows the following error message with "pveca -l":
4 : 192.168.200.3 N ERROR: 500 read timeout
It was previously master when the error occurred. I could not login on the web gui of this sever and still cannot. As it happened after changing the ip address of one of the other cluster members I have removed entries of the old ip address keys in /root/.ssh/authorized_keys and /root/.ssh/known_hosts.
I have restarted the pv* services. I deleted the cluster, created from scratch , created with another master, nothing helped.
Any ideas ?

I have a bit a similar issue I think...

It's a 4-Node cluster:
Node 1 (Master):
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:25 0.17 44% 2%
2 : 172.30.0.1 N S 01:56 0.50 31% 1%
3 : 172.30.0.50 N A 8 days 00:32 0.35 81% 7%
4 : 172.30.0.40 N A 01:53 0.49 70% 7%

Node 2 (Node):
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:24 0.39 42% 2%
2 : 172.30.0.1 N S 01:54 0.56 31% 1%
3 : 172.30.0.50 N A 8 days 00:30 0.26 81% 7%
4 : 172.30.0.40 N ERROR: 500 read failed: Connection reset by peer

Node 3 (Node):
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:26 0.62 44% 2%
2 : 172.30.0.1 N S 01:56 0.55 31% 1%
3 : 172.30.0.50 N A 8 days 00:33 0.26 81% 7%
4 : 172.30.0.40 N A 01:54 0.27 70% 7%

Node 4 (Node)
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 172.30.0.3 M A 5 days 15:27 0.61 44% 2%
2 : 172.30.0.1 N S 01:57 0.43 31% 1%
3 : 172.30.0.50 N A 8 days 00:33 0.19 81% 7%
4 : 172.30.0.40 N A 01:54 0.41 71% 7%

SSH works from any to any.
On the node failing (172.30.0.1) I have the following in dmesg:
RPC: bad TCP reclen 0x504f5354 (non-terminal)

Any idea? This issue came overnight...

I already tried to reset the cluster as mentioned here:
http://pve.proxmox.com/wiki/Proxmox_VE_Cluster

Thanks for your helping!

Tobias

aneubau · Nov 16, 2009

In my case the problem was related to a shared nfs store I used as backup for all server. The share was not available since the ip address changed.
The work around would be:
shutting down the pve services and apache.
Removing/changing entry for the nfs share in /etc/pve/storage.cfg
Restart all services.

iprigger · Nov 16, 2009

Hi Aneubau

aneubau said:
In my case the problem was related to a shared nfs store I used as backup for all server. The share was not available since the ip address changed.
The work around would be:
shutting down the pve services and apache.
Removing/changing entry for the nfs share in /etc/pve/storage.cfg
Restart all services.

Nope, storage.cfg is correct - same thing on all systems.

What disturbs me is the RPC error...

CU
Tobias

dietmar · Nov 17, 2009

iprigger said:
What disturbs me is the RPC error...

That does not look like an application bug.

ano · Jun 7, 2010

i encountered a similar problem where a node could not be synced. The file /var/log/auth.log on the master was full of these messages

Code:

sshd[7173]: error: connect_to localhost: unknown host (Name or service not known)

What fixed the problem for me was adding localhost to the file /etc/hosts on the master.

Code:

127.0.0.1       localhost.localdomain localhost
x.x.x.x   the.hostname.com pvelocalhost

Search

Search

cluster member fails

aneubau

Renowned Member

dietmar

Proxmox Staff Member

aneubau

Renowned Member

dietmar

Proxmox Staff Member

aneubau

Renowned Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

aneubau

Renowned Member

aneubau

Renowned Member

dietmar

Proxmox Staff Member

aneubau

Renowned Member

dietmar

Proxmox Staff Member

iprigger

Renowned Member

aneubau

Renowned Member

iprigger

Renowned Member

dietmar

Proxmox Staff Member

ano

Guest

We value your privacy