Clustering has stopped working. 500 Read timeout even to self.

Since 4:15Pm yesterday the logs on all 3 nodes in my cluster are filling up with:
MASTER:
Dec 20 06:24:06 hyper1 pvemirror[3373]: starting cluster syncronization
Dec 20 06:24:16 hyper1 pvemirror[3373]: syncing vzlist from '10.9.0.8' failed: 500 read timeout
Dec 20 06:24:26 hyper1 pvemirror[3373]: syncing vzlist from '10.9.0.9' failed: 500 read timeout
Dec 20 06:24:36 hyper1 pvemirror[3373]: syncing vzlist from '10.9.0.10' failed: 500 read timeout

Other Node:
Dec 20 06:24:14 hyper3 pvemirror[3366]: syncing master configuration from '10.9.0.8'
Dec 20 06:24:24 hyper3 pvemirror[3366]: syncing vzlist from '10.9.0.8' failed: 500 read timeout
Dec 20 06:24:34 hyper3 pvemirror[3366]: syncing vzlist from '10.9.0.9' failed: 500 read timeout
Dec 20 06:24:44 hyper3 pvemirror[3366]: syncing vzlist from '10.9.0.10' failed: 500 read timeout

We hadn't changed anything except we did have an NFS mount to a server that has failed and I can't umount it...

I tried stopping the cluster and tunnel service on all nodes and recreating it... but there is clearly something stopping them from communicating with each other... output from each:
MASTER:
hyper1:/etc/pve# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M ERROR: 500 read timeout
2 : 10.9.0.9 N ERROR: 500 read timeout
3 : 10.9.0.10 N ERROR: 500 read timeout

hyper2:/var/log# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M ERROR: 500 read timeout
2 : 10.9.0.9 N ERROR: 500 read timeout


hyper3:~/.ssh# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M ERROR: 500 read timeout
2 : 10.9.0.9 N ERROR: 500 read timeout
3 : 10.9.0.10 N ERROR: 500 read timeout

I think it's interesting that hyper2 only sees itself and the master, and yet hyper3 lists all 3.


This is a production environment with customer VM's on it. Hoping someone else has seen something similar or has some useful suggestions!

Thanks in advance.
 
So all 3 of these servers had an NFS mount from a filer that has failed and been removed. When I do a umount -f 'mount point' it cleared up hyper1 and hyper2 but not 3... I didn't mention that we were unable to connect to the web interface on any of these either.

We can now get to the web interface on 1 and 2, but not 3.

The 'clustering' looks a little better but is still not working properly:
Master:
hyper1:/boot/grub# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M S 5 days 04:40 0.24 56% 4%
2 : 10.9.0.9 N S 35 days 18:38 1.29 64% 7%
3 : 10.9.0.10 N ERROR: 500 read timeout
hyper2:/boot# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M ERROR: 500 Can't connect to 127.0.0.1:50000 (connect: Connection refused)
2 : 10.9.0.9 N S 35 days 18:48 0.14 64% 7%
3 : 10.9.0.10 N ERROR: 500 Can't connect to 127.0.0.1:50001 (connect: Connection refused)


hyper3:/boot# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M S 5 days 04:42 0.06 55% 4%
2 : 10.9.0.9 N S 35 days 18:41 0.20 64% 7%
3 : 10.9.0.10 N ERROR: 500 read timeout
 
You read to remove that failed NFS storage from all nodes (manually edit /etc/pve/storage.cfg), and reboot the nodes if you cant umount.
 
Thanks Dietmar,

I did that and it worked for 1 and 2... and did unmount on 3 but didn't clear up the issue. I did end up having to manually migrate the servers off 3 and then reboot it. The only thing I noticed is that the load average was quite high on 3 still after unmounting the failed NFS mount, but that it went back to normal on 1 and 2.

Either way...Thanks.
 
I had the same issue last night, I was trying to install NFS-kernel-server , portmap, etc and also updated the kernel from pvetest. After that I lost communication with other node in cluster and rebooted. Then a host of networking errors happened and required me to clean out the 70-net files. I eventually rolled back to stable repo kernel.