Since 4:15Pm yesterday the logs on all 3 nodes in my cluster are filling up with:
MASTER:
Dec 20 06:24:06 hyper1 pvemirror[3373]: starting cluster syncronization
Dec 20 06:24:16 hyper1 pvemirror[3373]: syncing vzlist from '10.9.0.8' failed: 500 read timeout
Dec 20 06:24:26 hyper1 pvemirror[3373]: syncing vzlist from '10.9.0.9' failed: 500 read timeout
Dec 20 06:24:36 hyper1 pvemirror[3373]: syncing vzlist from '10.9.0.10' failed: 500 read timeout
Other Node:
Dec 20 06:24:14 hyper3 pvemirror[3366]: syncing master configuration from '10.9.0.8'
Dec 20 06:24:24 hyper3 pvemirror[3366]: syncing vzlist from '10.9.0.8' failed: 500 read timeout
Dec 20 06:24:34 hyper3 pvemirror[3366]: syncing vzlist from '10.9.0.9' failed: 500 read timeout
Dec 20 06:24:44 hyper3 pvemirror[3366]: syncing vzlist from '10.9.0.10' failed: 500 read timeout
We hadn't changed anything except we did have an NFS mount to a server that has failed and I can't umount it...
I tried stopping the cluster and tunnel service on all nodes and recreating it... but there is clearly something stopping them from communicating with each other... output from each:
MASTER:
hyper1:/etc/pve# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M ERROR: 500 read timeout
2 : 10.9.0.9 N ERROR: 500 read timeout
3 : 10.9.0.10 N ERROR: 500 read timeout
hyper2:/var/log# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M ERROR: 500 read timeout
2 : 10.9.0.9 N ERROR: 500 read timeout
hyper3:~/.ssh# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M ERROR: 500 read timeout
2 : 10.9.0.9 N ERROR: 500 read timeout
3 : 10.9.0.10 N ERROR: 500 read timeout
I think it's interesting that hyper2 only sees itself and the master, and yet hyper3 lists all 3.
This is a production environment with customer VM's on it. Hoping someone else has seen something similar or has some useful suggestions!
Thanks in advance.
MASTER:
Dec 20 06:24:06 hyper1 pvemirror[3373]: starting cluster syncronization
Dec 20 06:24:16 hyper1 pvemirror[3373]: syncing vzlist from '10.9.0.8' failed: 500 read timeout
Dec 20 06:24:26 hyper1 pvemirror[3373]: syncing vzlist from '10.9.0.9' failed: 500 read timeout
Dec 20 06:24:36 hyper1 pvemirror[3373]: syncing vzlist from '10.9.0.10' failed: 500 read timeout
Other Node:
Dec 20 06:24:14 hyper3 pvemirror[3366]: syncing master configuration from '10.9.0.8'
Dec 20 06:24:24 hyper3 pvemirror[3366]: syncing vzlist from '10.9.0.8' failed: 500 read timeout
Dec 20 06:24:34 hyper3 pvemirror[3366]: syncing vzlist from '10.9.0.9' failed: 500 read timeout
Dec 20 06:24:44 hyper3 pvemirror[3366]: syncing vzlist from '10.9.0.10' failed: 500 read timeout
We hadn't changed anything except we did have an NFS mount to a server that has failed and I can't umount it...
I tried stopping the cluster and tunnel service on all nodes and recreating it... but there is clearly something stopping them from communicating with each other... output from each:
MASTER:
hyper1:/etc/pve# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M ERROR: 500 read timeout
2 : 10.9.0.9 N ERROR: 500 read timeout
3 : 10.9.0.10 N ERROR: 500 read timeout
hyper2:/var/log# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M ERROR: 500 read timeout
2 : 10.9.0.9 N ERROR: 500 read timeout
hyper3:~/.ssh# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.9.0.8 M ERROR: 500 read timeout
2 : 10.9.0.9 N ERROR: 500 read timeout
3 : 10.9.0.10 N ERROR: 500 read timeout
I think it's interesting that hyper2 only sees itself and the master, and yet hyper3 lists all 3.
This is a production environment with customer VM's on it. Hoping someone else has seen something similar or has some useful suggestions!
Thanks in advance.