Hello,
We haveing troubles with our PVE cluster. servers randomly stop with io-error. someimes on a worst moment so that the superblock is damaged.
the problem started to occure when we upgraded from version 4 to version 6. we did the upgrade by the book and we had no errors.
our cluster is build on 2 supermicro servers with XEON processors. connected in a LACP network to a core switch.
This core switch has 2 Synology NAS systems connected in HA cluster. also connected with LACP.
We connnect our shares with SMB and NFS. both protocols have the troubles.
We updated the servers to the latest, set SMB protocol to ver3.0 NFS. restarted the servers. moved them between hypervisors.
the only thing we know is in common that the problem is not related to time.
we only know that servers with high disk io have the troubles. and that the issue occures only once in x time and if the server is then reading or writing the problem starts.
all our servers uses qcow.
I cannot find anything why it says io error, nothing in logs like disconnected network, or lost connection to smb... or what ever...
we failed the Synology NAS over to the passive one. the only thing we did not replace is the switch. But it says no errors on the lines.
Hope people here have experienced some of these errors and hopefully we can find a solution with your help.
these hypervisors run about 30 servers. with many linux servers in a production net.. restoring and recoverying servers is getting hard..
I gope someone can help or point me in a good direction!
Aron
We haveing troubles with our PVE cluster. servers randomly stop with io-error. someimes on a worst moment so that the superblock is damaged.
the problem started to occure when we upgraded from version 4 to version 6. we did the upgrade by the book and we had no errors.
our cluster is build on 2 supermicro servers with XEON processors. connected in a LACP network to a core switch.
This core switch has 2 Synology NAS systems connected in HA cluster. also connected with LACP.
We connnect our shares with SMB and NFS. both protocols have the troubles.
We updated the servers to the latest, set SMB protocol to ver3.0 NFS. restarted the servers. moved them between hypervisors.
the only thing we know is in common that the problem is not related to time.
we only know that servers with high disk io have the troubles. and that the issue occures only once in x time and if the server is then reading or writing the problem starts.
all our servers uses qcow.
I cannot find anything why it says io error, nothing in logs like disconnected network, or lost connection to smb... or what ever...
we failed the Synology NAS over to the passive one. the only thing we did not replace is the switch. But it says no errors on the lines.
Hope people here have experienced some of these errors and hopefully we can find a solution with your help.
these hypervisors run about 30 servers. with many linux servers in a production net.. restoring and recoverying servers is getting hard..
I gope someone can help or point me in a good direction!
Aron