io erros

Aron Dijkstra · Mar 31, 2020

Hello,

We haveing troubles with our PVE cluster. servers randomly stop with io-error. someimes on a worst moment so that the superblock is damaged.
the problem started to occure when we upgraded from version 4 to version 6. we did the upgrade by the book and we had no errors.
our cluster is build on 2 supermicro servers with XEON processors. connected in a LACP network to a core switch.
This core switch has 2 Synology NAS systems connected in HA cluster. also connected with LACP.
We connnect our shares with SMB and NFS. both protocols have the troubles.

We updated the servers to the latest, set SMB protocol to ver3.0 NFS. restarted the servers. moved them between hypervisors.
the only thing we know is in common that the problem is not related to time.
we only know that servers with high disk io have the troubles. and that the issue occures only once in x time and if the server is then reading or writing the problem starts.
all our servers uses qcow.

I cannot find anything why it says io error, nothing in logs like disconnected network, or lost connection to smb... or what ever...
we failed the Synology NAS over to the passive one. the only thing we did not replace is the switch. But it says no errors on the lines.

Hope people here have experienced some of these errors and hopefully we can find a solution with your help.
these hypervisors run about 30 servers. with many linux servers in a production net.. restoring and recoverying servers is getting hard..

I gope someone can help or point me in a good direction!

Aron

apoc · Mar 31, 2020

Newer versions of software usually have enhancements for performance.
It might be the case you are overwhelming the switch, the NAS, the NIC.

Especially if it is a "cheap" switch it might just drop packets on overload.

I would try to set IO-limits on the high load VMS/vdisks.

Aron Dijkstra · Apr 1, 2020

Hi, tburger,

Thank you for your anwser, The switch is a HP Enterprise class server. it cost arround 2500 euros. it was over qualified for this job.
I do not think this is an issue. but the idea of limiting the io is intresting. than we have some time to investigate further.

is there a way to see the current io usage (statistics) so i know the correct setting?

thanks

Aron

Search

Search

io erros

Aron Dijkstra

Well-Known Member

apoc

Famous Member

Aron Dijkstra

Well-Known Member

We value your privacy