io erros

Aron Dijkstra

Well-Known Member
Aug 6, 2016
41
1
48
43
Hello,

We haveing troubles with our PVE cluster. servers randomly stop with io-error. someimes on a worst moment so that the superblock is damaged.
the problem started to occure when we upgraded from version 4 to version 6. we did the upgrade by the book and we had no errors.
our cluster is build on 2 supermicro servers with XEON processors. connected in a LACP network to a core switch.
This core switch has 2 Synology NAS systems connected in HA cluster. also connected with LACP.
We connnect our shares with SMB and NFS. both protocols have the troubles.

We updated the servers to the latest, set SMB protocol to ver3.0 NFS. restarted the servers. moved them between hypervisors.
the only thing we know is in common that the problem is not related to time.
we only know that servers with high disk io have the troubles. and that the issue occures only once in x time and if the server is then reading or writing the problem starts.
all our servers uses qcow.

I cannot find anything why it says io error, nothing in logs like disconnected network, or lost connection to smb... or what ever...
we failed the Synology NAS over to the passive one. the only thing we did not replace is the switch. But it says no errors on the lines.

Hope people here have experienced some of these errors and hopefully we can find a solution with your help.
these hypervisors run about 30 servers. with many linux servers in a production net.. restoring and recoverying servers is getting hard..

I gope someone can help or point me in a good direction!

Aron
 
Newer versions of software usually have enhancements for performance.
It might be the case you are overwhelming the switch, the NAS, the NIC.

Especially if it is a "cheap" switch it might just drop packets on overload.

I would try to set IO-limits on the high load VMS/vdisks.
 
Hi, tburger,

Thank you for your anwser, The switch is a HP Enterprise class server. it cost arround 2500 euros. it was over qualified for this job.
I do not think this is an issue. but the idea of limiting the io is intresting. than we have some time to investigate further.

is there a way to see the current io usage (statistics) so i know the correct setting?

thanks

Aron
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!