2 clusters unstable io-errors

Aron Dijkstra

Well-Known Member
Aug 6, 2016
41
1
48
43
Hi,

I have 2 clusters running proxmox, both on Supermicro server hardware connected with lacp links to a core switch, the storage is both on synology hardware but with diffrent configurations. one is a single rackmountable rs 8series and the other is a cluster rs 3617 system.

I get sometimes io errors on servers, operating system independant both clusters are both in a seperate datacenter. (200km apart)
HDD's on the storage system are all 100%, i replaced the caching disk, switches and cables.... the servers are running version 6.1.5... (cluster one has 2 servers and cluster 2 3 servers) one cluster is running jumbo packets the other just the normal 1500mtu...

it seems but it a theory that the usage of the disk (not high load but the frequency of usage) triggers the io error (the server needs to write or read on the time the breakages occures)

the only thing i did not do is make a clean install of the servers and make a new cluster.

both clusters are upgraded from version 4.x to 6.x (as described in the tutorials) the servers had no issue on version 4 proxmox....
servers are mounted with NFS and SMB to the storage system... no luck...

What can it be??? because of the whole different setup located on different sites, different hardware (only thing that is the same is Proxmox and the brand of equipment)
servers get broke once a week, and also a corrupted disk,

We have a subscription and use the enterprise repository's.

The only thing i can think is reinstall. but than we have other issues... and will the problem be solved??
 
Let me rephrase this to confirm that I understand the situation correctly.

You have two clusters, each in a different datacenter. They have a storage configured on Synology boxes and encounter IO errors if there is a lot of IO you encounter problems and VM disk corruption? (IO in terms of IOPS, not bandwidth).

You have the same effect accessing the share with NFS and Samba/Cifs.

Is there anything in the logs? (/var/log/syslog, /var/log/kern.log) that will give us an idea of what is the problem?

Have you tried if the latest firmware updates on the Synology boxes solve the problem?
 
Hi Aaron! :)

You have two clusters, each in a different datacenter. They have a storage configured on Synology boxes and encounter IO errors if there is a lot of IO you encounter problems and VM disk corruption? (IO in terms of IOPS, not bandwidth).
Yes, 2 clusters in 2 datacenters, storage is indeed configured on Synology and we get io errors on the VM and sometimes the system hangs and the icon left in proxmox say io-error with a yellow sign.

We used 3 diffrent versions of Synology firmware.

Syslog is clear, no errors from a hour before until the time of the crash.
Only repeated :

May 18 12:33:23 spt-phs-002 snmpd[1272]: error on subcontainer 'ia_addr' insert (-1)
May 18 12:33:53 spt-phs-002 snmpd[1272]: error on subcontainer 'ia_addr' insert (-1)
May 18 12:34:00 spt-phs-002 systemd[1]: Starting Proxmox VE replication runner...

in kern.log also no logging arround the time of the io-error. only when i started the machine again:
May 18 12:26:49 spt-phs-002 kernel: [10533404.490002] device tap101i0 entered promiscuous mode
May 18 16:04:58 spt-phs-002 kernel: [10546492.861776] CIFS VFS: Cancelling wait for mid 4862237929 cmd: 5
May 18 16:04:58 spt-phs-002 kernel: [10546492.862060] CIFS VFS: Cancelling wait for mid 4862237930 cmd: 16
May 18 16:04:58 spt-phs-002 kernel: [10546492.862312] CIFS VFS: Cancelling wait for mid 4862237931 cmd: 6
May 18 16:04:59 spt-phs-002 kernel: [10546493.958291] CIFS VFS: Close unmatched open

thanks for the support so far!
 
Hmm, these kind of problems can be quite tricky to tackle... :/

Which PVE versions do you have installed? pveversion -v
 
yes, they are :( i switched core switch, updated firmware. replaced caching disks, moved from linux bridge to ovs and visa versa, rebooted servers, jumbo frames no jumboframes, diffrent client os i only did not change the LACP links to the network and formatted they hypervisor. (because of the upgrade from version 4)

proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.62-1-pve: 4.4.62-88
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.6-1-pve: 4.4.6-48
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-2
pve-cluster: 6.1-2
pve-container: 3.0-16
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-4
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
 
It seems that the issue is related to the uptime...

If the Hypervisor is up for a long time the problem will occure more.

Example, if the server is up for a day or say 2 weeks there are no issues, but maybe sometimes once, if the server is say up for 4 or 5 months the issues can occure daily!

A reboot of the server is a solution for a everyday crash.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!