keep getting I/O errors on my drives (SSD)

tando

New Member
Apr 25, 2021
13
1
1
49
I run K3S and Rancher on my Proxmox virtual environment. Every now and then one of the nodes shows error messages concerning the hard-disk.
I see the error messages on the worker nodes and the storage nodes which i created for Longhorn.
I don't know if the problems relies to Rancher and Kubernetes or that it is a Proxmox problem.
I'm not pointing fingers i'm just seeking for a solution.
Maybe one of the forum members know what it is.

Normally if i reboot the vm the problem disappear, but it keeps coming back and not on the same VM but as a mention one of the worker nodes or one of the storage nodes

What Proxmox shows
 

Attachments

  • buffer IO error.PNG
    buffer IO error.PNG
    19.7 KB · Views: 6
Did you try to change the cables?
Did you start a long or short selftest? (smartctl -t short /dev/sdc or smartctl -t long /dev/sdc)
What is smartctl -a /dev/sdc returning?
 
We see a fair bit of that behavior, one cluster (recently upgraded from 5.4/luminous to 6.4/nautilus) was nearly unusable. The gear was under warranty (Super Micros servers with SM-branded LSI 3108 controllers in JBOD mode) and it turned out to be some as-yet unidentified issue between the Linux 5.4 kernel drivers for megaraid/scsi, and the firmware in the chassis backplanes. Vendor replaced the backplanes (with newest available firmware) and issue hasn't appeared again.

What hardware is hosting the disks that drop out? Is this ceph storage? Does the ceph/ceph-osd.NNN.log show anything interesting? Kern.log in the same timeframe?
 
We see a fair bit of that behavior, one cluster (recently upgraded from 5.4/luminous to 6.4/nautilus) was nearly unusable. The gear was under warranty (Super Micros servers with SM-branded LSI 3108 controllers in JBOD mode) and it turned out to be some as-yet unidentified issue between the Linux 5.4 kernel drivers for megaraid/scsi, and the firmware in the chassis backplanes. Vendor replaced the backplanes (with newest available firmware) and issue hasn't appeared again.

What hardware is hosting the disks that drop out? Is this ceph storage? Does the ceph/ceph-osd.NNN.log show anything interesting? Kern.log in the same timeframe?
It is on ZFS storage> It a mirrorset of 2 1tb samsung SSD's. The machine is a HP workstation Z620