keep getting I/O errors on my drives (SSD)

tando · Jun 13, 2021

I run K3S and Rancher on my Proxmox virtual environment. Every now and then one of the nodes shows error messages concerning the hard-disk.
I see the error messages on the worker nodes and the storage nodes which i created for Longhorn.
I don't know if the problems relies to Rancher and Kubernetes or that it is a Proxmox problem.
I'm not pointing fingers i'm just seeking for a solution.
Maybe one of the forum members know what it is.

Normally if i reboot the vm the problem disappear, but it keeps coming back and not on the same VM but as a mention one of the worker nodes or one of the storage nodes

What Proxmox shows

Dunuin · Jun 13, 2021

Did you try to change the cables?
Did you start a long or short selftest? (smartctl -t short /dev/sdc or smartctl -t long /dev/sdc)
What is smartctl -a /dev/sdc returning?

tando · Jun 13, 2021

Dunuin said:
Did you try to change the cables?
Did you start a long or short selftest? (smartctl -t short /dev/sdc or smartctl -t long /dev/sdc)
What is smartctl -a /dev/sdc returning?

Here is the outcome of smartctl -a /dev/sdc

smartctl part A

smartctl part B

Joi Owen · Jun 14, 2021

We see a fair bit of that behavior, one cluster (recently upgraded from 5.4/luminous to 6.4/nautilus) was nearly unusable. The gear was under warranty (Super Micros servers with SM-branded LSI 3108 controllers in JBOD mode) and it turned out to be some as-yet unidentified issue between the Linux 5.4 kernel drivers for megaraid/scsi, and the firmware in the chassis backplanes. Vendor replaced the backplanes (with newest available firmware) and issue hasn't appeared again.

What hardware is hosting the disks that drop out? Is this ceph storage? Does the ceph/ceph-osd.NNN.log show anything interesting? Kern.log in the same timeframe?

tando · Jun 15, 2021

Joi Owen said:
We see a fair bit of that behavior, one cluster (recently upgraded from 5.4/luminous to 6.4/nautilus) was nearly unusable. The gear was under warranty (Super Micros servers with SM-branded LSI 3108 controllers in JBOD mode) and it turned out to be some as-yet unidentified issue between the Linux 5.4 kernel drivers for megaraid/scsi, and the firmware in the chassis backplanes. Vendor replaced the backplanes (with newest available firmware) and issue hasn't appeared again.

What hardware is hosting the disks that drop out? Is this ceph storage? Does the ceph/ceph-osd.NNN.log show anything interesting? Kern.log in the same timeframe?

It is on ZFS storage> It a mirrorset of 2 1tb samsung SSD's. The machine is a HP workstation Z620

Search

Search

keep getting I/O errors on my drives (SSD)

tando

New Member

Attachments

Dunuin

Distinguished Member

tando

New Member

Joi Owen

Member

tando

New Member

We value your privacy