Hope you’re all well.
I have a question that’s been wrecking my head for months now.
I have a 3 node cluster (Dell PowerEdge R7525) with the following drive configuration:
Node 1: 10x 2TB NVMe KINGSTON SKC2500M82000G in raidz2
Node 2: 10x 2TB NVMe KINGSTON SKC2500M82000G in raidz2
Node 3: 8x 4TB Crucial MX500 in raidz2
A little while ago I got an email about a zfs scrub_finish for the pool on Node 2. The issue was with two drives.
The error mentioned: One or more devices are faulted in response to persistent errors.
The array became degraded but still accessible. I moved the VMs off of it immediately. SMART didn’t show any issues for the two problematic drives.
I then left it as I was going to remove the two drives, and rebuild the pool when I upgrade the node to PVE 9. I didn’t have any spare drives, so I planned to create a new smaller pool.
Today I did just that, and as soon as I upgraded to PVE 9, the degraded state went away, and I got an email saying that a zfs resilver finished with 0 errors. The pool changed to online and the drives were fine. I then ran: zpool upgrade as suggested.
But then I got an email saying the two drives (the ones that were marked as faulty prior to upgrading to PVE 9), had the following error:
Media and Data Integrity Errors changed from 0 to 12.
The zpool was still online, and I did a scrub that reported 0 errors. I then ran short SMART tests, and they did not report any errors.
I then upgraded Node 1 to PVE 9. After the upgrade I received an email saying one of the drives reported an error: Media and Data Integrity Errors changed from 0 to 16
The pool is still online, and the drive in question has passed a short SMART test.
Would anyone have any idea what could be happening? If the drives are bad, that’s not a problem, I can just remove them and create a new smaller pool. But I’m a bit hesitant to do that if these are just false positives. Has anyone encountered this? Could this just be some strange behaviour related to the Dell servers? I know the drives should be enterprise class, but we could not afford them at the time, and especially now.
Thank you so much!
I have a question that’s been wrecking my head for months now.
I have a 3 node cluster (Dell PowerEdge R7525) with the following drive configuration:
Node 1: 10x 2TB NVMe KINGSTON SKC2500M82000G in raidz2
Node 2: 10x 2TB NVMe KINGSTON SKC2500M82000G in raidz2
Node 3: 8x 4TB Crucial MX500 in raidz2
A little while ago I got an email about a zfs scrub_finish for the pool on Node 2. The issue was with two drives.
The error mentioned: One or more devices are faulted in response to persistent errors.
The array became degraded but still accessible. I moved the VMs off of it immediately. SMART didn’t show any issues for the two problematic drives.
I then left it as I was going to remove the two drives, and rebuild the pool when I upgrade the node to PVE 9. I didn’t have any spare drives, so I planned to create a new smaller pool.
Today I did just that, and as soon as I upgraded to PVE 9, the degraded state went away, and I got an email saying that a zfs resilver finished with 0 errors. The pool changed to online and the drives were fine. I then ran: zpool upgrade as suggested.
But then I got an email saying the two drives (the ones that were marked as faulty prior to upgrading to PVE 9), had the following error:
Media and Data Integrity Errors changed from 0 to 12.
The zpool was still online, and I did a scrub that reported 0 errors. I then ran short SMART tests, and they did not report any errors.
I then upgraded Node 1 to PVE 9. After the upgrade I received an email saying one of the drives reported an error: Media and Data Integrity Errors changed from 0 to 16
The pool is still online, and the drive in question has passed a short SMART test.
Would anyone have any idea what could be happening? If the drives are bad, that’s not a problem, I can just remove them and create a new smaller pool. But I’m a bit hesitant to do that if these are just false positives. Has anyone encountered this? Could this just be some strange behaviour related to the Dell servers? I know the drives should be enterprise class, but we could not afford them at the time, and especially now.
Thank you so much!