Issues with removing SAS disks from ZFS configuration

liamlows · Oct 5, 2021

Hey y'all,

So I wrote a post about this in a post before but it was far too much information and hard to read through, so I am removing the old post and consolidating the information into this post now that I have some more information after exploring a bit. So let's get started.

For my current setup I am using a R720XD that has a PERC H710P Mini flashed to IT mode for pass-through functionality. I have 2 1TB SSDs that are used for Proxmox, LXCs, and VMs, these drives are totally fine. I also have 10 1.2 TB 10K SAS drives (with 2 more on the way) all of the same model (ST1200MM0108 Seagate Secure SED FIPS 140-2) used for media storage, backup, NAS, etc.

When I initially received the server it came with 8 drives. Of these 8 drives, 1 didnt show up in proxmox GUI from the get-go, 1 was in a ZFS single disk array, 3 were in a RAIDz1 config, and the last 3 were also in a RAIDz1 config. The disk that was in a ZFS single disk array is fine and works well (sda). One of the disks in the 3 disk arrays is also fine (sdd).

The SAS Drives
So these drives are the ones I am having a massive pain with. As of right now here is the status for the drives (based on what i can access with fdisk):

4 of these drives (sda, sdd, sdh, sdi) work fine and i can access the disks with fdisk while also being able to select them when creating a ZFS pool from the proxmox GUI
- sdh and sdi I recently purchased and work fine
- sda was the disk in the single disk ZFS array
- sdd was one of the disks in the 2 3 disk RAIDz1 configurations
The other 6 drives (sdl, sdf, sdk, sde, sdj, sdg) have issues, these are the disks that I will be referring to for the remainder of this post.
- sdl was the disk that didnt show from the getgo, it has never been in a ZFS configuration
- sdf, sdk, sde, sdj, sdg were in the 2 3 disk RAIDz1 configurations

fdisk
So for these 6 disks that I am having issues with, directly after destroying the ZFS pools these disks displayed the following fdisk -l output, all the output was the same for each disk except for the IDs.

Screen Shot 2021-10-03 at 10.04.05 PM.png

In addition to that, whenever I tried to access a disk with fdisk /dev/sdX i would get an error that said fdisk: cannot open /dev/sdj: Input/output error. Therefore I was unable to do anything with fdisk to these disks.

fsck
I also ran fsck /dev/sdX on the disks to see what information I got there and you can see the output in the following message, again all the output for the 6 disks was pretty much the same:

Screen Shot 2021-10-03 at 10.55.52 PM.png

boot errors
In addition, another error I saw with these disks was during boot and for a little while after proxmox fully booted, you can see this code repeat for the different disks several times pretty much the same for each one:

Code:

sd 0:0:11:0: [sdl] tag#9050 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
sd 0:0:11:0: [sdl] tag#9050 Sense Key : Aborted Command [current]
sd 0:0:11:0: [sdl] tag#9050 Add. Sense: Logical block guard check failed
sd 0:0:11:0: [sdl] tag#9050 CDB: Read(32)
sd 0:0:11:0: [sdl] tag#9050 CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
sd 0:0:11:0: [sdl] tag#9050 CDB[10]: 8b ba 0b a8 8b ba 0b a8 00 00 00 00 00 00 00 08
blk_update_request: protection error, dev sdl, sector 2344225704 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Buffer I/O error on dev sdl, logical block 293028213, async page read
sd 0:0:11:0: [sdl] tag#9249 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
sd 0:0:11:0: [sdl] tag#9249 Sense Key : Aborted Command [current]
sd 0:0:11:0: [sdl] tag#9249 Add. Sense: Logical block guard check failed
sd 0:0:11:0: [sdl] tag#9249 CDB: Read(32)
sd 0:0:11:0: [sdl] tag#9249 CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
sd 0:0:11:0: [sdl] tag#9249 CDB[10]: 8b ba 0b a8 8b ba 0b a8 00 00 00 00 00 00 00 08
blk_update_request: protection error, dev sdl, sector 2344225704 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0

Unfortunately I have tried to understand what this means but I am still trying to figure it out. I really don't know what is going on here other than noticing the protection error which could be related to the fact that these drives are self-encrypting drives. Not sure though.

In addition, I also tried using dd to wipe the disk and fill it with zeros but that didn't seem to do anything and the partitions still remained after. I also ran smrtctl -a /dev/sdX which spat out output with no errors and marked the drives as ok.

Now the strange part!
So after doing all the above and doing a bunch of googling, i ended up trying to re-write the filesystem to the disk to see if that changed anything, sure enough it actually made 4 of the 6 disks show up in the proxmox GUI. I ran mkfs.ext2 /dev/sdX on each of the disks and obtained the following output for each one:

Screen Shot 2021-10-03 at 11.05.04 PM.png

Clearly, there is an issue right at the end when writing the superblocks and final filesystem stuff. That being said, of the 6 disks, sdf, sdj, sdg, sdl now appear in the proxmox GUI, but sde, sdk still do not show up (these 2 disks did show up when I created the first RAIDz1 arrays). Furthermore, I still cant access any of these 6 disks with fdisk, but, when I run fdisk -l the disks no longer show the paritions or the red "corrupt GPT error".

So to conclude, right now of the 6 disks that are having issues 4 seem to be usable (show up in proxmox GUI when creating ZFS pool) and 2 do not.

That's pretty much it, I'm sorry it's long but I figure every piece of information here could help. I really would love to understand what is going on here so if anyone has any idea, any and all comments are helpful! Are the 4 "usable" disks ok to use, or are they likely going to fail soon? What about the 2 that still do not show up in proxmox, is there anything I can do to make them appear?

Thank y'all so much in advance!

Stoiko Ivanov · Oct 7, 2021

In general - and especially with hardware that's a bit older - always try to upgrade to the latest available firmware of the hardware (Dell provides upgrades for quite a long time and they are easily accessible)

haven't seen this exact error until now - but a quick search yields:
https://serverfault.com/questions/971722/dmesg-full-of-i-o-errors-smart-ok-four-disks-affected
which suggests to check cabling and controllers (I would do that after upgrading to the latest firmware)

I hope this helps!

liamlows · Aug 22, 2022

Stoiko Ivanov said:
In general - and especially with hardware that's a bit older - always try to upgrade to the latest available firmware of the hardware (Dell provides upgrades for quite a long time and they are easily accessible)

haven't seen this exact error until now - but a quick search yields:
https://serverfault.com/questions/971722/dmesg-full-of-i-o-errors-smart-ok-four-disks-affected
which suggests to check cabling and controllers (I would do that after upgrading to the latest firmware)

I hope this helps!

Thank you for the reply @Stoiko, I will definitely check that! Just to let you know the drives miraculously repaired themselves it seems and there were no issues in creating the zpool i desired. However, after a few months of running it has come to my attention that two of the drives have completely faulted so I am back with this issue haha.

When i received the r720xd for the first time I updated all firmware that i could to my knowledge. However, I will go back in and double check that is still the case. Then ill take a look at some of the SAS cableing and the HBA to see if there could be a problem there. You dont think that flashing the H710P could have introduced some of these errors do you?

Stoiko Ivanov · Aug 23, 2022

liamlows said:
Thank you for the reply @Stoiko, I will definitely check that! Just to let you know the drives miraculously repaired themselves it seems and there were no issues in creating the zpool i desired. However, after a few months of running it has come to my attention that two of the drives have completely faulted so I am back with this issue haha.

just to get this right - you setup the pool, with the same drives completely fresh again - and it ran for a few months and then the issues reappeared?

is it the exact same issue with the same error-messages?

else it sounds odd - but maybe there is some kind of overflow in the controller/drive firmware - which causes the drives to fail after a long period of time - there again updating all firmware might help

did you upgrade recently - or are you still running the same kernel as you did in October? - if you did not - definitely try upgrading.

in any case - if your data is important - I'd consider replacing drives/cables since errors that do reoccur twice usually come back again

I hope this helps!

liamlows · Aug 29, 2022

Hey sorry for the late response, work was busy this week but thank you so much for taking the time to respond! So going down the list here are the answers to your questions:

Stoiko Ivanov said:
just to get this right - you setup the pool, with the same drives completely fresh again - and it ran for a few months and then the issues reappeared?

Yes, when i started playing around with the first set of drives i had (about 5) i ran into the error documented above. Then one day suddenly i was able to wipe the drives (but I never was able to initialize GPT on them) and add them to the pool. More specifically, on the web UI, the drives finally appeared for selection when creating the ZFS pool and as far as i know, i didn't do anything specifically to have that happen. Im pretty sure at the time of creating the ZFS pool, the errors were still present if i checked them via the PVE shell but the drive was working enough to be seen.

Stoiko Ivanov said:
is it the exact same issue with the same error-messages?

This is interesting, upon running fdisk -l all the drives show up and i see no errors about a backup GPT table being corrupt. When i run fsck /dev/sdX for the drives that are faulted they show the same error as seen in the original post (about bad magic number/bad superblock) for that command, however when I run it for the drives that are degraded it says that those drives are currently in use.

Stoiko Ivanov said:
else it sounds odd - but maybe there is some kind of overflow in the controller/drive firmware - which causes the drives to fail after a long period of time - there again updating all firmware might help

This could be it, like I said i did flash the controller into IT mode so it could support drive passthrough for the ZFS pools. Unfortunately due to flashing idrac nor the lifecycle controller recognize the controller anymore. So im not sure how i would try to update the firmware there.

Stoiko Ivanov said:
did you upgrade recently - or are you still running the same kernel as you did in October? - if you did not - definitely try upgrading.

I did upgrade fairly recently and the node with all these issues is currently running manager version 7.2-7.

Stoiko Ivanov said:
in any case - if your data is important - I'd consider replacing drives/cables since errors that do reoccur twice usually come back again

I'll definitely look into this, it is weird that they had the error, then the error vanished (or at least to my eyes) and then appeared again. It doesnt make that much sense. Luckily the data right now is just a few movies/TV shows so i could potentially back it all up to one of the spare SATA SSDs in the system that isnt being used now and start from scratch.

One thing i did notice that is really weird is that the RAM usage on the system is incredibly high. the KVM process is currently using 32.8GB of RAM which I know ZFS uses a lot of RAM but i didnt realize it would use that much (or it shouldnt use that much). Let me know if you want to see the output of top,arc_summary, or cat /proc/meminfo.

Lastly, I recently opened another post about the issue of the ZFS pool here. Let me know if you have any other ideas or questions.

Thanks again @Stoiko Ivanov !

Also, do you think it would be worth it to purchase an LSI pass through controller (LSI SAS 9207-8i)? I am already thinking of ordering this SAS cable to replace the one going from the backplane of the drives to the controller but I'm worried this may be a controller issue. If I were to purchase the LSI controller, do you think i would need to re-configure the entire system? Or will the drives simply show up normally as they do now (as if no change ever took place)?

Search

Search

Issues with removing SAS disks from ZFS configuration

liamlows

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

liamlows

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

liamlows

Well-Known Member

We value your privacy