ZFS Pool Keeps losing an SSD

jaybod

New Member
Oct 5, 2024
3
2
3
I have a ZFS pool set up with 2nr new 4TB SSD's in mirrored config. They were both running fine for approx a month until I noticed health warnings on the pool. One of the SSd's had dropped out and was not detected in the disks listed within the system. When I remove both disks and reinstall them proxmox disk utility picks both back up again. I wiped both disks and reset up the mirrored pool again and all ran well for approx a week when i noticed the same thing again. It is the same disk that is dropping out. Again if i power off the system and restart the disk comes back online again, but after a while it drops out again. When i restart proxmox and check the smart data both within proxmox and on another pc they both say the disk is ok and in the same health as the other disk. I dont know why the pool keeps dropping the same disk. im not sure how to tell if the disk is faulty or if this is a proxmox software issue that keeps dropping the disk out of the pool. I have another ZFS pool with 2 HHD's mirrored and there is no issues there. just looking to see if there is anything i can try to confirm what is happening or confirming if disk is faulty
 
thanks, i have switched the disks round so they are each using the other power and sata cables and monitoring to see what happens. if the same disk drops out not sure then. if the other disk drops out then it will point to there being an issue with the power / sata cables.
 
  • Like
Reactions: RolandK
The SMART test tells you the disk thinks it is working fine, which points to one of the other culprits.
Does not have to be the case for ZFS.

ZFS relies heavily on a fast storage, so if you have reads that will take longer than 1-2 seconds (internal error correction doing its thing), the block will be markes as failed and if that happens more and more, the disk will be thrown out. We experienced this a lot with non-enterprise disks. After switching to enterprise disks, we never experienced this again. We identified the problem to be the different firmware in SAS and SATA disks. In the consumer market, it's more important to try to read and recover from a failure and in the enterprise world, where you have redundancies everything, failing fast is more important and therefore the i/o times are much stricter and an error will be returned much sooner. This holds especially true for ZFS. Those disks, that ZFS threw out had no bad blocks and long smart tests did not yield any errors. We readded the disk and it would work for a couple of month until it failed again.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!