Hi folks,
I've been using ZFS on Proxmox for a few years now, ever since I got my first drive shelf & HBA card I've had issues with ZFS suspending pools randomly on startup. Anything that relies on the pool tends to hang, requiring a restart to fix. When I try to shutdown, the whole system hangs, I have to hold the power button to get it to shut off. I've tried updating the firmware on the HBA card, but that didn't solve the issue. I'm at a loss, I don't know how to diagnose this issue further.
My setup:
I have several HDD pools and one SSD pool on these shelves (I don't think the composition matters, if it does I'll post it). I'm pretty sure every single one has been suspended at startup at some point. The only ZFS pool that has never been suspended or thrown errors is the boot pool, these drives are connected via the motherboard's SATA. I want to make it abundantly clear though, these drives work fine, the issue is eventually resolved after a number of restarts.
The EMC shelves are connected via 1.5m SFF-8644 to SFF-8088 cables by 10gtek, the NetApp shelf is connected via 2m SFF-8644 to QSFP (SFF-8436) also by 10gtek. These cables are rated for SAS 2.0 while the HBA and I think the shelves can do SAS 3.0. I was tempted to buy a SAS 3.0 capable cable but they're a bit hard to find, also don't know if that will fix it.
I know ZFS is highly regarded for its data integrity, but the frustration this issue has caused is really taking the wind out of its sails. If I can at least get ZFS to not import or forcefully export after suspending erroring pools I can workaround this issue. Any help is much appreciated.
I've been using ZFS on Proxmox for a few years now, ever since I got my first drive shelf & HBA card I've had issues with ZFS suspending pools randomly on startup. Anything that relies on the pool tends to hang, requiring a restart to fix. When I try to shutdown, the whole system hangs, I have to hold the power button to get it to shut off. I've tried updating the firmware on the HBA card, but that didn't solve the issue. I'm at a loss, I don't know how to diagnose this issue further.
My setup:
- Proxmox 8.4.12 on Kernel 6.8.12-9-pve
- CPU: Ryzen 9 5950x
- RAM: 4x32GB G.SKill 3200MT (It might be 3600MT, it gets set to 2666MT because of the hard-offs)
- Boot: 2x Samsung 860/870 SSDs in zraid-1
- HBA: LSI 9300-16e (IT Mode | FW: 16.00.10.00)
- Drive Shelves: 2x EMC KTN-STL3 15-Bay 3.5"
- -------------------- 1x Dell EMC 25x 2.5''
- -------------------- 1x NetApp 4U NAJ-0801 24x 3.5"
I have several HDD pools and one SSD pool on these shelves (I don't think the composition matters, if it does I'll post it). I'm pretty sure every single one has been suspended at startup at some point. The only ZFS pool that has never been suspended or thrown errors is the boot pool, these drives are connected via the motherboard's SATA. I want to make it abundantly clear though, these drives work fine, the issue is eventually resolved after a number of restarts.
The EMC shelves are connected via 1.5m SFF-8644 to SFF-8088 cables by 10gtek, the NetApp shelf is connected via 2m SFF-8644 to QSFP (SFF-8436) also by 10gtek. These cables are rated for SAS 2.0 while the HBA and I think the shelves can do SAS 3.0. I was tempted to buy a SAS 3.0 capable cable but they're a bit hard to find, also don't know if that will fix it.
I know ZFS is highly regarded for its data integrity, but the frustration this issue has caused is really taking the wind out of its sails. If I can at least get ZFS to not import or forcefully export after suspending erroring pools I can workaround this issue. Any help is much appreciated.