Hi Pavel, Holr,
we use HP DL360gen9 (8 SSD sas + P840ar ctrl) under proxmox5.3 and proxmox 6.2 in HBA mode and encounter many instabilities.
sometimes an SSD is marked FAULTED by zfs, the CRC counter increments in the SMARTCTL output.
sometimes an SSD is abruptly ejected by the hpsa driver making the kernel unstable, the node can even reboot (via softdog)
log : PHYSICAL RESETTING ...
Systematically HP concludes that the SSD or the controller are fine
I would like to know your feedback on your HP hardware with ZFS.
Do you encounter similar pb's ?
Example :
### hpsa resetting physical DRV (slot 4)
Jan 26 07:08:31 172.30.18.2 kernel: [1764641.096331] hpsa 0000:03:00.0: scsi 0:0:4:0: resetting physical Direct-Access ATA MK000960GWEZK PHYS DRV SSDSmartPathCap- En- Exp=1
### drive never comes back … WAITING 8sec…
Jan 26 07:08:52 172.30.18.2 kernel: [1764662.332010] hpsa 0000:03:00.0: waiting 2 secs for device to become ready.
Jan 26 07:08:54 172.30.18.2 kernel: [1764664.344480] hpsa 0000:03:00.0: waiting 4 secs for device to become ready.
Jan 26 07:08:58 172.30.18.2 kernel: [1764668.440504] hpsa 0000:03:00.0: waiting 8 secs for device to become ready.
### hpsa report low level errors and WAITING 16sec…
Jan 26 07:09:04 172.30.18.2 kernel: [1764674.587638] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:04 172.30.18.2 kernel: [1764674.587642] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:04 172.30.18.2 kernel: [1764674.594966] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:06 172.30.18.2 kernel: [1764676.632554] hpsa 0000:03:00.0: waiting 16 secs for device to become ready.
Jan 26 07:09:20 172.30.18.2 kernel: [1764689.691849] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:20 172.30.18.2 kernel: [1764689.691853] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:20 172.30.18.2 kernel: [1764689.699148] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:23 172.30.18.2 kernel: [1764692.760665] hpsa 0000:03:00.0: waiting 32 secs for device to become ready.
Jan 26 07:09:35 172.30.18.2 kernel: [1764704.795884] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:35 172.30.18.2 kernel: [1764704.795888] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:35 172.30.18.2 kernel: [1764704.803167] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:50 172.30.18.2 kernel: [1764719.899953] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:50 172.30.18.2 kernel: [1764719.899956] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:50 172.30.18.2 kernel: [1764719.907229] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:55 172.30.18.2 kernel: [1764725.016873] hpsa 0000:03:00.0: waiting 32 secs for device to become ready.
Jan 26 07:10:05 172.30.18.2 kernel: [1764735.003979] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:10:05 172.30.18.2 kernel: [1764735.003982] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:10:05 172.30.18.2 kernel: [1764735.011280] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:10:20 172.30.18.2 kernel: [1764750.108138] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:10:20 172.30.18.2 kernel: [1764750.108142] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:10:20 172.30.18.2 kernel: [1764750.115434] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:10:28 172.30.18.2 kernel: [1764757.785081] hpsa 0000:03:00.0: waiting 32 secs for device to become ready.
…
### Without response to IO requests the kernel becomes unstable.
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929130] INFO: task zvol:2194 blocked for more than 120 seconds.
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929170] Tainted: P O 4.15.18-9-pve #1
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929194] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929227] zvol D 0 2194 2 0x80000000
…
### On PVE5 node reboot due to kernel softdog
### On PVE6 node remains inaccessible and requiere a manual reboot
Thank you for sharing your experience