Proxmox + ZFS + HP P440ar on HBA mode

Pavel Hruška

Member
May 1, 2018
75
8
8
45
Hi there, I want to run Proxmox on HP DL160 Gen9 and use ZFS. There is HP P440ar in the server, now in RAID mode, but I've read it is possible to switch this controller to HBA. I have similiar setups already running, but I use HW RAIDs (RAID6 or RAID1/10) and LVM on top of it.

What is unclear to me is whether I lose predictive failure or even drive failure notification including front panel indication on the physical drive when on HBA (all drives are HP's, so they should support smart carrier and all that "smart stuff" that just works when on RAID mode). Or maybe someone would explain me how those notification LEDs work standard way (when the drive is in fact "marked" as bad by controller).

In short: How do I know which physical drive is bad and needs to be replaced if I can see any problem in ZFS? Anyone has experience with this?

NOTE: Please do not argue here that HW RAID is bad and should be avoided (because, well, it works, even with Proxmox + LVM, for example) nor that I need better SAS controller. I know all that stuff. I just want to know how to handle drive failures in this scenario.

Thank you.
 
Hi, I have some experience with HP dl360 gen9 servers with HP P440ar. We set the drives to HBA mode (so each physical disk is exposed to Proxmox for a Ceph cluster), but we had to set up each drive as an individual raid 0 (I am not sure if we can run the drives without this raid 0 approach, but the systems are in production right now so I can't experiment for a few more weeks). We've setup the iLO on the servers, and in the menu options you can make the individual drive bay light up. But even before that, when we've had to replace drives, they'll start blinking orange at you if they're detecting physical problems.

Do note, that if you go for a similar HBA/individual drive RAID 0, we've had to reboot the server and go into the bios to get the server to accept the replacement drive.
 
We set the drives to HBA mode (so each physical disk is exposed to Proxmox for a Ceph cluster), but we had to set up each drive as an individual raid 0

Well I think this is not the real HBA mode and I think that your drives are still exposed to Proxmox through HW RAID, just as individual RAID 0s.

P440ar should support setting to HBA mode using this command (if controller is in slot 0):
Code:
 hpssacli cmd -q “controller slot=0 modify hbamode=on forced"

To view current config:
Code:
hpssacli cmd -q "controller slot=0 show config detail”

Not sure if you did such config on your setup.

Do note, that if you go for a similar HBA/individual drive RAID 0, we've had to reboot the server and go into the bios to get the server to accept the replacement drive.
This does not look like perfect solution for production and exactly what I'd like to avoid. I do not want to lose hot swap capability. Swapping bad drives is one of the most common things that I do when managing servers.
 
Hi Pavel, Holr,

we use HP DL360gen9 (8 SSD sas + P840ar ctrl) under proxmox5.3 and proxmox 6.2 in HBA mode and encounter many instabilities.
sometimes an SSD is marked FAULTED by zfs, the CRC counter increments in the SMARTCTL output.
sometimes an SSD is abruptly ejected by the hpsa driver making the kernel unstable, the node can even reboot (via softdog)
log : PHYSICAL RESETTING ...
Systematically HP concludes that the SSD or the controller are fine
I would like to know your feedback on your HP hardware with ZFS.
Do you encounter similar pb's ?

Example :


### hpsa resetting physical DRV (slot 4)
Jan 26 07:08:31 172.30.18.2 kernel: [1764641.096331] hpsa 0000:03:00.0: scsi 0:0:4:0: resetting physical Direct-Access ATA MK000960GWEZK PHYS DRV SSDSmartPathCap- En- Exp=1

### drive never comes back … WAITING 8sec…
Jan 26 07:08:52 172.30.18.2 kernel: [1764662.332010] hpsa 0000:03:00.0: waiting 2 secs for device to become ready.
Jan 26 07:08:54 172.30.18.2 kernel: [1764664.344480] hpsa 0000:03:00.0: waiting 4 secs for device to become ready.
Jan 26 07:08:58 172.30.18.2 kernel: [1764668.440504] hpsa 0000:03:00.0: waiting 8 secs for device to become ready.

### hpsa report low level errors and WAITING 16sec…
Jan 26 07:09:04 172.30.18.2 kernel: [1764674.587638] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:04 172.30.18.2 kernel: [1764674.587642] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:04 172.30.18.2 kernel: [1764674.594966] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:06 172.30.18.2 kernel: [1764676.632554] hpsa 0000:03:00.0: waiting 16 secs for device to become ready.
Jan 26 07:09:20 172.30.18.2 kernel: [1764689.691849] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:20 172.30.18.2 kernel: [1764689.691853] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:20 172.30.18.2 kernel: [1764689.699148] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:23 172.30.18.2 kernel: [1764692.760665] hpsa 0000:03:00.0: waiting 32 secs for device to become ready.
Jan 26 07:09:35 172.30.18.2 kernel: [1764704.795884] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:35 172.30.18.2 kernel: [1764704.795888] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:35 172.30.18.2 kernel: [1764704.803167] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:50 172.30.18.2 kernel: [1764719.899953] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:50 172.30.18.2 kernel: [1764719.899956] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:50 172.30.18.2 kernel: [1764719.907229] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:55 172.30.18.2 kernel: [1764725.016873] hpsa 0000:03:00.0: waiting 32 secs for device to become ready.
Jan 26 07:10:05 172.30.18.2 kernel: [1764735.003979] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:10:05 172.30.18.2 kernel: [1764735.003982] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:10:05 172.30.18.2 kernel: [1764735.011280] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:10:20 172.30.18.2 kernel: [1764750.108138] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:10:20 172.30.18.2 kernel: [1764750.108142] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:10:20 172.30.18.2 kernel: [1764750.115434] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:10:28 172.30.18.2 kernel: [1764757.785081] hpsa 0000:03:00.0: waiting 32 secs for device to become ready.


### Without response to IO requests the kernel becomes unstable.
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929130] INFO: task zvol:2194 blocked for more than 120 seconds.
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929170] Tainted: P O 4.15.18-9-pve #1
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929194] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929227] zvol D 0 2194 2 0x80000000


### On PVE5 node reboot due to kernel softdog
### On PVE6 node remains inaccessible and requiere a manual reboot

Thank you for sharing your experience
 
Hi Pavel, Holr,

we use HP DL360gen9 (8 SSD sas + P840ar ctrl) under proxmox5.3 and proxmox 6.2 in HBA mode and encounter many instabilities.
sometimes an SSD is marked FAULTED by zfs, the CRC counter increments in the SMARTCTL output.
sometimes an SSD is abruptly ejected by the hpsa driver making the kernel unstable, the node can even reboot (via softdog)
log : PHYSICAL RESETTING ...
Systematically HP concludes that the SSD or the controller are fine
I would like to know your feedback on your HP hardware with ZFS.
Do you encounter similar pb's ?

Example :


### hpsa resetting physical DRV (slot 4)
Jan 26 07:08:31 172.30.18.2 kernel: [1764641.096331] hpsa 0000:03:00.0: scsi 0:0:4:0: resetting physical Direct-Access ATA MK000960GWEZK PHYS DRV SSDSmartPathCap- En- Exp=1

### drive never comes back … WAITING 8sec…
Jan 26 07:08:52 172.30.18.2 kernel: [1764662.332010] hpsa 0000:03:00.0: waiting 2 secs for device to become ready.
Jan 26 07:08:54 172.30.18.2 kernel: [1764664.344480] hpsa 0000:03:00.0: waiting 4 secs for device to become ready.
Jan 26 07:08:58 172.30.18.2 kernel: [1764668.440504] hpsa 0000:03:00.0: waiting 8 secs for device to become ready.

### hpsa report low level errors and WAITING 16sec…
Jan 26 07:09:04 172.30.18.2 kernel: [1764674.587638] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:04 172.30.18.2 kernel: [1764674.587642] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:04 172.30.18.2 kernel: [1764674.594966] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:06 172.30.18.2 kernel: [1764676.632554] hpsa 0000:03:00.0: waiting 16 secs for device to become ready.
Jan 26 07:09:20 172.30.18.2 kernel: [1764689.691849] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:20 172.30.18.2 kernel: [1764689.691853] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:20 172.30.18.2 kernel: [1764689.699148] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:23 172.30.18.2 kernel: [1764692.760665] hpsa 0000:03:00.0: waiting 32 secs for device to become ready.
Jan 26 07:09:35 172.30.18.2 kernel: [1764704.795884] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:35 172.30.18.2 kernel: [1764704.795888] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:35 172.30.18.2 kernel: [1764704.803167] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:50 172.30.18.2 kernel: [1764719.899953] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:09:50 172.30.18.2 kernel: [1764719.899956] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:09:50 172.30.18.2 kernel: [1764719.907229] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:09:55 172.30.18.2 kernel: [1764725.016873] hpsa 0000:03:00.0: waiting 32 secs for device to become ready.
Jan 26 07:10:05 172.30.18.2 kernel: [1764735.003979] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:10:05 172.30.18.2 kernel: [1764735.003982] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:10:05 172.30.18.2 kernel: [1764735.011280] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:10:20 172.30.18.2 kernel: [1764750.108138] hpsa 0000:03:00.0: SCSI status: LUN:0000000000800301 CDB:12010000040000000000000000000000
Jan 26 07:10:20 172.30.18.2 kernel: [1764750.108142] hpsa 0000:03:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
Jan 26 07:10:20 172.30.18.2 kernel: [1764750.115434] hpsa 0000:03:00.0: Acknowledging event: 0x80000012 (HP SSD Smart Path configuration change)
Jan 26 07:10:28 172.30.18.2 kernel: [1764757.785081] hpsa 0000:03:00.0: waiting 32 secs for device to become ready.


### Without response to IO requests the kernel becomes unstable.
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929130] INFO: task zvol:2194 blocked for more than 120 seconds.
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929170] Tainted: P O 4.15.18-9-pve #1
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929194] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 07:10:34 172.30.18.2 kernel: [1764763.929227] zvol D 0 2194 2 0x80000000


### On PVE5 node reboot due to kernel softdog
### On PVE6 node remains inaccessible and requiere a manual reboot

Thank you for sharing your experience
@auranext
did you ever solve this issue?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!