Problems with Dell PowerEdge R430 (PERC H330 Mini Disk Controller)

JMM

New Member
Aug 28, 2019
9
0
1
50
Hello to all,
I am installing Proxmox in two Dell Poweredge R430 with a PERC H330 mini controller and when I upgrade Proxmox from 5.4 to 6.0 the server just freezes completely whenever doing or restoring backups.
The servers firmware and BIOS are upgraded and the controller is working in HBA mode, although this also happens in RAID mode.

This problem is occurring in two brand new servers that have the same hardware config.

If I keep 5.4 everything is ok, after the update to 6.0 this starts to happen whenever the backup or the restore reaches around 20%.

I have also tried a fresh version 6 install and the same thing happens, in both servers.

Has anyone found this same problem? Is there any special config for this hardware?

Thanks!
 
I've never had a similar issue, but i do own several R630s with PERC H330. A couple of questions:
  1. what is your storage setup? Are you using SAS/SATA/PCIe SSDs or SATA/SAS HDDs? Are the drives in an array?
  2. what file system are you using?
  3. if ZFS, how much RAM do you have and if you have an SSD for cache? What are ZFS settings (compression, checksums, encryption)?
  4. what is the size of backups you are restoring?
  5. where are you restoring the backup from? Is it a NAS? Does it perform well?
  6. if NAS, what is your network setup?
If i were to make a wild guess, it looks like your CPU or RAM is overwhelmed. You may need to tune ZFS ARC cache size. CPU is likely overwhelmed by either unarchiving (assuming the backups are gzip or lzo) or compression and checksums computation (assuming ZFS is used with default settings).

You've also mentioned the HBA mode. Is it the default mode when drives are connected as non-RAID, or was the card flashed with IT mode? It may be a case of corrupted firmware.
 
Syslog just showed this on last freeze, restoring a clonezilla backup:

Aug 28 14:42:10 pve1 kernel: [ 880.083452] [Firmware Bug]: APEI: Invalid physical address in GAR [0x0/0/0/0/0]
Aug 28 14:42:11 pve1 kernel: [ 881.171300] sd 0:0:1:0: [sdb] tag#2 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Aug 28 14:42:11 pve1 kernel: [ 881.171304] sd 0:0:1:0: [sdb] tag#2 CDB: Write(10) 2a 00 04 4e 71 10 00 01 00 00
Aug 28 14:42:11 pve1 kernel: [ 881.171307] print_req_error: I/O error, dev sdb, sector 72249616
Aug 28 14:42:11 pve1 kernel: [ 881.173045] sd 0:0:1:0: [sdb] tag#27 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Aug 28 14:42:11 pve1 kernel: [ 881.173049] sd 0:0:1:0: [sdb] tag#27 CDB: Write(10) 2a 00 04 4e 74 58 00 00 98 00
Aug 28 14:42:11 pve1 kernel: [ 881.173051] print_req_error: I/O error, dev sdb, sector 72250456
Aug 28 14:42:11 pve1 kernel: [ 881.173343] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.173392] IP: complete_cmd_fusion+0xbd/0x530 [megaraid_sas]
Aug 28 14:42:11 pve1 kernel: [ 881.173419] PGD 0 P4D 0
Aug 28 14:42:11 pve1 kernel: [ 881.173435] Oops: 0000 [#1] SMP PTI
Aug 28 14:42:11 pve1 kernel: [ 881.173453] Modules linked in: tcp_diag inet_diag nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables iptable_filter softdog nfnetlink_log nfnetlink mgag200 ttm drm_kms_helper ipmi_ssif intel_rapl drm sb_edac x86_pkg_temp_thermal i2c_algo_bit intel_powerclamp fb_sys_fops syscopyarea sysfillrect sysimgblt coretemp kvm_intel kvm irqbypass mxm_wmi crct10dif_pclmul crc32_pclmul shpchp ghash_clmulni_intel pcbc dcdbas aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate snd_pcm snd_timer snd soundcore intel_rapl_perf mei_me pcspkr mei input_leds lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter mac_hid vhost_net vhost tap ib_iser rdma_cm sunrpc iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 zfs(PO) zunicode(PO)
Aug 28 14:42:11 pve1 kernel: [ 881.173803] zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq ses enclosure scsi_transport_sas hid_generic usbmouse usbkbd usbhid hid tg3 ptp pps_core ahci libahci megaraid_sas
Aug 28 14:42:11 pve1 kernel: [ 881.173899] CPU: 4 PID: 31913 Comm: kvm Tainted: P O 4.15.18-12-pve #1
Aug 28 14:42:11 pve1 kernel: [ 881.173934] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.9.1 12/07/2018
Aug 28 14:42:11 pve1 kernel: [ 881.173972] RIP: 0010:complete_cmd_fusion+0xbd/0x530 [megaraid_sas]
Aug 28 14:42:11 pve1 kernel: [ 881.174001] RSP: 0018:ffff8b36bfd03e68 EFLAGS: 00010013
Aug 28 14:42:11 pve1 kernel: [ 881.174026] RAX: ffff8b3691918000 RBX: 0000000000000000 RCX: 0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.174058] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000082
Aug 28 14:42:11 pve1 kernel: [ 881.174091] RBP: ffff8b36bfd03ec0 R08: 0000000000000000 R09: ffff8b369a7310a0
Aug 28 14:42:11 pve1 kernel: [ 881.174123] R10: ffff8b3696093948 R11: ffff8b3691878000 R12: ffff8b2e7588fbd8
Aug 28 14:42:11 pve1 kernel: [ 881.174154] R13: ffff8b3691878008 R14: ffff8b36917047a8 R15: ffff8b3691878000
Aug 28 14:42:11 pve1 kernel: [ 881.174187] FS: 00007f8174fff700(0000) GS:ffff8b36bfd00000(0000) knlGS:0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.174224] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 28 14:42:11 pve1 kernel: [ 881.174250] CR2: 0000000000000000 CR3: 0000000796e0a006 CR4: 00000000003626e0
Aug 28 14:42:11 pve1 kernel: [ 881.174283] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.174315] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 28 14:42:11 pve1 kernel: [ 881.174347] Call Trace:
Aug 28 14:42:11 pve1 kernel: [ 881.174360] <IRQ>
Aug 28 14:42:11 pve1 kernel: [ 881.174377] megasas_isr_fusion+0x3d/0x190 [megaraid_sas]
Aug 28 14:42:11 pve1 kernel: [ 881.174406] __handle_irq_event_percpu+0x84/0x1a0
Aug 28 14:42:11 pve1 kernel: [ 881.174430] handle_irq_event_percpu+0x32/0x80
Aug 28 14:42:11 pve1 kernel: [ 881.174453] handle_irq_event+0x3b/0x60
Aug 28 14:42:11 pve1 kernel: [ 881.174473] handle_edge_irq+0x78/0x1a0
Aug 28 14:42:11 pve1 kernel: [ 881.174493] handle_irq+0x20/0x30
Aug 28 14:42:11 pve1 kernel: [ 881.174512] do_IRQ+0x4e/0xd0
Aug 28 14:42:11 pve1 kernel: [ 881.174530] common_interrupt+0x84/0x84
Aug 28 14:42:11 pve1 kernel: [ 881.174549] </IRQ>
Aug 28 14:42:11 pve1 kernel: [ 881.174564] RIP: 0010:finish_task_switch+0x7b/0x220
Aug 28 14:42:11 pve1 kernel: [ 881.174587] RSP: 0018:ffffb6e12984fba0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc
Aug 28 14:42:11 pve1 kernel: [ 881.174622] RAX: ffff8b369ba116c0 RBX: ffff8b35d5ec5b00 RCX: 0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.174654] RDX: 0000000000007f81 RSI: 0000000074fff700 RDI: ffff8b36bfd228c0
Aug 28 14:42:11 pve1 kernel: [ 881.174687] RBP: ffffb6e12984fbc8 R08: 000000000000131f R09: 0000000000000018
Aug 28 14:42:11 pve1 kernel: [ 881.174719] R10: ffffb6e1001d3e28 R11: 000000000000003b R12: ffff8b36bfd228c0
Aug 28 14:42:11 pve1 kernel: [ 881.174751] R13: ffff8b36458f39c0 R14: ffff8b369ba116c0 R15: 0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.174786] __schedule+0x3e8/0x870
Aug 28 14:42:11 pve1 kernel: [ 881.174806] ? get_futex_key+0x380/0x3c0
Aug 28 14:42:11 pve1 kernel: [ 881.174826] schedule+0x36/0x80
Aug 28 14:42:11 pve1 kernel: [ 881.174844] futex_wait_queue_me+0xc4/0x120
Aug 28 14:42:11 pve1 kernel: [ 881.174866] futex_wait+0x119/0x260
Aug 28 14:42:11 pve1 kernel: [ 881.174889] ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
Aug 28 14:42:11 pve1 kernel: [ 881.174916] ? vmx_vcpu_run+0x418/0x5e0 [kvm_intel]
 
I've never had a similar issue, but i do own several R630s with PERC H330. A couple of questions:
  1. what is your storage setup? Are you using SAS/SATA/PCIe SSDs or SATA/SAS HDDs? Are the drives in an array?
  2. what file system are you using?
  3. if ZFS, how much RAM do you have and if you have an SSD for cache? What are ZFS settings (compression, checksums, encryption)?
  4. what is the size of backups you are restoring?
  5. where are you restoring the backup from? Is it a NAS? Does it perform well?
  6. if NAS, what is your network setup?
If i were to make a wild guess, it looks like your CPU or RAM is overwhelmed. You may need to tune ZFS ARC cache size. CPU is likely overwhelmed by either unarchiving (assuming the backups are gzip or lzo) or compression and checksums computation (assuming ZFS is used with default settings).

You've also mentioned the HBA mode. Is it the default mode when drives are connected as non-RAID, or was the card flashed with IT mode? It may be a case of corrupted firmware.


Vladimir,

1 - 3 SAS 2TB drives;
2 - ZFS RAIDZ-1
3 - 32GB RAM, no SSD, ZFS settings are the standard from Proxmox installation
4 - The backups are of several sizes, from a few to MB to 20, 30 GB.
5 - Tried it from the NAS (Synology DS218j), from local, same resulta
6 - Network is all 1GB.

The drives came in RAID mode, no firmaware was installed, only the mode was changed.
 
Syslog just showed this on last freeze, restoring a clonezilla backup:

Aug 28 14:42:10 pve1 kernel: [ 880.083452] [Firmware Bug]: APEI: Invalid physical address in GAR [0x0/0/0/0/0]
Aug 28 14:42:11 pve1 kernel: [ 881.171300] sd 0:0:1:0: [sdb] tag#2 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Aug 28 14:42:11 pve1 kernel: [ 881.171304] sd 0:0:1:0: [sdb] tag#2 CDB: Write(10) 2a 00 04 4e 71 10 00 01 00 00
Aug 28 14:42:11 pve1 kernel: [ 881.171307] print_req_error: I/O error, dev sdb, sector 72249616
Aug 28 14:42:11 pve1 kernel: [ 881.173045] sd 0:0:1:0: [sdb] tag#27 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Aug 28 14:42:11 pve1 kernel: [ 881.173049] sd 0:0:1:0: [sdb] tag#27 CDB: Write(10) 2a 00 04 4e 74 58 00 00 98 00
Aug 28 14:42:11 pve1 kernel: [ 881.173051] print_req_error: I/O error, dev sdb, sector 72250456
Aug 28 14:42:11 pve1 kernel: [ 881.173343] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.173392] IP: complete_cmd_fusion+0xbd/0x530 [megaraid_sas]
Aug 28 14:42:11 pve1 kernel: [ 881.173419] PGD 0 P4D 0
Aug 28 14:42:11 pve1 kernel: [ 881.173435] Oops: 0000 [#1] SMP PTI
Aug 28 14:42:11 pve1 kernel: [ 881.173453] Modules linked in: tcp_diag inet_diag nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables iptable_filter softdog nfnetlink_log nfnetlink mgag200 ttm drm_kms_helper ipmi_ssif intel_rapl drm sb_edac x86_pkg_temp_thermal i2c_algo_bit intel_powerclamp fb_sys_fops syscopyarea sysfillrect sysimgblt coretemp kvm_intel kvm irqbypass mxm_wmi crct10dif_pclmul crc32_pclmul shpchp ghash_clmulni_intel pcbc dcdbas aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate snd_pcm snd_timer snd soundcore intel_rapl_perf mei_me pcspkr mei input_leds lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter mac_hid vhost_net vhost tap ib_iser rdma_cm sunrpc iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 zfs(PO) zunicode(PO)
Aug 28 14:42:11 pve1 kernel: [ 881.173803] zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq ses enclosure scsi_transport_sas hid_generic usbmouse usbkbd usbhid hid tg3 ptp pps_core ahci libahci megaraid_sas
Aug 28 14:42:11 pve1 kernel: [ 881.173899] CPU: 4 PID: 31913 Comm: kvm Tainted: P O 4.15.18-12-pve #1
Aug 28 14:42:11 pve1 kernel: [ 881.173934] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.9.1 12/07/2018
Aug 28 14:42:11 pve1 kernel: [ 881.173972] RIP: 0010:complete_cmd_fusion+0xbd/0x530 [megaraid_sas]
Aug 28 14:42:11 pve1 kernel: [ 881.174001] RSP: 0018:ffff8b36bfd03e68 EFLAGS: 00010013
Aug 28 14:42:11 pve1 kernel: [ 881.174026] RAX: ffff8b3691918000 RBX: 0000000000000000 RCX: 0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.174058] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000082
Aug 28 14:42:11 pve1 kernel: [ 881.174091] RBP: ffff8b36bfd03ec0 R08: 0000000000000000 R09: ffff8b369a7310a0
Aug 28 14:42:11 pve1 kernel: [ 881.174123] R10: ffff8b3696093948 R11: ffff8b3691878000 R12: ffff8b2e7588fbd8
Aug 28 14:42:11 pve1 kernel: [ 881.174154] R13: ffff8b3691878008 R14: ffff8b36917047a8 R15: ffff8b3691878000
Aug 28 14:42:11 pve1 kernel: [ 881.174187] FS: 00007f8174fff700(0000) GS:ffff8b36bfd00000(0000) knlGS:0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.174224] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 28 14:42:11 pve1 kernel: [ 881.174250] CR2: 0000000000000000 CR3: 0000000796e0a006 CR4: 00000000003626e0
Aug 28 14:42:11 pve1 kernel: [ 881.174283] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.174315] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 28 14:42:11 pve1 kernel: [ 881.174347] Call Trace:
Aug 28 14:42:11 pve1 kernel: [ 881.174360] <IRQ>
Aug 28 14:42:11 pve1 kernel: [ 881.174377] megasas_isr_fusion+0x3d/0x190 [megaraid_sas]
Aug 28 14:42:11 pve1 kernel: [ 881.174406] __handle_irq_event_percpu+0x84/0x1a0
Aug 28 14:42:11 pve1 kernel: [ 881.174430] handle_irq_event_percpu+0x32/0x80
Aug 28 14:42:11 pve1 kernel: [ 881.174453] handle_irq_event+0x3b/0x60
Aug 28 14:42:11 pve1 kernel: [ 881.174473] handle_edge_irq+0x78/0x1a0
Aug 28 14:42:11 pve1 kernel: [ 881.174493] handle_irq+0x20/0x30
Aug 28 14:42:11 pve1 kernel: [ 881.174512] do_IRQ+0x4e/0xd0
Aug 28 14:42:11 pve1 kernel: [ 881.174530] common_interrupt+0x84/0x84
Aug 28 14:42:11 pve1 kernel: [ 881.174549] </IRQ>
Aug 28 14:42:11 pve1 kernel: [ 881.174564] RIP: 0010:finish_task_switch+0x7b/0x220
Aug 28 14:42:11 pve1 kernel: [ 881.174587] RSP: 0018:ffffb6e12984fba0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc
Aug 28 14:42:11 pve1 kernel: [ 881.174622] RAX: ffff8b369ba116c0 RBX: ffff8b35d5ec5b00 RCX: 0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.174654] RDX: 0000000000007f81 RSI: 0000000074fff700 RDI: ffff8b36bfd228c0
Aug 28 14:42:11 pve1 kernel: [ 881.174687] RBP: ffffb6e12984fbc8 R08: 000000000000131f R09: 0000000000000018
Aug 28 14:42:11 pve1 kernel: [ 881.174719] R10: ffffb6e1001d3e28 R11: 000000000000003b R12: ffff8b36bfd228c0
Aug 28 14:42:11 pve1 kernel: [ 881.174751] R13: ffff8b36458f39c0 R14: ffff8b369ba116c0 R15: 0000000000000000
Aug 28 14:42:11 pve1 kernel: [ 881.174786] __schedule+0x3e8/0x870
Aug 28 14:42:11 pve1 kernel: [ 881.174806] ? get_futex_key+0x380/0x3c0
Aug 28 14:42:11 pve1 kernel: [ 881.174826] schedule+0x36/0x80
Aug 28 14:42:11 pve1 kernel: [ 881.174844] futex_wait_queue_me+0xc4/0x120
Aug 28 14:42:11 pve1 kernel: [ 881.174866] futex_wait+0x119/0x260
Aug 28 14:42:11 pve1 kernel: [ 881.174889] ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
Aug 28 14:42:11 pve1 kernel: [ 881.174916] ? vmx_vcpu_run+0x418/0x5e0 [kvm_intel]
JMM, what the status of your harddisk?
it showing,
print_req_error: I/O error, dev sdb, sector 72249616
print_req_error: I/O error, dev sdb, sector 72250456
You should try check smart status of HDD sdb
 
JMM, what the status of your harddisk?
it showing,
print_req_error: I/O error, dev sdb, sector 72249616
print_req_error: I/O error, dev sdb, sector 72250456
You should try check smart status of HDD sdb


adhiete, I removed that disk from the server and still the same issues.
 
Please try updating the firmware for the system (all components, especially the raid-card) to the latest available version.
If this does not resolve the issue - please run a memory-test and check the hardware for potential physical problems (cable not being connected properly)

Hope this helps!
 
Please try updating the firmware for the system (all components, especially the raid-card) to the latest available version.
If this does not resolve the issue - please run a memory-test and check the hardware for potential physical problems (cable not being connected properly)

Hope this helps!

Hi Stoiko,

Already done all of that, I am abou to just returning both servers because I am running out of ideas and can't find any kind of information about this issue.
 
Aug 28 14:42:11 pve1 kernel: [ 881.173934] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.9.1 12/07/2018
hmm - while the version seems fitting - Dell has a newer release-date for the BIOS:

https://www.dell.com/support/home/us/en/04/product-support/product/poweredge-r430/drivers

- maybe run the lifecycle controller once more.

apart from that - you could try settin the bios and raid-controller to factory defaults..
maybe it could be worth contacting Dell directly

sorry to not being able to provide anything more helpful!
 
hmm - while the version seems fitting - Dell has a newer release-date for the BIOS:

https://www.dell.com/support/home/us/en/04/product-support/product/poweredge-r430/drivers

- maybe run the lifecycle controller once more.

apart from that - you could try settin the bios and raid-controller to factory defaults..
maybe it could be worth contacting Dell directly

sorry to not being able to provide anything more helpful!

That's what I also thought... the bios update I downloaded is the Feb 2019 one, but it shows up there as Jul 2018.

I'm getting a little disappointed with Dell and all these 'details', maybe it's time to try new brands...
 
@JMM
LOL :D Wait till you try HP ;)

On the serious side of things, i'd suggest you try different drives and different controllers. I guess grabbing a used H330 is cheaper than shipping 2 servers back, so that's that. But i have most doubts about the drives. sdb drive seems to cause issues. Try removing all drives except the first one and run the installation on a single drive. Or maybe you have a totally different drive available.

Again, not that i'm promoting Dell, but i have 2 r630 running perfectly well with Proxmox. I use SATA disks with H330. I also encourage you to check for the latest updates for BIOS, Lifecycle Controller and RAID card H330 mini. Sometimes the automatic update does not work and you may need to download the firmware manually and install it via iDrac.
 
  • Like
Reactions: amstel

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!