7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

leex12 · Jun 14, 2024

Some strange stuff has been happening since i upgraded to v8 this week.

I have a six node cluster with ceph. The actual upgrade process was fine. Basically I did one a day over the course of the week and everything seemed fine.

I then had an issue with two nodes which have started to drop their OSD. Between the two nodes I am down 7 OSD. When it was one drive I thought OK may be hardware but not this amount over a 24-36 window. '573 daemons have recently crashed' was not happening before the upgrade and these seem to point to the 7 OSD,

Before I start doing something radical and probably stupid is there any recommendations as to approach this?

leex12 · Jun 14, 2024

Code:

 14 07:00:26 pve03 systemd[1]: Starting ceph-osd@9000.service - Ceph object storage daemon osd.9000...
Jun 14 07:00:26 pve03 systemd[1]: Started ceph-osd@9000.service - Ceph object storage daemon osd.9000.
Jun 14 07:00:34 pve03 systemd[1]: ceph-osd@9000.service: Main process exited, code=killed, status=6/ABRT
Jun 14 07:00:34 pve03 systemd[1]: ceph-osd@9000.service: Failed with result 'signal'.
Jun 14 07:00:34 pve03 systemd[1]: ceph-osd@9000.service: Consumed 3.614s CPU time.
Jun 14 07:00:44 pve03 systemd[1]: ceph-osd@9000.service: Scheduled restart job, restart counter is at 3.
Jun 14 07:00:44 pve03 systemd[1]: Stopped ceph-osd@9000.service - Ceph object storage daemon osd.9000.
Jun 14 07:00:44 pve03 systemd[1]: ceph-osd@9000.service: Consumed 3.614s CPU time.
Jun 14 07:00:44 pve03 systemd[1]: ceph-osd@9000.service: Start request repeated too quickly.
Jun 14 07:00:44 pve03 systemd[1]: ceph-osd@9000.service: Failed with result 'signal'.
Jun 14 07:00:44 pve03 systemd[1]: Failed to start ceph-osd@9000.service - Ceph object storage daemon osd.9000.

I have looked at a few of the failed osd and this seems standard

leex12 · Jun 17, 2024

So I went radical .. physically removed drives from the two nodes. Reformatted drives and recreated new OSDs. Will work for a while then crap out.

I have run three drives against a test program and they are passing so i don't think we are looking at hard drive failure.

The issue has happened across two nodes so i struggle to see its a server failure. I have managed to get a couple of external USB drives working. Both servers have SSD drive which is portioned between BOOT AND osd which is still going.

fiona · Jun 17, 2024

Hi,
please share the output of pveversion -v. The log says that the service failed, because it got a signal. Is there anything more in the system journal around the time the issue happens? Assuming it's a segfault or similar, you can try apt install systemd-coredump then you should get a core dump the next time it crashes.

leex12 · Jun 17, 2024

I have done a dirty spreadsheet across the versions attached .. pve3 + pve4 are the nodes that have the problem

proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-13
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
pve-kernel-5.15.152-1-pve: 5.15.152-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph: 17.2.7-pve3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.2
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.3-1
proxmox-backup-file-restore: 3.2.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

leex12 · Jun 17, 2024

Below is a recreated OSD .. stayed up for a few hours then died.

grep -Hn 'ERR' /var/log/ceph/ceph-osd.9101.log

Code:

/var/log/ceph/ceph-osd.9101.log:28764:2024-06-16T21:52:08.451+0100 754587c8a3c0 -1  ** ERROR: osd init failed: (5) Input/output error
/var/log/ceph/ceph-osd.9101.log:30201:2024-06-16T21:52:22.652+0100 75c64c2933c0 -1  ** ERROR: osd init failed: (5) Input/output error
/var/log/ceph/ceph-osd.9101.log:31638:2024-06-16T21:52:36.709+0100 738fcb01c3c0 -1  ** ERROR: osd init failed: (5) Input/output error
/var/log/ceph/ceph-osd.9101.log:33075:2024-06-17T09:29:11.159+0100 7525f68873c0 -1  ** ERROR: osd init failed: (5) Input/output error
/var/log/ceph/ceph-osd.9101.log:34512:2024-06-17T09:29:25.180+0100 7a3ce3e913c0 -1  ** ERROR: osd init failed: (5) Input/output error

leex12 · Jun 17, 2024

journalctl

leex12 · Jun 17, 2024

when you look in the full log i see this " _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0xbfa2820a, device location [0x9627286000~1000], logical extent 0x100000~1000, object #-1:2c740c03::

sdmap.194823:0#"

leex12 · Jun 17, 2024

fiona said:
Hi,
please share the output of pveversion -v. The log says that the service failed, because it got a signal. Is there anything more in the system journal around the time the issue happens? Assuming it's a segfault or similar, you can try apt install systemd-coredump then you should get a core dump the next time it crashes.

I see a bunch of these errors on pve03 when I tried to add a new drive

Code:

Jun 17 09:29:54 pve03 kernel: DMAR: ERROR: DMA PTE for vPFN 0x7ee69 already set (to 7ee69003 not 262743001)

The crc errors looks very similar to an old issue relating to the kenal.

Any thoughts or suggestions? I am just left thinking to kill the node and re-install?

fiona · Jun 18, 2024

leex12 said:
journalctl

https://forum.proxmox.com/attachments/ceph_9101-txt.69868/

If you filter the journal like that a lot of stuff will be missing.

Did you get a coredump in the meantime?

leex12 · Jun 18, 2024

So did the fresh install to see if that impacted anything .. it didn't. Been working fine with the boot disc and an external disc. Re-added a SSD for ceph. Worked fine for over an hour then my console erupts with DMAR ERROR DMA PTE for vPFN

this is in system log and there is a lot of them

Code:

 DMAR: ERROR: DMA PTE for vPFN 0x83b83 already set (to 83b83003 not 33d6ad003)
Jun 18 18:28:22 pve03 kernel: ------------[ cut here ]------------
Jun 18 18:28:22 pve03 kernel: WARNING: CPU: 7 PID: 0 at drivers/iommu/intel/iommu.c:2214 __domain_mapping+0x375/0x4f0
Jun 18 18:28:22 pve03 kernel: Modules linked in: ceph libceph netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics nvme_core nvme_auth 8021q garp mrp bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_pmc_core_pltdrv intel_pmc_core intel_vsec pmt_telemetry pmt_class intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl dell_wmi jc42 dell_smbios acpi_power_meter dell_wmi_descriptor ledtrig_audio joydev ipmi_si acpi_ipmi mgag200 sparse_keymap mei_me input_leds ipmi_devintf dcdbas i2c_algo_bit pcspkr intel_cstate intel_wmi_thunderbolt ee1004 mei intel_pch_thermal ie31200_edac ipmi_msghandler mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore
Jun 18 18:28:22 pve03 kernel:  dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq mlx4_ib ib_uverbs ib_core ses mlx4_en enclosure scsi_transport_sas hid_logitech_hidpp hid_logitech_dj hid_generic usbkbd usbmouse usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c uas usb_storage xhci_pci xhci_pci_renesas i2c_i801 crc32_pclmul xhci_hcd i2c_smbus mlx4_core tg3 ahci megaraid_sas libahci video wmi
Jun 18 18:28:22 pve03 kernel: CPU: 7 PID: 0 Comm: swapper/7 Tainted: P        W  O       6.8.8-1-pve #1
Jun 18 18:28:22 pve03 kernel: Hardware name: Dell Inc. PowerEdge R230/0FRVY0, BIOS 2.20.0 02/22/2024
Jun 18 18:28:22 pve03 kernel: RIP: 0010:__domain_mapping+0x375/0x4f0
Jun 18 18:28:22 pve03 kernel: Code: 48 89 c2 4c 89 4d b0 48 c7 c7 78 5d c3 a1 e8 92 6e 6d ff 8b 05 b0 6b 9e 01 4c 8b 4d b0 85 c0 74 09 83 e8 01 89 05 9f 6b 9e 01 <0f> 0b e9 fe fe ff ff 8b 45 c4 4c 89 ee 4c 89 f7 8d 58 01 48 8b 45
Jun 18 18:28:22 pve03 kernel: RSP: 0018:ffffb339c0294a30 EFLAGS: 00010246
Jun 18 18:28:22 pve03 kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Jun 18 18:28:22 pve03 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jun 18 18:28:22 pve03 kernel: RBP: ffffb339c0294ac0 R08: 0000000000000000 R09: ffff9dcb4175dc18
Jun 18 18:28:22 pve03 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9dcb4175dc18
Jun 18 18:28:22 pve03 kernel: R13: ffff9dcb402d7900 R14: 0000000000000001 R15: 000000033d6ad003
Jun 18 18:28:22 pve03 kernel: FS:  0000000000000000(0000) GS:ffff9dce9fd80000(0000) knlGS:0000000000000000
Jun 18 18:28:22 pve03 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 18 18:28:22 pve03 kernel: CR2: 0000706fefaf9f00 CR3: 000000015d036002 CR4: 00000000003706f0
Jun 18 18:28:22 pve03 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 18 18:28:22 pve03 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun 18 18:28:22 pve03 kernel: Call Trace:
Jun 18 18:28:22 pve03 kernel:  <IRQ>
Jun 18 18:28:22 pve03 kernel:  ? show_regs+0x6d/0x80
Jun 18 18:28:22 pve03 kernel:  ? __warn+0x89/0x160
Jun 18 18:28:22 pve03 kernel:  ? __domain_mapping+0x375/0x4f0
Jun 18 18:28:22 pve03 kernel:  ? report_bug+0x17e/0x1b0
Jun 18 18:28:22 pve03 kernel:  ? handle_bug+0x46/0x90
Jun 18 18:28:22 pve03 kernel:  ? exc_invalid_op+0x18/0x80
Jun 18 18:28:22 pve03 kernel:  ? asm_exc_invalid_op+0x1b/0x20
Jun 18 18:28:22 pve03 kernel:  ? __domain_mapping+0x375/0x4f0
Jun 18 18:28:22 pve03 kernel:  ? post_alloc_hook+0xcc/0x120
Jun 18 18:28:22 pve03 kernel:  ? kmem_cache_alloc+0x133/0x360
Jun 18 18:28:22 pve03 kernel:  intel_iommu_map_pages+0xe1/0x140
Jun 18 18:28:22 pve03 kernel:  ? alloc_iova+0x259/0x290
Jun 18 18:28:22 pve03 kernel:  __iommu_map+0x11e/0x280
Jun 18 18:28:22 pve03 kernel:  iommu_map+0x43/0xd0
Jun 18 18:28:22 pve03 kernel:  __iommu_dma_map+0x89/0xf0
Jun 18 18:28:22 pve03 kernel:  iommu_dma_map_page+0xc0/0x2c0
Jun 18 18:28:22 pve03 kernel:  dma_map_page_attrs+0x6f/0x2c0
Jun 18 18:28:22 pve03 kernel:  mlx4_en_prepare_rx_desc+0x161/0x1b0 [mlx4_en]
Jun 18 18:28:22 pve03 kernel:  mlx4_en_process_rx_cq+0xb2f/0x1010 [mlx4_en]
Jun 18 18:28:22 pve03 kernel:  mlx4_en_poll_rx_cq+0x6d/0x100 [mlx4_en]
Jun 18 18:28:22 pve03 kernel:  __napi_poll+0x30/0x200
Jun 18 18:28:22 pve03 kernel:  net_rx_action+0x181/0x2e0
Jun 18 18:28:22 pve03 kernel:  __do_softirq+0xd6/0x31c
Jun 18 18:28:22 pve03 kernel:  __irq_exit_rcu+0xd7/0x100
Jun 18 18:28:22 pve03 kernel:  irq_exit_rcu+0xe/0x20
Jun 18 18:28:22 pve03 kernel:  common_interrupt+0xa4/0xb0
Jun 18 18:28:22 pve03 kernel:  </IRQ>
Jun 18 18:28:22 pve03 kernel:  <TASK>
Jun 18 18:28:22 pve03 kernel:  asm_common_interrupt+0x27/0x40
Jun 18 18:28:22 pve03 kernel: RIP: 0010:cpuidle_enter_state+0xce/0x470
Jun 18 18:28:22 pve03 kernel: Code: 11 03 ff e8 f4 ee ff ff 8b 53 04 49 89 c6 0f 1f 44 00 00 31 ff e8 42 01 02 ff 80 7d d7 00 0f 85 e7 01 00 00 fb 0f 1f 44 00 00 <45> 85 ff 0f 88 83 01 00 00 49 63 d7 4c 89 f1 48 8d 04 52 48 8d 04
Jun 18 18:28:22 pve03 kernel: RSP: 0018:ffffb339c011fe50 EFLAGS: 00000246
Jun 18 18:28:22 pve03 kernel: RAX: 0000000000000000 RBX: ffff9dce9fdd5c70 RCX: 0000000000000000
Jun 18 18:28:22 pve03 kernel: RDX: 0000000000000007 RSI: 0000000000000000 RDI: 0000000000000000
Jun 18 18:28:22 pve03 kernel: RBP: ffffb339c011fe88 R08: 0000000000000000 R09: 0000000000000000
Jun 18 18:28:22 pve03 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
Jun 18 18:28:22 pve03 kernel: R13: ffffffffa266fa80 R14: 000005e69d14faa9 R15: 0000000000000001
Jun 18 18:28:22 pve03 kernel:  cpuidle_enter+0x2e/0x50
Jun 18 18:28:22 pve03 kernel:  call_cpuidle+0x23/0x60
Jun 18 18:28:22 pve03 kernel:  do_idle+0x207/0x260
Jun 18 18:28:22 pve03 kernel:  cpu_startup_entry+0x2a/0x30
Jun 18 18:28:22 pve03 kernel:  start_secondary+0x119/0x140
Jun 18 18:28:22 pve03 kernel:  secondary_startup_64_no_verify+0x184/0x18b
Jun 18 18:28:22 pve03 kernel:  </TASK>
Jun 18 18:28:22 pve03 kernel: ---[ end trace 0000000000000000 ]---
Jun 18 18:28:26 pve03 kernel: DMAR: ERROR: DMA PTE for vPFN 0x83b82 already set (to 83b82003 not 356b31003)
Jun 18 18:28:26 pve03 kernel: ------------[ cut here ]------------
Jun 18 18:28:26 pve03 kernel: WARNING: CPU: 0 PID: 0 at drivers/iommu/intel/iommu.c:2214 __domain_mapping+0x375/0x4f0
Jun 18 18:28:26 pve03 kernel: Modules linked in: ceph libceph netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics nvme_core nvme_auth 8021q garp mrp bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_pmc_core_pltdrv intel_pmc_core intel_vsec pmt_telemetry pmt_class intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl dell_wmi jc42 dell_smbios acpi_power_meter dell_wmi_descriptor ledtrig_audio joydev ipmi_si acpi_ipmi mgag200 sparse_keymap mei_me input_leds ipmi_devintf dcdbas i2c_algo_bit pcspkr intel_cstate intel_wmi_thunderbolt ee1004 mei intel_pch_thermal ie31200_edac ipmi_msghandler mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore
Jun 18 18:28:26 pve03 kernel:  dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq mlx4_ib ib_uverbs ib_core ses mlx4_en enclosure scsi_transport_sas hid_logitech_hidpp hid_logitech_dj hid_generic usbkbd usbmouse usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c uas usb_storage xhci_pci xhci_pci_renesas i2c_i801 crc32_pclmul xhci_hcd i2c_smbus mlx4_core tg3 ahci megaraid_sas libahci video wmi
Jun 18 18:28:26 pve03 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P        W  O       6.8.8-1-pve #1
Jun 18 18:28:26 pve03 kernel: Hardware name: Dell Inc. PowerEdge R230/0FRVY0, BIOS 2.20.0 02/22/2024
Jun 18 18:28:26 pve03 kernel: RIP: 0010:__domain_mapping+0x375/0x4f0
Jun 18 18:28:26 pve03 kernel: Code: 48 89 c2 4c 89 4d b0 48 c7 c7 78 5d c3 a1 e8 92 6e 6d ff 8b 05 b0 6b 9e 01 4c 8b 4d b0 85 c0 74 09 83 e8 01 89 05 9f 6b 9e 01 <0f> 0b e9 fe fe ff ff 8b 45 c4 4c 89 ee 4c 89 f7 8d 58 01 48 8b 45
Jun 18 18:28:26 pve03 kernel: RSP: 0018:ffffb339c0003a30 EFLAGS: 00010246
Jun 18 18:28:26 pve03 kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Jun 18 18:28:26 pve03 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jun 18 18:28:26 pve03 kernel: RBP: ffffb339c0003ac0 R08: 0000000000000000 R09: ffff9dcb4175dc10
Jun 18 18:28:26 pve03 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9dcb4175dc10
Jun 18 18:28:26 pve03 kernel: R13: ffff9dcb402d7900 R14: 0000000000000001 R15: 0000000356b31003
Jun 18 18:28:26 pve03 kernel: FS:  0000000000000000(0000) GS:ffff9dce9fa00000(0000) knlGS:0000000000000000
Jun 18 18:28:26 pve03 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 18 18:28:26 pve03 kernel: CR2: 000061c82c255000 CR3: 000000015d036006 CR4: 00000000003706f0
Jun 18 18:28:26 pve03 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 18 18:28:26 pve03 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun 18 18:28:26 pve03 kernel: Call Trace:
Jun 18 18:28:26 pve03 kernel:  <IRQ>
Jun 18 18:28:26 pve03 kernel:  ? show_regs+0x6d/0x80
Jun 18 18:28:26 pve03 kernel:  ? __warn+0x89/0x160
Jun 18 18:28:26 pve03 kernel:  ? __domain_mapping+0x375/0x4f0
Jun 18 18:28:26 pve03 kernel:  ? report_bug+0x17e/0x1b0
Jun 18 18:28:26 pve03 kernel:  ? handle_bug+0x46/0x90
Jun 18 18:28:26 pve03 kernel:  ? exc_invalid_op+0x18/0x80
Jun 18 18:28:26 pve03 kernel:  ? asm_exc_invalid_op+0x1b/0x20
Jun 18 18:28:26 pve03 kernel:  ? __domain_mapping+0x375/0x4f0
Jun 18 18:28:26 pve03 kernel:  ? post_alloc_hook+0xcc/0x120
Jun 18 18:28:26 pve03 kernel:  ? kmem_cache_alloc+0x133/0x360
Jun 18 18:28:26 pve03 kernel:  intel_iommu_map_pages+0xe1/0x140
Jun 18 18:28:26 pve03 kernel:  ? alloc_iova+0x259/0x290
Jun 18 18:28:26 pve03 kernel:  __iommu_map+0x11e/0x280
Jun 18 18:28:26 pve03 kernel:  iommu_map+0x43/0xd0
Jun 18 18:28:26 pve03 kernel:  __iommu_dma_map+0x89/0xf0
Jun 18 18:28:26 pve03 kernel:  iommu_dma_map_page+0xc0/0x2c0
Jun 18 18:28:26 pve03 kernel:  dma_map_page_attrs+0x6f/0x2c0
Jun 18 18:28:26 pve03 kernel:  mlx4_en_prepare_rx_desc+0x161/0x1b0 [mlx4_en]
Jun 18 18:28:26 pve03 kernel:  mlx4_en_process_rx_cq+0xb2f/0x1010 [mlx4_en]
Jun 18 18:28:26 pve03 kernel:  mlx4_en_poll_rx_cq+0x6d/0x100 [mlx4_en]
Jun 18 18:28:26 pve03 kernel:  __napi_poll+0x30/0x200
Jun 18 18:28:26 pve03 kernel:  net_rx_action+0x181/0x2e0
Jun 18 18:28:26 pve03 kernel:  __do_softirq+0xd6/0x31c
Jun 18 18:28:26 pve03 kernel:  __irq_exit_rcu+0xd7/0x100
Jun 18 18:28:26 pve03 kernel:  irq_exit_rcu+0xe/0x20
Jun 18 18:28:26 pve03 kernel:  common_interrupt+0xa4/0xb0
Jun 18 18:28:26 pve03 kernel:  </IRQ>
Jun 18 18:28:26 pve03 kernel:  <TASK>
Jun 18 18:28:26 pve03 kernel:  asm_common_interrupt+0x27/0x40
Jun 18 18:28:26 pve03 kernel: RIP: 0010:cpuidle_enter_state+0xce/0x470
Jun 18 18:28:26 pve03 kernel: Code: 11 03 ff e8 f4 ee ff ff 8b 53 04 49 89 c6 0f 1f 44 00 00 31 ff e8 42 01 02 ff 80 7d d7 00 0f 85 e7 01 00 00 fb 0f 1f 44 00 00 <45> 85 ff 0f 88 83 01 00 00 49 63 d7 4c 89 f1 48 8d 04 52 48 8d 04
Jun 18 18:28:26 pve03 kernel: RSP: 0018:ffffffffa2403db8 EFLAGS: 00000246
Jun 18 18:28:26 pve03 kernel: RAX: 0000000000000000 RBX: ffff9dce9fa55c70 RCX: 0000000000000000
Jun 18 18:28:26 pve03 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jun 18 18:28:26 pve03 kernel: RBP: ffffffffa2403df0 R08: 0000000000000000 R09: 0000000000000000
Jun 18 18:28:26 pve03 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
Jun 18 18:28:26 pve03 kernel: R13: ffffffffa266fa80 R14: 000005e78e81a0de R15: 0000000000000001
Jun 18 18:28:26 pve03 kernel:  ? cpuidle_enter_state+0xbe/0x470
Jun 18 18:28:26 pve03 kernel:  cpuidle_enter+0x2e/0x50
Jun 18 18:28:26 pve03 kernel:  call_cpuidle+0x23/0x60
Jun 18 18:28:26 pve03 kernel:  do_idle+0x207/0x260
Jun 18 18:28:26 pve03 kernel:  cpu_startup_entry+0x2a/0x30
Jun 18 18:28:26 pve03 kernel:  rest_init+0xd0/0xd0
Jun 18 18:28:26 pve03 kernel:  arch_call_rest_init+0xe/0x30
Jun 18 18:28:26 pve03 kernel:  start_kernel+0x729/0xb00
Jun 18 18:28:26 pve03 kernel:  x86_64_start_reservations+0x18/0x30
Jun 18 18:28:26 pve03 kernel:  x86_64_start_kernel+0xbf/0x110
Jun 18 18:28:26 pve03 kernel:  secondary_startup_64_no_verify+0x184/0x18b
Jun 18 18:28:26 pve03 kernel:  </TASK>
Jun 18 18:28:26 pve03 kernel: ---[ end trace 0000000000000000 ]---
Jun 18 18:28:27 pve03 kernel: DMAR: ERROR: DMA PTE for vPFN 0x83b81 already set (to 83b81003 not 325641003)
Jun 18 18:28:27 pve03 kernel: ------------[ cut here ]------------

leex12 · Jun 18, 2024

googing around this .. number of non-proxmox folks saying the issue is related to NIC drivers and vt-d?

fiona · Jun 19, 2024

Can you try with the intel_iommu=off kernel option? The default changed with the 6.8 kernel: https://pve.proxmox.com/wiki/Roadmap#8.2-known-issues, section Kernel: intel_iommu now defaults to on

leex12 · Jun 19, 2024

@fiona really need some guidance here but I think i may have got to the bottom of this ...

So to recap .. 6 server cluster. four of which are dell r230 servers. two of whcih have gone through the upgrade process and are all fine and two which aren't. I upgraded my ceph version ages ago and been running no issues. Made the jump to 8 and the problems start.

On the two servers which have issues the boot disc has survived no issues. This drive is partitioned with a ceph OSD. All working no issues. What occured to me last night is this not running off the dell controller but off a sata port on the motherboard. As I mentioned Ihav thrown a couple of external USB drives on to the two servers that have issue and have managed to get a ceph OSD running on those.

So the issue has to be related to the controller

If i do an lsmod and scan through the two 'failed' servers are picking up megaraid_sas (192512) / scsi_transport_sas 53248 1 ses
and the two that have continued to work mpt3sas (364544) ./ scsi_transport_sas 53248 1 mpt3sas

THe four servers should be identical other than the drives so not sure why the difference in driver. Are there updates to the megaraid drivers that I could try?

Full listing attached

fiona · Jun 19, 2024

The kernel traces clearly mention iommu, so I would first try with intel_iommu=off. They also show you are already using 6.8.8-1-pve. There is no newer kernel package yet, so you could try booting into kernel 6.5 instead to find out if the issue is kernel-related.

leex12 · Jun 19, 2024

fiona said:
Can you try with the intel_iommu=off kernel option? The default changed with the 6.8 kernel: https://pve.proxmox.com/wiki/Roadmap#8.2-known-issues, section Kernel: intel_iommu now defaults to on

As a real dumb question - once I have made that change how can i check that it has been applied?

fiona · Jun 19, 2024

For how to set it see: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline
You can check if it's there after rebooting with cat /proc/cmdline

leex12 · Jun 21, 2024

@fiona thanks very much for the support! Switching back off iommu has stopped the crash! That helps my paranoia as that was defaulted to off in v7 so I didn't imagine the issue being related to the upgrade.

I checked the BIOS on my four dell r230 and its all the same. The controllers were all dell 330 which I flashed to IT mode. So not sure why two of the servers are picking up megaraid_sas and the other two mpt3sas. I built them all at the same time and I thought in the same way,

I have not changed the setting on the mpt3sas machines as they are working. Do you think i should switch off, just in case?

I don't think we have done enough here to prove a bug but for sure there is something with ceph/iommu/mpt3sas that needs further investigation or at least an honourable mention in release notes?

fiona · Jun 25, 2024

leex12 said:
I have not changed the setting on the mpt3sas machines as they are working. Do you think i should switch off, just in case?

If you do not require the setting, and since it caused issues on similar servers, it's probably better to turn it off there too.

leex12 said:
I don't think we have done enough here to prove a bug but for sure there is something with ceph/iommu/mpt3sas that needs further investigation or at least an honourable mention in release notes?

The known issues section already mentions that the new iommu default can cause issues with certain hardware.

7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

Member

Member

Member

Proxmox Staff Member

Member

Attachments

Member

Member

Attachments

Member

Attachments

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Attachments

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member