Recent content by leex12

  1. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    @fiona thanks very much for the support! Switching back off iommu has stopped the crash! That helps my paranoia as that was defaulted to off in v7 so I didn't imagine the issue being related to the upgrade. I checked the BIOS on my four dell r230 and its all the same. The controllers were...
  2. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    As a real dumb question - once I have made that change how can i check that it has been applied?
  3. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    @fiona really need some guidance here but I think i may have got to the bottom of this ... So to recap .. 6 server cluster. four of which are dell r230 servers. two of whcih have gone through the upgrade process and are all fine and two which aren't. I upgraded my ceph version ages ago and...
  4. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    googing around this .. number of non-proxmox folks saying the issue is related to NIC drivers and vt-d?
  5. L

    Jumbo frame size set - netmtu in COROSYNC.CONF ?

    i have moved everything back to 8745 Jun 18 18:38:44 pve01 corosync[1262]: [KNET ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 8629 Jun 18 18:38:44 pve01 corosync[1262]: [KNET ] pmtud: Starting PMTUD for host: 2 link: 1 Jun 18 18:38:44 pve01 corosync[1262]: [KNET ] udp...
  6. L

    Jumbo frame size set - netmtu in COROSYNC.CONF ?

    my question was shouls i be setting netmtu to stop the messages! There is nothing exciting in my config logging { debug: on to_syslog: yes } nodelist { node { name: pve01 nodeid: 5 quorum_votes: 1 ring0_addr: 192.168.20.1 ring1_addr: 10.107.0.1 } node {...
  7. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    So did the fresh install to see if that impacted anything .. it didn't. Been working fine with the boot disc and an external disc. Re-added a SSD for ceph. Worked fine for over an hour then my console erupts with DMAR ERROR DMA PTE for vPFN this is in system log and there is a lot of them...
  8. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    I see a bunch of these errors on pve03 when I tried to add a new drive Jun 17 09:29:54 pve03 kernel: DMAR: ERROR: DMA PTE for vPFN 0x7ee69 already set (to 7ee69003 not 262743001) The crc errors looks very similar to an old issue relating to the kenal. Any thoughts or suggestions? I am just...
  9. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    when you look in the full log i see this " _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0xbfa2820a, device location [0x9627286000~1000], logical extent 0x100000~1000, object #-1:2c740c03:::eek:sdmap.194823:0#"
  10. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    Below is a recreated OSD .. stayed up for a few hours then died. grep -Hn 'ERR' /var/log/ceph/ceph-osd.9101.log /var/log/ceph/ceph-osd.9101.log:28764:2024-06-16T21:52:08.451+0100 754587c8a3c0 -1 ** ERROR: osd init failed: (5) Input/output error...
  11. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    I have done a dirty spreadsheet across the versions attached .. pve3 + pve4 are the nodes that have the problem proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve) pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4) proxmox-kernel-helper: 8.1.0 pve-kernel-5.15: 7.4-13 proxmox-kernel-6.8...
  12. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    So I went radical .. physically removed drives from the two nodes. Reformatted drives and recreated new OSDs. Will work for a while then crap out. I have run three drives against a test program and they are passing so i don't think we are looking at hard drive failure. The issue has happened...
  13. L

    Jumbo frame size set - netmtu in COROSYNC.CONF ?

    Not sure if I was snow blind to them in v7 but since upgrade to v8 I have tons of messages 'complaining' about MTU size. Historically I have had MTU set on NICs, bridges and vlans set to 9000. Since seeing these messages I have validated 8745 seems to be a sweet spot working between nodes...
  14. L

    7 OSD down across two nodes Issues Since Upgrading to v8 - HELP!

    14 07:00:26 pve03 systemd[1]: Starting ceph-osd@9000.service - Ceph object storage daemon osd.9000... Jun 14 07:00:26 pve03 systemd[1]: Started ceph-osd@9000.service - Ceph object storage daemon osd.9000. Jun 14 07:00:34 pve03 systemd[1]: ceph-osd@9000.service: Main process exited, code=killed...