Recent content by FXKai

  1. F

    [SOLVED] OSD's not starting after upgrade 18.2.4 -> 18.2.7

    We found a kind of solution ourselves. We had caching values set in our ceph.conf. Apparently, this is what Ceph has become stricter about since version 18.2.7 or higher (we are now on 19.2.3). The startup checks have become more picky with this. Example error message...
  2. F

    Incorrect file permission set on file `/etc/pve/ceph.conf` causes `ceph` user could not access the ceph config file in PVE cluster

    While it is ugly that this service has to be patched at all, the right and update-stable way of "fixing" the file would be using override files: # mkdir /etc/systemd/system/ceph-mgr@.service.d # cat >/etc/systemd/system/ceph-mgr@.service.d/override.conf <<EOF [Service] ExecStart=...
  3. F

    [SOLVED] OSD's not starting after upgrade 18.2.4 -> 18.2.7

    Hi, after running the latest apt dist-upgrade and rebooting one of our storage servers, all OSD's on this node refuse to start. After downgrading all CEPH packages the OSD's start immediately. Any help appreciated :) Setup: 7 x CEPH Storage nodes, 3 x PVE Compute Nodes Log files of the...
  4. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    PS : some more findings ls -la /etc/pve/qemu-server/ gives you an idea, which vms are still running on this host ls -la /var/run/qemu-server/ still gives you access to the vnc, serial and qemu sockets (also for debugging) # still working qm migrate <id> <target server> qm terminal <id> qm...
  5. F

    [SOLVED] kernel panic since update to Proxmox 7.1

    PS : some more findings ls -la /etc/pve/qemu-server/ gives you an idea, which vms are still running on this host # still working qm migrate <id> <target server> ps ax blocks just before the stale VM, but you can still loop over the procfs to get process info, i.e. for i in `ls /proc |egrep...
  6. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    Unfortunately, we are back in the situation, the BIOS update did not solve the problem, but we can nail it down to blocking VM has data on the local RAID controller (Perc H755, 4 x NVME Raid-10). Other VM's (running purely in CEPH Storage) are not affected.
  7. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    We updated our BIOS (Dell R6525 and the Perc Firmware) which helped to solve the problem. I guess the firmware of the Perc Controller was the main issue
  8. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    [1286374.147087] INFO: task khugepaged:796 blocked for more than 120 seconds. [1286374.154005] Tainted: P O 5.15.35-1-pve #1 [1286374.160027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1286374.168042] task:khugepaged state:D stack: 0...
  9. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s)...
  10. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    Hello Everybody, not sure if this thread (https://forum.proxmox.com/threads/kernel-panic-since-update-to-proxmox-7-1.101164/#post-437435) is related but since we updated to PVE7.2 we have repeated crashes with kernel message INFO: task khugepaged:796 blocked for more than 120 seconds...
  11. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    Ahh perfect. This means after waiting a few days, your performance came back to normal?
  12. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    This means, that your HW setup runs with NPS1 (see page 10 at https://developer.amd.com/wp-content/resources/56745_0.80.pdf) for details. I'm not too much of an expert in the nasty details of the AMD Epyc NUMA architecture but would say, that is is not your bottleneck (might give you some extra...
  13. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    For the Nexus, i dont know which exact Model you have but one of the faster ones, a 3172PQ seems to be around 850ns, other models might be (significantly) slower, but you will need to google this. The Lenovo switch I mentioned above is around 570ns, while the Mellanox goes down to 270ns for...
  14. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    1.) We use 2 x Mellanox SX1012 with 40GE QSFP+ and MLAG (before we used 2 x Lenovo GE8124E on MLAG with 10GE QSFP, similar performance). Note : both switches, the Lenovo and the SX1012 are cut-through switches 2.) default Linux/PVE driver with Mellanox 40GE CX-354 QSFP+ 3.) no Intel DPDK running...
  15. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    Hi Alibek, I would say, that after 3-4 weeks pulling my hair out, the cluster came back to normal operation speed. We actually never really figured out what the problem was but our feeling is, that the reconstruction of the OMAP data structures took quite some time in. the background. We also...