Search results

  1. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    PS : some more findings ls -la /etc/pve/qemu-server/ gives you an idea, which vms are still running on this host ls -la /var/run/qemu-server/ still gives you access to the vnc, serial and qemu sockets (also for debugging) # still working qm migrate <id> <target server> qm terminal <id> qm...
  2. F

    [SOLVED] kernel panic since update to Proxmox 7.1

    PS : some more findings ls -la /etc/pve/qemu-server/ gives you an idea, which vms are still running on this host # still working qm migrate <id> <target server> ps ax blocks just before the stale VM, but you can still loop over the procfs to get process info, i.e. for i in `ls /proc |egrep...
  3. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    Unfortunately, we are back in the situation, the BIOS update did not solve the problem, but we can nail it down to blocking VM has data on the local RAID controller (Perc H755, 4 x NVME Raid-10). Other VM's (running purely in CEPH Storage) are not affected.
  4. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    We updated our BIOS (Dell R6525 and the Perc Firmware) which helped to solve the problem. I guess the firmware of the Perc Controller was the main issue
  5. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    [1286374.147087] INFO: task khugepaged:796 blocked for more than 120 seconds. [1286374.154005] Tainted: P O 5.15.35-1-pve #1 [1286374.160027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1286374.168042] task:khugepaged state:D stack: 0...
  6. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s)...
  7. F

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    Hello Everybody, not sure if this thread (https://forum.proxmox.com/threads/kernel-panic-since-update-to-proxmox-7-1.101164/#post-437435) is related but since we updated to PVE7.2 we have repeated crashes with kernel message INFO: task khugepaged:796 blocked for more than 120 seconds...
  8. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    Ahh perfect. This means after waiting a few days, your performance came back to normal?
  9. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    This means, that your HW setup runs with NPS1 (see page 10 at https://developer.amd.com/wp-content/resources/56745_0.80.pdf) for details. I'm not too much of an expert in the nasty details of the AMD Epyc NUMA architecture but would say, that is is not your bottleneck (might give you some extra...
  10. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    For the Nexus, i dont know which exact Model you have but one of the faster ones, a 3172PQ seems to be around 850ns, other models might be (significantly) slower, but you will need to google this. The Lenovo switch I mentioned above is around 570ns, while the Mellanox goes down to 270ns for...
  11. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    1.) We use 2 x Mellanox SX1012 with 40GE QSFP+ and MLAG (before we used 2 x Lenovo GE8124E on MLAG with 10GE QSFP, similar performance). Note : both switches, the Lenovo and the SX1012 are cut-through switches 2.) default Linux/PVE driver with Mellanox 40GE CX-354 QSFP+ 3.) no Intel DPDK running...
  12. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    Hi Alibek, I would say, that after 3-4 weeks pulling my hair out, the cluster came back to normal operation speed. We actually never really figured out what the problem was but our feeling is, that the reconstruction of the OMAP data structures took quite some time in. the background. We also...
  13. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    Some more investigation reveals, that since the day(s) we split our 7TB SSD's into 4 OSD's (around Dec 8th), the latencies on this OSD's dropped significantly and never spiked again, so we can at least say, that this solved our issue. What caused the high latencies on these drives after the...
  14. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    2 weeks later everything got "normal", rados bench gives these values Total time run: 60.0139 Total writes made: 2301842 Write size: 4096 Object size: 4096 Bandwidth (MB/sec): 149.825 Stddev Bandwidth: 6.85404 Max bandwidth (MB/sec): 163.844 Min...
  15. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    We doubled our RAM but not difference. As i am starting to hunt down the bug i would like to see the PVE compile flags for CEPH or even compile it myself. Is there a guide how to rebuild the ProxMoxx packages on your own?
  16. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    We have approx 30GB cached, so i guess its not this. Still we will double the ram soon. Meanwhile I am hesitating to "escape" forward to Pacific unless i found some valid reasoning for the problem...
  17. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    What makes me wonder and what i can not explain is that the cluster used to be I/O bottlenecked (if i interpret it correctly) and since the update this changed. See the two example CEPH nodes below ... Updated PVE to PVE7.x two days ago, running the latest kernel, but no changes
  18. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    128 Queues ... root@xx-ceph01:~# fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -pool=SSD -runtime=30 -rbdname=testimg -iodepth=128 test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=128 fio-3.12 Starting 1 process...
  19. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    both, launched from ceph node or from compute node deliver the same result. just got a warning, that one of my ceph node ran out of swap! (still having 30GB linux fs cache free), dont know if this is related, swapoff and rados bench does not change a thing. can it be that octopus eats more ram...
  20. F

    [SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

    Our ceph.conf [global] auth client required = cephx auth cluster required = cephx auth service required = cephx cluster network = xx fsid = aef995d0-0244-4a65-8b8a-2e75740b4cbb # keyring = /etc/pve/priv/$cluster.$name.keyring mon allow pool delete = true...

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!