Search results

  1. T

    walk_pgd_range crash pve9.1 on 6.18+

    Only 1G hugepages. But recently i do not see those anymore. Upgraded kernel and changes few settings async io: threads (previous on uoring i had this issues)
  2. T

    walk_pgd_range crash pve9.1 on 6.18+

    This is happening also on normal 2K pages. Does not matter which kernel.
  3. T

    walk_pgd_range crash pve9.1 on 6.18+

    Just had another freeze in 6.17.4-2-pve kernel. Exactly same issue.
  4. T

    walk_pgd_range crash pve9.1 on 6.18+

    Well i have set every server to 6.17 pve2 . I had issues with passthrough + blackwell before. but will see now as some time passed. Some rebooted already and i have them now on 6.17 so within 2-3 days i should see if they will again have this bug and reboot or not. Also i seen some other issue...
  5. T

    walk_pgd_range crash pve9.1 on 6.18+

    Hehe so google does not see everything I am using this: https://prebuiltkernels.com/ When made small script to install kernels from there: kernel="6.18.7-pbk"; deb="${kernel%-*}"; deb="${deb/-rc/~rc}-1"; rm -f ./linux-*.deb; wget...
  6. T

    walk_pgd_range crash pve9.1 on 6.18+

    Recently i have reported slab memory leak and it was fixed. I am having yet another issue and wondering where to write with it. Would you be able to tell me if this is the right place or should i send it to someone else ? The issue seems also like memory leak. It happens on multiple servers...
  7. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    Can you share your test suite as this might be interesting. We are testing using some mining progress but i am wondering if we could use something that makes many different things. Also f you have 2 cpus then numa should be 1 in host. And if you allocating more than 300GB ram, consider using...
  8. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    6.17 also should be fine but i see improvement in newest kernels 6.17+ even in terms of booting time
  9. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    Can you try 6.18 kernel? For me those are working the best on 5090 and 6000 https://prebuiltkernels.com/
  10. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    I have returned mine to nvidia and replaced them for 600w ones. Anyway i did not solved the max-q ones. Make sure you will also do this: sudo echo 'SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{d3cold_allowed}="0"' | sudo tee /etc/udev/rules.d/99-nvidia-d3cold.rules sudo udevadm control...
  11. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    Rtx pro 6000 600w version and rtx 5090 seems to be fixed when we add udev rule. But i just got also rtx pro 6000 q-max 300w version which has even stranger issue. The crash happens instantly on random gpu when i turn on VM. Every time different gpu. And without passthrough it works fine. Every...
  12. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    My RTX6000 max-q are disappearing on Host when i just power on VM like 2-3 times ... And radom GPU is gone. This is madness. Nvidia is still "passed to developers" and no other info. Sitting on 48 GPUs that are broken ..... And yes i confirmed that ATTR was applied: for d in...
  13. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    I have implemented this also and i also see issue. But here is the thing. My RTX PRO 6000 blackwell 96GB 600w i think is working fine now But RTX PRO 6000 blackwell 96GB 300w max-q is having this issue regardless. Also did upgrade to 6.17 kernel and flashed that uefi firmware fix but still no...
  14. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    Actually mortise, i think this solves the issue. I am not sure as it does not happen often but i think you hit the spot. Where did you heard about this ? Wondering why i could not find that solution anywhere ?
  15. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    Yeah it seems like it ! I have upgraded few servers with RTX4090 and RTX5090 and l also RTX6000 blackwell to that kernel proxmox-kernel-6.14.8-2-bpo12-pve/stable And so far it works ok + those crazy fast startup. So only one thing stil remains. Crashing GPUs when VM guest does some strange...
  16. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    Got response from nvidia that they were able to reproduce this issue and they are thinking about fix. Also i have installed apt install proxmox-kernel-6.14.8-2-bpo12-pve/stable and i see that RTX6000 boots super fast now vs very slow when i had older 6.8 and 6.11 kernels. In 6.14 they added some...
  17. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    One of my clients confirmed that he had crashed all the time that rtx6000 blackwell when he was training unsloth. And after adding that fix in VM, it no longer crashes ! Anyone can confirm. I have asked nvidia support if they can fix this on host side or in gpu bios as changing anytging in my...
  18. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    Oh i actually did not see it. Thanks. Hmm that is interesting. I can try this setting in VM: options nvidia-drm modeset=0 Still you have rock solid Windows and we also got this issue after windows shutdown as well. I am wondering if you set something special in windows or maybe drivers have...
  19. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    That also did not help. So currently issue is not solved, i am out of ideas and waiting for Proxmox tech support or Nvidia support.
  20. T

    Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

    Today i got answer from Proxmox. It is not my proxmox installation and it was installed from clean debian. But here it is: According to the report, the package is not installed correctly, which may affect EFI VMs. Please reinstall: apt update apt install --reinstall pve-edk2-firmware # Check...