VM with docker crashes proxmox host

oester

Member
Jan 9, 2021
28
4
23
68
I have a brand-new server (HP Z440) with a 3090 installed, and I'm having an issue where I'm trying to bring up a number of docker containers, and it crashes the host during the image pull. The VM is Ubuntu 24.04 with the nvidia drives, PCI passthru to the VM. I can see the 3090 in the VM. I have a stack with ollama, open-webui, and searxng.

The ollama container spins up fine. However, during he pull of the open-webui container, the docker VM and the proxmox host crashes. No other VMs or processes are on this host. System requires a hard reboot.

Looking for ideas.

>>>>

docker compose pull
[+] Pulling 10/19
⠇ open-webui [⣿⣿⣿⣿⣿⣿⣿⣿⡀⣿⠀⣦⠀⠀⠀] Pulling 19.9s
✔ 3da95a905ed5 Pull complete 7.6s
✔ 483d0dd37518 Pull complete 7.9s
✔ 02a5d22e0d6f Pull complete 8.8s
✔ 471797cdda8c Pull complete 8.9s
✔ d735c6810219 Pull complete 8.9s
✔ 4f4fb700ef54 Pull complete 8.9s
✔ eb54bd960342 Pull complete 8.9s
✔ 1e80ef81ce95 Pull complete 9.0s
⠹ dc06c47d3f8d Downloading [===========> ] 75.15MB/341.1MB 19.2s
✔ 4fa722384786 Download complete 3.1s
⠹ 7c1003dda2a4 Downloading [==> ] 72.43MB/1.272GB 19.2s
⠹ d03263fb7c48 Downloading [================================> ] 59.47MB/91.32MB 19.2s
⠹ c840efba173d Waiting 19.2s
⠹ 9c18d807ca17 Waiting 19.2s
⠹ a9250ae4506e Waiting 19.2s
⠇ searxng [⠀] Pulling 19.9s
⠇ d20f48c1461f Waiting 18.9s
✔ ollama Pulled 0.7s
>>> Host crash ssh_dispatch_run_fatal: Connection to 192.168.25.84 port 22: message authentication code incorrect
 
The VM is Ubuntu 24.04 with the nvidia drives, PCI passthru to the VM.
That is most probably the reason why it crashes. PCIe passthrough heavily depends on the used hardware. Sometimes it works, sometimes it does not and often cannot be fixed in software.

Maybe look into Bind-Mounting the nvidia devices to an LX(C) container and run your AI stack from there. That is not prone to errors caused by PCIe passthrough.
 
brand-new server (HP Z440)
Released in 2014, I would not call that "brand-new"! You probably mean freshly installed server.

I totally agree with LnxBil above.

I'll just add: Given the dated server (completely EoSL 9 years later, since 6/30/2023), adding a GPU released 6/7 years later (RTX 3090 circa 2020, but now also EoL) & then doing PCI passthrough to a VM is inevitably going to crash something.

Either get yourself newer HW, or run OS/software released closer dated to your current HW.
 
Following up here - this appears to be unrelated to the PCIe passthru - it seems related to the network interface being renamed (briefly) during docker startup. If I run the docker command directly on the VM console, it works. I see these message about the port (eth0) entering blocking state, being renamed, and then non-blocking. This is my only proxmox host with a single network interface/bridge, and I don't see this message on any of my other proxmox hosts. Suggestion on what I can change on the prxmox host to avoid this issue?

Image 7-13-25 at 9.11 AM.jpg