Hi,
we are struggeling with vgpus for vllm / pytorch inside a VM.
Setup:
We operate a node with 4x Nvidia L40S in a Proxmox Cluster (8.4.1). Driver on the host is 580.65.05 and vGPUs are setup correctly and work reliable and flawless for VDIs.
For LLM Inference we have mapped 4 vGPUs into a Virtual machine. All 4 vGPUs show up correctly in the guest system (using nvidia-smi). We setup vLLM (0.11.0) and torch 2.8.0+cu128 in a python virtual environment. To our understanding torch comes with a pre-compiled cuda + nccl version. Inferencing models that fit on a SINGLE vGPU (L40S) work flawless.
Problem:
We follow the Troubleshooting guide from vLLM, testing if torch + cuda + nccl actually work. Executing the suggested test script fails in with ncclUnhandledCudaError. I have attached the error output log. When we don't use vGPUs, but rather use PCI Passthrough the test passes correctly and vllm inference can make use of all L40S cards. We suspect there is a problem between the vllm driver and the cuda / nccl version that comes with pytorch. Do you have any advice how to make torch / nccl / vllm work with vGPUs?
Disclaimer:
I opened this ticket previously with the Nivida Enterprise Support, but got defered to the Hypervisor vendor (Proxmox):
Help would be much appreciated!
Thanks & kind regards,
Tobias
we are struggeling with vgpus for vllm / pytorch inside a VM.
Setup:
We operate a node with 4x Nvidia L40S in a Proxmox Cluster (8.4.1). Driver on the host is 580.65.05 and vGPUs are setup correctly and work reliable and flawless for VDIs.
For LLM Inference we have mapped 4 vGPUs into a Virtual machine. All 4 vGPUs show up correctly in the guest system (using nvidia-smi). We setup vLLM (0.11.0) and torch 2.8.0+cu128 in a python virtual environment. To our understanding torch comes with a pre-compiled cuda + nccl version. Inferencing models that fit on a SINGLE vGPU (L40S) work flawless.
Problem:
We follow the Troubleshooting guide from vLLM, testing if torch + cuda + nccl actually work. Executing the suggested test script fails in with ncclUnhandledCudaError. I have attached the error output log. When we don't use vGPUs, but rather use PCI Passthrough the test passes correctly and vllm inference can make use of all L40S cards. We suspect there is a problem between the vllm driver and the cuda / nccl version that comes with pytorch. Do you have any advice how to make torch / nccl / vllm work with vGPUs?
Disclaimer:
I opened this ticket previously with the Nivida Enterprise Support, but got defered to the Hypervisor vendor (Proxmox):
This case involves use of a Linux KVM Partner hypervisor, which is out of scope for NVIDIA Enterprise Support.
All technical issues should be reported to your Linux KVM Hypervisor vendor.
The Linux KVM Hypervisor vendor will contact NVIDIA directly as required, and is also responsible for delivering any SW updates that may be needed to address the issue.
Help would be much appreciated!
Thanks & kind regards,
Tobias