Switching passthrough GPU between VMs ?

shodan · Sep 23, 2024

Hi,

Imagine I have a proxmox server with four active desktop, two windows and two linux

Is it possible to swap the GPU between the VM at will and efficiently ?

I am curious if GPU passthrough settings can be changed at all while the VM is running, to begin with ?

Then, does swapping the GPU causes the VMs to crash or the GPU itself to enter a crash state ?

How long does it take, say, compared with the time of switching betweem ports of a KVM ?

Would it make sense to Live snap shot and suspend the VMs, connect the GPU to the desired one and then unfreeze it ? And then when you switch again, freeze the VM, switch gpu, unfreeze new VM ?

Would the GPU need to "reboot" between this swapping ? Is it going to be like pressing ctrl+win+shift+b and restarting the video system each time ?

Would it cause a running, 3d or video compression using application to crash ?

What happens if two VMs are trying to GPU passthrough at the same time ? They both crash and crash the GPU ?

How do I implement easy switching logic to work from where I am interacting these VMs ?

That was a lot of questions, I wanted to create this list first and put it out there to find if there are other people trying this.

What I have in mind is, multiple things really.

First would be my desktop computer, running proxmox, all my screens are connected to the physical GPU. Right now it just shows the proxmox login prompt.
I would like the physical outputs to display the content of the VM instead of the content of the proxmox host.

I also access my VMs, from other computers, in fullscreen using moonlight/sunshine, but when the GPU detaches, my moonlight client will become unoperative. So the scripts have to be efficient, I run the script and reconnect, it should be the new VM, or else I will have to start logging into the console and stuff and that's not going to a good experience.

And then there will be more complications, such as VM and CT containers running parrallal AI applications such as stable diffusion and oogabooga text generator UI, which need the host to use the GPU, and it's unclear how I can swap between passthrough GPU and host using GPU without rebooting the entire thing !

shodan · Sep 23, 2024

I will now try to have chatgpt answer all of these questions

Code:

Please consider the following proxmox support forum question
And break it down in terms of individual goals

------------------------------

This Proxmox support forum post contains several distinct technical goals and questions about dynamically swapping a GPU between multiple virtual machines (VMs) and containers (CTs) efficiently. Let’s break down the post into its individual goals and questions:

Primary Goals:

Dynamically Swap a GPU Between VMs at Will:
- The user wants to swap the GPU between VMs efficiently and on demand.
Efficient Switching Without VM or GPU Crashes:
- The user wants to understand if switching the GPU while the VM is running is possible, and whether such switching could cause crashes for the VMs or the GPU.
Minimal Downtime During GPU Swapping:
- The user is concerned about how long it takes to switch the GPU between VMs and wants a comparison with switching between ports on a KVM (keyboard-video-mouse) switch.
VM Suspension for Safe GPU Switching:
- The user proposes the idea of suspending or freezing VMs (taking live snapshots), switching the GPU to the new VM, and unfreezing the VMs to minimize downtime. They are seeking advice on whether this is a reasonable approach.
Behavior of GPU When Swapped:
- The user is asking if the GPU needs to "reboot" or restart each time it’s swapped between VMs and if this would be similar to pressing a key combination (Ctrl+Win+Shift+B) to reset the video system.
Impact on GPU-Intensive Applications:
- The user is concerned about whether applications like 3D rendering or video compression running on a VM could crash when the GPU is swapped.
Handling Multiple VMs Trying to Use GPU Passthrough Simultaneously:
- The user wonders what happens if two VMs attempt to use the GPU at the same time. Specifically, they are concerned about both VMs and the GPU crashing due to simultaneous access attempts.

Secondary/Implementation Goals:

Efficient Scripted GPU Switching:
- The user is looking to automate and streamline the process of switching the GPU between VMs, using scripts that ensure the Moonlight/Sunshine streaming service reconnects to the correct VM without manual intervention.
Output VM Display to Physical GPU Ports:
- The user wants the physical GPU outputs (which currently display the Proxmox host interface) to switch and display the contents of a VM instead.
Handle GPU Sharing Between VMs and Host (AI Workloads):
- The user needs to juggle both VM GPU passthrough for desktops and the host itself using the GPU for container-based AI applications (e.g., Stable Diffusion, Oogabooga UI). They are concerned about how to handle GPU passthrough for VMs and still allow the host to use the GPU without frequent reboots.

Key Questions from the Post:

Is it possible to change GPU passthrough settings while a VM is running?
Does swapping the GPU between VMs cause them or the GPU to crash?
How long does the GPU swapping process take compared to a KVM switch?
Would suspending/freezing VMs and swapping the GPU be an efficient way to avoid downtime?
Does the GPU need to reboot when switching between VMs, similar to resetting the video system (Ctrl+Win+Shift+B)?
Will GPU-intensive applications (e.g., 3D rendering, video compression) crash if the GPU is swapped?
What happens if two VMs try to use GPU passthrough at the same time?
How to implement an easy switching logic using scripts to avoid manual intervention?
How can the physical GPU outputs show the VM content instead of the Proxmox login?
How to handle GPU sharing between the VMs and the Proxmox host running containers, without rebooting everything?

Possible Solutions/Implementation Considerations:

PCIe GPU Passthrough Limitations: Directly swapping a GPU between running VMs is generally not supported without shutting down the VMs. A typical method is to shut down or suspend one VM, reassign the GPU to another, and restart/resume the second VM.
Scripts/Automation: Automating the switching process can help minimize downtime. The user will need to develop scripts to detach the GPU from one VM, reattach it to another, and restart the VMs efficiently.
Display Redirection: Getting the GPU to output VM content directly to physical ports likely requires configuring Proxmox to pass the GPU to a VM while allowing it to take control of the GPU’s physical outputs.
GPU Reset Mechanisms: Some GPUs support SR-IOV (single root input/output virtualization), allowing multiple VMs to share the same GPU resources, though this depends on hardware capabilities.
Handling Multiple GPU-Intensive Workloads: Running parallel AI workloads in Proxmox containers alongside VMs using GPU passthrough adds complexity. Solutions like GPU partitioning or resource allocation with specific NVIDIA technologies (such as vGPU) could help, but they also require specific hardware support.

shodan · Sep 23, 2024

NOTE THIS IS AUTOGENERATED, NONE OF THIS CODE HAS BEEN TESTED !
I'm just exploring the answers of something that maybe nobody has tried before

Code:

Please attempt to solve
1. Dynamically Swap a GPU Between VMs at Will:

Dynamically swapping a GPU between VMs at will in Proxmox presents a significant challenge due to limitations with GPU passthrough technology and how hardware interacts with virtual machines. However, there are ways to address this need, though not truly "dynamic" in the sense of hot-swapping the GPU while the VM is running. Let’s break down some possible solutions and steps you can take to achieve GPU swapping with as little downtime as possible.

Challenges with Dynamically Swapping GPUs Between VMs:

GPU Passthrough Mechanism: When a GPU is passed through to a VM, it is dedicated exclusively to that VM. The GPU drivers, both on the host and the guest OS, expect full control of the GPU. This makes it hard to just reassign the GPU without either shutting down or suspending the VM that currently holds the GPU.
PCI Device Reset Issues: After detaching a GPU from one VM, it may require a full reset before it can be reassigned to another VM. Not all GPUs handle resets gracefully, leading to potential issues like GPU lockups, crashes, or requiring host reboots.
Hardware Support: Some GPUs (e.g., NVIDIA professional GPUs like the Quadro/RTX series or AMD workstation cards) handle GPU resets better than consumer-grade GPUs. Consumer-grade GPUs (e.g., NVIDIA GeForce cards) often have issues when hot-swapping or resetting between different VMs without reboots.

Strategies to Minimize Downtime when Swapping a GPU Between VMs:

VM Suspend/Shutdown for GPU Reassignment:
- The easiest, most reliable way to swap a GPU between VMs is to suspend or shut down the VM currently using the GPU, reassign the GPU to the next VM, and start or resume that VM.
- While this is not a true "live" swap, suspending the VM preserves its state, reducing downtime compared to a full shutdown.
- Steps:
  1. Suspend the VMthat is currently using the GPU:
    1. Code:
      
      qm suspend <VMID>
  2. Detach the GPUfrom the suspended VM:
    - This may require editing the VM configuration file in /etc/pve/qemu-server/<VMID>.conf to remove or comment out the PCI device.
  3. Attach the GPUto the target VM:
    - Add the same PCI device to the new VM's configuration file or using the GUI to attach the GPU.
  4. Resume/start the target VM:
    1. Code:
      
      qm start <VMID>
      - Example:
        
        VM 101 (Linux) is using the GPU.
        
        Suspend VM 101, detach the GPU, and reassign the GPU to VM 102 (Windows).

Scripted Automation for Faster GPU Swapping:

Since the process above requires several manual steps, it’s possible to automate this using Proxmox's API or command-line tools to script the suspend, GPU detach, reassign, and resume/start actions.
A basic example of a bash script to swap GPU between two VMs:

Code:

#!/bin/bash

VM_FROM=$1 # The VM currently using the GPU
VM_TO=$2   # The VM you want to switch the GPU to
GPU_ID="hostpci0" # Assuming hostpci0 is the GPU PCI device in VM config

echo "Suspending VM $VM_FROM..."
qm suspend $VM_FROM

echo "Detaching GPU from VM $VM_FROM..."
sed -i "/$GPU_ID/d" /etc/pve/qemu-server/$VM_FROM.conf

echo "Attaching GPU to VM $VM_TO..."
echo "$GPU_ID: 01:00.0" >> /etc/pve/qemu-server/$VM_TO.conf

echo "Starting VM $VM_TO..."
qm start $VM_TO

You would need to provide the VM ID of the source and target VMs when running the script, and modify it based on your GPU's PCI device ID.

Use VM Snapshots to Preserve State:
- Instead of full shutdowns, you can take a live snapshot before suspending the VM. This will allow you to revert back if needed.
- Example:
- Code:
```
qm snapshot <VMID> <snapshot-name>
```
Consider Multiple GPUs or SR-IOV:
- One approach to avoid needing to swap a single GPU is to use multiple GPUs or SR-IOV (Single Root I/O Virtualization)capable GPUs that allow multiple VMs to share the same GPU.
  - SR-IOV: This technology allows a single physical GPU to be divided into multiple virtual GPUs, each of which can be assigned to a different VM.
  - Multiple GPUs: Having dedicated GPUs for different VMs would allow you to avoid swapping GPUs altogether, though this can be cost-prohibitive.
GPU Reset Workaround:
- Some users have reported success with modifying the GPU reset behavior in Proxmox using the vfio-pci kernel driver with specific flags or patches. This may allow the GPU to reset between VM switches without requiring a full Proxmox reboot. However, success depends on the GPU model.
- Example modification in /etc/modprobe.d/vfio.conf for NVIDIA GPUs:
- Code:
```
options vfio-pci ids=10de:1b80,10de:10f0 disable_vga=1
```
- KVM/Headless Workstation Setup:
  - If using remote access (like Moonlight/Sunshine for game streaming), ensure that when switching GPUs between VMs, you also script the reconnection of the remote client to the new VM. This avoids the need to manually log in to the Proxmox console for every switch.

Steps to Implement Dynamic GPU Swapping:

Identify the GPU's PCI ID and confirm the VM configurations.
Suspend the VM that currently has the GPU, or take a snapshot for safety.
Detach the GPU from the current VM and reassign it to the target VM.
Start/resume the target VM.
Use automation scripts to streamline this process and reduce downtime.

Considerations:

GPU swapping is not instantaneous and requires a methodical approach to avoid crashes.
GPU reset issues can make some GPUs less suitable for swapping without a host reboot.
You may need to experiment with Proxmox settings or GPU-specific kernel modules to improve the swap's reliability.

Conclusion:

While true "dynamic" GPU swapping between running VMs is not feasible with typical consumer GPUs due to how passthrough works, you can achieve minimal downtime swapping by suspending VMs, detaching and reattaching the GPU, and using automation scripts.

shodan · Sep 23, 2024

Code:

Please attempt to solve
2.Efficient Switching Without VM or GPU Crashes:

This should take the form of a script executed on the host
The ability to execute this script from inside each of the four virtual machines
The ability to execute this script from a control web page accessible by the thin client running moonlight client
And from a non-gpu using dedicated "control" desktop which the proxmox host attached keyboard/mouse/monitor can use to swap to another VM in a single click

To implement efficient GPU switching without VM or GPU crashes, we need to combine several components: a host-side script for switching, the ability to execute the script from within the virtual machines (VMs), and a control interface accessible via a web page and from a dedicated "control" desktop. This will allow for smooth and user-friendly switching between VMs.

Here’s how we can achieve these goals step-by-step:

Step 1: Host-Side Script for GPU Switching

We'll start by writing a script that runs on the Proxmox host. This script will handle suspending the currently running VM that has the GPU, detaching the GPU, and attaching it to the target VM. For reliability, we'll add error handling to avoid crashes or state corruption.

Script (gpu_switch.sh):

This script will accept two arguments: the ID of the VM currently using the GPU and the ID of the target VM that will take over the GPU.

Code:

#!/bin/bash

# Get VM IDs from the command line
VM_FROM=$1  # VM currently using the GPU
VM_TO=$2    # VM to switch GPU to
GPU_ID="hostpci0"  # Assuming GPU is on hostpci0

# Check if VM_FROM is running
STATUS_FROM=$(qm status $VM_FROM | grep "status" | awk '{print $2}')

if [ "$STATUS_FROM" == "running" ]; then
    echo "Suspending VM $VM_FROM..."
    qm suspend $VM_FROM
    sleep 3
else
    echo "VM $VM_FROM is not running, skipping suspension..."
fi

# Detach GPU from VM_FROM
echo "Detaching GPU from VM $VM_FROM..."
sed -i "/$GPU_ID/d" /etc/pve/qemu-server/$VM_FROM.conf

# Check if VM_TO is running
STATUS_TO=$(qm status $VM_TO | grep "status" | awk '{print $2}')

if [ "$STATUS_TO" == "running" ]; then
    echo "Stopping VM $VM_TO to reassign GPU..."
    qm stop $VM_TO
    sleep 3
fi

# Attach GPU to VM_TO
echo "Attaching GPU to VM $VM_TO..."
echo "$GPU_ID: 01:00.0" >> /etc/pve/qemu-server/$VM_TO.conf

# Start VM_TO
echo "Starting VM $VM_TO..."
qm start $VM_TO

echo "GPU has been switched from VM $VM_FROM to VM $VM_TO."

This script will suspend the current VM (VM_FROM) using the GPU.
It will then detach the GPU (hostpci0) from that VM by modifying the Proxmox configuration.
Next, it will check if the target VM (VM_TO) is running and stop it if necessary.
Finally, it attaches the GPU to the target VM and starts it.

Making the Script Executable:

Place the script in a directory on the Proxmox host (e.g., /root/scripts/) and make it executable:

Code:

chmod +x /root/scripts/gpu_switch.sh

Step 2: Executing the Script from Inside Each VM

To enable executing the script from inside the VMs, we can expose an SSH access point from each VM to the Proxmox host, allowing the VM user to trigger the script via an SSH command.

Steps to Execute from VMs:

Configure Passwordless SSH from VMs to Proxmox Host:
- Generate SSH keys in each VM:
- Code:
```
ssh-keygen -t rsa
```
- Copy the public key to the Proxmox host:
- Code:
```
ssh-copy-id root@<proxmox-host-ip>
```
Trigger the Script:From within the VM, users can execute:

For example, to switch from VM 101 (Linux) to VM 102 (Windows), they would run:

Code:

ssh root@<proxmox-host-ip> "/root/scripts/gpu_switch.sh 101 102"

This can be further simplified by creating a small bash script inside the VM to hide the SSH command, making it user-friendly:

Code:

#!/bin/bash
# Script to switch GPU from within VM

VM_FROM=101
VM_TO=102

ssh root@<proxmox-host-ip> "/root/scripts/gpu_switch.sh $VM_FROM $VM_TO"

Step 3: Control Web Page for Thin Client (Moonlight Client)

To control GPU switching from a web interface, we’ll set up a lightweight web server (e.g., NGINX or Apache) on the Proxmox host or a dedicated server. The web interface will allow users to click buttons to switch the GPU between VMs.

Steps for Control Web Page:

Install a Web Server:On the Proxmox host or dedicated control machine:
Code:
```
apt-get install nginx
```

Create the Web Page:In the /var/www/html/ directory, create a simple web page (index.html) with buttons to trigger GPU switching.

Code:

<!DOCTYPE html>
<html>
<head>
    <title>GPU Switch Control</title>
</head>
<body>
    <h1>GPU Switch Control Panel</h1>
    <button onclick="window.location.href='/switch?from=101&to=102'">Switch to Windows VM (102)</button>
    <button onclick="window.location.href='/switch?from=102&to=101'">Switch to Linux VM (101)</button>
</body>
</html>

Create a Backend Script to Handle Requests:Use a simple PHP script or a CGI bash script to handle the web requests. For example, using NGINX and PHP, create /var/www/html/switch.php:

Code:

<?php
    $vm_from = $_GET['from'];
    $vm_to = $_GET['to'];

    $command = "/root/scripts/gpu_switch.sh $vm_from $vm_to";
    shell_exec($command);

    echo "GPU has been switched from VM $vm_from to VM $vm_to.";
?>

Configure NGINX to serve the web page and handle the PHP script.

Accessing the Web Page:

You can access the control panel from the Moonlight client’s thin client browser by navigating to the IP of the Proxmox host or dedicated control server (e.g., http://<proxmox-host-ip>/).

Step 4: Control Desktop for Host-Keyboard/Mouse-Monitor Setup

For the "control" desktop (which does not use the GPU and is used to control GPU switching), you can follow the same approach as the web interface, allowing the user to click buttons and trigger GPU switching.

You can either:

Use the Web Page you already created and access it from the control desktop's web browser.
Use a Desktop Shortcut: Create a desktop shortcut that runs an SSH command to trigger GPU switching directly from the control desktop.

Steps for Desktop Shortcut (Linux):

Create a .desktop file (Linux) for each action. For example, to switch to VM 102:

Code:

[Desktop Entry]
Name=Switch to Windows VM (102)
Exec=ssh root@<proxmox-host-ip> "/root/scripts/gpu_switch.sh 101 102"
Icon=utilities-terminal
Type=Application

Place this file in the control desktop's application directory, making it a clickable shortcut.

Conclusion

In summary, you can implement efficient GPU switching without VM or GPU crashes by following these steps:

Create a host-side script (gpu_switch.sh) for managing the GPU handoff.
Enable SSH access from inside VMs to allow users to trigger the GPU switching script.
Build a web interface accessible from a Moonlight thin client or any browser.
Set up a dedicated control desktop with web access or desktop shortcuts to manage GPU switching.

This method ensures you can dynamically switch the GPU between VMs with minimal downtime, avoiding crashes, and enabling easy control through multiple interfaces.

shodan · Sep 23, 2024

Code:

Please attempt to solve
3. Minimal Downtime During GPU Swapping:

To solve "Minimal Downtime During GPU Swapping", we will need to carefully design a process that handles GPU passthrough between Virtual Machines (VMs) in Proxmox as efficiently as possible. The downtime primarily comes from the need to:

Suspend or stop the VM currently using the GPU.
Detach the GPU from that VM.
Attach the GPU to the new VM.
Start the new VM.

The process must be designed to minimize the time between suspending one VM and starting the other VM, ensuring minimal impact on the user experience.

Here's a practical approach that reduces downtime, combines the techniques from previous steps, and introduces a refined way to switch GPUs quickly.

Steps for Minimizing Downtime:

Suspending or Shutting Down the Current VM:
- The quickest way to prepare a VM for GPU detachment is to suspend it rather than shutting it down. Suspending the VM pauses its state and reduces the time required to bring it back up later.
- Suspending the VM that holds the GPU ensures its memory, CPU, and device state are saved, allowing for fast resumption later. This is generally much faster than a full shutdown.
GPU Detachment:
- After suspending the VM, remove the GPU from the configuration of that VM. This action is performed on the host-level configuration file (/etc/pve/qemu-server/<VMID>.conf).
- Proxmox allows hot-unplugging of PCI devices like a GPU, but for simplicity, suspending the VM is the safest way to guarantee no crashes.
Attaching the GPU to the New VM:
- Once the GPU is free, attach it to the target VM. You can configure the target VM with the PCI passthrough device (hostpci0 in the example) and start or resume it.
Starting the New VM:
- The new VM with the GPU can now be started. Using suspend/resume for this VM (rather than stop/start) can also speed up the process, as it restores the VM state quickly without the need for a full boot.

The Script for Minimal Downtime GPU Swapping

Let's refine the script that switches the GPU between VMs in the shortest time possible. This script will:

Suspend the VM currently using the GPU.
Detach the GPU from that VM.
Attach the GPU to the target VM.
Resume or start the target VM.

Script (gpu_switch_min_downtime.sh):

Code:

#!/bin/bash

# Get VM IDs from command line arguments
VM_FROM=$1  # VM currently using the GPU
VM_TO=$2    # VM to switch GPU to
GPU_ID="hostpci0"  # Assuming GPU is on hostpci0

# Check if VM_FROM is running
STATUS_FROM=$(qm status $VM_FROM | grep "status" | awk '{print $2}')

if [ "$STATUS_FROM" == "running" ]; then
    echo "Suspending VM $VM_FROM..."
    qm suspend $VM_FROM
    sleep 2  # Wait to ensure VM is fully suspended
else
    echo "VM $VM_FROM is not running, skipping suspension..."
fi

# Detach GPU from VM_FROM
echo "Detaching GPU from VM $VM_FROM..."
sed -i "/$GPU_ID/d" /etc/pve/qemu-server/$VM_FROM.conf

# Save and reload VM config (to apply changes without restarting Proxmox)
echo "Reloading VM $VM_FROM configuration..."
qm rescan

# Check if VM_TO is running
STATUS_TO=$(qm status $VM_TO | grep "status" | awk '{print $2}')

if [ "$STATUS_TO" == "running" ]; then
    echo "Stopping VM $VM_TO to reassign GPU..."
    qm stop $VM_TO
    sleep 2  # Wait to ensure VM has stopped fully
fi

# Attach GPU to VM_TO
echo "Attaching GPU to VM $VM_TO..."
echo "$GPU_ID: 01:00.0" >> /etc/pve/qemu-server/$VM_TO.conf

# Reload VM_TO configuration to apply changes
echo "Reloading VM $VM_TO configuration..."
qm rescan

# Start the target VM
echo "Starting VM $VM_TO..."
qm start $VM_TO

# Verify the GPU was successfully switched
STATUS_TO_NEW=$(qm status $VM_TO | grep "status" | awk '{print $2}')
if [ "$STATUS_TO_NEW" == "running" ]; then
    echo "GPU successfully switched from VM $VM_FROM to VM $VM_TO."
else
    echo "Failed to start VM $VM_TO. Please check logs."
fi

Explanation of the Script:

Step 1: Suspend the VM currently using the GPU:
- The script checks if the source VM (VM_FROM) is running. If so, it suspends the VM, saving its state but not shutting it down, ensuring that the VM can resume quickly after the GPU is detached.
- This minimizes downtime, as suspending is much faster than a full shutdown.
Step 2: Detach the GPU from the current VM:
- The script modifies the configuration file of the source VM to remove the GPU (hostpci0) from its settings. It uses sed to directly edit the Proxmox configuration.
- After detaching the GPU, the script rescans the VM configuration to ensure the changes are applied without needing a reboot of the Proxmox host.
Step 3: Attach the GPU to the new VM:
- The script adds the GPU (hostpci0) to the target VM (VM_TO) and then rescans the VM's configuration to apply the changes.
Step 4: Start or resume the new VM:
- If the target VM was already running, it is stopped and then started again after attaching the GPU.
- Finally, the script verifies whether the new VM started successfully and reports the result.

Step 4: Automating GPU Swapping Execution from Various Points

To minimize downtime further, we can provide users different interfaces to trigger the GPU swap from various control points. This will allow quick switching without having to SSH into the Proxmox host.

Option 1: Execute Script from the VMs Themselves

You can run the gpu_switch_min_downtime.sh script directly from the VMs by configuring passwordless SSHto the Proxmox host.
- SSH from the VM to the Proxmox host to execute the script:
- Code:
```
ssh root@<proxmox-host-ip> "/root/scripts/gpu_switch_min_downtime.sh 101 102"
```

Option 2: Control Web Page for Moonlight Client or Thin Client

Set up a simple web interface to allow users to initiate the GPU switch with a click of a button. Here’s a refined setup:

Web Page Example:

Code:

<!DOCTYPE html><html>
<head>
    <title>GPU Switch Control</title>
    <meta http-equiv="refresh" content="5"> <!-- Auto-refresh to check updates -->
</head>
<body>
    <h1>GPU Switch Control Panel</h1>
    <button onclick="window.location.href='/switch?from=101&to=102'">Switch to Windows VM</button>
    <button onclick="window.location.href='/switch?from=102&to=101'">Switch to Linux VM</button>
</body>
</html>

PHP Backend Script:

Code:

<?php
    $vm_from = $_GET['from'];
    $vm_to = $_GET['to'];
    $command = "/root/scripts/gpu_switch_min_downtime.sh $vm_from $vm_to";
    shell_exec($command);
    echo "Switching GPU from VM $vm_from to VM $vm_to...";
?>

This allows users of a Moonlight thin client or a control desktop to switch between VMs without downtime by simply clicking a button on a web page.

Option 3: Control Desktop with Shortcut Buttons

You can also provide a desktop shortcut on a control desktop (a non-GPU VM or dedicated control machine) to trigger the script with a single click:
Example for Linux (switch_vm.desktop):
Code:
```
[Desktop Entry]
Name=Switch to Linux VM
Exec=ssh root@<proxmox-host-ip> "/root/scripts/gpu_switch_min_downtime.sh 102 101"
Icon=utilities-terminal
Type=Application
```
The user simply clicks this shortcut to execute the switching script.

Additional Optimizations:

Use of SR-IOV (Single Root I/O Virtualization): If your hardware and GPU support it (such as some NVIDIA and AMD enterprise GPUs), SR-IOV could reduce downtime further by allowing multiple VMs to share a single GPU concurrently. This would eliminate the need to fully detach and reattach the GPU between VMs.
GPU Reset Workarounds: Depending on your GPU model, you may need to implement workarounds to avoid issues with GPU reset when it is passed between VMs. Some Proxmox users have had success using kernel module options or patches to improve GPU reset behavior (especially with NVIDIA GeForce cards).

Conclusion:

By using this method—suspending the current VM, detaching the GPU, and attaching it to the target VM—we can achieve minimal downtime during GPU swapping in Proxmox. The downtime is mostly limited to the suspend/resume cycle, which is much faster than fully shutting down and booting VMs. Additionally, the ability to execute this process via a script, from VMs themselves, through a web interface, or from a control desktop, makes the process user-friendly and flexible.

shodan · Sep 23, 2024

Code:

Please attempt to solve
VM Suspension for Safe GPU Switching:
    The user proposes the idea of suspending or freezing VMs (taking live snapshots), switching the GPU to the new VM, and unfreezing the VMs to minimize downtime. They are seeking advice on whether this is a reasonable approach.

The idea of suspending/freezing VMs to enable safe GPU switching is a reasonable approach in Proxmox and can be effective in minimizing downtime. This method ensures that the VM's state is preserved, and no data or application processes are lost when switching the GPU from one VM to another. It also provides a way to ensure that VMs do not experience crashes during the GPU detachment process.

Here's a step-by-step explanation and proposed solution for implementing VM suspension for safe GPU switching in Proxmox.

Understanding the Process:

The user's proposed workflow involves:

Suspending (or freezing) the VM currently using the GPU.
Detaching the GPU from that VM.
Attaching the GPU to a new VM.
Resuming the suspended VM or unfreezing it once the switch is complete.

This approach is effective because:

VM Suspension saves the current state (CPU, memory, I/O operations, etc.) of the VM, allowing it to resume exactly where it left off after the GPU switch.
Live Snapshots can be taken before suspension to ensure that the VM can revert to a stable state in case anything goes wrong during GPU detachment or attachment.
Minimizing Downtime: The use of suspension/resume is much faster than fully shutting down and restarting the VMs, making the downtime minimal.

However, suspending the VM that holds the GPU is critical because it ensures that any active processes (like 3D rendering or video processing) are paused safely. Directly switching a GPU from a running VM can cause the VM to crash or lose data, so suspension is essential for preventing these issues.

Steps for Safe GPU Switching via VM Suspension:

Let’s break down how to implement the process in Proxmox, with a combination of suspension, live snapshots, and GPU passthrough switching.

Step 1: Suspending the VM Safely

Suspend the VM that currently holds the GPU to ensure its state is saved and that it is safe to detach the GPU. Suspending a VM essentially freezes the VM’s CPU and memory state, saving it to disk.
You can suspend a VM in Proxmox using the following command:
Code:
```
qm suspend <VMID>
```
Why Suspend?
Suspending a VM is better than shutting it down because it allows for faster recovery, and when the VM is resumed, it continues exactly where it left off, with all applications and processes in the same state.

Step 2: Taking a Live Snapshot (Optional for Extra Safety)

Live Snapshots can be taken just before suspending the VM to ensure that if something goes wrong during the GPU switch, you can roll back to a stable state.
Taking a snapshot is as simple as:
Code:
```
qm snapshot <VMID> <snapshot-name>
```
Live snapshots preserve the state of the VM's CPU, RAM, and disk data, so if a failure occurs after switching the GPU, you can revert the VM back to this snapshot.

Step 3: Detaching the GPU from the Suspended VM

After suspending the VM, the GPU needs to be detached from the VM’s configuration file so that it can be reattached to another VM.
Edit the configuration file for the VM in /etc/pve/qemu-server/<VMID>.conf, and remove the line for the GPU passthrough (usually identified as hostpci0):
Code:
```
sed -i "/hostpci0/d" /etc/pve/qemu-server/<VMID>.conf
```
This removes the GPU device from the VM configuration and prepares it for attachment to another VM.

Step 4: Attaching the GPU to the Target VM

Now that the GPU is free, you can attach it to the target VM by adding the PCI passthrough configuration to the target VM’s configuration file.
To attach the GPU to the new VM, edit the configuration file /etc/pve/qemu-server/<VMID>.conf for the target VM and add the PCI passthrough device, typically on hostpci0:
Code:
```
echo "hostpci0: 01:00.0" >> /etc/pve/qemu-server/<Target-VMID>.conf
```
This attaches the GPU to the new VM. You can use Proxmox's qm rescan command to refresh the VM's configuration and ensure the change takes effect immediately.

Step 5: Resuming the Target VM

Now, you can start or resume the target VM with the GPU attached. If the VM was suspended, you can resume it with:
Code:
```
qm resume <VMID>
```
If the VM was powered off, you can start it with:
Code:
```
qm start <VMID>
```
Why Resume? Resuming the VM is much faster than starting it from a shutdown state, as all of its memory and process states are restored instantly. The GPU will now be available to the newly resumed VM, allowing it to continue processing workloads that require GPU acceleration.

Step 6: (Optional) Resuming the Suspended VM

Once the GPU is successfully switched, you can resume the previously suspended VM if needed, though it will no longer have access to the GPU.
If this VM does not require the GPU for certain tasks, it can continue running without it. However, if the VM requires the GPU, it will need to remain suspended until the GPU can be reassigned back to it.

Proposed Script for Safe GPU Switching

Here is a script that implements the entire process, ensuring minimal downtime and safe suspension of VMs:

Code:

#!/bin/bash

# Variables for source and target VM IDs
VM_FROM=$1  # VM currently using the GPU
VM_TO=$2    # VM to switch GPU to
GPU_ID="hostpci0"  # PCI device for GPU passthrough

# Step 1: Suspend the VM currently using the GPU
echo "Suspending VM $VM_FROM..."
qm suspend $VM_FROM
sleep 2  # Small pause to ensure suspension completes

# Optional Step: Take a live snapshot before switching (for safety)
echo "Taking a snapshot of VM $VM_FROM before switching..."
qm snapshot $VM_FROM "gpu-switch-snapshot-$(date +%s)"

# Step 2: Detach GPU from VM_FROM by editing its configuration
echo "Detaching GPU from VM $VM_FROM..."
sed -i "/$GPU_ID/d" /etc/pve/qemu-server/$VM_FROM.conf
qm rescan

# Step 3: Attach GPU to VM_TO by adding it to the target VM's configuration
echo "Attaching GPU to VM $VM_TO..."
echo "$GPU_ID: 01:00.0" >> /etc/pve/qemu-server/$VM_TO.conf
qm rescan

# Step 4: Start or resume the target VM with the GPU
echo "Starting or resuming VM $VM_TO..."
qm start $VM_TO

echo "GPU has been successfully switched from VM $VM_FROM to VM $VM_TO."

Notes on the Script:

The script takes two arguments: VM_FROM (the VM currently using the GPU) and VM_TO (the VM to switch the GPU to).
It suspends the first VM, optionally takes a live snapshot, and detaches the GPU by modifying the VM configuration.
The GPU is then attached to the target VM, and the target VM is started or resumed.
This approach ensures that no VM is running while the GPU is detached, preventing GPU-related crashes.

Why is this a Reasonable Approach?

Minimizes Downtime: Suspending the VMs and using the qm start or qm resume commands minimizes the downtime compared to full VM shutdowns and restarts.
Safety: Suspending the VMs ensures that no in-use processes will crash, and any tasks requiring the GPU will resume safely once the GPU is reassigned.
Live Snapshots: If you are concerned about system stability, using live snapshots before GPU switching ensures that you can roll back to a known working state in case of failures.

Additional Considerations:

GPU Reset Issues: Some GPUs (especially NVIDIA consumer cards) may have issues with resetting after being detached from one VM and reattached to another. In some cases, this requires a host reboot. This is more common with consumer GPUs like NVIDIA GeForce, while professional GPUs (e.g., NVIDIA Quadro or AMD Radeon Pro) tend to handle this more gracefully.
Hardware Compatibility: Ensure that your hardware supports GPU passthrough and VM suspension correctly. Some GPUs may require specific configurations in the Proxmox kernel or the use of the vfio-pci driver.
Multiple GPUs: If you can, consider using multiple GPUs (e.g., one GPU per VM) to avoid the need for frequent GPU switching. However, this is often limited by budget or hardware capacity.

Conclusion

Suspending or freezing VMs to safely switch GPUs is a reasonable and efficient approach that minimizes downtime and avoids crashes. By suspending the VM, you ensure that no active processes are interrupted, and by resuming the VM after the GPU switch, the user can continue working from exactly where they left off. Taking live snapshots adds an extra layer of safety, allowing you to roll back to a known stable state if needed.

This method, combined with automation via the script provided, ensures minimal disruption while switching GPUs between VMs in a Proxmox environment.

shodan · Sep 23, 2024

Code:

Attempt to answer
Behavior of GPU When Swapped:
    The user is asking if the GPU needs to "reboot" or restart each time it’s swapped between VMs and 
    if this would be similar to pressing a key combination (Ctrl+Win+Shift+B) to reset the video system.

Behavior of GPU When Swapped Between VMs:

The behavior of a GPU when it's swapped between VMs is not exactly like rebooting or resetting the video system using the key combination Ctrl+Win+Shift+B. The key combination on Windows triggers a driver-level reset of the display subsystem, which reinitializes the graphics driver without rebooting the whole system. However, when a GPU is swapped between VMs in a virtualized environment like Proxmox, there are deeper hardware and driver-level considerations involved.

Let’s break down the behavior and what happens when a GPU is detached from one VM and attached to another.

1. Does the GPU Need to "Reboot" When Swapped Between VMs?

Yes, in a way, the GPU does need to reset or reinitialize when it is detached from one VM and passed through to another. Here’s why:

PCI Device Reset: When you detach a GPU from one VM and pass it to another, the host system (Proxmox) will typically trigger a PCI reset on the GPU. This is needed to clear the GPU's state (including any data in its memory, pending processes, etc.) and return it to a "clean" state before it can be reassigned to another VM. This is different from the Ctrl+Win+Shift+B shortcut, as the reset happens at the hardware level (PCI reset) rather than just reloading the graphics driver.
Driver Reinitialization: When the GPU is passed to a new VM, the new VM’s operating system will need to initialize the GPU drivers for that VM. This is similar to the process that occurs when you boot a computer with a GPU or when a system restarts after a GPU reset. During this process, the VM loads the necessary drivers and initializes the GPU for use. In Windows, this would involve loading the NVIDIA or AMD driver stack, which communicates with the GPU and prepares it for handling tasks like rendering, video decoding, or compute.

So, while the GPU does not "reboot" in the traditional sense (like a computer reboot), it does undergo a reset and driver reinitialization when passed from one VM to another.

2. Is the Reset Similar to the Ctrl+Win+Shift+B Key Combination?

The Ctrl+Win+Shift+B key combination in Windows forces the display driver to restart, reinitializing the display outputs and recovering from common GPU driver failures (e.g., when the screen goes black or becomes unresponsive). This operation affects only the driver and display output, without resetting the underlying hardware.
GPU swapping between VMs, on the other hand, involves a full reset of the GPU's state, including clearing any running processes, GPU memory, and context. This happens because a VM expects full, clean control of the GPU once it is attached. The reset here is deeper, as it ensures the GPU is ready for fresh use in the next VM.
In contrast to Ctrl+Win+Shift+B, which is relatively quick (since only the display system is reset), a PCI device reset of a GPU may take longer, especially if the GPU has pending tasks or if the VM's operating system takes time to initialize the GPU drivers.

3. How Long Does the GPU Reset Take?

The time it takes to reset and initialize the GPUwhen swapped between VMs depends on several factors:
- GPU model: Consumer GPUs like NVIDIA GeForce or AMD Radeon tend to have more issues with resets compared to workstation or server GPUs (e.g., NVIDIA Quadro, Tesla, or AMD Radeon Pro), which are designed with virtualization and reset functionality in mind.
- Driver loading time: The operating system in the VM (whether Linux or Windows) may take a few seconds to load the GPU drivers and initialize the hardware.
- GPU workload: If the GPU was handling a complex workload (e.g., rendering or video encoding) before the swap, it might take longer to reset.

In general, the reset process can take anywhere from a few seconds to upwards of 10–20 seconds depending on the hardware and workload.

4. Does GPU Swapping Cause a Full "Reboot" of the GPU?

No, swapping the GPU between VMs does not cause a full hardware reboot like restarting the entire computer or GPU firmware. It’s more of a reset or reinitialization process:

Hardware Reset: The GPU is reset to a clean state (via a PCI reset), but the underlying firmware (like the GPU’s BIOS) does not reboot in the traditional sense.
Driver Reinitialization: The operating system in the target VM will reinitialize the drivers, but this is similar to how drivers load when the VM boots up. The VM treats the GPU as if it is being powered up for the first time.

Key Differences Between a "Reboot" and a "Reset" in the Context of GPU Swapping:

Reboot: Implies a full restart of the GPU, including its BIOS/firmware, similar to restarting a whole system. This doesn’t happen when swapping the GPU between VMs.
Reset: This clears the GPU’s internal state and prepares it for reassignment to another VM. This is what happens when you swap the GPU between VMs.
The GPU does not restart its firmware or BIOS, but it does reset any running workloads, clear its memory, and get re-initialized by the new VM’s operating system.

5. Additional Considerations for GPU Reset During VM Swapping:

Consumer GPUs (like NVIDIA GeForce) may have issues with resetting properly when swapped between VMs. These GPUs are not designed with virtualization in mind, so sometimes a reset doesn’t fully succeed, leading to the need for a host reboot. Some users have experienced issues where the GPU becomes “stuck” after being passed between VMs, especially on consumer GPUs.
Professional GPUs (like NVIDIA Quadro, Tesla, or AMD Radeon Pro) handle reset functionality much better, as they are designed for use in virtualized environments and support features like SR-IOV (which allows a single GPU to be shared between multiple VMs simultaneously without the need for swapping).
SR-IOV or vGPU (virtual GPU) technology would allow multiple VMs to share a single GPU without needing to swap the device entirely. This could eliminate the need for GPU resets between VM switches, further reducing downtime and simplifying the workflow.

Conclusion:

When a GPU is swapped between VMs in Proxmox, it undergoes a PCI reset, clearing its state and requiring reinitialization by the new VM. This reset is deeper than the driver-level reset triggered by Ctrl+Win+Shift+B, and involves more than just reinitializing the display system. The GPU does not undergo a full reboot in the traditional sense (like restarting its firmware or BIOS), but it does go through a hardware reset and driver reinitialization each time it is swapped. This process can take a few seconds to complete, depending on the hardware and the workload being handled by the GPU.

For users who need to frequently swap a GPU between VMs, using workstation-class GPUs (like Quadro or Radeon Pro) or investigating SR-IOV or vGPU technologies can help mitigate reset issues and improve the swapping process.

shodan · Sep 23, 2024

Code:

Attempt to answer

Impact on GPU-Intensive Applications:
    The user is concerned about whether applications like 3D rendering or video
    compression running on a VM could crash when the GPU is swapped.

and

Handling Multiple VMs Trying to Use GPU Passthrough Simultaneously:
    The user wonders what happens if two VMs attempt to use the GPU at the same time.
    Specifically, they are concerned about both VMs and the GPU crashing due to
    simultaneous access attempts.

1. Impact on GPU-Intensive Applications When Swapping the GPU

The user is concerned about what happens to GPU-intensive applications like 3D rendering or video compression when the GPU is swapped from one VM to another. These types of applications rely heavily on the GPU, and losing access to it can cause major problems. Let's explore the likely impact:

What Happens to GPU-Intensive Applications When a GPU is Detached?

Applications Using the GPU Will Crash or Freeze:
- When a GPU is detached from a running VM, applications that are actively using the GPU (like rendering software, video encoding, machine learning tasks, or gaming) will almost certainly crash or freeze. This is because these applications are designed to continuously communicate with the GPU for processing tasks. Once the GPU is no longer available, the application loses its primary compute resource and cannot continue functioning normally.
- For example, if you are rendering a 3D scene in Blender or running a CUDA task in a deep learning framework, and the GPU is detached, the rendering or training job will either crash immediately or become unresponsive.
Potential Loss of Data:
- In many cases, if the application does not have a mechanism to gracefully handle the loss of the GPU, you may lose any unsaved work or processing progress. This is especially true for workloads that are highly time-sensitive or real-time, like video compression, live rendering, or streaming.
VM Stability:
- The VM itself will not necessarily crash. Most operating systems will continue running, but any GPU-dependent processes will be interrupted. However, if the OS relies on the GPU for graphics output (e.g., Windows running with a GPU driver for rendering the desktop), you may experience issues like a frozen screen, application crashes, or a forced restart of the display subsystem (akin to pressing Ctrl+Win+Shift+B on Windows).
- In Linux environments with headless applications (like rendering farms or deep learning servers), the loss of the GPU can cause job crashes, but the overall system should remain stable.

Can This Be Prevented?

Graceful Suspension:
- The best way to prevent application crashes when swapping the GPU is to suspend the VM that is using the GPU before the swap. Suspending the VM freezes its current state, including any GPU processes. This ensures that when the VM is resumed later (once the GPU is reattached or available again), the GPU-intensive application can continue where it left off.
- However, this still means you cannot actively use the application while the GPU is swapped to another VM. Suspending the VM prevents immediate crashes, but it also pauses the execution of GPU-intensive tasks.
Checkpointing / Save States:
- Some applications offer a checkpointing or save state mechanism that can store intermediate states to disk, allowing the job to resume after an interruption. In environments like machine learning or render farms, periodically saving checkpoints ensures that the process can be restarted with minimal data loss after a GPU is swapped or recovered.
- Rendering Applications: Some rendering software (like Blender or Maya) may have autosave features, so users should configure these to ensure minimal data loss during a GPU switch.
- Compute Jobs (AI, CUDA): Machine learning frameworks often checkpoint their models during training. Users should ensure that checkpoints are frequently saved to avoid losing progress during GPU swaps.

Workaround for Avoiding Crashes:

Use Multiple GPUs: If multiple GPUs are available, the application can continue running on one GPU while the other is swapped to a different VM. However, this requires dedicated hardware resources that may not be available.
SR-IOV or vGPU: If the hardware supports SR-IOV or virtual GPUs (vGPU), the same physical GPU can be shared by multiple VMs without requiring a hard swap. This allows applications to continue running without crashing, as the GPU resource is never fully detached from any VM.

2. Handling Multiple VMs Trying to Use GPU Passthrough Simultaneously

The user is concerned about what happens if two VMs attempt to use GPU passthrough at the same time, and whether this could cause both VMs or the GPU itself to crash.

Can Multiple VMs Use GPU Passthrough at the Same Time?

No, Not with Direct PCI Passthrough:
- With direct PCI passthrough, a physical GPU can only be assigned to one VM at a time. GPU passthrough grants the VM exclusive access to the GPU's hardware resources, meaning that no other VM can use the GPU while it's attached to a particular VM. If two VMs try to use the GPU at the same time, Proxmox will prevent this from happening at the configuration level — it won’t allow a second VM to start with the GPU already attached to another VM.
- If a VM is already using the GPU, attempting to start a second VM with the same GPU assigned will result in an error like this in Proxmox:
- Code:
```
kvm: -device vfio-pci,host=01:00.0,id=hostpci0,bus=pci.0,addr=0x10: Device 'vfio-pci' is already in use by another VM.
```

What Happens If Two VMs Try to Access the GPU?

Proxmox Blocks the Second VM:
- Proxmox will not allow multiple VMs to use the same GPU simultaneously in direct passthrough mode. If a VM is already running with the GPU attached, Proxmox will block the second VM from starting, showing an error message in the web UI or logs. The second VM will not crash, but it won’t be able to boot or resume until the GPU is available.
No GPU-Level Crash:
- The GPU itself will not crash or lock up due to this conflict because Proxmox ensures that only one VM can hold the GPU at a time. There is no risk of simultaneous access causing hardware-level crashes, as Proxmox’s PCI passthrough mechanism ensures exclusive access.

What About Shared GPU Resources?

SR-IOV or Virtual GPU (vGPU):
- In environments where SR-IOV (Single Root I/O Virtualization) or vGPU (Virtual GPU) is supported, the GPU can be shared between multiple VMs. This is a more advanced feature supported by certain enterprise-class GPUs (e.g., NVIDIA Quadro, Tesla, or AMD Radeon Pro) and requires specialized drivers.
- In this case, each VM gets a virtual instance of the GPU, and multiple VMs can perform GPU-intensive tasks at the same time. This eliminates the need for GPU swapping altogether and allows for parallel GPU usage across multiple VMs.
- Note: Not all GPUs support SR-IOV or vGPU. Most consumer GPUs (like the NVIDIA GeForce series) do not have this functionality. You would need workstation or server-grade GPUs to take advantage of this feature.

Risk of Crashing If Sharing is Forced:

While Proxmox prevents simultaneous GPU usage with passthrough, some users might attempt to use unsafe methodslike reassigning the GPU between VMs without proper detachment or shutdown (using hot-plugging techniques or misconfiguring PCI devices). In these cases:
- GPU Lockup: The GPU could lock up or enter an unstable state if forced to switch too quickly between VMs. Some consumer GPUs are known to have issues with reset functionality, which could lead to the GPU becoming unusable until the host system is rebooted.
- Host-Level Crash: Improper GPU swapping or forcing the GPU to be shared (without SR-IOV/vGPU) can lead to host crashes or kernel panics, affecting the entire Proxmox server.

Conclusion:

Impact on GPU-Intensive Applications:

If you swap the GPU from one VM to another while GPU-intensive applications (like 3D rendering, video compression, or machine learning tasks) are running, those applications will crash or freeze because they lose access to the GPU. To prevent crashes, you should suspend the VM before switching the GPU, allowing the application to resume when the GPU is returned.

Handling Multiple VMs Trying to Use GPU Passthrough Simultaneously:

Proxmox will prevent multiple VMs from using the same GPU simultaneously with PCI passthrough. Only one VM can use the GPU at a time, and Proxmox will block the second VM from starting until the GPU is free.
If you need to share the GPU between multiple VMs concurrently, you’ll need SR-IOV or vGPU technology, which is supported by certain professional-grade GPUs but not by consumer-grade GPUs.

shodan · Sep 23, 2024

Code:

Give a short answer for each of the following based on the above discussion
[...]

Key Questions from the Post:

Is it possible to change GPU passthrough settings while a VM is running?
- No, the VM must be suspended or stopped before changing GPU passthrough settings.
Does swapping the GPU between VMs cause them or the GPU to crash?
- If not done properly (e.g., without suspending the VM), GPU-intensive applications may crash, but the VM itself generally remains stable.
How long does the GPU swapping process take compared to a KVM switch?
- GPU swapping typically takes a few seconds to 20 seconds, depending on VM suspension and GPU reinitialization, longer than a traditional KVM switch.
Would suspending/freezing VMs and swapping the GPU be an efficient way to avoid downtime?
- Yes, suspending VMs before swapping is an efficient way to minimize downtime and prevent crashes.
Does the GPU need to reboot when switching between VMs, similar to resetting the video system (Ctrl+Win+Shift+B)?
- The GPU undergoes a hardware reset, which is deeper than the video system reset (Ctrl+Win+Shift+B), but it doesn't fully "reboot."
Will GPU-intensive applications (e.g., 3D rendering, video compression) crash if the GPU is swapped?
- Yes, GPU-intensive applications will crash or freeze if the GPU is swapped while they are running.
What happens if two VMs try to use GPU passthrough at the same time?
- Proxmox will prevent the second VM from starting, and no crash will occur, but only one VM can use the GPU at a time.
How to implement an easy switching logic using scripts to avoid manual intervention?
- Use bash scripts on the host to automate the suspension, GPU detachment, and reattachment, along with SSH or web interfaces for easy control.
How can the physical GPU outputs show the VM content instead of the Proxmox login?
- Assign the GPU to a VM via passthrough, which will make the VM take control of the physical GPU output.
How to handle GPU sharing between the VMs and the Proxmox host running containers, without rebooting everything?

Use SR-IOV or vGPU technologies, if supported, to share the GPU between VMs and the Proxmox host simultaneously.

--------

Ok, so let me know if you spot any errors in all this, if there is a better way to achieve these goals or if you have some lateral method to achieve these not discussed above. Oh, and if there already exists tested scripts to do any of this ?

frankmanzhu · Sep 23, 2024

I have a workstation and have multiple displays with each connected to VM with GPU/PCI passthrough.

Is it possible to swap the GPU between the VM at will and efficiently ?
Yes, if you have the right hardware.
AMD has reset bug, there are some chances the GPU isn't returned to the host correctly. So don't switch AMD graphics card much on the host.
Nvidia has better chances at this stage.

GPU switch has to be manual because once the vm is shutdown, the GPU will be returned to the host. You'll see the fans start to spinning loud without drivers. You can then assign this to another VM. Doing too soon will likely cause the host to crash.

GPU sharing is possible with VMWare workstation inside your windows 11 VM. e.g. On proxmox, you pass through your Nvidia gpu to win 11 vm. Inside that windows 11, you have vmware workstation installed. Any nested vms inside can share your gpu.

Your host will more likely to crash with passthrough than the ones without.

Switching passthrough GPU between VMs ?

Active Member

Active Member

Primary Goals:​

Secondary/Implementation Goals:​

Key Questions from the Post:​

Possible Solutions/Implementation Considerations:​

Active Member

Challenges with Dynamically Swapping GPUs Between VMs:​

Strategies to Minimize Downtime when Swapping a GPU Between VMs:​

Steps to Implement Dynamic GPU Swapping:​

Conclusion:​

Active Member

Step 1: Host-Side Script for GPU Switching​

Script (gpu_switch.sh):​

Making the Script Executable:​

Step 2: Executing the Script from Inside Each VM​

Steps to Execute from VMs:​

Step 3: Control Web Page for Thin Client (Moonlight Client)​

Steps for Control Web Page:​

Accessing the Web Page:​

Step 4: Control Desktop for Host-Keyboard/Mouse-Monitor Setup​

Steps for Desktop Shortcut (Linux):​

Conclusion​

Active Member

Steps for Minimizing Downtime:​

The Script for Minimal Downtime GPU Swapping​

Script (gpu_switch_min_downtime.sh):​

Explanation of the Script:​

Step 4: Automating GPU Swapping Execution from Various Points​

Option 1: Execute Script from the VMs Themselves​

Option 2: Control Web Page for Moonlight Client or Thin Client​

Option 3: Control Desktop with Shortcut Buttons​

Additional Optimizations:​

Conclusion:​

Active Member

Understanding the Process:​

Steps for Safe GPU Switching via VM Suspension:​

Step 1: Suspending the VM Safely​

Step 2: Taking a Live Snapshot (Optional for Extra Safety)​

Step 3: Detaching the GPU from the Suspended VM​

Step 4: Attaching the GPU to the Target VM​

Step 5: Resuming the Target VM​

Step 6: (Optional) Resuming the Suspended VM​

Proposed Script for Safe GPU Switching​

Notes on the Script:​

Why is this a Reasonable Approach?​

Additional Considerations:​

Conclusion​

Active Member

​

Behavior of GPU When Swapped Between VMs:​

1. Does the GPU Need to "Reboot" When Swapped Between VMs?​

2. Is the Reset Similar to the Ctrl+Win+Shift+B Key Combination?​

3. How Long Does the GPU Reset Take?​

4. Does GPU Swapping Cause a Full "Reboot" of the GPU?​

Key Differences Between a "Reboot" and a "Reset" in the Context of GPU Swapping:​

5. Additional Considerations for GPU Reset During VM Swapping:​

Conclusion:​

Active Member

1. Impact on GPU-Intensive Applications When Swapping the GPU​

What Happens to GPU-Intensive Applications When a GPU is Detached?​

Can This Be Prevented?​

Workaround for Avoiding Crashes:​

2. Handling Multiple VMs Trying to Use GPU Passthrough Simultaneously​

Can Multiple VMs Use GPU Passthrough at the Same Time?​

What Happens If Two VMs Try to Access the GPU?​

What About Shared GPU Resources?​

Risk of Crashing If Sharing is Forced:​

Conclusion:​

Impact on GPU-Intensive Applications:​

Handling Multiple VMs Trying to Use GPU Passthrough Simultaneously:​

Active Member

Key Questions from the Post:​

Member

We value your privacy

Primary Goals:

Secondary/Implementation Goals:

Key Questions from the Post:

Possible Solutions/Implementation Considerations:

Challenges with Dynamically Swapping GPUs Between VMs:

Strategies to Minimize Downtime when Swapping a GPU Between VMs:

Steps to Implement Dynamic GPU Swapping:

Conclusion:

Step 1: Host-Side Script for GPU Switching

Script (gpu_switch.sh):

Making the Script Executable:

Step 2: Executing the Script from Inside Each VM

Steps to Execute from VMs:

Step 3: Control Web Page for Thin Client (Moonlight Client)

Steps for Control Web Page:

Accessing the Web Page:

Step 4: Control Desktop for Host-Keyboard/Mouse-Monitor Setup

Steps for Desktop Shortcut (Linux):

Conclusion

Steps for Minimizing Downtime:

The Script for Minimal Downtime GPU Swapping

Script (gpu_switch_min_downtime.sh):

Explanation of the Script:

Step 4: Automating GPU Swapping Execution from Various Points

Option 1: Execute Script from the VMs Themselves

Option 2: Control Web Page for Moonlight Client or Thin Client

Option 3: Control Desktop with Shortcut Buttons

Additional Optimizations:

Conclusion:

Understanding the Process:

Steps for Safe GPU Switching via VM Suspension:

Step 1: Suspending the VM Safely

Step 2: Taking a Live Snapshot (Optional for Extra Safety)

Step 3: Detaching the GPU from the Suspended VM

Step 4: Attaching the GPU to the Target VM

Step 5: Resuming the Target VM

Step 6: (Optional) Resuming the Suspended VM

Proposed Script for Safe GPU Switching

Notes on the Script:

Why is this a Reasonable Approach?

Additional Considerations:

Conclusion

Behavior of GPU When Swapped Between VMs:

1. Does the GPU Need to "Reboot" When Swapped Between VMs?

2. Is the Reset Similar to the Ctrl+Win+Shift+B Key Combination?

3. How Long Does the GPU Reset Take?

4. Does GPU Swapping Cause a Full "Reboot" of the GPU?

Key Differences Between a "Reboot" and a "Reset" in the Context of GPU Swapping:

5. Additional Considerations for GPU Reset During VM Swapping:

Conclusion:

1. Impact on GPU-Intensive Applications When Swapping the GPU

What Happens to GPU-Intensive Applications When a GPU is Detached?

Can This Be Prevented?

Workaround for Avoiding Crashes:

2. Handling Multiple VMs Trying to Use GPU Passthrough Simultaneously

Can Multiple VMs Use GPU Passthrough at the Same Time?

What Happens If Two VMs Try to Access the GPU?

What About Shared GPU Resources?

Risk of Crashing If Sharing is Forced:

Conclusion:

Impact on GPU-Intensive Applications:

Handling Multiple VMs Trying to Use GPU Passthrough Simultaneously:

Key Questions from the Post: