Truenas as VM question

fearz

New Member
Aug 7, 2024
12
0
1
Hello,

I have a truenas scale VM with harddisks (PCIE) passthrough shared via NFS on all proxmox cluster nodes.

I also have other VMs using that truenas shared NFS storage as their main harddisk storage, the issue is when or if the truenas VM restarts, all other VMs using that shared NFS storage will have IO errors, the only solution is to restart those VMs after the truenas VM has become healthy.

Is there any solution to restore those VMs back to normal state if the truenas VM is restored to a healthy state?
 
Hi fearz:
May you can consider use the "Start/Shutdown order" in Options for those VM who use NFS export on TrueNAS, let them shutdown before TrueNAS and than start after TrueNAS.
 
Hi fearz:
May you can consider use the "Start/Shutdown order" in Options for those VM who use NFS export on TrueNAS, let them shutdown before TrueNAS and than start after TrueNAS.

This i already use when the server starts up or shutdown...

But i'm talking about when the server is already up and i have other VMs on other nodes depending on the treunas VM that are already up..

If the truenas VM went down for whatever reasons, other VMs on other nodes depending on that shared storage will fail / give out errors, I have to restart them for them to work.
 
Ok, I think to use TrueNAS VM in Proxmox VE to serve NFS export for Proxmox Cluster is the last choose in this environment! And if you want to handle unexpected I/O error for those VM who belongs for the NFS export from TrueNAS VM , then you needs to write down an custom script to do that.
  1. Regularly test to write something into NFS export and get result back, then
  2. If write test return failed(maybe set a throughold up to 3 times), then
  3. Check each VM who belongs for this NFS export is running or stop, then
  4. If it's running then send qm stop <VMID> to those VM
  5. If it's already stopped, then do nothing for those VM
  6. If write test return successd, then
  7. Check each VM who belongs for this NFS export is running or stop, then
  8. If it's running then do nothing for those VM
  9. If it's already stopped, then send qm start <VMID> to those VM
Above script you may need to run in each node in the Proxmox Cluster, by use cron job, to aviod if TrueNAS alive but some Proxmox node lost network access to the NFS export belongs for TrueNAS.

PS: Use a simple script to solve the I/O error problem may not a bast practice, and maybe anyone will provide batter solution to you. Of couse, use another HA NFS solution to replace the TrueNAS VM is also one of the best practice. ^_^
 
Following script was generated by AI and FYR, ^_^

Bash:
#!/bin/bash

# Define the NFS mount point and the file to write
NFS_MOUNT="/mnt/nfs"
TEST_FILE="test.txt"

# Define the threshold for write failures
FAILURE_THRESHOLD=3

# Define the list of VM IDs associated with the NFS export
VM_IDS=(vm1 vm2 vm3)

# Initialize failure counter
FAILURE_COUNT=0

while true; do
    # Attempt to write to the NFS mount
    if echo "Test" > "$NFS_MOUNT/$TEST_FILE"; then
        # If write is successful, reset failure counter
        FAILURE_COUNT=0
       
        # Check each VM and start if stopped
        for vm in "${VM_IDS[@]}"; do
            if ! qm status "$vm" | grep -q "running"; then
                echo "Starting VM $vm"
                qm start "$vm"
            fi
        done
    else
        # Increment failure counter if write fails
        ((FAILURE_COUNT++))
       
        # If failure threshold is reached, stop VMs
        if [ $FAILURE_COUNT -ge $FAILURE_THRESHOLD ]; then
            echo "Write test failed $FAILURE_THRESHOLD times. Stopping VMs."
           
            # Check each VM and stop if running
            for vm in "${VM_IDS[@]}"; do
                if qm status "$vm" | grep -q "running"; then
                    echo "Stopping VM $vm"
                    qm stop "$vm"
                fi
            done
           
            # Reset failure counter after taking action
            FAILURE_COUNT=0
        fi
    fi
   
    # Wait before next test
    sleep 60
done

Explanation:​

  1. NFS Write Test: The script regularly attempts to write to a file on the NFS mount.
  2. Failure Handling: If the write fails, it increments a failure counter. If this counter reaches a threshold (3 in this case), it stops all running VMs associated with the NFS export.
  3. VM State Management: If the write is successful, it checks each VM and starts any that are stopped. If the write fails and the threshold is reached, it stops any running VMs.
  4. Loop and Delay: The script runs indefinitely, testing the NFS write every minute.

Notes:​

  • Ensure you have qm installed and configured properly for managing VMs.
  • Replace /mnt/nfs with your actual NFS mount point and adjust VM_IDS with your actual VM IDs.
  • You may need to adjust permissions or paths based on your system configuration.