How to recover from 100% disk use

pool: Backup_Storage
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:

NAME STATE READ WRITE CKSUM
Backup_Storage DEGRADED 0 0 0
sdb DEGRADED 0 0 0 too many errors

errors: No known data errors
 
!! EVERYTHING AT YOUR RISK!!
I would do the following:

- identify the affected disk with zpool status -x (see also link in your posted message)
- Replace the cable on this plate.
- zpool clear Backup_Storage <diskwitherror>. After that the status should be changed from DEGRADED to ONLINE. Check this with zpool status.
- Observe the system for a while. If the error no longer occurs, then the matter is over.
- If the error still occurs again, replace the disk (see link).
- Now you can run zfs set reservation=2g Backup_Storage.
 
  • Like
Reactions: ioB_Newcomer
I made a script to help other out
If you end up filling up your space run this with privileged user like root

Steps:
1. Set the Paths "SRC_DIR", "DEST_DIR" and "BACKUP_FILE"
2. Run the script and select "MOVE", you need to free up at least 100mb
3. Manually run a Garbage Collect Job now in PBS, it should work (ignore the warnings, we will restore them in step 4.)
4. Run this script and select "RESTORE", to restore your moved file as they were.
5. Run a full Verify in PBS

Thank you!

Just had this issue, been trying to solve it for a couple of days and using this script cracked it for me!
 
I updated my script, from user feedback and added a few failsafes and guardrails:
the script should now give you a more guided experience:

Instructions are same as before
Steps:
  • Set the Paths "SRC_DIR", "DEST_DIR" and "BACKUP_FILE"
  • follow the "Order of operations"
  1. Prune Job -- (in PBS GUI)
  2. Move
  3. Garbage Collect -- (in PBS GUI)
  4. Restore
  5. Garbage Collect -- (in PBS GUI)
Copy the below script as a "anex_pbs_script.sh"
chmod +x ./anex_pbs_script.sh

Bash:
#!/bin/env bash

# Set Source and destination directories
# Create the destination directory before running the script, where your PBS files will be moved to clear some space.

# Variables (EDIT BELOW):
SRC_DIR="/path/to/<datastore-name>/.chunks"
DEST_DIR="/path/to/destination"

# ---------------------------------------------------------------------
# DO NOT EDIT BELOW:
# ---------------------------------------------------------------------

# Build paths:
CHUNKS_DIR="$DEST_DIR/backup/.chunks"
BACKUP_FILE="$DEST_DIR/logs/move_log.txt"
LOCK_FILE="$DEST_DIR/.state_lock"

# Pre-flight Configuration Check:
check_config() {
    # Ensure log directory exists
    mkdir -p "$(dirname "$BACKUP_FILE")"

    # Check if CHUNKS_DIR exists, create if not
    if [ ! -d "$CHUNKS_DIR" ]; then
        echo "Destination directory '$CHUNKS_DIR' does not exist. Attempting to create..."
        if ! mkdir -pv "$CHUNKS_DIR"; then
             echo -e "\e[31mERROR: Failed to create destination directory '$CHUNKS_DIR'. Aborting.\e[0m"
             exit 1
        fi
    fi

    # Set Permissions on CHUNKS_DIR (Must be backup:backup 750)
    # This ensures consistency with PBS datastore structure
    chown backup:backup "$CHUNKS_DIR"
    chmod 750 "$CHUNKS_DIR"

    # Check if CHUNKS_DIR is writable
    if [ ! -w "$CHUNKS_DIR" ]; then
        echo -e "\e[31mCRITICAL ERROR: Destination directory '$CHUNKS_DIR' is NOT writable.\e[0m"
        echo "Please fix permissions or check the path."
        exit 1
    fi
}

clear
function_move() {

    # warn the user that the script will move files from SRC_DIR to CHUNKS_DIR, make it eye catching:
    echo -e "\e[31m----- WARNING: This script will move files -----\e[0m"
    echo -e "from: \e[33m'$SRC_DIR'\e[0m"
    echo -e "to:   \e[32m'$CHUNKS_DIR'\e[0m"
    echo -e "\e[31m----- WARNING: Do Not Run MOVE second time without Running RESTORE first -----\e[0m"

    # Prompt the user to select the number of latest modified folders to move
    echo
    echo "Select the number of latest folders to move:"
    echo "This is done to free up space for garbage cleaning to be able to run"
    echo
    options=("<Manually input>" "5" "10" "25" "50" "100")
    select opt in "${options[@]}"
    do
        case $opt in
            "<Manually input>")
                read -p "Enter the number of items: " NUM_ITEMS
                break
                ;;
            "5")
                NUM_ITEMS=5
                break
                ;;
            "10")
                NUM_ITEMS=10
                break
                ;;
            "25")
                NUM_ITEMS=25
                break
                ;;
            "50")
                NUM_ITEMS=50
                break
                ;;
            "100")
                NUM_ITEMS=100
                break
                ;;
            *) echo "Invalid option $REPLY";;
        esac
    done

    # Find the latest modified folders or files in SRC_DIR
    LATEST_ITEMS=$(ls -t "$SRC_DIR" | head -n "$NUM_ITEMS")

    # Output the result
    # echo "The latest $NUM_ITEMS modified items are:"
    # echo "$LATEST_ITEMS"

    # Show human-readable sizes of the selected items
    echo "Sizes of the selected items:"
    for ITEM in $LATEST_ITEMS;
    do
      du -sh "$SRC_DIR/$ITEM"
    done

    # Show the total combined size
    echo
    echo "Total combined size of the selected items:"
    du -ch $(for ITEM in $LATEST_ITEMS; do echo "$SRC_DIR/$ITEM"; done) | grep total

    # Ask for confirmation to proceed with the move
    while true; do
        read -p "Do you want to proceed with moving these items to $CHUNKS_DIR? (y/n): " CONFIRM
        case $CONFIRM in
            [Yy]* )
                break
                ;;
            [Nn]* )
                read -p "Do you want to retry or exit? (r/e): " RETRY
                case $RETRY in
                    [Rr]* )
                        function_move
                        return
                        ;;
                    [Ee]* )
                        echo "Operation cancelled."
                        exit 1
                        ;;
                    * )
                        echo "Invalid option. Please enter 'r' to retry or 'e' to exit."
                        ;;
                esac
                ;;
            * )
                echo "Invalid option. Please enter 'y' to proceed or 'n' to cancel."
                ;;
        esac
    done

    # Ensure the destination directory exists
    if [ ! -d "$CHUNKS_DIR" ]; then
        mkdir -pv "$CHUNKS_DIR"
    fi
    # Change ownership of the destination directory to 34:34
    chown 34:34 "$CHUNKS_DIR"

    # Backup 'permissions', 'uid', 'gid', 'File path', 'last modification', 'last access' and 'last status change' to a file
    > "$BACKUP_FILE"
    for ITEM in $LATEST_ITEMS;
    do
    find "$SRC_DIR/$ITEM" -exec stat -c "%a %U %G %n %Y %X %Z" {} \; >> "$BACKUP_FILE"
    done

    # Copy the latest folders or files to CHUNKS_DIR
    # Copy and Verify before Deleting
    for ITEM in $LATEST_ITEMS;
    do
        echo "Moving '$ITEM'..."
        # 1. Copy
        cp -rpv "$SRC_DIR/$ITEM" "$CHUNKS_DIR"
        CP_EXIT_CODE=$?
      
        # 2. Verify
        if [ $CP_EXIT_CODE -eq 0 ] && [ -e "$CHUNKS_DIR/$ITEM" ]; then
            # Verification Successful
          
            # Backup stats again just to be safe/redundant or relies on earlier bulk backup?
            # The bulk backup earlier is fine.
          
            # 3. Delete Source
            rm -r "$SRC_DIR/$ITEM"
        else
            # Verification Failed
            echo -e "\e[31mCRITICAL ERROR: Failed to move '$ITEM'. Copy failed or Destination file missing.\e[0m"
            echo "Halting operation completely to prevent data loss."
            echo "Please check '$CHUNKS_DIR' and manual intervention is required."
            return 1 # Exit function, do not continue loop
        fi
    done

    echo
    echo "Items moved successfully."
    echo "State Locked. You must run Restore to clear the lock."
  
    # Create Lock File
    touch "$LOCK_FILE"

    echo
    echo -e "\e[31m----- DO NOT RUN AGAIN UNTIL FILES ARE RESTORED BACK TO '$SRC_DIR' -----\e[0m"
    echo
    echo "Next Steps:"
    echo
    echo -e " 1). Run a \e[32m\"GARBAGE COLLECT\"\e[0m Job in Proxbox Backup Server to free up space"
    echo " 2). Then Run RESTORE with this script to move the files back to '$SRC_DIR'"
    echo " 3). Or you can Exit now and run this script later to Restore."
    echo -e "\e[31mDO NOT RUN \"PRUNE JOB\"\e[0m as it will mark the moved files for deletion and restore wont work then, corrupting the PBS backups"
    echo
    read -p "Press Enter to return to menu..."

    echo
    echo "Items copied, permissions, timestamps, and ownership saved, and source items deleted successfully."
    echo
    echo "--------------------------------------------------------------------------------------------------"
    echo
    echo -e "\e[31m----- DO NOT RUN AGAIN UNTIL FILES ARE RESTORED BACK TO '$SRC_DIR' -----\e[0m"
    echo
    echo "Next Steps:"
    echo
    echo " 1). Run a Garbage Collect Job in Proxbox Backup Server to free up space"
    echo " 2). Then Run RESTORE with this script to move the files back to '$SRC_DIR'"
    echo
}

function_restore() {
    # Restore the items from CHUNKS_DIR to SRC_DIR
    echo "Restoring items from $CHUNKS_DIR to $SRC_DIR..."

    # Copy the items back to SRC_DIR
    while IFS=' ' read -r SRC_PERM SRC_USER SRC_GROUP SRC_PATH SRC_MODTIME SRC_ACCESSTIME SRC_CHANGETIME; do
        REL_PATH="${SRC_PATH#$SRC_DIR/}"
        DEST_PATH="$SRC_DIR/$REL_PATH"
        DEST_DIR_PATH=$(dirname "$DEST_PATH")

        # Copy the item back
        cp -rpv "$CHUNKS_DIR/$REL_PATH" "$DEST_PATH"

        # Restore permissions, timestamps, and ownership
        if [ -e "$DEST_PATH" ]; then
            DEST_PERM=$(stat -c %a "$DEST_PATH")
            if [ "$SRC_PERM" != "$DEST_PERM" ]; then
                chmod "$SRC_PERM" "$DEST_PATH"
            fi
            # Set the timestamps
            touch -m -d "@$SRC_MODTIME" "$DEST_PATH"
            touch -a -d "@$SRC_ACCESSTIME" "$DEST_PATH"
            touch -d "@$SRC_CHANGETIME" "$DEST_PATH"
            # Restore ownership
            chown "$SRC_USER:$SRC_GROUP" "$DEST_PATH"
        fi
    done < "$BACKUP_FILE"

    # Delete the leftover files and folders in CHUNKS_DIR
    # Archival Cleanup (Instead of Deleting)
    TIMESTAMP=$(date +%Y-%m-%d_%H-%M-%S)
    # Ensure archive is a sibling of the chunks folder, not inside it
    DEST_ARCHIVE="${CHUNKS_DIR%/}_archive_$TIMESTAMP"
  
    echo "Archiving backup files to '$DEST_ARCHIVE'..."
    mkdir -p "$DEST_ARCHIVE"
    # Set Permissions on ARCHIVE (Must be backup:backup 750)
    chown backup:backup "$DEST_ARCHIVE"
    chmod 750 "$DEST_ARCHIVE"
  
    # Archive the log file with matching timestamp
    ARCHIVED_LOG_FILE="${BACKUP_FILE%.*}_$TIMESTAMP.txt"
    cp "$BACKUP_FILE" "$ARCHIVED_LOG_FILE"
    echo "Archived metadata log to '$ARCHIVED_LOG_FILE'"

    # Move leftover files to archive
    # We use finding from BACKUP_FILE or just move everything remaining in CHUNKS_DIR?
    # Strategy: Move everything in CHUNKS_DIR to be safe, as it should only contain what we put there + maybe empty dirs.
    # But strictly following the logic, we should only move what we put there.
    # Let's move the specific items we restored to the archive folder.
  
    while IFS=' ' read -r _ _ _ SRC_PATH _ _ _; do
        REL_PATH="${SRC_PATH#$SRC_DIR/}"
        DEST_PATH="$CHUNKS_DIR/$REL_PATH"
        if [ -e "$DEST_PATH" ]; then
             mv "$DEST_PATH" "$DEST_ARCHIVE/"
        fi
    done < "$BACKUP_FILE"
  
    # Remove Lock File
    if [ -f "$LOCK_FILE" ]; then
        rm "$LOCK_FILE"
        echo "State Lock removed."
    fi

    echo
    echo "----------------------------------------------------------------"
    echo "RESTORE COMPLETED SUCCESSFULLY"
    echo "----------------------------------------------------------------"
    echo "1. Items restored to Source."
    echo "2. Metadata (permissions/ownership) re-applied."
    echo "3. Restored files also Archived to (in case of a emergency), you can delete this after confirming proper PBS functionality restored:"
    echo -e "   -> \e[33m$DEST_ARCHIVE\e[0m"
    echo "4. Metadata log preserved at:"
    echo -e "   -> \e[33m$ARCHIVED_LOG_FILE\e[0m"
    echo "----------------------------------------------------------------"
    echo -e "\e[32mDONE\e[0m"
    echo -e "run another \e[32mGARBAGE COLLECT\e[0m job to fix previous \"GARBAGE COLLECT\" warnings"
    echo restarting script for new session...
    echo
}

echo
echo -e "\e[31m----- WARNING: Do Not Run MOVE second time without Running RESTORE first -----\e[0m"
# Start the function_move function or restore function based on user input
# Start the function_move function or restore function based on user input
# Main Loop
check_config

while true; do
    echo
    if [ -f "$LOCK_FILE" ]; then
        # Locked State Message
        echo -e "\e[32mMove Completed, your files are safe in :\e[0m"
        echo -e "\e[33m$CHUNKS_DIR\e[0m"
        echo "1. Check the PBS Datastore should now have free space,"
        echo -e "2. Next Step: Run a \e[32m\"GARBAGE COLLECT\"\e[0m job only"
        echo -e "\e[31mDO NOT RUN \"PRUNE JOB\"\e[0m as it will mark the moved files for deletion and restore wont work then, corrupting the PBS backups"
    else
        # Unlocked State Message
        echo -e "\e[33mMake sure you have atleast run a full 'PRUNE JOB' in pbs before proceding.\e[0m"
        echo -e "\e[33mthe prune job will only mark files for deletion nothing is deleted\e[0m"
        echo
        echo -e "Order of operations:"
        echo -e "1). \e[33mPrune Job\e[0m        --  (in PBS GUI)"
        echo -e "2). Move"
        echo -e "3). \e[33mGarbage Collect\e[0m  --  (in PBS GUI)"
        echo -e "4). Restore"
        echo -e "5). \e[33mGarbage Collect\e[0m  --  (in PBS GUI)"
    fi
  
    echo
    echo -e "\e[31m----- WARNING: Do Not Run MOVE second time without Running RESTORE first -----\e[0m"
    echo
    echo -e "\e[32m      SOURCE Directory: $SRC_DIR\e[0m"
    echo -e "\e[33m DESTINATION Directory: $CHUNKS_DIR\e[0m"
    echo
    echo "Do you want to move items or restore items?"
    echo

    # Dynamic Menu Options
    if [ -f "$LOCK_FILE" ]; then
        echo "1) Move   --   (Locked until Restore is done)"
    else
        echo "1) Move"
    fi
    echo "2) Restore"
    echo "3) Exit"
  
    read -p "#? " choice
  
    case $choice in
        1)
            if [ -f "$LOCK_FILE" ]; then
                echo
                echo -e "\e[31m----- WARNING: LOCKED -----\e[0m"
                echo -e "\e[31mA 'Move' operation was already completed. Doing it again implies you want to move MORE files.\e[0m"
                echo -e "\e[31mWARNING: This will overwrite the backup log. IF YOU PROCEED WITHOUT RESTORING FIRST, PREVIOUS METADATA LOGS WILL BE LOST.\e[0m"
                echo
                read -p "Type 'YES' to ignore this warning and proceed (ANYTHING ELSE TO CANCEL): " FORCE_CONFIRM
                if [ "$FORCE_CONFIRM" != "YES" ]; then
                    echo "Cancelled."
                    continue
                fi
            fi
            function_move
            ;;
        2)
            function_restore
            ;;
        3)
            echo "Exiting."
            exit 0
            ;;
        *)
            echo "Invalid option."
            ;;
    esac
done
 
Last edited: