Patchy summary usage graphs

Articuler · Apr 30, 2023

I've been seeing patchy usage graphs on the summary page for my LXCs and VMs. The graphs will display correctly for a time then there will be large gaps in between. It's the same for the CPU, memory, and network graphs.

I've already tried clearing out /var/lib/rrdcached and restarting, but no dice. My BIOS time looks to be set correctly as well to UTC. I see the following errors in /var/log/syslog.

Code:

Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/100: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/111: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/110: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/108: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/107: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/109: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/106: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/112: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/105: -1
Apr 30 01:39:19 pve pmxcfs[1598]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve/local-zfs: -1

Has anybody seen this before? InfluxDB looks to be getting all the usage metrics, it's just the proxmox GUI that is having issues.

Articuler · May 9, 2023

For anyone else unfortunate enough to run into this, I think I found a solution at least for my case.

Each batch of RRDC update error seemed to also be accompanied by a name resolution error for my influxDB server. It looked something like:

Code:

pve pvestatd[1730]: metrics send error 'InfluxDB': 500 Can't connect to influxdb.example.com:8086 (Temporary failure in name resolution)

It looks like the logic for generating summary usage graphs is coupled too closely with the code that pushes influxDB metrics. If the influxDB push fails then there will be gaps in the usage page.

I solved this by hardcoding the domain/IP pair for the influxDB server in /etc/hosts, bypassing the need for a DNS lookup. The errors in the the syslog disappeared and my summary graphs no longer have patches.

Neobin · May 10, 2023

Articuler said:
It looks like the logic for generating summary usage graphs is coupled too closely with the code that pushes influxDB metrics. If the influxDB push fails then there will be gaps in the usage page.

Unfortunately, yes:
https://bugzilla.proxmox.com/show_bug.cgi?id=4130

CLE · Jun 20, 2023

I have similar symptoms but, as i just found out, with another cause.

I, too, have regular (and by regular i mean like clockwork) gaps in my graphs.

I run 12 active LXC containers and 2 VMs on a 3-node cluster and until now i couldn't quite put my finger on it what it was - and then, upon reading a similar thread, it dawned on me:

The time during which i see graphs being recorded is the time i am running/booting my PBS backupserver target - and after that one is shut down on schedule, the graphs go missing again.

So, it seems that this is the cause in my case - i just wonder how i could work around it, as i don't intend to keep the more-or-less offline backupserver running all the time, i'd appreciate any and all ideas!

Another observation i made:

I once had a container with an ID of 115, that one was deleted and as i created a new one, the ID got reused - this single container does NOT exhibit the same behavior, it's graphs are available all the time, no matter if the backupserver is running or not.

Dunuin · Jun 20, 2023

PVE got a serious problem with unavailable storages, as it polls them very couble seconds and if the storage isn't answering that pvestatd can get stuck. Solution here was to write a hook script that will enable the PBS storages when the backup job starts and disabled them after the backup job has finished. PVE got no problem with an unavailable PBS as long as the PBS storage is disabled.
...downside...as the PBS storages are always disabled while no backup job is running, you will have to manually enable the storage in case you want to do a manual backup.

CLE · Jun 20, 2023

That sounds like a workable solution - would you please be so kind to share your hook script?

Thank you so much!

Dunuin · Jun 20, 2023

Here is some quick & dirty code (didn't planned to show that to someone

) I'm successfully using on 4 nodes for some months. But all my PVE nodes are unclustered, so might be problematic if you are running a cluster as I didn't cared about that. I started rewriting all that some weeks ago but then never finished it.

Bash:

#! /bin/bash

################################################################################
#
# Dunuins Vzdump Hook Script for Voyager
#
# This script for PVE 7.3 will enable storages when a backup starts and disable
# them after the backup has finsihed. This is useful in case your backup
# storages aren't online 24/7, as the webUI will get unresponsive or even
# totally unusable when there are unreachable PBS, NFS or SMB storages.
#
# To install it, create a new file at the location of your choice and edit it.
# For example:
# $ nano /var/lib/vz/snippets/vzdump_hook.sh
# Paste all of this code there and save it with CTRL+X, Y.
# Make that script owned by root:
# $ chown root:root /var/lib/vz/snippets/vzdump_hook.sh
# Make the script executable by root:
# $ chmod 750 /var/lib/vz/snippets/vzdump_hook.sh
# Add hook script to /etc/vzdump.conf:
# $ echo 'script: /var/lib/vz/snippets/vzdump_hook.sh' >> /etc/vzdump.conf
#
# Last edit: 2023.05.21 03:12
###############################################################################

# You might want to edit the following parameters:

# Define which storageIDs should be allowed to be enabled/disabled by this
# script. When the array is empty, it will allow all storages. If you only want
# to allow specific storages, add a new storageID by creating a new line like
# this below "incl_storids=()":
# incl_storids+=("MyStorageID")
declare -a incl_storids
incl_storids=()
#incl_storids+=("YourStorageID")

# Definde which storageIDs should not allowed to be enabled/disabled by this
# script. When the array is empty, it won't prevent enabling/disabling of any
# storages. If you want to exclude a storage from enabling/disabling add a new
# line like this below "excl_storids()":
# excl_storids+=("MyStorageID")
declare -a excl_storids
excl_storids=()
#excl_storids+=("YourStorageID")

# set the VMIDs here of the VMs that can't be run at the same time
declare -a vmids_sharing_device1
vmids_sharing_device1=()
#vmids_sharing_device1+=(YourVMID)

# where to store config files
conf_dir="/tmp"

# if VMs that got shutdown because they were part of a group sharing a device
# should be started again when the backup job ends
resume_vmids_sharing_device="false"

# how many seconds to wait between retries
retry_timeout=60

# how many seconds to wait for a VM to shutdown
shutdown_timeout=600

# how often to retry
retry_amount=3

# how many seconds to wait for pvesm commands to finish
check_interval=10

# how many seconds need to be passed since booting the server before backup jobs are allowed to run.
# might be useful in case you virtualize a PBS or NAS that needs to be started first after boot
# in order for the backup storage to get available
boot_delay=300

# End of Config

phase=$1
case "${phase}" in
    job-init \
        | job-start \
        | job-end \
        | job-abort)
        # undef for Proxmox Backup Server storages
        # undef in phase 'job-init' except when --dumpdir is used directly
        dumpdir=$(printenv DUMPDIR)
        # undef when --dumpdir is used directly
        storeid=$(printenv STOREID)
        case "${phase}" in
            job-init)
                # pause backup job if uptime is below boot_delay
                # calculate uptime in seconds
                uptimesec=$(($(date +%s)-$(date +%s --date="$(uptime -s)"))) 
                while [ ${uptimesec} -lt ${boot_delay} ]; do
                    sleep 10
                    # calculate uptime in seconds
                    uptimesec=$(($(date +%s)-$(date +%s --date="$(uptime -s)")))
                done     
                # remove files that store the last running VM of a group
                if [ -f "${conf_dir}/vzdump_resume_vmid1" ]; then
                    rm "${conf_dir}/vzdump_resume_vmid1"
                fi
                # enable storage
                if [ ! -z "${storeid}" ]; then
                    storid_allowed=0
                    if [ ${#incl_storids[@]} -gt 0 ]; then
                        for stid in "${incl_storids[@]}"; do
                            if [ "${stid}" = "${storeid}" ]; then
                                storid_allowed=1
                                break
                            fi
                        done
                    else
                        storid_allowed=1
                    fi
                    if [ ${#excl_storids[@]} -gt 0 ]; then
                        for stid in "${excl_storids[@]}"; do
                            if [ "${stid}" = "${storeid}" ]; then
                                storid_allowed=0
                                break
                            fi
                        done
                    fi
                    if [ ${storid_allowed} -eq 1 ]; then
                        storestatus=$(pvesm status --storage ${storeid} | grep -E "^${storeid}[[:space:]]" | tr -s ' ' | cut -d ' ' -f3)
                        if [ "${storestatus}" = "disabled" ]; then
                            # enable storage and wait for it to become active
                            retries=0
                            while [ ${retries} -lt ${retry_amount} ]; do
                                /usr/sbin/pvesm set "${storeid}" --disable 0
                                timeoutcounter=0
                                while [ ${timeoutcounter} -lt ${retry_timeout} ]; do
                                    sleep ${check_interval}
                                    timeoutcounter+=${check_interval}
                                    storestatus=$(pvesm status --storage ${storeid} | grep -E "^${storeid}[[:space:]]" | tr -s ' ' | cut -d ' ' -f3)
                                    if [ "${storestatus}" = "active" ]; then
                                        echo "$(date '+%Y-%m-%d %H:%M:%S') - Storage '${storeid}' successfully enabled " >> /tmp/hook.log
                                        break 2
                                    fi
                                done
                                retries+=1
                            done
                            if [ ${retries} -ge ${retry_amount} ]; then
                                # fail because storage couldn't be successfully enabled
                                /usr/sbin/pvesm set "${storeid}" --disable 1
                                echo "$(date '+%Y-%m-%d %H:%M:%S') - Error: failed to enable storage '${storeid}'" >> /tmp/hook.log
                                exit 1
                            fi
                        fi
                    fi
                fi
                ;;
            job-end)
                # disable storage
                if [ ! -z "${storeid}" ]; then
                    storid_allowed=0
                    if [ ${#incl_storids[@]} -gt 0 ]; then
                        for stid in "${incl_storids[@]}"; do
                            if [ "${stid}" = "${storeid}" ]; then
                                storid_allowed=1
                                break
                            fi
                        done
                    else
                        storid_allowed=1
                    fi
                    if [ ${#excl_storids[@]} -gt 0 ]; then
                        for stid in "${excl_storids[@]}"; do
                            if [ "${stid}" = "${storeid}" ]; then
                                storid_allowed=0
                                break
                            fi
                        done
                    fi
                    if [ ${storid_allowed} -eq 1 ]; then
                        storestatus=$(pvesm status --storage ${storeid} | grep -E "^${storeid}[[:space:]]" | tr -s ' ' | cut -d ' ' -f3)
                        if [ "${storestatus}" != "disabled" ]; then
                            # disable storage and wait for it to become disabled
                            retries=0
                            while [ ${retries} -lt ${retry_amount} ]; do
                                /usr/sbin/pvesm set "${storeid}" --disable 1
                                timeoutcounter=0
                                while [ ${timeoutcounter} -lt ${retry_timeout} ]; do
                                    sleep ${check_interval}
                                    timeoutcounter+=${check_interval}
                                    storestatus=$(pvesm status --storage ${storeid} | grep -E "^${storeid}[[:space:]]" | tr -s ' ' | cut -d ' ' -f3)
                                    if [ "${storestatus}" = "disabled" ]; then
                                        echo "$(date '+%Y-%m-%d %H:%M:%S') - Storage '${storeid}' successfully disabled " >> /tmp/hook.log
                                        break 2
                                    fi
                                done
                                retries+=1
                            done
                            if [ ${retries} -ge ${retry_amount} ]; then
                                # fail because storage couldn't be successfully disabled
                                echo "$(date '+%Y-%m-%d %H:%M:%S') - Error: failed to disable storage '${storeid}'" >> /tmp/hook.log
                                exit 1
                            fi
                        fi
                    fi
                fi
                # start VMs again that got shutdown for the backup
                if [ "${resume_vmids_sharing_device}" = "true" ]; then
                    if [ -f "${conf_dir}/vzdump_resume_vmid1" ]; then
                        retries=0
                        start_vmid=$(<"${conf_dir}/vzdump_resume_vmid1")
                        # wait until VM is running
                        while [ ${retries} -lt ${retry_amount} ]; do
                            qm start ${start_vmid}
                            timeoutcounter=0
                            while [ ${timeoutcounter} -lt ${retry_timeout} ]; do
                                sleep ${check_interval}
                                timeoutcounter+=${check_interval}
                                vmstatus=$(qm status ${start_vmid} | grep status | sed 's/^status: \(.*\)/\1/')
                                if [ "${vmstatus}" = "running" ]; then
                                    echo "$(date '+%Y-%m-%d %H:%M:%S') - VM with VMID ${start_vmid} successfully started again" >> /tmp/hook.log
                                    break 2
                                fi
                            done
                            retries+=1
                        done
                        # remove file that store the last running VM of a group
                        rm "${conf_dir}/vzdump_resume_vmid1"
                    fi
                fi
                ;;
        esac
        ;;
    backup-start \
        | backup-end \
        | backup-abort \
        | log-end \
        | pre-stop \
        | pre-restart \
        | post-restart)
        mode=$2
        vmid=$3
 
        # shutdown VMs sharing the GPU before doing a backup
        if [ "${phase}" = "backup-start" ]; then
            if [ ${#vmids_sharing_device1[@]} -gt 0 ]; then
                # find out position in the array
                this_vmid_position=-1
                for (( i=0; i<${#vmids_sharing_device1[@]}; i++ )); do
                    if [ "${vmids_sharing_device1[${i}]}" = "${vmid}" ]; then
                            this_vmid_position=${i}
                            break
                    fi
                done
                if [ ${this_vmid_position} -ge 0 ]; then
                    # find out which VMs are running
                    declare -a vmids_state
                    vmids_state=()
                    for (( i=0; i<${#vmids_sharing_device1[@]}; i++ )); do
                        # find out state of VM
                        vmids_state+=($(qm status ${vmids_sharing_device1[${i}]} | grep status | sed 's/^status: \(.*\)/\1/'))
                    done
                    # shutdown running VM
                    for (( i=0; i<${#vmids_state[@]}; i++ )); do
                        if [ "${vmids_state[${i}]}" = "running" ]; then
                            echo "$(date '+%Y-%m-%d %H:%M:%S') - Stutting down VM with VMID '${vmids_sharing_device1[${i}]}'" >> /tmp/hook.log
                            qm shutdown ${vmids_sharing_device1[${i}]} && qm wait ${vmids_sharing_device1[${i}]} -timeout ${shutdown_timeout}
                            if [ "${resume_vmids_sharing_device}" = "true" ]; then
                                echo "${vmids_sharing_device1[${i}]}" > "${conf_dir}/vzdump_resume_vmid1"
                            fi
                        fi
                    done
                fi
            fi
        fi
 
        ;;
    *)
        echo "$(date '+%Y-%m-%d %H:%M:%S') - Error: Phase '${phase}' unknown" >> /tmp/hook.log
        exit 1
        ;;
esac

exit 0

In addition to disabling/enabling storages it is a workaround for 2 other problems I encountered.
1.) it can pause backup jobs for X seconds after boot so they won't fail because the virtualized PBS isn't available yet when the backups jobs with enabled "repeat missed" option trigger. So basically a hacky "Start delay" for backup jobs...
2.) It allows you to backup multiple VMs that share the same PCIe device while one of the VMs is running, as it can shutdown all of them and later start that VM again, that got shutdown for the backup job. Without that the backups of all other VMs using the same device would fail, as a backup needs to start a VM and you can't start a VM where the PCI device is already in use.

CLE · Jun 20, 2023

Whoa, thanks for that - to be honest, that looks to be a tiny wee little bit more involved than i thought it'd be.

Not sure how i'm going to test that, especially regarding i got a cluster - i guess i'll wait and see if someone from Proxmox team will have some input to share until i get around to try it...

Dunuin · Jun 20, 2023

CLE said:
Not sure how i'm going to test that, especially regarding i got a cluster - i guess i'll wait and see if someone from Proxmox team will have some input to share until i get around to try it...

They know about the problem but fixing it would require a rewrite of the whole pvestatd...so a lot of work...and as a "mid to long term goal" I wouldn't bet this will be fixed soon.
See here: https://bugzilla.proxmox.com/show_bug.cgi?id=3714

A quick workaround would be to use the crontab or systemd timer to do a /usr/sbin/pvesm set "YourPbsStorage" --disable 0 and /usr/sbin/pvesm set "YourPbsStorage" --disable 1 in case your PBS always gets started and shutdown at the same time of the day.

CLE · Jun 21, 2023

I see, also your initial post that is linked from that bug...

Interestingly, my web-UI doesn't hang at all (and i think it never really did) - the missing graphs are missing for all containers and VMs, besides the one that has its ID reused from before - that part i also can't quite wrap my head around

Anyways, as i don't even see much in the ways of errors neither in the UI nor in syslog, i probably can't be of much help here and gotta wait until it is fixed...

Dunuin said:
A quick workaround would be to use the crontab or systemd timer to do a /usr/sbin/pvesm set "YourPbsStorage" --disable 0 and /usr/sbin/pvesm set "YourPbsStorage" --disable 1 in case your PBS always gets started and shutdown at the same time of the day.

Yup, that is something i read on your post as well - since i send a magic packet to wakeup the backupserver and basically shut it down at a fixed time (for the time being - permanent workaround

) i think that may be workable as well - thanks again!

seele · Feb 29, 2024

Thanks for the discussion, it made me realize that when I want to use the iSCSI feature, the graph may not display properly!
So I must choice another way to monitor my vms.

Search

Search

Patchy summary usage graphs

Articuler

New Member

Articuler

New Member

Neobin

Distinguished Member

CLE

Active Member

Dunuin

Distinguished Member

CLE

Active Member

Dunuin

Distinguished Member

CLE

Active Member

Dunuin

Distinguished Member

CLE

Active Member

seele

New Member

We value your privacy