[SOLVED] error fetching datastores - 500 after upgrade to 2.2

rlljorge · Jun 14, 2022

Hi,

My scheduler is random fail, I am receiving the error the intermittent error in different nodes

TASK ERROR: could not activate storage 'pbs02': pbs02: error fetching datastores - 500 Can't connect to 10.250.5.84:8007

The pbs02 is mounted in all nodes of the cluster and I can see the files and info.
If I execute scheduler or individual backup manually works with success.
I have jumbo frame configured and tested in all proxmox nodes and pbs server.
I attached the logs files showing the scheduler backup fail at 12:30 and the manual execution at 12:34 with success for the same vm.

Regards,

Rodrigo L L Jorge

rlljorge · Jun 21, 2022

Hello there !

Some tip ?

Regards,

Rodrigo

dcsapak · Jun 21, 2022

hi,

i'd look at your network as you not only get a 'can't connect' but you also get corosync retransmissions, which indicate an overloaded network

rlljorge · Jun 23, 2022

Hi,

The traffic for the ring network is separate from data network and the ring network does not use for proxmox backup server.

The problem occurs in 12 distinct nodes using the same PBS, the problem initiated after upgrade to 2.2-3.

Can I downgrade to 2.1 ? Are there some way ?

Best Regards,

dcsapak · Jun 23, 2022

rlljorge said:
Can I downgrade to 2.1 ? Are there some way ?

no that's not really supported

rlljorge said:
The traffic for the ring network is separate from data network and the ring network does not use for proxmox backup server.

ok, i'd investigate regardless, not that it becomes a problem later on

rlljorge said:
The problem occurs in 12 distinct nodes using the same PBS, the problem initiated after upgrade to 2.2-3.

mhm... can you post your network config from pve & pbs ?

rlljorge · Jun 23, 2022

Hello,

The proxmox network config:

Code:

root@proxmox11:~# pveversion
pve-manager/7.2-4/ca9d43cc (running kernel: 5.15.30-2-pve)

root@proxmox11:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto eno3
iface eno3 inet static
        address 192.168.30.139/27
#ring0

auto eno4
iface eno4 inet static
        address 192.168.30.171/27
#ring1

aauto enp4s0f0

auto enp4s0f0
iface enp4s0f0 inet manual
        mtu 1500

auto enp4s0f1
iface enp4s0f1 inet manual
        mtu 1500

auto enp6s0f0
iface enp6s0f0 inet manual
        mtu 9000

auto enp6s0f1
iface enp6s0f1 inet manual
        mtu 9000

auto bond0
iface bond0 inet manual
        bond-slaves eno1 enp4s0f0
        bond-miimon 100
        bond-mode active-backup
        bond-primary eno1

auto bond1
iface bond1 inet manual
        bond-slaves eno2 enp4s0f1
        bond-miimon 100
        bond-mode active-backup
        bond-primary eno2

auto bond2
iface bond2 inet manual
        bond-slaves enp6s0f0 enp6s0f1
        bond-miimon 100
        bond-mode active-backup
        bond-primary enp6s0f0
        mtu 9000

auto bond2.45
iface bond2.45 inet static
        address 10.250.5.43/25
        mtu 9000

auto bond2.51
iface bond2.51 inet static
        address 10.250.6.42/25
        mtu 9000

auto vmbr0
iface vmbr0 inet static
        address 10.1.0.73/24
        gateway 10.1.0.253
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto vmbr2
iface vmbr2 inet manual
        bridge-ports bond2
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9000

The storage config:

Code:

pbs: pbs02
        datastore backup
        server 10.250.5.84
        content backup
        fingerprint ca:11:18:47:cf:29:4d:3b:1c:b3:43:d8:78:d5:67:3e:d2:35:f7:0b:42:6e:43:f5:1a:09:28:de:00:68:a4:e1
        prune-backups keep-all=1
        username root@pam

Proxmox Backup Server:

Code:

root@pbs02:~# proxmox-backup-manager version
proxmox-backup-server 2.2.3-2 running version: 2.2.3
root@pbs02:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface enp6s0f1 inet manual

iface enp8s0f0 inet manual

iface enp8s0f1 inet manual

auto bond0
iface bond0 inet manual
        bond-miimon 100
        bond-mode active-backup
        bond-primary enp8s0f1
        bond-slaves enp8s0f0 enp8s0f1
        mtu 9000

iface enp6s0f0 inet manual

auto bond0.11
iface bond0.11 inet static
        address 10.1.0.84/24
        gateway 10.1.0.253
        mtu 1500

auto bond0.45
iface bond0.45 inet static
        address 10.250.5.84/25
        mtu 9000

Best Regards,

Rodrigo

dcsapak · Jun 23, 2022

ok and what else goes over vmbr2/bond2 on the pve side?
are you sure all networking in between has also mtu 9000 set ?

aside from that i cannot really say what's causing this, at least here we don't see this behaviour...

rlljorge · Jun 23, 2022

Hi,

Vmbr2 is used for traffic NFS from the VM to the Storage NAS, the vms mapping the NFS direct em in some cases.
I checked and the jumbo frame are enabled and works in all nodes.

The problem occurs when the backup job is started by scheduler, when I executed the same job manually works good.

Regards,

Rodrigo L L Jorge

rj45 · Jun 24, 2022

same here. 2.1 works like a charm..

ProxmoxFan · Jun 27, 2022

hello - I have same thing happening. In the process of upgrading my cluster from 6.4 to 7 series. Before doing that I upgraded my PBS server as recommended by your docs for upgrading process. The PBS server upgraded to version 2.2-3 worked fine and I can see my Datastore everywhere (in the GUI, on the command line) I can ping the PBS from any proxmox node and all backups work fine IF done manually. But they fail when scheduled and generate that error:

vzdump could not activate storage error fetching datastores - 500 Can't connect to xx.xx.xx.xx:8007

I even removed my PBS from Datacentre storage and added it back using fingerprint and server info. It works perfectly except when trying scheduled backup jobs from cron file.

I figured I should get this fixed before continuing on with cluster upgrade. Or will moving to Proxmox 7 fix this?

Thanks
John

ProxmoxFan · Jun 27, 2022

ProxmoxFan said:
hello - I have same thing happening. In the process of upgrading my cluster from 6.4 to 7 series. Before doing that I upgraded my PBS server as recommended by your docs for upgrading process. The PBS server upgraded to version 2.2-3 worked fine and I can see my Datastore everywhere (in the GUI, on the command line) I can ping the PBS from any proxmox node and all backups work fine IF done manually. But they fail when scheduled and generate that error:

vzdump could not activate storage error fetching datastores - 500 Can't connect to xx.xx.xx.xx:8007

I even removed my PBS from Datacentre storage and added it back using fingerprint and server info. It works perfectly except when trying scheduled backup jobs from cron file.

I figured I should get this fixed before continuing on with cluster upgrade. Or will moving to Proxmox 7 fix this?

Thanks
John

Adding some more info as I investigate.

I just tried connecting to the datastore from commandline from cluster nodes using the "proxmox-backup-client list --repository xx.xx.xx.xx Datastore" command, and it worked (eventually) BUT it did prompt me to accept the fingerprint twice! Why the double fingerprint? A side effect of upgrading my PBS and deleting / re-adding the PBS to Datacentre?

Also, I noticed that this proxmox-backup-client command expects the IP address followed by the Datastore name, but the error I get in the failed backups has vzdump trying to connect to the ip address Port number of the PBS server and not the datastore name. Is that normal? Why would it change / how do I fix it?

Thanks
John

itNGO · Jun 27, 2022

We have comparable issue here on some Nodes....
Since 2.2 often Syslog in PVE gives:

Code:

Jun 27 12:52:43 RZB-CPVE3 pvedaemon[2204395]: BACKUP: error fetching datastores - 500 Can't connect to 10.255.192.119:8007
Jun 27 12:52:52 RZB-CPVE3 pvestatd[1652]: BACKUP: error fetching datastores - 500 Can't connect to 10.255.192.119:8007
Jun 27 12:52:52 RZB-CPVE3 pvestatd[1652]: status update time (7.574 seconds)
Jun 27 12:53:22 RZB-CPVE3 pvestatd[1652]: BACKUP: error fetching datastores - 500 read timeout
Jun 27 12:53:22 RZB-CPVE3 pvestatd[1652]: status update time (7.585 seconds)
Jun 27 12:53:26 RZB-CPVE3 pvedaemon[2204395]: BACKUP: error fetching datastores - 500 read timeout
Jun 27 12:53:31 RZB-CPVE3 pvestatd[1652]: BACKUP: error fetching datastores - 500 Can't connect to 10.255.192.119:8007
Jun 27 12:53:32 RZB-CPVE3 pvestatd[1652]: status update time (7.546 seconds)
Jun 27 12:53:41 RZB-CPVE3 pvestatd[1652]: BACKUP: error fetching datastores - 500 Can't connect to 10.255.192.119:8007
Jun 27 12:53:41 RZB-CPVE3 pvestatd[1652]: status update time (7.627 seconds)

It recovers itself but is happening over and over every few minutes to hours... even if the PBS is totally idle....

pmbaeum · Jun 29, 2022

Hi, same Problem here.
Syslog on my PVE-Hosts is full with error 500 messages.

Right now about 50-60% success rate with my backups.

PBS and PVE communicate on 192.168.40.0/24-Network, Connected directly with 10G Cards and 10G Switch.

PVE1 Syslog

Code:

Jun 29 11:01:07 pve1 pvestatd[1441]: PBS01: error fetching datastores - 500 Can't connect to 192.168.40.5:8007
Jun 29 11:01:14 pve1 pvestatd[1441]: PBS2TB: error fetching datastores - 500 Can't connect to 192.168.40.5:8007
Jun 29 11:01:14 pve1 pvestatd[1441]: status update time (14.226 seconds)
Jun 29 11:01:20 pve1 pvestatd[1441]: status update time (5.411 seconds)
Jun 29 11:02:11 pve1 pvestatd[1441]: PBS01: error fetching datastores - 500 Can't connect to 192.168.40.5:8007
Jun 29 11:02:18 pve1 pvestatd[1441]: PBS2TB: error fetching datastores - 500 Can't connect to 192.168.40.5:8007
Jun 29 11:02:18 pve1 pvestatd[1441]: status update time (14.201 seconds)
Jun 29 11:03:15 pve1 pvestatd[1441]: PBS01: error fetching datastores - 500 Can't connect to 192.168.40.5:8007
Jun 29 11:03:20 pve1 pvestatd[1441]: status update time (11.391 seconds)
Jun 29 11:04:07 pve1 pvestatd[1441]: PBS2TB: error fetching datastores - 500 Can't connect to 192.168.40.5:8007
Jun 29 11:04:14 pve1 pvestatd[1441]: PBS01: error fetching datastores - 500 Can't connect to 192.168.40.5:8007
Jun 29 11:04:14 pve1 pvestatd[1441]: status update time (14.212 seconds)
Jun 29 11:04:20 pve1 pvestatd[1441]: status update time (5.727 seconds)
Jun 29 11:05:11 pve1 pvestatd[1441]: PBS01: error fetching datastores - 500 Can't connect to 192.168.40.5:8007
Jun 29 11:05:18 pve1 pvestatd[1441]: PBS2TB: error fetching datastores - 500 Can't connect to 192.168.40.5:8007
Jun 29 11:05:18 pve1 pvestatd[1441]: status update time (14.230 seconds)
Jun 29 11:06:15 pve1 pvestatd[1441]: PBS01: error fetching datastores - 500 Can't connect to 192.168.40.5:8007
Jun 29 11:06:20 pve1 pvestatd[1441]: status update time (11.359 seconds)

PVE1 /etc/network/interfaces

Code:

# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface enp1s0f0 inet manual

auto enp1s0f1
iface enp1s0f1 inet manual
#10G Backend

auto vmbr0
iface vmbr0 inet static
        address 192.168.0.28/23
        gateway 192.168.0.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto vmbr1
iface vmbr1 inet manual
        bridge-ports eno2
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto vmbr2
iface vmbr2 inet static
        address 192.168.40.10/24
        bridge-ports enp1s0f1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

PBS01 /etc/network/interfaces

Code:

auto lo
iface lo inet loopback

auto enp1s0f0
iface enp1s0f0 inet static
        address 192.168.0.21/23
        gateway 192.168.0.1

iface enp1s0f1 inet manual

auto enp2s0f0
iface enp2s0f0 inet static
        address 192.168.40.5/24
#Backup Network | 10G

iface enp2s0f1 inet manual

iface ens1f0 inet manual

iface ens1f1 inet manual

PBS01 Syslog

Code:

Jun 29 11:00:10 pbs01 proxmox-backup-proxy[3509]: removing backup snapshot "/mnt/datastore/storage2tb/vm/100/2022-06-28T11:09:47Z"
Jun 29 11:00:10 pbs01 proxmox-backup-proxy[3509]: removing backup snapshot "/mnt/datastore/storage2tb/vm/114/2022-06-28T11:09:48Z"
Jun 29 11:07:37 pbs01 proxmox-backup-proxy[3509]: starting new backup on datastore 'storage2tb': "vm/100/2022-06-29T09:07:34Z"
Jun 29 11:07:37 pbs01 proxmox-backup-proxy[3509]: download 'index.json.blob' from previous backup.
Jun 29 11:07:38 pbs01 proxmox-backup-proxy[3509]: register chunks in 'drive-scsi0.img.fidx' from previous backup.
Jun 29 11:07:38 pbs01 proxmox-backup-proxy[3509]: download 'drive-scsi0.img.fidx' from previous backup.
Jun 29 11:07:38 pbs01 proxmox-backup-proxy[3509]: starting new backup on datastore 'storage2tb': "vm/114/2022-06-29T09:07:34Z"
Jun 29 11:07:38 pbs01 proxmox-backup-proxy[3509]: created new fixed index 1 ("vm/100/2022-06-29T09:07:34Z/drive-scsi0.img.fidx")

ProxmoxFan · Jun 29, 2022

dcsapak said:
ok and what else goes over vmbr2/bond2 on the pve side?
are you sure all networking in between has also mtu 9000 set ?

aside from that i cannot really say what's causing this, at least here we don't see this behaviour...

Hi dcsapak

Don't know if you noticed yet or not, but several more of us seem to have this issue. Perhaps you can get the Devs to dig into it a bit more?

Thanks

dcsapak · Jun 29, 2022

ProxmoxFan said:
Don't know if you noticed yet or not, but several more of us seem to have this issue. Perhaps you can get the Devs to dig into it a bit more?

hi, yes ofc we're actively trying to find the issue. sadly we were not able to reproduce it here so far on many different setups... maybe if you share some more details about your setup (cpu/memory/nic/network/etc.) we can find some common theme...

ProxmoxFan · Jun 29, 2022

dcsapak said:
hi, yes ofc we're actively trying to find the issue. sadly we were not able to reproduce it here so far on many different setups... maybe if you share some more details about your setup (cpu/memory/nic/network/etc.) we can find some common theme...

no problem. Running a 3 node HA cluster that was on Proxmox 6.4, connecting to a PBS server which was on 1.something (I don't remember the exact version of 1). The PBS is running as a VM on one of the nodes. Everything has been running perfectly for over a year, until I upgraded the PBS to new version 2.2-3. As soon as the PBS went to its new version, the scheduled backup jobs started failing with 500 error.

Datastore is live & working, and fully accessible by any command or GUI control. Only cron jobs have this error.

However I just completed upgrading all my nodes in my Proxmox cluster to 7.2-4. Waiting to see if that fixes anything. Will let you know

pmbaeum · Jun 29, 2022

Hi,

my systemreport:
https://pastebin.com/dz1Ds4gY

HW on PBS:
CPU: 2x Intel Xeon E5-2620V3 SR207 6C Server Prozessor 6x 2,40 GHz 15MB Cache 2011-3 CPU
Memory: 64GB Registered ECC DDR4 SDRAM (8x 8GB DIMM)
NIC: Intel Ethernet Controller 10-Gigabit X540-AT2

Main PVE systemreport:
https://pastebin.com/Z60JgPxR

Hardware:
CPU: 1X Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz
Memory: 64GB Registered ECC DDR4 SDRAM (4x16GB DIMM)
NIC: Dell Broadcom 57810 2x 10GbE RJ-45 Dual Port (NetXtreme II BCM57810)

ProxmoxFan · Jun 29, 2022

ProxmoxFan said:
no problem. Running a 3 node HA cluster that was on Proxmox 6.4, connecting to a PBS server which was on 1.something (I don't remember the exact version of 1). The PBS is running as a VM on one of the nodes. Everything has been running perfectly for over a year, until I upgraded the PBS to new version 2.2-3. As soon as the PBS went to its new version, the scheduled backup jobs started failing with 500 error.

Datastore is live & working, and fully accessible by any command or GUI control. Only cron jobs have this error.

However I just completed upgrading all my nodes in my Proxmox cluster to 7.2-4. Waiting to see if that fixes anything. Will let you know

nope - upgrading Proxmox to 7.2-4 didn't help. It must be some change in PBS... an initial command to activate the storage? something that changed from version 1?

mira · Jun 29, 2022

Some of our dependencies were upgraded (e.g. tokio), so it might be related to that.
But without a reproducer in house where we can freely test and reproduce the issue, it's difficult to find the actual cause.

As my colleague pointed out, we couldn't yet reproduce this here on multiple setups here. This includes private setups by some of us.

ProxmoxFan · Jun 29, 2022

mira said:
Some of our dependencies were upgraded (e.g. tokio), so it might be related to that.
But without a reproducer in house where we can freely test and reproduce the issue, it's difficult to find the actual cause.

As my colleague pointed out, we couldn't yet reproduce this here on multiple setups here. This includes private setups by some of us.

Hi Mira - I understand it's difficult to troubleshoot something that is not consistent. In continuing to research it myself I think I might have figured it out?

Could it be the syntax used in the cron job?

Since the start, (over a year ago) my cron file has contained the following:

PATH="/usr/sbin:/usr/bin:/sbin:/bin"

5 */4 * * *           root vzdump 100 102 103 104 106 --mailnotification failure --storage BuffaloNAS --mailto me@myemail.com --mode snapshot --quiet 1
10 */24 * * *          root vzdump 105 --mailnotification failure --storage BuffaloNAS --mailto me@myemail.com --mode snapshot --quiet 1

This has worked perfectly. However I am wondering if the upgraded PBS is more picky about the commands and whether or not I should be using the proxmox-backup-client command instead of vzdump? Something like mentioned over in a different post on this site:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
So I create /root/pbs.sh like that :

#!/bin/bash
PBS_PASSWORD='mypassword'
PBS_FINGERPRINT='10:d3:7f:79:7e:e etcetera myfingerprint'
export PBS_PASSWORD
export PBS_FINGERPRINT

proxmox-backup-client login --repository root@pam@192.168.0.46:sauvegardes
proxmox-backup-client backup root.pxar:/ --repository root@pam@192.168.0.46:sauvegardes && proxmox-backup-client prune host/se4fs --repository root@pam@192.168.0.46:sauvegardes --keep-daily 14

pbs.sh runs very well

and then I add a file sauvhome in cron.d like that :
0 3 * * tue,wed,thu,fri,sat root /root/pbs.sh
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Is it possible to do a full image VM snapshot backup like vzdump but using the proxmox-backup-client command instead? Are you able to specify what I should enter in my cron file? (or .sh file if following the example above)

Thanks
John

[SOLVED] error fetching datastores - 500 after upgrade to 2.2

Active Member

Attachments

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Member

Member

Member

Renowned Member

Member

Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member