[SOLVED] error fetching datastores - 500 after upgrade to 2.2

rlljorge · Jun 30, 2022

I still have the same error:

TASK ERROR: could not activate storage 'pbs02': pbs02: error fetching datastores - 500 Can't connect to 10.250.5.84:8007

Work's with intermittent backup is a very big impact and risk for the operation.

Please, Proxmox Staff help us to identify the root cause.

Best Regards,

Rodrigo

itNGO · Jun 30, 2022

rlljorge said:
I still have the same error:

TASK ERROR: could not activate storage 'pbs02': pbs02: error fetching datastores - 500 Can't connect to 10.250.5.84:8007

Work's with intermittent backup is a very big impact and risk for the operation.

Please, Proxmox Staff help us to identify the root cause.

Best Regards,

Rodrigo

If this is a production system, you should have a subscription. And with this, you can make a ticket at support to get "fast" help.
If you don't have a subscription you have to wait until someone can look into this further more.

itNGO · Jun 30, 2022

In the meantime, did you double check that PVE, PBS "AND!" Switch have all set/support the correct MTU?
If you never changed it, defaults to 1500. We had a Client where MTU was wrong on PVE which caused also many 500 Can't connect errors.
After fixing MTU the error where gone.

Also check the IO delay on your PBS during Backups. If it really high this can also be problematic in 2.2, we are still investigating this....

alex · Jul 1, 2022

I am also having this error but it only seems to occur with automated backups.
If i hit "run now" it doesn't fail, at least out of the 30 times I have done it it hasn't failed.
It only occurred since PBS 2.2.
Backup load has not changed on the PBS since update to 2.2 and reboot.
@itNGO your comment regarding MTU suddenly worried me as we just replaced two core switches a few weeks ago but the MTU's are all fine.

karpuzvepeynir · Jul 2, 2022

Hi all,
I'm happy to see this thread. I have 2 PBS which they had been upgraded 2.2-3, they have the same problems. I have 6.4 and 7.x PVEs. All PVEs were affected by this problem after 2.2-3 upgrade.

backup failed: could not activate storage 'xxx': xxx: error fetching datastores - 500 Can't connect to yyy:8007

My all PBSs were working good since 1.1 version. I'm also sure no network issues about this problem.

itNGO · Jul 2, 2022

I can confirm this happens on some VMs and some Clusters here also.
Today there was no scheduled Job at all and we started 2 Backups which where missed due to Timeout/Error 500 and somewhat large. (4TB in sum)
Both are running normal now....

karpuzvepeynir · Jul 4, 2022

When automated backup job started and failed, I don't see any log on PBS files below:

/var/log/proxmox-backup/api/access.log
/var/log/proxmox-backup/api/auth.log

But when I click to "run now" in backup menu, everything goes good. What is the working differences between cron job and run now button?

mira · Jul 5, 2022

Both the command line tool `vzdump` and `Run Now` in the GUI use the same API call, so there should not be any difference.

timproxmox · Jul 5, 2022

mira said:
Both the command line tool `vzdump` and `Run Now` in the GUI use the same API call, so there should not be any difference.

I have to manually run jobs as I can't rely on the schedule. Although it 'should not make a difference' we are getting the error since the upgrade. Please don't take offense to this, we just need it looked into.

Interesting enough, when I went to run a manual job this time, it failed. I noticed the syslog said 'Reached target Sockets' but not sure at this stage if it is related. I stopped the two current sync jobs and the garbage collect job and ran the backup again - it worked.
Could the other jobs be causing the backup server to not respond to new requests?

karpuzvepeynir · Jul 5, 2022

I pasted the syslog when started the automated backup job on 22:06. Then, I click to "run now" on 22:06:32. I masked some personal infos, don't mind them. I don't know if this log may help to understand something. I can look at anything which will be helpful if Proxmox staffs wish.

Code:

root@prox-host:~# cat /var/log/syslog.1 | grep "Jul  4" | grep 22:06
Jul  4 22:06:00 prox-host systemd[1]: Starting Proxmox VE replication runner...
Jul  4 22:06:01 prox-host systemd[1]: pvesr.service: Succeeded.
Jul  4 22:06:01 prox-host systemd[1]: Started Proxmox VE replication runner.
Jul  4 22:06:01 prox-host cron[8375]: (*system*vzdump) RELOAD (/etc/cron.d/vzdump)
Jul  4 22:06:01 prox-host CRON[14960]: (root) CMD (vzdump 423 428 461 484 488 489 80434 81126 81141 --quiet 1 --mailnotification always --mode snapshot --mailto abc@def.com --node prox-host --storage pbs-host)
Jul  4 22:06:02 prox-host vzdump[14961]: <root@pam> starting task UPID:prox-host:00003AA6:D2D3B356:62C33A1A:vzdump::root@pam:
Jul  4 22:06:08 prox-host pvestatd[8386]: pbs-host: error fetching datastores - 500 Can't connect to 192.168.55.110:8007
Jul  4 22:06:08 prox-host pvestatd[8386]: status update time (7.326 seconds)
Jul  4 22:06:09 prox-host vzdump[15014]: could not activate storage 'pbs-host': pbs-host: error fetching datastores - 500 Can't connect to 192.168.55.110:8007
Jul  4 22:06:09 prox-host vzdump[14961]: <root@pam> end task UPID:prox-host:00003AA6:D2D3B356:62C33A1A:vzdump::root@pam: could not activate storage 'pbs-host': pbs-host: error fetching datastores - 500 Can't connect to 192.168.55.110:8007
Jul  4 22:06:09 prox-host postfix/pickup[30466]: 35079181F22: uid=0 from=<root>
Jul  4 22:06:09 prox-host postfix/cleanup[16942]: 35079181F22: message-id=<20220704190609.35079181F22@prox-host.domain.com>
Jul  4 22:06:09 prox-host postfix/qmgr[8369]: 35079181F22: from=<root@prox-host.domain.com>, size=1694, nrcpt=1 (queue active)
Jul  4 22:06:11 prox-host postfix/smtp[16944]: 35079181F22: to=<abc@def.com>, relay=mx.xyz.com[xx.xx.xx.xx]:25, delay=2, delays=0.02/0.01/0.02/2, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 4LcFdY4txlz4xbT6)
Jul  4 22:06:11 prox-host postfix/qmgr[8369]: 35079181F22: removed
Jul  4 22:06:32 prox-host pvedaemon[891]: <root@pam> starting task UPID:prox-host:000075BC:D2D3BF13:62C33A38:vzdump::root@pam:
Jul  4 22:06:32 prox-host pvedaemon[30140]: INFO: starting new backup job: vzdump 423 428 461 484 488 489 80434 81126 81141 --quiet 1 --storage pbs-host --mailto abc@def.com --all 0 --node prox-host --mode snapshot --mailnotification always
Jul  4 22:06:32 prox-host pvedaemon[30140]: INFO: Starting Backup of VM 423 (qemu)
Jul  4 22:06:47 prox-host pvedaemon[23412]: worker exit
Jul  4 22:06:47 prox-host pvedaemon[8406]: worker 23412 finished
Jul  4 22:06:47 prox-host pvedaemon[8406]: starting 1 worker(s)
Jul  4 22:06:48 prox-host pvedaemon[8406]: worker 13200 started

itNGO · Jul 6, 2022

So Proxmox-Team what else can we do, to help you getting this solved?
Any more data, logs? Any Remote-Access to servers with this issue? Tell us what you need, we will try to make it accessible...

Thank you!

Björn B. · Jul 7, 2022

We are also experiencing the same issue with PBS 2.2-3 (since latest update)
3 nodes will randomly fail to run a backup task every 2-3 days. Manually works fine. Also a 4th node which is standalone fails as well every 2-3 days.
All nodes are pve 7.2-5
A freshly installed and not yet rolled out setup of PBS 2.2-3 + 2 PVE 7.2-5 seem to work fine.

Another customer was experiencing random Windows VM freezes again with 7.2-5. After installing 7.2-7 it seems to be resolved. So maybe it's not a problem with PBS itself but with the nodes. It's a long shot but who knows. I cannot reproduce because said customer has no PBS.
Our Update Window for the 7.2-5 with PBS will be later this month.

apollo13 · Jul 8, 2022

Hi @mira, I also see this issue and see it so frequently that one could call it reproducable; and I am located in Austria so we could (easily?) do a debugging session.

I'll try to provide some hints for you so you might be able to reproduce on your own. The issue seems to occur with TLS handshaking, though it is not clear to me why. I am using pbs003.tuxis.nl as backup server (I am sure @tuxis would give you an account there).

Every now and then I see

Code:

Jul 08 09:09:08 ps-pmx2 pvestatd[4226]: pbs-tuxis: error fetching datastores - 500 Can't connect to pbs003.tuxis.nl:8007
Jul 08 09:09:08 ps-pmx2 pvestatd[4226]: status update time (7.358 seconds)

I can trigger this relatively often by going via node -> pbs storage -> backups in the gui and it will show the corresponding error. And now it starts to get weird; I attached tcpdump and whenever the request fails I get this output:

As you can see in there, after the Client-Hello we just get an ACK and not a Server-Hello. The Server-Hello is only sent (!!) after the client timed out (seven seconds -- does that number sound familiar)? I could imagine a bug in the TLS statemachine. This would also explain why one might not see the issue when simply trying to curl that PBS endpoint etc… I will try to further compare working and non-working requests; please let me know if you spot anything useful in my picture.

apollo13 · Jul 8, 2022

Ok, I have correlated the tcpdumps with our perimeter firewall and they show the same results. The Server-Hello is coming late. I only have one firewall in front of that now where I currently cannot get tcpdumps. But if someone from @tuxis wanted to hop on a call with me and get tcpdumps on their end we should be able to confirm the behavior I am seeing (It would confirm that the answer originated late and is not delayed on the way).

DJI · Jul 8, 2022

I have deleted and recreated the datastores from the VE machines, and so far its working without issue. It would normally have thrown a 500 by now.

However, I am sure others would have tried this and maybe I have just not waited long enough for a fail?

itNGO · Jul 8, 2022

DJI said:
I have deleted and recreated the datastores from the VE machines, and so far its working without issue. It would normally have thrown a 500 by now.

However, I am sure others would have tried this and maybe I have just not waited long enough for a fail?

I am not sure, but I believe this happens only if you have several hundreds of Snapshots/Backups on PBS... on an empty Datastore we could never reproduce this failure....

apollo13 · Jul 8, 2022

itNGO said:
I am not sure, but I believe this happens only if you have several hundreds of Snapshots/Backups on PBS... on an empty Datastore we could never reproduce this failure....

Given what I have seen in the tcpdumps the connection fails way before the application layer is established. It would be great if anyone could confirm those findings. So either we are hunting multiple issues here or it is pure luck that it doesn't show up. We have less than hundred backups in the store, so…

DJI · Jul 8, 2022

apollo13 said:
Given what I have seen in the tcpdumps the connection fails way before the application layer is established. It would be great if anyone could confirm those findings. So either we are hunting multiple issues here or it is pure luck that it doesn't show up. We have less than hundred backups in the store, so…

I should have been clearer, I deleted the link to the data stores and recreated the link, it has 9579 backups on one store, 482 on the next and 1836 on the final one.
Backing up every 2 hours, no issues.... yet!

rlljorge · Jul 8, 2022

DJI said:
I should have been clearer, I deleted the link to the data stores and recreated the link, it has 9579 backups on one store, 482 on the next and 1836 on the final one.
Backing up every 2 hours, no issues.... yet!

Not worked for me !

DJI · Jul 8, 2022

rlljorge said:
Not worked for me !

Sounds like I am just waiting for a failure then

[SOLVED] error fetching datastores - 500 after upgrade to 2.2

Well-Known Member

Famous Member

Famous Member

Member

Active Member

Famous Member

Active Member

Proxmox Staff Member

Member

Active Member

Famous Member

Member

Renowned Member

Renowned Member

Active Member

Famous Member

Renowned Member

Active Member

Well-Known Member

Active Member

We value your privacy