backup failed: could not activate storage - after backups working for over a year, no changes

Jun 4, 2024
5
0
1
I've had PBS working, and backing up for over a year now, 399 backups and only 1 failure. However, for the last week now I'm getting the following message in my emails. Note ALL backups are failing, on all server nodes. My PVE environment is 4 server nodes, multiple VM's on each, and all the server/nodes I keep "updated" via the same command below.

vzdump backup status (pve.mydomain.com) : backup failed: could not activate storage 'Luke': Luke: error fetching datastores - 500 Can't connect to 10.5.1.7:8007 (Connection timed out)

The only change with anything was my normal "sudo apt update && sudo apt dist-upgrade". Which has been my updating process now for over a year, again NO problems at all until last week. There was an update, but I didn't pay attention to the time stamp to see if it was related closely to said update. However, that update only had a few items and to me nothing critical that would have caused this issue. Note, NO hardware changes, and no server power cycles around the time frame in question. Was power cycled a few weeks ago due to weather power outages, but had successful backups since the power was restored, many backups before THIS issue started.

I cannot rule out hardware, yet, but I am ordering new HD's and will completely destroy the RAID and start fresh, if I cannot find some other solution. I feel as though it could be hardware related, is an older Dell server that is due to replace in the next year, but again has been running fine for over a year without any issues until this. Could be an OS/software bug, which to me makes sense too. Still actively troubleshooting this, just wondering if anyone else has seen this or has suggestions.

Here's a copy of my datastore.cfg, which the drives are SSD 1TB in a RAID5 configuration, PERC controller.
root@pbs-luke:~# cat /etc/proxmox-backup/datastore.cfg
datastore: Backups
comment
gc-schedule daily
path /Backups
 
its certainly the kernel 6.8.4. show if your network card naming is the same and the /etc/interfaces is right.
 
Maybe I'm behind on versions here, not sure, but here's what I see....
root@pbs-luke:~# uname -an
Linux pbs-luke 5.15.149-1-pve #1 SMP PVE 5.15.149-1 (2024-03-29T14:24Z) x86_64 GNU/Linux
root@pbs-luke:~# ls -al /boot/
total 307092
drwxr-xr-x 5 root root 4096 Jun 3 09:58 .
drwxr-xr-x 20 root root 4096 Jun 28 2022 ..
-rw-r--r-- 1 root root 261208 Feb 8 12:12 config-5.15.143-1-pve
-rw-r--r-- 1 root root 261096 Mar 29 09:24 config-5.15.149-1-pve
-rw-r--r-- 1 root root 260563 May 11 2022 config-5.15.35-1-pve
-rw-r--r-- 1 root root 260941 Jun 8 2022 config-5.15.35-2-pve
drwxr-xr-x 2 root root 4096 Jun 28 2022 efi
drwxr-xr-x 6 root root 4096 Jun 3 09:59 grub
-rw-r--r-- 1 root root 60228224 Mar 18 11:02 initrd.img-5.15.143-1-pve
-rw-r--r-- 1 root root 60226984 Apr 26 10:28 initrd.img-5.15.149-1-pve
-rw-r--r-- 1 root root 61813020 Jun 28 2022 initrd.img-5.15.35-1-pve
-rw-r--r-- 1 root root 61842961 Jun 29 2022 initrd.img-5.15.35-2-pve
-rw-r--r-- 1 root root 182704 Aug 15 2019 memtest86+.bin
-rw-r--r-- 1 root root 184884 Aug 15 2019 memtest86+_multiboot.bin
drwxr-xr-x 2 root root 4096 Apr 26 10:28 pve
-rw-r--r-- 1 root root 6111059 Feb 8 12:12 System.map-5.15.143-1-pve
-rw-r--r-- 1 root root 6114392 Mar 29 09:24 System.map-5.15.149-1-pve
-rw-r--r-- 1 root root 6079590 May 11 2022 System.map-5.15.35-1-pve
-rw-r--r-- 1 root root 6079552 Jun 8 2022 System.map-5.15.35-2-pve
-rw-r--r-- 1 root root 11382656 Feb 8 12:12 vmlinuz-5.15.143-1-pve
-rw-r--r-- 1 root root 11388448 Mar 29 09:24 vmlinuz-5.15.149-1-pve
-rw-r--r-- 1 root root 10866496 May 11 2022 vmlinuz-5.15.35-1-pve
-rw-r--r-- 1 root root 10865376 Jun 8 2022 vmlinuz-5.15.35-2-pve
 
On a side note, I "might" have just fixed it. I added a new backup test, and it completed without any issue.

The root cause, at least at this point, was my "ssh host key" apparently changed. However, NO logs were showing that. I just remembered having that issue when trying to migrate VM's from nodes in the past. We shall see if this was the fix tomorrow, which the backups are set to run at 9pm each night.
 
Kernel version is old, yes.
a changed ssh key...error message is not straight forward
then good luck
 
The initial test command I ran was :
ssh -o "HostKeyAlias=pve-luke" root@10.x.y.z

Which gave the error message of "host key changed", but also provided a command line to edit/change/delete it. Once I ran that, and then tried the above command again it then brought in the new key. At this point, my test backup worked, so in theory over-night ALL should work, but we shall see. I'll update here, if this is the case regardless.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!