Hi all,
First to explain our infrastructure:
We have 7 nodes in total, 6 of which are in a cluster. 3 nodes in location A, 3 nodes in location B and 1 nodes in location C. Location A and B are clustered. Our PBS is located on Location C . We have been backing-up all our VMs in the cluster in A and B to PBS in location C for months.
I first updated the single node in location C and saw no issues after a day so I went ahead and updated 1 node in location A and B and waited to see for anyother issues.
We noticed all planned back-ups of VM's on those updated nodes failing, and VMs on the non-updated nodes succeeding. I also can't migrate any VM's from an updated node to a non-updated node.
However, the backups of VMs on the node in location C are working correctly, so this lead me to believe it might be networking related since the PBS and VMs in location C are in the same network. However I can perfectly ping and reach all servers across locations, I was able to add a test datastore on PBS and add this to my PVE in locations A and B but had the same issue when running a back-up on the updated nodes.
Next I thought because location C is one single node that has been updated it is completely on the new version, in location A and B however there are 3 nodes each, of which 1 each is updated. Since I also can't migrate VMs from a non-updated node to an updated node in the same location I started thinking that the nodes being a different version might be causing issues. I would however prefer not to test this theory by updating the remaining nodes since this could mean I can't back-up the remaining VMs either.
Some general info:
PBS 1.1.5
PVE 6.2.10 on the non-updated nodes
PVE 6.4.6 on the updated nodes
Proxmox Backup Client 1.1.7-1
The task output in PVE:
The task output in PBS:
Output in Syslog on PBS:
If anymore information is required, I'd be happy to provide since I'm pretty much out of ideas.
Thanks in advance!
First to explain our infrastructure:
We have 7 nodes in total, 6 of which are in a cluster. 3 nodes in location A, 3 nodes in location B and 1 nodes in location C. Location A and B are clustered. Our PBS is located on Location C . We have been backing-up all our VMs in the cluster in A and B to PBS in location C for months.
I first updated the single node in location C and saw no issues after a day so I went ahead and updated 1 node in location A and B and waited to see for anyother issues.
We noticed all planned back-ups of VM's on those updated nodes failing, and VMs on the non-updated nodes succeeding. I also can't migrate any VM's from an updated node to a non-updated node.
However, the backups of VMs on the node in location C are working correctly, so this lead me to believe it might be networking related since the PBS and VMs in location C are in the same network. However I can perfectly ping and reach all servers across locations, I was able to add a test datastore on PBS and add this to my PVE in locations A and B but had the same issue when running a back-up on the updated nodes.
Next I thought because location C is one single node that has been updated it is completely on the new version, in location A and B however there are 3 nodes each, of which 1 each is updated. Since I also can't migrate VMs from a non-updated node to an updated node in the same location I started thinking that the nodes being a different version might be causing issues. I would however prefer not to test this theory by updating the remaining nodes since this could mean I can't back-up the remaining VMs either.
Some general info:
PBS 1.1.5
PVE 6.2.10 on the non-updated nodes
PVE 6.4.6 on the updated nodes
Proxmox Backup Client 1.1.7-1
The task output in PVE:
INFO: starting new backup job: vzdump 2020 --node dcuxvm202 --mode snapshot --storage PBS-A --remove 0
INFO: Starting Backup of VM 2020 (qemu)
INFO: Backup started at 2021-05-20 10:31:38
INFO: status = running
INFO: VM Name: BLUE-LEU-ANSIBLE-01
INFO: include disk 'scsi0' 'Silver:vm-2020-disk-0' 80G
INFO: include disk 'scsi1' 'Silver:vm-2020-disk-1' 50G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/2020/2021-05-20T08:31:38Z'
INFO: started backup task '0f0a9145-e720-4559-be73-13795f85628f'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: scsi1: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: 2% (2.8 GiB of 130.0 GiB) in 3s, read: 953.3 MiB/s, write: 1.3 MiB/s
INFO: 2% (3.9 GiB of 130.0 GiB) in 13m 27s, read: 1.4 MiB/s, write: 0 B/s
ERROR: backup write data failed: command error: write_data upload error: broken pipe
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 2020 failed - backup write data failed: command error: write_data upload error: broken pipe
INFO: Failed at 2021-05-20 10:45:06
INFO: Backup job finished with errors
TASK ERROR: job errors
The task output in PBS:
2021-05-20T10:31:39+02:00: starting new backup on datastore 'Datastore-A': "vm/2020/2021-05-20T08:31:38Z"
2021-05-20T10:31:39+02:00: GET /previous: 400 Bad Request: no valid previous backup
2021-05-20T10:31:39+02:00: created new fixed index 1 ("vm/2020/2021-05-20T08:31:38Z/drive-scsi0.img.fidx")
2021-05-20T10:31:39+02:00: created new fixed index 2 ("vm/2020/2021-05-20T08:31:38Z/drive-scsi1.img.fidx")
2021-05-20T10:31:39+02:00: add blob "/Backups/A/vm/2020/2021-05-20T08:31:38Z/qemu-server.conf.blob" (425 bytes, comp: 425)
2021-05-20T10:45:05+02:00: backup failed: connection error: Connection timed out (os error 110)
2021-05-20T10:45:05+02:00: removing failed backup
2021-05-20T10:45:05+02:00: PUT /fixed_index: 400 Bad Request: Problems reading request body: error reading a body from connection: broken pipe
2021-05-20T10:45:05+02:00: TASK ERROR: connection error: Connection timed out (os error 110)
Output in Syslog on PBS:
May 20 10:31:39 pbs proxmox-backup-proxy[8609]: starting new backup on datastore 'Datastore-A': "vm/2020/2021-05-20T08:31:38Z"
May 20 10:31:39 pbs proxmox-backup-proxy[8609]: GET /previous: 400 Bad Request: no valid previous backup
May 20 10:31:39 pbs proxmox-backup-proxy[8609]: created new fixed index 1 ("vm/2020/2021-05-20T08:31:38Z/drive-scsi0.img.fidx")
May 20 10:31:39 pbs proxmox-backup-proxy[8609]: created new fixed index 2 ("vm/2020/2021-05-20T08:31:38Z/drive-scsi1.img.fidx")
May 20 10:31:39 pbs proxmox-backup-proxy[8609]: add blob "/Backups/A/vm/2020/2021-05-20T08:31:38Z/qemu-server.conf.blob" (425 bytes, comp: 425)
May 20 10:39:35 pbs sshd[19039]: Accepted password for root from 10.10.10.82 port 51625 ssh2
May 20 10:39:35 pbs sshd[19039]: pam_unix(sshd:session): session opened for user root by (uid=0)
May 20 10:39:35 pbs systemd-logind[967]: New session 30 of user root.
May 20 10:39:35 pbs systemd[1]: Created slice User Slice of UID 0.
May 20 10:39:35 pbs systemd[1]: Starting User Runtime Directory /run/user/0...
May 20 10:39:35 pbs systemd[1]: Started User Runtime Directory /run/user/0.
May 20 10:39:35 pbs systemd[1]: Starting User Manager for UID 0...
May 20 10:39:35 pbs systemd[19050]: pam_unix(systemd-user:session): session opened for user root by (uid=0)
May 20 10:39:35 pbs systemd[19050]: Listening on GnuPG cryptographic agent and passphrase cache.
May 20 10:39:35 pbs systemd[19050]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
May 20 10:39:35 pbs systemd[19050]: Reached target Timers.
May 20 10:39:35 pbs systemd[19050]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 20 10:39:35 pbs systemd[19050]: Listening on GnuPG network certificate management daemon.
May 20 10:39:35 pbs systemd[19050]: Reached target Paths.
May 20 10:39:35 pbs systemd[19050]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
May 20 10:39:35 pbs systemd[19050]: Reached target Sockets.
May 20 10:39:35 pbs systemd[19050]: Reached target Basic System.
May 20 10:39:35 pbs systemd[19050]: Reached target Default.
May 20 10:39:35 pbs systemd[19050]: Startup finished in 60ms.
May 20 10:39:35 pbs systemd[1]: Started User Manager for UID 0.
May 20 10:39:35 pbs systemd[1]: Started Session 30 of user root.
May 20 10:39:43 pbs sshd[19039]: pam_unix(sshd:session): session closed for user root
May 20 10:39:43 pbs systemd-logind[967]: Session 30 logged out. Waiting for processes to exit.
May 20 10:39:43 pbs systemd[1]: session-30.scope: Succeeded.
May 20 10:39:43 pbs systemd-logind[967]: Removed session 30.
May 20 10:39:53 pbs systemd[1]: Stopping User Manager for UID 0...
May 20 10:39:53 pbs systemd[19050]: Stopped target Default.
May 20 10:39:53 pbs systemd[19050]: Stopped target Basic System.
May 20 10:39:53 pbs systemd[19050]: Stopped target Paths.
May 20 10:39:53 pbs systemd[19050]: Stopped target Timers.
May 20 10:39:53 pbs systemd[19050]: Stopped target Sockets.
May 20 10:39:53 pbs systemd[19050]: gpg-agent.socket: Succeeded.
May 20 10:39:53 pbs systemd[19050]: Closed GnuPG cryptographic agent and passphrase cache.
May 20 10:39:53 pbs systemd[19050]: gpg-agent-browser.socket: Succeeded.
May 20 10:39:53 pbs systemd[19050]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 20 10:39:53 pbs systemd[19050]: gpg-agent-extra.socket: Succeeded.
May 20 10:39:53 pbs systemd[19050]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
May 20 10:39:53 pbs systemd[19050]: dirmngr.socket: Succeeded.
May 20 10:39:53 pbs systemd[19050]: Closed GnuPG network certificate management daemon.
May 20 10:39:53 pbs systemd[19050]: gpg-agent-ssh.socket: Succeeded.
May 20 10:39:53 pbs systemd[19050]: Closed GnuPG cryptographic agent (ssh-agent emulation).
May 20 10:39:53 pbs systemd[19050]: Reached target Shutdown.
May 20 10:39:53 pbs systemd[19050]: systemd-exit.service: Succeeded.
May 20 10:39:53 pbs systemd[19050]: Started Exit the Session.
May 20 10:39:53 pbs systemd[19050]: Reached target Exit the Session.
May 20 10:39:53 pbs systemd[19051]: pam_unix(systemd-user:session): session closed for user root
May 20 10:39:53 pbs systemd[1]: user@0.service: Succeeded.
May 20 10:39:53 pbs systemd[1]: Stopped User Manager for UID 0.
May 20 10:39:53 pbs systemd[1]: Stopping User Runtime Directory /run/user/0...
May 20 10:39:53 pbs systemd[1]: run-user-0.mount: Succeeded.
May 20 10:39:53 pbs systemd[1]: user-runtime-dir@0.service: Succeeded.
May 20 10:39:53 pbs systemd[1]: Stopped User Runtime Directory /run/user/0.
May 20 10:39:53 pbs systemd[1]: Removed slice User Slice of UID 0.
May 20 10:42:04 pbs proxmox-backup-proxy[8609]: error during snapshot file listing: 'unable to load blob '"/Backups/A/vm/2020/2021-05-20T08:31:38Z/index.json.blob"' - No such file or directory (os error 2)'
May 20 10:44:48 pbs proxmox-backup-proxy[8609]: error during snapshot file listing: 'unable to load blob '"/Backups/A/vm/2020/2021-05-20T08:31:38Z/index.json.blob"' - No such file or directory (os error 2)'
May 20 10:45:05 pbs proxmox-backup-proxy[8609]: backup failed: connection error: Connection timed out (os error 110)
May 20 10:45:05 pbs proxmox-backup-proxy[8609]: removing failed backup
May 20 10:45:05 pbs proxmox-backup-proxy[8609]: PUT /fixed_index: 400 Bad Request: Problems reading request body: error reading a body from connection: broken pipe
May 20 10:45:05 pbs proxmox-backup-proxy[8609]: removing backup snapshot "/Backups/A/vm/2020/2021-05-20T08:31:38Z"
May 20 10:45:05 pbs proxmox-backup-proxy[8609]: TASK ERROR: connection error: Connection timed out (os error 110)
If anymore information is required, I'd be happy to provide since I'm pretty much out of ideas.
Thanks in advance!