Restore large VM wont work - timeout

Jun 28, 2023
19
0
1
Hey,
we are running into issues while we trying to restore an large (1,5TB) vms.
We can reproduce this behavior. Small VMs works fine. Problem exits for different VMs / Backups.
The Snapshots we are trying to restore are verified.
We've tryed this multiple times. It stop on different progress / chunks.
No further tasks ran on the backup server during the restore.

How can we go further and analyse / solve this problem?

I've attached logs for the restore job from PBS and the PVE.
Below further details.



Environment:
Network
PBS connected to PVE without additional hop directly on same switch.
10 Gbit/s Network.

PVE
64 x AMD EPYC 7502P 32-Core Processor
VM-Storage: Dedicated 3-Node Ceph Cluster connected to the PVE via 20Gbit/s

PBS
24 x AMD EPYC 7272 12-Core Processor
64GB RAM
Storage: Raidz1 Enterprise SATA SSD


We use following versions:
PVE
Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.13-1-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
pve-kernel-5.15: 7.4-3
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 18.2.1-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.2
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.1
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.1.0
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.5
proxmox-widget-toolkit: 4.1.4
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.4
pve-edk2-firmware: 4.2023.08-4
pve-firewall: 5.0.3
pve-firmware: 3.9-2
pve-ha-manager: 4.0.3
pve-i18n: 3.2.1
pve-qemu-kvm: 8.1.5-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve2


PBS
Code:
proxmox-backup: 3.0.1 (running kernel: 6.5.13-1-pve)
proxmox-backup-server: 3.1.4-1 (running version: 3.1.4)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
pve-kernel-5.15: 7.4-3
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
proxmox-kernel-6.5.11-6-pve-signed: 6.5.11-6
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.60-1-pve: 5.15.60-1
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.35-1-pve: 5.15.35-3
ifupdown2: 3.2.0-1+pmx8
libjs-extjs: 7.0.0-4
proxmox-backup-docs: 3.1.4-1
proxmox-backup-client: 3.1.4-1
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.5
proxmox-widget-toolkit: 4.1.3
pve-xtermjs: 5.3.0-3
smartmontools: 7.3-pve1
zfsutils-linux: 2.2.2-pve2
 

Attachments

  • task-pbs-01-reader-2024-03-21T11 33 52Z.log
    973.3 KB · Views: 3
  • task-proxmox-02-qmrestore-2024-03-21T11_33_22Z.log
    12.6 KB · Views: 2
further findings and details:
  • wont work on different vms/snapshots
  • reboot from pbs and pve didnt change anything
  • wont work from different pbs instances,same behavior
  • stops every time on different progress, chunks and time
 
Further analyse:
PBS and PVE are in different vlans in our environment. After we configured the hosts to the same vlan so no other host ist involed, it works without any problems!
As router we use an opnsense firewall.

After this finding we played around with sysctl settings:
net.ipv4.tcp_keepalive_time=40
net.ipv4.tcp_keepalive_intvl=3
net.ipv4.tcp_keepalive_probes=2

Doesnt changed anything.
We changed the session / connection handling on the firewall:
1711371440969.png

But nothing changed.
At the moment we didnt got any futher ideas.
Setting the backupserver to the same network is anything, but not ideal for security reason.

We found many "timeout-Threads" here in the forum that complain about timeout while syncing backups from PVE to PBS. We think its the same rootcause only the different direction. None of these threads are really solved.
Arent there any plans to get into this and solved it finaly?
 
there is no syncing backups from PVE to PBS (or did you mean creating backups on PBS?). in any case, it sounds like your firewall mishandles the connection or loses a connection tracking entry and the traffic goes nowhere as a result, I am not sure how PBS is supposed to solve that? the HTTP/2 connection used for creating backups or restoring them already has a built in keepalive mechanism, so data should always be flowing..
 
  • Like
Reactions: waltar
Further analyse:
PBS and PVE are in different vlans in our environment. After we configured the hosts to the same vlan so no other host ist involed, it works without any problems!
As router we use an opnsense firewall.

After this finding we played around with sysctl settings:
net.ipv4.tcp_keepalive_time=40
net.ipv4.tcp_keepalive_intvl=3
net.ipv4.tcp_keepalive_probes=2

Doesnt changed anything.
We changed the session / connection handling on the firewall:
View attachment 65244

But nothing changed.
At the moment we didnt got any futher ideas.
Setting the backupserver to the same network is anything, but not ideal for security reason.

We found many "timeout-Threads" here in the forum that complain about timeout while syncing backups from PVE to PBS. We think its the same rootcause only the different direction. None of these threads are really solved.
Arent there any plans to get into this and solved it finaly?
I guess this is more an opnsense issue. We have the same here, with the same restore problem. Connecting 2 datacenters with ipsec/gre and restore from one datacenter to the other through opnsense is currently not possible. You should open a posting in the opnsense forum about this.
 
Further analyse:

I guess this is more an opnsense issue. We have the same here, with the same restore problem. Connecting 2 datacenters with ipsec/gre and restore from one datacenter to the other through opnsense is currently not possible. You should open a posting in the opnsense forum about this.
We've solved this. It wasnt any dedicated PBS Problem. We've noticed that we occasionally have problems with long-running requests.
At the end it was solved by disabling the pfsync.
Whats the detailed problem here or why this happen is still unclear.
 
We've solved this. It wasnt any dedicated PBS Problem. We've noticed that we occasionally have problems with long-running requests.
At the end it was solved by disabling the pfsync.
Whats the detailed problem here or why this happen is still unclear.
So you had a HA-opnsense-system and disabled the state sync and that fixed it?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!