Restore large VM wont work - timeout

ITmarcapo · Mar 21, 2024

Hey,
we are running into issues while we trying to restore an large (1,5TB) vms.
We can reproduce this behavior. Small VMs works fine. Problem exits for different VMs / Backups.
The Snapshots we are trying to restore are verified.
We've tryed this multiple times. It stop on different progress / chunks.
No further tasks ran on the backup server during the restore.

How can we go further and analyse / solve this problem?

I've attached logs for the restore job from PBS and the PVE.
Below further details.

Environment:
Network
PBS connected to PVE without additional hop directly on same switch.
10 Gbit/s Network.

PVE
64 x AMD EPYC 7502P 32-Core Processor
VM-Storage: Dedicated 3-Node Ceph Cluster connected to the PVE via 20Gbit/s

PBS
24 x AMD EPYC 7272 12-Core Processor
64GB RAM
Storage: Raidz1 Enterprise SATA SSD

We use following versions:
PVE

Code:

proxmox-ve: 8.1.0 (running kernel: 6.5.13-1-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
pve-kernel-5.15: 7.4-3
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 18.2.1-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.2
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.1
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.1.0
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.5
proxmox-widget-toolkit: 4.1.4
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.4
pve-edk2-firmware: 4.2023.08-4
pve-firewall: 5.0.3
pve-firmware: 3.9-2
pve-ha-manager: 4.0.3
pve-i18n: 3.2.1
pve-qemu-kvm: 8.1.5-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve2

PBS

Code:

proxmox-backup: 3.0.1 (running kernel: 6.5.13-1-pve)
proxmox-backup-server: 3.1.4-1 (running version: 3.1.4)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
pve-kernel-5.15: 7.4-3
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
proxmox-kernel-6.5.11-6-pve-signed: 6.5.11-6
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.60-1-pve: 5.15.60-1
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.35-1-pve: 5.15.35-3
ifupdown2: 3.2.0-1+pmx8
libjs-extjs: 7.0.0-4
proxmox-backup-docs: 3.1.4-1
proxmox-backup-client: 3.1.4-1
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.5
proxmox-widget-toolkit: 4.1.3
pve-xtermjs: 5.3.0-3
smartmontools: 7.3-pve1
zfsutils-linux: 2.2.2-pve2

ITmarcapo · Mar 22, 2024

further findings and details:

wont work on different vms/snapshots
reboot from pbs and pve didnt change anything
wont work from different pbs instances,same behavior
stops every time on different progress, chunks and time

ITmarcapo · Mar 25, 2024

Further analyse:
PBS and PVE are in different vlans in our environment. After we configured the hosts to the same vlan so no other host ist involed, it works without any problems!
As router we use an opnsense firewall.

After this finding we played around with sysctl settings:
net.ipv4.tcp_keepalive_time=40
net.ipv4.tcp_keepalive_intvl=3
net.ipv4.tcp_keepalive_probes=2

Doesnt changed anything.
We changed the session / connection handling on the firewall:

But nothing changed.
At the moment we didnt got any futher ideas.
Setting the backupserver to the same network is anything, but not ideal for security reason.

We found many "timeout-Threads" here in the forum that complain about timeout while syncing backups from PVE to PBS. We think its the same rootcause only the different direction. None of these threads are really solved.
Arent there any plans to get into this and solved it finaly?

fabian · Mar 26, 2024

there is no syncing backups from PVE to PBS (or did you mean creating backups on PBS?). in any case, it sounds like your firewall mishandles the connection or loses a connection tracking entry and the traffic goes nowhere as a result, I am not sure how PBS is supposed to solve that? the HTTP/2 connection used for creating backups or restoring them already has a built in keepalive mechanism, so data should always be flowing..

itNGO · Aug 16, 2024

Further analyse:

ITmarcapo said:
PBS and PVE are in different vlans in our environment. After we configured the hosts to the same vlan so no other host ist involed, it works without any problems!
As router we use an opnsense firewall.

After this finding we played around with sysctl settings:
net.ipv4.tcp_keepalive_time=40
net.ipv4.tcp_keepalive_intvl=3
net.ipv4.tcp_keepalive_probes=2

Doesnt changed anything.
We changed the session / connection handling on the firewall:
View attachment 65244

But nothing changed.
At the moment we didnt got any futher ideas.
Setting the backupserver to the same network is anything, but not ideal for security reason.

We found many "timeout-Threads" here in the forum that complain about timeout while syncing backups from PVE to PBS. We think its the same rootcause only the different direction. None of these threads are really solved.
Arent there any plans to get into this and solved it finaly?

I guess this is more an opnsense issue. We have the same here, with the same restore problem. Connecting 2 datacenters with ipsec/gre and restore from one datacenter to the other through opnsense is currently not possible. You should open a posting in the opnsense forum about this.

ITmarcapo · Aug 16, 2024

itNGO said:
Further analyse:

I guess this is more an opnsense issue. We have the same here, with the same restore problem. Connecting 2 datacenters with ipsec/gre and restore from one datacenter to the other through opnsense is currently not possible. You should open a posting in the opnsense forum about this.

We've solved this. It wasnt any dedicated PBS Problem. We've noticed that we occasionally have problems with long-running requests.
At the end it was solved by disabling the pfsync.
Whats the detailed problem here or why this happen is still unclear.

itNGO · Aug 16, 2024

ITmarcapo said:
We've solved this. It wasnt any dedicated PBS Problem. We've noticed that we occasionally have problems with long-running requests.
At the end it was solved by disabling the pfsync.
Whats the detailed problem here or why this happen is still unclear.

So you had a HA-opnsense-system and disabled the state sync and that fixed it?

ITmarcapo · Aug 20, 2024

itNGO said:
So you had a HA-opnsense-system and disabled the state sync and that fixed it?

Yes.

Search

Search

Restore large VM wont work - timeout

ITmarcapo

New Member

Attachments

ITmarcapo

New Member

ITmarcapo

New Member

fabian

Proxmox Staff Member

itNGO

Renowned Member

ITmarcapo

New Member

itNGO

Renowned Member

ITmarcapo

New Member