I've done some further looking with the advice given here. My setup goes as follows:
3 local nodes connected with 10gb/s, total of 11 ceph SSD osd's, and 7 ceph HDD osds. This is the only storage in my "datacenter" aside from the mandatory boot disk which usually has about 400GB available on each node. I run 16 VMs that do various things, 2 being camera servers that write about 6MB/s each to disk, and 1 being that massive Plex server with relatively low disk usage. Some DNS servers, Hass.io, torrent tracker cluster, web servers, and a couple other niche things, all mostly compute and ram based. I have them all on the same subnet, and they all have at least 2 different links to the network via either 802.3ad LACP or active backup or some combo.
The off-site is over a 500Mb/s pfsense VPn tunnel with 30ms of latency and both sides being fiber. The remote server is a single proxmox node (used to be part of the local cluser) that runs some Linux VMs, PBS, and another Hass.io instance. It has 4x Seagate exos 8TB HDDs in a zfs raid 10 and a 1tb boot nvme SSD.
I am able to backup to the PBS servers and saturate the link, but restoring (which I just learned is not ideal with HDDs) is about 8-10MB/s or about 80Mb/s. I am able to yoink network interfaces and interrupt the backups and they will usually continue, but my problem stems from the unreliability of the restore. I can wait 3 weeks for it to restore, as once about 100GB of data is restored the Plex server runs like a champ. I am just struggling with why something like a temporary disk pool slowdown from an OSD dying and on the HDD pool and the pool recovering relatively quickly, or something as simple as reconnecting the vpn tunnel (simulating a very brief network interruption) is enough to have the task continue to run and enter logs but refuse to keep pulling from the available PBS server. I only need to pull from this server like once and thats because I lost too much data to recover my Plex server (yay ceph). Thats why we have backups, etc. Does my question of why restoring is a 1-shot fuse operation instead of like the backup process which is very resilient make sense?
Thanks in advance,
Cody