transfer rate drops to a crawl

Apr 27, 2024
436
150
43
Portland, OR
www.gnetsys.net
I've got a lot of syncs running site to site to support a migration.
I need them to go go go go. Not lag for a day.

What is happening when a sync job drops to a few K and just stays that way until you kill it?
There are no 'logs' of this, other than the dramatic time window between log entries.

Are there priorities for the types of jobs?
Does a datastore read always win out over a datastore write? Seems that way.

How does PBS decide to split up its bandwidth?
When a sync that was using most of the bandwidth finishes, why doesn't a concurrent sync pick up speed and use that now available bandwidth?

I understand contention on the source server and network issues. If I could lay the blame on either of those things, it would be fixable, but I don't see them. Yet.

I'm not seeing log errors for locked backups. Or at least not to any frequent degree. Mostly I cause that when I try to execute the sync job and its already running.

I need this to work. Now. I'm considering wiping a datastore and trying again.
 
Hmm. Maybe I fixed it?

I was getting this.

Mismatch between pool hostid and system hostid on imported pool.​


I did the fix where you gen a new hostid, set it to multihost, and back.
The sync job apparently immediately started working at the prior top speed.

I didn't cause this issue. My weekday nemesis decided to redo custom drive pool. I hope I've fixed it. Really under the gun here.
 
I don't think the ZFS issue, as it was, exists any longer.
As far as I can tell, I've fixed that ... but the sync failed again _after_ i fixed the zpool ID thing.
Code:
rm /etc/hostid
zgenhostid
zpool set multihost=on rpool
zpool set multihost=off rpool

Sync has been working steadily since I restarted the job 4 hours ago.

@waltar If you've got any details about this known issue, that would be nice.
I've come to respect your ZFS knowledge, and I think perhaps English isn't your first language, but your contribution here was unhelpful as such.

---
Couple hours later. Sync died again. Restarting it causes a bunch of 'resync' entries, which can be confusing.
After the resync stuff, it eventually gets to downloading an archive image, and from the PBS side, seems to do very little.
2025-05-10T14:57:31-04:00: re-sync snapshot vm/207/2025-05-10T04:09:05Z
2025-05-10T14:57:32-04:00: sync archive drive-sata0.img.fidx

On the pull side, there's no cpu in use. Minimal disk. 200k constant download.
On the source side, I see get chunk, download chunk, being processed for that job. It doesn't look real quick, but I don't think a human can really tell how fast its working by watching the log roll, so I discount that impression. But it sure doesn't look as fast as other logrolls on other servers.

I know garbage collect has run during this time.
It might have collided with another sync job too, hard to tell.
I've bounced the sync target box. Might bounce the source too after its done with some stuff. Try to get a clean test again.
I sure don't know what's the matter yet, but the target machine seems to be in the middle of it all.

i guess I'm going to try tcpdumps.
This is a new colo. We may have network issues beyond my usual purview.

--------------------------------------------

... Right, so the above 'fix' isn't permanent.

I bounced the box, the zpool id issue came back. I ran the fix again. The sync is working again.
I've tried so many things that its hard to tell what worked or not.

I know there's a ZFS issue, and that may be the entire problem. I'll chew on that for a bit.

-------------------------------------------

No, this is entirely queuing and locking in PBS.
And possibly some obfuscation where it pretends to transfer data that is then discarded.

The server that won't download anything just fired right back up and started pulling data. No intervention on my part.

It's chugging away. And its pulling data from a PBS that is running another sync job at the same time. (Two different namespaces.)
No issues at the moment. Its being cooperative and doing what I expect.
I can no more explain why it works now than why it didn't work earlier.
 
Last edited: