Slow Migration between nodes

drjaymz@ · Sep 12, 2022

I recently set up three nodes running a fairly small VM which consists of 3 disks 25/4/25 Gb so 54Gb in total. Migration of that VM from one node to another tops out at 120 Megabit/s. You can quite clearly see the limitation on the IO graphs.
I can quite happily get 950 Megabits between nodes for other data or doing a backup to a network drive so why is this bit so slow? There isn't much in the way of contention but the I'd expect migration to be < 10 mins not over an hour.
The IO limitation is so flat that it looks exactly like bandwidth limitation is set.

I think what's its doing is creating a snapshot, then compressing that (which goes to about 10Gb) and then pushing that over - but its terrible performance.

I do have replication for the nodes and that only takes a minute or two every 15 mins with the changes we accumulate - but the initial sync, I think went similarly slow.

The nodes do have additional 10GBit ports which I plan to use for replication, but the replication isn't using more than 10% of the primary to start with.

I have looked around and found others notes slow behaviour after the nodes have been running a long time, but these are only a week old. So I don't know what else to check.

B.Otto · Sep 12, 2022

The default migration method is secure, so via SSH iirc. If the migration network is local, you could try to use the insecure mode by editing the /etc/pve/datacenter.cfg and setting migration: insecure

Is the migration from local to local storage or are you using a shared storage like ceph?

In the case of local-to-local the low performance could also be storage-related, since it has to read all data (somewhat fast), send it to the second node (usually fast) and write it to the new storage (usually slowest).

In the case of a shared storage, a migration only needs to copy the RAM of the machine (plus delta-blocks that changed while the migration was running), so it is much faster.

drjaymz@ · Sep 12, 2022

its local storage, but I can write to that local storage an order of magnitude quicker unless something is horribly misconfigured somewhere.
To rule that out - I could try a backup restore to another node to see what speed that writes at - that should be coming in from the network at close to full bandwidth and then decompressing and writing to the disk at a reasonable rate. If its also strangled then definitely something wonky there.

Shared storage may be an answer and something I need to look at right now whilst I only have one VM on there.

drjaymz@ · Sep 12, 2022

It restored in about 11 minutes 54Gb. That means that the limitation was NOT related to write performance of the disk, in this case it was read from the network performance. So, that means its unsolved as of now.

I will try the unsecure copy and see if that makes any difference.

drjaymz@ · Sep 12, 2022

drjaymz@ said:
It restored in about 11 minutes 54Gb. That means that the limitation was NOT related to write performance of the disk, in this case it was read from the network performance. So, that means its unsolved as of now.

I will try the unsecure copy and see if that makes any difference.

OK that is copying very quickly now.

Sorry there is such a long gap in the middle but you can see the inbound copy was 12MByte/s The outbound is now 120MBytes. So yeah - its the secure method that slows it down by exactly 90.0%.

drjaymz@ · Sep 12, 2022

scratch that - its related to that first node. Its exactly the same migrating back. 10% speed.

Yep, pushing it back the way....

You can see its now 10% - that looks like its deliberately capped somewhere?

cshill · May 22, 2024

This is interesting as I'm having a very similar problem actually. I'm getting transfer rates around 2-5 MiB/s between my servers and I'm not thinking it's the write performance on the server I'm putting everything onto but can be wrong.

My setup:
3 servers, all older equipment but functional. I have a poweredge r420 and two old acer veritons. I'm testing this with some older HDDs. The R420 has 4TB Western Digital Enterprise HDDs, one of the acers has a 10TB Seagate, and the last acer I believe is just a 2TB Western Digital Desktop HDD.

All storage for virtual machines are on ZFS and the maximum speed I have is 10MB transfer speed of the VM. I am not sure why this is the case.

justinclift · May 23, 2024

@cshill When the migration is happening, does top (or better yet, htop) show an individual cpu taking up about ~99% of a cpu core?

If so, check if it's an ssh process (it'll be in the name). That would indicate the bottleneck is ssh.

That might be what's going on.

cshill · May 23, 2024

justinclift said:
@cshill When the migration is happening, is does top (or better yet, htop) show an individual cpu taking up about ~99% of a cpu core?

If so, check if it's an ssh process (it'll be in the name). That would indicate the bottleneck is ssh.

That might be what's going on.

I figured out my problem. I started some network testing and noticed the packets were dropping when going to the one server I was migrating back and forth on. I checked a couple options on the server but settings were good. I went upstairs and noticed the cable I plugged in was a cat5 not a cat5e, then found out it was also the room itself was not wired correctly. The room had only a 100mb/s speed. I moved it again but this time into the server room and I laughed as the wall jack in the server room was also a 100mbps limit. So I ran a long cable directly into the switches and that fixed the speed problem. I now have a separate problem but just posted it if you want to check it out. I think you can find it via my profile?

Search

Search

Slow Migration between nodes

drjaymz@

Member

B.Otto

Active Member

drjaymz@

Member

drjaymz@

Member

drjaymz@

Member

drjaymz@

Member

cshill

Member

justinclift

Well-Known Member

cshill

Member

We value your privacy