Live-Migration almost freezes Targetnode

abien · Apr 6, 2022

We could not solve this issue reliably and ended up moving away from ZFS for primary VM storage. At this point we have switched to Ceph with an all-flash configuration. The investment in terms of EUR was quite substantial but the results are pretty good - i dare say "worth it". After roughly one year of service Ceph delivers what it promises. There have been no unexplained glitches, no wiered crashed or missing logs & no performance issues. Even though it comes with lots of new concepts and can be scary at first - It's pretty straight forward to maintain once you start using it.

j4ys0n · Apr 27, 2022

Same thing here. I thought my new server crashed, until I logged into another node and saw the IO delay spike.

Are individuals able to contribute to the Proxmox source code? I'd love to fix things, add features

JamesT · Apr 27, 2022

@j4ys0n do you mind if I ask your server specs? My current servers which are still running (which were the topic of my original post earlier in this thread) are Dell R520's.
I'm just configuring a new R740 2 node HPC cluster now and hoping having newer hardware will somehow make this problem go away, but not feeling that optimistic, and even less so after your recent reply.

fiona · Apr 27, 2022

Hi,

j4ys0n said:
Are individuals able to contribute to the Proxmox source code? I'd love to fix things, add features

yes, contributions are welcome. See the developer documentation.

j4ys0n · Apr 27, 2022

JamesT said:
@j4ys0n do you mind if I ask your server specs? My current servers which are still running (which were the topic of my original post earlier in this thread) are Dell R520's.
I'm just configuring a new R740 2 node HPC cluster now and hoping having newer hardware will somehow make this problem go away, but not feeling that optimistic, and even less so after your recent reply.

I should have added more info to the post. The server isn't completely new - it's some hardware I had laying around that I got a "new" (also old) motherboard for. i7 7700K, 32GB of 2400mhz memory, 4 Kingston SSDs in RAID10, a 10 G NIC and an ASRock Rack board. It's for some services I wanted to move off of another server so I can upgrade it. I do have a newer Epyc server to upgrade and another new Epyc server in the works - I'll try again on those when the time comes - will be a week or two though.

hohl · May 26, 2022

Seeing the same issue with target node freezes while using live migrations with Proxmox 7.2 here (but also already happened with 7.1).

All nodes use Enterprise NVMe 4.0 disks, ZFS RAID 10, 10G NICs, latest AMD EPYC CPUs and a mixture of Supermicro and Dell mainboard. Discard/trim is enabled for all VMs, full list of packages in use is attached.

Live migration works with up to ~900 MB/s. But even if it's throttled to ~50 MB/s through bandwidth options, it still renders the entire target node (aka all VMs on it) into an unusable state. (Some few guest OS/setups on the VMs also seem to crash in relation to this, but that might just be because they fail to handle multi-minutes IO wait.)

From what I can tell there seems to be two things necessary for the target node freeze to happen:
- live migration of a rather disk I/O heavy VM (like a database server), and
- discard/trim enabled on the VM.

One more thing: it seems to primarly freeze the other VMs on the target nodes, but has less impact on the VM that is transferred.

It doesn't seem to be influenced much by other factors. There are also some other slower nodes at work in different clusters (including ones with slow spinning HDDs) where this issue also happens, but less frequently. However, the reason why this issues happen less frequently there might be related to the fact that these slower nodes usually don't run the most I/O-heavy stuff in the first place.

I'm just posting all these details in the hope this might help somebody to isolate the reason for this issue. For now, I will just try to avoid the migration feature at all. (Live migrations are primarily used here to apply Proxmox updates without any downtime for the guest systems anyhow.)

DerDanilo · Jun 11, 2022

Managing multiple systems as well. Local LVM, ZFS and network storage (cifs,NFS or ceph rbd), it's all the same. One VM migration renders all VMa on the host unusable (CPU freeze / kernel crash) Have to bulk stop and start all VMa afterwards.
Only offline migration seems to work without issues.

Any hint on how to solve this? Is this a bug? Kernel issues?

Mike.a · Aug 16, 2022

I'm seeing the exact same issue, only in my case ZFS isn't involved, but rather Ceph

filipealvarez · Nov 5, 2023

Any new?

Search

Search

Live-Migration almost freezes Targetnode

abien

Active Member

j4ys0n

Well-Known Member

JamesT

New Member

fiona

Proxmox Staff Member

j4ys0n

Well-Known Member

hohl

Member

Attachments

DerDanilo

Famous Member

Mike.a

Member

filipealvarez

Well-Known Member

We value your privacy