Live-Migration almost freezes Targetnode

We could not solve this issue reliably and ended up moving away from ZFS for primary VM storage. At this point we have switched to Ceph with an all-flash configuration. The investment in terms of EUR was quite substantial but the results are pretty good - i dare say "worth it". After roughly one year of service Ceph delivers what it promises. There have been no unexplained glitches, no wiered crashed or missing logs & no performance issues. Even though it comes with lots of new concepts and can be scary at first - It's pretty straight forward to maintain once you start using it.
 
  • Like
Reactions: j4ys0n
Same thing here. I thought my new server crashed, until I logged into another node and saw the IO delay spike.

Are individuals able to contribute to the Proxmox source code? I'd love to fix things, add features :)

1651035720378.png
 
  • Like
Reactions: JamesT
@j4ys0n do you mind if I ask your server specs? My current servers which are still running (which were the topic of my original post earlier in this thread) are Dell R520's.
I'm just configuring a new R740 2 node HPC cluster now and hoping having newer hardware will somehow make this problem go away, but not feeling that optimistic, and even less so after your recent reply.
 
@j4ys0n do you mind if I ask your server specs? My current servers which are still running (which were the topic of my original post earlier in this thread) are Dell R520's.
I'm just configuring a new R740 2 node HPC cluster now and hoping having newer hardware will somehow make this problem go away, but not feeling that optimistic, and even less so after your recent reply.
I should have added more info to the post. The server isn't completely new - it's some hardware I had laying around that I got a "new" (also old) motherboard for. i7 7700K, 32GB of 2400mhz memory, 4 Kingston SSDs in RAID10, a 10 G NIC and an ASRock Rack board. It's for some services I wanted to move off of another server so I can upgrade it. I do have a newer Epyc server to upgrade and another new Epyc server in the works - I'll try again on those when the time comes - will be a week or two though.
 
Seeing the same issue with target node freezes while using live migrations with Proxmox 7.2 here (but also already happened with 7.1).

All nodes use Enterprise NVMe 4.0 disks, ZFS RAID 10, 10G NICs, latest AMD EPYC CPUs and a mixture of Supermicro and Dell mainboard. Discard/trim is enabled for all VMs, full list of packages in use is attached.

Live migration works with up to ~900 MB/s. But even if it's throttled to ~50 MB/s through bandwidth options, it still renders the entire target node (aka all VMs on it) into an unusable state. (Some few guest OS/setups on the VMs also seem to crash in relation to this, but that might just be because they fail to handle multi-minutes IO wait.)

From what I can tell there seems to be two things necessary for the target node freeze to happen:
- live migration of a rather disk I/O heavy VM (like a database server), and
- discard/trim enabled on the VM.

One more thing: it seems to primarly freeze the other VMs on the target nodes, but has less impact on the VM that is transferred.

It doesn't seem to be influenced much by other factors. There are also some other slower nodes at work in different clusters (including ones with slow spinning HDDs) where this issue also happens, but less frequently. However, the reason why this issues happen less frequently there might be related to the fact that these slower nodes usually don't run the most I/O-heavy stuff in the first place.

I'm just posting all these details in the hope this might help somebody to isolate the reason for this issue. For now, I will just try to avoid the migration feature at all. (Live migrations are primarily used here to apply Proxmox updates without any downtime for the guest systems anyhow.)
 

Attachments

  • packages-and-versions-on-the-proxmox-nodes.txt
    1.3 KB · Views: 0
Managing multiple systems as well. Local LVM, ZFS and network storage (cifs,NFS or ceph rbd), it's all the same. One VM migration renders all VMa on the host unusable (CPU freeze / kernel crash) Have to bulk stop and start all VMa afterwards.
Only offline migration seems to work without issues.

Any hint on how to solve this? Is this a bug? Kernel issues?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!