"when we run a restore it randomly picks a VM and crashes it"

Apr 27, 2024
464
170
43
Portland, OR
www.gnetsys.net
Ya. This was waiting for me at work today. Mystery problem. No details. Everyone is upset.

Entirely over my objections, we have changed our corosync VLAN to the same one used for migrations and restores.

When they run a restore its flooding corosync. And now I need to untangle this again.

RTFM. https://pve.proxmox.com/pve-docs/chapter-pvecm.html
 
Last edited:
I ultimately heard about 5 different conflicting stories about issues that during a restore.
Boy, did I catch it from all sides. Lotta fun.
My weekday counterpart has no time for it, because it all sounds like BS.
I did find an issue with migration that threw an SSH error, which I resolved.

Sometimes you get a problem report, even from the most experienced staff, and you just know up front you will never track down whatevertheh3ll happened. So you shake the chicken, a little voodoo dance, and act like its fixed.
 
just a wild guess, maybe the oom killer steps in because your host otherwise would be out of memory, like here?

*edit*
check your logs for oom killer events dmesg -T | egrep -i 'killed process'
 
Last edited:
  • Like
Reactions: tcabernoch
Hey thanks. I hadn't considered OOM. In this case its not likely. The PVE has 512gb of RAM is is about half used. Even with ZFS chewing up the RAM, the machine is currently ok. I didn't know that dmesg command, that's kinda cool.

I think my users were freaked out by the corosync connection getting flooded during a restore. That makes it look like the whole cluster goes down, and it can cause extreme consternation. I think the problem reports I got were not entirely accurate, and colored by people being upset.
 
Last edited:
dont't think the corosync connection should get flooded, have you set up corosync redundant?
on a dedicated interface (no bridge / bond) - that's what the docs suggest

but that won't solve the issue of a random vm getting killed ...
 
My friend, you're trying to be helpful. Thanks, but you didn't read the first post.
Your point about not using a bond ... Doesn't seem like good advice. Maybe even bad advice.
The random VM getting killed ... I think my users are hallucinating.

A redundant corosync interface isn't the magical fix one might think. Yes, I do have a redundant interface, 2 completely different sets of switches, different subnets. The redundant link doesn't kick in until the primary link completely drops out. So if that interface drops offline, the next one comes online. But if the primary thinks its still online, corosync is too stupid to shift to the other link when communication stops. It is very far from an optimal situation. I wish a redundant link was the solution that seems to be implied by its name.

My company made a network change that affects corosync traffic. Then we had problems. They seem to be related.
This week I've added a new VLAN to logically (re)-segregate corosync and ceph/vsan traffic.

I am just right now about to edit corosync.conf and shift the primary IPs for the cluster onto this VLAN.
I find the whole editing operation to be ... well ... rocket surgery. But I've done it a few times.

Woo hoo! Here we go ....
 
Well corosync is about latency, that's why it is suggested to use one or more dedicated interfaces for it without any abstaction layer (like bond, bridge ...) to keep latency low. A LACP bond could take too long for corosync to recover ...

If you use corosync over a bridge / bond you have to take extra care that you priorize traffic for corosync because if latency rises in case of a migration or a backup your hosts could fence and reboot.

But whatever, this is most likely not the cause of your vm issue.