"when we run a restore it randomly picks a VM and crashes it"

tcabernoch · Thursday at 21:21

Ya. This was waiting for me at work today. Mystery problem. No details. Everyone is upset.

Entirely over my objections, we have changed our corosync VLAN to the same one used for migrations and restores.

When they run a restore its flooding corosync. And now I need to untangle this again.

RTFM. https://pve.proxmox.com/pve-docs/chapter-pvecm.html

tcabernoch · Friday at 04:26

I've deployed a new VLAN and set of IPs currently reserved only for cluster traffic.
(At some point, ceph will run on this VLAN as well.)

Everything pings.
I'm doing the rocket surgery late tomorrow night, so I have all weekend if I break the cluster.

tcabernoch · Friday at 22:37

I ultimately heard about 5 different conflicting stories about issues that during a restore.
Boy, did I catch it from all sides. Lotta fun.
My weekday counterpart has no time for it, because it all sounds like BS.
I did find an issue with migration that threw an SSH error, which I resolved.

Sometimes you get a problem report, even from the most experienced staff, and you just know up front you will never track down whatevertheh3ll happened. So you shake the chicken, a little voodoo dance, and act like its fixed.

MarkusKo · Saturday at 14:03

just a wild guess, maybe the oom killer steps in because your host otherwise would be out of memory, like here?

*edit*
check your logs for oom killer events dmesg -T | egrep -i 'killed process'

tcabernoch · Saturday at 21:26

Hey thanks. I hadn't considered OOM. In this case its not likely. The PVE has 512gb of RAM is is about half used. Even with ZFS chewing up the RAM, the machine is currently ok. I didn't know that dmesg command, that's kinda cool.

I think my users were freaked out by the corosync connection getting flooded during a restore. That makes it look like the whole cluster goes down, and it can cause extreme consternation. I think the problem reports I got were not entirely accurate, and colored by people being upset.

MarkusKo · Saturday at 22:45

dont't think the corosync connection should get flooded, have you set up corosync redundant?
on a dedicated interface (no bridge / bond) - that's what the docs suggest

but that won't solve the issue of a random vm getting killed ...

tcabernoch · Sunday at 04:53

My friend, you're trying to be helpful. Thanks, but you didn't read the first post.
Your point about not using a bond ... Doesn't seem like good advice. Maybe even bad advice.
The random VM getting killed ... I think my users are hallucinating.

A redundant corosync interface isn't the magical fix one might think. Yes, I do have a redundant interface, 2 completely different sets of switches, different subnets. The redundant link doesn't kick in until the primary link completely drops out. So if that interface drops offline, the next one comes online. But if the primary thinks its still online, corosync is too stupid to shift to the other link when communication stops. It is very far from an optimal situation. I wish a redundant link was the solution that seems to be implied by its name.

My company made a network change that affects corosync traffic. Then we had problems. They seem to be related.
This week I've added a new VLAN to logically (re)-segregate corosync and ceph/vsan traffic.

I am just right now about to edit corosync.conf and shift the primary IPs for the cluster onto this VLAN.
I find the whole editing operation to be ... well ... rocket surgery. But I've done it a few times.

Woo hoo! Here we go ....

MarkusKo · Sunday at 13:23

Well corosync is about latency, that's why it is suggested to use one or more dedicated interfaces for it without any abstaction layer (like bond, bridge ...) to keep latency low. A LACP bond could take too long for corosync to recover ...

If you use corosync over a bridge / bond you have to take extra care that you priorize traffic for corosync because if latency rises in case of a migration or a backup your hosts could fence and reboot.

But whatever, this is most likely not the cause of your vm issue.

tcabernoch · Sunday at 23:50

Thanks man. Sometimes it takes a village to sort things out.
I've moved my corosync traffic to another vlan. The move went well.
Now I wait and see if there are further issues.

Search

Search

"when we run a restore it randomly picks a VM and crashes it"

tcabernoch

Active Member

tcabernoch

Active Member

tcabernoch

Active Member

MarkusKo

Active Member

tcabernoch

Active Member

MarkusKo

Active Member

tcabernoch

Active Member

MarkusKo

Active Member

tcabernoch

Active Member

We value your privacy