Hypervisor rebooted when VM memory resized

Dragonn · Mar 2, 2021

Hello there,

yesterday one of our hypervisor crashed (actualy it was probably rebooted by watchdog) when VM had it's memory resized. I would really appreciate if you can give me some insight into VM placement algorithm, because I am still unable to completely understand how node selection works.

According to my calculations reconstruction from logs this exactly happened:

There are 11 same hypervisors in cluster
- 64GB RAM + 8GB swap each
- using OVS bridges for networking
- not using memory ballooning
On one of them, there was 8 VMs running, using 52GB memory total (+ some KVM overhead ofc), everything was fine
User decided to resize memory of single VM, so he changed memory size from 8GB to 30GB a clicked Reboot
VM was shut down via QEMU Guest agent, configuration way applied, (12:52:22) VM was started again at same hypervisor
- note that now all VMs on that hypervisor would need 52-8+30 = 74 GB RAM (+ overhead) but there is only 64GB RAM + 8GB swap available
hypervisor was choking itself
- (12:53:30) monitoring graphs shows 100% memory usage, load raising because of intensive swapping
- (12:53:49) consul service was complaining it cannot see master servers (looks like huge packet loss to me)
- (12:54:41) pvestatd reports lock timeout
- (12:55:20) last entry in monitoring graphs, load 35, swap usage ~80%
- (12:55:23) last log message in syslog
- (12:56:57) VMs are getting started on other hypervisors (as expected)
- (12:59:10) hypervisor booting again

So my theory is (and please confront it with your opinions):

Proxmox doesn't check ahead if VM fits into hypervisor when starting it
When VM is rebooted, no migration consideration is done
while VM memory was allocating, essential processes like corosync, pmxcfs or openvswitchd were swapped and lagged
hypervisor was rebooted by watchdog before swap was consumed and VM could be OOM killed

All VMs are same except CPU, RAM, disk size and allowed VLANs.

agent: 1
bootdisk: scsi0
cores: 8
ide0: none,media=cdrom
memory: 10240
name: ofswebperfdb
net0: virtio=7A

4:7E:42:2C:BD,bridge=vmbr0,tag=341,trunks=302
numa: 0
onboot: 1
ostype: l26
scsi0: oceph1:vm-142-disk-0,backup=0,discard=on,size=80G,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=31490500-68ae-444a-989f-d9f9e4505301
sockets: 1
vcpus: 8
vmgenid: 87c722a6-266d-4000-8b24-be1d536c1e06

We are currently using Proxmox v6.2-1

proxmox-ve: 6.2-1 (running kernel: 5.4.48-ls-2)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.4.44-2-pve: 5.4.44-2
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-3
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.1-1
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-13
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-11
smartmontools: 6.6-1
spiceterm: 3.1-1
vncterm: 1.6-1

I honestly thought Proxmox keeps track of minimal memory allocation for each VM so I expected it fails to start VM when there is not enought memory.

Also I would really like to define some amount of memory which should be reserved for hypervisor OS and Proxmox would try not to use it for VMs. I know it's not possible in all cases, especialy if you are using memory ballooning, but I would like to reserve about 10-15% memory for KVM overheads and OS breathing space.

Now I would really appreciate if you can give me some advices what can I do to make cluster more foolproof. Generaly I want user not to bother with cluster usage, VM placement etc.

My current thoughts goes towards:

significantly lowering vm.swappiness (we have default 60 right now)
experimenting with vm.overcommit_memory and vm.overcommit_ratio
turning off swappiness for corosync, pve-cluster and openvswitch-nonetwork systemd units
shrinking swap significantly (maybe to 512M or so) or turning it off completely

Thanks in advice for all your time spent reading and thinking about this long post. Here's an emoticon for your good mood

aaron · Mar 2, 2021

Dragonn said:
Proxmox doesn't check ahead if VM fits into hypervisor when starting it

No it does not. Due to things like KSM it is possible to overprovision the RAM quite significantly if a lot of VMs are using the same OS for example. Therefore enforcing such limits would have an impact on other use cases.

Dragonn said:
When VM is rebooted, no migration consideration is done

No, there is not a lot of magic behind it. If you order a VM to boot, it will boot on the node it is currently located at.

Dragonn said:
while VM memory was allocating, essential processes like corosync, pmxcfs or openvswitchd were swapped and lagged

Seems plausible, but that would need to be checked in the logs. Also in the logs of other nodes. For example search for "corosync" in the syslogs. If there were issues with that node, the other nodes should have seen that node be gone.

Do you have HA enabled guests? Thats when the watchdog becomes active if the node cannot communicate with the other cluster nodes in time.

If you do not want swap, you can also disable it completely.

Dragonn · Mar 2, 2021

Thanks @aaron for all your answers.

aaron said:
No it does not. Due to things like KSM it is possible to overprovision the RAM quite significantly if a lot of VMs are using the same OS for example. Therefore enforcing such limits would have an impact on other use cases.

Yea, I understand that there is some cases where this behavior is not intended. I am currently in situation when I try to build cluster as realiable as possible. I don't mind reserving RAM and never using it, rather than unexpected VM failure because of OOM condition.

Edit: Btw all our hypervisors and VMs are based on same Debian Stretch template, so there will be probably a lot of same memory pages to merge. But I am not sure I want/need to use it in my current setup.

aaron said:
No, there is not a lot of magic behind it. If you order a VM to boot, it will boot on the node it is currently located at.

I thought since every start command goes through HA Manager it will do some basic checking.

aaron said:
Seems plausible, but that would need to be checked in the logs. Also in the logs of other nodes. For example search for "corosync" in the syslogs. If there were issues with that node, the other nodes should have seen that node be gone.

Do you have HA enabled guests? Thats when the watchdog becomes active if the node cannot communicate with the other cluster nodes in time.

Yes, I have HA enabled for every running VM so watchdog is active on every hypervisor. Other servers lost communication with problem hypervisor and reported it in corosync and pmxcfs logs -- no flapping, just hard dead. This is completely expected behavior and I have no issues with that.

Only thing I am not sure about is how exactly works mechanisms of resetting watchdog timer. I suppose pve-ha-lrm daemon writes its status into /etc/pve/local/lrm_status and if succeeds it resets watchdog?

Therefore to prevent hypervisor to fence itself, I need to have always working pve-ha-lrm, pmxcfs and corosync daemons? What about OVS daemon? Is it actualy needed for networking to work?

aaron said:
If you do not want swap, you can also disable it completely.

In general I think swap is good thing to have, but I don't feel I really need it on hypervisors so I am willing to disable it if I don't find any other solution. I prefer to get starting VM killed rather than reboot whole hypervisor.

Search

Search

Hypervisor rebooted when VM memory resized

Dragonn

Member

aaron

Proxmox Staff Member

Dragonn

Member

We value your privacy