Hello there,
yesterday one of our hypervisor crashed (actualy it was probably rebooted by watchdog) when VM had it's memory resized. I would really appreciate if you can give me some insight into VM placement algorithm, because I am still unable to completely understand how node selection works.
According to my calculations reconstruction from logs this exactly happened:
We are currently using Proxmox v6.2-1
I honestly thought Proxmox keeps track of minimal memory allocation for each VM so I expected it fails to start VM when there is not enought memory.
Also I would really like to define some amount of memory which should be reserved for hypervisor OS and Proxmox would try not to use it for VMs. I know it's not possible in all cases, especialy if you are using memory ballooning, but I would like to reserve about 10-15% memory for KVM overheads and OS breathing space.
Now I would really appreciate if you can give me some advices what can I do to make cluster more foolproof. Generaly I want user not to bother with cluster usage, VM placement etc.
My current thoughts goes towards:
Thanks in advice for all your time spent reading and thinking about this long post. Here's an emoticon for your good mood
yesterday one of our hypervisor crashed (actualy it was probably rebooted by watchdog) when VM had it's memory resized. I would really appreciate if you can give me some insight into VM placement algorithm, because I am still unable to completely understand how node selection works.
According to my calculations reconstruction from logs this exactly happened:
- There are 11 same hypervisors in cluster
- 64GB RAM + 8GB swap each
- using OVS bridges for networking
- not using memory ballooning
- On one of them, there was 8 VMs running, using 52GB memory total (+ some KVM overhead ofc), everything was fine
- User decided to resize memory of single VM, so he changed memory size from 8GB to 30GB a clicked Reboot
- VM was shut down via QEMU Guest agent, configuration way applied, (12:52:22) VM was started again at same hypervisor
- note that now all VMs on that hypervisor would need 52-8+30 = 74 GB RAM (+ overhead) but there is only 64GB RAM + 8GB swap available
- hypervisor was choking itself
- (12:53:30) monitoring graphs shows 100% memory usage, load raising because of intensive swapping
- (12:53:49) consul service was complaining it cannot see master servers (looks like huge packet loss to me)
- (12:54:41) pvestatd reports lock timeout
- (12:55:20) last entry in monitoring graphs, load 35, swap usage ~80%
- (12:55:23) last log message in syslog
- (12:56:57) VMs are getting started on other hypervisors (as expected)
- (12:59:10) hypervisor booting again
- Proxmox doesn't check ahead if VM fits into hypervisor when starting it
- When VM is rebooted, no migration consideration is done
- while VM memory was allocating, essential processes like corosync, pmxcfs or openvswitchd were swapped and lagged
- hypervisor was rebooted by watchdog before swap was consumed and VM could be OOM killed
agent: 1
bootdisk: scsi0
cores: 8
ide0: none,media=cdrom
memory: 10240
name: ofswebperfdb
net0: virtio=7A4:7E:42:2C:BD,bridge=vmbr0,tag=341,trunks=302
numa: 0
onboot: 1
ostype: l26
scsi0: oceph1:vm-142-disk-0,backup=0,discard=on,size=80G,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=31490500-68ae-444a-989f-d9f9e4505301
sockets: 1
vcpus: 8
vmgenid: 87c722a6-266d-4000-8b24-be1d536c1e06
bootdisk: scsi0
cores: 8
ide0: none,media=cdrom
memory: 10240
name: ofswebperfdb
net0: virtio=7A4:7E:42:2C:BD,bridge=vmbr0,tag=341,trunks=302
numa: 0
onboot: 1
ostype: l26
scsi0: oceph1:vm-142-disk-0,backup=0,discard=on,size=80G,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=31490500-68ae-444a-989f-d9f9e4505301
sockets: 1
vcpus: 8
vmgenid: 87c722a6-266d-4000-8b24-be1d536c1e06
We are currently using Proxmox v6.2-1
proxmox-ve: 6.2-1 (running kernel: 5.4.48-ls-2)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.4.44-2-pve: 5.4.44-2
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-3
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.1-1
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-13
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-11
smartmontools: 6.6-1
spiceterm: 3.1-1
vncterm: 1.6-1
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.4.44-2-pve: 5.4.44-2
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-3
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.1-1
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-13
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-11
smartmontools: 6.6-1
spiceterm: 3.1-1
vncterm: 1.6-1
I honestly thought Proxmox keeps track of minimal memory allocation for each VM so I expected it fails to start VM when there is not enought memory.
Also I would really like to define some amount of memory which should be reserved for hypervisor OS and Proxmox would try not to use it for VMs. I know it's not possible in all cases, especialy if you are using memory ballooning, but I would like to reserve about 10-15% memory for KVM overheads and OS breathing space.
Now I would really appreciate if you can give me some advices what can I do to make cluster more foolproof. Generaly I want user not to bother with cluster usage, VM placement etc.
My current thoughts goes towards:
- significantly lowering
vm.swappiness
(we have default60
right now) - experimenting with
vm.overcommit_memory
andvm.overcommit_ratio
- turning off swappiness for
corosync
,pve-cluster
andopenvswitch-nonetwork
systemd units - shrinking swap significantly (maybe to 512M or so) or turning it off completely
Thanks in advice for all your time spent reading and thinking about this long post. Here's an emoticon for your good mood