[SOLVED] PVE6 - CPU/MEM hotplug

bjsko · Sep 25, 2019

Hi,

We have two PVE 6.0-x clusters with 3 nodes each. The nodes are Dell R640 servers. Both clusters are fairly new and we are in the process of installing out first VM's. The VM's are running SLES 15 SP1.

As we would like to have the option of adding vCPU/RAM, we activated hotplugging on the VM's by ticking the correct boxes in under Options-Hotplug.

I have looked at https://pve.proxmox.com/wiki/Hotplug_(qemu_disk,nic,cpu,memory)

The VM's contain a default udev rule coming from the SLES installation (/usr/lib/udev/rules.d/80-hotplug-cpu-mem.rules):

Code:

# do not edit this file, it will be overwritten on update

#
# Hotplug physical CPU
#
SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}="1"

#
# Hotplug physical memory. Instances of tmpfs are remounted so their
# size are recalculated. This might be needed if some sizes were
# specified relative to the total amount of memory (boo#869603). For
# now make it simple and remount all tmpfs regardless of how their
# size are specified. It should be handled by the kernel as it has a
# lot of shortcomings anyways (tmpfs mounted by other processes, mount
# namespaces, ...)
#
SUBSYSTEM=="memory", ACTION=="add", PROGRAM=="/usr/bin/uname -m", RESULT!="s390x", ATTR{state}=="offline", \
  ATTR{state}="online", \
  RUN+="/bin/sh -c ' \
    while read src dst fs opts unused; do \
      case $fs in \
      tmpfs)  mount -o remount \"$dst\" ;; \
      esac \
    done </proc/self/mounts"

Code:

memhp_default_state=online

has been added to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub on the VM's

SLES15SP1 runs a 4.12.x kernel, so in theory I should only need the memhp_default_state=online, I guess. I have not yet tested removing the default udev rule.

Anyway, hotplugging itself seems to be working just fine.

However, the VM's are behaving inconsistently bad. We have seen at least three different scenarios:

The VM will not boot at all (typically if it is fairly big memory/cpu wise - more than 10vCPU's and/or above 80-100GB RAM, but sometimes it boot just fine even if it is big)
The VM will freeze completely with no errors/dumps whatsoever after a while
The VM will crash with kernel dumps in the console after a while

"A while" is normally a few hours.

If we remove hotplugging of RAM/CPU from the VM in Proxmox (leaving just the default "Disk, Network, USB options), it will behave perfectly fine.

We have seen the same behaviour on several 6.0-x versions, currently on 6.0-7.

Before posting stack traces and other required logs, I'm just curious to see if anyone else have observed behaviour like this? Which additional info would be required for further troubleshooting the issue?

Any help or insight would be highly appreciated.

BR
Bjørn

bjsko · Mar 6, 2021

Updating my own (long forgotten) question here if anyone should run into this one... We worked with SUSE Support for quite a while and this was solved in a Suse kernel patch. SLES15SP1 with kernel 4.12.14-197.61and above will do hotplugging of memory like it should.

Search

Search

[SOLVED] PVE6 - CPU/MEM hotplug

bjsko

Active Member

bjsko

Active Member