[SOLVED] PVE6 - CPU/MEM hotplug

bjsko

Active Member
Sep 25, 2019
30
4
28
Hi,

We have two PVE 6.0-x clusters with 3 nodes each. The nodes are Dell R640 servers. Both clusters are fairly new and we are in the process of installing out first VM's. The VM's are running SLES 15 SP1.

As we would like to have the option of adding vCPU/RAM, we activated hotplugging on the VM's by ticking the correct boxes in under Options-Hotplug.

I have looked at https://pve.proxmox.com/wiki/Hotplug_(qemu_disk,nic,cpu,memory)

The VM's contain a default udev rule coming from the SLES installation (/usr/lib/udev/rules.d/80-hotplug-cpu-mem.rules):
Code:
# do not edit this file, it will be overwritten on update

#
# Hotplug physical CPU
#
SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}="1"

#
# Hotplug physical memory. Instances of tmpfs are remounted so their
# size are recalculated. This might be needed if some sizes were
# specified relative to the total amount of memory (boo#869603). For
# now make it simple and remount all tmpfs regardless of how their
# size are specified. It should be handled by the kernel as it has a
# lot of shortcomings anyways (tmpfs mounted by other processes, mount
# namespaces, ...)
#
SUBSYSTEM=="memory", ACTION=="add", PROGRAM=="/usr/bin/uname -m", RESULT!="s390x", ATTR{state}=="offline", \
  ATTR{state}="online", \
  RUN+="/bin/sh -c ' \
    while read src dst fs opts unused; do \
      case $fs in \
      tmpfs)  mount -o remount \"$dst\" ;; \
      esac \
    done </proc/self/mounts"

Code:
memhp_default_state=online
has been added to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub on the VM's

SLES15SP1 runs a 4.12.x kernel, so in theory I should only need the memhp_default_state=online, I guess. I have not yet tested removing the default udev rule.

Anyway, hotplugging itself seems to be working just fine.


However, the VM's are behaving inconsistently bad. We have seen at least three different scenarios:
  • The VM will not boot at all (typically if it is fairly big memory/cpu wise - more than 10vCPU's and/or above 80-100GB RAM, but sometimes it boot just fine even if it is big)
  • The VM will freeze completely with no errors/dumps whatsoever after a while
  • The VM will crash with kernel dumps in the console after a while
"A while" is normally a few hours.

If we remove hotplugging of RAM/CPU from the VM in Proxmox (leaving just the default "Disk, Network, USB options), it will behave perfectly fine.

We have seen the same behaviour on several 6.0-x versions, currently on 6.0-7.

Before posting stack traces and other required logs, I'm just curious to see if anyone else have observed behaviour like this? Which additional info would be required for further troubleshooting the issue?

Any help or insight would be highly appreciated.

BR
Bjørn
 
Updating my own (long forgotten) question here if anyone should run into this one... We worked with SUSE Support for quite a while and this was solved in a Suse kernel patch. SLES15SP1 with kernel 4.12.14-197.61and above will do hotplugging of memory like it should.
 
Last edited:
  • Like
Reactions: Ramalama

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!