[SOLVED] PVE 4 CPU Cores

adamb · Oct 22, 2015

Hey all. I just did a 3.4 -> 4.0 upgrade. The upgrade itself went really well. However I am running into an issue I can't seem to pin down.

We are a bit odd as we run one large VM within a 3 node HA stack. I am finding that when I get above 7-10 cores I am running into a very odd issue. While the VM is booting the KVM process on the host starts eating CPU cycles. So bad that sometimes it causes an NMI on the host. This happens while the guest is booting and setting up the kvm-clock per CPU.

The guest says the following
Total of 46 processors activated (211355.09 BogoMIPS)

The KVM process on the host is currently running at 4000+ %CPU on top. Sometimes it will manage to get the VM up and sometimes it will cause the host to NMI. Once the guest is up, load drops and all is well.

The hardware is HP DL380 Gen9's. The guest OS is CentOS 6.6 with 320G of ram and 46 vCPU's. Virtio drivers.

I already did a complete shell swap to ensure it wasn't a hardware issue. I can reproduce the issue on multiple nodes.

adamb · Oct 22, 2015

I have also found that even when shutting the VM down, the KVM process on the host skyrockets to 4000+% for a extended period of time.

adamb · Oct 22, 2015

Interesting that ksmd is running and I have it set to not start within "/etc/default/ksmtuned"

This process also pegs out at 100% cpu while the guest is booting or shutting down.

adamb · Oct 22, 2015

Here is a screen shot of the host after a panic. Not sure if it helps at all.

adamb · Oct 22, 2015

If I drop the RAM given to the VM down to 10G, the issue is completely resolved. If I go above 30G the issue starts to happen again. This is a pretty big show stopper for us. I am hoping someone has some input.

sigxcpu · Oct 22, 2015

Are you sure it is a hard limit at 30GB? Like, it does it at 31 but not at 30? Or it simply gets worser and worser when increasing the RAM size?
I think allocating and freeing 320GB of RAM (if qemu/KVM or the guest initializes it) is not cheap in any scenario.
I've seen that KVM indeed pegs all the allocated cores between VM shutdown and process exit. Now, if at that moment it frees those 320GB it will stay there for a long time. I don't know the exact reason, but my feeling is that it is a bug. I see no reason to use ALL the assigned cores at the end. It looks like it is polling for something in a busy loop.

adamb · Oct 22, 2015

sigxcpu said:
Are you sure it is a hard limit at 30GB? Like, it does it at 31 but not at 30? Or it simply gets worser and worser when increasing the RAM size?
I think allocating and freeing 320GB of RAM (if qemu/KVM or the guest initializes it) is not cheap in any scenario.
I've seen that KVM indeed pegs all the allocated cores between VM shutdown and process exit. Now, if at that moment it frees those 320GB it will stay there for a long time. I don't know the exact reason, but my feeling is that it is a bug. I see no reason to use ALL the assigned cores at the end. It looks like it is polling for something in a busy loop.

It gets worse as you add more ram to the VM. We have been running a configuration like this since PVE 2.x without any issues. I agree this has to be some type of bug.

Here is the version I am running on all of my nodes.

root@testprox1:~# pveversion -v
proxmox-ve: 4.0-16 (running kernel: 4.2.2-1-pve)
pve-manager: 4.0-50 (running version: 4.0-50/d3a6b7e5)
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-23
qemu-server: 4.0-31
pve-firmware: 1.1-7
libpve-common-perl: 4.0-32
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-27
pve-libspice-server1: 0.12.5-1
vncterm: 1.2-1
pve-qemu-kvm: 2.4-10
pve-container: 1.0-10
pve-firewall: 2.0-12
pve-ha-manager: 1.0-10
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.3-1
lxcfs: 0.9-pve2
cgmanager: 0.37-pve2
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie

udo · Oct 22, 2015

adamb said:
It gets worse as you add more ram to the VM. We have been running a configuration like this since PVE 2.x without any issues. I agree this has to be some type of bug.

Here is the version I am running on all of my nodes.

root@testprox1:~# pveversion -v
proxmox-ve: 4.0-16 (running kernel: 4.2.2-1-pve)
pve-manager: 4.0-50 (running version: 4.0-50/d3a6b7e5)
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-23
qemu-server: 4.0-31
pve-firmware: 1.1-7
libpve-common-perl: 4.0-32
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-27
pve-libspice-server1: 0.12.5-1
vncterm: 1.2-1
pve-qemu-kvm: 2.4-10
pve-container: 1.0-10
pve-firewall: 2.0-12
pve-ha-manager: 1.0-10
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.3-1
lxcfs: 0.9-pve2
cgmanager: 0.37-pve2
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie

Hi,
how many CPUs (and cores) has your host? How many RAM?

Do you have NUMA enabled in the VM??
Would your mem of the VM fit for NUMA?

Udo

adamb · Oct 22, 2015

udo said:
Hi,
how many CPUs (and cores) has your host? How many RAM?

Do you have NUMA enabled in the VM??
Would your mem of the VM fit for NUMA?

Udo

The host has 48 cores in total including hyper-threading. With 384G of ram. I did enable NUMA by checking the box but it didn't seem to make a difference.

I am not quite sure what you mean by "Would your mem of the VM fit for NUMA?". I have 192G per CPU for a total of 384G (12x 32G DIMM's). Seeing that the issue starts with even 30G, im well within NUMA of one CPU if I understand the concept correctly. Correct me if I am way out of the ball park on that one.

I appreciate the input!

udo · Oct 22, 2015

adamb said:
The host has 48 cores in total including hyper-threading. With 384G of ram. I did enable NUMA by checking the box but it didn't seem to make a difference.

I am not quite sure what you mean by "Would your mem of the VM fit for NUMA?".

Hi,
if you select an mem settings, which is not numa-aware you get an error message if you select NUMA - this is not the case I assume.

BTW. I would select NUMA because you have more than one CPU.

I have 192G per CPU for a total of 384G (12x 32G DIMM's). Seeing that the issue starts with even 30G, im well within NUMA of one CPU if I understand the concept correctly. Correct me if I am way out of the ball park on that one.

I appreciate the input!

Hmm,
have you tried if the effect is the same with cpu=host?

Udo

adamb · Oct 22, 2015

udo said:
Hi,
if you select an mem settings, which is not numa-aware you get an error message if you select NUMA - this is not the case I assume.

BTW. I would select NUMA because you have more than one CPU.

Hmm,
have you tried if the effect is the same with cpu=host?

Udo

Whether NUMA is enabled or not it doesn't make a difference. I don't get any error messages. The same issue occurs with the CPU load skyrocketing. I also tried enabling transparent huge pages and disabling ksmtuned.

Appreciate the input on using NUMA, I wasn't quite clear if I should.

We have always used cpu=host, I did try the default to but same issue.

adamb · Oct 23, 2015

Proxmox dev's can you please provide input. We would really like to move forward with being able to sell proxmox4 to our clients as proxmox3 is running a bit short on life span.

The load issue persists whether I have 20G given to a VM or 300G. However, the more ram the longer the high load lasts. If I could prevent it from completing eating up all CPU cycle to prevent a NMI that would be great.

There are two places which the load skyrockets. 1 being as soon as you hit start on the guest. The console won't even display and says "Guest has not initialized the display yet", the host load pegs all cores during this. Sometimes it will get past this part and sometimes it will cause an NMI. The next issue is while the VM is booting and is doing the the "kvm-clock: cpu 1" process for each CPU. It also pegs out all available cores for a long period of time. Sometimes the guest will make it past this and sometimes the host will NMI.

sigxcpu · Oct 23, 2015

Try this: http://www.linux-kvm.com/content/get-performance-boost-backing-your-kvm-guest-hugetlbfs

adamb · Oct 23, 2015

sigxcpu said:
Try this: http://www.linux-kvm.com/content/get-performance-boost-backing-your-kvm-guest-hugetlbfs

I have tried the following without any luck, but this is just transparent huge pages.

echo always > /sys/kernel/mm/transparent_hugepage/enabled

I can then see when I start the VM it starts to use the transparent huge pages.

root@testprox1:~# cat /proc/meminfo | grep HugePages
AnonHugePages: 7723008 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0

I have seen those guides, but it has never been clear how to add it to a proxmox VM as the config file format is different.

sigxcpu · Oct 23, 2015

you add in the <VMID>.conf

Code:

args: -mem-path /hugepages

adamb · Oct 23, 2015

sigxcpu said:
you add in the <VMID>.conf

Code:

args: -mem-path /hugepages

Thanks! Trying now.

sigxcpu · Oct 23, 2015

It is best to add vm.hugepages=something into your /etc/sysctl.conf and reboot the machine as you will not be able to reserve them if you already fiddled with stuff.

It looks like a page is 2MB, so for 320GB you will want to allocate 163840 of them.

adamb · Oct 23, 2015

sigxcpu said:
It is best to add vm.hugepages=something into your /etc/sysctl.conf and reboot the machine as you will not be able to reserve them if you already fiddled with stuff.

It looks like a page is 2MB, so for 320GB you will want to allocate 163840 of them.

Not having much luck. I was able to reserve the pages no issues within sysctl.conf.

root@testprox1:~# cat /proc/meminfo | grep HugePages
AnonHugePages: 0 kB
HugePages_Total: 163840
HugePages_Free: 163840
HugePages_Rsvd: 0
HugePages_Surp: 0

I added the mount point to /etc/fstab, but it doesn't seem to take effect.

Code:

root@testprox1:~# cat /etc/fstab 
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/pve/root / ext3 errors=remount-ro 0 1
/dev/pve/data /var/lib/vz ext3 defaults 0 1
UUID=6C3D-6237 /boot/efi vfat defaults 0 1
/dev/pve/swap none swap sw 0 0
proc /proc proc defaults 0 0
hugetlbfs       /hugepages  hugetlbfs       mode=1770,gid=2021        0 0

Code:

root@testprox1:~# mount -a
root@testprox1:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   10M     0   10M   0% /dev
tmpfs                  76G  9.7M   76G   1% /run
/dev/dm-0              69G  1.8G   64G   3% /
tmpfs                 189G   63M  189G   1% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                 189G     0  189G   0% /sys/fs/cgroup
/dev/sda2             126M  144K  125M   1% /boot/efi
/dev/mapper/pve-data  157G   14G  143G   9% /var/lib/vz
/dev/fuse              30M   24K   30M   1% /etc/pve
cgmfs                 100K     0  100K   0% /run/cgmanager/fs
tmpfs                  38G     0   38G   0% /run/user/0

Code:

Running as unit 100.scope.
kvm: -object memory-backend-ram,size=330000M,id=ram-node0: cannot set up guest memory 'ram-node0': Cannot allocate memory
TASK ERROR: start failed: command '/usr/bin/systemd-run --scope --slice qemu --unit 100 -p 'CPUShares=1000' /usr/bin/kvm -id 100 -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -vnc unix:/var/run/qemu-server/100.vnc,x509,password -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=8680e698-5127-49d8-b204-fff6d6216d8e' -name Medent -smp '46,sockets=1,cores=46,maxcpus=46' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000' -vga cirrus -cpu host,+kvm_pv_unhalt,+kvm_pv_eoi,-kvm_steal_time -m 330000 -object 'memory-backend-ram,size=330000M,id=ram-node0' -numa 'node,nodeid=0,cpus=0-45,memdev=ram-node0' -k en-us -mem-path /hugepages -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:d8c0c737df7e' -drive 'file=/dev/OS/vm-100-disk-1,if=none,id=drive-virtio0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' -drive 'file=/dev/disk1/vm-100-disk-1,if=none,id=drive-virtio1,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb' -drive 'file=/dev/disk2/vm-100-disk-1,if=none,id=drive-virtio2,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio2,id=virtio2,bus=pci.0,addr=0xc' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=86:83:0D:41:5C:CE,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'' failed: exit code 1

sigxcpu · Oct 23, 2015

Where did

Code:

mode=1770,gid=2021

come from?

adamb · Oct 23, 2015

sigxcpu said:
Where did

Code:

mode=1770,gid=2021

come from?

From the debain wiki

https://wiki.debian.org/Hugepages

But I also tried the way listed in the link you provided.

[SOLVED] PVE 4 CPU Cores

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Well-Known Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member