Guest Applications being Killed + Undocumented Balloon setting

Dec 19, 2016
10
0
1
36
2 Issues experienced with LVM guests on Proxmox 4.4.1 Since updating using the Subscription Repository.

1.) On some of our Ubuntu 14.04 LTS Guests we have been finding processes being stopped without warning or any information being logged by both the host and the guest.

Checked the syslog on all of the Hosts and found no indication of the Host stopping any processes. Checked the syslogs on the guests and found no mention of the kernel OOM killing the processes.

Is this perhaps as a result of the QEMU updates?

2.) Since the update, there is a "balloon" option for all KVM virtual machines that was not present before, and it was enabled for all VM's. This option does not seem to be documented anywhere or show up on the roadmap or release notes.

Could someone please indicate what this new tickbox does?

pveversion:
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-1 (running version: 4.4-1/eb2d6f1e)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.35-1-pve: 4.4.35-76
pve-kernel-4.2.6-1-pve: 4.2.6-36
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-101
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-88
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80


Regards,
Joe
 

Attachments

  • proxmoxbalooningram.png
    proxmoxbalooningram.png
    19.3 KB · Views: 11
Last edited:
1.) On some of our Ubuntu 14.04 LTS Guests we have been finding processes being stopped without warning or any information being logged by both the host and the guest.
are these containers or qemu kvm guests ?

2.) Since the update, there is a "balloon" option for all KVM virtual machines that was not present before, and it was enabled for all VM's. This option does not seem to be documented anywhere or show up on the roadmap or release notes.
ballooning is mechanism of qemu so that the memory which is not used by the guest goes back to the host.
this is enabled by default since quiet some time

for details see
http://pve.proxmox.com/wiki/Dynamic_Memory_Management#Ballooning
 
Update original Post with a screenshot of the Ballooning Ram setting.

As previously said this just enables the VM to hand back memory it does not require, this memory can then be used for other VM's or HOST page cache, what could be happening is if you have any host processes which are then taking all this available memory or using it as page cache, the VM then demands more memory however the host can not clear the memory before the VM times out the request.

This may then cause a kernel panic within the VM, something that is hard to find via logs at their standard levels, is your host under a fair amount of use RAM wise, for example during a general running of the VM's what does free -m output?
 
As previously said this just enables the VM to hand back memory it does not require, this memory can then be used for other VM's or HOST page cache, what could be happening is if you have any host processes which are then taking all this available memory or using it as page cache, the VM then demands more memory however the host can not clear the memory before the VM times out the request.

This may then cause a kernel panic within the VM, something that is hard to find via logs at their standard levels, is your host under a fair amount of use RAM wise, for example during a general running of the VM's what does free -m output?

Thanks for the information, as far as we can see this feature was not present on the Host with the Previous Version we were running.We have disabled The Balloon Memory feature until the dust settles/ We find the root cause of the applications being killed.

Below is the output of one of the problematic Qemu Guests:

userx@Ubuntuguestxyz:/home/mysql/database# free -m
a)total b)used c)free d)shared e)buffers f)cached
Mem: a)5969 b)5720 c)248 d)0 e)216 f)2090
-/+ buffers/cache: a)3414 b)2554
Swap: a)3813 b)13 c)3800
 
then the error lies probably inside the guest, look there for clues

The only indication of errors found on the Qemu KVMs that are affected is that the Applications are being terminated like this : mysqld got signal 11
Another application :
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007f9bf0138080

The above is the only information being logged that is of any help.

There is nothing in the kernel.log that correlates with the above mentioned applications being killed.

Is there any other specific logs that I can check to confirm?
 
Thanks for the information, as far as we can see this feature was not present on the Host with the Previous Version we were running.We have disabled The Balloon Memory feature until the dust settles/ We find the root cause of the applications being killed.
ballooning is enabled since quite some time, we simply did not expose it in the gui

The only indication of errors found on the Qemu KVMs that are affected is that the Applications are being terminated like this : mysqld got signal 11
Another application :
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007f9bf0138080
this is a faulty memory access, can be caused by a faulty program, faulty hardware or something similar
see for example: https://en.wikipedia.org/wiki/Segmentation_fault
 
ballooning is enabled since quite some time, we simply did not expose it in the gui

this is a faulty memory access, can be caused by a faulty program, faulty hardware or something similar
see for example: https://en.wikipedia.org/wiki/Segmentation_fault

We are busy planning to take the affected node out of the cluster to perform a memory test.

What is unusual though is that we have had the same problem on another host in the cluster since we did the updates last week.

Please let me know if there are any other things I can check and confirm or if there is any futher information I can provide.
 
Update:

We have performed a RAM test on the affected Host, this did not reveal any problems with the RAM.
Furthermore we have still had the issue with the segmentation fault occur on a handful of KVMs.
We are now busy scheduling a test on our RAID array to confirm that all is healthy.

Could you guys please advise as to what can be investigated next?
 
Update:

We have performed a RAM test on the affected Host, this did not reveal any problems with the RAM.
Furthermore we have still had the issue with the segmentation fault occur on a handful of KVMs.
We are now busy scheduling a test on our RAID array to confirm that all is healthy.

Could you guys please advise as to what can be investigated next?

Further Update:

We have completed a RAID array health check this did not reveal any problems.
Please note that we have confirmed that both the RAM and HDD setup are healthy.

Please advise as to what we need to do next?

Have there been any hotfixes since the 16th of December 2016 that could resolve this issue?

Furthermore what do we need to do to get assistance from Proxmox?
 
Last edited:
We have also been experiencing "segfaults" on Debian 7 and 8 guests (QEMU/KVM) for the last couple of months, but managed to get the situation stable after stopping the "ksmtuned" service on both hosts. After 6 weeks without incident, we updated to Proxmox 4.4-5, restarted the hosts, forgot to stop the "ksmtuned" service and the problem almost immediately resurfaced. I have now stopped the "ksmtuned" service on both hosts again and rebooted all the guests. The only message on the hosts that seem to correlate is "kvm: zapping shadow pages for mmio generation wraparound".

The same servers were used in our Ceph cluster running the same VMs, before migrating the storage to a SAS enclosure and upgrading to Proxmox 4, so I doubt it is hardware related...
 
Hi Jacques,
I will be looking into the ksmtuned services asap, do some testing and revert back.


We have also been experiencing "segfaults" on Debian 7 and 8 guests (QEMU/KVM) for the last couple of months, but managed to get the situation stable after stopping the "ksmtuned" service on both hosts. After 6 weeks without incident, we updated to Proxmox 4.4-5, restarted the hosts, forgot to stop the "ksmtuned" service and the problem almost immediately resurfaced. I have now stopped the "ksmtuned" service on both hosts again and rebooted all the guests. The only message on the hosts that seem to correlate is "kvm: zapping shadow pages for mmio generation wraparound".

The same servers were used in our Ceph cluster running the same VMs, before migrating the storage to a SAS enclosure and upgrading to Proxmox 4, so I doubt it is hardware related...
 
Unfortunately the problem persists. After 48 hours we had segfaults on two guests with no error being logged on the host they are running on. I should also mention that we are runnning Open vSwitch on these hosts, but I doubt that has any relevance.

Hi Jacques,
I will be looking into the ksmtuned services asap, do some testing and revert back.
 
Hi All, It would seem that after disabling ksmtuned in our Cluster we have had another VM Application thread die without warning.

We are currently on Version: Virtual Environment 4.4-12/e71b7a74 .

Does anyone have any suggestions that I have not tested ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!