Guest Applications being Killed + Undocumented Balloon setting

joelinux · Dec 19, 2016

2 Issues experienced with LVM guests on Proxmox 4.4.1 Since updating using the Subscription Repository.

1.) On some of our Ubuntu 14.04 LTS Guests we have been finding processes being stopped without warning or any information being logged by both the host and the guest.

Checked the syslog on all of the Hosts and found no indication of the Host stopping any processes. Checked the syslogs on the guests and found no mention of the kernel OOM killing the processes.

Is this perhaps as a result of the QEMU updates?

2.) Since the update, there is a "balloon" option for all KVM virtual machines that was not present before, and it was enabled for all VM's. This option does not seem to be documented anywhere or show up on the roadmap or release notes.

Could someone please indicate what this new tickbox does?

pveversion:
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-1 (running version: 4.4-1/eb2d6f1e)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.35-1-pve: 4.4.35-76
pve-kernel-4.2.6-1-pve: 4.2.6-36
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-101
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-88
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80

Regards,
Joe

dcsapak · Dec 19, 2016

joelinux said:
1.) On some of our Ubuntu 14.04 LTS Guests we have been finding processes being stopped without warning or any information being logged by both the host and the guest.

are these containers or qemu kvm guests ?

joelinux said:
2.) Since the update, there is a "balloon" option for all KVM virtual machines that was not present before, and it was enabled for all VM's. This option does not seem to be documented anywhere or show up on the roadmap or release notes.

ballooning is mechanism of qemu so that the memory which is not used by the guest goes back to the host.
this is enabled by default since quiet some time

for details see
http://pve.proxmox.com/wiki/Dynamic_Memory_Management#Ballooning

joelinux · Dec 19, 2016

dcsapak said:
are these containers or qemu kvm guests ?

ballooning is mechanism of qemu so that the memory which is not used by the guest goes back to the host.
this is enabled by default since quiet some time

for details see
http://pve.proxmox.com/wiki/Dynamic_Memory_Management#Ballooning

Hi dcsapak,

This is only affecting the qemu kvm guests.

Thank you for the information regarding the Ballooning.

Regards,
Joe

joelinux · Dec 20, 2016

Update original Post with a screenshot of the Ballooning Ram setting.

Ashley · Dec 20, 2016

joelinux said:
Update original Post with a screenshot of the Ballooning Ram setting.

As previously said this just enables the VM to hand back memory it does not require, this memory can then be used for other VM's or HOST page cache, what could be happening is if you have any host processes which are then taking all this available memory or using it as page cache, the VM then demands more memory however the host can not clear the memory before the VM times out the request.

This may then cause a kernel panic within the VM, something that is hard to find via logs at their standard levels, is your host under a fair amount of use RAM wise, for example during a general running of the VM's what does free -m output?

dcsapak · Dec 20, 2016

joelinux said:
This is only affecting the qemu kvm guests.

then the error lies probably inside the guest, look there for clues

joelinux · Dec 20, 2016

Ashley said:
As previously said this just enables the VM to hand back memory it does not require, this memory can then be used for other VM's or HOST page cache, what could be happening is if you have any host processes which are then taking all this available memory or using it as page cache, the VM then demands more memory however the host can not clear the memory before the VM times out the request.

This may then cause a kernel panic within the VM, something that is hard to find via logs at their standard levels, is your host under a fair amount of use RAM wise, for example during a general running of the VM's what does free -m output?

Thanks for the information, as far as we can see this feature was not present on the Host with the Previous Version we were running.We have disabled The Balloon Memory feature until the dust settles/ We find the root cause of the applications being killed.

Below is the output of one of the problematic Qemu Guests:

userx@Ubuntuguestxyz:/home/mysql/database# free -m
a)total b)used c)free d)shared e)buffers f)cached
Mem: a)5969 b)5720 c)248 d)0 e)216 f)2090
-/+ buffers/cache: a)3414 b)2554
Swap: a)3813 b)13 c)3800

joelinux · Dec 20, 2016

dcsapak said:
then the error lies probably inside the guest, look there for clues

The only indication of errors found on the Qemu KVMs that are affected is that the Applications are being terminated like this : mysqld got signal 11
Another application :
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007f9bf0138080

The above is the only information being logged that is of any help.

There is nothing in the kernel.log that correlates with the above mentioned applications being killed.

Is there any other specific logs that I can check to confirm?

dcsapak · Dec 20, 2016

joelinux said:
Thanks for the information, as far as we can see this feature was not present on the Host with the Previous Version we were running.We have disabled The Balloon Memory feature until the dust settles/ We find the root cause of the applications being killed.

ballooning is enabled since quite some time, we simply did not expose it in the gui

joelinux said:
The only indication of errors found on the Qemu KVMs that are affected is that the Applications are being terminated like this : mysqld got signal 11
Another application :
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007f9bf0138080

this is a faulty memory access, can be caused by a faulty program, faulty hardware or something similar
see for example: https://en.wikipedia.org/wiki/Segmentation_fault

joelinux · Dec 21, 2016

dcsapak said:
ballooning is enabled since quite some time, we simply did not expose it in the gui

this is a faulty memory access, can be caused by a faulty program, faulty hardware or something similar
see for example: https://en.wikipedia.org/wiki/Segmentation_fault

We are busy planning to take the affected node out of the cluster to perform a memory test.

What is unusual though is that we have had the same problem on another host in the cluster since we did the updates last week.

Please let me know if there are any other things I can check and confirm or if there is any futher information I can provide.

joelinux · Jan 9, 2017

Update:

We have performed a RAM test on the affected Host, this did not reveal any problems with the RAM.
Furthermore we have still had the issue with the segmentation fault occur on a handful of KVMs.
We are now busy scheduling a test on our RAID array to confirm that all is healthy.

Could you guys please advise as to what can be investigated next?

joelinux · Jan 17, 2017

joelinux said:
Update:

We have performed a RAM test on the affected Host, this did not reveal any problems with the RAM.
Furthermore we have still had the issue with the segmentation fault occur on a handful of KVMs.
We are now busy scheduling a test on our RAID array to confirm that all is healthy.

Could you guys please advise as to what can be investigated next?

Further Update:

We have completed a RAID array health check this did not reveal any problems.
Please note that we have confirmed that both the RAM and HDD setup are healthy.

Please advise as to what we need to do next?

Have there been any hotfixes since the 16th of December 2016 that could resolve this issue?

Furthermore what do we need to do to get assistance from Proxmox?

jacques · Jan 25, 2017

We have also been experiencing "segfaults" on Debian 7 and 8 guests (QEMU/KVM) for the last couple of months, but managed to get the situation stable after stopping the "ksmtuned" service on both hosts. After 6 weeks without incident, we updated to Proxmox 4.4-5, restarted the hosts, forgot to stop the "ksmtuned" service and the problem almost immediately resurfaced. I have now stopped the "ksmtuned" service on both hosts again and rebooted all the guests. The only message on the hosts that seem to correlate is "kvm: zapping shadow pages for mmio generation wraparound".

The same servers were used in our Ceph cluster running the same VMs, before migrating the storage to a SAS enclosure and upgrading to Proxmox 4, so I doubt it is hardware related...

joelinux · Jan 25, 2017

Hi Jacques,
I will be looking into the ksmtuned services asap, do some testing and revert back.

jacques said:
We have also been experiencing "segfaults" on Debian 7 and 8 guests (QEMU/KVM) for the last couple of months, but managed to get the situation stable after stopping the "ksmtuned" service on both hosts. After 6 weeks without incident, we updated to Proxmox 4.4-5, restarted the hosts, forgot to stop the "ksmtuned" service and the problem almost immediately resurfaced. I have now stopped the "ksmtuned" service on both hosts again and rebooted all the guests. The only message on the hosts that seem to correlate is "kvm: zapping shadow pages for mmio generation wraparound".

The same servers were used in our Ceph cluster running the same VMs, before migrating the storage to a SAS enclosure and upgrading to Proxmox 4, so I doubt it is hardware related...

jacques · Jan 27, 2017

Unfortunately the problem persists. After 48 hours we had segfaults on two guests with no error being logged on the host they are running on. I should also mention that we are runnning Open vSwitch on these hosts, but I doubt that has any relevance.

joelinux said:
Hi Jacques,
I will be looking into the ksmtuned services asap, do some testing and revert back.

joelinux · Mar 1, 2017

Hi All, It would seem that after disabling ksmtuned in our Cluster we have had another VM Application thread die without warning.

We are currently on Version: Virtual Environment 4.4-12/e71b7a74 .

Does anyone have any suggestions that I have not tested ?

Search

Search

Guest Applications being Killed + Undocumented Balloon setting

joelinux

New Member

Attachments

dcsapak

Proxmox Staff Member

joelinux

New Member

joelinux

New Member

Ashley

Member

dcsapak

Proxmox Staff Member

joelinux

New Member

joelinux

New Member

dcsapak

Proxmox Staff Member

joelinux

New Member

joelinux

New Member

joelinux

New Member

jacques

New Member

joelinux

New Member

jacques

New Member

joelinux

New Member

We value your privacy