Proxmox VE 2.2 catastrophe! - RESOLVED

E

Eric Renfro

Guest
So far, I just upgraded to Proxmox 2.2 from 2.1, and I'm getting kernel panic after kernel panic after kernel panic, all about out of memory.

None of these VM's had issues prior to upgrading the hosts.
None of these VM's even combined together use up more than 50% of the physical memory in the hosts they run on.
Not ALL VM's even kernel panic. Just my web servers, my VPN/ZNC VM.. For absolutely no understandable reason.

At first, I thought it was just something wierd going on with my web VM, but I moved it to a 2.1 server, and it ran flawlessly. I moved it back, it panic'd. To further test this theory, I upgraded another VM. Same problem. Some VM's no problem. The web and VPN/ZNC VM's kernel panic with OOM.

During bootup, no less.. It just literally starts killing everything, until it gets to the final message:
Kernel panic - not syncing: Out of memory and no killable processes...

I've tried pvetest to see if there were any immediate fixes for this... Nothing.
 
Last edited by a moderator:
Re: Proxmox VE 2.2 catastrophe!

Furthermore..

I even tried with an Ubuntu 10.04.3 Server CD, 11.10 Server CD, and 12.04 Server CD, all going into rescue mode to try to recover the problem.

Works fine for the very minimals, but when you go through the recovery options where it starts loading kernel modules, it suddenly goes into an OOM kill state and panics just like it has been for many other things. This is truly a catastrophe in this release, and I don't even know where to begin to diagnose to resolve this problem.. Meanwhile, I have VM's down that shouldn't be because of this!
 
Re: Proxmox VE 2.2 catastrophe! -- Determination

Okay,

So, in researching head on what the problem is.. I tried changing disk types around from virtio to scsi, Network devices from virtio to e1000, etc..

None of those made a difference, so I went a step further, and looked into the conf files themselves in /etc/pve/qemu-server, and the one thing the 2 VM's I specifically was having problems with had balloon: values in them. Since ballooning relates to memory, a lightbulb in my head went off. I commented out the balloon line, and PRESTO... VM's are back online.. For now..

The problem seems to be related to Balloon, when ANYTHING starts to try to utilize balloon or provide the balloon interface, it will suddenly kill all memory off until it kernel panic's. This is, of course, non-desired activity. :)

Now, root cause has been determined... Proxmox? When will this be fixed and ready for even at least testing?
 
Re: Proxmox VE 2.2 catastrophe! -- Determination

AFAIK we do not use the balloon driver.

Okay? When you go into the monitor of a VM, and you do, balloon XXXX, as long as the guest has virtio_balloon loaded, it will set the balloon factor for that VM, and Proxmox itself will store that balloon setting into it's /etc/pve/qemu-server/###.conf. The problem won't actually start until the guest is rebooted, and the virtio_balloon is loaded by the distribution, possibly because kvm was started with the balloon already instantiated (and yes, you can see them on the command-line args of kvm, balloon settings), then balloon becomes blackhole and starts sucking up all the RAM until it panics.

That's what I'm seeing, so I'm pretty sure Proxmox VE is using something to set, remember, and call kvm with balloon oriented settings, because it is. I used to use balloon all the time to start VM's with less memory than the set maximum, and balloon up only as needed, dynamically. I can't do this anymore because somewhere along the lines, balloon was messed up royally by this VE 2.2 update.
 
Re: Proxmox VE 2.2 catastrophe! -- Determination

Please can you post the config of such VM?

No problem dietmar:

Code:
#Corporate Web Server
#
#eth0%3A 172.17.102.2
#eth1%3A 172.18.0.12
#
#vip1%3A 172.17.102.0
#vip2%3A x.x.x.x
balloon: 2048
boot: dc
bootdisk: virtio0
cores: 1
cpu: phenom
memory: 2048
name: cweb2
net0: virtio=22:86:CE:FC:95:29,bridge=vmbr0
net1: virtio=CE:97:C9:6A:EA:BF,bridge=vmbr1
onboot: 1
ostype: l26
sockets: 1
startup: order=1
virtio0: san:132/vm-132-disk-1.qcow2,cache=writeback,backup=no

You'll notice that on this, memory is 2048, and balloon is 2048, so it's not really ballooning at the time of startup, but runs into OOM kill panic at boot, just because balloon is set. The Guest OS was Ubuntu 10.04, 11.10, and 12.04, as well as Debian 6. It will even cause the OOM kill panic to occur on "Recovery" mode runs of Ubuntu 10.04 Server, and 12.04 Server, after you run through some of the on-screen directions of the recovery (basically after it starts loading modules, including virtio_balloon I'm sure).

This is how you can repeat the issue over and over again, though.
 
Re: Proxmox VE 2.2 catastrophe! -- Determination

OK, found the bug. I just uploaded a fix to the stable repository.

Please update and test.
 
Re: Proxmox VE 2.2 catastrophe! -- Determination

Thanks Dietmar. I have updated and confirmed the problem is solved with balloon. Things are back to normal in my environment from this, and are resuming to be able to online re-allocate memory as need-be as was designed (originally, as a test to check out ballooning, but became something rather interesting after it all worked. <G>)

Here's an unrelated FYI:
I did notice some issues trying to create a new VM, where "Storage" was grayed out during the creation of VM's, but that's a different issue. May even be related to ceph, when a ceph store isn't available it throws away the entire storage options entirely. I had two Ceph storage environments setup, but one was down at the time I noticed this problem, since I'm still testing. Once I removed both ceph storages, it started showing me the option again proper.

OK, found the bug. I just uploaded a fix to the stable repository.

Please update and test.
 
Re: Proxmox VE 2.2 catastrophe! -- Determination

Also, this does remind me of a certain issue I do have, regarding ballooning. Right now, my agents are setup to explicitly ssh to the parent host on static VM's that don't migrate. But for those that do migrate around, currently there is no 100% viable way to list all the VM's from anywhere to determine where it's actually running off of.

qm list is per host, presently. But something that can show all hosts's VM's would be extremely useful, especially in cases where some agent/script would need to know where a host is running so it can manipulate it appropriately, or even manipulate it from any host via command-line. This is how my balloon system works. It uses a Zabbix agent that runs on the hypervisors, and the guests. When the guest memory statistics shows memory being at high levels, it logs into the system via a macro setting (basically a variable in the host definition itself), and runs the appropriate qm set <vmid> -balloon <value>.. This, of course, can only be done on the physical host the VM is running on. This limitation really hinders the ability to easily automate such processes, especially in 4+ hypervisors.
Is something like this in the works, to allow qm list, or similar, to show all VMs, running/not-running/etc, and/or to even connect to it's monitor to handle live changes such as that?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!