VMs locking up in PMVE 3.4.6

hacman

Renowned Member
Oct 11, 2013
86
8
73
Newcastle upon Tyne, UK
Hi all,

Something of a plea for help here, as this has been driving me mad and I can't seem to find the solution. It's probably something really silly and simple too!

So we've just setup a two node cluster of PMVE 3.4.

We've a number of Centos 6.7 VMs running at present, and from time to time they just randomly hang.

Symptoms are that the CPU use reported in Proxmox drops to 0.00%, and the VM will not respond via either the network or the console.

The VMs have to be stopped at the CLI with the "qm stop" command and started again, where they will then run happy until the next seemingly random occurence.

To give you a rundown on our environment:

Nodes:
DL365 with 2 x Opteron 2379. 32GB RAM per node. Local SAS storage, P400i with 512MB cache. Onboard NIC plus HP NC380T.

Code:
root:~# pveversion -v
proxmox-ve-2.6.32: 3.4-156 (running kernel: 2.6.32-39-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-39-pve: 2.6.32-156
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-17
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Two network bridges, an Internal one and an External one. Two NICs each, via a bond.

VMs:
CentOS 6.7. Setup as KVM with hardware virt, 1 CPU (host), fixed RAM. VIRTIO NIC and storage controller, VMDK. Tried with QCOW2 format, and also with the LSI NIC and E1000 driver, but this has no effect.

Configurations of two of the VMs included below:

Code:
root:~# qm config 101
bootdisk: ide0
cores: 1
cpu: host
description: Observium monitoring VM. %0A
hotplug: network,usb
ide0: VMStore:101/vm-101-disk-2.vmdk,format=vmdk,cache=writeback,size=76G
memory: 4608
name: OPS
net0: virtio=46:C3:5A:24:8F:B5,bridge=vmbr0
numa: 0
onboot: 1
ostype: l24
scsihw: virtio-scsi-pci
smbios1: uuid=f30620b4-0c81-40bf-bfd6-4f2608eac7bf
sockets: 1
vga: std

root:~# qm config 102
bootdisk: ide0
cores: 1
cpu: host
ide0: VMStore:102/vm-102-disk-2.vmdk,format=vmdk,cache=writethrough,size=26G
memory: 2048
name: NS1
net0: virtio=E6:A4:F3:2E:F2:AD,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
smbios1: uuid=a4bf5775-f5d2-473f-9c74-c427f2a41002
sockets: 1
vga: std
root:~#

Any help or thoughts anyone has on what the root cause of this may be would be much appreciated!

Thanks,

Jon
 
The ostype of the 101 VM is set to l24, but shouldn't it be l26 (for linux kernel >=2.6), only a notice shouldn't be the cause of the hang.

Whats in the logs from the VMs after the hang? Anything suspicious?
 
The ostype of the 101 VM is set to l24, but shouldn't it be l26 (for linux kernel >=2.6), only a notice shouldn't be the cause of the hang.

Whats in the logs from the VMs after the hang? Anything suspicious?

Hi,

I had set this at first to L26, but had changed it as part of my troubleshooting. As you say shouldn't cause the issue but was trying to rule out as much as possible.

The logs I see I can't find anything that is suspicious. The system just seems to be responding normally then dies.

Are you able to please give pointers as to which logs may be of help on the host? I've left one VM in the crashed state if there is anything you can suggest checking.

Thanks for your help!

Jon
 
One of the VMs I've rebooted, and I've now changed it to RAW storage type to see if this makes a difference.

I think this is one of the few last settings I've not tried, so we'll see.

In the meantime any further assistance people can offer will be greatly appreciated.

Thanks,

Jon


Sent from my iPhone using Tapatalk
 
Hi all,

Again last night we had another lockup, on yet a different VM.

Details below:

Code:
root:~# qm config 103
bootdisk: ide0
cores: 1
cpu: host
hotplug: network
ide0: VMStore:103/vm-103-disk-2.vmdk,format=vmdk,cache=writethrough,size=26G
memory: 2048
name: NS2
net0: virtio=8E:5C:FC:26:F8:1D,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=4c0ff92c-fb07-46d0-90b1-d7ffccea3673
sockets: 1
vga: std

Again this one had to be killed at the CLI, as the GUI just wouldn't do anything with it.

Code:
root:~# qm stop 103
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL

On booting back up the logs don't seem to suggest anything out of the ordinary. Nothing odd that I can see in the host logs either. :(

This is not looking good...

Jon
 
We've same symptoms on one standalone hypervisor.
Both with Windows and CentOS guests. Seems to happens after an high load. Nothing strange in logs. We also have to stop/start the VM again
We did not found any clear explanation yet... Still investigating... Restarting daily (cron job) the guest seems to help but not sure (maybe related to guest driver, VirtIO?). If you find something on your side please share it ;-)
 
We've same symptoms on one standalone hypervisor.
Both with Windows and CentOS guests. Seems to happens after an high load. Nothing strange in logs. We also have to stop/start the VM again
We did not found any clear explanation yet... Still investigating... Restarting daily (cron job) the guest seems to help but not sure (maybe related to guest driver, VirtIO?). If you find something on your side please share it ;-)

Still nothing certain here, and although the VMs affected are under load at the time, it's nothing especially high.

Are you able to perhaps paste the output of pveversion -v in here, and if possible the XML config of the VMs (qm config [VMID]) for comparison?

What sort of hardware are you using?

Thanks,

Jon
 
Still nothing certain here, and although the VMs affected are under load at the time, it's nothing especially high.

Are you able to perhaps paste the output of pveversion -v in here, and if possible the XML config of the VMs (qm config [VMID]) for comparison?

What sort of hardware are you using?

Thanks,

Jon

We are up-to-date :
proxmox-ve-2.6.32: not correctly installed (running kernel: 3.10.0-11-pve)
pve-manager: 3.4-9 (running version: 3.4-9/4b51d87a)
pve-kernel-3.10.0-10-pve: 3.10.0-34
pve-kernel-3.10.0-8-pve: 3.10.0-30
pve-kernel-3.10.0-11-pve: 3.10.0-36
pve-kernel-2.6.32-37-pve: 2.6.32-150
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-18
qemu-server: 3.4-6
pve-firmware: not correctly installed
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-11
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

This is recent Supermicro hardware, E5-2630 v3 processors

Here is an example of VM which sometimes crashes (it seems did not happened since a week)

root@pve-build:~# qm config 103
boot: c
bootdisk: virtio0
cores: 8
cpu: host
description: Linux stable virtual build bot%0A
hotplug: disk,network,usb
ide2: none,media=cdrom
keyboard: fr-be
memory: 12000
name: linux-libc25-x86-64-ig482-stable
net0: virtio=00:0C:29:79:09:88,bridge=vmbr0
numa: 1
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=76231c3d-39a7-49c4-8564-270b8af63af1
sockets: 1
tablet: 0
virtio0: vg15k.1-fs:103/vm-103-disk-1.raw,format=raw,cache=writeback,discard=on,size=40G
virtio1: vg15k.1-fs:103/vm-103-disk-2.raw,format=raw,cache=writeback,backup=no,size=40G
 
We are up-to-date :


This is recent Supermicro hardware, E5-2630 v3 processors

Here is an example of VM which sometimes crashes (it seems did not happened since a week)

So that rules out and issue with the AMD CPU type, and that the RAW disk image type may play a part. As you're seeing this with Windows guests also we can rule out an issue with the guest too I guess. NUMA enabled vs disabled too.

Very interesting - suggests we're looking at something to do with PMVE or it's components here.

Thanks for the info!

Jon
 
Hi,

This morning we had yet another lockup.

This was the machine I converted to RAW storage type, so that's not the issue.

Can anyone with a bit more experience or one of the Proxmox team please advise on further things we can check?

This has sadly now gone from being a major issue to showstopper for us moving forward with Proxmox :(.

Thanks in advance!

Jon


Sent from my iPhone using Tapatalk
 
Hi,

I noticed you use vmdk as an disk image format. Why it's that the cause? It could be that that is the root of the issues (fishing a little in the dark, though).
QEMU and vmdk sometimes give some problems.

Could you try to convert an machine to raw or qcow2 and then see if that fixes the hangs?
Use:
Code:
# to raw:
qemu-img convert -f vmdk -O raw centos7.vmdk centos7.img

# to qcow2:
qemu-img convert -f vmdk -O qcow2 centos7.vmdk centos7.qcow2

Replace 'centos9' accordingly.
 
Hi,

I noticed you use vmdk as an disk image format. Why it's that the cause? It could be that that is the root of the issues (fishing a little in the dark, though).
QEMU and vmdk sometimes give some problems.

Could you try to convert an machine to raw or qcow2 and then see if that fixes the hangs?
Use:
Code:
# to raw:
qemu-img convert -f vmdk -O raw centos7.vmdk centos7.img

# to qcow2:
qemu-img convert -f vmdk -O qcow2 centos7.vmdk centos7.qcow2

Replace 'centos9' accordingly.

Hi Thomas,

Thanks for the info.

We have already tried QCOW2 and RAW formats, and the lockup issue was still present.

The VMDK format was set in an attempt to rule out the other formats. We've now converted them all back. I can re-post the configs again if required.

Thanks,

Jon
 
I would try a totally different hw machine if possible... even just for a test to confirm this happens on different servers
and/or I wuld try an older pve version, just to get an hint.

Marco
 
I would try a totally different hw machine if possible... even just for a test to confirm this happens on different servers
and/or I wuld try an older pve version, just to get an hint.

Marco

Hi Marco,

Thanks for the info.

The machine hardware is known good, as we had previously been using oVrt (KVM) on this with some quite heavy loads - no issues.

I'll look at downgrading shortly - any particular versions you can reccomend?

Thanks,

Jon
 
The machine hardware is known good, as we had previously been using oVrt (KVM) on this with some quite heavy loads - no issues.

most people is running perfectly running vms on that pve so... if is not the hw, and not the hypervisor, it should be... the vms?

you could try to install and test (at least for as long to match your freeze timing) a reproducible vm like a standard win/linux from iso or cd.
That _must_ run stable. If not, is the hw, or the pve on that hardware, imho.

Since there are almosto no clues, another thing you could try is up/downgrade the kernel version. if you're runnign 2.6x you can try 3.x or another 2.6x.
Many hw issues are related to kernel compatibility.

If you're going to try another pve version, I guess you should reinstall, I don't know how to downgrade (maybe make an image from your current install to revert later)

If your vms are not on local storage, I would also try to run them on local storage, if possible to avoid any network/storage issue.

[edit] it may be your hw is good, but sometimes it could be pretty nasty to find out that hw (even golden one) is the issue, see this thread http://forum.proxmox.com/threads/22265-Proxmox-3-4-on-IBM-x3850-X6) [/edit]

Marco
 
Last edited:
most people is running perfectly running vms on that pve so... if is not the hw, and not the hypervisor, it should be... the vms?

you could try to install and test (at least for as long to match your freeze timing) a reproducible vm like a standard win/linux from iso or cd.
That _must_ run stable. If not, is the hw, or the pve on that hardware, imho.

Since there are almosto no clues, another thing you could try is up/downgrade the kernel version. if you're runnign 2.6x you can try 3.x or another 2.6x.
Many hw issues are related to kernel compatibility.

If you're going to try another pve version, I guess you should reinstall, I don't know how to downgrade (maybe make an image from your current install to revert later)

If your vms are not on local storage, I would also try to run them on local storage, if possible to avoid any network/storage issue.

[edit] it may be your hw is good, but sometimes it could be pretty nasty to find out that hw (even golden one) is the issue, see this thread http://forum.proxmox.com/threads/22265-Proxmox-3-4-on-IBM-x3850-X6) [/edit]

Marco

Hi Marco,

Thanks for that.

Is there a certain kernel you'd reccomend trying? With PMVE is there any further config that needs to be done once the kernel is swapped out in terms of config?

Thanks,

Jon
 
Is there a certain kernel you'd reccomend trying? With PMVE is there any further config that needs to be done once the kernel is swapped out in terms of config?

I'm not that techie :) but I think you could start trying the nearest, older or newer, or even then 3.x (note: no openvz support).
If you're still in trouble, maybe asking pve support you'll have more chances to spot what is happening on your server, but they'll need any possible HW detail about it, or a direct ssh access (that needs paid support, though).

In my relatively low-tech experience, troubles like that are difficult to spot, but I would definitely try
- another server, totally different, whatever is the known story of the current one
- standard linux vms installed from scratch

once I have a stable environment with a simple setup I would build on that:
- if the "simple" vm is stable on the new server, and not on the old, I get a hint (the old server)
- if the "real" vm is unstable on the new server, where the "simple" runs fine, I get another hint (the real vm)

There could be more sophisticated ways, but a more practical approach lie thi could be faster and meybe find other alternatives for you...

Marco
 
Last edited:
You can install the 3.10.0-x kernel directly from our repos with apt-get. Note that you loose OpenVZ (container) support, but for testing if the VMs run fine that shouldn't matter.
use
Code:
apt-get update
apt-get install pve-kernel-3.10.0-11-pve
and reboot. (the actual kernel subversion could differ, it's dependent which repository you use, take simply the newest 3.10.0-x kernel)

Also testing Proxmox VE4beta2 would be an good option.
 
Hi all,

Thanks for the suggestions.

We were thinking of the 4.0 beta idea - as this has a 4.6 kernel in which will contain a number of fixes, but we may try with the 3.10 first as containers/OpenVZ is not important to us.

In the meantime we've transferred a single VM that was previously affected to another node, but in typical style the fault has not happened again since :(.

@Robhost - no, we're an AMD shop (for reasons I no longer remember).

Thanks,

Jon
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!