KVMs only randomly shutdown

gconcepts

New Member
Sep 29, 2011
20
0
1
Washington DC Area
Good Afternoon wonderful folks. This is my first post on the forums. Could use any assistance available. I have a single node Proxmox instance hosting Windows server KVMs, Linux KVMs and some linux containers. I started noticing an issue over the weekend. I would look at the proxmox web interface and notice that a couple of KVMs would be in a stopped state. I'm not sure what could cause this. My proxmox host is a Dell PowerEdgeR900. I have proxmox installed on a USB 3.0 SSD drive(this one specifically http://www.amazon.com/VisionTek-120...1626&sr=8-16&keywords=usb+3.0+ssd+flash+drive). The storage for the VMs are on a RAID array and only the proxmox host operating system is installed on the USB flash drive. I installed it on the flash drive as the R900 has a SATA port on the inside but there is no SATA power port for it. Below is my pve-version. What logs can I look at to tell me what's going on?

root@proxmox10:~# pveversion -v
proxmox-ve: 4.1-28 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-2 (running version: 4.1-2/78c5f4a2)
pve-kernel-4.2.6-1-pve: 4.2.6-28
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-42
pve-firmware: 1.1-7
libpve-common-perl: 4.0-42
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-18
pve-container: 1.0-35
pve-firewall: 2.0-14
pve-ha-manager: 1.0-16
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
root@proxmox10:~#
 
Good Afternoon wonderful folks. This is my first post on the forums. Could use any assistance available. I have a single node Proxmox instance hosting Windows server KVMs, Linux KVMs and some linux containers. I started noticing an issue over the weekend. I would look at the proxmox web interface and notice that a couple of KVMs would be in a stopped state. I'm not sure what could cause this. My proxmox host is a Dell PowerEdgeR900. I have proxmox installed on a USB 3.0 SSD drive(this one specifically http://www.amazon.com/VisionTek-120...1626&sr=8-16&keywords=usb+3.0+ssd+flash+drive). The storage for the VMs are on a RAID array and only the proxmox host operating system is installed on the USB flash drive. I installed it on the flash drive as the R900 has a SATA port on the inside but there is no SATA power port for it. Below is my pve-version. What logs can I look at to tell me what's going on?

root@proxmox10:~# pveversion -v
proxmox-ve: 4.1-28 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-2 (running version: 4.1-2/78c5f4a2)
pve-kernel-4.2.6-1-pve: 4.2.6-28
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-42
pve-firmware: 1.1-7
libpve-common-perl: 4.0-42
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-18
pve-container: 1.0-35
pve-firewall: 2.0-14
pve-ha-manager: 1.0-16
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
root@proxmox10:~#
Any help will be greatly appreciated.
 
Any ideas anyone please? I also did notice that sometimes, the lxc containers freeze and I cannot SSH into them and even when attempting to force stop the container, it won't respond for a while. Could this be due to bad RAM, hard drive corruption or some kernel bug?
 
Does this indicate RAM issues? I noticed the below in the syslog

{code}
Mar 23 07:36:06 proxmox10 kernel: [45705.955755] dmidecode.sudo invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Mar 23 07:36:06 proxmox10 kernel: [45705.955766] dmidecode.sudo cpuset=110 mems_allowed=0
Mar 23 07:36:06 proxmox10 kernel: [45705.955778] CPU: 1 PID: 14125 Comm: dmidecode.sudo Tainted: P O 4.2.6-1-pve #1
Mar 23 07:36:06 proxmox10 kernel: [45705.955780] Hardware name: Dell Inc. PowerEdge R900/0X947H, BIOS 1.1.13 07/09/2009
Mar 23 07:36:06 proxmox10 kernel: [45705.955783] 0000000000000000 000000007bf3cc13 ffff880e3129fcb8 ffffffff81801158
Mar 23 07:36:06 proxmox10 kernel: [45705.955787] 0000000000000000 ffff880af7f24b00 ffff880e3129fd38 ffffffff817ff67a
Mar 23 07:36:06 proxmox10 kernel: [45705.955790] ffff881fab48e8f8 ffffffff81e60cc0 ffff880e3129fd18 ffffffff810c5b9c
Mar 23 07:36:06 proxmox10 kernel: [45705.955793] Call Trace:
Mar 23 07:36:06 proxmox10 kernel: [45705.955812] [<ffffffff81801158>] dump_stack+0x45/0x57
Mar 23 07:36:06 proxmox10 kernel: [45705.955816] [<ffffffff817ff67a>] dump_header+0xaf/0x238
Mar 23 07:36:06 proxmox10 kernel: [45705.955824] [<ffffffff810c5b9c>] ? __rwsem_do_wake+0x10c/0x140
Mar 23 07:36:06 proxmox10 kernel: [45705.955834] [<ffffffff81185693>] oom_kill_process+0x1e3/0x3c0
Mar 23 07:36:06 proxmox10 kernel: [45705.955839] [<ffffffff811f0971>] mem_cgroup_oom_synchronize+0x531/0x600
Mar 23 07:36:06 proxmox10 kernel: [45705.955849] [<ffffffff811ecb60>] ? mem_cgroup_css_online+0x250/0x250
Mar 23 07:36:06 proxmox10 kernel: [45705.955852] [<ffffffff81185e63>] pagefault_out_of_memory+0x13/0x80
Mar 23 07:36:06 proxmox10 kernel: [45705.955862] [<ffffffff8106735f>] mm_fault_error+0x7f/0x160
Mar 23 07:36:06 proxmox10 kernel: [45705.955865] [<ffffffff81067823>] __do_page_fault+0x3e3/0x410
Mar 23 07:36:06 proxmox10 kernel: [45705.955867] [<ffffffff81067872>] do_page_fault+0x22/0x30
Mar 23 07:36:06 proxmox10 kernel: [45705.955945] [<ffffffff8180a048>] page_fault+0x28/0x30

{code}
 
Also noticed after recent updates that some containers max out ram usage.

Mar 23 08:02:47 proxmox10 kernel: [47306.772442] Call Trace:
Mar 23 08:02:47 proxmox10 kernel: [47306.772623] [<ffffffff810c5b9c>] ? __rwsem_do_wake+0x10c/0x140
Mar 23 08:02:47 proxmox10 kernel: [47306.772726] [<ffffffff811ecb60>] ? mem_cgroup_css_online+0x250/0x250
Mar 23 08:02:47 proxmox10 kernel: [47306.772743] [<ffffffff81067823>] __do_page_fault+0x3e3/0x410
Mar 23 08:02:48 proxmox10 kernel: [47307.800543] Task in /lxc/110 killed as a result of limit of /lxc/110
Mar 23 08:02:48 proxmox10 kernel: [47307.800559] memory: usage 2097152kB, limit 2097152kB, failcnt 696340
Mar 23 08:02:48 proxmox10 kernel: [47307.800562] memory+swap: usage 5242784kB, limit 5242880kB, failcnt 582511
Mar 23 08:02:48 proxmox10 kernel: [47307.800564] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Mar 23 08:02:48 proxmox10 kernel: [47307.800566] Memory cgroup stats for /lxc/110: cache:2944KB rss:2094208KB rss_huge:0KB mapped_file:1952KB dirty:0KB writeback:0KB swap:3145$
Mar 23 08:02:48 proxmox10 kernel: [47307.800590] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
 
hi,

how much ram do you have in the server?
how much ram did you assign to the vms/container?
what do you run inside the vms/container?

Also noticed after recent updates that some containers max out ram usage.

does that mean you upgraded? (because the versions in your first post are not current)
if yes, did you reboot the server?
 
hi,

how much ram do you have in the server?
how much ram did you assign to the vms/container?
what do you run inside the vms/container?



does that mean you upgraded? (because the versions in your first post are not current)
if yes, did you reboot the server?

Hi dcsapak,

Thanks for the response. Yes I applied some update to the proxmox server yesterday hoping that would resolve the issue. Please see below for updated pveversion. Also, please see below for answers to your questions.


how much ram do you have in the server? 128GB
how much ram did you assign to the vms/container? it varies. The VM with the most amount of RAM has 12GB assigned to it. I also have ballooning enabled on all the VMs.
what do you run inside the vms/container?Most of the VMs run windows server which in turn run SQL Server, SHarePoint and the containers are fairly lightweight as a couple run some Java applications running on tomcat, openldap, wordpress e.t.c

The server has never reached full RAM usage as out of the 128GB the server has, max usage is never more than 78GB

upload_2016-3-23_10-34-53.png



root@proxmox10:~# pveversion -v
proxmox-ve: 4.1-28 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-2 (running version: 4.1-2/78c5f4a2)
pve-kernel-4.2.6-1-pve: 4.2.6-28
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-42
pve-firmware: 1.1-7
libpve-common-perl: 4.0-42
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-18
pve-container: 1.0-35
pve-firewall: 2.0-14
pve-ha-manager: 1.0-16
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
root@proxmox10:~#
 
how did you upgrade?
the way to go is
Code:
apt-get update
apt-get dist-upgrade
or via the gui under updates

(because your versions do not differ from the first post)

what does your storage look like?
do you use zfs?
 
how did you upgrade?
the way to go is
Code:
apt-get update
apt-get dist-upgrade
or via the gui under updates

(because your versions do not differ from the first post)

what does your storage look like?
do you use zfs?

Thats correct that I upgraded using the below command
apt-get update
apt-get dist-upgrade

The patches I applied previously were from a post I saw regarding backups failing due to containers waiting for timeout. After I applied that patch, ti seemed to resolve the issue of the backups failing due to container timeouts.


I use local storage. No ZFS. I have the proxmox host installed on an internal USB SSD flash drive. The I have a RAID array thats connected to the built in raid controller for the server setup as local storage on LVM and I use that as the storage for the VMs. I also have an iSCSI connection to a windows machine that I use only for backups. The storage I use for backups is backed by the iscsi connection. I just ran apt-get update and apt-get dist-upgrade and there are some updates available. I am applying that currently and will see if that resolves the issue. Please let me know if there is any other information I can provide.
 
Thats correct that I upgraded using the below command
apt-get update
apt-get dist-upgrade

The patches I applied previously were from a post I saw regarding backups failing due to containers waiting for timeout. After I applied that patch, ti seemed to resolve the issue of the backups failing due to container timeouts.


I use local storage. No ZFS. I have the proxmox host installed on an internal USB SSD flash drive. The I have a RAID array thats connected to the built in raid controller for the server setup as local storage on LVM and I use that as the storage for the VMs. I also have an iSCSI connection to a windows machine that I use only for backups. The storage I use for backups is backed by the iscsi connection. I just ran apt-get update and apt-get dist-upgrade and there are some updates available. I am applying that currently and will see if that resolves the issue. Please let me know if there is any other information I can provide.


I have finished applying the upgrade. Below is latest pveversion. I will monitor and see if that resolved the issue.

proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-22 (running version: 4.1-22/aca130cf)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-39
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-36
qemu-server: 4.0-64
pve-firmware: 1.1-7
libpve-common-perl: 4.0-54
libpve-access-control: 4.0-13
libpve-storage-perl: 4.0-45
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-52
pve-firewall: 2.0-22
pve-ha-manager: 1.0-25
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie


Thanks
 
Thats correct that I upgraded using the below command
apt-get update
apt-get dist-upgrade

The patches I applied previously were from a post I saw regarding backups failing due to containers waiting for timeout. After I applied that patch, ti seemed to resolve the issue of the backups failing due to container timeouts.


I use local storage. No ZFS. I have the proxmox host installed on an internal USB SSD flash drive. The I have a RAID array thats connected to the built in raid controller for the server setup as local storage on LVM and I use that as the storage for the VMs. I also have an iSCSI connection to a windows machine that I use only for backups. The storage I use for backups is backed by the iscsi connection. I just ran apt-get update and apt-get dist-upgrade and there are some updates available. I am applying that currently and will see if that resolves the issue. Please let me know if there is any other information I can provide.
Hi,
but it's looks that you don't have enabled the right repository?!

I assume, you don't have an valid subscription? So pve-enterprise don't work.

Please enable the no-subsciption-list:
Code:
cat /etc/apt/sources.list.d/pve-no-subscription.list
deb http://download.proxmox.com/debian jessie pve-no-subscription
"apt-get update" and "apt-get dist-upgrade" should update your system (reboot because new kernel).

Because kvm shutdown: If you use ubuntu inside the VMs use the "right" kernel: linux-image-virtual

Udo
 
Hi,
but it's looks that you don't have enabled the right repository?!

I assume, you don't have an valid subscription? So pve-enterprise don't work.

Please enable the no-subsciption-list:
Code:
cat /etc/apt/sources.list.d/pve-no-subscription.list
deb http://download.proxmox.com/debian jessie pve-no-subscription
"apt-get update" and "apt-get dist-upgrade" should update your system (reboot because new kernel).

Because kvm shutdown: If you use ubuntu inside the VMs use the "right" kernel: linux-image-virtual

Udo
sorry read not carefully enough - your system is uptodate!

Udo
 
sorry read not carefully enough - your system is uptodate!

Udo
Thanks for the response udo. I already applied the update. I have the VMs currently running and will update this thread. Those usually occurs with a VM crashing after about 8hrs so I should know more by this time tomorrow.

Thanks
 
Hi,
look at the kernel-package if the client is ubuntu.

Udo

This also affects windows machines. There is only one Ubuntu KVM and others are windows. I will be replacing some of the RAM sticks on that box by Monday and will see if that resolves the issue. With the latest update I applied today, the issue still occurred and when it does, the syslog shows a similar message as I pasted above.
 
This also affects windows machines. There is only one Ubuntu KVM and others are windows. I will be replacing some of the RAM sticks on that box by Monday and will see if that resolves the issue. With the latest update I applied today, the issue still occurred and when it does, the syslog shows a similar message as I pasted above.

So I noticed a strange behavior. When the crash does occur for the KVM's some LXC containers are running wild with CPU using the most CPU time. I cannot SSH into those LXC containers to see what's using up all processes. The only way I can get it to stop is to kill it from the web gui :8006. Sometimes I might have to do it multiple times before it gets killed. The LXC container is from a debian template and only thing installed on it is java and the custom application. Please note that after the container is stopped or restarted, the custom application isnt running as I prefer to manually start it. So when the CPU crazyness occurs, I can't tell what's going on within the container causing the host to run out of CPU. Also as another note, on the Dell DRAC logs page, I noticed some errors regarding a certain DIMM. I have some new RAM coming in by Monday so I will reseat the RAM sticks in that DIMM and replace it if the same error still comes up. Could that RAM error cause the containers to be erratic like that?

"Persistent correctable memory error rate has increased for a memory device at location Memory Board C DIMM5"
 
Thanks Udo and dcsapak. The issue seems to be related to bad RAM. I swapped the RAM and deleted the lxc containers that were misbehaving and the issue hasnt occurred since last Friday. Will close this thread.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!