KVMs only randomly shutdown

gconcepts · Mar 22, 2016

Good Afternoon wonderful folks. This is my first post on the forums. Could use any assistance available. I have a single node Proxmox instance hosting Windows server KVMs, Linux KVMs and some linux containers. I started noticing an issue over the weekend. I would look at the proxmox web interface and notice that a couple of KVMs would be in a stopped state. I'm not sure what could cause this. My proxmox host is a Dell PowerEdgeR900. I have proxmox installed on a USB 3.0 SSD drive(this one specifically http://www.amazon.com/VisionTek-120...1626&sr=8-16&keywords=usb+3.0+ssd+flash+drive). The storage for the VMs are on a RAID array and only the proxmox host operating system is installed on the USB flash drive. I installed it on the flash drive as the R900 has a SATA port on the inside but there is no SATA power port for it. Below is my pve-version. What logs can I look at to tell me what's going on?

root@proxmox10:~# pveversion -v
proxmox-ve: 4.1-28 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-2 (running version: 4.1-2/78c5f4a2)
pve-kernel-4.2.6-1-pve: 4.2.6-28
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-42
pve-firmware: 1.1-7
libpve-common-perl: 4.0-42
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-18
pve-container: 1.0-35
pve-firewall: 2.0-14
pve-ha-manager: 1.0-16
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
root@proxmox10:~#

gconcepts · Mar 22, 2016

gconcepts said:
Good Afternoon wonderful folks. This is my first post on the forums. Could use any assistance available. I have a single node Proxmox instance hosting Windows server KVMs, Linux KVMs and some linux containers. I started noticing an issue over the weekend. I would look at the proxmox web interface and notice that a couple of KVMs would be in a stopped state. I'm not sure what could cause this. My proxmox host is a Dell PowerEdgeR900. I have proxmox installed on a USB 3.0 SSD drive(this one specifically http://www.amazon.com/VisionTek-120...1626&sr=8-16&keywords=usb+3.0+ssd+flash+drive). The storage for the VMs are on a RAID array and only the proxmox host operating system is installed on the USB flash drive. I installed it on the flash drive as the R900 has a SATA port on the inside but there is no SATA power port for it. Below is my pve-version. What logs can I look at to tell me what's going on?

root@proxmox10:~# pveversion -v
proxmox-ve: 4.1-28 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-2 (running version: 4.1-2/78c5f4a2)
pve-kernel-4.2.6-1-pve: 4.2.6-28
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-42
pve-firmware: 1.1-7
libpve-common-perl: 4.0-42
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-18
pve-container: 1.0-35
pve-firewall: 2.0-14
pve-ha-manager: 1.0-16
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
root@proxmox10:~#

Any help will be greatly appreciated.

gconcepts · Mar 23, 2016

Any ideas anyone please? I also did notice that sometimes, the lxc containers freeze and I cannot SSH into them and even when attempting to force stop the container, it won't respond for a while. Could this be due to bad RAM, hard drive corruption or some kernel bug?

gconcepts · Mar 23, 2016

I am also taking a look at the syslog and kernel log.

gconcepts · Mar 23, 2016

Does this indicate RAM issues? I noticed the below in the syslog

{code}
Mar 23 07:36:06 proxmox10 kernel: [45705.955755] dmidecode.sudo invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Mar 23 07:36:06 proxmox10 kernel: [45705.955766] dmidecode.sudo cpuset=110 mems_allowed=0
Mar 23 07:36:06 proxmox10 kernel: [45705.955778] CPU: 1 PID: 14125 Comm: dmidecode.sudo Tainted: P O 4.2.6-1-pve #1
Mar 23 07:36:06 proxmox10 kernel: [45705.955780] Hardware name: Dell Inc. PowerEdge R900/0X947H, BIOS 1.1.13 07/09/2009
Mar 23 07:36:06 proxmox10 kernel: [45705.955783] 0000000000000000 000000007bf3cc13 ffff880e3129fcb8 ffffffff81801158
Mar 23 07:36:06 proxmox10 kernel: [45705.955787] 0000000000000000 ffff880af7f24b00 ffff880e3129fd38 ffffffff817ff67a
Mar 23 07:36:06 proxmox10 kernel: [45705.955790] ffff881fab48e8f8 ffffffff81e60cc0 ffff880e3129fd18 ffffffff810c5b9c
Mar 23 07:36:06 proxmox10 kernel: [45705.955793] Call Trace:
Mar 23 07:36:06 proxmox10 kernel: [45705.955812] [<ffffffff81801158>] dump_stack+0x45/0x57
Mar 23 07:36:06 proxmox10 kernel: [45705.955816] [<ffffffff817ff67a>] dump_header+0xaf/0x238
Mar 23 07:36:06 proxmox10 kernel: [45705.955824] [<ffffffff810c5b9c>] ? __rwsem_do_wake+0x10c/0x140
Mar 23 07:36:06 proxmox10 kernel: [45705.955834] [<ffffffff81185693>] oom_kill_process+0x1e3/0x3c0
Mar 23 07:36:06 proxmox10 kernel: [45705.955839] [<ffffffff811f0971>] mem_cgroup_oom_synchronize+0x531/0x600
Mar 23 07:36:06 proxmox10 kernel: [45705.955849] [<ffffffff811ecb60>] ? mem_cgroup_css_online+0x250/0x250
Mar 23 07:36:06 proxmox10 kernel: [45705.955852] [<ffffffff81185e63>] pagefault_out_of_memory+0x13/0x80
Mar 23 07:36:06 proxmox10 kernel: [45705.955862] [<ffffffff8106735f>] mm_fault_error+0x7f/0x160
Mar 23 07:36:06 proxmox10 kernel: [45705.955865] [<ffffffff81067823>] __do_page_fault+0x3e3/0x410
Mar 23 07:36:06 proxmox10 kernel: [45705.955867] [<ffffffff81067872>] do_page_fault+0x22/0x30
Mar 23 07:36:06 proxmox10 kernel: [45705.955945] [<ffffffff8180a048>] page_fault+0x28/0x30

{code}

gconcepts · Mar 23, 2016

Also noticed after recent updates that some containers max out ram usage.

Mar 23 08:02:47 proxmox10 kernel: [47306.772442] Call Trace:
Mar 23 08:02:47 proxmox10 kernel: [47306.772623] [<ffffffff810c5b9c>] ? __rwsem_do_wake+0x10c/0x140
Mar 23 08:02:47 proxmox10 kernel: [47306.772726] [<ffffffff811ecb60>] ? mem_cgroup_css_online+0x250/0x250
Mar 23 08:02:47 proxmox10 kernel: [47306.772743] [<ffffffff81067823>] __do_page_fault+0x3e3/0x410
Mar 23 08:02:48 proxmox10 kernel: [47307.800543] Task in /lxc/110 killed as a result of limit of /lxc/110
Mar 23 08:02:48 proxmox10 kernel: [47307.800559] memory: usage 2097152kB, limit 2097152kB, failcnt 696340
Mar 23 08:02:48 proxmox10 kernel: [47307.800562] memory+swap: usage 5242784kB, limit 5242880kB, failcnt 582511
Mar 23 08:02:48 proxmox10 kernel: [47307.800564] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Mar 23 08:02:48 proxmox10 kernel: [47307.800566] Memory cgroup stats for /lxc/110: cache:2944KB rss:2094208KB rss_huge:0KB mapped_file:1952KB dirty:0KB writeback:0KB swap:3145$
Mar 23 08:02:48 proxmox10 kernel: [47307.800590] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name

dcsapak · Mar 23, 2016

hi,

how much ram do you have in the server?
how much ram did you assign to the vms/container?
what do you run inside the vms/container?

gconcepts said:
Also noticed after recent updates that some containers max out ram usage.

does that mean you upgraded? (because the versions in your first post are not current)
if yes, did you reboot the server?

gconcepts · Mar 23, 2016

dcsapak said:
hi,

how much ram do you have in the server?
how much ram did you assign to the vms/container?
what do you run inside the vms/container?

does that mean you upgraded? (because the versions in your first post are not current)
if yes, did you reboot the server?

Hi dcsapak,

Thanks for the response. Yes I applied some update to the proxmox server yesterday hoping that would resolve the issue. Please see below for updated pveversion. Also, please see below for answers to your questions.

how much ram do you have in the server? 128GB
how much ram did you assign to the vms/container? it varies. The VM with the most amount of RAM has 12GB assigned to it. I also have ballooning enabled on all the VMs.
what do you run inside the vms/container?Most of the VMs run windows server which in turn run SQL Server, SHarePoint and the containers are fairly lightweight as a couple run some Java applications running on tomcat, openldap, wordpress e.t.c

The server has never reached full RAM usage as out of the 128GB the server has, max usage is never more than 78GB

root@proxmox10:~# pveversion -v
proxmox-ve: 4.1-28 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-2 (running version: 4.1-2/78c5f4a2)
pve-kernel-4.2.6-1-pve: 4.2.6-28
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-42
pve-firmware: 1.1-7
libpve-common-perl: 4.0-42
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-18
pve-container: 1.0-35
pve-firewall: 2.0-14
pve-ha-manager: 1.0-16
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
root@proxmox10:~#

dcsapak · Mar 23, 2016

how did you upgrade?
the way to go is

Code:

apt-get update
apt-get dist-upgrade

or via the gui under updates

(because your versions do not differ from the first post)

what does your storage look like?
do you use zfs?

gconcepts · Mar 23, 2016

dcsapak said:
how did you upgrade?
the way to go is

Code:

apt-get update apt-get dist-upgrade

or via the gui under updates

(because your versions do not differ from the first post)

what does your storage look like?
do you use zfs?

Thats correct that I upgraded using the below command
apt-get update
apt-get dist-upgrade

The patches I applied previously were from a post I saw regarding backups failing due to containers waiting for timeout. After I applied that patch, ti seemed to resolve the issue of the backups failing due to container timeouts.

I use local storage. No ZFS. I have the proxmox host installed on an internal USB SSD flash drive. The I have a RAID array thats connected to the built in raid controller for the server setup as local storage on LVM and I use that as the storage for the VMs. I also have an iSCSI connection to a windows machine that I use only for backups. The storage I use for backups is backed by the iscsi connection. I just ran apt-get update and apt-get dist-upgrade and there are some updates available. I am applying that currently and will see if that resolves the issue. Please let me know if there is any other information I can provide.

gconcepts · Mar 23, 2016

gconcepts said:
Thats correct that I upgraded using the below command
apt-get update
apt-get dist-upgrade

The patches I applied previously were from a post I saw regarding backups failing due to containers waiting for timeout. After I applied that patch, ti seemed to resolve the issue of the backups failing due to container timeouts.

I use local storage. No ZFS. I have the proxmox host installed on an internal USB SSD flash drive. The I have a RAID array thats connected to the built in raid controller for the server setup as local storage on LVM and I use that as the storage for the VMs. I also have an iSCSI connection to a windows machine that I use only for backups. The storage I use for backups is backed by the iscsi connection. I just ran apt-get update and apt-get dist-upgrade and there are some updates available. I am applying that currently and will see if that resolves the issue. Please let me know if there is any other information I can provide.

I have finished applying the upgrade. Below is latest pveversion. I will monitor and see if that resolved the issue.

proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-22 (running version: 4.1-22/aca130cf)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-39
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-36
qemu-server: 4.0-64
pve-firmware: 1.1-7
libpve-common-perl: 4.0-54
libpve-access-control: 4.0-13
libpve-storage-perl: 4.0-45
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-52
pve-firewall: 2.0-22
pve-ha-manager: 1.0-25
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie

Thanks

udo · Mar 23, 2016

gconcepts said:
Thats correct that I upgraded using the below command
apt-get update
apt-get dist-upgrade

The patches I applied previously were from a post I saw regarding backups failing due to containers waiting for timeout. After I applied that patch, ti seemed to resolve the issue of the backups failing due to container timeouts.

I use local storage. No ZFS. I have the proxmox host installed on an internal USB SSD flash drive. The I have a RAID array thats connected to the built in raid controller for the server setup as local storage on LVM and I use that as the storage for the VMs. I also have an iSCSI connection to a windows machine that I use only for backups. The storage I use for backups is backed by the iscsi connection. I just ran apt-get update and apt-get dist-upgrade and there are some updates available. I am applying that currently and will see if that resolves the issue. Please let me know if there is any other information I can provide.

Hi,
but it's looks that you don't have enabled the right repository?!

I assume, you don't have an valid subscription? So pve-enterprise don't work.

Please enable the no-subsciption-list:

Code:

cat /etc/apt/sources.list.d/pve-no-subscription.list
deb http://download.proxmox.com/debian jessie pve-no-subscription

"apt-get update" and "apt-get dist-upgrade" should update your system (reboot because new kernel).

Because kvm shutdown: If you use ubuntu inside the VMs use the "right" kernel: linux-image-virtual

Udo

udo · Mar 23, 2016

udo said:
Hi,
but it's looks that you don't have enabled the right repository?!

I assume, you don't have an valid subscription? So pve-enterprise don't work.

Please enable the no-subsciption-list:

Code:

cat /etc/apt/sources.list.d/pve-no-subscription.list deb http://download.proxmox.com/debian jessie pve-no-subscription

"apt-get update" and "apt-get dist-upgrade" should update your system (reboot because new kernel).

Because kvm shutdown: If you use ubuntu inside the VMs use the "right" kernel: linux-image-virtual

Udo

sorry read not carefully enough - your system is uptodate!

Udo

gconcepts · Mar 23, 2016

udo said:
sorry read not carefully enough - your system is uptodate!

Udo

Thanks for the response udo. I already applied the update. I have the VMs currently running and will update this thread. Those usually occurs with a VM crashing after about 8hrs so I should know more by this time tomorrow.

Thanks

udo · Mar 23, 2016

gconcepts said:
Thanks for the response udo. I already applied the update. I have the VMs currently running and will update this thread. Those usually occurs with a VM crashing after about 8hrs so I should know more by this time tomorrow.

Thanks

Hi,
look at the kernel-package if the client is ubuntu.

Udo

gconcepts · Mar 23, 2016

udo said:
Hi,
look at the kernel-package if the client is ubuntu.

Udo

This also affects windows machines. There is only one Ubuntu KVM and others are windows. I will be replacing some of the RAM sticks on that box by Monday and will see if that resolves the issue. With the latest update I applied today, the issue still occurred and when it does, the syslog shows a similar message as I pasted above.

gconcepts · Mar 24, 2016

gconcepts said:
This also affects windows machines. There is only one Ubuntu KVM and others are windows. I will be replacing some of the RAM sticks on that box by Monday and will see if that resolves the issue. With the latest update I applied today, the issue still occurred and when it does, the syslog shows a similar message as I pasted above.

So I noticed a strange behavior. When the crash does occur for the KVM's some LXC containers are running wild with CPU using the most CPU time. I cannot SSH into those LXC containers to see what's using up all processes. The only way I can get it to stop is to kill it from the web gui :8006. Sometimes I might have to do it multiple times before it gets killed. The LXC container is from a debian template and only thing installed on it is java and the custom application. Please note that after the container is stopped or restarted, the custom application isnt running as I prefer to manually start it. So when the CPU crazyness occurs, I can't tell what's going on within the container causing the host to run out of CPU. Also as another note, on the Dell DRAC logs page, I noticed some errors regarding a certain DIMM. I have some new RAM coming in by Monday so I will reseat the RAM sticks in that DIMM and replace it if the same error still comes up. Could that RAM error cause the containers to be erratic like that?

"Persistent correctable memory error rate has increased for a memory device at location Memory Board C DIMM5"

gconcepts · Apr 2, 2016

Thanks Udo and dcsapak. The issue seems to be related to bad RAM. I swapped the RAM and deleted the lxc containers that were misbehaving and the issue hasnt occurred since last Friday. Will close this thread.

Search

Search

KVMs only randomly shutdown

gconcepts

New Member

gconcepts

New Member

gconcepts

New Member

gconcepts

New Member

gconcepts

New Member

gconcepts

New Member

dcsapak

Proxmox Staff Member

gconcepts

New Member

dcsapak

Proxmox Staff Member

gconcepts

New Member

gconcepts

New Member

udo

Distinguished Member

udo

Distinguished Member

gconcepts

New Member

udo

Distinguished Member

gconcepts

New Member

gconcepts

New Member

gconcepts

New Member

We value your privacy