restarted a node, some kvm's on other nodes panic

RobFantini

Renowned Member
May 24, 2012
1,708
41
68
Boston,Mass
2 of 3 kvm's had high memory usage.

one did not have swap.

the 3 were busy with disk i/o

one of the nodes uses on board sata, the other a high end recent supermicro and it mode hba.

Code:
 # pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-108
pve-firmware: 1.1-10
libpve-common-perl: 4.0-91
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80
ceph: 10.2.5-1~bpo80+1
 

tom

Proxmox Staff Member
Staff member
Aug 29, 2006
14,056
492
103
we have a ceph cluster

a minute after restarting one node at least 3 key kvm's paniced.

screen shot attached.

kvm's are on 2 diff nodes
post your VM config:

> qm config VMID

what OS do you run inside in detail?
 

RobFantini

Renowned Member
May 24, 2012
1,708
41
68
Boston,Mass
all three use jessie
Code:
boot: cn
bootdisk: scsi0
cores: 2
memory: 1024
name: fbcadmin
net0: virtio=DE:60:C3:F6:55:23,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: ceph-kvm3:vm-100-disk-1,discard=on,size=8G
smbios1: uuid=195cf837-ebaa-49c2-95e9-5ba7a0869cb0
sockets: 1
 

RobFantini

Renowned Member
May 24, 2012
1,708
41
68
Boston,Mass
also none of the systems logged out of memory errors. so prob. not a mem issue. note mem on above conf was 512 yesterday.

kernel running per uname -a
Linux fbcadmin 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux
 

RobFantini

Renowned Member
May 24, 2012
1,708
41
68
Boston,Mass
after some research , since I have 8 nodes , I'll try using 5 for OSD and 3 for VM. I am not sure yet where to place the 3 mons .
 

tom

Proxmox Staff Member
Staff member
Aug 29, 2006
14,056
492
103
...

scsi0: ceph-kvm3:vm-100-disk-1,discard=on,size=8G
...
[/code]
make sure that you use virtio-scsi controller (not LSI), see VM options. I remember some panic when using LSI recently but I did not debug it further as modern OS should use virtio-scsi anyways.
 

udo

Famous Member
Apr 22, 2009
5,880
163
83
Ahrensburg; Germany
after some research , since I have 8 nodes , I'll try using 5 for OSD and 3 for VM. I am not sure yet where to place the 3 mons .
Hi Rob,
I'm not sure if this help for this issue, but I had an seperate ceph,cluster (8 nodes), where the mons run on the pve-nodes.
So I would run the mons on the VM-nodes.

Was the restarted node an osd+mon-node? Because there is an issue that the osd-stop would not reconised early enough, because the mon also died to fast. If you restart an node and shut down the ceph-osd first the VMs have app. 20sec less IO-stall.

Normaly the VMs should handle short IO-stalling without trouble, but perhaps not?! (Don't know if the discard is also an problem in this case).

Udo
 

udo

Famous Member
Apr 22, 2009
5,880
163
83
Ahrensburg; Germany
Because there is an issue that the osd-stop would not reconised early enough, because the mon also died to fast. If you restart an node and shut down the ceph-osd first the VMs have app. 20sec less IO-stall.
Answer myself,
got an email that this bug (#18516) is solved now - but don't know how long it's take to get this changes in ceph (i guess 10.2.6).

Udo
 

RobFantini

Renowned Member
May 24, 2012
1,708
41
68
Boston,Mass
make sure that you use virtio-scsi controller (not LSI), see VM options. I remember some panic when using LSI recently but I did not debug it further as modern OS should use virtio-scsi anyways.
they are set to LSI, I'll do the switch. thank you.
 

RobFantini

Renowned Member
May 24, 2012
1,708
41
68
Boston,Mass
Hi Rob,
I'm not sure if this help for this issue, but I had an seperate ceph,cluster (8 nodes), where the mons run on the pve-nodes.
So I would run the mons on the VM-nodes.

Was the restarted node an osd+mon-node? Because there is an issue that the osd-stop would not reconised early enough, because the mon also died to fast. If you restart an node and shut down the ceph-osd first the VMs have app. 20sec less IO-stall.

Normaly the VMs should handle short IO-stalling without trouble, but perhaps not?! (Don't know if the discard is also an problem in this case).

Udo
Udo: yes the restarted node ran mon+osd .
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!