restarted a node, some kvm's on other nodes panic

RobFantini

Famous Member
May 24, 2012
2,009
102
133
Boston,Mass
we have a ceph cluster

a minute after restarting one node at least 3 key kvm's paniced.

screen shot attached.

kvm's are on 2 diff nodes
 

Attachments

  • services-kvm-freeze-s24_Proxmox_Virtual_Environment.png
    services-kvm-freeze-s24_Proxmox_Virtual_Environment.png
    207.3 KB · Views: 28
2 of 3 kvm's had high memory usage.

one did not have swap.

the 3 were busy with disk i/o

one of the nodes uses on board sata, the other a high end recent supermicro and it mode hba.

Code:
 # pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-108
pve-firmware: 1.1-10
libpve-common-perl: 4.0-91
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80
ceph: 10.2.5-1~bpo80+1
 
we have a ceph cluster

a minute after restarting one node at least 3 key kvm's paniced.

screen shot attached.

kvm's are on 2 diff nodes

post your VM config:

> qm config VMID

what OS do you run inside in detail?
 
all three use jessie
Code:
boot: cn
bootdisk: scsi0
cores: 2
memory: 1024
name: fbcadmin
net0: virtio=DE:60:C3:F6:55:23,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: ceph-kvm3:vm-100-disk-1,discard=on,size=8G
smbios1: uuid=195cf837-ebaa-49c2-95e9-5ba7a0869cb0
sockets: 1
 
also none of the systems logged out of memory errors. so prob. not a mem issue. note mem on above conf was 512 yesterday.

kernel running per uname -a
Linux fbcadmin 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux
 
after some research , since I have 8 nodes , I'll try using 5 for OSD and 3 for VM. I am not sure yet where to place the 3 mons .
 
...

scsi0: ceph-kvm3:vm-100-disk-1,discard=on,size=8G
...
[/code]

make sure that you use virtio-scsi controller (not LSI), see VM options. I remember some panic when using LSI recently but I did not debug it further as modern OS should use virtio-scsi anyways.
 
after some research , since I have 8 nodes , I'll try using 5 for OSD and 3 for VM. I am not sure yet where to place the 3 mons .
Hi Rob,
I'm not sure if this help for this issue, but I had an seperate ceph,cluster (8 nodes), where the mons run on the pve-nodes.
So I would run the mons on the VM-nodes.

Was the restarted node an osd+mon-node? Because there is an issue that the osd-stop would not reconised early enough, because the mon also died to fast. If you restart an node and shut down the ceph-osd first the VMs have app. 20sec less IO-stall.

Normaly the VMs should handle short IO-stalling without trouble, but perhaps not?! (Don't know if the discard is also an problem in this case).

Udo
 
Because there is an issue that the osd-stop would not reconised early enough, because the mon also died to fast. If you restart an node and shut down the ceph-osd first the VMs have app. 20sec less IO-stall.
Answer myself,
got an email that this bug (#18516) is solved now - but don't know how long it's take to get this changes in ceph (i guess 10.2.6).

Udo
 
make sure that you use virtio-scsi controller (not LSI), see VM options. I remember some panic when using LSI recently but I did not debug it further as modern OS should use virtio-scsi anyways.

they are set to LSI, I'll do the switch. thank you.
 
Hi Rob,
I'm not sure if this help for this issue, but I had an seperate ceph,cluster (8 nodes), where the mons run on the pve-nodes.
So I would run the mons on the VM-nodes.

Was the restarted node an osd+mon-node? Because there is an issue that the osd-stop would not reconised early enough, because the mon also died to fast. If you restart an node and shut down the ceph-osd first the VMs have app. 20sec less IO-stall.

Normaly the VMs should handle short IO-stalling without trouble, but perhaps not?! (Don't know if the discard is also an problem in this case).

Udo

Udo: yes the restarted node ran mon+osd .
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!