Cluster slowing down and cutting connection - pvesr cpu 100%

Veikko

Member
Dec 4, 2017
17
0
21
Finland
Hi!
I just updated a 3-node cluster to the latest. I'm using community support, so it's 5.4-5. Detailed info below.

Quickly after updating, I found out that I cannot start a container in one of the nodes. I was migrating the vm's off from the nodes prior to restarting, and found out when I migrated the vm's and ct's back. Then the node started to cut off from the gui, as well as ceph started showing symptoms (dropping out OSD:s etc.). I have restarted the node a couple of times, but I cannot get rid of the symptoms. Now, one of the other nodes seem to start slowing down also, and the vm:s start to become unresponsive.

A bug in the latest build? This is after all a production environment, hence the subscription.

Any ideas?

proxmox-ve: 5.4-1 (running kernel: 4.15.18-14-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-2 pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.13.4-1-pve: 4.13.4-26
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-9
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-42
libqb0: 1.0.3-1~bpo9 lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3 lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-26
pve-cluster: 5.0-37
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-20
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-51
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2
pvecm status
Quorum information
------------------
Date: Mon May 20 16:17:25 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1/1976
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.10.41 (local)
0x00000002 1 192.168.10.42
0x00000003 1 192.168.10.43
 
Hi!
Now I got a little more info about it; After restart my nodes seem to work fine about 5-10min. Then there's a process pvesr which start to build up, and quickly takes over all processing cores to 100%. I searched the forums, and there has been a similar case in feb, but no solution there. Any hints? I am running some production services on the cluster so would need solution asap.

I have zero replication tasks set up. My nodes use zfs as the system disk, and ceph storage with 4 osd's per node with 1 ssd cache per node.

It is obvious as I am watching htop via ssh, the process puts all cores to 100% and soon after that the node disappears (red cross) from the gui.
 

Attachments

  • pve2-report-Mon-20-May-2019-18-49.txt
    66.7 KB · Views: 3
I have now shut down the defective node, as the pvesr stalled the whole machine, making ceph remarkable slow and thus slowing down all the vm's. Now ceph is in degraded state (3/1 configuration so no data loss problems). I will fire up the defected node once more, and try to gather some information. But if it's doing the same thing again, I will need to delete it from the cluster and re-install I think? At least I have come up with no other solution.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!