Strange behavior of PVE HyperConvergent cluster

jedo19

Active Member
Nov 16, 2018
29
3
43
49
I have 5 nodes PVE cluster (PVE8.0.4) created from some dated hardware, it consists of:
2 nodes with 1x E5-2640 v3 cpu 96G ram, 2x 320g SAS disk sin R1 for the system 2x 8T SAS for CEPH
1 node with 2x E5-2640 v2 cpu 64G ram, 2x512g SATA disks in RZ1 for the system and 2x 8T SATA for ceph + 512G nvme for DB/WAL devices
1 node with 2x E5-2640 v0 cpu 64G ram, 2x128g SAS disks in R1 for the system and 12x 2T SAS for ceph
1 node with 2x E5-2630 v0 cpu 64G ram, 2x256g SAS disks in R1 for the system and 4x4T SAS for ceph

I run Mons on nodes 1,2

All of the nodes have dual 10G cards, one is used for CEPH, one for the world-facing bridge, 1G is used for nodes to speak to each other.
Mainly 2 nodes carry production containers/VMs; the other three are more or less used for disks or for testing VM/CT's.

The cluster mainly works without any problems, I had some disk failures, etc. It healed itself nicely so I am fairly satisfied with the setup. Although someone might say the storage on Ceph nodes is not balanced if you look closely I tried to match 16T on each node except node 4 which caries 24T.

Now to the problem.
One of VM's had 6T disk attached to it as a transitional storage. At one point I was testing Snapshots, so I snapshotted this VM. All good at that point. I think, I removed it manually from the pool at some point, but I don't really remember. The other night I was looking at the VM's snapshots through PVE's web interface and I still saw that snapshot (although it didn't exist on Ceph anymore). I clicked delete and tried to delete that snapshot. Obviously, this did not finish, and it failed. This VM lived on Node 3 and was done through Node 1.

At that moment I observed strange behavior on Node1. The load slowly crept up, and "ps" just froze and did not give any output. so at load 90, I tried restarting all of the pve services with no luck, I managed to migrate off the VMs but CT's would not die. After migration, I rebooted the node and it came back 10 mins later, After CEPH healed I ran a backup on one of the CTs on Node 2 and the same thing happened. The load started creeping up and there was no output on the console (ps) after reboot all settled in and it has worked fine ever since.

I was wondering why something like this would happen, and where should I be looking at if something like that is happening again. Restarting is really the last resort, so I want to understand the issue

B.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!