So, I am running a 6-node cluster, a 10GB dedicated backend Corosync network with no other traffic, and a 10GB frontend network dedicated to various VLANs and networks to run our VMs/CTs. Also running PBS for backups. 20GB bonded link to that system running 802.3ad. I have a business dedicated 1Gbps x 1Gbps fiber link which rarely sees anything above about 20% utilization.
All VMs and CTs are on dedicated Intel datacenter grade Nvme mirrored storage (zfs) locally on each Node; PBS is the same mirrored Nvme (zfs) storage.
So I spin up a CT aptly named mediamaster, that is running a few things Plex-related. Basically nzbget, sonarr, radarr, portainer, plexpy. Plex itself is in another CT on another node running just fine including using a Tesla P4 for GPU transcoding.
I am running the CTs for Plex privileged since I mount the media via NFS powered by dual TrueNAS Core systems (100GB Uplinks to our core).
So I had some issues with the same CT before. I was running Prox 7.4 and it ran great for over a year (the CT), never had issues. Then when I upgraded to 8, it started crashing ANY node that I put it on and crashing it hard. The only way to get the node back was to kill -9 the PID of the lxc. I thought I fixed the problem by rebuilding the CT from scratch using a Turnkey image and reloading everything.
So about 3 or 4 days ago, it started happening again. No matter what Node I put it on, it causes the entire node to go unresponsive in the web UI, and I have to do a kill -9 on the LXC to get that node back.
I firstly want to figure out what the heck is causing the problem and secondly want to know how or why, if Promox is considered enterprise-ready, this could actually happen.
Here is the config for the CT in question, hopefully someone can see if I have done some kind of boneheaded stunt.
vlan55 is a dedicated VLAN for NFS traffic to my TrueNAS systems, vlan50 if the frontend IP for the CT.
All VMs and CTs are on dedicated Intel datacenter grade Nvme mirrored storage (zfs) locally on each Node; PBS is the same mirrored Nvme (zfs) storage.
So I spin up a CT aptly named mediamaster, that is running a few things Plex-related. Basically nzbget, sonarr, radarr, portainer, plexpy. Plex itself is in another CT on another node running just fine including using a Tesla P4 for GPU transcoding.
I am running the CTs for Plex privileged since I mount the media via NFS powered by dual TrueNAS Core systems (100GB Uplinks to our core).
So I had some issues with the same CT before. I was running Prox 7.4 and it ran great for over a year (the CT), never had issues. Then when I upgraded to 8, it started crashing ANY node that I put it on and crashing it hard. The only way to get the node back was to kill -9 the PID of the lxc. I thought I fixed the problem by rebuilding the CT from scratch using a Turnkey image and reloading everything.
So about 3 or 4 days ago, it started happening again. No matter what Node I put it on, it causes the entire node to go unresponsive in the web UI, and I have to do a kill -9 on the LXC to get that node back.
I firstly want to figure out what the heck is causing the problem and secondly want to know how or why, if Promox is considered enterprise-ready, this could actually happen.
Here is the config for the CT in question, hopefully someone can see if I have done some kind of boneheaded stunt.
vlan55 is a dedicated VLAN for NFS traffic to my TrueNAS systems, vlan50 if the frontend IP for the CT.
Code:
root@proxmox03:~# cat /etc/pve/nodes/proxmox03/lxc/146.conf
arch: amd64
cores: 12
features: mount=nfs,nesting=1
hostname: mediamaster
memory: 16384
nameserver: 10.200.0.1
net0: name=eth0,bridge=vmbr50,gw=10.200.50.1,hwaddr=7E:E2:40:D7:55:C0,ip=10.200.50.6/24,type=veth
net1: name=eth1,bridge=vmbr55,hwaddr=32:2F:71:AF:E7:F8,ip=10.200.55.6/24,type=veth
onboot: 0
ostype: debian
rootfs: ssdimages:subvol-146-disk-1,size=500G
swap: 512
tags: plex