How can a single CT take down an entire Node within a cluster?

helojunkie · Oct 11, 2023

So, I am running a 6-node cluster, a 10GB dedicated backend Corosync network with no other traffic, and a 10GB frontend network dedicated to various VLANs and networks to run our VMs/CTs. Also running PBS for backups. 20GB bonded link to that system running 802.3ad. I have a business dedicated 1Gbps x 1Gbps fiber link which rarely sees anything above about 20% utilization.

All VMs and CTs are on dedicated Intel datacenter grade Nvme mirrored storage (zfs) locally on each Node; PBS is the same mirrored Nvme (zfs) storage.

So I spin up a CT aptly named mediamaster, that is running a few things Plex-related. Basically nzbget, sonarr, radarr, portainer, plexpy. Plex itself is in another CT on another node running just fine including using a Tesla P4 for GPU transcoding.

I am running the CTs for Plex privileged since I mount the media via NFS powered by dual TrueNAS Core systems (100GB Uplinks to our core).

So I had some issues with the same CT before. I was running Prox 7.4 and it ran great for over a year (the CT), never had issues. Then when I upgraded to 8, it started crashing ANY node that I put it on and crashing it hard. The only way to get the node back was to kill -9 the PID of the lxc. I thought I fixed the problem by rebuilding the CT from scratch using a Turnkey image and reloading everything.

So about 3 or 4 days ago, it started happening again. No matter what Node I put it on, it causes the entire node to go unresponsive in the web UI, and I have to do a kill -9 on the LXC to get that node back.

I firstly want to figure out what the heck is causing the problem and secondly want to know how or why, if Promox is considered enterprise-ready, this could actually happen.

Here is the config for the CT in question, hopefully someone can see if I have done some kind of boneheaded stunt.

vlan55 is a dedicated VLAN for NFS traffic to my TrueNAS systems, vlan50 if the frontend IP for the CT.

Code:

root@proxmox03:~# cat /etc/pve/nodes/proxmox03/lxc/146.conf
arch: amd64
cores: 12
features: mount=nfs,nesting=1
hostname: mediamaster
memory: 16384
nameserver: 10.200.0.1
net0: name=eth0,bridge=vmbr50,gw=10.200.50.1,hwaddr=7E:E2:40:D7:55:C0,ip=10.200.50.6/24,type=veth
net1: name=eth1,bridge=vmbr55,hwaddr=32:2F:71:AF:E7:F8,ip=10.200.55.6/24,type=veth
onboot: 0
ostype: debian
rootfs: ssdimages:subvol-146-disk-1,size=500G
swap: 512
tags: plex

the1corrupted · Oct 12, 2023

helojunkie said:
So I had some issues with the same CT before. I was running Prox 7.4 and it ran great for over a year (the CT), never had issues. Then when I upgraded to 8, it started crashing ANY node that I put it on and crashing it hard. The only way to get the node back was to kill -9 the PID of the lxc. I thought I fixed the problem by rebuilding the CT from scratch using a Turnkey image and reloading everything.

How was this upgrade performed? From what I recall, Proxmox 8 upgrade will have you reload all of the local storage like you are installing fresh. Did you do this or did you attempt an in-place upgrade?

CTs share the kernel with any node that they are running on so since this CT in particular is crashing the node, I can only think of some kernel-level error or event that would be triggering?

helojunkie said:
The only way to get the node back was to kill -9 the PID of the lxc. I thought I fixed the problem by rebuilding the CT from scratch using a Turnkey image and reloading everything.

Is there a way you can verify if it is at all storage related? Given the time span that you are spending between completely new CT's, I am wondering if there is a log or disk write that is going crazy to flood the CT storage somehow and once it caps out, kills the CT and the node?

helojunkie · Oct 12, 2023

First, thank you for your help...

How was this upgrade performed? From what I recall, Proxmox 8 upgrade will have you reload all of the local storage like you are installing fresh. Did you do this or did you attempt an in-place upgrade?

So at the same time I upgraded to 8, I added two additional nodes. Both of those nodes were totally fresh, bare-metal installs, and then they were added to the cluster. I have tried the CT on every node in my cluster and regardless of which node I put it on, within about an hour, the node has 'gone gray'. It's important to note that the node and all CTs/VMs continue to operate as normal except the one CT in question. I just lose the Web UI.

So 'crashing the node' might be the wrong wording. I can SSH into the node, but I cannot run any `pct` commands via the CLI, `qm` commands work just fine.

Is there a way you can verify if it is at all storage related? Given the time span that you are spending between completely new CT's, I am wondering if there is a log or disk write that is going crazy to flood the CT storage somehow and once it caps out, kills the CT and the node?

When you say 'storage', what storage? The NFS storage? We have a TON of stuff using that storage including nextcloud so I don't think it has anything to do with the NFS. If you mean the storage on the CT itself, I have given it 500GB and it's never topped over 100GB and its on Nvme.

hd-- · Oct 12, 2023

So 'crashing the node' might be the wrong wording. I can SSH into the node, but I cannot run any `pct` commands via the CLI, `qm` commands work just fine.

Thats something, have you checked journalctl logs for something conspicious and have you checked the LXC services ?

Code:

systemctl status lxc-net.service
systemctl status lxc-monitord.service
systemctl status lxc.service
systemctl status lxcfs.service

helojunkie · Oct 13, 2023

hd-- said:
Thats something, have you checked journalctl logs for something conspicious and have you checked the LXC services ?

Code:

systemctl status lxc-net.service systemctl status lxc-monitord.service systemctl status lxc.service systemctl status lxcfs.service

OK, I have not checked any of this yet, I am assuming that I should do so while the CT is in failed state. I will restart it and try this when it fails.

Thank You

helojunkie · Oct 13, 2023

OK, so I restarted the LXC container in question, it "crashed" the node about 10 minutes later. Again, I have access to the node via ssh, the VMs on the node are running, the LXC container is non-responsive, I can't console, ssh or otherwise access it. Here is the output of the commands you asked about:

Code:

root@proxmox03:~# systemctl status lxc-net.service
● lxc-net.service - LXC network bridge setup
     Loaded: loaded (/lib/systemd/system/lxc-net.service; enabled; preset: enabled)
     Active: active (exited) since Fri 2023-09-22 09:50:54 PDT; 3 weeks 0 days ago
       Docs: man:lxc
   Main PID: 3315 (code=exited, status=0/SUCCESS)
        CPU: 4ms

Sep 22 09:50:54 proxmox03 systemd[1]: Starting lxc-net.service - LXC network bridge setup...
Sep 22 09:50:54 proxmox03 systemd[1]: Finished lxc-net.service - LXC network bridge setup.

Code:

root@proxmox03:~# systemctl status lxc-monitord.service
● lxc-monitord.service - LXC Container Monitoring Daemon
     Loaded: loaded (/lib/systemd/system/lxc-monitord.service; enabled; preset: enabled)
     Active: active (running) since Fri 2023-09-22 09:50:54 PDT; 3 weeks 0 days ago
       Docs: man:lxc
   Main PID: 3313 (lxc-monitord)
      Tasks: 1 (limit: 77094)
     Memory: 560.0K
        CPU: 8ms
     CGroup: /system.slice/lxc-monitord.service
             └─3313 /usr/libexec/lxc/lxc-monitord --daemon

Sep 22 09:50:54 proxmox03 systemd[1]: Started lxc-monitord.service - LXC Container Monitoring Daemon.

Code:

root@proxmox03:~# systemctl status lxc.service
● lxc.service - LXC Container Initialization and Autoboot Code
     Loaded: loaded (/lib/systemd/system/lxc.service; enabled; preset: enabled)
     Active: active (exited) since Fri 2023-09-22 09:50:54 PDT; 3 weeks 0 days ago
       Docs: man:lxc-autostart
             man:lxc
   Main PID: 3379 (code=exited, status=0/SUCCESS)
        CPU: 35ms

Sep 22 09:50:54 proxmox03 systemd[1]: Starting lxc.service - LXC Container Initialization and Autoboot Code...
Sep 22 09:50:54 proxmox03 systemd[1]: Finished lxc.service - LXC Container Initialization and Autoboot Code.

Code:

root@proxmox03:~# systemctl status lxcfs.service
● lxcfs.service - FUSE filesystem for LXC
     Loaded: loaded (/lib/systemd/system/lxcfs.service; enabled; preset: enabled)
     Active: active (running) since Fri 2023-09-22 09:50:52 PDT; 3 weeks 0 days ago
       Docs: man:lxcfs(1)
   Main PID: 2654 (lxcfs)
      Tasks: 9 (limit: 77094)
     Memory: 7.3M
        CPU: 12.364s
     CGroup: /system.slice/lxcfs.service
             └─2654 /usr/bin/lxcfs /var/lib/lxcfs

Sep 22 09:50:52 proxmox03 lxcfs[2654]: - proc_meminfo
Sep 22 09:50:52 proxmox03 lxcfs[2654]: - proc_stat
Sep 22 09:50:52 proxmox03 lxcfs[2654]: - proc_swaps
Sep 22 09:50:52 proxmox03 lxcfs[2654]: - proc_uptime
Sep 22 09:50:52 proxmox03 lxcfs[2654]: - proc_slabinfo
Sep 22 09:50:52 proxmox03 lxcfs[2654]: - shared_pidns
Sep 22 09:50:52 proxmox03 lxcfs[2654]: - cpuview_daemon
Sep 22 09:50:52 proxmox03 lxcfs[2654]: - loadavg_daemon
Sep 22 09:50:52 proxmox03 lxcfs[2654]: - pidfds
Sep 22 09:50:52 proxmox03 lxcfs[2654]: Ignoring invalid max threads value 4294967295 > max (100000).

Here is the latest output of journalctl:

Code:

Oct 13 11:55:32 proxmox03 kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 233
Oct 13 11:55:32 proxmox03 kernel: nouveau 0000:42:00.0: DRM: DDC responded, but no EDID for VGA-1
Oct 13 11:55:42 proxmox03 kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 233
Oct 13 11:55:42 proxmox03 kernel: nouveau 0000:42:00.0: DRM: DDC responded, but no EDID for VGA-1
Oct 13 11:55:52 proxmox03 kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 233
Oct 13 11:55:52 proxmox03 kernel: nouveau 0000:42:00.0: DRM: DDC responded, but no EDID for VGA-1
Oct 13 11:56:02 proxmox03 kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 233
Oct 13 11:56:02 proxmox03 kernel: nouveau 0000:42:00.0: DRM: DDC responded, but no EDID for VGA-1
Oct 13 11:56:13 proxmox03 kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 233
Oct 13 11:56:13 proxmox03 kernel: nouveau 0000:42:00.0: DRM: DDC responded, but no EDID for VGA-1
Oct 13 11:56:20 proxmox03 sshd[2726740]: Accepted publickey for root from 10.200.70.5 port 45232 ssh2: RSA SHA256:
Oct 13 11:56:20 proxmox03 sshd[2726740]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Oct 13 11:56:20 proxmox03 systemd-logind[2674]: New session 144464 of user root.
Oct 13 11:56:20 proxmox03 systemd[1]: Started session-144464.scope - Session 144464 of User root.
Oct 13 11:56:20 proxmox03 sshd[2726740]: pam_env(sshd:session): deprecated reading of user environment enabled
Oct 13 11:56:21 proxmox03 sshd[2726740]: Received disconnect from 10.200.70.5 port 45232:11: disconnected by user
Oct 13 11:56:21 proxmox03 sshd[2726740]: Disconnected from user root 10.200.70.5 port 45232
Oct 13 11:56:21 proxmox03 sshd[2726740]: pam_unix(sshd:session): session closed for user root
Oct 13 11:56:21 proxmox03 systemd[1]: session-144464.scope: Deactivated successfully.
Oct 13 11:56:21 proxmox03 systemd-logind[2674]: Session 144464 logged out. Waiting for processes to exit.
Oct 13 11:56:21 proxmox03 systemd-logind[2674]: Removed session 144464.
Oct 13 11:56:21 proxmox03 sshd[2726749]: Accepted publickey for root from 10.200.70.5 port 45236 ssh2: RSA SHA256:
Oct 13 11:56:21 proxmox03 sshd[2726749]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Oct 13 11:56:21 proxmox03 systemd-logind[2674]: New session 144465 of user root.
Oct 13 11:56:21 proxmox03 systemd[1]: Started session-144465.scope - Session 144465 of User root.
Oct 13 11:56:21 proxmox03 sshd[2726749]: pam_env(sshd:session): deprecated reading of user environment enabled
Oct 13 11:56:22 proxmox03 sshd[2726749]: Received disconnect from 10.200.70.5 port 45236:11: disconnected by user
Oct 13 11:56:22 proxmox03 sshd[2726749]: Disconnected from user root 10.200.70.5 port 45236
Oct 13 11:56:22 proxmox03 sshd[2726749]: pam_unix(sshd:session): session closed for user root
Oct 13 11:56:22 proxmox03 systemd[1]: session-144465.scope: Deactivated successfully.
Oct 13 11:56:22 proxmox03 systemd-logind[2674]: Session 144465 logged out. Waiting for processes to exit.
Oct 13 11:56:22 proxmox03 systemd-logind[2674]: Removed session 144465.
Oct 13 11:56:22 proxmox03 pmxcfs[3502]: [status] notice: received log
Oct 13 11:56:22 proxmox03 sshd[2726756]: Accepted publickey for root from 10.200.70.5 port 45248 ssh2: RSA SHA256:
Oct 13 11:56:22 proxmox03 sshd[2726756]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Oct 13 11:56:22 proxmox03 systemd-logind[2674]: New session 144466 of user root.
Oct 13 11:56:22 proxmox03 systemd[1]: Started session-144466.scope - Session 144466 of User root.
Oct 13 11:56:22 proxmox03 sshd[2726756]: pam_env(sshd:session): deprecated reading of user environment enabled
Oct 13 11:56:23 proxmox03 kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 233
Oct 13 11:56:23 proxmox03 kernel: nouveau 0000:42:00.0: DRM: DDC responded, but no EDID for VGA-1
Oct 13 11:56:23 proxmox03 sshd[2726756]: Received disconnect from 10.200.70.5 port 45248:11: disconnected by user
Oct 13 11:56:23 proxmox03 sshd[2726756]: Disconnected from user root 10.200.70.5 port 45248
Oct 13 11:56:23 proxmox03 sshd[2726756]: pam_unix(sshd:session): session closed for user root
Oct 13 11:56:23 proxmox03 systemd[1]: session-144466.scope: Deactivated successfully.
Oct 13 11:56:23 proxmox03 systemd-logind[2674]: Session 144466 logged out. Waiting for processes to exit.
Oct 13 11:56:23 proxmox03 systemd-logind[2674]: Removed session 144466.
Oct 13 11:56:24 proxmox03 sshd[2727871]: Accepted publickey for root from 10.200.70.5 port 45254 ssh2: RSA SHA256:
Oct 13 11:56:24 proxmox03 sshd[2727871]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Oct 13 11:56:24 proxmox03 systemd-logind[2674]: New session 144467 of user root.
Oct 13 11:56:24 proxmox03 systemd[1]: Started session-144467.scope - Session 144467 of User root.
Oct 13 11:56:24 proxmox03 sshd[2727871]: pam_env(sshd:session): deprecated reading of user environment enabled
Oct 13 11:56:25 proxmox03 sshd[2727871]: Received disconnect from 10.200.70.5 port 45254:11: disconnected by user
Oct 13 11:56:25 proxmox03 sshd[2727871]: Disconnected from user root 10.200.70.5 port 45254
Oct 13 11:56:25 proxmox03 sshd[2727871]: pam_unix(sshd:session): session closed for user root
Oct 13 11:56:25 proxmox03 systemd[1]: session-144467.scope: Deactivated successfully.
Oct 13 11:56:25 proxmox03 systemd-logind[2674]: Session 144467 logged out. Waiting for processes to exit.
Oct 13 11:56:25 proxmox03 systemd-logind[2674]: Removed session 144467.
Oct 13 11:56:33 proxmox03 kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 233
Oct 13 11:56:33 proxmox03 kernel: nouveau 0000:42:00.0: DRM: DDC responded, but no EDID for VGA-1
Oct 13 11:56:43 proxmox03 kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 233
Oct 13 11:56:43 proxmox03 kernel: nouveau 0000:42:00.0: DRM: DDC responded, but no EDID for VGA-1
Oct 13 11:56:54 proxmox03 kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 233
Oct 13 11:56:54 proxmox03 kernel: nouveau 0000:42:00.0: DRM: DDC responded, but no EDID for VGA-1
Oct 13 11:57:04 proxmox03 kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 233
Oct 13 11:57:04 proxmox03 kernel: nouveau 0000:42:00.0: DRM: DDC responded, but no EDID for VGA-1

Other than an ssh connection from another of my nodes every second (I assume this is part of the clustering?) and something about a VGA card, I don't see anything that stands out.

hd-- · Oct 16, 2023

Can you post

Code:

systemctl status pve-container@[ctid_of_bad_container].service

helojunkie · Oct 21, 2023

I was never able to figure this one out; recreating a CT and trying everything from scratch yielded the exact same results, regardless of what node I used to create the CT. The only thing I can think of is that I was running docker on it (which I do in a lot of CTs), but this one had a LOT of docker containers in it, while my other typically only has one or two, mostly for monitoring stuff.

I created a VM and moved everything over, and the problem has not recreated itself.

Search

Search

How can a single CT take down an entire Node within a cluster?

helojunkie

Well-Known Member

the1corrupted

New Member

helojunkie

Well-Known Member

hd--

Proxmox Staff Member

helojunkie

Well-Known Member

helojunkie

Well-Known Member

hd--

Proxmox Staff Member

helojunkie

Well-Known Member

We value your privacy