PMXCFS Ghost Config File

britmob · Oct 8, 2022

Hello, I am seeing a particularly odd issue with pmxcfs.

There is a configuration file for an lxc that appears to exist in /etc/pve/nodes/[node]/lxc, but I cannot actually see it.

See here:

Code:

root@hoopy:/etc/pve/nodes/hoopy# systemctl status pve-ha-crm
● pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2022-10-07 12:41:45 EDT; 1 day 3h ago
   Main PID: 1292 (pve-ha-crm)
      Tasks: 1 (limit: 26282)
     Memory: 65.6M
        CPU: 25.235s
     CGroup: /system.slice/pve-ha-crm.service
             └─1292 pve-ha-crm

Oct 08 16:39:11 hoopy pve-ha-crm[1292]: recover service 'ct:118' to previous failed and fenced node 'gfman' again
Oct 08 16:39:11 hoopy pve-ha-crm[1292]: got unexpected error - Configuration file 'nodes/gfman/lxc/118.conf' does not exist
Oct 08 16:39:21 hoopy pve-ha-crm[1292]: recover service 'ct:118' to previous failed and fenced node 'gfman' again
Oct 08 16:39:21 hoopy pve-ha-crm[1292]: got unexpected error - Configuration file 'nodes/gfman/lxc/118.conf' does not exist
Oct 08 16:39:31 hoopy pve-ha-crm[1292]: recover service 'ct:118' to previous failed and fenced node 'gfman' again
Oct 08 16:39:31 hoopy pve-ha-crm[1292]: got unexpected error - Configuration file 'nodes/gfman/lxc/118.conf' does not exist
Oct 08 16:39:41 hoopy pve-ha-crm[1292]: recover service 'ct:118' to previous failed and fenced node 'gfman' again
Oct 08 16:39:41 hoopy pve-ha-crm[1292]: got unexpected error - Configuration file 'nodes/gfman/lxc/118.conf' does not exist
Oct 08 16:39:51 hoopy pve-ha-crm[1292]: recover service 'ct:118' to previous failed and fenced node 'gfman' again
Oct 08 16:39:51 hoopy pve-ha-crm[1292]: got unexpected error - Configuration file 'nodes/gfman/lxc/118.conf' does not exist


root@hoopy:/etc/pve/nodes/gfman/lxc# touch 118.conf
touch: cannot touch '118.conf': File exists
root@hoopy:/etc/pve/nodes/gfman/lxc# cat 118.conf
cat: 118.conf: No such file or directory
root@hoopy:/etc/pve/nodes/gfman/lxc# ls
108.conf  127.conf  128.conf

118.conf cannot be modified or its' contents accessed, but it has some reference preventing me from recreating/amending it.

This kills all HA functions from working, rendering my cluster dead.

I have tried restarting all nodes at the same time, restarting pve-ha-crm/lrm, moving this config file around to other nodes, and nothing seems to work. I have no idea what is wrong, but it appears to be a lower level issue with pmxcfs.

If someone could please advise me :'(

fiona · Oct 10, 2022

Hi,

britmob said:
root@hoopy:/etc/pve/nodes/gfman/lxc# touch 118.conf
touch: cannot touch '118.conf': File exists
root@hoopy:/etc/pve/nodes/gfman/lxc# cat 118.conf
cat: 118.conf: No such file or directory

pmxcfs is special in this regard, and it also checks if a file with that ID exists for another node already.

britmob said:
This kills all HA functions from working, rendering my cluster dead.

I have tried restarting all nodes at the same time, restarting pve-ha-crm/lrm, moving this config file around to other nodes, and nothing seems to work. I have no idea what is wrong, but it appears to be a lower level issue with pmxcfs.

Can you see the file in another node's folder with ls /etc/pve/nodes/*/lxc/118.conf? Please share the output of ha-manager status --verbose and pveversion -v.

Does removing the service ct:118 from HA and re-adding it work?

britmob · Oct 10, 2022

I tried looking at other nodes for config files. I found that from any node XYZ, there was a file at /etc/pve/nodes/[XYZ]/lxc/118.conf on that local system. None of these files worked, they were just the blank pointers seen in the initial post.

I've attached the ha-manager output below.

I tried removing the HA entry, but it was stuck on deleting for some minutes. I then entirely removed the ct config file, restarted the ha daemons, and then replaced it from a backup. This helped, but now pve-ha-crm is complaining about a different container's config file instead of the original.

Additionally, here is the process status from systemctl on the master crm node.

Code:

root@cavejohnson:~# systemctl status pve-ha-crm
● pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-10-10 08:03:20 HDT; 6min ago
    Process: 3831101 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)
   Main PID: 3831102 (pve-ha-crm)
      Tasks: 1 (limit: 57672)
     Memory: 92.9M
        CPU: 659ms
     CGroup: /system.slice/pve-ha-crm.service
             └─3831102 pve-ha-crm

Oct 10 08:08:59 cavejohnson pve-ha-crm[3831102]: recover service 'ct:124' to previous failed and fenced node 'gfman' again
Oct 10 08:08:59 cavejohnson pve-ha-crm[3831102]: got unexpected error - Configuration file 'nodes/gfman/lxc/124.conf' does not exist
Oct 10 08:09:09 cavejohnson pve-ha-crm[3831102]: recover service 'ct:124' to previous failed and fenced node 'gfman' again
Oct 10 08:09:09 cavejohnson pve-ha-crm[3831102]: got unexpected error - Configuration file 'nodes/gfman/lxc/124.conf' does not exist
Oct 10 08:09:19 cavejohnson pve-ha-crm[3831102]: recover service 'ct:124' to previous failed and fenced node 'gfman' again
Oct 10 08:09:19 cavejohnson pve-ha-crm[3831102]: got unexpected error - Configuration file 'nodes/gfman/lxc/124.conf' does not exist
Oct 10 08:09:29 cavejohnson pve-ha-crm[3831102]: recover service 'ct:124' to previous failed and fenced node 'gfman' again
Oct 10 08:09:29 cavejohnson pve-ha-crm[3831102]: got unexpected error - Configuration file 'nodes/gfman/lxc/124.conf' does not exist
Oct 10 08:09:39 cavejohnson pve-ha-crm[3831102]: recover service 'ct:124' to previous failed and fenced node 'gfman' again
Oct 10 08:09:39 cavejohnson pve-ha-crm[3831102]: got unexpected error - Configuration file 'nodes/gfman/lxc/124.conf' does not exist

Code:

root@cavejohnson:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.35-3-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-5
pve-kernel-5.15.35-3-pve: 5.15.35-6
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

fiona · Oct 11, 2022

britmob said:
I tried looking at other nodes for config files. I found that from any node XYZ, there was a file at /etc/pve/nodes/[XYZ]/lxc/118.conf on that local system. None of these files worked, they were just the blank pointers seen in the initial post.

What do you mean by "there was a file"? That the touch command failed with the "file exists" error? That just means that the file exists in the directory for some node, not that it exists in the directory of that node.

britmob said:

I've attached the ha-manager output below.

I tried removing the HA entry, but it was stuck on deleting for some minutes. I then entirely removed the ct config file, restarted the ha daemons, and then replaced it from a backup. This helped, but now pve-ha-crm is complaining about a different container's config file instead of the original.

Additionally, here is the process status from systemctl on the master crm node.

Code:

root@cavejohnson:~# systemctl status pve-ha-crm
● pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-10-10 08:03:20 HDT; 6min ago
    Process: 3831101 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)
   Main PID: 3831102 (pve-ha-crm)
      Tasks: 1 (limit: 57672)
     Memory: 92.9M
        CPU: 659ms
     CGroup: /system.slice/pve-ha-crm.service
             └─3831102 pve-ha-crm

Oct 10 08:08:59 cavejohnson pve-ha-crm[3831102]: recover service 'ct:124' to previous failed and fenced node 'gfman' again
Oct 10 08:08:59 cavejohnson pve-ha-crm[3831102]: got unexpected error - Configuration file 'nodes/gfman/lxc/124.conf' does not exist
Oct 10 08:09:09 cavejohnson pve-ha-crm[3831102]: recover service 'ct:124' to previous failed and fenced node 'gfman' again
Oct 10 08:09:09 cavejohnson pve-ha-crm[3831102]: got unexpected error - Configuration file 'nodes/gfman/lxc/124.conf' does not exist
Oct 10 08:09:19 cavejohnson pve-ha-crm[3831102]: recover service 'ct:124' to previous failed and fenced node 'gfman' again
Oct 10 08:09:19 cavejohnson pve-ha-crm[3831102]: got unexpected error - Configuration file 'nodes/gfman/lxc/124.conf' does not exist
Oct 10 08:09:29 cavejohnson pve-ha-crm[3831102]: recover service 'ct:124' to previous failed and fenced node 'gfman' again
Oct 10 08:09:29 cavejohnson pve-ha-crm[3831102]: got unexpected error - Configuration file 'nodes/gfman/lxc/124.conf' does not exist
Oct 10 08:09:39 cavejohnson pve-ha-crm[3831102]: recover service 'ct:124' to previous failed and fenced node 'gfman' again
Oct 10 08:09:39 cavejohnson pve-ha-crm[3831102]: got unexpected error - Configuration file 'nodes/gfman/lxc/124.conf' does not exist

What is the output of ls /etc/pve/nodes/*/lxc/124.conf? My guess is that the configuration file is in the lxc directory for a different node and not where the HA manger expects it. If it is, try moving it to the lxc directory for gfman.

Search

Search

PMXCFS Ghost Config File

britmob

New Member

fiona

Proxmox Staff Member

britmob

New Member

Attachments

fiona

Proxmox Staff Member