I had a few containers acting weird and when I tried to reboot one of them, it hung when shutting down. Killing it with pct did not help, neither did killing the PID, so I rebooted the node. After the node came back up none of the LXC's that had a bind mount to a ceph pool would start. The bind mount in the config is:
In the node shell I went to /mnt/pve/cephfs and none of the files were there, it was then that I saw the the status for only that pool now shows as "unknown" but the other pools are fine.
If I click on it and try to look at any other content I get "mount error" and a couple things to check. the results of systemctl status mnt-pve-cephfs.mount are:
The results of journalctl -xe:
So, it appears to be some sort of cephx authentication problem, but I don't know how to fix it.
pveversion -v:
Code:
mp0:/mnt/pve/cephfs/media,mp=/mnt/files
In the node shell I went to /mnt/pve/cephfs and none of the files were there, it was then that I saw the the status for only that pool now shows as "unknown" but the other pools are fine.
If I click on it and try to look at any other content I get "mount error" and a couple things to check. the results of systemctl status mnt-pve-cephfs.mount are:
Code:
● mnt-pve-cephfs.mount - /mnt/pve/cephfs
Loaded: loaded (/run/systemd/system/mnt-pve-cephfs.mount; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2021-11-25 13:44:20 EST; 10s ago
Where: /mnt/pve/cephfs
What: 192.168.4.50,192.168.4.54,192.168.4.56,192.168.4.58,192.168.4.60,192.168.4.62,192.168.4.66:/
Nov 25 13:44:20 Dak1 systemd[1]: Mounting /mnt/pve/cephfs...
Nov 25 13:44:20 Dak1 mount[8198]: mount error: no mds server is up or the cluster is laggy
Nov 25 13:44:20 Dak1 systemd[1]: mnt-pve-cephfs.mount: Mount process exited, code=exited, status=32/n/a
Nov 25 13:44:20 Dak1 systemd[1]: mnt-pve-cephfs.mount: Failed with result 'exit-code'.
Nov 25 13:44:20 Dak1 systemd[1]: Failed to mount /mnt/pve/cephfs.
The results of journalctl -xe:
Code:
--
-- A start job for unit user@0.service has finished successfully.
--
-- The job identifier is 635.
Nov 25 13:48:40 Dak1 systemd[1]: Started Session 1 of user root.
-- Subject: A start job for unit session-1.scope has finished successfully
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit session-1.scope has finished successfully.
--
-- The job identifier is 710.
Nov 25 13:48:40 Dak1 login[1999]: pam_unix(login:session): session opened for user root by root(uid=0)
Nov 25 13:48:40 Dak1 login[2004]: ROOT LOGIN on '/dev/pts/0' from '192.168.1.58'
Nov 25 13:48:43 Dak1 systemd[1]: Reloading.
Nov 25 13:48:44 Dak1 systemd[1]: /lib/systemd/system/fail2ban.service:12: PIDFile= references path below legacy directory /var/run/, updating /var/run/fail2ban/fai
Nov 25 13:48:45 Dak1 systemd[1]: Mounting /mnt/pve/cephfs...
-- Subject: A start job for unit mnt-pve-cephfs.mount has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit mnt-pve-cephfs.mount has begun execution.
--
-- The job identifier is 786.
Nov 25 13:48:45 Dak1 mount[2104]: mount error: no mds server is up or the cluster is laggy
Nov 25 13:48:45 Dak1 kernel: libceph: auth protocol 'cephx' mauth authentication failed: -13
Nov 25 13:48:45 Dak1 kernel: ceph: No mds server is up or the cluster is laggy
Nov 25 13:48:45 Dak1 systemd[1]: mnt-pve-cephfs.mount: Mount process exited, code=exited, status=32/n/a
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- An n/a= process belonging to unit mnt-pve-cephfs.mount has exited.
--
-- The process' exit code is 'exited' and its exit status is 32.
Nov 25 13:48:45 Dak1 systemd[1]: mnt-pve-cephfs.mount: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit mnt-pve-cephfs.mount has entered the 'failed' state with result 'exit-code'.
Nov 25 13:48:45 Dak1 systemd[1]: Failed to mount /mnt/pve/cephfs.
-- Subject: A start job for unit mnt-pve-cephfs.mount has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit mnt-pve-cephfs.mount has finished with a failure.
--
-- The job identifier is 786 and the job result is failed.
Nov 25 13:48:45 Dak1 pvestatd[1281]: mount error: See "systemctl status mnt-pve-cephfs.mount" and "journalctl -xe" for details.
Nov 25 13:48:45 Dak1 pmxcfs[981]: [status] notice: received log
So, it appears to be some sort of cephx authentication problem, but I don't know how to fix it.
pveversion -v:
Code:
proxmox-ve: 6.4-1 (running kernel: 5.11.7-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-helper: 6.4-8
pve-kernel-5.4: 6.4-7
pve-kernel-5.11.7-1-pve: 5.11.7-1~bpo10
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.15-pve1~bpo10
ceph-fuse: 15.2.15-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve1~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.6-pve1~bpo10+1