Grey Question Mark, after CIFS share offline

fpdragon · Oct 10, 2024

Hi, I am using a cluster of 4 PVE 8.2.4 machines. One of the VMs serves a CIFS share that is mounted as a "SMB/CIFS" Storage mount.

That normally works well but now the storage VM failed and this lead unstable results in the PVE UI.

I see the typical "grey question mark" problem and the VM names are not resolved.
And I can't access any of the PVE storages. I just receive a "Connection timed out (596)"

I tried to disable and re enable the CIFS mont from "Datacenter" -> "Storage". The network drive disappears and reappears under the different nodes but the access is still not possible.

If possible I really don't want to reboot my nodes. Hopefully just something needs to be restarted to get it up and running.

Here is an overview of one of the nodes services:

Bash:

root@p-virt-sw-3:~# systemctl status
● p-virt-sw-3
    State: running
    Units: 596 loaded (incl. loaded aliases)
     Jobs: 0 queued
   Failed: 0 units
    Since: Tue 2024-09-10 15:49:09 CEST; 4 weeks 1 day ago
  systemd: 252.30-1~deb12u2
   CGroup: /
           ├─init.scope
           │ └─1 /sbin/init
           ├─qemu.slice
           │ ├─1027.scope
           │ │ └─811359 /usr/bin/kvm -id 1027 -name SwarmIomProc-Win7,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/1027.qmp,server=on,wait=off -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qmev>
           │ ├─2003.scope
           │ │ ├─6327 swtpm socket --tpmstate backend-uri=file:///dev/zvol/ssd2-nvme-zfs/vm-2003-disk-1,mode=0600 --ctrl type=unixio,path=/var/run/qemu-server/2003.swtpm,mode=0600 --pid file=/var/run/qemu-server/2003.swtpm.pid --terminate --daemon --log >
           │ │ └─6334 /usr/bin/kvm -id 2003 -name SwDevZes-Win10,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/2003.qmp,server=on,wait=off -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qmeventd.>
           │ ├─2008.scope
           │ │ ├─6571 swtpm socket --tpmstate backend-uri=file:///dev/zvol/rpool/vm-2008-disk-1,mode=0600 --ctrl type=unixio,path=/var/run/qemu-server/2008.swtpm,mode=0600 --pid file=/var/run/qemu-server/2008.swtpm.pid --terminate --daemon --log "file=/r>
           │ │ └─6578 /usr/bin/kvm -id 2008 -name SwDevJel-Win10,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/2008.qmp,server=on,wait=off -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qmeventd.>
           │ ├─2010.scope
           │ │ ├─6980 swtpm socket --tpmstate backend-uri=file:///dev/zvol/rpool/vm-2010-disk-1,mode=0600 --ctrl type=unixio,path=/var/run/qemu-server/2010.swtpm,mode=0600 --pid file=/var/run/qemu-server/2010.swtpm.pid --terminate --daemon --log "file=/r>
           │ │ └─6986 /usr/bin/kvm -id 2010 -name SwDevBoc-Win10,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/2010.qmp,server=on,wait=off -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qmeventd.>
           │ └─2027.scope
           │   └─37935 /usr/bin/kvm -id 2027 -name SwarmIomLive-Win7,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/2027.qmp,server=on,wait=off -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qmeve>
           ├─system.slice
           │ ├─chrony.service
           │ │ ├─3225 /usr/sbin/chronyd -F 1
           │ │ └─3234 /usr/sbin/chronyd -F 1
           │ ├─corosync.service
           │ │ └─3377 /usr/sbin/corosync -f
           │ ├─cron.service
           │ │ └─3379 /usr/sbin/cron -f
           │ ├─dbus.service
           │ │ └─2941 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
           │ ├─ksmtuned.service
           │ │ ├─   2950 /bin/bash /usr/sbin/ksmtuned
           │ │ └─1430211 sleep 60
           │ ├─lxc-monitord.service
           │ │ └─3171 /usr/libexec/lxc/lxc-monitord --daemon
           │ ├─lxcfs.service
           │ │ └─2966 /usr/bin/lxcfs /var/lib/lxcfs
           │ ├─proxmox-firewall.service
           │ │ └─3380 /usr/libexec/proxmox/proxmox-firewall
           │ ├─pve-cluster.service
           │ │ └─3294 /usr/bin/pmxcfs
           │ ├─pve-firewall.service
           │ │ └─3479 pve-firewall
           │ ├─pve-ha-crm.service
           │ │ └─3519 pve-ha-crm
           │ ├─pve-ha-lrm.service
           │ │ └─3538 pve-ha-lrm
           │ ├─pve-lxc-syscalld.service
           │ │ └─2947 /usr/lib/x86_64-linux-gnu/pve-lxc-syscalld/pve-lxc-syscalld --system /run/pve/lxc-syscalld.sock
           │ ├─pvedaemon.service
           │ │ ├─   3507 pvedaemon
           │ │ ├─ 463060 "pvedaemon worker"
           │ │ ├─ 470732 "pvedaemon worker"
           │ │ ├─ 488150 "pvedaemon worker"
           │ │ ├─1399185 "pvedaemon worker"
           │ │ ├─1399248 "pvedaemon worker"
           │ │ └─1399262 "pvedaemon worker"
           │ ├─pvefw-logger.service
           │ │ └─1059595 /usr/sbin/pvefw-logger
           │ ├─pveproxy.service
           │ │ ├─   3530 pveproxy
           │ │ ├─1059599 "pveproxy worker"
           │ │ ├─1059600 "pveproxy worker"
           │ │ └─1059601 "pveproxy worker"
           │ ├─pvescheduler.service
           │ │ └─7496 pvescheduler
           │ ├─pvestatd.service
           │ │ ├─   3491 pvestatd
           │ │ └─1394818 pvestatd
           │ ├─qmeventd.service
           │ │ └─2953 /usr/sbin/qmeventd /var/run/qmeventd.sock
           │ ├─rpc-statd.service
           │ │ └─3712 /sbin/rpc.statd
           │ ├─rpcbind.service
           │ │ └─2926 /sbin/rpcbind -f -w
           │ ├─rrdcached.service
           │ │ └─3268 /usr/bin/rrdcached -B -b /var/lib/rrdcached/db/ -j /var/lib/rrdcached/journal/ -p /var/run/rrdcached.pid -l unix:/var/run/rrdcached.sock
           │ ├─smartmontools.service
           │ │ └─2952 /usr/sbin/smartd -n -q never
           │ ├─spiceproxy.service
           │ │ ├─   3536 spiceproxy
           │ │ └─1059593 "spiceproxy worker"
           │ ├─ssh.service
           │ │ └─3192 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"
           │ ├─system-getty.slice
           │ │ └─getty@tty1.service
           │ │   └─3188 /sbin/agetty -o "-p -- \\u" --noclear - linux
           │ ├─system-postfix.slice
           │ │ └─postfix@-.service
           │ │   ├─   3370 /usr/lib/postfix/sbin/master -w
           │ │   ├─   3372 qmgr -l -t unix -u
           │ │   └─1401054 pickup -l -t unix -u -c
           │ ├─systemd-journald.service
           │ │ └─1470 /lib/systemd/systemd-journald
           │ ├─systemd-logind.service
           │ │ └─2956 /lib/systemd/systemd-logind
           │ ├─systemd-udevd.service
           │ │ └─udev
           │ │   └─1491 /lib/systemd/systemd-udevd
           │ ├─watchdog-mux.service
           │ │ └─2957 /usr/sbin/watchdog-mux
           │ └─zfs-zed.service
           │   └─2962 /usr/sbin/zed -F
           └─user.slice
             └─user-0.slice
               ├─session-16.scope
               │ └─23349 /usr/bin/kvm -id 1036 -name FpgaLibero-Ubuntu,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/1036.qmp,server=on,wait=off -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qme>
               ├─session-829.scope
               │ ├─1408393 "sshd: root@pts/0"
               │ ├─1408400 /bin/login -f
               │ ├─1408405 -bash
               │ ├─1430303 systemctl status
               │ └─1430304 pager
               └─user@0.service
                 └─init.scope
                   ├─22358 /lib/systemd/systemd --user
                   └─22359 "(sd-pam)"

Not sure how to look through the logs. The "System Log" Tab is empty in the UI.

Hope someone can help without the need of a reboot.
Thanks.

Moayad · Oct 10, 2024

Hi,

Can you still ping the NFS share server? If you disabled the NFS storage in the Proxmox VE -> Storage do you still see the grey question mark?

fpdragon · Oct 10, 2024

I have a SMB/CIFS share (WS2022 VM).
I had it running with NFS but I switched to CIFS since I got the impression that the NFS implementation of WS2022 is not that great.

Pinging the server is possible from each node.

I tried to disable and re enable the CIFS share multiple times.

Thanks for the answer

fpdragon · Oct 10, 2024

The thing is, I had to reboot the storage VM and now the whole cluster seems to become unstable.

I read several things that the pvestatd service may be involved.
I tried to stop and restart the service but it takes very very long and goes through multiple kill/term signal timeouts. Igues pvestatd can't run stable any more due to the loss of the CIFS connection.

fpdragon · Oct 10, 2024

Moayad · Oct 10, 2024

Hi,

Can you please try to manual mount your CIIFS on a test VM or LXC or even your Proxmox VE using `mount`, to narrow down if the issure related to the CIFS connection or not.

Did you check the syslog for anything related to?

fpdragon · Oct 10, 2024

I guess this line could be of interest?
pvestatd[2893887]: start failed - can't acquire lock '/var/run/pvestatd.pid.lock' - Resource temporarily unavailable

manual mount:
not sure about that. I always used the PVE UI. I need to find out the equal command.

I only can access syslog for 2 of 4 cluster machines. seems that everything broke away. however all VMs are still up and running, which is important for me.

fpdragon · Oct 10, 2024

I can't start the pvestatd service due to the /var/run/pvestatd.pid.lock file.
Any idea how to resolve that without rebooting?

Moayad · Oct 10, 2024

fpdragon said:
pvestatd[2893887]: start failed - can't acquire lock '/var/run/pvestatd.pid.lock' - Resource temporarily unavailable

You can move/remove the `/var/run/pvestatd.pid.lock` and then restart the pvestatd service using systemctl restart pvestatd command.

fpdragon · Oct 10, 2024

Thanks.

I did so and now I fixed p-virt-sw-2.
So 2 of 4 nodes are completely up and running again.

However the other two just have the grey question mark and the UI is not updated any more.
But I still have terminal access.

But on these two machines pvestatd ist already running.
No idea how to track down the problem over the term console.

Moayad · Oct 10, 2024

fpdragon said:
But on these two machines pvestatd ist already running.

Make sure that you restart the pvestatd on the two remaining nodes.

fpdragon · Oct 10, 2024

Took a long time but now I stopped and started the service on both machines and now everything seems to be fine again.
Without a reboot of any node.

Thanks a log Moayad!

Now I try to start the CIFS share again.
Let's see if this also works again.

fpdragon · Oct 10, 2024

Hmmm...

re enabled the CIFS share from the Datacenter storage UI but it failed.
p-virt-sw-1 works
The other 3 nodes have the grey question mark and are inaccessible.

So the CIFS connection is still broken on 3 of 4 machines.

Moayad · Oct 10, 2024

Can you please check the storage status and the output of the following commands:

Bash:

pvesm status --storage <CIFS storage>
top
free -h

For more information can you please check the syslog using journalctl since you re-enabled the CIFS share? You can generate the syslog as the following command:

Code:

journalctl --since '2024-10-10 00:00:00' --until 'now' > /tmp/$(hostname)-syslog.txt

You have to edit the time in the above command.

fpdragon · Oct 10, 2024

Tested the same with the old NFS share.
Exactly the same results.

first working node:

other three nodes:

After waiting several minutes the whole nodes are also gone again.

3 of 4 nodes have grey question marks again.
1 is just completely green.

I guess I can do the same procedure again:
disable CIFS and NFS share
pvestatd stop
pvestatd start

fpdragon · Oct 10, 2024

Moayad said:
Can you please check the storage status and the output of the following commands:

Bash:

pvesm status --storage <CIFS storage> top free -h

For more information can you please check the syslog using journalctl since you re-enabled the CIFS share? You can generate the syslog as the following command:

Code:

journalctl --since '2024-10-10 00:00:00' --until 'now' > /tmp/$(hostname)-syslog.txt

You have to edit the time in the above command.

on the working node:

on the others:

(still waiting)

fpdragon · Oct 10, 2024

Code:

root@p-virt-sw-2:/tmp# journalctl --since '2024-10-10 15:00:00' --until 'now'
Oct 10 15:00:01 p-virt-sw-2 systemd[1]: pvestatd.service: Found left-over process 2836895 (pvestatd) in control group while starting unit. Ignoring.
Oct 10 15:00:01 p-virt-sw-2 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 10 15:00:01 p-virt-sw-2 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Oct 10 15:00:07 p-virt-sw-2 pvestatd[2908261]: start failed - can't acquire lock '/var/run/pvestatd.pid.lock' - Resource temporarily unavailable
Oct 10 15:00:07 p-virt-sw-2 pvestatd[2908261]: start failed - can't acquire lock '/var/run/pvestatd.pid.lock' - Resource temporarily unavailable
Oct 10 15:00:07 p-virt-sw-2 systemd[1]: pvestatd.service: Control process exited, code=exited, status=255/EXCEPTION
Oct 10 15:01:37 p-virt-sw-2 systemd[1]: pvestatd.service: State 'stop-sigterm' timed out. Killing.
Oct 10 15:01:37 p-virt-sw-2 systemd[1]: pvestatd.service: Killing process 2836895 (pvestatd) with signal SIGKILL.
Oct 10 15:01:45 p-virt-sw-2 pmxcfs[2553]: [status] notice: received log
Oct 10 15:02:13 p-virt-sw-2 pvedaemon[2877409]: <root@pam> successful auth for user 'root@pam'
Oct 10 15:03:07 p-virt-sw-2 systemd[1]: pvestatd.service: Processes still around after SIGKILL. Ignoring.
Oct 10 15:04:17 p-virt-sw-2 pveproxy[2866030]: worker exit
Oct 10 15:04:17 p-virt-sw-2 pveproxy[2733]: worker 2866030 finished
Oct 10 15:04:17 p-virt-sw-2 pveproxy[2733]: starting 1 worker(s)
Oct 10 15:04:17 p-virt-sw-2 pveproxy[2733]: worker 2913396 started
Oct 10 15:04:37 p-virt-sw-2 systemd[1]: pvestatd.service: State 'final-sigterm' timed out. Killing.
Oct 10 15:04:37 p-virt-sw-2 systemd[1]: pvestatd.service: Killing process 2836895 (pvestatd) with signal SIGKILL.
Oct 10 15:06:07 p-virt-sw-2 systemd[1]: pvestatd.service: Processes still around after final SIGKILL. Entering failed mode.
Oct 10 15:06:07 p-virt-sw-2 systemd[1]: pvestatd.service: Failed with result 'exit-code'.
Oct 10 15:06:07 p-virt-sw-2 systemd[1]: pvestatd.service: Unit process 2836895 (pvestatd) remains running after unit stopped.
Oct 10 15:06:07 p-virt-sw-2 systemd[1]: Failed to start pvestatd.service - PVE Status Daemon.
Oct 10 15:06:43 p-virt-sw-2 systemd[1]: pvestatd.service: Found left-over process 2836895 (pvestatd) in control group while starting unit. Ignoring.
Oct 10 15:06:43 p-virt-sw-2 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 10 15:06:43 p-virt-sw-2 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Oct 10 15:06:44 p-virt-sw-2 pvestatd[2916359]: starting server
Oct 10 15:06:44 p-virt-sw-2 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Oct 10 15:06:48 p-virt-sw-2 pvedaemon[2877409]: worker exit
Oct 10 15:06:48 p-virt-sw-2 pvedaemon[2721]: worker 2877409 finished
Oct 10 15:06:48 p-virt-sw-2 pvedaemon[2721]: starting 1 worker(s)
Oct 10 15:06:48 p-virt-sw-2 pvedaemon[2721]: worker 2916442 started
Oct 10 15:07:28 p-virt-sw-2 pmxcfs[2553]: [status] notice: received log
Oct 10 15:07:49 p-virt-sw-2 pveproxy[2898202]: Clearing outdated entries from certificate cache
Oct 10 15:08:02 p-virt-sw-2 pvedaemon[2881614]: <root@pam> successful auth for user 'root@pam'
Oct 10 15:08:03 p-virt-sw-2 pveproxy[2894137]: Clearing outdated entries from certificate cache
Oct 10 15:08:19 p-virt-sw-2 pveproxy[2898202]: proxy detected vanished client connection
Oct 10 15:08:33 p-virt-sw-2 pveproxy[2898202]: proxy detected vanished client connection
Oct 10 15:09:09 p-virt-sw-2 pveproxy[2898202]: 2024-10-10 15:09:09.754451 +0200 error AnyEvent::Util: Runtime error in AnyEvent::guard callback: Can't call method "_put_session" on an undefined value at /usr/lib/x86_64-linux-gnu/perl5/5.36/AnyEvent/Handl>
Oct 10 15:09:09 p-virt-sw-2 pveproxy[2733]: worker 2898202 finished
Oct 10 15:09:09 p-virt-sw-2 pveproxy[2733]: starting 1 worker(s)
Oct 10 15:09:09 p-virt-sw-2 pveproxy[2733]: worker 2917941 started
Oct 10 15:09:09 p-virt-sw-2 pveproxy[2913396]: Clearing outdated entries from certificate cache
Oct 10 15:09:13 p-virt-sw-2 pveproxy[2917940]: got inotify poll request in wrong process - disabling inotify
Oct 10 15:09:24 p-virt-sw-2 pveproxy[2894137]: proxy detected vanished client connection
Oct 10 15:09:26 p-virt-sw-2 pveproxy[2894137]: proxy detected vanished client connection
Oct 10 15:09:29 p-virt-sw-2 pveproxy[2894137]: proxy detected vanished client connection
Oct 10 15:09:32 p-virt-sw-2 pveproxy[2917940]: proxy detected vanished client connection
Oct 10 15:09:33 p-virt-sw-2 pveproxy[2917940]: worker exit
Oct 10 15:09:34 p-virt-sw-2 pveproxy[2894137]: proxy detected vanished client connection
Oct 10 15:11:48 p-virt-sw-2 pveproxy[2894137]: worker exit
Oct 10 15:11:48 p-virt-sw-2 pveproxy[2733]: worker 2894137 finished
Oct 10 15:11:48 p-virt-sw-2 pveproxy[2733]: starting 1 worker(s)
Oct 10 15:11:48 p-virt-sw-2 pveproxy[2733]: worker 2919049 started
Oct 10 15:13:38 p-virt-sw-2 pveproxy[2919049]: Clearing outdated entries from certificate cache
Oct 10 15:13:38 p-virt-sw-2 pveproxy[2917941]: Clearing outdated entries from certificate cache
Oct 10 15:14:01 p-virt-sw-2 pvedaemon[2920058]: starting termproxy UPID:p-virt-sw-2:002C8E7A:0F864135:6707D319:vncshell::root@pam:
Oct 10 15:14:01 p-virt-sw-2 pvedaemon[2916442]: <root@pam> starting task UPID:p-virt-sw-2:002C8E7A:0F864135:6707D319:vncshell::root@pam:
Oct 10 15:14:01 p-virt-sw-2 pvedaemon[2881614]: <root@pam> successful auth for user 'root@pam'
Oct 10 15:14:08 p-virt-sw-2 pvedaemon[2916442]: <root@pam> end task UPID:p-virt-sw-2:002C8E7A:0F864135:6707D319:vncshell::root@pam: OK
Oct 10 15:14:18 p-virt-sw-2 pvedaemon[2920159]: starting termproxy UPID:p-virt-sw-2:002C8EDF:0F8647A8:6707D32A:vncshell::root@pam:
Oct 10 15:14:18 p-virt-sw-2 pvedaemon[2881614]: <root@pam> starting task UPID:p-virt-sw-2:002C8EDF:0F8647A8:6707D32A:vncshell::root@pam:
Oct 10 15:14:18 p-virt-sw-2 pvedaemon[2916442]: <root@pam> successful auth for user 'root@pam'
Oct 10 15:14:20 p-virt-sw-2 pveproxy[2919049]: proxy detected vanished client connection
Oct 10 15:14:53 p-virt-sw-2 pmxcfs[2553]: [dcdb] notice: data verification successful
Oct 10 15:16:32 p-virt-sw-2 pveproxy[2919049]: proxy detected vanished client connection
Oct 10 15:16:45 p-virt-sw-2 pmxcfs[2553]: [status] notice: received log

fpdragon · Oct 10, 2024

Now I had to remove the lock files on all 3 nodes but afterwards the pvestatd service could be started and now all nodes are green again.

However, without any CIFS or NFS share mounted.

Seems that mounting has been broken on 3 of 4 nodes.
or maybe a problem with the cluster syncing? but the login credentials for the shares have not changed.

fpdragon · Oct 10, 2024

Found something else:
PVE has left the mount directorys under /mnt/pve/...

On good node I am able to access the (now empty) directroy by:
cd /mnt/pve/SwFileServer-cifs

On the other nodes the same command leads to a stuck terminal.

I can ls in /mnt/pve but when going in a subfolder the terminal freezes.

fpdragon · Oct 10, 2024

Seems that I also can't remove these directories.

Grey Question Mark, after CIFS share offline

Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Member

Member

Member

Member

Member

We value your privacy