[SOLVED] Promox question marks on all machines and storage

ixproxmox

Renowned Member
Nov 25, 2015
77
5
73
I have two proxmox machines in Cluster and I do not run any vms on shared storage (but I do backups to the shared nfs storage) and no HA or redundancy or remote storage for any vm.

Suddenly, all machines shows question mark and about each 2nd nigth, two of the machines (running really ligth load) is going down for no apparent reason. I can't connect to its console, but I can stop it and start it. Then it runs good for two days (but still with question marks on all machines and all storage).

I suspect this started due to a shared nfs-storage got full due to backups. But again, I do not run any vms off it, so can't understand how it can relate. I have made more space on the backup-nfs-device.

pve-manager/6.2-12/b287dd27

Cluster information
-------------------
Name: XX
Config Version: 2
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Dec 22 01:26:41 2020
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 1.1cd
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1
0x00000002 1 (local)
 
Hi,

Please post the output of pveversion -v, Have you checked your syslog or journalctl?
 
Note that I run an apt upgrade/update on each server after posting here (to see if it was fixed by upgrade) - just in case there is difference.

:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.34-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-helper: 6.3-3
pve-kernel-5.4: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 
I can't say that I noticed anything the latest months, I have checked both those logs. It is also two machines at same time, so it must be something common I think.

I ran update to 6.3 on both:
proxmox-ve: 6.3-1 (running kernel: 5.4.34-1-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
....

Didn't help anything it seems. Still question marks everywhere. But all machines are running, but it doesn't show their names in the list (it shows if I go into one vm in detailed view). It must be some process or something that is stuck... My guess is something related to nfs.
 
Last edited:
Is there anything I can do to get things up and running here? I have this questions mark everywhere still and I have upgraded both nodes several times. I assume it doesn't have anything to do with the cluster, since I have earlier had this issue on single Proxmox server. All the servers work, but I can't see the names of the machines unless I go into each one.
 
service pve-cluster stop
service corosync stop
service pvestatd stop
service pveproxy stop
service pvedaemon stop

service pve-cluster start
service corosync start
service pvestatd start
service pveproxy start
service pvedaemon start

Tried this before, but now it actually worked. Got one node up and working, testing it on 2nd node now...
 
It worked for me.
Researching the issue, I came an across a post where the poster had issues connecting to his NAS from Proxmox. I had been reworking the IP connections to my NAS so I may have had the same issue.

Everything was back up, but the backup job looked to be hung. I clicked on the mount, clicked on the files tab, it timed out, and all the question marks came back. I reset them again and then disabled the backup job to stop it from restarting every time I did the reset. I then just had question marks on the storage items. It hung again while I was checking them. After one more reset, I was able to delete the backup NFS storage item. All the question marks are gone. Unfortunately, I get an error trying to remount the share. I have an SMB share to the same NAS that works fine.
 
Last edited:
service pve-cluster stop
service corosync stop
service pvestatd stop
service pveproxy stop
service pvedaemon stop

service pve-cluster start
service corosync start
service pvestatd start
service pveproxy start
service pvedaemon start

Tried this before, but now it actually worked. Got one node up and working, testing it on 2nd node now...
It Worked very well for some time but after some time same problem occur again all the nodes with all storage and pool in the cluster are showing question mark.

Please check what can be the root problem.
@ixproxmox @Moayad @fabian
 
We had this problem today as well.

Background: we have two nodes in the cluster. An older one with usb 2 ports and a newer one with usb 3. We do weekly offsite backups with external usb hard drives. The older node 1 has greater resources than the newer node 2 so has a higher and heavier VM load. But as node 1 only has usb 2, backups take about 12 hours at about 40 mB/s. Where as node 2 is able to write at about 115 mB/s.

Last week I created a lxc container on node 2 to act as NFS server so that I can backup both nodes in a single night instead of backing up one node the first night, move the drive to the second node and back that one up in a second night. Added benefit, node 1 would be able to write at 115 mB/s over the 2 gbps LAGed nics and usb 3 connection of the other node.

The problem: yesterday I did a backup from node 1 and was somewhat disappointed about transfer speeds but thought to myself "there are about 2 dozen explanations for this, let's see how the local backup goes over nfs".

Today I tried to backup node 2 over the NFS share of the locally mounted container. It got to about 4% before the backup got hung and the GUI crashed. After a F5 everything was question marked. Same results connecting to either node.

If I tried to click on the nfs share it timed out with:
Code:
unable to activate storage 'nfs01' - directory '/mnt/pve/nfs01' does not exist or is unreachable (500)

Next I tried:
service pvestatd restart
on both nodes. No change. I disabled the NFS share under cluster>storage and retried pvestatd restart. Still no change.

I then tried all of the commands suggested by ixproxmox on both servers. Node1 came back. Node2 remained all greyed out.

I tried to reboot node 2 but it didn't. Likely because of the crashed backup process. reboot -f was equally ineffective (tasks showed bulk shutdown all VMs and containers but nothing happened). As a last resort I issued an ipmi restart command with the obvious risk to all VMs involved.

This obviously rebooted the server and everything went back to normal (fortunately).

At this point I feel very disinclined to EVER try that again. Yet lots of folks report success running LXC based NFS servers and NFS per se seems to be rock solid as it's been in use for like forever. Also, googling around this problem doesn't seem to be limited to NFS. Encounters with the grey question marks seem to date back to about 2020.

Does anyone have any pointers as to what could be causing this?
 

Attachments

  • Screenshot 2024-10-25 175112.png
    Screenshot 2024-10-25 175112.png
    16.2 KB · Views: 2
By the way here's my pveversion:

Code:
proxmox-ve: 8.2.0 (running kernel: 6.8.12-2-pve)
pve-manager: 8.2.7 (running version: 8.2.7/3e0176e6bb2ade3b)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-11
proxmox-kernel-6.8: 6.8.12-2
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
pve-kernel-5.4: 6.4-18
pve-kernel-5.15.143-1-pve: 5.15.143-1
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.131-1-pve: 5.15.131-2
pve-kernel-5.15.126-1-pve: 5.15.126-1
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.104-1-pve: 5.15.104-2
pve-kernel-5.4.189-2-pve: 5.4.189-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: not correctly installed
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.3
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.1
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.10
libpve-storage-perl: 8.2.5
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-4
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.2.0
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.0.7
pve-firmware: 3.13-2
pve-ha-manager: 4.0.5
pve-i18n: 3.2.3
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
Same on both nodes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!