one node in 3-node cluster goes into "?" status

starkruzr

Well-Known Member
hi,

just upgraded to 7.2 and noticing that my cluster is in a state where:
  1. my sidebar looks like this:
    1657347636205.png
  2. I can't list or start containers on the "?" node, ibnmajid. VMs seem to work fine. Ceph seems to be fine. Info about that node:
    1657347713480.png
  3. output of pvecm status:
    root@ganges:~# pvecm status Cluster information ------------------- Name: BrokenWorks Config Version: 3 Transport: knet Secure auth: on Quorum information ------------------ Date: Sat Jul 9 01:39:05 2022 Quorum provider: corosync_votequorum Nodes: 3 Node ID: 0x00000002 Ring ID: 1.45b Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 0x00000001 1 192.168.9.10 0x00000002 1 192.168.9.11 (local) 0x00000003 1 192.168.9.12 root@ganges:~#
What else can I provide to troubleshoot?
 
but I still cannot create or do anything with containers on this host.
what do you mean by that exactly? what can't you do? what error messages do you get?
 
mostly I get hanging and timeouts. if I do `pct list` *right* after restarting the node, I get the list of containers on it. if I try to start, create, or delete one, the process hangs indefinitely, and if I cancel it, `pct list` will hang just like anything else to do with containers on it.

another detail: VMs seem to mostly work on this node, but if I try to open a console on any, it also times out.

one more: if I start the MDS on this host, it sticks around for a minute and then seems to stop. other than that I have two active MDSes, which is not something I ever configured. it started doing that after I updated to 7.2.

here's what happens in syslog when I try to start the MDS:
Code:
Jul 13 01:33:29 ibnmajid systemd[1]: Started Ceph metadata server daemon.
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: starting mds.ibnmajid at
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.661-0500 7fcfa8342700 -1 MDSIOContextBase: failed with -108, restarting...
Jul 13 01:33:29 ibnmajid ceph-mds[65187]:     -1> 2022-07-13T01:33:29.661-0500 7fcfa8342700 -1 MDSIOContextBase: failed with -108, restarting...
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.661-0500 7fcfa7b41700 -1 mds.0.log Blocklisted during JournalPointer read!  Respawning...
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: did not load config file, using default settings.
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: ignoring --setuser ceph since I am not root
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: ignoring --setgroup ceph since I am not root
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.673-0500 7f1340b1b780 -1 Errors while parsing config file!
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.673-0500 7f1340b1b780 -1 can't open ceph.conf: (2) No such file or directory
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: unable to get monitor info from DNS SRV with service name: ceph-mon
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.689-0500 7f1340b1b780 -1 failed for service _ceph-mon._tcp
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.689-0500 7f1340b1b780 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: failed to fetch mon config (--no-mon-config to skip)
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Main process exited, code=exited, status=1/FAILURE
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Failed with result 'exit-code'.
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Scheduled restart job, restart counter is at 3.
Jul 13 01:33:29 ibnmajid systemd[1]: Stopped Ceph metadata server daemon.
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Start request repeated too quickly.
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Failed with result 'exit-code'.
Jul 13 01:33:29 ibnmajid systemd[1]: Failed to start Ceph metadata server daemon.
 
Last edited:
what does 'ceph -s' say?

what else is running on that node?

can you post part of the journal when it hangs?
 
Code:
root@ibnmajid:~# ceph -s
  cluster:
    id:     310af567-1607-402b-bc5d-c62286a129d5
    health: HEALTH_WARN
            insufficient standby MDS daemons available

  services:
    mon: 3 daemons, quorum ibnmajid,ganges,riogrande (age 14h)
    mgr: riogrande(active, since 14h)
    mds: 2/2 daemons up
    osd: 18 osds: 18 up (since 14h), 18 in (since 14h)

  data:
    volumes: 2/2 healthy
    pools:   7 pools, 1537 pgs
    objects: 804.39k objects, 1.9 TiB
    usage:   4.2 TiB used, 10 TiB / 14 TiB avail
    pgs:     1537 active+clean

  io:
    client:   138 KiB/s wr, 0 op/s rd, 9 op/s wr

what else is running on that node?
two VMs. one is a Windows 10 instance, one is a random Linux VM I don't need.

can you post part of the journal when it hangs?
Code:
root@ibnmajid:~# pct destroy 106
rbd error: 'storage-fastwrx'-locked command timed out - aborting

from the journal:
Code:
Jul 13 13:22:01 ibnmajid pct[620159]: <root@pam> starting task UPID:ibnmajid:00097691:004F00C4:62CF0D49:vzdestroy:106:root@pam:
Jul 13 13:22:05 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:06 ibnmajid pvestatd[1504]: status update time (6.289 seconds)
Jul 13 13:22:15 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:15 ibnmajid pvestatd[1504]: status update time (6.264 seconds)
Jul 13 13:22:25 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:25 ibnmajid pvestatd[1504]: status update time (6.292 seconds)
Jul 13 13:22:35 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:36 ibnmajid pvestatd[1504]: status update time (6.248 seconds)
Jul 13 13:22:45 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:45 ibnmajid pvestatd[1504]: status update time (6.302 seconds)
Jul 13 13:22:49 ibnmajid pvedaemon[610587]: <root@pam> end task UPID:ibnmajid:0009721C:004EDF0D:62CF0CF2:vncproxy:102:root@pam: OK
Jul 13 13:22:55 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:55 ibnmajid pvedaemon[610587]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout afte>
Jul 13 13:22:55 ibnmajid pvestatd[1504]: status update time (6.269 seconds)
Jul 13 13:22:55 ibnmajid pvedaemon[600231]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout afte>
Jul 13 13:22:55 ibnmajid pvedaemon[600231]: <root@pam> starting task UPID:ibnmajid:00097998:004F1612:62CF0D7F:vncproxy:108:root@pam:
Jul 13 13:22:55 ibnmajid pvedaemon[620952]: starting vnc proxy UPID:ibnmajid:00097998:004F1612:62CF0D7F:vncproxy:108:root@pam:
Jul 13 13:22:59 ibnmajid qm[620954]: VM 108 qmp command failed - VM 108 qmp command 'set_password' failed - unable to connect to VM 108 qmp socket - timeout after 31 retries
Jul 13 13:22:59 ibnmajid pvedaemon[620952]: Failed to run vncproxy.
Jul 13 13:22:59 ibnmajid pvedaemon[600231]: <root@pam> end task UPID:ibnmajid:00097998:004F1612:62CF0D7F:vncproxy:108:root@pam: Failed to run vncproxy.
Jul 13 13:23:01 ibnmajid pct[620177]: rbd error: 'storage-fastwrx'-locked command timed out - aborting
Jul 13 13:23:01 ibnmajid pct[620159]: <root@pam> end task UPID:ibnmajid:00097691:004F00C4:62CF0D49:vzdestroy:106:root@pam: rbd error: 'storage-fastwrx'-locked command timed out >
root@ibnmajid:~#
108 is the Windows VM that does not permit the VNC connection for the console to work.
 
Last edited:
I mean, there is ever the option to get a subscription [1] (at least basic) and open a support ticket (for setups supported by Proxmox).

Do not get me wrong, only meant as a hint. :)

[1] https://proxmox.com/en/proxmox-ve/pricing
I don't have $900 a year to spend on my 3-node homelab cluster. I would actually be willing to pay for a ticket on an individual basis, but that isn't an option.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!