one node in 3-node cluster goes into "?" status

starkruzr

Well-Known Member
hi,

just upgraded to 7.2 and noticing that my cluster is in a state where:
  1. my sidebar looks like this:
    1657347636205.png
  2. I can't list or start containers on the "?" node, ibnmajid. VMs seem to work fine. Ceph seems to be fine. Info about that node:
    1657347713480.png
  3. output of pvecm status:
    root@ganges:~# pvecm status Cluster information ------------------- Name: BrokenWorks Config Version: 3 Transport: knet Secure auth: on Quorum information ------------------ Date: Sat Jul 9 01:39:05 2022 Quorum provider: corosync_votequorum Nodes: 3 Node ID: 0x00000002 Ring ID: 1.45b Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 0x00000001 1 192.168.9.10 0x00000002 1 192.168.9.11 (local) 0x00000003 1 192.168.9.12 root@ganges:~#
What else can I provide to troubleshoot?
 
mostly I get hanging and timeouts. if I do `pct list` *right* after restarting the node, I get the list of containers on it. if I try to start, create, or delete one, the process hangs indefinitely, and if I cancel it, `pct list` will hang just like anything else to do with containers on it.

another detail: VMs seem to mostly work on this node, but if I try to open a console on any, it also times out.

one more: if I start the MDS on this host, it sticks around for a minute and then seems to stop. other than that I have two active MDSes, which is not something I ever configured. it started doing that after I updated to 7.2.

here's what happens in syslog when I try to start the MDS:
Code:
Jul 13 01:33:29 ibnmajid systemd[1]: Started Ceph metadata server daemon.
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: starting mds.ibnmajid at
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.661-0500 7fcfa8342700 -1 MDSIOContextBase: failed with -108, restarting...
Jul 13 01:33:29 ibnmajid ceph-mds[65187]:     -1> 2022-07-13T01:33:29.661-0500 7fcfa8342700 -1 MDSIOContextBase: failed with -108, restarting...
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.661-0500 7fcfa7b41700 -1 mds.0.log Blocklisted during JournalPointer read!  Respawning...
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: did not load config file, using default settings.
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: ignoring --setuser ceph since I am not root
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: ignoring --setgroup ceph since I am not root
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.673-0500 7f1340b1b780 -1 Errors while parsing config file!
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.673-0500 7f1340b1b780 -1 can't open ceph.conf: (2) No such file or directory
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: unable to get monitor info from DNS SRV with service name: ceph-mon
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.689-0500 7f1340b1b780 -1 failed for service _ceph-mon._tcp
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.689-0500 7f1340b1b780 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: failed to fetch mon config (--no-mon-config to skip)
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Main process exited, code=exited, status=1/FAILURE
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Failed with result 'exit-code'.
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Scheduled restart job, restart counter is at 3.
Jul 13 01:33:29 ibnmajid systemd[1]: Stopped Ceph metadata server daemon.
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Start request repeated too quickly.
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Failed with result 'exit-code'.
Jul 13 01:33:29 ibnmajid systemd[1]: Failed to start Ceph metadata server daemon.
 
Last edited:
what does 'ceph -s' say?

what else is running on that node?

can you post part of the journal when it hangs?
 
Code:
root@ibnmajid:~# ceph -s
  cluster:
    id:     310af567-1607-402b-bc5d-c62286a129d5
    health: HEALTH_WARN
            insufficient standby MDS daemons available

  services:
    mon: 3 daemons, quorum ibnmajid,ganges,riogrande (age 14h)
    mgr: riogrande(active, since 14h)
    mds: 2/2 daemons up
    osd: 18 osds: 18 up (since 14h), 18 in (since 14h)

  data:
    volumes: 2/2 healthy
    pools:   7 pools, 1537 pgs
    objects: 804.39k objects, 1.9 TiB
    usage:   4.2 TiB used, 10 TiB / 14 TiB avail
    pgs:     1537 active+clean

  io:
    client:   138 KiB/s wr, 0 op/s rd, 9 op/s wr

what else is running on that node?
two VMs. one is a Windows 10 instance, one is a random Linux VM I don't need.

can you post part of the journal when it hangs?
Code:
root@ibnmajid:~# pct destroy 106
rbd error: 'storage-fastwrx'-locked command timed out - aborting

from the journal:
Code:
Jul 13 13:22:01 ibnmajid pct[620159]: <root@pam> starting task UPID:ibnmajid:00097691:004F00C4:62CF0D49:vzdestroy:106:root@pam:
Jul 13 13:22:05 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:06 ibnmajid pvestatd[1504]: status update time (6.289 seconds)
Jul 13 13:22:15 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:15 ibnmajid pvestatd[1504]: status update time (6.264 seconds)
Jul 13 13:22:25 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:25 ibnmajid pvestatd[1504]: status update time (6.292 seconds)
Jul 13 13:22:35 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:36 ibnmajid pvestatd[1504]: status update time (6.248 seconds)
Jul 13 13:22:45 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:45 ibnmajid pvestatd[1504]: status update time (6.302 seconds)
Jul 13 13:22:49 ibnmajid pvedaemon[610587]: <root@pam> end task UPID:ibnmajid:0009721C:004EDF0D:62CF0CF2:vncproxy:102:root@pam: OK
Jul 13 13:22:55 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:55 ibnmajid pvedaemon[610587]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout afte>
Jul 13 13:22:55 ibnmajid pvestatd[1504]: status update time (6.269 seconds)
Jul 13 13:22:55 ibnmajid pvedaemon[600231]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout afte>
Jul 13 13:22:55 ibnmajid pvedaemon[600231]: <root@pam> starting task UPID:ibnmajid:00097998:004F1612:62CF0D7F:vncproxy:108:root@pam:
Jul 13 13:22:55 ibnmajid pvedaemon[620952]: starting vnc proxy UPID:ibnmajid:00097998:004F1612:62CF0D7F:vncproxy:108:root@pam:
Jul 13 13:22:59 ibnmajid qm[620954]: VM 108 qmp command failed - VM 108 qmp command 'set_password' failed - unable to connect to VM 108 qmp socket - timeout after 31 retries
Jul 13 13:22:59 ibnmajid pvedaemon[620952]: Failed to run vncproxy.
Jul 13 13:22:59 ibnmajid pvedaemon[600231]: <root@pam> end task UPID:ibnmajid:00097998:004F1612:62CF0D7F:vncproxy:108:root@pam: Failed to run vncproxy.
Jul 13 13:23:01 ibnmajid pct[620177]: rbd error: 'storage-fastwrx'-locked command timed out - aborting
Jul 13 13:23:01 ibnmajid pct[620159]: <root@pam> end task UPID:ibnmajid:00097691:004F00C4:62CF0D49:vzdestroy:106:root@pam: rbd error: 'storage-fastwrx'-locked command timed out >
root@ibnmajid:~#
108 is the Windows VM that does not permit the VNC connection for the console to work.
 
Last edited:
I mean, there is ever the option to get a subscription [1] (at least basic) and open a support ticket (for setups supported by Proxmox).

Do not get me wrong, only meant as a hint. :)

[1] https://proxmox.com/en/proxmox-ve/pricing
I don't have $900 a year to spend on my 3-node homelab cluster. I would actually be willing to pay for a ticket on an individual basis, but that isn't an option.