one node in 3-node cluster goes into "?" status

starkruzr · Jul 9, 2022

hi,

just upgraded to 7.2 and noticing that my cluster is in a state where:

my sidebar looks like this:
I can't list or start containers on the "?" node, ibnmajid. VMs seem to work fine. Ceph seems to be fine. Info about that node:
output of pvecm status:
root@ganges:~# pvecm status Cluster information ------------------- Name: BrokenWorks Config Version: 3 Transport: knet Secure auth: on Quorum information ------------------ Date: Sat Jul 9 01:39:05 2022 Quorum provider: corosync_votequorum Nodes: 3 Node ID: 0x00000002 Ring ID: 1.45b Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 0x00000001 1 192.168.9.10 0x00000002 1 192.168.9.11 (local) 0x00000003 1 192.168.9.12 root@ganges:~#

What else can I provide to troubleshoot?

starkruzr · Jul 11, 2022

bump?

starkruzr · Jul 11, 2022

I am happy to provide any additional info needed.

starkruzr · Jul 12, 2022

I have tried all the options in these threads:
https://forum.proxmox.com/threads/node-with-question-mark.41180/
https://forum.proxmox.com/threads/promox-question-marks-on-all-machines-and-storage.81087/
https://forum.proxmox.com/threads/node-show-all-items-in-gray-with-question-marks.65135/
https://forum.proxmox.com/threads/containers-with-question-marks-when-renaming-node.108930/

starkruzr · Jul 13, 2022

so after removing some hard drive devices that had bad sectors, things are for some reason somewhat happier - no more question marks, but I still cannot create or do anything with containers on this host.

dcsapak · Jul 13, 2022

starkruzr said:
but I still cannot create or do anything with containers on this host.

what do you mean by that exactly? what can't you do? what error messages do you get?

starkruzr · Jul 13, 2022

mostly I get hanging and timeouts. if I do `pct list` *right* after restarting the node, I get the list of containers on it. if I try to start, create, or delete one, the process hangs indefinitely, and if I cancel it, `pct list` will hang just like anything else to do with containers on it.

another detail: VMs seem to mostly work on this node, but if I try to open a console on any, it also times out.

one more: if I start the MDS on this host, it sticks around for a minute and then seems to stop. other than that I have two active MDSes, which is not something I ever configured. it started doing that after I updated to 7.2.

here's what happens in syslog when I try to start the MDS:

Code:

Jul 13 01:33:29 ibnmajid systemd[1]: Started Ceph metadata server daemon.
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: starting mds.ibnmajid at
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.661-0500 7fcfa8342700 -1 MDSIOContextBase: failed with -108, restarting...
Jul 13 01:33:29 ibnmajid ceph-mds[65187]:     -1> 2022-07-13T01:33:29.661-0500 7fcfa8342700 -1 MDSIOContextBase: failed with -108, restarting...
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.661-0500 7fcfa7b41700 -1 mds.0.log Blocklisted during JournalPointer read!  Respawning...
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: did not load config file, using default settings.
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: ignoring --setuser ceph since I am not root
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: ignoring --setgroup ceph since I am not root
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.673-0500 7f1340b1b780 -1 Errors while parsing config file!
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.673-0500 7f1340b1b780 -1 can't open ceph.conf: (2) No such file or directory
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: unable to get monitor info from DNS SRV with service name: ceph-mon
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.689-0500 7f1340b1b780 -1 failed for service _ceph-mon._tcp
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: 2022-07-13T01:33:29.689-0500 7f1340b1b780 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Jul 13 01:33:29 ibnmajid ceph-mds[65187]: failed to fetch mon config (--no-mon-config to skip)
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Main process exited, code=exited, status=1/FAILURE
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Failed with result 'exit-code'.
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Scheduled restart job, restart counter is at 3.
Jul 13 01:33:29 ibnmajid systemd[1]: Stopped Ceph metadata server daemon.
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Start request repeated too quickly.
Jul 13 01:33:29 ibnmajid systemd[1]: ceph-mds@ibnmajid.service: Failed with result 'exit-code'.
Jul 13 01:33:29 ibnmajid systemd[1]: Failed to start Ceph metadata server daemon.

dcsapak · Jul 13, 2022

what does 'ceph -s' say?

what else is running on that node?

can you post part of the journal when it hangs?

starkruzr · Jul 13, 2022

Code:

root@ibnmajid:~# ceph -s
  cluster:
    id:     310af567-1607-402b-bc5d-c62286a129d5
    health: HEALTH_WARN
            insufficient standby MDS daemons available

  services:
    mon: 3 daemons, quorum ibnmajid,ganges,riogrande (age 14h)
    mgr: riogrande(active, since 14h)
    mds: 2/2 daemons up
    osd: 18 osds: 18 up (since 14h), 18 in (since 14h)

  data:
    volumes: 2/2 healthy
    pools:   7 pools, 1537 pgs
    objects: 804.39k objects, 1.9 TiB
    usage:   4.2 TiB used, 10 TiB / 14 TiB avail
    pgs:     1537 active+clean

  io:
    client:   138 KiB/s wr, 0 op/s rd, 9 op/s wr

dcsapak said:
what else is running on that node?

two VMs. one is a Windows 10 instance, one is a random Linux VM I don't need.

dcsapak said:
can you post part of the journal when it hangs?

Code:

root@ibnmajid:~# pct destroy 106
rbd error: 'storage-fastwrx'-locked command timed out - aborting

from the journal:

Code:

Jul 13 13:22:01 ibnmajid pct[620159]: <root@pam> starting task UPID:ibnmajid:00097691:004F00C4:62CF0D49:vzdestroy:106:root@pam:
Jul 13 13:22:05 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:06 ibnmajid pvestatd[1504]: status update time (6.289 seconds)
Jul 13 13:22:15 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:15 ibnmajid pvestatd[1504]: status update time (6.264 seconds)
Jul 13 13:22:25 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:25 ibnmajid pvestatd[1504]: status update time (6.292 seconds)
Jul 13 13:22:35 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:36 ibnmajid pvestatd[1504]: status update time (6.248 seconds)
Jul 13 13:22:45 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:45 ibnmajid pvestatd[1504]: status update time (6.302 seconds)
Jul 13 13:22:49 ibnmajid pvedaemon[610587]: <root@pam> end task UPID:ibnmajid:0009721C:004EDF0D:62CF0CF2:vncproxy:102:root@pam: OK
Jul 13 13:22:55 ibnmajid pvestatd[1504]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout after 3>
Jul 13 13:22:55 ibnmajid pvedaemon[610587]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout afte>
Jul 13 13:22:55 ibnmajid pvestatd[1504]: status update time (6.269 seconds)
Jul 13 13:22:55 ibnmajid pvedaemon[600231]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - unable to connect to VM 108 qmp socket - timeout afte>
Jul 13 13:22:55 ibnmajid pvedaemon[600231]: <root@pam> starting task UPID:ibnmajid:00097998:004F1612:62CF0D7F:vncproxy:108:root@pam:
Jul 13 13:22:55 ibnmajid pvedaemon[620952]: starting vnc proxy UPID:ibnmajid:00097998:004F1612:62CF0D7F:vncproxy:108:root@pam:
Jul 13 13:22:59 ibnmajid qm[620954]: VM 108 qmp command failed - VM 108 qmp command 'set_password' failed - unable to connect to VM 108 qmp socket - timeout after 31 retries
Jul 13 13:22:59 ibnmajid pvedaemon[620952]: Failed to run vncproxy.
Jul 13 13:22:59 ibnmajid pvedaemon[600231]: <root@pam> end task UPID:ibnmajid:00097998:004F1612:62CF0D7F:vncproxy:108:root@pam: Failed to run vncproxy.
Jul 13 13:23:01 ibnmajid pct[620177]: rbd error: 'storage-fastwrx'-locked command timed out - aborting
Jul 13 13:23:01 ibnmajid pct[620159]: <root@pam> end task UPID:ibnmajid:00097691:004F00C4:62CF0D49:vzdestroy:106:root@pam: rbd error: 'storage-fastwrx'-locked command timed out >
root@ibnmajid:~#

108 is the Windows VM that does not permit the VNC connection for the console to work.

starkruzr · Jul 14, 2022

any ideas here?

starkruzr · Jul 14, 2022

ok so this is a Ceph problem, it increasingly seems. `rbd ceph stats fastwrx` works immediately on another node, but not on ibnmajid.

starkruzr · Jul 15, 2022

can I get some additional help with this? the ceph-users list doesn't seem to accept new posters and neither does the Reddit /r/ceph subreddit.

Neobin · Jul 16, 2022

starkruzr said:
can I get some additional help with this? the ceph-users list doesn't seem to accept new posters and neither does the Reddit /r/ceph subreddit.

I mean, there is ever the option to get a subscription [1] (at least basic) and open a support ticket (for setups supported by Proxmox).

Do not get me wrong, only meant as a hint.

[1] https://proxmox.com/en/proxmox-ve/pricing

starkruzr · Jul 16, 2022

Neobin said:
I mean, there is ever the option to get a subscription [1] (at least basic) and open a support ticket (for setups supported by Proxmox).

Do not get me wrong, only meant as a hint.

[1] https://proxmox.com/en/proxmox-ve/pricing

I don't have $900 a year to spend on my 3-node homelab cluster. I would actually be willing to pay for a ticket on an individual basis, but that isn't an option.

Search

Search

one node in 3-node cluster goes into "?" status

starkruzr

Well-Known Member

starkruzr

Well-Known Member

starkruzr

Well-Known Member

starkruzr

Well-Known Member

starkruzr

Well-Known Member

dcsapak

Proxmox Staff Member

starkruzr

Well-Known Member

dcsapak

Proxmox Staff Member

starkruzr

Well-Known Member

starkruzr

Well-Known Member

starkruzr

Well-Known Member

starkruzr

Well-Known Member

Neobin

Distinguished Member

starkruzr

Well-Known Member

We value your privacy