7.3-3, three node cluster: one node "question marks" on me and no Ceph connection on it responds *unless it was already there when the problem started

starkruzr · Dec 1, 2022

hi,

I have had a problem with one node in my cluster for months now with no idea how to fix it, with no one on this forum or in the Ceph mailing list ever replying, which makes me think no one has ever seen it before and no one has any idea how to troubleshoot it, so I am trying to figure out how to wipe a node that is functioning as a Ceph node and reinstall without losing any Ceph data.

the issue: about six months ago I had this node refuse to respond when I tried to stop or move containers and VMs on it. it will start them if it is rebooted, but any attempt to do anything else to them results in a long wait and finally "TASK ERROR: rbd error: 'storage-fastwrx'-locked command timed out - aborting" when the task eventually fails. ('fastwrx' being the name of the all-SSD pool I use for block devices for rootvols) I get the same error if I try to create any containers or VMs on it. If there is anywhere else that errors related to this are found, I haven't found them yet. There is nothing notable in syslog, dmesg, or /var/log/ceph.log, and all of the OSDs seem to be ticking along without incident. `ceph health` says everything is fine with the exception of there not being enough standby MDSes, which is something that I think changed when I updated to Quincy -- I have two cephfses and I guess it wants a standby MDS for each now. I only have 3 nodes so that's not an option for now.

Most importantly, you cannot create any new VMs or containers on the node's Ceph resources, with the same complaint about the fastwrx pool. (It does work using "local" and "local-lvm" storage.) What this means is that all my workloads are slowly shifting to the other two nodes. This is not a sustainable situation and is a waste of resources on the "broken" node. There is nothing running on the "broken" node I need to keep, but I do need the data on the OSDs.

The other day this situation became worse, where even though pvecm status showed everything as fine, the node was all "question marks" for itself and all its VMs and containers and wouldn't run the web interface. A restart seems to have temporarily fixed this, but I'm afraid the situation is continuing to deteriorate.

I am guessing no one knows how to help me troubleshoot this, but I'm hoping someone has an idea of how I can nuke the node and reinstall it without screwing up the rest of the cluster.

TIA.

mira · Dec 1, 2022

I assume you have a default pool setup with size 3 and min_size 2?
In this case simply remove any monitors, managers and MDSs on the node you want to remove.
Then remove the OSDs on that node.

Once that is done you can follow our docs on how to remove a node from the cluster: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node
(There's also a section on how to remove a node without reinstalling it, if required)

This should cleanly remove it from the node and you should be able to add it again once reinstalled.

If you want us to take a look at the situation, I'd ask you to provide the output of `pvecm status`, `pveceph status`, the ceph config `/etc/pve/ceph.conf` and the network config `/etc/network/interfaces`.

Do you see any syslog messages mentioning `pvestatd` and errors or warnings?

starkruzr · Dec 2, 2022

mira said:
I assume you have a default pool setup with size 3 and min_size 2?
In this case simply remove any monitors, managers and MDSs on the node you want to remove.
Then remove the OSDs on that node.

Once that is done you can follow our docs on how to remove a node from the cluster: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node
(There's also a section on how to remove a node without reinstalling it, if required)

This should cleanly remove it from the node and you should be able to add it again once reinstalled.

If you want us to take a look at the situation, I'd ask you to provide the output of `pvecm status`, `pveceph status`, the ceph config `/etc/pve/ceph.conf` and the network config `/etc/network/interfaces`.

Do you see any syslog messages mentioning `pvestatd` and errors or warnings?

Hi, thanks so much for replying - I actually turned it down to 2 and 2 for more space. Haven't run into any issues with it yet, although of course I can always put it back up to 3/2 if that's prerequisite for getting things migrated.

I can do that:

Code:

root@ibnmajid:~# pvecm status
Cluster information
-------------------
Name:             BrokenWorks
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Dec  1 18:41:14 2022
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.4ec
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.9.10 (local)
0x00000002          1 192.168.9.11
0x00000003          1 192.168.9.12

Code:

root@ibnmajid:~# pveceph status
  cluster:
    id:     310af567-1607-402b-bc5d-c62286a129d5
    health: HEALTH_WARN
            insufficient standby MDS daemons available

  services:
    mon: 3 daemons, quorum ibnmajid,ganges,riogrande (age 13h)
    mgr: riogrande(active, since 13h)
    mds: 2/2 daemons up, 1 hot standby
    osd: 18 osds: 18 up (since 13h), 18 in (since 39h)

  data:
    volumes: 2/2 healthy
    pools:   7 pools, 1537 pgs
    objects: 959.76k objects, 2.0 TiB
    usage:   4.2 TiB used, 10 TiB / 14 TiB avail
    pgs:     1537 active+clean

  io:
    client:   86 KiB/s rd, 294 KiB/s wr, 14 op/s rd, 20 op/s wr

Code:

root@ibnmajid:~# cat /etc/pve/ceph.conf
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.9.10/24
     fsid = 310af567-1607-402b-bc5d-c62286a129d5
     mon_allow_pool_delete = true
     mon_host = 192.168.9.10 192.168.9.11 192.168.9.12
     osd_pool_default_min_size = 2
     osd_pool_default_size = 2
     public_network = 192.168.9.10/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.ganges]
     host = ganges
     mds_standby_for_name = pve

[mds.ibnmajid]
     host = ibnmajid
     mds standby for name = pve

[mds.riogrande]
     host = riogrande
     mds_standby_for_name = pve

Code:

root@ibnmajid:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface enp35s0 inet manual

iface enp36s0 inet manual

iface enp43s0f0 inet manual

iface enp43s0f1 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.9.10/24
    gateway 192.168.9.1
    bridge-ports enp1s0f0
    bridge-stp off
    bridge-fd 0

Code:

root@ibnmajid:~# cat /var/log/syslog | grep pvestatd | tail -n 30
Nov 30 03:05:08 ibnmajid pvestatd[1695]: mount error: Job failed. See "journalctl -xe" for details.
Nov 30 03:05:08 ibnmajid pvestatd[1695]: mount error: Job failed. See "journalctl -xe" for details.
Nov 30 03:05:18 ibnmajid pvestatd[1695]: mount error: Job failed. See "journalctl -xe" for details.
Nov 30 03:05:18 ibnmajid pvestatd[1695]: mount error: Job failed. See "journalctl -xe" for details.
Nov 30 03:05:29 ibnmajid pvestatd[1695]: mount error: Job failed. See "journalctl -xe" for details.
Nov 30 03:05:29 ibnmajid pvestatd[1695]: mount error: Job failed. See "journalctl -xe" for details.
Nov 30 03:05:39 ibnmajid pvestatd[1695]: mount error: Job failed. See "journalctl -xe" for details.
Nov 30 04:20:54 ibnmajid pvestatd[1695]: status update time (314.785 seconds)
Nov 30 22:04:49 ibnmajid pvestatd[1695]: auth key pair too old, rotating..
Dec  1 04:09:05 ibnmajid pvestatd[622601]: ipcc_send_rec[1] failed: Connection refused
Dec  1 04:09:05 ibnmajid pvestatd[622601]: ipcc_send_rec[2] failed: Connection refused
Dec  1 04:09:05 ibnmajid pvestatd[622601]: ipcc_send_rec[3] failed: Connection refused
Dec  1 04:09:05 ibnmajid pvestatd[622601]: Unable to load access control list: Connection refused
Dec  1 04:09:05 ibnmajid pvestatd[622601]: ipcc_send_rec[1] failed: Connection refused
Dec  1 04:09:05 ibnmajid pvestatd[622601]: ipcc_send_rec[2] failed: Connection refused
Dec  1 04:09:05 ibnmajid pvestatd[622601]: ipcc_send_rec[3] failed: Connection refused
Dec  1 04:09:05 ibnmajid systemd[1]: pvestatd.service: Control process exited, code=exited, status=111/n/a
Dec  1 04:09:05 ibnmajid pvestatd[1695]: received signal TERM
Dec  1 04:09:05 ibnmajid pvestatd[1695]: server closing
Dec  1 04:09:05 ibnmajid pvestatd[1695]: server stopped
Dec  1 04:09:05 ibnmajid systemd[1]: pvestatd.service: Failed with result 'exit-code'.
Dec  1 04:09:05 ibnmajid systemd[1]: pvestatd.service: Consumed 22min 17.691s CPU time.
Dec  1 04:09:59 ibnmajid pvestatd[622674]: starting server
Dec  1 04:24:37 ibnmajid pvestatd[622674]: Use of uninitialized value in subtraction (-) at /usr/share/perl5/PVE/LXC.pm line 251.
Dec  1 04:24:37 ibnmajid pvestatd[622674]: status update time (868.740 seconds)
Dec  1 04:55:59 ibnmajid pvestatd[1687]: starting server
Dec  1 04:57:42 ibnmajid pvestatd[1687]: unable to get PID for CT 101 (not running?)
Dec  1 04:57:43 ibnmajid pvestatd[1687]: status update time (93.902 seconds)
Dec  1 16:35:13 ibnmajid pvestatd[1687]: modified cpu set for lxc/101: 0
Dec  1 16:35:14 ibnmajid pvestatd[1687]: status update time (3887.687 seconds)

mira · Dec 2, 2022

With size 2 and min_size 2 you can't lose either a node or even just an OSD since you won't have the min_size anymore.
With size 3 you can lose at least 1 or 2 OSDs, or depending on the current usage of the pool, even a complete host as long as there are at least 2 others.

The logs hint at there not being a quorum. Check the output of `systemctl status pve-cluster.service` and the output of `systemctl status corosync.service`.

starkruzr · Dec 2, 2022

Code:

root@ibnmajid:~# sudo systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-12-01 04:55:58 CST; 1 day 1h ago
   Main PID: 1207 (pmxcfs)
      Tasks: 9 (limit: 77006)
     Memory: 70.0M
        CPU: 1min 34.284s
     CGroup: /system.slice/pve-cluster.service
             └─1207 /usr/bin/pmxcfs

Dec 02 05:15:28 ibnmajid pmxcfs[1207]: [status] notice: received log
Dec 02 05:19:32 ibnmajid pmxcfs[1207]: [status] notice: received log
Dec 02 05:19:33 ibnmajid pmxcfs[1207]: [status] notice: received log
Dec 02 05:31:29 ibnmajid pmxcfs[1207]: [status] notice: received log
Dec 02 05:34:34 ibnmajid pmxcfs[1207]: [status] notice: received log
Dec 02 05:34:34 ibnmajid pmxcfs[1207]: [status] notice: received log
Dec 02 05:47:29 ibnmajid pmxcfs[1207]: [status] notice: received log
Dec 02 05:49:35 ibnmajid pmxcfs[1207]: [status] notice: received log
Dec 02 05:49:35 ibnmajid pmxcfs[1207]: [status] notice: received log
Dec 02 05:55:58 ibnmajid pmxcfs[1207]: [dcdb] notice: data verification successful

Code:

● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-12-01 04:55:58 CST; 1 day 1h ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 1363 (corosync)
      Tasks: 9 (limit: 77006)
     Memory: 135.1M
        CPU: 11min 51.979s
     CGroup: /system.slice/corosync.service
             └─1363 /usr/sbin/corosync -f

Dec 01 04:56:01 ibnmajid corosync[1363]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Dec 01 04:56:01 ibnmajid corosync[1363]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Dec 01 04:56:01 ibnmajid corosync[1363]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Dec 01 04:56:01 ibnmajid corosync[1363]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 01 04:56:01 ibnmajid corosync[1363]:   [QUORUM] Sync members[3]: 1 2 3
Dec 01 04:56:01 ibnmajid corosync[1363]:   [QUORUM] Sync joined[2]: 2 3
Dec 01 04:56:01 ibnmajid corosync[1363]:   [TOTEM ] A new membership (1.4ec) was formed. Members joined: 2 3
Dec 01 04:56:01 ibnmajid corosync[1363]:   [QUORUM] This node is within the primary component and will provide service.
Dec 01 04:56:01 ibnmajid corosync[1363]:   [QUORUM] Members[3]: 1 2 3
Dec 01 04:56:01 ibnmajid corosync[1363]:   [MAIN  ] Completed service synchronization, ready to provide service.

I think there was no quorum for a short while when I had my "?" problem with the problematic node. I've had quorum since then.

mira · Dec 5, 2022

Yes, that would explain it.

What's the current situation? Is it stable now?

starkruzr · Dec 5, 2022

mira said:
Yes, that would explain it.

What's the current situation? Is it stable now?

It is stable, but no VM or container started on the problematic node is able to use Ceph block device resources. It times out trying to access them, UNLESS the VM or container already existed when this problem started ~6 months ago. It is also not able to move containers or VMs off of Ceph storage to anything else.

mira · Dec 6, 2022

Can you try migrating a VM or container and provide the unfiltered journal/syslog around that time (~5-10 minutes before the migration until 5-10 minutes after the migration failed)?

starkruzr · Dec 14, 2022

So, I have since reformatted and reinstalled all three nodes and added a fourth. I am able to migrate a VM off and back on to the problematic node (Orinoco), but if I try to create a new VM on it, I get all kinds of "timed out" errors. I don't know how to get it to be more specific, unfortunately. You'll see some complaints about osd.12 in here, which is on Orinoco but is not part of the pool the rootvol of the VM is trying to be created on.

syslog: https://pastebin.com/1pbcnd5Q

Another strange detail: 102 was created on this node right after I reinstalled with no issues. 105 is the new one I just tried to create, which "exists" but again has problems trying to use the 32GB image created to be its rootvol on fastwrx, which is the all-Flash Ceph pool.

Code:

root@orinoco:/var/log# qm status 105
status: running
root@orinoco:/var/log# qm status 102
status: running
root@orinoco:/var/log# qm status 105
status: running

`qm status 105` above took about 8 seconds to return "status: running." `qm status 102` returned "running" almost immediately. Tried 105 again and again about an 8 second delay. Very, very weird behavior.

mira · Dec 15, 2022

Do you have the same MTU configured on all nodes on the Ceph network(s)?

starkruzr · Dec 15, 2022

Yes, just checked. All are at 1500, both the bridges and the physical interfaces.

Search

Search

7.3-3, three node cluster: one node "question marks" on me and no Ceph connection on it responds *unless it was already there when the problem started

starkruzr

Well-Known Member

mira

Proxmox Staff Member

starkruzr

Well-Known Member

mira

Proxmox Staff Member

starkruzr

Well-Known Member

mira

Proxmox Staff Member

starkruzr

Well-Known Member

mira

Proxmox Staff Member

starkruzr

Well-Known Member

mira

Proxmox Staff Member

starkruzr

Well-Known Member

We value your privacy