[SOLVED] Update 7.0-11 to 7.1-8 Ceph issues

enderst · Dec 16, 2021

Updated 7.0-11 to 7.1-8 and ceph is giving me grief.

Code:

Dec 16 15:42:06 hoc-node01 systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Dec 16 15:42:06 hoc-node01 systemd[1]: Mounting /mnt/pve/cephfs...
Dec 16 15:42:07 hoc-node01 kernel: [16124.141954] libceph: mon0 (1)172.16.30.21:6789 socket closed (con state OPEN)
Dec 16 15:42:18 hoc-node01 kernel: [16135.640535] libceph: mon3 (1)172.16.30.24:6789 socket closed (con state OPEN)
Dec 16 15:42:19 hoc-node01 kernel: [16136.249131] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:19 hoc-node01 kernel: [16136.501125] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:19 hoc-node01 kernel: [16137.017131] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:21 hoc-node01 kernel: [16138.041112] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:31 hoc-node01 kernel: [16148.281160] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:31 hoc-node01 kernel: [16148.537104] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:32 hoc-node01 kernel: [16149.049142] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:33 hoc-node01 kernel: [16150.073158] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:38 hoc-node01 kernel: [16155.196974] libceph: mon2 (1)172.16.30.23:6789 socket error on write
Dec 16 15:42:38 hoc-node01 kernel: [16155.448972] libceph: mon2 (1)172.16.30.23:6789 socket error on write
Dec 16 15:42:38 hoc-node01 kernel: [16155.960941] libceph: mon2 (1)172.16.30.23:6789 socket error on write
Dec 16 15:42:39 hoc-node01 kernel: [16156.984945] libceph: mon2 (1)172.16.30.23:6789 socket error on write
Dec 16 15:42:43 hoc-node01 kernel: [16160.657483] libceph: mon3 (1)172.16.30.24:6789 socket closed (con state OPEN)

Another cluster with the same hardware updated just fine.

proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-4
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: not correctly installed
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

Any ideas how to get ceph online again? I haven't found a similar issue in the forum, is weird another cluster updated fine.

enderst · Dec 17, 2021

I added the following to /etc/pve/ceph.conf per the doc and ceph came back to life. I removed it and restarted a monitor and ceph-mon for that host would not restart. Re-added it and restarting the ceph-mon for that host would fail. Reboooted and it's back. Ceph is cleaning, will try again after that finishes.

Code:

[mon]
    mon_mds_skip_sanity = true

https://pve.proxmox.com/wiki/Ceph_O..._crashes_after_minor_16.2.6_to_16.2.7_upgrade
BTW this was Ceph 16.2.5 to 16.2.7.

enderst · Dec 17, 2021

enderst said:
I added the following to /etc/pve/ceph.conf per the doc and ceph came back to life. I removed it and restarted a monitor and ceph-mon for that host would not restart. Re-added it and restarting the ceph-mon for that host would fail. Reboooted and it's back. Ceph is cleaning, will try again after that finishes.

Code:

[mon] mon_mds_skip_sanity = true

https://pve.proxmox.com/wiki/Ceph_O..._crashes_after_minor_16.2.6_to_16.2.7_upgrade
BTW this was Ceph 16.2.5 to 16.2.7.

Cleaning finished overnight, was very slow. Removed the sanity check from ceph.conf and restarted the monitor for that host and fails to start.

Code:

# systemctl restart ceph-mon@hoc-node03.service
# systemctl status ceph-mon@hoc-node03.service
● ceph-mon@hoc-node03.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
             └─ceph-after-pve-cluster.conf
     Active: failed (Result: signal) since Fri 2021-12-17 08:33:09 MST; 58s ago
    Process: 333391 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id hoc-node03 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
   Main PID: 333391 (code=killed, signal=ABRT)
        CPU: 104ms

Dec 17 08:33:09 hoc-node03 systemd[1]: ceph-mon@hoc-node03.service: Scheduled restart job, restart counter is at 5.
Dec 17 08:33:09 hoc-node03 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 17 08:33:09 hoc-node03 systemd[1]: ceph-mon@hoc-node03.service: Start request repeated too quickly.
Dec 17 08:33:09 hoc-node03 systemd[1]: ceph-mon@hoc-node03.service: Failed with result 'signal'.
Dec 17 08:33:09 hoc-node03 systemd[1]: Failed to start Ceph cluster monitor daemon.
Dec 17 08:33:58 hoc-node03 systemd[1]: ceph-mon@hoc-node03.service: Start request repeated too quickly.
Dec 17 08:33:58 hoc-node03 systemd[1]: ceph-mon@hoc-node03.service: Failed with result 'signal'.
Dec 17 08:33:58 hoc-node03 systemd[1]: Failed to start Ceph cluster monitor daemon.

# ceph -s
  cluster:
    id:     231c3ca9-5ead-4497-b375-5ae6e722e6dd
    health: HEALTH_WARN
            1/4 mons down, quorum hoc-node01,hoc-node02,hoc-node04
            5 daemons have recently crashed

  services:
    mon: 4 daemons, quorum hoc-node01,hoc-node02,hoc-node04 (age 9m), out of quorum: hoc-node03
    mgr: hoc-node02(active, since 3m), standbys: hoc-node01, hoc-node04, hoc-node03
    mds: 1/1 daemons up, 3 standby
    osd: 20 osds: 20 up (since 85s), 20 in (since 9d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 199.03k objects, 739 GiB
    usage:   2.0 TiB used, 16 TiB / 18 TiB avail
    pgs:     193 active+clean

  io:
    client:   2.7 KiB/s rd, 1.4 MiB/s wr, 0 op/s rd, 73 op/s wr

Then this in the Ceph status summary of the Proxmox webui:
5 daemons have recently crashed
mon.hoc-node03 crashed on host hoc-node03 at 2021-12-17T15:32:18.617966Z
mon.hoc-node03 crashed on host hoc-node03 at 2021-12-17T15:32:28.913936Z
mon.hoc-node03 crashed on host hoc-node03 at 2021-12-17T15:32:39.152173Z
mon.hoc-node03 crashed on host hoc-node03 at 2021-12-17T15:32:49.394701Z
mon.hoc-node03 crashed on host hoc-node03 at 2021-12-17T15:32:59.650578Z

Rebooted the host after restarting the service failed thinking maybe/hopefully there were more services that need to be reloaded or restarted but no change, monitor for that host still fails to start. Assuming the same will be the case with the other hosts and I'll be back to a down filesystem.
I'm sure it's probably not a good thing to leave the sanity skip enabled. Any ideas?

enderst · Dec 21, 2021

Nobody has had this happen? Nervous about leaving:

Code:

[mon]
    mon_mds_skip_sanity = true

in /etc/pve/ceph.conf

ikogan · Dec 22, 2021

I'm having this happen now as well, I think you're supposed to switch it back after the upgrade to the mon db is done but I don't know how to tell when that happens?

enderst · Dec 22, 2021

ikogan said:
I'm having this happen now as well, I think you're supposed to switch it back after the upgrade to the mon db is done but I don't know how to tell when that happens?

I tried after the upgrade. Details in #3.

ikogan · Dec 23, 2021

Did you upgrade all of your nodes?

enderst · Dec 23, 2021

ikogan said:
Did you upgrade all of your nodes?

Yes, all are at the same software levels.

ikogan · Dec 23, 2021

Yup, I can confirm. Tried to turn it off this morning and I still get assertion failures. I wonder if I can delete and re-create each mon one at a time...

ikogan · Dec 23, 2021

I've posted a message to the ceph mailing list here: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/FU6JPZNLY2PVF4ZV7PYP2KDJ4UFSVOR2/

enderst · Dec 23, 2021

ikogan said:
I've posted a message to the ceph mailing list here: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/FU6JPZNLY2PVF4ZV7PYP2KDJ4UFSVOR2/

Thank you for the link. I'll try it later today.

ikogan · Dec 23, 2021

So this is now solved, I think. According to my error log, it seems like, somehow, an MDS was reporting an older version than 16.2.7. Checking the CephFS section in the Proxmox web UI, I actually had 2 MDSes that were reporting _no_ version string at all. I first did

ceph mds fail clusterfs-performance:0

As recommended and that caused one of the two weird MDSes to display the correct version. On a whim, I then restarted the other and it did as well. Once all MDSes were displaying version 16.2.7, I was able to remove the mon_mds_skip_sanity flag and restart all monitors. Things seem to be correct now...

enderst · Sep 1, 2022

For what it's worth, this did resolve itself over time. I removed mon_mds_skip_sanity = true and restarted the monitor on each host without issue.

Search

Search

[SOLVED] Update 7.0-11 to 7.1-8 Ceph issues

enderst

Active Member

enderst

Active Member

enderst

Active Member

enderst

Active Member

ikogan

Renowned Member

enderst

Active Member

ikogan

Renowned Member

enderst

Active Member

ikogan

Renowned Member

ikogan

Renowned Member

enderst

Active Member

ikogan

Renowned Member

enderst

Active Member

We value your privacy