[SOLVED] Update 7.0-11 to 7.1-8 Ceph issues

enderst

Member
Aug 29, 2019
19
8
23
Updated 7.0-11 to 7.1-8 and ceph is giving me grief.
Code:
Dec 16 15:42:06 hoc-node01 systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Dec 16 15:42:06 hoc-node01 systemd[1]: Mounting /mnt/pve/cephfs...
Dec 16 15:42:07 hoc-node01 kernel: [16124.141954] libceph: mon0 (1)172.16.30.21:6789 socket closed (con state OPEN)
Dec 16 15:42:18 hoc-node01 kernel: [16135.640535] libceph: mon3 (1)172.16.30.24:6789 socket closed (con state OPEN)
Dec 16 15:42:19 hoc-node01 kernel: [16136.249131] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:19 hoc-node01 kernel: [16136.501125] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:19 hoc-node01 kernel: [16137.017131] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:21 hoc-node01 kernel: [16138.041112] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:31 hoc-node01 kernel: [16148.281160] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:31 hoc-node01 kernel: [16148.537104] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:32 hoc-node01 kernel: [16149.049142] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:33 hoc-node01 kernel: [16150.073158] libceph: mon1 (1)172.16.30.22:6789 socket closed (con state V1_BANNER)
Dec 16 15:42:38 hoc-node01 kernel: [16155.196974] libceph: mon2 (1)172.16.30.23:6789 socket error on write
Dec 16 15:42:38 hoc-node01 kernel: [16155.448972] libceph: mon2 (1)172.16.30.23:6789 socket error on write
Dec 16 15:42:38 hoc-node01 kernel: [16155.960941] libceph: mon2 (1)172.16.30.23:6789 socket error on write
Dec 16 15:42:39 hoc-node01 kernel: [16156.984945] libceph: mon2 (1)172.16.30.23:6789 socket error on write
Dec 16 15:42:43 hoc-node01 kernel: [16160.657483] libceph: mon3 (1)172.16.30.24:6789 socket closed (con state OPEN)
Another cluster with the same hardware updated just fine.
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-4
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: not correctly installed
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
Any ideas how to get ceph online again? I haven't found a similar issue in the forum, is weird another cluster updated fine.
 
I added the following to /etc/pve/ceph.conf per the doc and ceph came back to life. I removed it and restarted a monitor and ceph-mon for that host would not restart. Re-added it and restarting the ceph-mon for that host would fail. Reboooted and it's back. Ceph is cleaning, will try again after that finishes.
Code:
[mon]
    mon_mds_skip_sanity = true
https://pve.proxmox.com/wiki/Ceph_O..._crashes_after_minor_16.2.6_to_16.2.7_upgrade
BTW this was Ceph 16.2.5 to 16.2.7.
 
I added the following to /etc/pve/ceph.conf per the doc and ceph came back to life. I removed it and restarted a monitor and ceph-mon for that host would not restart. Re-added it and restarting the ceph-mon for that host would fail. Reboooted and it's back. Ceph is cleaning, will try again after that finishes.
Code:
[mon]
    mon_mds_skip_sanity = true
https://pve.proxmox.com/wiki/Ceph_O..._crashes_after_minor_16.2.6_to_16.2.7_upgrade
BTW this was Ceph 16.2.5 to 16.2.7.
Cleaning finished overnight, was very slow. Removed the sanity check from ceph.conf and restarted the monitor for that host and fails to start.
Code:
# systemctl restart ceph-mon@hoc-node03.service
# systemctl status ceph-mon@hoc-node03.service
● ceph-mon@hoc-node03.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
             └─ceph-after-pve-cluster.conf
     Active: failed (Result: signal) since Fri 2021-12-17 08:33:09 MST; 58s ago
    Process: 333391 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id hoc-node03 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
   Main PID: 333391 (code=killed, signal=ABRT)
        CPU: 104ms

Dec 17 08:33:09 hoc-node03 systemd[1]: ceph-mon@hoc-node03.service: Scheduled restart job, restart counter is at 5.
Dec 17 08:33:09 hoc-node03 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 17 08:33:09 hoc-node03 systemd[1]: ceph-mon@hoc-node03.service: Start request repeated too quickly.
Dec 17 08:33:09 hoc-node03 systemd[1]: ceph-mon@hoc-node03.service: Failed with result 'signal'.
Dec 17 08:33:09 hoc-node03 systemd[1]: Failed to start Ceph cluster monitor daemon.
Dec 17 08:33:58 hoc-node03 systemd[1]: ceph-mon@hoc-node03.service: Start request repeated too quickly.
Dec 17 08:33:58 hoc-node03 systemd[1]: ceph-mon@hoc-node03.service: Failed with result 'signal'.
Dec 17 08:33:58 hoc-node03 systemd[1]: Failed to start Ceph cluster monitor daemon.

# ceph -s
  cluster:
    id:     231c3ca9-5ead-4497-b375-5ae6e722e6dd
    health: HEALTH_WARN
            1/4 mons down, quorum hoc-node01,hoc-node02,hoc-node04
            5 daemons have recently crashed

  services:
    mon: 4 daemons, quorum hoc-node01,hoc-node02,hoc-node04 (age 9m), out of quorum: hoc-node03
    mgr: hoc-node02(active, since 3m), standbys: hoc-node01, hoc-node04, hoc-node03
    mds: 1/1 daemons up, 3 standby
    osd: 20 osds: 20 up (since 85s), 20 in (since 9d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 199.03k objects, 739 GiB
    usage:   2.0 TiB used, 16 TiB / 18 TiB avail
    pgs:     193 active+clean

  io:
    client:   2.7 KiB/s rd, 1.4 MiB/s wr, 0 op/s rd, 73 op/s wr

Then this in the Ceph status summary of the Proxmox webui:
5 daemons have recently crashed
mon.hoc-node03 crashed on host hoc-node03 at 2021-12-17T15:32:18.617966Z
mon.hoc-node03 crashed on host hoc-node03 at 2021-12-17T15:32:28.913936Z
mon.hoc-node03 crashed on host hoc-node03 at 2021-12-17T15:32:39.152173Z
mon.hoc-node03 crashed on host hoc-node03 at 2021-12-17T15:32:49.394701Z
mon.hoc-node03 crashed on host hoc-node03 at 2021-12-17T15:32:59.650578Z
Rebooted the host after restarting the service failed thinking maybe/hopefully there were more services that need to be reloaded or restarted but no change, monitor for that host still fails to start. Assuming the same will be the case with the other hosts and I'll be back to a down filesystem.
I'm sure it's probably not a good thing to leave the sanity skip enabled. Any ideas?
 
Last edited:
Nobody has had this happen? Nervous about leaving:
Code:
[mon]
    mon_mds_skip_sanity = true
in /etc/pve/ceph.conf
 
I'm having this happen now as well, I think you're supposed to switch it back after the upgrade to the mon db is done but I don't know how to tell when that happens?
 
I'm having this happen now as well, I think you're supposed to switch it back after the upgrade to the mon db is done but I don't know how to tell when that happens?
I tried after the upgrade. Details in #3.
 
Yup, I can confirm. Tried to turn it off this morning and I still get assertion failures. I wonder if I can delete and re-create each mon one at a time...
 
So this is now solved, I think. According to my error log, it seems like, somehow, an MDS was reporting an older version than 16.2.7. Checking the CephFS section in the Proxmox web UI, I actually had 2 MDSes that were reporting _no_ version string at all. I first did

ceph mds fail clusterfs-performance:0

As recommended and that caused one of the two weird MDSes to display the correct version. On a whim, I then restarted the other and it did as well. Once all MDSes were displaying version 16.2.7, I was able to remove the mon_mds_skip_sanity flag and restart all monitors. Things seem to be correct now...
 
  • Like
Reactions: Syrrys
For what it's worth, this did resolve itself over time. I removed mon_mds_skip_sanity = true and restarted the monitor on each host without issue.
 
  • Like
Reactions: Syrrys

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!