Cluster down - unable to restart Ceph

pgo

New Member
Jan 30, 2020
2
0
1
59
Dear all,

We have a 8 node proxmox cluster running badly. Cluster has ceph filesystem (nodes proxmox0-7 plus an additional ceph-monitor machine).
We experienced problems on the cluster and now ceph is down. Also restarting VM’s fails with error: TASK ERROR: storage 'proxmox-images' is not online

How can we return to normal operation?


# service ceph status

● ceph.service - PVE activate Ceph OSD disks
Loaded: loaded (/etc/systemd/system/ceph.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2020-02-03 08:57:04 CET; 1h 39min ago

Main PID: 3702938 (code=exited, status=0/SUCCESS)
Feb 03 08:57:00 proxmox0 ceph-disk[3702938]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@10.service → /lib/systemd/system/ceph-osd@.service.
Feb 03 08:57:01 proxmox0 ceph-disk[3702938]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@34.service.
Feb 03 08:57:01 proxmox0 ceph-disk[3702938]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@34.service → /lib/systemd/system/ceph-osd@.service.
Feb 03 08:57:01 proxmox0 ceph-disk[3702938]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@24.service.
Feb 03 08:57:02 proxmox0 ceph-disk[3702938]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@24.service → /lib/systemd/system/ceph-osd@.service.
Feb 03 08:57:02 proxmox0 ceph-disk[3702938]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@1.service.
Feb 03 08:57:03 proxmox0 ceph-disk[3702938]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@1.service → /lib/systemd/system/ceph-osd@.service.
Feb 03 08:57:03 proxmox0 ceph-disk[3702938]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@6.service.
Feb 03 08:57:03 proxmox0 ceph-disk[3702938]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@6.service → /lib/systemd/system/ceph-osd@.service.
Feb 03 08:57:04 proxmox0 systemd[1]: Started PVE activate Ceph OSD disks.


Details:
# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 9.4 (stretch)
Release: 9.4
Codename: stretch

# ceph health detail
HEALTH_OK

But this changes depending on when one runs the command – errors pop up regularly:
2020-02-01 12:00:00.000111 mon.proxmox0 mon.0 192.168.110.10:6789/0 70495 : cluster [INF] overall HEALTH_OK
2020-02-01 12:45:45.762714 osd.38 osd.38 192.168.110.14:6801/5292 27730 : cluster [ERR] 4.39d shard 46: soid 4:b9dd82d6:::rbd_data.a61df9238e1f29.0000000000004013:head candidate had a read error
2020-02-01 12:46:59.896623 osd.38 osd.38 192.168.110.14:6801/5292 27731 : cluster [ERR] 4.39d deep-scrub 0 missing, 1 inconsistent objects
2020-02-01 12:46:59.896630 osd.38 osd.38 192.168.110.14:6801/5292 27732 : cluster [ERR] 4.39d deep-scrub 1 errors
2020-02-01 12:47:02.474397 mon.proxmox0 mon.0 192.168.110.10:6789/0 72890 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2020-02-01 12:47:02.474418 mon.proxmox0 mon.0 192.168.110.10:6789/0 72891 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2020-02-01 13:00:00.000125 mon.proxmox0 mon.0 192.168.110.10:6789/0 73566 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2020-02-01 13:45:46.663934 mon.proxmox0 mon.0 192.168.110.10:6789/0 75818 : cluster [WRN] Health check failed: 9 slow requests are blocked > 32 sec. Implicated osds 19 (REQUEST_SLOW)
2020-02-01 13:45:53.549245 mon.proxmox0 mon.0 192.168.110.10:6789/0 75824 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 9 slow requests are blocked > 32 sec. Implicated osds 19)
2020-02-01 14:00:00.000117 mon.proxmox0 mon.0 192.168.110.10:6789/0 76514 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2020-02-01 15:00:00.000153 mon.proxmox0 mon.0 192.168.110.10:6789/0 79496 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent


We tried to replace 2 disks (osd.7 and osd.13 that had smart errors) using documented procedure https://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/
But the osd does not come back online.


#/etc/pve/nodes# ceph -w
cluster:
id: 19d1ead5-b8bf-4cef-a686-6fd707307258
health: HEALTH_OK
services:
mon: 3 daemons, quorum proxmox0,proxmox1,ceph-mon0
mgr: proxmox1(active), standbys: proxmox0, ceph-mon0
mds: cephfs-1/1/1 up {0=proxmox4=up:active}, 1 up:standby
osd: 51 osds: 49 up, 49 in

data:
pools: 4 pools, 4096 pgs
objects: 4031k objects, 14452 GB
usage: 29007 GB used, 27905 GB / 56912 GB avail
pgs: 4096 active+clean

io:
client: 3062 B/s rd, 12037 kB/s wr, 0 op/s rd, 306 op/s wr


root@proxmox0:/etc/pve/nodes# ceph osd tree | grep -i down
7 0 osd.7 down 0 1.00000
13 0 osd.13 down 0 1.00000
51 0 osd.51 down 0 1.00000


root@proxmox1:~# pveversion -v
proxmox-ve: 5.4-2 (running kernel: 4.13.16-2-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-12
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.16-1-pve: 4.13.16-46
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-41
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-54
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3


Many thanks for your insights !!!
--
Peter
 
We have a 8 node proxmox cluster running badly. Cluster has ceph filesystem (nodes proxmox0-7 plus an additional ceph-monitor machine).
We experienced problems on the cluster and now ceph is down. Also restarting VM’s fails with error: TASK ERROR: storage 'proxmox-images' is not online


For an overview post pvereport(s) about the node(s) where the incative osds are located.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!