PMX 4.4 suddenly osd's not starting on some nodes

lifeboy · Jul 6, 2018

Code:

# pveversion -v
proxmox-ve: 4.4-111 (running kernel: 4.4.128-1-pve)
pve-manager: 4.4-24 (running version: 4.4-24/08ba4d2d)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.117-2-pve: 4.4.117-110
pve-kernel-4.4.128-1-pve: 4.4.128-111
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+2
libqb0: 1.0.1-1
pve-cluster: 4.0-55
qemu-server: 4.0-115
pve-firmware: 1.1-12
libpve-common-perl: 4.0-96
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.9.1-9~pve4
pve-container: 1.0-106
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.8-2~pve4
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 10.2.10-1~bpo80+1

Last night all was running fine, this morning proxmox 3 nodes suddenly show down. Restarted the failed nodes and they come up, but some OSD's not. I get the following error on the node where the OSD's don't start:

Code:

root@hp1:~# systemctl status ceph -l
● ceph.service - PVE activate Ceph OSD disks
   Loaded: loaded (/etc/systemd/system/ceph.service; enabled)
   Active: failed (Result: exit-code) since Fri 2018-07-06 08:40:47 SAST; 3h 14min ago
  Process: 1340 ExecStart=/usr/sbin/ceph-disk --log-stdout activate-all (code=exited, status=1/FAILURE)
 Main PID: 1340 (code=exited, status=1/FAILURE)

Jul 06 08:40:46 hp1 ceph-disk[1340]: mount_activate: Failed to activate
Jul 06 08:40:46 hp1 ceph-disk[1340]: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid a6092407-216f-41ff-bccb-9bed78587ac3
Jul 06 08:40:46 hp1 ceph-disk[1340]: mount_activate: Failed to activate
Jul 06 08:40:46 hp1 ceph-disk[1340]: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid a6092407-216f-41ff-bccb-9bed78587ac3
Jul 06 08:40:47 hp1 ceph-disk[1340]: mount_activate: Failed to activate
Jul 06 08:40:47 hp1 ceph-disk[1340]: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid a6092407-216f-41ff-bccb-9bed78587ac3
Jul 06 08:40:47 hp1 ceph-disk[1340]: ceph-disk: Error: One or more partitions failed to activate
Jul 06 08:40:47 hp1 systemd[1]: ceph.service: main process exited, code=exited, status=1/FAILURE
Jul 06 08:40:47 hp1 systemd[1]: Failed to start PVE activate Ceph OSD disks.
Jul 06 08:40:47 hp1 systemd[1]: Unit ceph.service entered failed state.

However:

Code:

root@hp1:~# cat /etc/ceph/ceph.conf
[global]
fsid = 7cae4d25-6864-46de-84dc-d8fd4f75ca6f
mon_initial_members = hp1
mon_host = 192.168.121.30
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

This node was running perfectly yesterday and it appear that during the night something changed. How is is possible that the fsid doesn't match any more?

Can I change that ceph.conf fsid, or where does the system get "a6092407-216f-41ff-bccb-9bed78587ac3" from?

All the other nodes have a6092407-216f-41ff-bccb-9bed78587ac3 as fsid. I would think that if I change the ceph.conf on this node it would fix the problem?

lifeboy · Jul 6, 2018

UPDATE: I changed the fsid in the failing node's ceph.conf to a6092407-216f-41ff-bccb-9bed78587ac3 and immediately the services started up.

This is really freaky... Things don't change themselves.

Search

Search

PMX 4.4 suddenly osd's not starting on some nodes

lifeboy

Renowned Member

lifeboy

Renowned Member