PMX 4.4 suddenly osd's not starting on some nodes

lifeboy

Renowned Member
Code:
# pveversion -v
proxmox-ve: 4.4-111 (running kernel: 4.4.128-1-pve)
pve-manager: 4.4-24 (running version: 4.4-24/08ba4d2d)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.117-2-pve: 4.4.117-110
pve-kernel-4.4.128-1-pve: 4.4.128-111
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+2
libqb0: 1.0.1-1
pve-cluster: 4.0-55
qemu-server: 4.0-115
pve-firmware: 1.1-12
libpve-common-perl: 4.0-96
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.9.1-9~pve4
pve-container: 1.0-106
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.8-2~pve4
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 10.2.10-1~bpo80+1

Last night all was running fine, this morning proxmox 3 nodes suddenly show down. Restarted the failed nodes and they come up, but some OSD's not. I get the following error on the node where the OSD's don't start:

Code:
root@hp1:~# systemctl status ceph -l
● ceph.service - PVE activate Ceph OSD disks
   Loaded: loaded (/etc/systemd/system/ceph.service; enabled)
   Active: failed (Result: exit-code) since Fri 2018-07-06 08:40:47 SAST; 3h 14min ago
  Process: 1340 ExecStart=/usr/sbin/ceph-disk --log-stdout activate-all (code=exited, status=1/FAILURE)
 Main PID: 1340 (code=exited, status=1/FAILURE)

Jul 06 08:40:46 hp1 ceph-disk[1340]: mount_activate: Failed to activate
Jul 06 08:40:46 hp1 ceph-disk[1340]: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid a6092407-216f-41ff-bccb-9bed78587ac3
Jul 06 08:40:46 hp1 ceph-disk[1340]: mount_activate: Failed to activate
Jul 06 08:40:46 hp1 ceph-disk[1340]: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid a6092407-216f-41ff-bccb-9bed78587ac3
Jul 06 08:40:47 hp1 ceph-disk[1340]: mount_activate: Failed to activate
Jul 06 08:40:47 hp1 ceph-disk[1340]: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid a6092407-216f-41ff-bccb-9bed78587ac3
Jul 06 08:40:47 hp1 ceph-disk[1340]: ceph-disk: Error: One or more partitions failed to activate
Jul 06 08:40:47 hp1 systemd[1]: ceph.service: main process exited, code=exited, status=1/FAILURE
Jul 06 08:40:47 hp1 systemd[1]: Failed to start PVE activate Ceph OSD disks.
Jul 06 08:40:47 hp1 systemd[1]: Unit ceph.service entered failed state.

However:

Code:
root@hp1:~# cat /etc/ceph/ceph.conf
[global]
fsid = 7cae4d25-6864-46de-84dc-d8fd4f75ca6f
mon_initial_members = hp1
mon_host = 192.168.121.30
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

This node was running perfectly yesterday and it appear that during the night something changed. How is is possible that the fsid doesn't match any more?

Can I change that ceph.conf fsid, or where does the system get "a6092407-216f-41ff-bccb-9bed78587ac3" from?

All the other nodes have a6092407-216f-41ff-bccb-9bed78587ac3 as fsid. I would think that if I change the ceph.conf on this node it would fix the problem?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!