Code:
# pveversion -v
proxmox-ve: 4.4-111 (running kernel: 4.4.128-1-pve)
pve-manager: 4.4-24 (running version: 4.4-24/08ba4d2d)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.117-2-pve: 4.4.117-110
pve-kernel-4.4.128-1-pve: 4.4.128-111
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+2
libqb0: 1.0.1-1
pve-cluster: 4.0-55
qemu-server: 4.0-115
pve-firmware: 1.1-12
libpve-common-perl: 4.0-96
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.9.1-9~pve4
pve-container: 1.0-106
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.8-2~pve4
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 10.2.10-1~bpo80+1
Last night all was running fine, this morning proxmox 3 nodes suddenly show down. Restarted the failed nodes and they come up, but some OSD's not. I get the following error on the node where the OSD's don't start:
Code:
root@hp1:~# systemctl status ceph -l
● ceph.service - PVE activate Ceph OSD disks
Loaded: loaded (/etc/systemd/system/ceph.service; enabled)
Active: failed (Result: exit-code) since Fri 2018-07-06 08:40:47 SAST; 3h 14min ago
Process: 1340 ExecStart=/usr/sbin/ceph-disk --log-stdout activate-all (code=exited, status=1/FAILURE)
Main PID: 1340 (code=exited, status=1/FAILURE)
Jul 06 08:40:46 hp1 ceph-disk[1340]: mount_activate: Failed to activate
Jul 06 08:40:46 hp1 ceph-disk[1340]: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid a6092407-216f-41ff-bccb-9bed78587ac3
Jul 06 08:40:46 hp1 ceph-disk[1340]: mount_activate: Failed to activate
Jul 06 08:40:46 hp1 ceph-disk[1340]: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid a6092407-216f-41ff-bccb-9bed78587ac3
Jul 06 08:40:47 hp1 ceph-disk[1340]: mount_activate: Failed to activate
Jul 06 08:40:47 hp1 ceph-disk[1340]: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid a6092407-216f-41ff-bccb-9bed78587ac3
Jul 06 08:40:47 hp1 ceph-disk[1340]: ceph-disk: Error: One or more partitions failed to activate
Jul 06 08:40:47 hp1 systemd[1]: ceph.service: main process exited, code=exited, status=1/FAILURE
Jul 06 08:40:47 hp1 systemd[1]: Failed to start PVE activate Ceph OSD disks.
Jul 06 08:40:47 hp1 systemd[1]: Unit ceph.service entered failed state.
However:
Code:
root@hp1:~# cat /etc/ceph/ceph.conf
[global]
fsid = 7cae4d25-6864-46de-84dc-d8fd4f75ca6f
mon_initial_members = hp1
mon_host = 192.168.121.30
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
This node was running perfectly yesterday and it appear that during the night something changed. How is is possible that the fsid doesn't match any more?
Can I change that ceph.conf fsid, or where does the system get "a6092407-216f-41ff-bccb-9bed78587ac3" from?
All the other nodes have a6092407-216f-41ff-bccb-9bed78587ac3 as fsid. I would think that if I change the ceph.conf on this node it would fix the problem?