Hello all,
I am a little bit new to Ceph. I installed a few months ago a new proxmox cluster with Promox 5 beta using 3 Dell PE R630 nodes, each with 2 SSDs (one for OS and one for journals), and 8 500 GB HD drives, for OSDs. So I have 24 OSDs. Proxmox ans Ceph share the same servers.
I created a pool 'VM Pool' and several VMs on it and it was OK. Yesterday, I tried to upgrade the cluster from 5 beta to 5.0. I already tested with a test cluster which was on PVE 4.4 with jewel, upgradinf first to luminous following documentation (it was OK), then to PVE 5/0. It went OK.
I thought the upgrade on the real cluster was easier, because it was already running ceph luminous, and all I had to do was to upgrade to 5.0. So I logged on the first node, migrated the existing VMs (two) to the secont node, set 'ceph osd set noout' to avoid rebalancing, then apt-get update and apt-get dist-upgrade.
It was OK, so I rebooted the node.
After reboot, I tested the ceph status :
There was some warnings but I thouhght it was because I just rebooted and he was working to reintegrate the local OSDs.
I did not unset noout. I went to the second node, migrated the present VMs to the third node then update and dist-upgrade. All seemed OK and I rebooted.
After that, I tested proxmox cluster, it was OK, then ceph cluster, and this time, it was not OK, I got a timeout :
So it seems that there is an error in authentication. I tested on the first node, and I got the same message. I tried to reboot the node, to no avail.
I did not for now upgraded the last node, because it contains all running VMs now (5), and I planned to migrate them on the first nodes. I fear that if I stop them, I will not recover them after upgrade and node reboot.
I am a bit puzzled at this step, and don't know how to troubleshoot the problem. I see nothing very informative in ceph logs.
I tried to retart the cluster, the command did not output any error message, but did not succeed either :
Could someone help me to troubleshoot the problem ?
Here is the version I have now on the first node :
Thanks in advance for your advices.
P.S : I forgot, here is me ceph configuration :
I am a little bit new to Ceph. I installed a few months ago a new proxmox cluster with Promox 5 beta using 3 Dell PE R630 nodes, each with 2 SSDs (one for OS and one for journals), and 8 500 GB HD drives, for OSDs. So I have 24 OSDs. Proxmox ans Ceph share the same servers.
I created a pool 'VM Pool' and several VMs on it and it was OK. Yesterday, I tried to upgrade the cluster from 5 beta to 5.0. I already tested with a test cluster which was on PVE 4.4 with jewel, upgradinf first to luminous following documentation (it was OK), then to PVE 5/0. It went OK.
I thought the upgrade on the real cluster was easier, because it was already running ceph luminous, and all I had to do was to upgrade to 5.0. So I logged on the first node, migrated the existing VMs (two) to the secont node, set 'ceph osd set noout' to avoid rebalancing, then apt-get update and apt-get dist-upgrade.
It was OK, so I rebooted the node.
After reboot, I tested the ceph status :
Code:
# ceph -s
cluster b5a08127-b65a-430c-ad34-810752429977
health HEALTH_WARN
1088 pgs degraded
1088 pgs stuck degraded
1088 pgs stuck unclean
1088 pgs stuck undersized
1088 pgs undersized
recovery 57444/172332 objects degraded (33.333%)
8/24 in osds are down
noout flag(s) set
1 mons down, quorum 1,2 1,2
monmap e4: 3 mons at {0=192.168.10.2:6789/0,1=192.168.10.3:6789/0,2=192.168.10.4:6789/0}
election epoch 222, quorum 1,2 1,2
mgr active: 1 standbys: 2
osdmap e894: 24 osds: 16 up, 24 in
flags noout
pgmap v4701364: 1088 pgs, 2 pools, 222 GB data, 57444 objects
665 GB used, 10507 GB / 11172 GB avail
57444/172332 objects degraded (33.333%)
1088 active+undersized+degraded
client io 10927 B/s wr, 0 op/s rd, 0 op/s wr
There was some warnings but I thouhght it was because I just rebooted and he was working to reintegrate the local OSDs.
I did not unset noout. I went to the second node, migrated the present VMs to the third node then update and dist-upgrade. All seemed OK and I rebooted.
After that, I tested proxmox cluster, it was OK, then ceph cluster, and this time, it was not OK, I got a timeout :
Code:
# ceph -s
2017-08-14 19:09:37.396061 7f7460aaa700 0 monclient(hunting): authenticate timed out after 300
2017-08-14 19:09:37.396082 7f7460aaa700 0 librados: client.admin authentication error (110) Connection timed out
[errno 110] error connecting to the cluster
So it seems that there is an error in authentication. I tested on the first node, and I got the same message. I tried to reboot the node, to no avail.
I did not for now upgraded the last node, because it contains all running VMs now (5), and I planned to migrate them on the first nodes. I fear that if I stop them, I will not recover them after upgrade and node reboot.
I am a bit puzzled at this step, and don't know how to troubleshoot the problem. I see nothing very informative in ceph logs.
I tried to retart the cluster, the command did not output any error message, but did not succeed either :
Code:
# systemctl start ceph.target
Could someone help me to troubleshoot the problem ?
Here is the version I have now on the first node :
Code:
~# pveversion -v
proxmox-ve: 5.0-20 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-30 (running version: 5.0-30/5ab26bc)
pve-kernel-4.10.1-2-pve: 4.10.1-2
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.10.8-1-pve: 4.10.8-7
pve-kernel-4.10.11-1-pve: 4.10.11-9
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-4
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
Thanks in advance for your advices.
P.S : I forgot, here is me ceph configuration :
Code:
:/etc/ceph# cat ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.10.0/24
filestore xattr use omap = true
fsid = b5a08127-b65a-430c-ad34-810752429977
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
public network = 192.168.10.0/24
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.1]
host = prox2orsay
mon addr = 192.168.10.3:6789
[mon.2]
host = prox3orsay
mon addr = 192.168.10.4:6789
[mon.0]
host = prox1orsay
mon addr = 192.168.10.2:6789
Last edited: