[solved] Ceph time out after upgrade from PVE 5 beta to 5.0

alain · Aug 15, 2017

Hello all,

I am a little bit new to Ceph. I installed a few months ago a new proxmox cluster with Promox 5 beta using 3 Dell PE R630 nodes, each with 2 SSDs (one for OS and one for journals), and 8 500 GB HD drives, for OSDs. So I have 24 OSDs. Proxmox ans Ceph share the same servers.

I created a pool 'VM Pool' and several VMs on it and it was OK. Yesterday, I tried to upgrade the cluster from 5 beta to 5.0. I already tested with a test cluster which was on PVE 4.4 with jewel, upgradinf first to luminous following documentation (it was OK), then to PVE 5/0. It went OK.

I thought the upgrade on the real cluster was easier, because it was already running ceph luminous, and all I had to do was to upgrade to 5.0. So I logged on the first node, migrated the existing VMs (two) to the secont node, set 'ceph osd set noout' to avoid rebalancing, then apt-get update and apt-get dist-upgrade.

It was OK, so I rebooted the node.

After reboot, I tested the ceph status :

Code:

# ceph -s
    cluster b5a08127-b65a-430c-ad34-810752429977
     health HEALTH_WARN
            1088 pgs degraded
            1088 pgs stuck degraded
            1088 pgs stuck unclean
            1088 pgs stuck undersized
            1088 pgs undersized
            recovery 57444/172332 objects degraded (33.333%)
            8/24 in osds are down
            noout flag(s) set
            1 mons down, quorum 1,2 1,2
     monmap e4: 3 mons at {0=192.168.10.2:6789/0,1=192.168.10.3:6789/0,2=192.168.10.4:6789/0}
            election epoch 222, quorum 1,2 1,2
        mgr active: 1 standbys: 2
     osdmap e894: 24 osds: 16 up, 24 in
            flags noout
      pgmap v4701364: 1088 pgs, 2 pools, 222 GB data, 57444 objects
            665 GB used, 10507 GB / 11172 GB avail
            57444/172332 objects degraded (33.333%)
                1088 active+undersized+degraded
  client io 10927 B/s wr, 0 op/s rd, 0 op/s wr

There was some warnings but I thouhght it was because I just rebooted and he was working to reintegrate the local OSDs.

I did not unset noout. I went to the second node, migrated the present VMs to the third node then update and dist-upgrade. All seemed OK and I rebooted.

After that, I tested proxmox cluster, it was OK, then ceph cluster, and this time, it was not OK, I got a timeout :

Code:

# ceph -s
2017-08-14 19:09:37.396061 7f7460aaa700  0 monclient(hunting): authenticate timed out after 300
2017-08-14 19:09:37.396082 7f7460aaa700  0 librados: client.admin authentication error (110) Connection timed out
[errno 110] error connecting to the cluster

So it seems that there is an error in authentication. I tested on the first node, and I got the same message. I tried to reboot the node, to no avail.

I did not for now upgraded the last node, because it contains all running VMs now (5), and I planned to migrate them on the first nodes. I fear that if I stop them, I will not recover them after upgrade and node reboot.

I am a bit puzzled at this step, and don't know how to troubleshoot the problem. I see nothing very informative in ceph logs.

I tried to retart the cluster, the command did not output any error message, but did not succeed either :

Code:

# systemctl start ceph.target

Could someone help me to troubleshoot the problem ?

Here is the version I have now on the first node :

Code:

~# pveversion -v
proxmox-ve: 5.0-20 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-30 (running version: 5.0-30/5ab26bc)
pve-kernel-4.10.1-2-pve: 4.10.1-2
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.10.8-1-pve: 4.10.8-7
pve-kernel-4.10.11-1-pve: 4.10.11-9
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-4
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1

Thanks in advance for your advices.

P.S : I forgot, here is me ceph configuration :

Code:

:/etc/ceph# cat ceph.conf
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 192.168.10.0/24
         filestore xattr use omap = true
         fsid = b5a08127-b65a-430c-ad34-810752429977
         keyring = /etc/pve/priv/$cluster.$name.keyring
         osd journal size = 5120
         osd pool default min size = 1
         public network = 192.168.10.0/24

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.1]
         host = prox2orsay
         mon addr = 192.168.10.3:6789

[mon.2]
         host = prox3orsay
         mon addr = 192.168.10.4:6789

[mon.0]
         host = prox1orsay
         mon addr = 192.168.10.2:6789

alain · Aug 15, 2017

Some more information after reading another thread :

Code:

~# systemctl status ceph ceph-osd
● ceph.service - PVE activate Ceph OSD disks
   Loaded: loaded (/etc/systemd/system/ceph.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2017-08-14 22:19:06 CEST; 18h ago
 Main PID: 2428 (code=exited, status=0/SUCCESS)

Aug 14 22:19:04 prox1orsay ceph-disk[2428]: Created symlink /etc/systemd/system/ceph-osd.target.wants/ceph-osd@9.service → /lib/systemd/system/ceph-osd@.service.
Aug 14 22:19:04 prox1orsay ceph-disk[2428]: Removed /etc/systemd/system/ceph-osd.target.wants/ceph-osd@7.service.
Aug 14 22:19:04 prox1orsay ceph-disk[2428]: Created symlink /etc/systemd/system/ceph-osd.target.wants/ceph-osd@7.service → /lib/systemd/system/ceph-osd@.service.
Aug 14 22:19:05 prox1orsay ceph-disk[2428]: Removed /etc/systemd/system/ceph-osd.target.wants/ceph-osd@6.service.
Aug 14 22:19:05 prox1orsay ceph-disk[2428]: Created symlink /etc/systemd/system/ceph-osd.target.wants/ceph-osd@6.service → /lib/systemd/system/ceph-osd@.service.
Aug 14 22:19:05 prox1orsay ceph-disk[2428]: Removed /etc/systemd/system/ceph-osd.target.wants/ceph-osd@3.service.
Aug 14 22:19:05 prox1orsay ceph-disk[2428]: Created symlink /etc/systemd/system/ceph-osd.target.wants/ceph-osd@3.service → /lib/systemd/system/ceph-osd@.service.
Aug 14 22:19:06 prox1orsay ceph-disk[2428]: Removed /etc/systemd/system/ceph-osd.target.wants/ceph-osd@0.service.
Aug 14 22:19:06 prox1orsay ceph-disk[2428]: Created symlink /etc/systemd/system/ceph-osd.target.wants/ceph-osd@0.service → /lib/systemd/system/ceph-osd@.service.
Aug 14 22:19:06 prox1orsay systemd[1]: Started PVE activate Ceph OSD disks.
Unit ceph-osd.service could not be found.

and :

Code:

~# ls /var/lib/ceph/osd/
ceph-0  ceph-3  ceph-4  ceph-5  ceph-6  ceph-7  ceph-8  ceph-9

It seems at first sight that ceph-1 and ceph-2 are missing...

alain · Aug 15, 2017

Finally, I solved the problem by rebooting the second node. I think it was in some stateof error, and as the cluster as only three nodes, hence quorum is two, and third node was not yet upgraded, there was no quorum.

It is much better now :

Code:

~# ceph -s
  cluster:
    id:     b5a08127-b65a-430c-ad34-810752429977
    health: HEALTH_WARN
            application not enabled on 1 pool(s)

  services:
    mon: 3 daemons, quorum 0,1,2
    mgr: prox1orsay(active)
    osd: 24 osds: 24 up, 24 in

  data:
    pools:   2 pools, 1088 pgs
    objects: 60304 objects, 233 GB
    usage:   700 GB used, 10472 GB / 11172 GB avail
    pgs:     1088 active+clean

  io:
    client:   610 kB/s rd, 368 MB/s wr, 426 op/s rd, 409 op/s wr

I have this annoying warning 'application not enabled on 1 pool(s)', I already had this on my test cluster, but things are working elsewhere.

I apologize for the noise on the forum.

Search

Search

[solved] Ceph time out after upgrade from PVE 5 beta to 5.0

alain

Renowned Member

alain

Renowned Member

alain

Renowned Member