[SOLVED] ceph upgrade to jewel: HA migrate not working anymore

Ruben Waitz

Member
Aug 6, 2012
38
2
8
Hi,

I've upgraded ceph from Hammer to Jewel on 2 of 4 nodes in our proxmox 4.4 cluster.
At one "pve jewel node" HA-migration (to/from) and the Ceph-log (from pve-UI) faulty.
HA-migration says "HA 200 - Migrate OK" but doesn't do anything further.
From pve-UI the ceph-log for this node says: "unable to open file - No such file or directory"

When I remove an LXC from the HA-setup, I do can migrate it to the 'problem node'.

I've carefully followed the steps of https://pve.proxmox.com/wiki/Ceph_Hammer_to_Jewel and checked the dir/file permissions. Everything looks OK. Furthermore all nodes are included in the HA setup and there aren't any HA-groups excluding particular nodes. I've rebooted the node also to make sure it has a clean startup.

Am I missing something?

Thanks
Ruben


ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-78
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
 
Hi,

did you unset noout?
And the ceph cluster has no warning?
 
Hi Wolfgang,

Thank you for replying.

I've executed 'unset noout' again, to make sure. It looks like ceph and the cluster is working fine except for the HA-issue. That occurred after upgrading to Jewel. Maybe the problem is elsewhere in HA.
The only warning I get is "HEALTH_WARN: crush map has legacy tunables (require bobtail, min is firefly)", alternating with "HEALTH_OK" every 10 seconds (or so). I didn't upgrade the crushmap to the adviced 'hammer' yet.

In the cluster there's another Jewel-node which has the same config, but does work fine. So I assume this shouldn't be a problem.

Do you have any other ideas?

Thanks!
 
Does this only happens with LXC or also with KVM?
If it also happens with KVM do you use krbd or librbd(without the krbd flag)?
 
Hi,

This happens also with KVMs.
In the rbd: section of /etc/pve/storage.cfg I've set krbd to 1.
The only difference is that I don't run a ceph mon on this node to have an odd quorum (3 monitors on 4 nodes) as adviced in a ceph book I read. Could that be the problem in this case?

Otherwise maybe the easiest solution is to reinstall the node and add it to the cluster again.
 
This is no problem but do you have osd o this nodes?
Or more specific which version of librbd is running on this node?
 
Yes, each node has 1 OSD.
dpkg-query -l | grep librbd displays 10.2.5-1~bpo80+1 (both librbd1 and python-rbd). This is the same on the other working Jewel node. The 2 old 'hammer' nodes have version 0.80.8-1~bpo70+1
 
You sure with version 0.80.8.1 on the old nodes?
Because this would be firefly and not hammer?
 
...
The only warning I get is "HEALTH_WARN: crush map has legacy tunables (require bobtail, min is firefly)", alternating with "HEALTH_OK" every 10 seconds (or so). I didn't upgrade the crushmap to the adviced 'hammer' yet.
...
Hi,
this sound's that not all Mons on jewel is. It's strongly recommended to upgrade all Mons to jewel.

Udo
 
@wolfgang Sorry, you're right. I made a mistake. The old nodes are 0.94.9-1~bpo80+1 (hammer).
@udo Yes, I'm intended to do so, but I'm afraid I mess things up when upgrading. Originally there were 4 hammer nodes. After upgrading 2 to Jewel, one of the Jewel nodes is yielding these problems. With Wolfgang I'm trying to figure out what went (and is) wrong.
 
I removed the OSD of the failing node, and did a new jewel install. After that I've upgraded the remaining hammer nodes to jewel. Everythings seems to work again.
The missing ceph log message from pve-UI is still there. I think /var/log/ceph/ceph.log is written by the ceph MON, which I didn't install on the 'problem' node (to retain an odd number of MONs).

Hope this helps someone else.
@wolfgang and @udo, thanks for your input!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!