[SOLVED] Ceph stopped talking after 3.3 upgrade

wahmed

Famous Member
Oct 28, 2012
1,114
44
113
Calgary, Canada
www.symmcom.com
I just upgraded all 6 proxmox nodes and now Ceph cluster stopped talking to each other. I had Ceph 0.72 previously. After upgrade it is 0.80. Did i miss a upgrade step??
 
Last edited:
Re: Ceph stopped talking after 3.3 upgrade

I just upgraded all 6 proxmox nodes and now Ceph cluster stopped talking to each other. I had Ceph 0.72 previously. After upgrade it is 0.80. Did i miss a upgrade step??
Hi,
have you looked at the ceph-doku:
Code:
Upgrade daemons in the following order:

        Monitors
        OSDs
        MDSs and/or radosgw

If the ceph-mds daemon is restarted first, it will wait until all OSDs have been upgraded before finishing its startup sequence. If the ceph-mon daemons are not restarted prior to the ceph-osd daemons, they will not correctly register their new capabilities with the cluster and new features may not be usable until they are restarted a second time.
and
We recommand adding the following to the [mon] section of your ceph.conf prior to upgrade:
Code:
mon warn on legacy crush tunables = false

Udo
 
Re: Ceph stopped talking after 3.3 upgrade

Have not read the Ceph doc prior to upgrade. I did not think Proxmox was going to upgrade the Ceph. I simply did the upgrade through GUI. I dont have MDS. Already restarted the nodes several times, but no go. Not sure if it too late already but going to add the mon warn in conf and reboot. Lets see if it makes any difference. My tension level rising in each passing minutes. :) 7 hours till work day starts for 55 users!
 
Re: Ceph stopped talking after 3.3 upgrade

Hi Wasim,
do the monitor starts?
Code:
netstat -an | grep 6789 | grep -i listen
filesystem too full?
Code:
df -k
du -hs /var/lib/ceph/mon/*/store.db
you can try to start the mon in foreground to see error messages:
like this (in this case mon-b - see "ls /var/lib/ceph/mon/"):
Code:
ceph-mon -i b -d -c /etc/ceph/ceph.conf
Udo
 
Re: Ceph stopped talking after 3.3 upgrade

netstat shows it is listening.

After trying to run mon in foreground this is what i got:
~# ceph-mon -i 0 -d -c /etc/ceph/ceph.conf
2014-09-18 01:29:10.057089 7f2700d13780 0 ceph version 0.80.5 (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx), process ceph-mon, pid 7267
2014-09-18 01:29:10.058536 7f2700d13780 -1 asok(0x3444d20) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
2014-09-18 01:29:10.058655 7f2700d13780 -1 failed to create new leveldb store

Looks like MONs not starting at all.
 
Re: Ceph stopped talking after 3.3 upgrade

Hi,
perhaps more than one mon-process?
use 0 instead of b ;-)
Code:
root@proxmox4:~# ls -lsa /var/lib/ceph/mon/ceph-b/store.db/LOCK
0 -rw-r--r-- 1 root root 0 Feb  3  2014 /var/lib/ceph/mon/ceph-b/store.db/LOCK
root@proxmox4:~# fuser /var/lib/ceph/mon/ceph-b/store.db/LOCK
/var/lib/ceph/mon/ceph-b/store.db/LOCK:  6357
enough space inside /var/lib/ceph free?

Udo
 
Re: Ceph stopped talking after 3.3 upgrade

Hi,
perhaps more than one mon-process?
use 0 instead of b ;-)
Code:
root@proxmox4:~# ls -lsa /var/lib/ceph/mon/ceph-b/store.db/LOCK
0 -rw-r--r-- 1 root root 0 Feb  3  2014 /var/lib/ceph/mon/ceph-b/store.db/LOCK
root@proxmox4:~# fuser /var/lib/ceph/mon/ceph-b/store.db/LOCK
/var/lib/ceph/mon/ceph-b/store.db/LOCK:  6357
enough space inside /var/lib/ceph free?

Udo

0 is what i used actually. Since thats the mon ID. With ls -lsa and fuser this is what i get:
Code:
root@CA-00-01-01-01:/etc/ceph# ls -lsa /var/lib/ceph/mon/ceph-0/store.db/LOCK
0 -rw-r--r-- 1 root root 0 Jul 26 17:09 /var/lib/ceph/mon/ceph-0/store.db/LOCK
root@CA-00-01-01-01:/etc/ceph# fuser /var/lib/ceph/mon/ceph-0/store.db/LOCK
/var/lib/ceph/mon/ceph-0/store.db/LOCK:  3199

Plenty of space on local OS disk. All 6 nodes behaving exactly the same. None of the MON wont come on.
If i run ceph -s, following is some of the error msg it is showing
Code:
2014-09-18 01:46:10.879483 7fcd13e80700  0 monclient(hunting): authenticate timed out after 300
2014-09-18 01:46:10.879523 7fcd13e80700  0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut
Could it be Proxmox upgrade messed up the Ceph authentication Admin Key?
 
Re: Ceph stopped talking after 3.3 upgrade

0 is what i used actually. Since thats the mon ID. With ls -lsa and fuser this is what i get:
Code:
root@CA-00-01-01-01:/etc/ceph# ls -lsa /var/lib/ceph/mon/ceph-0/store.db/LOCK
0 -rw-r--r-- 1 root root 0 Jul 26 17:09 /var/lib/ceph/mon/ceph-0/store.db/LOCK
root@CA-00-01-01-01:/etc/ceph# fuser /var/lib/ceph/mon/ceph-0/store.db/LOCK
/var/lib/ceph/mon/ceph-0/store.db/LOCK:  3199

Plenty of space on local OS disk. All 6 nodes behaving exactly the same. None of the MON wont come on.
but you mon is running (and till 26.07.14!!) - what is the output of
Code:
ps aux | grep 3199
If you kill that mon and start an new one in the foreground you get perhaps better error messages?!

Udo
 
Re: Ceph stopped talking after 3.3 upgrade

I am extremely HAPPY to inform that all MONs are back in business again!!! I can get some sleep before work day starts after all. :)

Udo, with your initial hint of Ceph doc, i read the whole page of Ceph upgrade from 0.72 to 0.80. I think combination of few things fixed the MON issue. This is what i did:

1. Change /etc/apt/sources.list.d/ceph.list on all proxmox nodes. I changed the
The repo said deb http://ceph.com/debian-emperor wheezy main. I changed it to debian-firefly.
2. Ran #apt-get update
3. Ran #apt-get dist-upgrade
4. Reboot
5. Install ceph-deploy; #apt-get install ceph-deploy
6. Install/Upgrade all MONs from each Proxmox nodes: #ceph-deploy install --release firefly node1 node2 node3 ...........
7. Rebooted nodes one by one.

And all MONs came back to life one by one!! Thank you sooo much Udo for your help tonight!!
 
Re: Ceph stopped talking after 3.3 upgrade

I am curious how come you are using ceph-deploy and not pveceph?

The recommended Proxmox method is to use pveceph.

Is there an advantage to use ceph-deploy?
What changes must we do to use ceph-deploy and still take advantage of the Proxmox cluster replication and ceph management?

I an interested because I would like to try to install one or more MDS and it appears to be very easy using ceph-deploy.

Thanks.

Serge
 
Re: Ceph stopped talking after 3.3 upgrade

I thought I heard this had been fixed. I may be wrong though.

In any case, it would be interesting to look into.

Serge
 
Re: Ceph stopped talking after 3.3 upgrade

I am curious how come you are using ceph-deploy and not pveceph?

The recommended Proxmox method is to use pveceph.

Is there an advantage to use ceph-deploy?
What changes must we do to use ceph-deploy and still take advantage of the Proxmox cluster replication and ceph management?

I an interested because I would like to try to install one or more MDS and it appears to be very easy using ceph-deploy.

I probably could have used #pveceph. But i was not thinking quite straight last night out of panic. :)
I used ceph-deploy because i am familiar with it. I learned ceph before it was included in Proxmox. All my Ceph cluster was based on Ubuntu for many months. I upgraded Ceph cluster twice during those months using ceph-deploy. I was not quite sure what #pveceph install would do if Ceph installation already exists on the same node. Probably nothing. Would be worth finding out.

There are simply no changes necessary to use ceph-deploy. Just install it with #apt-get install ceph-deploy. It is just a tool itself.

Yes, you are right about MDS being easy to install with ceph-deploy. Actually it is the only method right now to have MDS on Proxmox node. Since Proxmox+Ceph does not support MDS yet.
 
Re: Ceph stopped talking after 3.3 upgrade

netstat shows it is listening.

After trying to run mon in foreground this is what i got:
~# ceph-mon -i 0 -d -c /etc/ceph/ceph.conf
2014-09-18 01:29:10.057089 7f2700d13780 0 ceph version 0.80.5 (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx), process ceph-mon, pid 7267
2014-09-18 01:29:10.058536 7f2700d13780 -1 asok(0x3444d20) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
2014-09-18 01:29:10.058655 7f2700d13780 -1 failed to create new leveldb store

Looks like MONs not starting at all.
Hi Wasim,
I have just updated the mon-nodes inside my proxmox-cluster and on one node I had the same issue.

But the problem was easy solved, because the monitor-process simply don't die during restart:
Code:
apt-get install --only-upgrade ceph ceph-common ceph-fs-common ceph-fuse ceph-mds libcephfs1 python-ceph

root@proxmox2:~# /etc/init.d/ceph restart
=== mon.a === 
=== mon.a === 
Stopping Ceph mon.a on proxmox2...done
=== mon.a === 
Starting Ceph mon.a on proxmox2...
IO error: lock /var/lib/ceph/mon/ceph-a/store.db/LOCK: Resource temporarily unavailable
IO error: lock /var/lib/ceph/mon/ceph-a/store.db/LOCK: Resource temporarily unavailable
2014-09-26 21:25:53.146853 7f6b1133b780 -1 failed to create new leveldb store
failed: 'ulimit -n 32768;  /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf --cluster ceph '
Starting ceph-create-keys on proxmox2...

root@proxmox2:~# fuser /var/lib/ceph/mon/ceph-a/store.db/LOCK 
/var/lib/ceph/mon/ceph-a/store.db/LOCK: 652183
root@proxmox2:~# ps aux | grep 652183
root      652183  1.4  0.4 934032 134872 ?       Sl   Jun16 2084:34 ceph-mon -i a
root      773100  0.0  0.0   7792   948 pts/4    S+   21:27   0:00 grep 652183

root@proxmox2:~# kill 652183
root@proxmox2:~# kill 652183
root@proxmox2:~# kill 652183
-bash: kill: (652183) - No such process
root@proxmox2:~# /etc/init.d/ceph restart
=== mon.a === 
=== mon.a === 
Stopping Ceph mon.a on proxmox2...done
=== mon.a === 
Starting Ceph mon.a on proxmox2...
Starting ceph-create-keys on proxmox2...


root@proxmox2:~# ceph -s
    cluster 591db070-15c1-4c7a-b107-67717bdb87d9
     health HEALTH_WARN some monitors are running older code
...
the other both mon-nodes updates without issues.

The next time you don't need delete and restore with ceph-deploy ;)

Udo
 
Re: Ceph stopped talking after 3.3 upgrade

I finally upgraded my two cluster members from PVE 3.2 to PVE 3.3 latest.

In the course of the upgrade, the ceph.conf file got truncated.

I had 4 OSDs defined on each proxmox cluster member. The ceph.conf file has nothing below the monitor configuration.

I see no reference of any OSDs in /etc/pve/ceph.

I have not touched the HDD themselves and they are still showing in the Ceph-> Disk GUI pane.

I do not know what to do. Is there a way to recover such a screw-up?

Serge
 
Re: Ceph stopped talking after 3.3 upgrade

I had 4 OSDs defined on each proxmox cluster member. The ceph.conf file has nothing below the monitor configuration.

I see no reference of any OSDs in /etc/pve/ceph.

First, the update does not touch the ceph configuration. Second, it is normal that there is no reference to OSDs (not needed).
 
Re: Ceph stopped talking after 3.3 upgrade

OK Thanks. I could not find any sample ceph.conf file and I thought I remembered seeing something below.

Is there a way to recover the OSD mapping I had before the upgrade? It seems to me that I should have something more under /etc/pve/ceph and I do not see any subdirectories containing MON or OSD information.

I am wondering how to check things out and begin the troubleshooting.

The Web interface gives me timeouts when I try to see the Ceph status and the OSD panel.

Rebooting makes no difference and I do not know where to begin.

How can I manually shutdown and restart Ceph in a controlled manner?

Serge
 
Re: Ceph stopped talking after 3.3 upgrade

I am getting this for all:

2014-10-02 20:55:37.279808 7fde044b3700 0 -- :/1261613 >> 10.10.10.50:6789/0 pipe(0xd151f0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0xd15460).fault
2014-10-02 20:55:40.279868 7fde043b2700 0 -- :/1261613 >> 10.10.10.51:6789/0 pipe(0xd14390 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0xd14600).fault
2014-10-02 20:55:43.280361 7fde044b3700 0 -- :/1261613 >> 10.10.10.50:6789/0 pipe(0xd15bf0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0xd15e60).fault
2014-10-02 20:55:46.280463 7fde043b2700 0 -- :/1261613 >> 10.10.10.51:6789/0 pipe(0xd16160 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0xd163d0).fault
...

/etc/init.d/ceph restart
/etc/init.d/ceph stop
/etc/init.d/ceph start

Has no output and no effect.

Serge
 
Re: Ceph stopped talking after 3.3 upgrade

I am getting this for all:

2014-10-02 20:55:37.279808 7fde044b3700 0 -- :/1261613 >> 10.10.10.50:6789/0 pipe(0xd151f0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0xd15460).fault
2014-10-02 20:55:40.279868 7fde043b2700 0 -- :/1261613 >> 10.10.10.51:6789/0 pipe(0xd14390 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0xd14600).fault
2014-10-02 20:55:43.280361 7fde044b3700 0 -- :/1261613 >> 10.10.10.50:6789/0 pipe(0xd15bf0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0xd15e60).fault
2014-10-02 20:55:46.280463 7fde043b2700 0 -- :/1261613 >> 10.10.10.51:6789/0 pipe(0xd16160 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0xd163d0).fault
...

/etc/init.d/ceph restart
/etc/init.d/ceph stop
/etc/init.d/ceph start

Has no output and no effect.

Serge
Hi Serge,
this looks for me, that you have only two ceph-mons (10.10.10.50 + 10.10.10.51). Without an third mon, you don't get quorum.

Are both mons running?
Code:
netstat -an | grep 6789
which error messages do you get, if you start your mons in the foreground (stop the mon before with "kill PID" - find the PID with "ps aux | grep ceph")

To start the mon in the foreground, look with
Code:
ls /var/lib/ceph/mon/
If the dir named ceph-0 you neet the 0.
Code:
ceph-mon -i 0 -d -c /etc/ceph/ceph.conf
Any output, which looks bad?

Have the disks enough space:
Code:
df -h

BTW. you should open an new thread for such things...

Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!