[SOLVED] Ceph stopped talking after 3.3 upgrade

wahmed · Sep 18, 2014

I just upgraded all 6 proxmox nodes and now Ceph cluster stopped talking to each other. I had Ceph 0.72 previously. After upgrade it is 0.80. Did i miss a upgrade step??

udo · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

symmcom said:
I just upgraded all 6 proxmox nodes and now Ceph cluster stopped talking to each other. I had Ceph 0.72 previously. After upgrade it is 0.80. Did i miss a upgrade step??

Hi,
have you looked at the ceph-doku:

Code:

Upgrade daemons in the following order:

        Monitors
        OSDs
        MDSs and/or radosgw

If the ceph-mds daemon is restarted first, it will wait until all OSDs have been upgraded before finishing its startup sequence. If the ceph-mon daemons are not restarted prior to the ceph-osd daemons, they will not correctly register their new capabilities with the cluster and new features may not be usable until they are restarted a second time.

and
We recommand adding the following to the [mon] section of your ceph.conf prior to upgrade:

Code:

mon warn on legacy crush tunables = false

Udo

wahmed · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

Have not read the Ceph doc prior to upgrade. I did not think Proxmox was going to upgrade the Ceph. I simply did the upgrade through GUI. I dont have MDS. Already restarted the nodes several times, but no go. Not sure if it too late already but going to add the mon warn in conf and reboot. Lets see if it makes any difference. My tension level rising in each passing minutes.

7 hours till work day starts for 55 users!

udo · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

Hi Wasim,
do the monitor starts?

Code:

netstat -an | grep 6789 | grep -i listen

filesystem too full?

Code:

df -k
du -hs /var/lib/ceph/mon/*/store.db

you can try to start the mon in foreground to see error messages:
like this (in this case mon-b - see "ls /var/lib/ceph/mon/"):

Code:

ceph-mon -i b -d -c /etc/ceph/ceph.conf

Udo

wahmed · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

netstat shows it is listening.

After trying to run mon in foreground this is what i got:
~# ceph-mon -i 0 -d -c /etc/ceph/ceph.conf
2014-09-18 01:29:10.057089 7f2700d13780 0 ceph version 0.80.5 (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx), process ceph-mon, pid 7267
2014-09-18 01:29:10.058536 7f2700d13780 -1 asok(0x3444d20) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
2014-09-18 01:29:10.058655 7f2700d13780 -1 failed to create new leveldb store

Looks like MONs not starting at all.

udo · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

Hi,
perhaps more than one mon-process?
use 0 instead of b ;-)

Code:

root@proxmox4:~# ls -lsa /var/lib/ceph/mon/ceph-b/store.db/LOCK
0 -rw-r--r-- 1 root root 0 Feb  3  2014 /var/lib/ceph/mon/ceph-b/store.db/LOCK
root@proxmox4:~# fuser /var/lib/ceph/mon/ceph-b/store.db/LOCK
/var/lib/ceph/mon/ceph-b/store.db/LOCK:  6357

enough space inside /var/lib/ceph free?

Udo

wahmed · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

udo said:
Hi,
perhaps more than one mon-process?
use 0 instead of b ;-)

Code:

root@proxmox4:~# ls -lsa /var/lib/ceph/mon/ceph-b/store.db/LOCK 0 -rw-r--r-- 1 root root 0 Feb 3 2014 /var/lib/ceph/mon/ceph-b/store.db/LOCK root@proxmox4:~# fuser /var/lib/ceph/mon/ceph-b/store.db/LOCK /var/lib/ceph/mon/ceph-b/store.db/LOCK: 6357

enough space inside /var/lib/ceph free?

Udo

0 is what i used actually. Since thats the mon ID. With ls -lsa and fuser this is what i get:

Code:

root@CA-00-01-01-01:/etc/ceph# ls -lsa /var/lib/ceph/mon/ceph-0/store.db/LOCK
0 -rw-r--r-- 1 root root 0 Jul 26 17:09 /var/lib/ceph/mon/ceph-0/store.db/LOCK
root@CA-00-01-01-01:/etc/ceph# fuser /var/lib/ceph/mon/ceph-0/store.db/LOCK
/var/lib/ceph/mon/ceph-0/store.db/LOCK:  3199

Plenty of space on local OS disk. All 6 nodes behaving exactly the same. None of the MON wont come on.
If i run ceph -s, following is some of the error msg it is showing

Code:

2014-09-18 01:46:10.879483 7fcd13e80700  0 monclient(hunting): authenticate timed out after 300
2014-09-18 01:46:10.879523 7fcd13e80700  0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut

Could it be Proxmox upgrade messed up the Ceph authentication Admin Key?

udo · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

symmcom said:
0 is what i used actually. Since thats the mon ID. With ls -lsa and fuser this is what i get:

Code:

root@CA-00-01-01-01:/etc/ceph# ls -lsa /var/lib/ceph/mon/ceph-0/store.db/LOCK 0 -rw-r--r-- 1 root root 0 Jul 26 17:09 /var/lib/ceph/mon/ceph-0/store.db/LOCK root@CA-00-01-01-01:/etc/ceph# fuser /var/lib/ceph/mon/ceph-0/store.db/LOCK /var/lib/ceph/mon/ceph-0/store.db/LOCK: 3199

Plenty of space on local OS disk. All 6 nodes behaving exactly the same. None of the MON wont come on.

but you mon is running (and till 26.07.14!!) - what is the output of

Code:

ps aux | grep 3199

If you kill that mon and start an new one in the foreground you get perhaps better error messages?!

Udo

wahmed · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

I am extremely HAPPY to inform that all MONs are back in business again!!! I can get some sleep before work day starts after all.

Udo, with your initial hint of Ceph doc, i read the whole page of Ceph upgrade from 0.72 to 0.80. I think combination of few things fixed the MON issue. This is what i did:

1. Change /etc/apt/sources.list.d/ceph.list on all proxmox nodes. I changed the
The repo said deb http://ceph.com/debian-emperor wheezy main. I changed it to debian-firefly.
2. Ran #apt-get update
3. Ran #apt-get dist-upgrade
4. Reboot
5. Install ceph-deploy; #apt-get install ceph-deploy
6. Install/Upgrade all MONs from each Proxmox nodes: #ceph-deploy install --release firefly node1 node2 node3 ...........
7. Rebooted nodes one by one.

And all MONs came back to life one by one!! Thank you sooo much Udo for your help tonight!!

sdutremble · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

I am curious how come you are using ceph-deploy and not pveceph?

The recommended Proxmox method is to use pveceph.

Is there an advantage to use ceph-deploy?
What changes must we do to use ceph-deploy and still take advantage of the Proxmox cluster replication and ceph management?

I an interested because I would like to try to install one or more MDS and it appears to be very easy using ceph-deploy.

Thanks.

Serge

udo · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

sdutremble said:
I an interested because I would like to try to install one or more MDS and it appears to be very easy using ceph-deploy.

Hi,
you know that only one MDS can running?! That is the reason that cephfs isn't productionstable yet.

Udo

sdutremble · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

I thought I heard this had been fixed. I may be wrong though.

In any case, it would be interesting to look into.

Serge

wahmed · Sep 18, 2014

Re: Ceph stopped talking after 3.3 upgrade

sdutremble said:
I am curious how come you are using ceph-deploy and not pveceph?

The recommended Proxmox method is to use pveceph.

Is there an advantage to use ceph-deploy?
What changes must we do to use ceph-deploy and still take advantage of the Proxmox cluster replication and ceph management?

I an interested because I would like to try to install one or more MDS and it appears to be very easy using ceph-deploy.

I probably could have used #pveceph. But i was not thinking quite straight last night out of panic.

I used ceph-deploy because i am familiar with it. I learned ceph before it was included in Proxmox. All my Ceph cluster was based on Ubuntu for many months. I upgraded Ceph cluster twice during those months using ceph-deploy. I was not quite sure what #pveceph install would do if Ceph installation already exists on the same node. Probably nothing. Would be worth finding out.

There are simply no changes necessary to use ceph-deploy. Just install it with #apt-get install ceph-deploy. It is just a tool itself.

Yes, you are right about MDS being easy to install with ceph-deploy. Actually it is the only method right now to have MDS on Proxmox node. Since Proxmox+Ceph does not support MDS yet.

udo · Sep 26, 2014

Re: Ceph stopped talking after 3.3 upgrade

symmcom said:
netstat shows it is listening.

After trying to run mon in foreground this is what i got:
~# ceph-mon -i 0 -d -c /etc/ceph/ceph.conf
2014-09-18 01:29:10.057089 7f2700d13780 0 ceph version 0.80.5 (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx), process ceph-mon, pid 7267
2014-09-18 01:29:10.058536 7f2700d13780 -1 asok(0x3444d20) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
2014-09-18 01:29:10.058655 7f2700d13780 -1 failed to create new leveldb store

Looks like MONs not starting at all.

Hi Wasim,
I have just updated the mon-nodes inside my proxmox-cluster and on one node I had the same issue.

But the problem was easy solved, because the monitor-process simply don't die during restart:

Code:

apt-get install --only-upgrade ceph ceph-common ceph-fs-common ceph-fuse ceph-mds libcephfs1 python-ceph

root@proxmox2:~# /etc/init.d/ceph restart
=== mon.a === 
=== mon.a === 
Stopping Ceph mon.a on proxmox2...done
=== mon.a === 
Starting Ceph mon.a on proxmox2...
IO error: lock /var/lib/ceph/mon/ceph-a/store.db/LOCK: Resource temporarily unavailable
IO error: lock /var/lib/ceph/mon/ceph-a/store.db/LOCK: Resource temporarily unavailable
2014-09-26 21:25:53.146853 7f6b1133b780 -1 failed to create new leveldb store
failed: 'ulimit -n 32768;  /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf --cluster ceph '
Starting ceph-create-keys on proxmox2...

root@proxmox2:~# fuser /var/lib/ceph/mon/ceph-a/store.db/LOCK 
/var/lib/ceph/mon/ceph-a/store.db/LOCK: 652183
root@proxmox2:~# ps aux | grep 652183
root      652183  1.4  0.4 934032 134872 ?       Sl   Jun16 2084:34 ceph-mon -i a
root      773100  0.0  0.0   7792   948 pts/4    S+   21:27   0:00 grep 652183

root@proxmox2:~# kill 652183
root@proxmox2:~# kill 652183
root@proxmox2:~# kill 652183
-bash: kill: (652183) - No such process
root@proxmox2:~# /etc/init.d/ceph restart
=== mon.a === 
=== mon.a === 
Stopping Ceph mon.a on proxmox2...done
=== mon.a === 
Starting Ceph mon.a on proxmox2...
Starting ceph-create-keys on proxmox2...


root@proxmox2:~# ceph -s
    cluster 591db070-15c1-4c7a-b107-67717bdb87d9
     health HEALTH_WARN some monitors are running older code
...

the other both mon-nodes updates without issues.

The next time you don't need delete and restore with ceph-deploy

Udo

sdutremble · Oct 2, 2014

Re: Ceph stopped talking after 3.3 upgrade

I finally upgraded my two cluster members from PVE 3.2 to PVE 3.3 latest.

In the course of the upgrade, the ceph.conf file got truncated.

I had 4 OSDs defined on each proxmox cluster member. The ceph.conf file has nothing below the monitor configuration.

I see no reference of any OSDs in /etc/pve/ceph.

I have not touched the HDD themselves and they are still showing in the Ceph-> Disk GUI pane.

I do not know what to do. Is there a way to recover such a screw-up?

Serge

dietmar · Oct 2, 2014

Re: Ceph stopped talking after 3.3 upgrade

sdutremble said:
I had 4 OSDs defined on each proxmox cluster member. The ceph.conf file has nothing below the monitor configuration.

I see no reference of any OSDs in /etc/pve/ceph.

First, the update does not touch the ceph configuration. Second, it is normal that there is no reference to OSDs (not needed).

sdutremble · Oct 2, 2014

Re: Ceph stopped talking after 3.3 upgrade

OK Thanks. I could not find any sample ceph.conf file and I thought I remembered seeing something below.

Is there a way to recover the OSD mapping I had before the upgrade? It seems to me that I should have something more under /etc/pve/ceph and I do not see any subdirectories containing MON or OSD information.

I am wondering how to check things out and begin the troubleshooting.

The Web interface gives me timeouts when I try to see the Ceph status and the OSD panel.

Rebooting makes no difference and I do not know where to begin.

How can I manually shutdown and restart Ceph in a controlled manner?

Serge

udo · Oct 2, 2014

Re: Ceph stopped talking after 3.3 upgrade

sdutremble said:
...
Rebooting makes no difference and I do not know where to begin.

How can I manually shutdown and restart Ceph in a controlled manner?

Serge

Hi,
start on the console with commands like this:

Code:

ceph health detail
ceph -s
ceph osd tree

Udo

sdutremble · Oct 3, 2014

Re: Ceph stopped talking after 3.3 upgrade

I am getting this for all:

2014-10-02 20:55:37.279808 7fde044b3700 0 -- :/1261613 >> 10.10.10.50:6789/0 pipe(0xd151f0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0xd15460).fault
2014-10-02 20:55:40.279868 7fde043b2700 0 -- :/1261613 >> 10.10.10.51:6789/0 pipe(0xd14390 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0xd14600).fault
2014-10-02 20:55:43.280361 7fde044b3700 0 -- :/1261613 >> 10.10.10.50:6789/0 pipe(0xd15bf0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0xd15e60).fault
2014-10-02 20:55:46.280463 7fde043b2700 0 -- :/1261613 >> 10.10.10.51:6789/0 pipe(0xd16160 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0xd163d0).fault
...

/etc/init.d/ceph restart
/etc/init.d/ceph stop
/etc/init.d/ceph start

Has no output and no effect.

Serge

udo · Oct 3, 2014

Re: Ceph stopped talking after 3.3 upgrade

sdutremble said:
I am getting this for all:

2014-10-02 20:55:37.279808 7fde044b3700 0 -- :/1261613 >> 10.10.10.50:6789/0 pipe(0xd151f0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0xd15460).fault
2014-10-02 20:55:40.279868 7fde043b2700 0 -- :/1261613 >> 10.10.10.51:6789/0 pipe(0xd14390 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0xd14600).fault
2014-10-02 20:55:43.280361 7fde044b3700 0 -- :/1261613 >> 10.10.10.50:6789/0 pipe(0xd15bf0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0xd15e60).fault
2014-10-02 20:55:46.280463 7fde043b2700 0 -- :/1261613 >> 10.10.10.51:6789/0 pipe(0xd16160 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0xd163d0).fault
...

/etc/init.d/ceph restart
/etc/init.d/ceph stop
/etc/init.d/ceph start

Has no output and no effect.

Serge

Hi Serge,
this looks for me, that you have only two ceph-mons (10.10.10.50 + 10.10.10.51). Without an third mon, you don't get quorum.

Are both mons running?

Code:

netstat -an | grep 6789

which error messages do you get, if you start your mons in the foreground (stop the mon before with "kill PID" - find the PID with "ps aux | grep ceph")

To start the mon in the foreground, look with

Code:

ls /var/lib/ceph/mon/

If the dir named ceph-0 you neet the 0.

Code:

ceph-mon -i 0 -d -c /etc/ceph/ceph.conf

Any output, which looks bad?

Have the disks enough space:

Code:

df -h

BTW. you should open an new thread for such things...

Udo

[SOLVED] Ceph stopped talking after 3.3 upgrade

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Renowned Member

Distinguished Member

Renowned Member

Famous Member

Distinguished Member

Renowned Member

Proxmox Staff Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member