One of Proxmox Ceph Cluster Node Can't Sync and Restart After That

Dany Kurniawan

New Member
Feb 13, 2019
8
0
1
29
Hello everyone.
I have successfully deploy proxmox ve 5.3 and using ceph to cluster 3 node.

Accidently, one of the 3 node is going down because of power electrical problem. and still remain 2 node active.
When i power on that server again, it won't join the ceph cluster to sync and then it goes restart everytime. When i pull out the network cable that connected to ceph cluster, the server run normally. so i guest the problem is on ceph sync process.

What log do i need to trace regarding this problem? I use QuantaGrid D52BQ-2U server and nexus switch from cisco.

Is there any one that ever had the issue like this?
 
Code:
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped PVE API Proxy Server.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped target PVE Storage Target.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-mon@.service instances at once.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopping Ceph cluster monitor daemon...
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-osd@.service instances at once.
Jul 20 17:09:59 pveceph-node1 ceph-mon[1853]: 2019-07-20 17:09:59.492466 7fac3d612700 -1 received  signal: Terminated from  PID: 1 task name: /sbin/init  UID: 0
Jul 20 17:09:59 pveceph-node1 ceph-mon[1853]: 2019-07-20 17:09:59.492505 7fac3d612700 -1 mon.pveceph-node1@1(probing) e3 *** Got Signal Terminated ***
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-mds@.service instances at once.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-mgr@.service instances at once.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopping Ceph cluster manager daemon...
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopping OpenBSD Secure Shell server...
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped OpenBSD Secure Shell server.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped Ceph cluster manager daemon.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Removed slice system-ceph\x2dmgr.slice.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped PVE Cluster Ressource Manager Daemon.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopping PVE API Daemon...
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopping Proxmox VE watchdog multiplexer...
Jul 20 17:15:57 pveceph-node1 systemd[1]: Starting Flush Journal to Persistent Storage...
Jul 20 17:15:57 pveceph-node1 systemd-modules-load[460]: Inserted module 'iscsi_tcp'
Jul 20 17:15:57 pveceph-node1 systemd[1]: Started Flush Journal to Persistent Storage.
Jul 20 17:15:57 pveceph-node1 systemd[1]: Started Load/Save Random Seed.
Jul 20 17:15:57 pveceph-node1 systemd[1]: Mounted RPC Pipe File System.
Jul 20 17:15:57 pveceph-node1 keyboard-setup.sh[462]: cannot open file /tmp/tmpkbd.D7IQVE

Here the update for syslog.between time 17.09 to 17.15 it's the restart period error.

Code:
Jul 20 17:09:59 pveceph-node1 ceph-mon[1853]: 2019-07-20 17:09:59.492466 7fac3d612700 -1 received  signal: Terminated from  PID: 1 task name: /sbin/init  UID: 0
Jul 20 17:09:59 pveceph-node1 ceph-mon[1853]: 2019-07-20 17:09:59.492505 7fac3d612700 -1 mon.pveceph-node1@1(probing) e3 *** Got Signal Terminated ***

i think the problem is on this error log. but i don't know what is this mean and why it's got signal terminated
 
Hey, can you see if the server is still correctly in the Proxmox VE cluster? As root just execute:
Code:
pvecm status

Also a
Code:
cat /etc/pve/ceph.conf
ceph -s
would be interesting.
 
Here is the result from pvecm status

Code:
root@pveceph-node1:~# pvecm status
Quorum information
------------------
Date:             Mon Jul 22 14:52:48 2019
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1/2280
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.50
0x00000002          1 192.168.1.51 (local)
0x00000003          1 192.168.1.52

Here is the result from cat /etc/pve/ceph.conf

Code:
root@pveceph-node1:~# cat /etc/pve/ceph.conf
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 10.10.10.0/24
         fsid = 9acbf954-000d-4996-b0db-2f3915c68c80
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 3
         public network = 10.10.10.0/24

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.pveceph-master]
         host = pveceph-master
         mon addr = 10.10.10.100:6789

[mon.pveceph-node2]
         host = pveceph-node2
         mon addr = 10.10.10.102:6789

[mon.pveceph-node1]
         host = pveceph-node1
         mon addr = 10.10.10.101:6789

When i do command
Code:
ceph -s
it's just freezing. I thought that it's because i unplugged the cable that used by ceph to communicate.
Code:
2019-07-22 15:01:59.138918 7ffb53f18700  0 monclient(hunting): authenticate timed out after 300
2019-07-22 15:01:59.138960 7ffb53f18700  0 librados: client.admin authentication error (110) Connection timed out
[errno 110] error connecting to the cluster


every server, there is 2 interface. eth0 for internet and cluster PVE. and eth1 is for ceph communication.
 

Attachments

  • ceph-s.PNG
    ceph-s.PNG
    1.6 KB · Views: 5
Update:
i have done nothing but the server restart by itself. i check the log and shows up the log like this.

Code:
Jul 22 16:31:35 pveceph-node1 pvestatd[1867]: status update time (5.114 seconds)
Jul 22 16:31:45 pveceph-node1 pvestatd[1867]: got timeout
Jul 22 16:31:45 pveceph-node1 pvestatd[1867]: status update time (5.113 seconds)
Jul 22 16:31:55 pveceph-node1 pvestatd[1867]: got timeout
Jul 22 16:31:55 pveceph-node1 pvestatd[1867]: status update time (5.115 seconds)
Jul 22 16:32:00 pveceph-node1 systemd[1]: Starting Proxmox VE replication runner...
Jul 22 16:32:01 pveceph-node1 systemd[1]: Started Proxmox VE replication runner.
NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL 
Jul 22 16:35:59 pveceph-node1 systemd-modules-load[475]: Inserted module 'iscsi_tcp'
Jul 22 16:35:59 pveceph-node1 systemd[1]: Mounted Debug File System.
Jul 22 16:35:59 pveceph-node1 systemd[1]: Mounted POSIX Message Queue File System.
Jul 22 16:35:59 pveceph-node1 systemd[1]: Mounted Huge Pages File System.

does it because of ceph?
 
On cisco nexus switch, i use vlan 200 and vlan 250
vlan 200 is for internet purpose
vlan 250 is for ceph cluster

i activate those vlan with igmp snooping on
and every vlan i define ip igmp snooping querier

do i need to disable ip igmp snooping querier on vlan 250?
if i disable ip igmp snooping on vlan 200, all server can't quorum.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!