One of Proxmox Ceph Cluster Node Can't Sync and Restart After That

Dany Kurniawan

New Member
Feb 13, 2019
8
0
1
30
Hello everyone.
I have successfully deploy proxmox ve 5.3 and using ceph to cluster 3 node.

Accidently, one of the 3 node is going down because of power electrical problem. and still remain 2 node active.
When i power on that server again, it won't join the ceph cluster to sync and then it goes restart everytime. When i pull out the network cable that connected to ceph cluster, the server run normally. so i guest the problem is on ceph sync process.

What log do i need to trace regarding this problem? I use QuantaGrid D52BQ-2U server and nexus switch from cisco.

Is there any one that ever had the issue like this?
 
Code:
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped PVE API Proxy Server.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped target PVE Storage Target.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-mon@.service instances at once.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopping Ceph cluster monitor daemon...
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-osd@.service instances at once.
Jul 20 17:09:59 pveceph-node1 ceph-mon[1853]: 2019-07-20 17:09:59.492466 7fac3d612700 -1 received  signal: Terminated from  PID: 1 task name: /sbin/init  UID: 0
Jul 20 17:09:59 pveceph-node1 ceph-mon[1853]: 2019-07-20 17:09:59.492505 7fac3d612700 -1 mon.pveceph-node1@1(probing) e3 *** Got Signal Terminated ***
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-mds@.service instances at once.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-mgr@.service instances at once.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopping Ceph cluster manager daemon...
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopping OpenBSD Secure Shell server...
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped OpenBSD Secure Shell server.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped Ceph cluster manager daemon.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Removed slice system-ceph\x2dmgr.slice.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopped PVE Cluster Ressource Manager Daemon.
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopping PVE API Daemon...
Jul 20 17:09:59 pveceph-node1 systemd[1]: Stopping Proxmox VE watchdog multiplexer...
Jul 20 17:15:57 pveceph-node1 systemd[1]: Starting Flush Journal to Persistent Storage...
Jul 20 17:15:57 pveceph-node1 systemd-modules-load[460]: Inserted module 'iscsi_tcp'
Jul 20 17:15:57 pveceph-node1 systemd[1]: Started Flush Journal to Persistent Storage.
Jul 20 17:15:57 pveceph-node1 systemd[1]: Started Load/Save Random Seed.
Jul 20 17:15:57 pveceph-node1 systemd[1]: Mounted RPC Pipe File System.
Jul 20 17:15:57 pveceph-node1 keyboard-setup.sh[462]: cannot open file /tmp/tmpkbd.D7IQVE

Here the update for syslog.between time 17.09 to 17.15 it's the restart period error.

Code:
Jul 20 17:09:59 pveceph-node1 ceph-mon[1853]: 2019-07-20 17:09:59.492466 7fac3d612700 -1 received  signal: Terminated from  PID: 1 task name: /sbin/init  UID: 0
Jul 20 17:09:59 pveceph-node1 ceph-mon[1853]: 2019-07-20 17:09:59.492505 7fac3d612700 -1 mon.pveceph-node1@1(probing) e3 *** Got Signal Terminated ***

i think the problem is on this error log. but i don't know what is this mean and why it's got signal terminated
 
Hey, can you see if the server is still correctly in the Proxmox VE cluster? As root just execute:
Code:
pvecm status

Also a
Code:
cat /etc/pve/ceph.conf
ceph -s
would be interesting.
 
Here is the result from pvecm status

Code:
root@pveceph-node1:~# pvecm status
Quorum information
------------------
Date:             Mon Jul 22 14:52:48 2019
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1/2280
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.50
0x00000002          1 192.168.1.51 (local)
0x00000003          1 192.168.1.52

Here is the result from cat /etc/pve/ceph.conf

Code:
root@pveceph-node1:~# cat /etc/pve/ceph.conf
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 10.10.10.0/24
         fsid = 9acbf954-000d-4996-b0db-2f3915c68c80
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 3
         public network = 10.10.10.0/24

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.pveceph-master]
         host = pveceph-master
         mon addr = 10.10.10.100:6789

[mon.pveceph-node2]
         host = pveceph-node2
         mon addr = 10.10.10.102:6789

[mon.pveceph-node1]
         host = pveceph-node1
         mon addr = 10.10.10.101:6789

When i do command
Code:
ceph -s
it's just freezing. I thought that it's because i unplugged the cable that used by ceph to communicate.
Code:
2019-07-22 15:01:59.138918 7ffb53f18700  0 monclient(hunting): authenticate timed out after 300
2019-07-22 15:01:59.138960 7ffb53f18700  0 librados: client.admin authentication error (110) Connection timed out
[errno 110] error connecting to the cluster


every server, there is 2 interface. eth0 for internet and cluster PVE. and eth1 is for ceph communication.
 

Attachments

  • ceph-s.PNG
    ceph-s.PNG
    1.6 KB · Views: 5
Update:
i have done nothing but the server restart by itself. i check the log and shows up the log like this.

Code:
Jul 22 16:31:35 pveceph-node1 pvestatd[1867]: status update time (5.114 seconds)
Jul 22 16:31:45 pveceph-node1 pvestatd[1867]: got timeout
Jul 22 16:31:45 pveceph-node1 pvestatd[1867]: status update time (5.113 seconds)
Jul 22 16:31:55 pveceph-node1 pvestatd[1867]: got timeout
Jul 22 16:31:55 pveceph-node1 pvestatd[1867]: status update time (5.115 seconds)
Jul 22 16:32:00 pveceph-node1 systemd[1]: Starting Proxmox VE replication runner...
Jul 22 16:32:01 pveceph-node1 systemd[1]: Started Proxmox VE replication runner.
NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL 
Jul 22 16:35:59 pveceph-node1 systemd-modules-load[475]: Inserted module 'iscsi_tcp'
Jul 22 16:35:59 pveceph-node1 systemd[1]: Mounted Debug File System.
Jul 22 16:35:59 pveceph-node1 systemd[1]: Mounted POSIX Message Queue File System.
Jul 22 16:35:59 pveceph-node1 systemd[1]: Mounted Huge Pages File System.

does it because of ceph?
 
On cisco nexus switch, i use vlan 200 and vlan 250
vlan 200 is for internet purpose
vlan 250 is for ceph cluster

i activate those vlan with igmp snooping on
and every vlan i define ip igmp snooping querier

do i need to disable ip igmp snooping querier on vlan 250?
if i disable ip igmp snooping on vlan 200, all server can't quorum.