Good evening,
after some tests, I discovered that if 1 of 4 nodes goes down, the disk IO stucks.
VM and CT are still up but no disk of them are available for I/O.
I have 3 ceph monitors.
When I reboot the node, on ceph logs:
Here my network:
Here the cluster status:
My packages versions:
is it possible that the problem could be the clock skew?
after some tests, I discovered that if 1 of 4 nodes goes down, the disk IO stucks.
VM and CT are still up but no disk of them are available for I/O.
I have 3 ceph monitors.
When I reboot the node, on ceph logs:
Code:
2019-01-24 10:28:08.240463 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175770 : cluster [INF] osd.2 marked itself down
2019-01-24 10:28:08.276487 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175771 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2019-01-24 10:28:08.286445 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175773 : cluster [INF] Standby daemon mds.bluehub-prox05 assigned to filesystem cephfs as rank 0
2019-01-24 10:28:09.238925 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175776 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-01-24 10:28:09.238987 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175777 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2019-01-24 10:28:10.427732 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175784 : cluster [INF] daemon mds.bluehub-prox05 is now active in filesystem cephfs as rank 0
2019-01-24 10:28:11.401683 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175785 : cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded)
2019-01-24 10:28:11.402667 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175786 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2019-01-24 10:28:11.402698 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175787 : cluster [WRN] Health check failed: Degraded data redundancy: 3660/1823339 objects degraded (0.201%), 14 pgs degraded (PG_DEGRADED)
2019-01-24 10:28:24.577475 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8302 : cluster [INF] mon.bluehub-prox03 calling monitor election
2019-01-24 10:28:24.595828 mon.bluehub-prox05 mon.2 10.9.9.5:6789/0 1879958 : cluster [INF] mon.bluehub-prox05 calling monitor election
2019-01-24 10:28:34.598958 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8303 : cluster [INF] mon.bluehub-prox03 is new leader, mons bluehub-prox03,bluehub-prox05 in quorum (ranks 1,2)
2019-01-24 10:28:34.625022 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8308 : cluster [WRN] Health check failed: 1/3 mons down, quorum bluehub-prox03,bluehub-prox05 (MON_DOWN)
2019-01-24 10:28:34.642025 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8310 : cluster [WRN] overall HEALTH_WARN 1 osds down; 1 host (1 osds) down; 22/1823339 objects misplaced (0.001%); Reduced data availability: 1 pg peering; Degraded data redundancy: 67494/1823339 objects degraded (3.702%), 188 pgs degraded; 1/3 mons down, quorum bluehub-prox03,bluehub-prox05
2019-01-24 10:28:34.676528 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8311 : cluster [WRN] Health check update: 22/1823513 objects misplaced (0.001%) (OBJECT_MISPLACED)
2019-01-24 10:28:34.676588 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8312 : cluster [WRN] Health check update: Degraded data redundancy: 68479/1823513 objects degraded (3.755%), 199 pgs degraded (PG_DEGRADED)
2019-01-24 10:28:34.676648 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8313 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2019-01-24 10:29:09.215151 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8318 : cluster [WRN] Health check update: 22/1823515 objects misplaced (0.001%) (OBJECT_MISPLACED)
2019-01-24 10:29:09.215200 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8319 : cluster [WRN] Health check update: Degraded data redundancy: 68479/1823515 objects degraded (3.755%), 199 pgs degraded (PG_DEGRADED)
2019-01-24 10:29:11.285465 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8320 : cluster [WRN] Health check failed: Reduced data availability: 76 pgs inactive (PG_AVAILABILITY)
2019-01-24 10:29:15.441692 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8321 : cluster [WRN] Health check update: Degraded data redundancy: 68479/1823515 objects degraded (3.755%), 199 pgs degraded, 203 pgs undersized (PG_DEGRADED)
2019-01-24 10:32:24.933885 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 1 : cluster [INF] mon.bluehub-prox02 calling monitor election
2019-01-24 10:32:24.938282 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 2 : cluster [INF] mon.bluehub-prox02 calling monitor election
2019-01-24 10:32:24.979618 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 3 : cluster [INF] mon.bluehub-prox02 is new leader, mons bluehub-prox02,bluehub-prox03,bluehub-prox05 in quorum (ranks 0,1,2)
2019-01-24 10:32:25.002587 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 4 : cluster [WRN] mon.2 10.9.9.5:6789/0 clock skew 0.491436s > max 0.05s
2019-01-24 10:32:25.002706 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 5 : cluster [WRN] mon.1 10.9.9.3:6789/0 clock skew 0.491068s > max 0.05s
2019-01-24 10:32:25.009771 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 10 : cluster [WRN] Health check failed: clock skew detected on mon.bluehub-prox03, mon.bluehub-prox05 (MON_CLOCK_SKEW)
2019-01-24 10:32:25.009805 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 11 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum bluehub-prox03,bluehub-prox05)
2019-01-24 10:32:25.010647 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 12 : cluster [WRN] message from mon.2 was stamped 0.491892s in the future, clocks not synchronized
2019-01-24 10:32:25.022385 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 13 : cluster [WRN] overall HEALTH_WARN 1 osds down; 1 host (1 osds) down; 22/1823515 objects misplaced (0.001%); Reduced data availability: 76 pgs inactive; Degraded data redundancy: 68479/1823515 objects degraded (3.755%), 199 pgs degraded, 203 pgs undersized; clock skew detected on mon.bluehub-prox03, mon.bluehub-prox05
2019-01-24 10:32:25.428905 mon.bluehub-prox05 mon.2 10.9.9.5:6789/0 1880004 : cluster [INF] mon.bluehub-prox05 calling monitor election
2019-01-24 10:32:25.429032 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8348 : cluster [INF] mon.bluehub-prox03 calling monitor election
2019-01-24 10:32:29.988287 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 16 : cluster [INF] Manager daemon bluehub-prox05 is unresponsive. No standby daemons available.
2019-01-24 10:32:29.988376 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 17 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
Here my network:
Code:
cat /etc/network/interfaces
auto lo
iface lo inet loopback
iface eno1 inet manual
#Production
auto vmbr0
iface vmbr0 inet static
address 10.169.136.75
netmask 255.255.255.128
gateway 10.169.136.1
bridge_ports eno1
bridge_stp off
bridge_fd 0
iface eno2 inet manual
iface enp0s29f0u2 inet manual
iface ens6f0 inet manual
iface ens6f1 inet manual
iface ens2f0 inet manual
iface ens2f1 inet manual
auto vlan1050
iface vlan1050 inet static
vlan_raw_device ens2f0
address 10.9.9.1
netmask 255.255.255.0
network 10.9.9.0
#Ceph
auto vlan1048
iface vlan1048 inet static
vlan_raw_device ens2f0
address 10.1.1.1
netmask 255.255.255.0
network 10.1.1.0
#Cluster
Here the cluster status:
Code:
Quorum information
------------------
Date: Wed Jan 23 17:09:11 2019
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1/4580
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.1.1.1 (local)
0x00000002 1 10.1.1.2
0x00000003 1 10.1.1.3
0x00000004 1 10.1.1.5
My packages versions:
Code:
proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph: 12.2.10-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
is it possible that the problem could be the clock skew?
Code:
cluster [WRN] mon.2 10.9.9.5:6789/0 clock skew 0.491436s > max 0.05s