HI there,
one of my 5 Nodes reboots every 5 - 10 minutes, its a torture for my Ceph storage! the cluster runs 3 months whitout any problem but now it sucks. :-(
Is there a timer (watchdog?) or something who can make this happen?
i can't find anything in syslog or kern.log, its a reset and not a normal shutdown!
i dont use the cluster manager because i dont trust them.
can somebody point me in the direction where i will find the reason?
proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.8-1-pve: 4.4.8-51
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
ceph: 0.94.7-1~bpo80+1
Quorum information
------------------
Date: Mon Aug 22 19:11:56 2016
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000002
Ring ID: 4684
Quorate: Yes
Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.11.12.1
0x00000002 1 10.11.12.2 (local)
0x00000004 1 10.11.12.4
0x00000005 1 10.11.12.5
Membership information
----------------------
Nodeid Votes Name
1 1 node1pv
2 1 node2pv (local)
4 1 node4pv
5 1 node5pv
2016-08-22 19:14:32.194370 mon.0 [INF] pgmap v7526378: 1664 pgs: 892 active+clean, 9 active+undersized+degraded+remapped+backfilling, 763 active+undersized+degraded+remapped+wait_backfill; 3097 GB data, 7777 GB used, 81603 GB / 89380 GB avail; 163 kB/s wr, 37 op/s; 394883/2743666 objects degraded (14.393%); 1114915/2743666 objects misplaced (40.636%); 559 MB/s, 140 objects/s recovering
:-(
regards
one of my 5 Nodes reboots every 5 - 10 minutes, its a torture for my Ceph storage! the cluster runs 3 months whitout any problem but now it sucks. :-(
Is there a timer (watchdog?) or something who can make this happen?
i can't find anything in syslog or kern.log, its a reset and not a normal shutdown!
i dont use the cluster manager because i dont trust them.
can somebody point me in the direction where i will find the reason?
proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.8-1-pve: 4.4.8-51
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
ceph: 0.94.7-1~bpo80+1
Quorum information
------------------
Date: Mon Aug 22 19:11:56 2016
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000002
Ring ID: 4684
Quorate: Yes
Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.11.12.1
0x00000002 1 10.11.12.2 (local)
0x00000004 1 10.11.12.4
0x00000005 1 10.11.12.5
Membership information
----------------------
Nodeid Votes Name
1 1 node1pv
2 1 node2pv (local)
4 1 node4pv
5 1 node5pv
2016-08-22 19:14:32.194370 mon.0 [INF] pgmap v7526378: 1664 pgs: 892 active+clean, 9 active+undersized+degraded+remapped+backfilling, 763 active+undersized+degraded+remapped+wait_backfill; 3097 GB data, 7777 GB used, 81603 GB / 89380 GB avail; 163 kB/s wr, 37 op/s; 394883/2743666 objects degraded (14.393%); 1114915/2743666 objects misplaced (40.636%); 559 MB/s, 140 objects/s recovering
:-(
regards
Last edited: