Node reboot loop

proxtest

Active Member
Mar 19, 2014
108
0
36
HI there,

one of my 5 Nodes reboots every 5 - 10 minutes, its a torture for my Ceph storage! the cluster runs 3 months whitout any problem but now it sucks. :-(

Is there a timer (watchdog?) or something who can make this happen?

i can't find anything in syslog or kern.log, its a reset and not a normal shutdown!

i dont use the cluster manager because i dont trust them. :)

can somebody point me in the direction where i will find the reason?

proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.8-1-pve: 4.4.8-51
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
ceph: 0.94.7-1~bpo80+1

Quorum information
------------------
Date: Mon Aug 22 19:11:56 2016
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000002
Ring ID: 4684
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.11.12.1
0x00000002 1 10.11.12.2 (local)
0x00000004 1 10.11.12.4
0x00000005 1 10.11.12.5

Membership information
----------------------
Nodeid Votes Name
1 1 node1pv
2 1 node2pv (local)
4 1 node4pv
5 1 node5pv

2016-08-22 19:14:32.194370 mon.0 [INF] pgmap v7526378: 1664 pgs: 892 active+clean, 9 active+undersized+degraded+remapped+backfilling, 763 active+undersized+degraded+remapped+wait_backfill; 3097 GB data, 7777 GB used, 81603 GB / 89380 GB avail; 163 kB/s wr, 37 op/s; 394883/2743666 objects degraded (14.393%); 1114915/2743666 objects misplaced (40.636%); 559 MB/s, 140 objects/s recovering
:-(


regards
 
Last edited:
One of my nodes starting doing the same as 'proxtext' after updating the node today. Any word on this issue?