Hello, we updated only one of our clusters to the new version last weekend. Yesterday custer rebooted two times and 3 time today. Every time it's back one ore more OSDs are down, different each time, if it's destroyed and recreated again it seems there it's no problem with that OSD till now. This is what is logged every time cluster reboots:
This never happened before. Any idea to try to debug and fix it?
Regarding OSD I see these ones after the cluster restarted and till ceph health it's ok again:
But:
Regards
Jul 22 18:24:58 int101 pmxcfs[3477]: [status] notice: received log
Jul 22 18:24:58 int101 pmxcfs[3477]: [status] notice: received log
Jul 22 18:24:58 int101 pmxcfs[3477]: [status] notice: received log
Jul 22 18:24:58 int101 pmxcfs[3477]: [status] notice: received log
Jul 22 18:25:00 int101 systemd[1]: Starting Proxmox VE replication runner...
Jul 22 18:25:01 int101 systemd[1]: pvesr.service: Succeeded.
Jul 22 18:25:01 int101 systemd[1]: Finished Proxmox VE replication runner.
Jul 22 18:25:33 int101 pveproxy[1632072]: worker exit
Jul 22 18:25:33 int101 pveproxy[4182]: worker 1632072 finished
Jul 22 18:25:33 int101 pveproxy[4182]: starting 1 worker(s)
Jul 22 18:25:33 int101 pveproxy[4182]: worker 1741226 started
Jul 22 18:26:00 int101 systemd[1]: Starting Proxmox VE replication runner...
Jul 22 18:26:01 int101 systemd[1]: pvesr.service: Succeeded.
Jul 22 18:26:01 int101 systemd[1]: Finished Proxmox VE replication runner.
Jul 22 18:26:51 int101 pmxcfs[3477]: [dcdb] notice: data verification successful
Jul 22 18:27:00 int101 systemd[1]: Starting Proxmox VE replication runner...
Jul 22 18:27:01 int101 systemd[1]: pvesr.service: Succeeded.
Jul 22 18:27:01 int101 systemd[1]: Finished Proxmox VE replication runner.
Jul 22 18:27:08 int101 smartd[2041]: Device: /dev/sdj [SAT], CHECK POWER STATUS spins up disk (0x80 -> 0xff)
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] link: host: 5 link: 0 is down
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] link: host: 4 link: 0 is down
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] host: host: 5 has no active links
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] host: host: 4 has no active links
Jul 22 18:27:18 int101 kernel: ixgbe 0000:05:00.0 enp5s0f0: NIC Link is Down
Jul 22 18:27:18 int101 kernel: vmbr16: port 1(enp5s0f0.16) entered disabled state
Jul 22 18:27:18 int101 kernel: vmbr17: port 1(enp5s0f0.17) entered disabled state
Jul 22 18:27:18 int101 kernel: vmbr18: port 1(enp5s0f0.18) entered disabled state
Jul 22 18:27:18 int101 kernel: vmbr91: port 1(enp5s0f0.91) entered disabled state
Jul 22 18:27:18 int101 kernel: ixgbe 0000:05:00.1 enp5s0f1: NIC Link is Down
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] link: host: 3 link: 0 is down
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] link: host: 2 link: 0 is down
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] host: host: 3 has no active links
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] host: host: 2 has no active links
Jul 22 18:27:20 int101 corosync[3865]: [TOTEM ] Token has not been received in 3712 ms
Jul 22 18:27:21 int101 corosync[3865]: [TOTEM ] A processor failed, forming new configuration: token timed out (4950ms), waiting 5940ms for consensus.
Jul 22 18:27:27 int101 corosync[3865]: [QUORUM] Sync members[1]: 1
Jul 22 18:27:27 int101 corosync[3865]: [QUORUM] Sync left[4]: 2 3 4 5
Jul 22 18:27:27 int101 corosync[3865]: [TOTEM ] A new membership (1.1f55) was formed. Members left: 2 3 4 5
Jul 22 18:27:27 int101 corosync[3865]: [TOTEM ] Failed to receive the leave message. failed: 2 3 4 5
This never happened before. Any idea to try to debug and fix it?
Regarding OSD I see these ones after the cluster restarted and till ceph health it's ok again:
Jul 22 18:31:31 int101 ceph-osd[3921]: 2021-07-22T18:31:31.035+0200 7f05c3c40700 -1 --2- 10.10.40.101:0/3921 >> [v2:10.10.40.105:6830/567869,v1:10.10.40.105:6831/567869] conn(0x55b38983b000 0x55b395c39400 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:10.10.40.105:6830/567869,v1:10.10.40.105:6831/567869] is using msgr V1 protocol
But:
root@int101:~# ceph mon dump
epoch 8
fsid b70b6772-1c34-407d-a701-462c14fde916
last_changed 2021-07-18T10:24:22.240410+0200
created 2018-03-01T10:14:32.869926+0100
min_mon_release 16 (pacific)
election_strategy: 1
0: [v2:10.10.40.101:3300/0,v1:10.10.40.101:6789/0] mon.int101
1: [v2:10.10.40.102:3300/0,v1:10.10.40.102:6789/0] mon.int102
2: [v2:10.10.40.103:3300/0,v1:10.10.40.103:6789/0] mon.int103
3: [v2:10.10.40.105:3300/0,v1:10.10.40.105:6789/0] mon.int105
dumped monmap epoch 8
Regards