Long term reboots of CEPH cluster

radekcz · Mar 7, 2022

We are solving a long term issue with a 3 node proxmox CEPH cluster. From time to time one of the nodes in the cluster just accidentally reboots.
The physical setup is as seen here:

Lets focus at pve1 configuration which represents the same config at the other nodes.

Here is the pveversion:

Code:

# pveversion --verbose
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

After some diagnostics and observations the problem with the reboots must be somewhere within the networking setup i suppose. So we have separated corosync network to a dedicated 1GB bond. Therefore our current networking looks like this (the IPs are masked for security reasons):

Code:

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr2

auto eno2
iface eno2 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr2

auto eno3
iface eno3 inet manual

auto eno4
iface eno4 inet manual

auto ens6f0
iface ens6f0 inet manual

auto ens6f1
iface ens6f1 inet manual

auto vlan99
iface vlan99 inet static
        address 192.ip.233/24
        gateway 192.ip.1
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=99

auto vlan100
iface vlan100 inet static
        address 192.ip.11/24
        ovs_type OVSIntPort
        ovs_bridge vmbr1
        ovs_options tag=100
#cluster network

auto bond0
iface bond0 inet manual
        ovs_bonds eno3 eno4
        ovs_type OVSBond
        ovs_bridge vmbr0
        ovs_options lacp=active bond_mode=balance-tcp
#2x 1G: guests' data trunk + pve web access

auto bond1
iface bond1 inet manual
        ovs_bonds ens6f0 ens6f1
        ovs_type OVSBond
        ovs_bridge vmbr1
        ovs_options bond_mode=balance-tcp lacp=active tag=100
#2x 10G: cluster + storage

auto vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports vlan99 bond0

auto vmbr1
iface vmbr1 inet manual
        ovs_type OVSBridge
        ovs_ports vlan100 bond1

auto vmbr2
iface vmbr2 inet static
        address 10.ip.11/24
        ovs_type OVSBridge
        ovs_ports eno1 eno2
        post-up ovs-vsctl set Bridge vmbr2 rstp_enable=true
#2x 1G: corosync

So actually the 10GB network (ens6f0 and ens6f1) is used for cluster and storage traffic. The 1Gb ports are used for user traffic to the VMs and as I have written we have added a 1GB dedicated network foc corosync only.

The config of corosync looks like this:

Code:

#cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.ip.11
    ring1_addr: 10.ip.11
  }
  node {
    name: pve2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.ip.12
    ring1_addr: 10.ip.12
  }
  node {
    name: pve3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.ip.13
    ring1_addr: 10.ip.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster01name
  config_version: 12
  interface {
    linknumber: 0
    knet_link_priority: 2
  }
  interface {
    linknumber: 1
    knet_link_priority: 255
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

The pveceph is fine:

Code:

# pveceph status
  cluster:
    id:     0c0d803f-db1a-4a20-ae35-de91cbf243ac
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum pve2a,pve3a,pve1a (age 2m)
    mgr: pve2a(active, since 29h), standbys: pve3a, pve1a
    osd: 21 osds: 21 up (since 4h), 21 in (since 4h)
 
  data:
    pools:   3 pools, 641 pgs
    objects: 2.32M objects, 8.9 TiB
    usage:   26 TiB used, 23 TiB / 50 TiB avail
    pgs:     641 active+clean
 
  io:
    client:   16 MiB/s rd, 2.9 MiB/s wr, 278 op/s rd, 470 op/s wr

There were no significant modifications to the ceph itself:

Code:

# cat /etc/ceph/ceph.conf
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.ip.11/24
     fsid = 0c0d803f-db1a-4a20-ae35-de91cbf243ac
     mon_allow_pool_delete = true
     mon_host = 192.ip.12 192.ip.13 192.ip.11
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.ip.11/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve1a]
     public_addr = 192.ip.11

[mon.pve2a]
     public_addr = 192.ip.12

[mon.pve3a]
     public_addr = 192.ip.13

After some inspection in the logs we have found at the last reboot this info:

Code:

Feb 27 04:01:55 pve3a kernel: [809316.723693] igb 0000:02:00.0 eno1: igb: eno1 NIC Link is Down
Feb 27 04:01:55 pve3a ovs-vswitchd: ovs|290627|rstp_sm|ERR|vmbr2 transmitting bpdu in disabled role on port 8002
Feb 27 04:02:05 pve3a pvestatd[4454]: got timeout
Feb 27 04:02:06 pve3a pvestatd[4454]: status update time (5.573 seconds)
Feb 27 04:02:17 pve3a corosync[4348]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 27 04:02:17 pve3a corosync[4348]:   [KNET  ] rx: host: 1 link: 1 is up
Feb 27 04:02:17 pve3a corosync[4348]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 255)
Feb 27 04:02:17 pve3a corosync[4348]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 255)
Feb 27 04:02:18 pve3a corosync[4348]:   [QUORUM] Sync members[3]: 1 2 3
Feb 27 04:02:18 pve3a corosync[4348]:   [QUORUM] Sync joined[1]: 1
Feb 27 04:02:18 pve3a corosync[4348]:   [TOTEM ] A new membership (1.2132) was formed. Members joined: 1
Feb 27 04:02:18 pve3a corosync[4348]:   [QUORUM] Members[3]: 1 2 3
Feb 27 04:02:18 pve3a corosync[4348]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 27 04:02:19 pve3a kernel: [809340.802573] vmx_set_msr: 152 callbacks suppressed

Any help? Ideas how to fix this long term problem (more then 5 month of random reboots)

aaron · Mar 7, 2022

You did mask the IP addresses used by Ceph a bit too much. Therefore, I can only come up with some educated guesses.

You have HA guests, I assume?

In that case, a stable Corosync connection is important, as it is used to determine if a node is still part of the cluster.
If Ceph or other services that can use up all the available bandwidth, then the latency for the Corosync packets on the same physical network, can go up too high to still be considered usable by Corosync.

The logs you posted look like a node finished rebooted and corosync was able to communicate with it on both links. Check for other corosync messages where a link is reported as down. Depending on how those logs look, you might be able to draw a conclusion of what is going on.

Correct me if I am wrong, but from what I could gather from the network config is that Corosync is sharing the network with storage (Ceph?) bond1/vmbr1?

Then it has vmbr2. Is that only for Corosync or also used by other services/VMs?

In case vmbr2 is only used for Corosync, check what you get in RSTP log messages. If that has issues at the same time as the other vmbr1/bond1 link is having issues, you could run into the situation that both corosync links are considered as down.

I would try to simplify that network though. Removing the vmbr2 and the rstp config and configuring two different subnets on the NICs itself. Then add the resulting 3rd and new subnet as another Corosync link. Giving Corosync another option to fall back to.

radekcz · Mar 7, 2022

aaron said:
You did mask the IP addresses used by Ceph a bit too much. Therefore, I can only come up with some educated guesses.

You have HA guests, I assume?

In that case, a stable Corosync connection is important, as it is used to determine if a node is still part of the cluster.
If Ceph or other services that can use up all the available bandwidth, then the latency for the Corosync packets on the same physical network, can go up too high to still be considered usable by Corosync.

The logs you posted look like a node finished rebooted and corosync was able to communicate with it on both links. Check for other corosync messages where a link is reported as down. Depending on how those logs look, you might be able to draw a conclusion of what is going on.

Correct me if I am wrong, but from what I could gather from the network config is that Corosync is sharing the network with storage (Ceph?) bond1/vmbr1?

Then it has vmbr2. Is that only for Corosync or also used by other services/VMs?

In case vmbr2 is only used for Corosync, check what you get in RSTP log messages. If that has issues at the same time as the other vmbr1/bond1 link is having issues, you could run into the situation that both corosync links are considered as down.

I would try to simplify that network though. Removing the vmbr2 and the rstp config and configuring two different subnets on the NICs itself. Then add the resulting 3rd and new subnet as another Corosync link. Giving Corosync another option to fall back to.

1. HA state
Yes, the HA manager is set up and running

Code:

#ha-manager status
quorum OK
master pve2a (active, Mon Mar  7 17:24:18 2022)
lrm pve1a (active, Mon Mar  7 17:24:14 2022)
lrm pve2a (active, Mon Mar  7 17:24:21 2022)
lrm pve3a (active, Mon Mar  7 17:24:20 2022)
service vm:100 (pve3, stopped)
service vm:102 (pve1, started)
service vm:103 (pve1, started)
service vm:104 (pve2, started)
service vm:105 (pve3, stopped)
service vm:106 (pve1, started)
service vm:107 (pve3, started)

the configuration is done via group.cfg

Code:

group: g5-pve1primary
        nodes pve3:2,pve2:2,pve1:4
        nofailback 0
        restricted 0

group: g6-pve2primary
        nodes pve3:2,pve2:4,pve1:2
        nofailback 0
        restricted 0

group: g7-pve3primary
        nodes pve1:2,pve3:4,pve2:2
        nofailback 0
        restricted 0

2. corosync log

Code:

#journalctl -b -u corosync.service --> log before a node reboot
-- Journal begins at Tue 2022-01-11 19:42:57 CET, ends at Mon 2022-03-07 17:28:47 CET. --
Mar 04 02:44:43 pve3a systemd[1]: Starting Corosync Cluster Engine...
Mar 04 02:44:43 pve3a corosync[3272]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Mar 04 02:44:43 pve3a corosync[3272]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Mar 04 02:44:43 pve3a corosync[3272]:   [TOTEM ] Initializing transport (Kronosnet).
Mar 04 02:44:43 pve3a corosync[3272]:   [TOTEM ] totemknet initialized
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Mar 04 02:44:43 pve3a corosync[3272]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Mar 04 02:44:43 pve3a corosync[3272]:   [QB    ] server name: cmap
Mar 04 02:44:43 pve3a corosync[3272]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Mar 04 02:44:43 pve3a corosync[3272]:   [QB    ] server name: cfg
Mar 04 02:44:43 pve3a corosync[3272]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Mar 04 02:44:43 pve3a corosync[3272]:   [QB    ] server name: cpg
Mar 04 02:44:43 pve3a corosync[3272]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Mar 04 02:44:43 pve3a corosync[3272]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Mar 04 02:44:43 pve3a corosync[3272]:   [WD    ] Watchdog not enabled by configuration
Mar 04 02:44:43 pve3a corosync[3272]:   [WD    ] resource load_15min missing a recovery key.
Mar 04 02:44:43 pve3a corosync[3272]:   [WD    ] resource memory_used missing a recovery key.
Mar 04 02:44:43 pve3a corosync[3272]:   [WD    ] no resources configured.
Mar 04 02:44:43 pve3a corosync[3272]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Mar 04 02:44:43 pve3a corosync[3272]:   [QUORUM] Using quorum provider corosync_votequorum
Mar 04 02:44:43 pve3a corosync[3272]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Mar 04 02:44:43 pve3a corosync[3272]:   [QB    ] server name: votequorum
Mar 04 02:44:43 pve3a corosync[3272]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Mar 04 02:44:43 pve3a corosync[3272]:   [QB    ] server name: quorum
Mar 04 02:44:43 pve3a corosync[3272]:   [TOTEM ] Configuring link 0
Mar 04 02:44:43 pve3a corosync[3272]:   [TOTEM ] Configured link number 0: local addr: 192.ip.13, port=5405
Mar 04 02:44:43 pve3a corosync[3272]:   [TOTEM ] Configuring link 1
Mar 04 02:44:43 pve3a corosync[3272]:   [TOTEM ] Configured link number 1: local addr: 10.ip.13, port=5406
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 2 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 1 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [QUORUM] Sync members[1]: 2
Mar 04 02:44:43 pve3a corosync[3272]:   [QUORUM] Sync joined[1]: 2
Mar 04 02:44:43 pve3a corosync[3272]:   [TOTEM ] A new membership (2.2137) was formed. Members joined: 2
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 2)
Mar 04 02:44:43 pve3a corosync[3272]:   [KNET  ] host: host: 3 has no active links
Mar 04 02:44:43 pve3a corosync[3272]:   [QUORUM] Members[1]: 2
Mar 04 02:44:43 pve3a corosync[3272]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 04 02:44:43 pve3a systemd[1]: Started Corosync Cluster Engine.
Mar 04 02:44:45 pve3a corosync[3272]:   [KNET  ] rx: host: 3 link: 0 is up

3. corosync shared
half true. There is a bridge at 2x 1GB ports. Each port goes to a different switch (both switches are in a stack) --> that is the primary ring that is in the vmbr2. The secondary ring on the other side does share its net with the 10GB networking. the vmbr2 is only dedicated for corosync, and nothing more.

4. IPs
fixed. Now it should be more intuitive. Substitute instead of IP something like 10.11 and it will fit.

radekcz · Mar 7, 2022

If it may be of some use, we have observerd that when we do an apt-update/upgrade on whatever node, the node reboots suddenly itself also. So it works as follows:
apt-update
apt-upgrade
reboot
apt-get update --fix-missing
apt upgrade
upgrade takes horribly long to be done (2h)

aaron · Mar 8, 2022

First of, if you use apt update and apt full-upgrade, or the pveupgrade tool, you don't run into the situation that you reboot into a potentially only half updated system with new but missing dependencies!

That also might reduce the upgrade time.

The corosync logs you posted look like the node just rebooted and started to connect to all the other services?

Do you have some network monitoring set up? How is the connection and latency on the 2 networks used by Corosync? Are some other things logged around the same time in the syslogs (/var/log/syslog)? How is the situation seen on the other nodes?

radekcz · Mar 8, 2022

aaron said:
First of, if you use apt update and apt full-upgrade, or the pveupgrade tool, you don't run into the situation that you reboot into a potentially only half updated system with new but missing dependencies!

That also might reduce the upgrade time.

The corosync logs you posted look like the node just rebooted and started to connect to all the other services?

Do you have some network monitoring set up? How is the connection and latency on the 2 networks used by Corosync? Are some other things logged around the same time in the syslogs (/var/log/syslog)? How is the situation seen on the other nodes?

sure, here is the graph. We can see around 04:00 AM a peak when was the reboot:

in the syslog was nothing more then info about a reboot

Code:

Mar  7 04:06:40 pve1a systemd[1]: Starting Daily PVE download activities...
Mar  7 04:06:40 pve1a kernel: [691500.915949] kvm [7128]: vcpu4, guest rIP: 0xfffff801c0d61ea3 vmx_set_msr: BTF|LBR in IA32_DEBUGCTLMSR 0x1, nop
Mar  7 04:18:34 pve1a systemd-modules-load[1477]: Inserted module 'iscsi_tcp'
Mar  7 04:18:34 pve1a systemd-modules-load[1477]: Inserted module 'ib_iser'
Mar  7 04:18:34 pve1a kernel: [    0.000000] Linux version 5.13.19-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.13.19-7 (Thu, 20 Jan 2022 16:37:56 +0100) ()
Mar  7 04:18:34 pve1a systemd-modules-load[1477]: Inserted module 'vhost_net'

the 10G network (second ring) does look to have any problems at all

radekcz · Mar 14, 2022

Sadly, the `pveupgrade` command did not help. the node rebootet itslef again, but there was a corelation with high iops of system discs (durning the upgrade theis utilization in iostat was still at 100%. Can that be a reason why the node reboots itself?

Search

Search

Long term reboots of CEPH cluster

radekcz

New Member

aaron

Proxmox Staff Member

radekcz

New Member

radekcz

New Member

aaron

Proxmox Staff Member

radekcz

New Member

radekcz

New Member