We are solving a long term issue with a 3 node proxmox CEPH cluster. From time to time one of the nodes in the cluster just accidentally reboots.
The physical setup is as seen here:
Lets focus at pve1 configuration which represents the same config at the other nodes.
Here is the pveversion:
After some diagnostics and observations the problem with the reboots must be somewhere within the networking setup i suppose. So we have separated corosync network to a dedicated 1GB bond. Therefore our current networking looks like this (the IPs are masked for security reasons):
So actually the 10GB network (ens6f0 and ens6f1) is used for cluster and storage traffic. The 1Gb ports are used for user traffic to the VMs and as I have written we have added a 1GB dedicated network foc corosync only.
The config of corosync looks like this:
The pveceph is fine:
There were no significant modifications to the ceph itself:
After some inspection in the logs we have found at the last reboot this info:
Any help? Ideas how to fix this long term problem (more then 5 month of random reboots)
The physical setup is as seen here:
Lets focus at pve1 configuration which represents the same config at the other nodes.
Here is the pveversion:
Code:
# pveversion --verbose
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1
After some diagnostics and observations the problem with the reboots must be somewhere within the networking setup i suppose. So we have separated corosync network to a dedicated 1GB bond. Therefore our current networking looks like this (the IPs are masked for security reasons):
Code:
auto lo
iface lo inet loopback
auto eno1
iface eno1 inet manual
ovs_type OVSPort
ovs_bridge vmbr2
auto eno2
iface eno2 inet manual
ovs_type OVSPort
ovs_bridge vmbr2
auto eno3
iface eno3 inet manual
auto eno4
iface eno4 inet manual
auto ens6f0
iface ens6f0 inet manual
auto ens6f1
iface ens6f1 inet manual
auto vlan99
iface vlan99 inet static
address 192.ip.233/24
gateway 192.ip.1
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=99
auto vlan100
iface vlan100 inet static
address 192.ip.11/24
ovs_type OVSIntPort
ovs_bridge vmbr1
ovs_options tag=100
#cluster network
auto bond0
iface bond0 inet manual
ovs_bonds eno3 eno4
ovs_type OVSBond
ovs_bridge vmbr0
ovs_options lacp=active bond_mode=balance-tcp
#2x 1G: guests' data trunk + pve web access
auto bond1
iface bond1 inet manual
ovs_bonds ens6f0 ens6f1
ovs_type OVSBond
ovs_bridge vmbr1
ovs_options bond_mode=balance-tcp lacp=active tag=100
#2x 10G: cluster + storage
auto vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports vlan99 bond0
auto vmbr1
iface vmbr1 inet manual
ovs_type OVSBridge
ovs_ports vlan100 bond1
auto vmbr2
iface vmbr2 inet static
address 10.ip.11/24
ovs_type OVSBridge
ovs_ports eno1 eno2
post-up ovs-vsctl set Bridge vmbr2 rstp_enable=true
#2x 1G: corosync
So actually the 10GB network (ens6f0 and ens6f1) is used for cluster and storage traffic. The 1Gb ports are used for user traffic to the VMs and as I have written we have added a 1GB dedicated network foc corosync only.
The config of corosync looks like this:
Code:
#cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: pve1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.ip.11
ring1_addr: 10.ip.11
}
node {
name: pve2
nodeid: 3
quorum_votes: 1
ring0_addr: 192.ip.12
ring1_addr: 10.ip.12
}
node {
name: pve3
nodeid: 2
quorum_votes: 1
ring0_addr: 192.ip.13
ring1_addr: 10.ip.13
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: Cluster01name
config_version: 12
interface {
linknumber: 0
knet_link_priority: 2
}
interface {
linknumber: 1
knet_link_priority: 255
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
The pveceph is fine:
Code:
# pveceph status
cluster:
id: 0c0d803f-db1a-4a20-ae35-de91cbf243ac
health: HEALTH_OK
services:
mon: 3 daemons, quorum pve2a,pve3a,pve1a (age 2m)
mgr: pve2a(active, since 29h), standbys: pve3a, pve1a
osd: 21 osds: 21 up (since 4h), 21 in (since 4h)
data:
pools: 3 pools, 641 pgs
objects: 2.32M objects, 8.9 TiB
usage: 26 TiB used, 23 TiB / 50 TiB avail
pgs: 641 active+clean
io:
client: 16 MiB/s rd, 2.9 MiB/s wr, 278 op/s rd, 470 op/s wr
There were no significant modifications to the ceph itself:
Code:
# cat /etc/ceph/ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 192.ip.11/24
fsid = 0c0d803f-db1a-4a20-ae35-de91cbf243ac
mon_allow_pool_delete = true
mon_host = 192.ip.12 192.ip.13 192.ip.11
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 192.ip.11/24
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
[mon.pve1a]
public_addr = 192.ip.11
[mon.pve2a]
public_addr = 192.ip.12
[mon.pve3a]
public_addr = 192.ip.13
After some inspection in the logs we have found at the last reboot this info:
Code:
Feb 27 04:01:55 pve3a kernel: [809316.723693] igb 0000:02:00.0 eno1: igb: eno1 NIC Link is Down
Feb 27 04:01:55 pve3a ovs-vswitchd: ovs|290627|rstp_sm|ERR|vmbr2 transmitting bpdu in disabled role on port 8002
Feb 27 04:02:05 pve3a pvestatd[4454]: got timeout
Feb 27 04:02:06 pve3a pvestatd[4454]: status update time (5.573 seconds)
Feb 27 04:02:17 pve3a corosync[4348]: [KNET ] rx: host: 1 link: 0 is up
Feb 27 04:02:17 pve3a corosync[4348]: [KNET ] rx: host: 1 link: 1 is up
Feb 27 04:02:17 pve3a corosync[4348]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 255)
Feb 27 04:02:17 pve3a corosync[4348]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 255)
Feb 27 04:02:18 pve3a corosync[4348]: [QUORUM] Sync members[3]: 1 2 3
Feb 27 04:02:18 pve3a corosync[4348]: [QUORUM] Sync joined[1]: 1
Feb 27 04:02:18 pve3a corosync[4348]: [TOTEM ] A new membership (1.2132) was formed. Members joined: 1
Feb 27 04:02:18 pve3a corosync[4348]: [QUORUM] Members[3]: 1 2 3
Feb 27 04:02:18 pve3a corosync[4348]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 27 04:02:19 pve3a kernel: [809340.802573] vmx_set_msr: 152 callbacks suppressed
Any help? Ideas how to fix this long term problem (more then 5 month of random reboots)
Last edited: