[SOLVED] Node reboot while disk operation in Ceph

daubner

New Member
Jan 7, 2025
8
1
3
Hello!

I have another issue I'd appreciate a community feedback from. One of our nodes crashed during last Friday inexplicably during migration of a vm disk from ceph storage to a local one. I'm attaching ceph log on pastebin as it's too long: https://pastebin.com/wkxHP7rt


Code:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-9-pve)pve-manager: 8.3.5 (running version: 8.3.5/dac3aa88bac3f300)proxmox-kernel-helper: 8.1.1proxmox-kernel-6.8: 6.8.12-9proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4ceph: 19.2.0-pve2ceph-fuse: 19.2.0-pve2corosync: 3.1.7-pve3criu: 3.17.1-2+deb12u1dnsmasq: 2.90-4~deb12u1frr-pythontools: 8.5.2-1+pve1glusterfs-client: 10.3-5ifupdown2: 3.2.0-1+pmx11ksm-control-daemon: 1.5-1libjs-extjs: 7.0.0-5libknet1: 1.28-pve1libproxmox-acme-perl: 1.6.0libproxmox-backup-qemu0: 1.5.1libproxmox-rs-perl: 0.3.5libpve-access-control: 8.2.0libpve-apiclient-perl: 3.3.2libpve-cluster-api-perl: 8.0.10libpve-cluster-perl: 8.0.10libpve-common-perl: 8.2.9libpve-guest-common-perl: 5.1.6libpve-http-server-perl: 5.2.0libpve-network-perl: 0.10.1libpve-rs-perl: 0.9.2libpve-storage-perl: 8.3.3libspice-server1: 0.15.1-1lvm2: 2.03.16-2lxc-pve: 6.0.0-1lxcfs: 6.0.0-pve2novnc-pve: 1.5.0-1proxmox-backup-client: 3.3.4-1proxmox-backup-file-restore: 3.3.4-1proxmox-firewall: 0.6.0proxmox-kernel-helper: 8.1.1proxmox-mail-forward: 0.3.1proxmox-mini-journalreader: 1.4.0proxmox-offline-mirror-helper: 0.6.7proxmox-widget-toolkit: 4.3.7pve-cluster: 8.0.10pve-container: 5.2.4pve-docs: 8.3.1pve-edk2-firmware: 4.2023.08-4pve-esxi-import-tools: 0.7.2pve-firewall: 5.1.0pve-firmware: 3.14-3pve-ha-manager: 4.0.6pve-i18n: 3.4.1pve-qemu-kvm: 9.2.0-2pve-xtermjs: 5.3.0-3qemu-server: 8.3.8smartmontools: 7.3-pve1spiceterm: 3.3.0swtpm: 0.8.0+pve1vncterm: 1.8.0zfsutils-linux: 2.2.7-pve2


Other nodes in cluster have been restarted previously and afterwards the ceph monitor on that node didn't want to come up and needed to be reinstalled. I can provide more logs if that would help. Thank you for your help and have a nice rest of the day.
 
It sounds like you have an issue with your Corosync links.

Please provide the following information:
  • Contents of /etc/network/interfaces
  • The output from corosync-cfgtool -s
  • The output from ha-manager status
  • The contents of /etc/ceph/ceph.conf
 
Code:
root@nextclouda:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno1np0
iface eno1np0 inet manual

iface ens1f0np0 inet manual
#WAN

auto ens1f1np1
iface ens1f1np1 inet static
        address 10.0.0.1/24
        mtu 9000
#ceph-cluster

auto eno2np1
iface eno2np1 inet static
        address 10.0.1.1/24
        mtu 9000
#ceph-public

auto eno3np2
iface eno3np2 inet manual

iface eno4np3 inet manual

iface ens2f0np0 inet manual

iface ens2f1np1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves eno1np0 eno3np2
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
#MNG

auto vmbr0
iface vmbr0 inet static
        address 10.70.73.1/28
        gateway 10.70.73.14
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
#MNG

auto vmbr1
iface vmbr1 inet manual
        bridge-ports ens1f0np0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 1001-1004
#WAN

auto zakA
iface zakA inet manual
        bridge-ports vmbr1.1001
        bridge-stp off
        bridge-fd 0
#vSwitch Zákazník A

auto zakB
iface zakB inet manual
        bridge-ports vmbr1.1002
        bridge-stp off
        bridge-fd 0
#vSwitch Zákazník B

auto zakC
iface zakC inet manual
        bridge-ports vmbr1.1003
        bridge-stp off
        bridge-fd 0
#vSwitch Zákazník C

auto Internal
iface Internal inet manual
        bridge-ports none
        bridge-stp off
        bridge-fd 0
#Interní komunikace

source /etc/network/interfaces.d/*

Code:
root@nextclouda:~# corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
        addr    = 10.70.73.1
        status:
                nodeid:          1:     localhost
                nodeid:          2:     connected
                nodeid:          3:     connected

Code:
root@nextclouda:~# ha-manager status
quorum OK
master nextcloudb (active, Tue Apr  8 08:50:14 2025)
lrm nextclouda (active, Tue Apr  8 08:50:10 2025)
lrm nextcloudb (idle, Tue Apr  8 08:50:13 2025)
lrm nextcloudc (idle, Tue Apr  8 08:50:13 2025)
service vm:100 (nextclouda, started)

Code:
root@nextclouda:~# cat /etc/ceph/ceph.conf
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 10.0.0.1/24
        fsid = cf282c03-77a3-458d-8989-b4a477f121dd
        mon_allow_pool_delete = true
        mon_host = 10.0.1.1 10.0.1.2 10.0.1.3
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.0.1.1/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.nextclouda]
        host = nextclouda
        mds_standby_for_name = pve

[mds.nextcloudb]
        host = nextcloudb
        mds_standby_for_name = pve

[mds.nextcloudc]
        host = nextcloudc
        mds_standby_for_name = pve

[mon.nextclouda]
        public_addr = 10.0.1.1

[mon.nextcloudb]
        public_addr = 10.0.1.2

[mon.nextcloudc]
        public_addr = 10.0.1.3

It may have to do something with our Xen server. Until last friday we had it in the same network and it caused weird issues like broadcast storm during a reboot (Brief broadcast storm when booting up a server).
Since I readressed the whole cluster to it's own network subnet the broadcast storm didn't occur.
 
@daubner Thank you for sharing this information.

Since I readressed the whole cluster to it's own network subnet the broadcast storm didn't occur.

Your information makes it reasonable to conclude that a prolonged broadcast storm on the host network caused the reboot. This is based on the assumption that the cluster was set up much like you have it right now. I'll explain.

Code:
root@nextclouda:~# corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
        addr    = 10.70.73.1

This output tells me you are running corosync on your host network. Corosync is sensitive to latency.

Code:
root@nextclouda:~# ha-manager status
quorum OK
master nextcloudb (active, Tue Apr  8 08:50:14 2025)
lrm nextclouda (active, Tue Apr  8 08:50:10 2025)
lrm nextcloudb (idle, Tue Apr  8 08:50:13 2025)
lrm nextcloudc (idle, Tue Apr  8 08:50:13 2025)
service vm:100 (nextclouda, started)

The command ha-manager status let me know you have PVE HA enabled for at least one guest (vm100).

If you have HA enabled (you do), lrm (Local Resource Manager) is active (it is), and if a PVE node loses corosync for more than 60 seconds, it will reboot. Your corosync is running on your host network, and that is where the broadcast storm happened. This would have disrupted corosync and triggered the reboot of your PVE node.

In your original post, you indicated:

One of our nodes crashed during last Friday inexplicably during migration of a vm disk from ceph storage to a local one

The interfaces and ceph.conf let me know you have isolated the Ceph and guest networks (both good). Having Ceph isolated tells me it is unlikely that the disk migration caused the reboot.

However, if Ceph and corosync shared a network in your previous configuration, that would have likely caused or contributed to the disruption of corosync that led to the reboot.

Prevention

There is a chance you will have a node reboot in the future, even if the source of the broadcast storm is removed. Your host is sharing its network with corosync. If you cause some congestion on the host network, you may trigger a reboot, as it would disrupt the corosync communication.

From your interfaces file, it appears you have unused NICs. I recommend you use one of those NICs and a separate switch to create a dedicated corosync network. It only needs to be 1Gbps as the corosync traffic does not need a lot of bandwidth. It just needs consistently low latency. You can use your host network as a backup corosync network if the dedicated switch dies but does not have it as the primary corosync link.

Here is more information on the corosync network (a.k.a. cluster network) Proxmox VE Cluster Network.
 
Thank you very much for your insight.
We'll be adding a dedicated network for corosync as you described when deploying for production. We had it originally configured but reinstalled the nodes without it when we were analyzing the broadcast issue (we were ruling out problems on second layer where we had a network bonds on both management and corosync).

I'll mark this thread as solved. Thank you once again and have a nice rest of the day!
 
  • Like
Reactions: weehooey-bh