Proxmox Cluster Random Reboot

Brendan1

New Member
Dec 18, 2025
3
0
1
Hello community,
I am seeking technical assistance regarding an incident that occurred on our Proxmox infrastructure on February 17, 2026.

At 14:25:56, our node pve2 experienced a reboot without reason. Initial analysis confirms that the High Availability stack successfully fenced node pve2 following a loss of communication with the rest of the cluster. Although the service has since been restored, we are seeking to identify the root cause of this isolation. It is important to note that all three nodes in this cluster are running Proxmox version 8.4 and that Corosync traffic is routed through a dedicated, physically separate network.

Furthermore, we observed a similar issue occurring on another infrastructure located in a different rack (not in the same day), despite that specific cluster running different Proxmox versions.

I have attached the logs recorded during the minutes surrounding the reboot of pve2, as well as the logs from pve1 to provide a complete overview of the cluster state during the event.
I would appreciate any insight if you have encountered similar cases or if you can identify a specific pattern in the provided logs.
 

Attachments

Hi,

you have an issue with 2 OSDs
Feb 17 14:27:37 pve1 ceph-osd[2774]: 2026-02-17T14:27:37.688+0100 7f32fcfef6c0 -1 osd.2 3302 heartbeat_check: no reply from 10.13.31.3:6812 osd.1 since back 2026-02-17T14:27:09.786249+0100 front 2026-02-17T14:27:09.786304+0100 (oldest deadline 2026-02-17T14:27:35.086110+0100)
Feb 17 14:27:37 pve1 ceph-osd[2774]: 2026-02-17T14:27:37.688+0100 7f32fcfef6c0 -1 osd.2 3302 heartbeat_check: no reply from 10.13.31.3:6804 osd.4 since back 2026-02-17T14:27:09.786438+0100 front 2026-02-17T14:27:09.786401+0100 (oldest deadline 2026-02-17T14:27:35.086110+0100)

seems Ceph OSDs 1 and 4 on 10.13.31.3 stopped responding, causing the cluster to mark them down. This made the storage for some VMs temporarily unavailable. Proxmox HA detected the storage issue and moved VMs 101 and 114 into fence state to safely stop and restart them elsewhere, preventing potential split-brain.
Feb 17 14:28:20 pve1 pve-ha-crm[2801]: service 'vm:101': state changed from 'started' to 'fence'
Feb 17 14:28:20 pve1 pve-ha-crm[2801]: service 'vm:114': state changed from 'started' to 'fence'
Check the network and OSDs.
 
Hi,

Osds et Networks don't seem to have issue, what to do to prevent this from happening ? Do you think it's because of LACP ?
 
Hey,

seems Ceph OSDs 1 and 4 on 10.13.31.3 stopped responding, causing the cluster to mark them down. This made the storage for some VMs temporarily unavailable. Proxmox HA detected the storage issue and moved VMs 101 and 114 into fence state to safely stop and restart them elsewhere, preventing potential split-brain.
Correct, they did not respond anymore. But not because of any storage issues. Proxmox does not have such failover mechanism.
Proxmox HA does not handle failover due to hardware failure / storage missing, it only fences nodes, if it fully looses connections to the other nodes. If the node can still reach the other nodes, but the storage fails, it won't do anything.

They simply stopped responding, as pve02 restarted.

If my suspicion is correct, you have only defined one corosync network, the one on your dedicated link? (Share /etc/pve/corosync.conf please)
This link failed on pve02 (Which I sadly can't prove, as the logs don't say anything)
=> Anything logged on the switch maybe? Switch port froze? Could be anything.


What I can prove is, that pve02 lost link on ring0, according to pve01.

Code:
Feb 17 14:27:14 pve1 corosync[2419]:   [KNET  ] link: host: 3 link: 0 is down
Feb 17 14:27:14 pve1 corosync[2419]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 17 14:27:14 pve1 corosync[2419]:   [KNET  ] host: host: 3 has no active links
Feb 17 14:27:15 pve1 corosync[2419]:   [TOTEM ] Token has not been received in 2737 ms
Feb 17 14:27:16 pve1 corosync[2419]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Feb 17 14:27:20 pve1 corosync[2419]:   [QUORUM] Sync members[2]: 1 2
Feb 17 14:27:20 pve1 corosync[2419]:   [QUORUM] Sync left[1]: 3
Feb 17 14:27:20 pve1 corosync[2419]:   [TOTEM ] A new membership (1.230) was formed. Members left: 3
Feb 17 14:27:20 pve1 corosync[2419]:   [TOTEM ] Failed to receive the leave message. failed: 3
Feb 17 14:27:20 pve1 pmxcfs[2313]: [dcdb] notice: members: 1/2313, 2/3079
Feb 17 14:27:20 pve1 pmxcfs[2313]: [dcdb] notice: starting data syncronisation
Feb 17 14:27:20 pve1 corosync[2419]:   [QUORUM] Members[2]: 1 2
Feb 17 14:27:20 pve1 corosync[2419]:   [MAIN  ] Completed service synchronization, ready to provide service.

Nothing about that on pve02.
Shortly after the OSDs 1 and 4 failed, which should be located on pve02 => Due to restart.

I would recommend defining a second or even a third ring. The more corosync networks, the better (up to 8).
=> https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy
You can even use the CEPH network here, even if it is over a bond, it is only there for redundancy if the dedicated link fails.
 
Last edited:
Contents of network/interfaces and corosync.conf

/etc/pve/corosync.conf

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.13.30.2
  }
  node {
    name: pve2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.13.30.3
  }
  node {
    name: pve3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.13.30.4
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster-hbs
  config_version: 3
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}


network/interfaces

Code:
auto lo
iface lo inet loopback

iface idrac inet manual

iface eno3 inet manual

iface eno4 inet manual

auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto ens1f0
iface ens1f0 inet manual

auto ens1f1
iface ens1f1 inet manual

auto bond1
iface bond1 inet manual
        bond-slaves eno2 ens1f1
        bond-miimon 100
        bond-mode 802.3ad

auto bond0
iface bond0 inet manual
        bond-slaves eno1 ens1f0
        bond-miimon 100
        bond-mode 802.3ad

auto vmbr0
iface vmbr0 inet static
        address 10.13.30.2/24
        gateway 10.13.30.254
        bridge-ports bond0.1330
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto vmbr1
iface vmbr1 inet static
        address 10.13.31.2/24
        bridge-ports bond1.1331
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto vmbr2
iface vmbr2 inet static
        address 10.13.32.50/24
        bridge-ports bond0.1332
        bridge-stp off
        bridge-fd 0

auto vmbr3
iface vmbr3 inet static
        address 10.13.33.50/24
        bridge-ports bond0.1333
        bridge-stp off
        bridge-fd 0

auto vmbr4
iface vmbr4 inet static
        address 10.13.34.50/24
        bridge-ports bond0.1334
        bridge-stp off
        bridge-fd 0

auto vmbr5
iface vmbr5 inet static
        address 10.13.35.50/24
        bridge-ports bond0.1335
        bridge-stp off
        bridge-fd 0

auto vmbr6
iface vmbr6 inet static
        address 10.13.36.50/24
        bridge-ports bond0.1336
        bridge-stp off
        bridge-fd 0

auto vmbr7
iface vmbr7 inet static
        address 10.13.37.50/24
        bridge-ports bond0.1337
        bridge-stp off
        bridge-fd 0
 
so here's what I'd suggest-

dont use 10.13.30.x for corosync at all.

assign arbitrary addresses to bond0 and bond1; --edit- ON DIFFERENT SUBNETS. ideally, they should be on seperate vlans too. use those addresses as ring0 and ring1.
 
Last edited:
  • Like
Reactions: Johannes S