Sudden reboots on Proxmox

samumsq

New Member
Aug 27, 2025
8
1
3
Hi everyone,

I have a three nodes pve 8.2.4 scluster mounted on HPE ProLiant DL385 Gen10 Plus. We have configured Ceph and HA using separate networks and using 20GB bonds.
A few weeks ago one of our nodes suddently reboot (in syslog only shows reboot).

Log node with fault

Aug 12 04:08:21 pve02-poz ceph-mgr[2694]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:21] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.45.0"
Aug 12 04:08:26 pve02-poz ceph-mgr[2694]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:26] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.45.0"
Aug 12 04:08:31 pve02-poz ceph-mgr[2694]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:31] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.45.0"
-- Reboot --
Aug 12 04:12:46 pve02-poz kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) ()
Aug 12 04:12:46 pve02-poz kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-1-pve root=/dev/mapper/pve-root ro quiet
Aug 12 04:12:46 pve02-poz kernel: KERNEL supported cpus:
Aug 12 04:12:46 pve02-poz kernel: Intel GenuineIntel

And after reboot all ceph OSDs on that node stopped processing connections and CEPH stopped working because of slow ops.
We had to manually reboot the node again to make it work.

We've check on the other nodes to see if it was because a network failure that lost quorum and proxmox forced a reboot. But it only seems to lose connection after reboot happens.

LOGS ON WORKING NODES

Aug 12 04:08:37 pve01-poz ceph-mgr[2959]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:37] "GET /metrics HTTP/1.1" 200 44690 "" "Prometheus/2.45.0"
Aug 12 04:08:38 pve01-poz corosync[3024]: [KNET ] link: host: 2 link: 1 is down
Aug 12 04:08:38 pve01-poz corosync[3024]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
Aug 12 04:08:38 pve01-poz corosync[3024]: [KNET ] host: host: 2 has no active links
Aug 12 04:08:39 pve01-poz corosync[3024]: [TOTEM ] Token has not been received in 2250 ms
Aug 12 04:08:40 pve01-poz corosync[3024]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Aug 12 04:08:40 pve01-poz pvestatd[3489]: got timeout
Aug 12 04:08:42 pve01-poz ceph-mgr[2959]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:42] "GET /metrics HTTP/1.1" 200 44690 "" "Prometheus/2.45.0"
Aug 12 04:08:43 pve01-poz corosync[3024]: [QUORUM] Sync members[2]: 1 3
Aug 12 04:08:43 pve01-poz corosync[3024]: [QUORUM] Sync left[1]: 2
Aug 12 04:08:43 pve01-poz corosync[3024]: [TOTEM ] A new membership (1.4ff) was formed. Members left: 2
Aug 12 04:08:43 pve01-poz corosync[3024]: [TOTEM ] Failed to receive the leave message. failed: 2
Aug 12 04:08:43 pve01-poz pmxcfs[2719]: [dcdb] notice: members: 1/2719, 3/2468
Aug 12 04:08:43 pve01-poz pmxcfs[2719]: [dcdb] notice: starting data syncronisation
Aug 12 04:08:43 pve01-poz corosync[3024]: [QUORUM] Members[2]: 1 3
Aug 12 04:08:43 pve01-poz corosync[3024]: [MAIN ] Completed service synchronization, ready to provide service.

Aug 12 04:08:38 pve03-poz ceph-mgr[2705]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:38] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.45.0"
Aug 12 04:08:38 pve03-poz corosync[2736]: [KNET ] link: host: 2 link: 1 is down
Aug 12 04:08:38 pve03-poz corosync[2736]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
Aug 12 04:08:38 pve03-poz corosync[2736]: [KNET ] host: host: 2 has no active links
Aug 12 04:08:39 pve03-poz corosync[2736]: [TOTEM ] Token has not been received in 2250 ms
Aug 12 04:08:39 pve03-poz pvestatd[3148]: got timeout
Aug 12 04:08:40 pve03-poz corosync[2736]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Aug 12 04:08:43 pve03-poz ceph-mgr[2705]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:43] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.45.0"
Aug 12 04:08:43 pve03-poz corosync[2736]: [QUORUM] Sync members[2]: 1 3
Aug 12 04:08:43 pve03-poz corosync[2736]: [QUORUM] Sync left[1]: 2
Aug 12 04:08:43 pve03-poz corosync[2736]: [TOTEM ] A new membership (1.4ff) was formed. Members left: 2
Aug 12 04:08:43 pve03-poz corosync[2736]: [TOTEM ] Failed to receive the leave message. failed: 2
Aug 12 04:08:43 pve03-poz pmxcfs[2468]: [dcdb] notice: members: 1/2719, 3/2468
Aug 12 04:08:43 pve03-poz pmxcfs[2468]: [dcdb] notice: starting data syncronisation
Aug 12 04:08:43 pve03-poz pmxcfs[2468]: [status] notice: members: 1/2719, 3/2468
Aug 12 04:08:43 pve03-poz pmxcfs[2468]: [status] notice: starting data syncronisation
Aug 12 04:08:43 pve03-poz corosync[2736]: [QUORUM] Members[2]: 1 3
Aug 12 04:08:43 pve03-poz corosync[2736]: [MAIN ] Completed service synchronization, ready to provide service.

HPE support told us that the can't see anything wrong in ILO logs on the server that failed. And told us that the only anomaly they can see are this logs. But we are not sure if this is truly a problem
Informational,1814,15437,0x45,PCIe VDM Driver,0x08,PCIe MCTP Error Msg, ,Engineering, ,08/23/2025 13:08:35,src/dvrpcievdm_driver.c(1003): Expected Event(SHUTDOWN / RESET): dvrpmctp_disable_asic.

Informational,1815,1766,0x45,PCIe VDM Driver,0x08,PCIe MCTP Error Msg, ,Engineering, ,08/23/2025 13:10:21,src/dvrpcievdm_driver.c(906): Expected Event(PCIe VDM Init): dvrpmctp_enable_asic.


We are worried because it has happened again to us a few days ago with exactly the same problem and same logs and can't find the source of the problem

Thank in advance for your time and help
 
Node pve2 fenced itself as reached corosync timeouts to the other 2 nodes. I would reset all the eth cables on pve2 once.
We had somethink before and defined a bit higher token time than default in /etc/pve/corosync.conf and after that the issue was gone.
 
  • Like
Reactions: samumsq
Hi Waltar,
Thank you for your reply.

We'll check the corosync config
Appart from that what do you mean with "I would reset all the eth cables on pve2 once" just unplug an plug or there something else?
For the corosync connection we have a bond of 2 interfaces in each pve node, should we change that?

Thank you again for your time and help
 
Yes, mean just unplug and re-plug-in the cables.
For corosync connection I would definitive NOT bond them and instead build 2 corosync rounds if that's possible which is very fine.
:)
 
Last edited:
Hi again Waltar,

Thanks for your kind reply. We'll break the bond and set the corosync with two different networks then.

Also we'd like to ask just in case it's not done propperly.
In our servers we have 10G ethernet ports but in our switches the ports are 10G SPF so we use a converter and cat6 cables. Should we change to SPF network cards in our servers?
It's been working like this since last year, but with the recent failures we are looking to improve everything we can to avoid problems in the future

Thanks again for all your help
 
No, I would not change the cards if not going to even higher speed (25/100Gb) and then for data instead as corosync links.
 
  • Like
Reactions: samumsq
Thanks again for your help
I'm sorry for asking so many things, we are a bit lost with this problem

I think this is the last one, we'd like to know why it's better to have the corosync in two different networks instead of a bond.
If you told us to do it we undertand that it's the correct way to do it. But when we first set it as a bond we did it because in case an interface stops working the server would never lost connectivity and a bond should not add latency.

Once again thanks for all your help
 
Does lacp generate problems?
In our cluster we have lacp bonds for ceph, corosync and network for VMs.
We did it to increase bandwith and avoid losing connectivity. Is correct to configure it like this?

Thanks again for your help
 
Hi MarkusKo,

Thanks for your reply, we didn't know that lacp bond could rise latency.
We'll change configuration then

Thanks for all your help MarkusKo and waltar

Does this forum have a recomendation system or something i can do for you?
 
  • Like
Reactions: waltar