Hi everyone,
I have a three nodes pve 8.2.4 scluster mounted on HPE ProLiant DL385 Gen10 Plus. We have configured Ceph and HA using separate networks and using 20GB bonds.
A few weeks ago one of our nodes suddently reboot (in syslog only shows reboot).
And after reboot all ceph OSDs on that node stopped processing connections and CEPH stopped working because of slow ops.
We had to manually reboot the node again to make it work.
We've check on the other nodes to see if it was because a network failure that lost quorum and proxmox forced a reboot. But it only seems to lose connection after reboot happens.
HPE support told us that the can't see anything wrong in ILO logs on the server that failed. And told us that the only anomaly they can see are this logs. But we are not sure if this is truly a problem
We are worried because it has happened again to us a few days ago with exactly the same problem and same logs and can't find the source of the problem
Thank in advance for your time and help
I have a three nodes pve 8.2.4 scluster mounted on HPE ProLiant DL385 Gen10 Plus. We have configured Ceph and HA using separate networks and using 20GB bonds.
A few weeks ago one of our nodes suddently reboot (in syslog only shows reboot).
Log node with fault Aug 12 04:08:21 pve02-poz ceph-mgr[2694]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:21] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.45.0" Aug 12 04:08:26 pve02-poz ceph-mgr[2694]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:26] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.45.0" Aug 12 04:08:31 pve02-poz ceph-mgr[2694]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:31] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.45.0" -- Reboot -- Aug 12 04:12:46 pve02-poz kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) () Aug 12 04:12:46 pve02-poz kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-1-pve root=/dev/mapper/pve-root ro quiet Aug 12 04:12:46 pve02-poz kernel: KERNEL supported cpus: Aug 12 04:12:46 pve02-poz kernel: Intel GenuineIntel |
And after reboot all ceph OSDs on that node stopped processing connections and CEPH stopped working because of slow ops.
We had to manually reboot the node again to make it work.
We've check on the other nodes to see if it was because a network failure that lost quorum and proxmox forced a reboot. But it only seems to lose connection after reboot happens.
LOGS ON WORKING NODES Aug 12 04:08:37 pve01-poz ceph-mgr[2959]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:37] "GET /metrics HTTP/1.1" 200 44690 "" "Prometheus/2.45.0" Aug 12 04:08:38 pve01-poz corosync[3024]: [KNET ] link: host: 2 link: 1 is down Aug 12 04:08:38 pve01-poz corosync[3024]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1) Aug 12 04:08:38 pve01-poz corosync[3024]: [KNET ] host: host: 2 has no active links Aug 12 04:08:39 pve01-poz corosync[3024]: [TOTEM ] Token has not been received in 2250 ms Aug 12 04:08:40 pve01-poz corosync[3024]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus. Aug 12 04:08:40 pve01-poz pvestatd[3489]: got timeout Aug 12 04:08:42 pve01-poz ceph-mgr[2959]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:42] "GET /metrics HTTP/1.1" 200 44690 "" "Prometheus/2.45.0" Aug 12 04:08:43 pve01-poz corosync[3024]: [QUORUM] Sync members[2]: 1 3 Aug 12 04:08:43 pve01-poz corosync[3024]: [QUORUM] Sync left[1]: 2 Aug 12 04:08:43 pve01-poz corosync[3024]: [TOTEM ] A new membership (1.4ff) was formed. Members left: 2 Aug 12 04:08:43 pve01-poz corosync[3024]: [TOTEM ] Failed to receive the leave message. failed: 2 Aug 12 04:08:43 pve01-poz pmxcfs[2719]: [dcdb] notice: members: 1/2719, 3/2468 Aug 12 04:08:43 pve01-poz pmxcfs[2719]: [dcdb] notice: starting data syncronisation Aug 12 04:08:43 pve01-poz corosync[3024]: [QUORUM] Members[2]: 1 3 Aug 12 04:08:43 pve01-poz corosync[3024]: [MAIN ] Completed service synchronization, ready to provide service. Aug 12 04:08:38 pve03-poz ceph-mgr[2705]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:38] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.45.0" Aug 12 04:08:38 pve03-poz corosync[2736]: [KNET ] link: host: 2 link: 1 is down Aug 12 04:08:38 pve03-poz corosync[2736]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1) Aug 12 04:08:38 pve03-poz corosync[2736]: [KNET ] host: host: 2 has no active links Aug 12 04:08:39 pve03-poz corosync[2736]: [TOTEM ] Token has not been received in 2250 ms Aug 12 04:08:39 pve03-poz pvestatd[3148]: got timeout Aug 12 04:08:40 pve03-poz corosync[2736]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus. Aug 12 04:08:43 pve03-poz ceph-mgr[2705]: ::ffff:192.168.112.7 - - [12/Aug/2025:04:08:43] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.45.0" Aug 12 04:08:43 pve03-poz corosync[2736]: [QUORUM] Sync members[2]: 1 3 Aug 12 04:08:43 pve03-poz corosync[2736]: [QUORUM] Sync left[1]: 2 Aug 12 04:08:43 pve03-poz corosync[2736]: [TOTEM ] A new membership (1.4ff) was formed. Members left: 2 Aug 12 04:08:43 pve03-poz corosync[2736]: [TOTEM ] Failed to receive the leave message. failed: 2 Aug 12 04:08:43 pve03-poz pmxcfs[2468]: [dcdb] notice: members: 1/2719, 3/2468 Aug 12 04:08:43 pve03-poz pmxcfs[2468]: [dcdb] notice: starting data syncronisation Aug 12 04:08:43 pve03-poz pmxcfs[2468]: [status] notice: members: 1/2719, 3/2468 Aug 12 04:08:43 pve03-poz pmxcfs[2468]: [status] notice: starting data syncronisation Aug 12 04:08:43 pve03-poz corosync[2736]: [QUORUM] Members[2]: 1 3 Aug 12 04:08:43 pve03-poz corosync[2736]: [MAIN ] Completed service synchronization, ready to provide service. |
HPE support told us that the can't see anything wrong in ILO logs on the server that failed. And told us that the only anomaly they can see are this logs. But we are not sure if this is truly a problem
Informational,1814,15437,0x45,PCIe VDM Driver,0x08,PCIe MCTP Error Msg, ,Engineering, ,08/23/2025 13:08:35,src/dvrpcievdm_driver.c(1003): Expected Event(SHUTDOWN / RESET): dvrpmctp_disable_asic. Informational,1815,1766,0x45,PCIe VDM Driver,0x08,PCIe MCTP Error Msg, ,Engineering, ,08/23/2025 13:10:21,src/dvrpcievdm_driver.c(906): Expected Event(PCIe VDM Init): dvrpmctp_enable_asic. |
We are worried because it has happened again to us a few days ago with exactly the same problem and same logs and can't find the source of the problem
Thank in advance for your time and help